Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Causal DoubleMLEstimator (#8) #1715

Merged
merged 17 commits into from Dec 19, 2022

Conversation

dylanw-oss
Copy link
Contributor

What changes are proposed in this pull request?

Add package 'com.microsoft.azure.synapse.ml.causal' and implementation LinearDMLEstimator

How is this patch tested?

  • I have written tests

Does this PR change any dependencies?

  • No.

Does this PR add a new feature? If so, have you added samples on website?

  • Yes.

Add package 'com.microsoft.azure.synapse.ml.causal' and implementation LinearDMLEstimator
@acrolinxatmsft1
Copy link

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article Acrolinx
score
Word and
Phrases Score
Correctness
Score
Scorecard Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md link ⚠️
website/docs/documentation/estimators/estimators_causal.md link ⚠️

More information about Acrolinx

@github-actions
Copy link

Hey @dylanw-oss 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

@dylanw-oss
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 1715 in repo microsoft/SynapseML

@dylanw-oss
Copy link
Contributor Author

@serena-ruan, @mhamilton723, can anyone help to give me permission to run a pipeline?

@memoryz
Copy link
Contributor

memoryz commented Nov 12, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov-commenter
Copy link

codecov-commenter commented Nov 12, 2022

Codecov Report

Merging #1715 (60f09e2) into master (4a25954) will decrease coverage by 0.57%.
The diff coverage is 32.25%.

@@            Coverage Diff             @@
##           master    #1715      +/-   ##
==========================================
- Coverage   86.51%   85.94%   -0.58%     
==========================================
  Files         273      276       +3     
  Lines       14420    14571     +151     
  Branches      769      754      -15     
==========================================
+ Hits        12476    12523      +47     
- Misses       1944     2048     +104     
Impacted Files Coverage Δ
...ft/azure/synapse/ml/causal/DoubleMLEstimator.scala 8.97% <8.97%> (ø)
.../azure/synapse/ml/causal/ResidualTransformer.scala 28.12% <28.12%> (ø)
...osoft/azure/synapse/ml/causal/DoubleMLParams.scala 71.05% <71.05%> (ø)
...azure/synapse/ml/core/schema/SchemaConstants.scala 100.00% <100.00%> (ø)
...microsoft/azure/synapse/ml/train/AutoTrainer.scala 100.00% <100.00%> (ø)
...osoft/azure/synapse/ml/train/TrainClassifier.scala 84.32% <100.00%> (+0.11%) ⬆️
...rosoft/azure/synapse/ml/train/TrainRegressor.scala 92.30% <100.00%> (+1.92%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@serena-ruan
Copy link
Contributor

@serena-ruan, @mhamilton723, can anyone help to give me permission to run a pipeline?

Hi @dylanw-oss Thanks for this PR!! Could you raise a request to join this team: https://github.com/orgs/microsoft/teams/synapseml So that @mhamilton723 could add you in, the you can run /azp run to trigger the pipeline.

}

@DeveloperApi
override def transformSchema(schema: StructType): StructType =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transform schema doesent look right, you sure this doesent add any info to the data dframe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LinearDMLEstimator transform does nothing by design and isn't supposed to be called by end user.
Previously, I set it throw exception, but it won't pass fuzzing testing, so I changed it to return the original dataset back, in this case I don't think we need transform schema, please correct me if I missing something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there actually is a way to actually use this model in a natural way and perform a regression. In particular you can think of this pipeline as estimating a prediction variable in two steps. The first is the debiasing operation where you map a dataframe to it's residuals. The second is the prediction of the target residuals.

To form the actual prediction target, you first use your baseline estimate of the target from step 1, then add your predicted residual from step 2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give a little more info here:

First use your learned residual models to map the inputs to their residuals, then use your treatment effect model to map the residuals to the treatment. Then append that treadment as the prediction column. (If im missing something here perhaps we can chat to help clarify)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@memoryz , Jason, did you sync with our data scientist and if this is feasible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no objections, I'll set it as by design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhamilton723, I discussed the feedback in depth with our data scientist @sarahshy, and she confirmed that there is no meaningful natural transformation we can do here. We can implement a natural transformation as you suggested, but the result won't be meaningful and interpretable. Therefore, I suggest we resolve this item as "by design". I can schedule a meeting with @sarahshy if you still have concerns.

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely work! Really excited to see this go in, please feel free to chat if any of these comments dont make sense!

@acrolinxatmsft1
Copy link

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article Acrolinx
score
Word and
Phrases Score
Correctness
Score
Scorecard Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md 100 100 100 link
website/docs/documentation/estimators/estimators_causal.md 100 100 100 link

More information about Acrolinx

@acrolinxatmsft1
Copy link

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article Acrolinx
score
Word and
Phrases Score
Correctness
Score
Scorecard Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md 100 100 100 link
website/docs/documentation/estimators/estimators_causal.md 100 100 100 link

More information about Acrolinx

@dylanw-oss
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

memoryz
memoryz previously approved these changes Dec 8, 2022
Copy link
Contributor

@memoryz memoryz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

@acrolinxatmsft1
Copy link

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article Acrolinx
score
Word and
Phrases Score
Correctness
Score
Scorecard Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md 100 100 100 link
website/docs/documentation/estimators/estimators_causal.md 100 100 100 link

More information about Acrolinx

@acrolinxatmsft1
Copy link

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article Acrolinx
score
Word and
Phrases Score
Correctness
Score
Scorecard Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md 100 100 100 link
website/docs/documentation/estimators/estimators_causal.md 100 100 100 link

More information about Acrolinx

@dylanw-oss
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

memoryz
memoryz previously approved these changes Dec 8, 2022
@acrolinxatmsft1
Copy link

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article Acrolinx
score
Word and
Phrases Score
Correctness
Score
Scorecard Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md 100 100 100 link
website/docs/documentation/estimators/estimators_causal.md 100 100 100 link

More information about Acrolinx

@acrolinxatmsft1
Copy link

Acrolinx Scorecards

A minimum Acrolinx score of 80 is required.

Click the scorecard links for each article to review the Acrolinx feedback on grammar, spelling, punctuation, writing style, and terminology.

Article Acrolinx
score
Word and
Phrases Score
Correctness
Score
Scorecard Processed
website/docs/documentation/estimators/causal/_causalInferenceDML.md 100 100 100 link
website/docs/documentation/estimators/estimators_causal.md 100 100 100 link

More information about Acrolinx

@memoryz
Copy link
Contributor

memoryz commented Dec 16, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@memoryz
Copy link
Contributor

memoryz commented Dec 16, 2022

@mhamilton723 ready for merge. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants