Reevaluating Causal Estimation Methods with Data from a Product Release

Justin Young and Eleanor W. Dillon (2026)

This repository contains the benchmark datasets, replication code, and pre-computed artifacts for the paper Reevaluating Causal Estimation Methods with Data from a Product Release.

We release a paired dataset: a randomized experiment and a parallel observational study on the same population. We also attach here a reproducible notebook (notebooks/01_main_results.ipynb) demonstrating our recommended best practices for observational causal estimation.

Repository Structure

├── Data/
│   ├── README.md                            # Dataset documentation (44 columns)
│   ├── FINAL_PUBLIC_experimental.parquet    # Randomized A/B sample (435,170 obs)
│   └── FINAL_PUBLIC_observed.parquet        # Observational sample (445,286 obs)
├── notebooks/
│   └── 01_main_results.ipynb                # §4 main results: Figures 1–3, Table 2
├── src/                                     # Reusable Python modules
│   ├── data_loading.py                      # Data loading + covariate list
│   ├── propensity.py                        # FLAML-tuned LGBM propensity ensembles
│   ├── trimming.py                          # Crump et al. (2009) optimal trimming
│   ├── estimators.py                        # ATE estimators (Reg, OM, IPW, PSM, DR)
│   ├── ensemble_wrappers.py                 # AveragingRegressor / AveragingClassifier
│   ├── plotting.py                          # Figure generation helpers
│   ├── cache.py                             # Pickle load/compute helpers
│   ├── utils.py                             # Cross-fitting + ensembling utilities
│   ├── cate.py                              # CATE meta-learners (additional)
│   └── sensitivity.py                       # Sensitivity analysis (additional)
├── saved_outputs/                           # Pre-computed artifacts for fast replication
│   ├── prop_averaged_FLAML_FINAL_LGBM.pkl   #   observational propensity scores
│   ├── exp_dr_ate_FLAML_LGBM_Continuous.pkl #   experimental DR benchmark
│   ├── *_hyperparams.pkl                    #   FLAML-tuned LGBM hyperparameter dicts
│   └── *_ate_psm_pass_noreplace.pkl         #   cached PSM results (R Matching pkg)
├── requirements.txt
└── README.md

Quick Start

1. Clone the repository

git clone https://github.com/microsoft/Reevaluating-Causal-Estimation-Methods.git
cd Reevaluating-Causal-Estimation-Methods

2. Install Python dependencies

python -m venv .venv
# Windows:    .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
pip install -r requirements.txt

3. Run the main notebook

jupyter notebook notebooks/01_main_results.ipynb

With the default flags (RERUN_ALL=False, RUN_PSM=False), the notebook loads pre-computed FLAML hyperparameters and cached PSM results from saved_outputs/, cross-fits fresh LGBM nuisance models on the public data, and reproduces Figures 1–3 and Table 2 in a few minutes on a laptop.

4. (Optional) Re-run from scratch

RERUN_ALL = True      # full FLAML AutoML search (slow; minutes–hours)
RUN_PSM   = True      # recompute PSM via R (~30 min/call); see step 5

5. (Optional) Install R for propensity score matching

PSM uses R's Matching package (Sekhon 2011) via rpy2. PSM with replace=FALSE, ties=FALSE on the trimmed sample (~300k rows) takes ~30 minutes per call. If R is not installed or RUN_PSM=False, the notebook falls back to cached PSM results.

# 1. Install R: https://cran.r-project.org/
# 2. Install rpy2 + Matching:
pip install rpy2==3.6.7
Rscript -e 'install.packages("Matching", repos="https://cran.r-project.org")'

What the Notebook Does

Step	Description	Paper reference
1	Load public experimental & observational datasets	§2
2	Compute naive difference-in-means	§4
3	Estimate propensity scores (ensembled, tuned LGBM)	§4
4	Propensity score distributions	Figure 1
5	Apply Crump et al. (2009) optimal trimming	§4, Table 2
6	Establish experimental ground-truth benchmark	§4
7	Fit tuned, ensembled nuisance models (cross-fit)	§4
8	Estimate ATE with five methods	§4, Figure 2
9	Compare trimmed vs. untrimmed results	§4, Figure 3

Estimators

Estimator	Method
Reg	OLS on `y ~ D + W`
OM	Outcome modeling (cross-fit)
IPW	Inverse probability weighting (cross-fit)
PSM	1-NN propensity-score matching (R `Matching`)
DR	Cross-fit AIPW via EconML `LinearDRLearner`

Citation

@article{young2026reevaluating,
  title={Reevaluating Causal Estimation Methods with Data from a Product Release},
  author={Young, Justin and Dillon, Eleanor W.},
  year={2026},
  url={https://arxiv.org/abs/2601.11845}
}

License

MIT. See LICENSE.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

This project has adopted the Microsoft Open Source Code of Conduct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reevaluating Causal Estimation Methods with Data from a Product Release

Repository Structure

Quick Start

1. Clone the repository

2. Install Python dependencies

3. Run the main notebook

4. (Optional) Re-run from scratch

5. (Optional) Install R for propensity score matching

What the Notebook Does

Estimators

Citation

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Data		Data
notebooks		notebooks
saved_outputs		saved_outputs
src		src
.gitignore		.gitignore
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Reevaluating Causal Estimation Methods with Data from a Product Release

Repository Structure

Quick Start

1. Clone the repository

2. Install Python dependencies

3. Run the main notebook

4. (Optional) Re-run from scratch

5. (Optional) Install R for propensity score matching

What the Notebook Does

Estimators

Citation

License

Contributing

About

Resources

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages