# OpportunityFinder: A Framework for Automated Causal Inference

- [paper link](https://assets.amazon.science/64/f8/071f0d24476797333f0571270106/opportunityfinder-a-framework-for-automated-causal-inference.pdf)
- [amazon science link](https://www.amazon.science/publications/opportunityfinder-a-framework-for-automated-causal-inference)

**Note:** the below notes are in first person because it's easier to take notes that way (since the paper is 1st person). This is obvious but know that I had nothing to do with creating the paper!!


## Abstract

A code-less framework for performing a variety of causal inference studies with panel data for non-expert users. Only requires users to provide raw observational data and a configuration file. A pipeline is then triggered that inspects/processes data, chooses suitable algorithm(s) to execute the causal study. It returns the causal impact of the treatment on the configured outcome, together with sensitivity and robustness results.

## 1. Introduction

Automated machine learning (AutoML) frameworks have advanced significantly with introductions of AutoGluon, Auto-sklearn, H2O, etc. Their advantage is abstracting away the implementation of underlying algorithms and hyper-parameter tuning and making it easy to experiment with a large number of models and identify the one that works best.

Causal inference is more challenging however as different methods rely on diffrent set of assumptions.

Currently there is DoWhy (still a low level API) as well as AutoCausality, which is built on top of EconML and DoWhy. It supports hyperparameter tuning but assumes that the causal graph provided is accurate. Also, Neither support panel data, which is a mainstream at real-world problems.

In OpportunityFinder (OPF), the decision to choose the algorithm is automated for both panel and cross-sectional data.

## 2. Literature Review

Traditional techniques such as PSM, IVs, and DiD often struggle to account for high-dimensional covariates and complex interactions. The Synthetic Control Method (SCM) extend these approaches as well as Generalized Synthetic Control (GSC) further expanding SCM via incorporating interactive fixed effects models.

Recently, ML techniques have been widely integrated, such as DoubleML, EconML, CausalML. Double machine learning provides a flexible approach as well.

## 3. Framework Design

The key contribution of our design are:
1. integration of several causal modeling methods
2. branching based on type of observational data (cross-sectional vs panel) and number of treatment units
3. execution in the users' own AWS environment

### 3.1 Data Requirements

Configuration file has optional requirements like list of features to scale or algorithms to choose. The mandatory requirements are columsn with time, unit id, outcome / treatment variables, etc.

### 3.2 Implementation Details

The decision of algorithm has two stages. The first stage is a set of rules, some examples being:
- is the data panel or cross-sectional
- total event data less or more than 500k?
- how many time periods are available?
- etc.,

OPF also allows the transformation of panel data into cohorts (i.e., into cross-sectional data which allow for techniques like double machine learning).

The following Causal Inference Models are the options available in OPF (that the algorithm will try to choose from):
- Synthetic Control (SC) and Generalized Synthetic Control (GSC)
- Double Machine Learning (DML)
- Causal Forests
- Neural Network based approaches
- Meta Learners
- Difference in Differences (DiD)
- Propensity Score Matching (PSM)

*Note: The paper claims that DML's are superior to PSM as it overcomes the limitations of PSM and is more robust. As I have not read about DML before, this may be something of interest to read up on. However, it is true that PSM has recognized scrutiny across the industry.

DML models and their treatment effect estimation are validated through refutation tests by DoWhy package:
- Add random common cause
- Add unobserved common cause
- Data subsets validation
- Placebo Treatment

GSC model is validated with a suite of sensitivity tests that check for changes in the estimated causal effect with small changes in data like:
- random down-sampling
- different pre-treatment window for learning synthetic control weights
- reduced covariate list

### 3.3 Limitations and Risks

As of today, OPF does not implement causal graph generation algorithms.

## 4. Validation of Causal Estiamtes by OpportunityFinder

Validates causal inference algorithms on benchmark datasets using 3 metrics,

1. ATE
2. ATT
3. MAE (mean absolute error), the average absolute difference between estimated ATE and true ATE where available for evaluating accuracy of a causal estimation method.

### 4.1 IHDP (public benchmark)

Infant Health and Development Program

### 4.2 Smoking (public benchmark)

Has only one treated unit, thus causal estimations based on machine learning (DML, NN) do not apply.

### 4.3 Synthetic Data 1: Cross Sectional

Generated using DoWhy.

### 4.4 Synthetic Data 2: Large Panel

### 4.5 Discussion on Model Choice

So far with synthetic data, performs well. Will continue to iterate as NN models become more popular for causal inference.

## 5. Applications on Real World Data

Has been used for uplift internally at Amazon. This is defined as the percentage increase/decrease in the outcome attributed to the treatment over a defined period. It is calculated as ATE or ATT divided by average over control units.

## 6. Conclusion and Future Work

Actively taking feature requests from current OPF users.