# Advances in Machine Learning with Big Data

### Trinity 2021
### Jeremy Large
#### jeremy.large@economics.ox.ac.uk


&#169; Jeremy Large ; shared under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)

## 11. Introductory remarks on Causal Inference

## Contents Weeks 1-7:

1. Introducing this course's dataset

1. Being an econometrician _and_ a data scientist

1. Overfit and regularization

1. Regularization through predictor/feature selection (Lasso etc.)

1. Resampling methods, and model selection

1. Classification

1. Decision trees, bagging, and random forests

1. Make a start on neural networks

1. Convolutional neural nets and image classification (Lucas Kruitwagen)

1. Transfer learning (Lucas Kruitwagen)

1. **Causal inference**

### The relationship between machine learning and empirical economics

Judea Pearl (~2018): ["All the impressive achievements of deep learning amount to just curve fitting"](https://www.theatlantic.com/technology/archive/2018/05/machine-learning-is-stuck-on-asking-why/560675/)

He's deprecating the activity of predicting, and he certainly has a point:

Prediction may be interesting for observers, speculators, but ...

... for action, including policy and law, we need to understand causes and effects.

* Often, we want causation, not correlations,
    
* so we want to do Causal Inference.

Interestingly, Pearl is not quite in agreement with mainstream econometricians about how to formulate CI:

* [Imbens (2020)](https://arxiv.org/pdf/1907.07271.pdf) outlines the fault-lines

> *Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics*


A comprehensive, recent, overview of ML and empirical economics is in [Athey and Imbens (2019)](https://arxiv.org/pdf/1903.10075.pdf).

* strong emphasis on causation (esp. Section 6)

### **Manipulability** and the meaning of 'cause'

Widespread agreement on the centrality of 'manipulation' in causal theory.

Kids play/experiment -> discover their ability to affect the world -> discover they can cause changes in the world

They watch other kids play and learn cause/effect by watching

Adults (we) inherit the same way of thinking:
* sometimes we experiment ourselves (Randomised Control Trials / AB Trials);
* sometimes we watch others experiment;
* sometimes we watch a natural experiment ('observational study').

Policy-makers, econometricians and statisticians do the same

We often have to make-do with observational study

**Recall: supervised learning**: an i.i.d. sequence of observations, $\{(Y_i, X_i), i=0, 1, ...\}$. There is a distribution of the R.V. $Y_i$, conditional on the stacked regressors, namely $X$:
\begin{equation}
Y_i | X \ \ \sim \ \ \mathcal f_{X_i; \theta},
\end{equation}

where the $\theta$ stands-in for our parameters.

NB: Just because we might predict $Y_i$ from $X_i$ in test data, this does not mean that $X_i$ causes $Y_i$.
* they could share a common cause (a confounder)

* so we need to add at least a third random variable to our mix

* we'll let $X_i$ do the job of 'common cause / confounder'.

* ... and we'll add a *treatment* variable, $W_i$, that might affect our *outcome*, $Y_i$

### Treatment

Call the treatment $W_i$.

So we see an i.i.d. sequence of 'observational data', $\{(Y_i, X_i, W_i), i=0, 1, ...\}$. The treatment, $W_i$, can be 0 or 1; 'absent' or 'present'. 

Crucial idea is that, unlike $Y_i$ and $X_i$, 
* $W_i$ is special, because (even though we often simply see it), sometimes we can *also* ... 
\begin{equation*}
do(W_i)
\end{equation*}

* When we $do(W_i)$ we **break the correlation** between $W_i$ and its usual causes (namely, here, $X_i$).

This is a random variable where we can intervene and act! Pearl's [do-calculus](https://en.wikipedia.org/wiki/Causal_model#Do_calculushttps://en.wikipedia.org/wiki/Causal_model#Do_calculus).

When we $do(W_i=0)$, there is a distribution of the R.V. $Y_i$, conditional on the stacked regressors, namely $X$:
\begin{equation*}
Y_i(0) | X \ \ \sim \ \ \mathcal f^0_{X_i; \theta},
\end{equation*}
Similarly, when we $do(W_i=1)$,
\begin{equation*}
Y_i(1) | X \ \ \sim \ \ \mathcal f^1_{X_i; \theta},
\end{equation*}

where the $\theta$ stands-in for our parameters. 

And **if the treatment has an effect, then**:
\begin{equation*}
\mathcal f^1_{X_i; \theta} \ \ \neq \ \ \mathcal f^0_{X_i; \theta}.
\end{equation*}

### Treatment Effects

**ITE**  : $Y_i(1) - Y_i(0)$
 * individual treatment effect
 * unknowable! - we never see both parts
 

**ATE**  : $E[Y_i(1) - Y_i(0)]$
 * average treatment effect
 * a significant objective in science

Important point: in general,
\begin{equation*}
E[Y_i(1) - Y_i(0)] \ \ \neq \ \ E[Y_i| W_i=1] - E[Y_i | W_i=0],
\end{equation*}
because, begin good Bayesians, we get *backdoor* information about $X_i$ from $W_i$, and that information pertains to $Y_i$ as well.

... unless, that is, $\{W_i\}$ is being generated in a randomized trial.

### Sufficient Adjustment Set

Important point: in general,
\begin{equation*}
E[Y_i(1) - Y_i(0)] \ \ \neq \ \ E[Y_i| W=1] - E[Y_i | W=0]
\end{equation*}

However, if $X$ contains enough confounding variables, then:
\begin{equation*}
E[Y_i| do(W_i=w), X_i] \ \ = \ \ E[Y_i|W_i=w, X_i],
\end{equation*}
so that, using the Law of Iterated Expectations,
\begin{equation*}
E[Y_i(w)] \ \ = \ \ E_X\left[ E[Y_i|W_i=w, X_i] \right],
\end{equation*}
and we can use this to calculate the ATE of $W_i$

If $X$ has this property, then we say that $X$ is a *sufficient adjustment set*.

> The case of estimating average treatment effects under unconfoundedness is an example of a more general theme from econometrics; typically, economists prioritize **precise estimates of causal effects** above **predictive power** ([Athey and Imbens (2019)](https://arxiv.org/pdf/1903.10075.pdf))