# Lecture 1: An Overview

## Basic Concepts:

### Two Purposes of Statistical Inference:
- **Prediction:**
    - Purpose: 
        - Given X (features/independent variables) what is our best estimate of outcome Y (taregt/dependent variable)?
    - Characteristic:
        - Minimize some sort of expected prediction error (using in-sample prediction error on Y as an approximate)
        - Care and evaluate out-of-sample performance (is there overfitting? is the prediction stable?)
- **Causal Inference:**
    - Purpose: 
        - What happens to Y (treatment effect) if we intervene to change some X while keeping other X unchanged?
        - In other word, predicting (estimating) the treatment effect (TE)
    - Characteristic:
        - Impossible to minimize expected prediction error as the true TE is never observed
        - As a result, need variations that isolates causal mechanism in order to identify the TE estimator
            - For experimental data, such variatrion is guaranteed through random assignment of treatment
            - For observational data, however, the variation is achieved through quasi-experiment or assumptions
        - Care out-of-sample performance (is the TE estimator stable?) but cannot evaluate

### Two Cultures of Statistical Inference:
- **Parametric Methods:**
    - Assumes that the data are generated by a given stochastic data model / data generation process (DGP)
    - DGP: an assumed relationship between Y and X plus a stochastic error term
- **Non-parametric Methods:**
    - Uses algorithmic models and treats the data generation process as unknown
    - These methods are usually referred as machine learning

### Machine Learning Revolution:
- The traditional statistics adopted parametric methods for prediction.
- Machine learning (non-parametric methods) revolutize this field through three advantages:
    - ML concentrated on developing computationally efficient algorithms
    - Through introudcing regularization, many ML methods leverage the variance-bias tradeoff to tackle prediction problem in high dimension
    - ML algorithms are model agnostic; they do not pre-assume a function form of relation between feature and target
- Big data challenge and mean-variance trade-ff
    - For big data, variable number is usually large relative to the size of observations (high dimension problem)
    - Most parametric methods will result in high variance on estimators in this scenario (reduce out-sample prediction accuracy)
    - Through introudcing regularization, ML methods drop some variables to reduce variance of the estimators
    - This imporves out-of-sample stability of the estimator (and prediction) while increasing bias (the estimator is no longer correct in average)
    - That's why it is called variance-bias trade-off

### Machine Learning and Econometrics:
- The adoption of ML methods in econometrics has been slower, as the goal of econometrics is causal inference, while ML intend to optimize prediction
- However, there is a way to combine machine learning (non-parametric methods) with causal inference, which is causal machine learning
- **Benefit of causal ML:**
    - ML methods do not create quasi-experimental variation
    - That is, if the identification assumption for causal inference (in RCT, conditional independence, DiD, instrument variable etc.) do not hold, causal ML will not help
    - Instead, causal ML methods provide tools for robustness checking and efficiency improvement on TE estimators (lower variance)
    - For example, we may find TE identified using parametric methods with control for selection and confounding effects not different from 0 at 95% CI
    - Causal ML might overtun the conclusion by further restrict the CI boundary (but no guarantee)
- **Challenge of causal ML:**
    - Causal inference are based on identification assumptions
    - These assumptions need to be carrired over to the use of ML to causal inference
    - This have created challenges for the modification of predictive algorithms for causal inference

## Causal ML: Example on LASSO

### Set Up:
- Suppose we want to estimate the effect of a treatment $d$ on $Y$ with confounders $X$
- $X$ is a vector of $P$ values, with $P$ very close to number of observation $N$
- The condition independece assumption holds: after controlling $X$, the assignment of $d$ can be treated as random

### OLS:
- OLS equation: $Y = \tau d + X \beta + \epsilon$
- Due to the high dimension of $X$, our estimator on TE $\hat \tau$ using OLS will be very unstable (though unbiased)

### How LASSO Can Help:
- If many coefficients correpsonding to variables in $X$ are 0 or close to 0, we say that we have approximate sparsity on the regression coefficients
- In this case, we can use LASSO to shrink some coefficients on controls to 0 to reduce dimensionality
- In turn, this will perform well in prediction task, as LASSO conducts feature selection and reduce variance on estimators

- **Can we apply LASSO for causal inference?** :
    - LASSO shrink size of estimators, this will make estimator on TE $\hat \tau$ invalid
- **Solution: Naive Post LASSO**:
    - Estimate the equation $Y = \tau d + X \beta + \epsilon$ first with LASSO penalty terms
    - Then re-estimate the equation with variables that have estimated coefficient not equal to 0 under LASSO
    - To prevent $d$ being dropped (possible if it is correlated with controls), we do not include $\tau$ in the LASSO penalty

- **Problem: Regularisation bias** : 
    - Controls that are strongly correlated with $d$ and having little effect on $Y$ will be dropped
    - This introduces the omitted variable bias, contaminating the TE estimator $\hat \tau$ in the second stage OLS
- **Solution: Double LASSO**:
    - Estimate the equation $Y = X \beta + \epsilon$ first with LASSO penalty terms
    - Derive residual $\tilde Y = Y - X \hat \beta _ Y $
    - Estimate the equation $d = X \beta + \epsilon$ next with LASSO penalty terms
    - Derive residual $\tilde d = d - X \hat \beta _ d $
    - Estimate the equation $\tilde Y = \alpha + \tau \tilde d + \epsilon$ with OLS
    - By FWL Theorem and Neyman Orthogonality, $\hat \tau$ from the final OLS regression is an TE estimate robust to regularisation bias

- **Problem: Overfitting bias** :
    - If we pinned down the LASSO penalty size using cross validation on dataset $D$
    - And train LASSO models and compute residualized variables $\tilde Y$ and $\tilde d$ using same dataset
    - Then any overfitting at the training stage of LASSO will convert to bias at the parameter estimation stage of $\hat \tau$
- **Solution: Sample Splitting**:
    - Split $D$ to two sets $D_1$ and $D_2$
    - Use $D_1$ for cross-validation and traning LASSO models 
    - Apply trained LASSO models on $D_2$ to get residualized variable $\tilde Y$ and $\tilde d$
    - Estimate $\hat \tau$ from the final OLS regression on $\tilde Y$ and $\tilde d$

## 