# Lecture 1: An Overview

## Basic Concepts:

### Two Purposes of Statistical Inference:
- **Prediction:**
    - Purpose: 
        - Given X (features/independent variables) what is our best estimate of outcome Y (taregt/dependent variable)?
    - Characteristic:
        - Minimize some sort of expected prediction error (using in-sample prediction error on Y as an approximate)
        - Care and evaluate out-of-sample performance (is there overfitting? is the prediction stable?)
- **Causal Inference:**
    - Purpose: 
        - What happens to Y (treatment effect) if we intervene to change some X while keeping other X unchanged?
        - In other word, predicting (estimating) the treatment effect (TE)
    - Characteristic:
        - Impossible to minimize expected prediction error as the true TE is never observed
        - As a result, need variations that isolates causal mechanism in order to identify the TE estimator
            - For experimental data, such variatrion is guaranteed through random assignment of treatment
            - For observational data, however, the variation is achieved through quasi-experiment or assumptions
        - Care out-of-sample performance (is the TE estimator stable?) but cannot evaluate

### Two Cultures of Statistical Inference:
- **Parametric Methods:**
    - Assumes that the data are generated by a given stochastic data model / data generation process (DGP)
    - DGP: an assumed relationship between Y and X plus a stochastic error term
- **Non-parametric Methods:**
    - Uses algorithmic models and treats the data generation process as unknown
    - These methods are usually referred as machine learning

### Machine Learning Revolution:
- The traditional statistics adopted parametric methods for prediction.
- Machine learning (non-parametric methods) revolutize this field through three advantages:
    - ML concentrated on developing computationally efficient algorithms
    - Through introudcing regularization, many ML methods leverage the variance-bias tradeoff to tackle prediction problem in high dimension
    - ML algorithms are model agnostic; they do not pre-assume a function form of relation between feature and target
- Big data challenge and mean-variance trade-ff
    - For big data, variable number is usually large relative to the size of observations (high dimension problem)
    - Most parametric methods will result in high variance on estimators in this scenario (reduce out-sample prediction accuracy)
    - Through introudcing regularization, ML methods drop some variables to reduce variance of the estimators
    - This imporves out-of-sample stability of the estimator (and prediction) while increasing bias (the estimator is no longer correct in average)
    - That's why it is called variance-bias trade-off

### Machine Learning and Econometrics:
- The adoption of ML methods in econometrics has been slower, as the goal of econometrics is causal inference, while ML intend to optimize prediction
- However, there is a way to combine machine learning (non-parametric methods) with causal inference, which is causal machine learning
- **Benefit of causal ML:**
    - ML methods do not create quasi-experimental variation
    - That is, if the identification assumption for causal inference (in RCT, conditional independence, DiD, instrument variable etc.) do not hold, causal ML will not help
    - Instead, causal ML methods provide tools for robustness checking and efficiency improvement on TE estimators (lower variance)
    - For example, we may find TE identified using parametric methods with control for selection and confounding effects not different from 0 at 95% CI
    - Causal ML might overtun the conclusion by further restrict the CI boundary (but no guarantee)
- **Challenge of causal ML:**
    - Causal inference are based on identification assumptions
    - These assumptions need to be carrired over to the use of ML to causal inference
    - This have created challenges for the modification of predictive algorithms for causal inference

## Causal ML: Example on LASSO

### Set Up:
- Suppose we want to estimate the effect of a treatment $d$ on $Y$ with confounders $X$
- $X$ is a vector of $P$ values, with $P$ very close to number of observation $N$
- The condition independece assumption holds: after controlling $X$, the assignment of $d$ can be treated as random

### OLS:
- OLS equation: $Y = \tau d + X \beta + \epsilon$
- Due to the high dimension of $X$, our estimator on TE $\hat \tau$ using OLS will be very unstable (though unbiased)

### How LASSO Can Help:
- If many coefficients correpsonding to variables in $X$ are 0 or close to 0, we say that we have approximate sparsity on the regression coefficients
- In this case, we can use LASSO to shrink some coefficients on controls to 0 to reduce dimensionality
- In turn, this will perform well in prediction task, as LASSO conducts feature selection and reduce variance on estimators

- **Can we apply LASSO for causal inference?** :
    - LASSO shrink size of estimators, this will make estimator on TE $\hat \tau$ invalid
- **Solution: Naive Post LASSO**:
    - Estimate the equation $Y = \tau d + X \beta + \epsilon$ first with LASSO penalty terms
    - Then re-estimate the equation with variables that have estimated coefficient not equal to 0 under LASSO
    - This approach reduces the shrinkage bias while maintaining the benefits of variable selection
    - To prevent $d$ being dropped (possible if it is correlated with controls), we do not include $\tau$ in the LASSO penalty

- **Problem: Regularisation bias** : 
    - Controls that are strongly correlated with $d$ and having little effect on $Y$ will be dropped
    - This introduces the omitted variable bias, contaminating the TE estimator $\hat \tau$ in the second stage OLS
- **Solution: Double LASSO**:
    - Estimate the equation $Y = X \beta + \epsilon$ first with LASSO penalty terms
    - Derive residual $\tilde Y = Y - X \hat \beta _ Y $
    - Estimate the equation $d = X \beta + \epsilon$ next with LASSO penalty terms
    - Derive residual $\tilde d = d - X \hat \beta _ d $
    - Estimate the equation $\tilde Y = \alpha + \tau \tilde d + \epsilon$ with OLS
    - By FWL Theorem and Neyman Orthogonality, $\hat \tau$ from the final OLS regression is an TE estimate robust to regularisation bias

- **Problem: Overfitting bias** :
    - If we pinned down the LASSO penalty size using cross validation on dataset $D$
    - And train LASSO models and compute residualized variables $\tilde Y$ and $\tilde d$ using same dataset
    - Then any overfitting at the training stage of LASSO will convert to bias at the parameter estimation stage of $\hat \tau$
- **Solution: Sample Splitting**:
    - Split $D$ to two sets $D_1$ and $D_2$
    - Use $D_1$ for cross-validation and traning LASSO models 
    - Apply trained LASSO models on $D_2$ to get residualized variable $\tilde Y$ and $\tilde d$
    - Estimate $\hat \tau$ from the final OLS regression on $\tilde Y$ and $\tilde d$

## Generalised Double Debiased ML 

### Basic Framework
- Set Up:
    - the dependent variable is $Y$
    - set of confounders: $X$
    - treatment indicator: $d$
    - the structural form (real relationship) between $Y$, $X$, and $d$ is: 
        - $Y_i = \tau d_i + g(X_i) + \epsilon _i$ (partial linear model: Y linear in d but non-linear in X)
        - $d_i = m(X_i) + v _i$ (partial linear model: Y linear in d but non-linear in X)
- Sample Splitting: 
    - split the dataset intwo two pieces $D_1$ and $D_2$
    - one piece ($D_1$) for model selection (train non-parametric ML model for predicting $Y$ and $d$ using $X$)
    - another piece ($D_2$) for causal inference parameter estimation
- Prediction (Selection) Stage:
    - find how confounders relate to outcomes and treatments (nuisance function estimation):
        - $g_0(X_i) = E[Y_i|X_i]$
        - $m_0(X_i) = E[d_i|X_i]$
        - where $g_0(.)$ and $m_0(.)$ is estimated using nonparametric ML methods
        - their estimates are $\hat g_0(.)$ and $\hat m_0(.)$
        - their form + confounders included is "selected" though ML methods
        - note: nuisance functions affect identification of the treatment effect, but their parameters are not of intrinsic interest
- Causal Inference Stage:
    - assuming that the condition $E[(Y_i − \tau d_i − g(X_i))(d_i − m(X_i))] = 0$ holds at true paremeters $\tau$, $g$, and $m$
    - then by FWL theorem, $\tau = \frac{cov(Y_i^*, d_i^*)}{var(d_i^*)}$
    - where $Y_i^* = Y_i − g_0(X_i)$ and $d_i^* = d_i − m_0(X_i)$
    - this motivates a partialling-out (residualization) estimator, where:
        - $\tilde Y_i = Y_i − \hat g_0(X_i)$
        - $\tilde d_i = d_i − \hat m_0(X_i)$
        - residuals are constructed on $D_2$ using nuisance models estimated on $D_1$
        - TE $\hat \tau$ is estimated from OLS regression $\tilde Y_i = \tau \tilde d_i + w_i$
    - the scoring function of residualization estimator is Neyman-orthogonal, implying that $\hat \tau$ is locally insensitive to first-order errors in nuisance parameters ($g_0(.)$ and $m_0(.)$)
- Two Sources of Bias Addressed:
    - Regularisation Bias: 
        - the bias from using ML method for selection (e.g. drop some variables) becomes ignorable (insensitive to first-order errors in nuisance parameters)
    - Overfitting Bias:
        - we prevent from using same data for both nuisance estimation and TE parameter estimation
        - thus we omit this bias

### Extension to Heterogeneous Treatment Effects
- Linear form (parametric) model: $y = α + \tau d + X'β + ε$
- Partial linear form (nuisance function estimated using non-parametric method): $y = α + \tau d + g(X) + ε$
- Fully nonparametric form: $y = α(X) + \tau(X)d + ε$

## Points of Departure of Causal ML

### Conditional Expectation Function
- The CEF captures all the predictive information about Y that is contained in X
- The principle of all statiscal learning techniques is approximating the CEF
- OLS find the best (optimal among linear unbiased estimators for in-sample prediction) linear approximation of the CEF
- It is **best** only if:
    - we believe the CEF is linear in X
    - we require unbiaseness
- If CEF is not linear, we should deviate from OLS to ML methods
- If we do not care unbiaseness, we can also transfer to ML methods that trade bias for lower variance (achieve better out-of-sample prediction)

### High Dimensional Methods in Statistics
- When p (number of predictors) are large relative to n (number of observations), we face the high-dimensional challenge
- This challenge can be rephrased as curse of dimensionalty: as p increases, the sample size required for learning increase expotentially
- The OLS estimators will be unstable (having high variance) in this setting, achieving poor performance in out-of-sample prediction
- ML methods can trade bias and variance, achieving better performance through:
    - Regularisation (Lasso, Ridge, Elastic Net)
    - Model selection and screening
    - Dimension reduction techniques
- More specifically, under sparsity assumption (a subset of coefficients are zero), regularisation discovers which coefficients matter (thus reducing dimension)
- The better prediction performance by ML can in turn be converted to more efficient causal inference

### Frisch-Waugh-Lovell (FWL) Theorem
- When we apply regularisation methods like LASSO to select variables from high-dimensional X
- It results in regularization bias (difference between estimator and true parameter) as:
    - Some variable being dropped, creating bias similar to omitted variable bias
    - Penalty term shrinks all coefficients based on predictive considerations
- Solution: Applying FWL Logic
    - we can use high dimensional methods (such as LASSO) to estimate X-Y relationship and residualise Y
    - we can use high dimensional methods to estimate X-d relationship and residualise d
    - finally estimate the τ from residualised regression
- The estimator is valid under the identification assumption that $E[(Y − \tau d − g(X))(d − m(X))] = 0$ at true paremeters $\tau$, $g$, and $m$
- Besides, the estimator of τ from last step will be locally insensitive to first-order error in estimator of X-Y relationship and X-d relationship

### Treatment Effects
- We will review the treatment effects literature including:
    - Identification assumptions of TE
    - Heterogeneous treatment effects
    - Parametric and nonparametric estimators of TE
- Also connection to ML methods through:
    - Double robust estimators
    - Causal forests for heterogeneity
    - Optimal policy learning

### Machine Learning Methods for Prediction
- Note: ML helps estimate nuisance functions but cannot solve identification problems
- Core ML principles:
    - Flexibility in functional forms
    - Data-driven model selection
    - Focus on out-of-sample performance
- Key algorithms:
    - Regularised regression (Lasso, Ridge)
    - Tree-based methods, Random Forests, and Generalised Random Forests
    - Neural networks and deep learning
    - Ensemble methods and stacking

## Two Sets of Causal ML Methods

### High-Dimensional Controls/Instruments Approach
- Models with a large number of potential confounding variables or instruments
- Employs ML to manage the curse of dimensionality (variable selection and dimension reduction)
- Tackles bias due to model selection
- e.g. Double Debiased Machine Learning

### Treatment Effect Heterogeneity Approach
- Explores and identifies heterogeneous treatment effects (conditional average TE, group average TE)
- Facilitates understanding of differential treatment effects across groups
- Enables personalized treatment decisions and targeted policy implementation
- Modified MSE for heterogeneous causal effects:
    - MSE measures expected squared difference between estimator and true parameter
    - Original MSE: $MSE(\hat \theta) = E[{(\hat \theta - \theta)}^2] = {[Bias(\hat \theta)]}^2 + Var(\hat \theta)$
    - Modified MSE: $MSE(\hat \tau) = E[{(\hat \tau(X) - \tau(X))}^2]$
    - Key differences from standard MSE:
        - Target is CATE not outcome prediction (e.g. Y), which is never observed
        - Rewards finding genuine treatment effect heterogeneity
- e.g. causal forest and generalized random forest

## Extras

### Predictive Effect
- Consider the regression model $Y = \alpha + \tau d + X \beta + \epsilon$
- The regression coefficient $\tau$ measures the linear prediction of Y if d changes from 0 to 1, holding the controls X
- We can call this the **predictive effect** (PE), as it measures the impact of a variable on the prediction we make
- PE is a measure of statistical dependence or association between d and Y even if we partial-out linearly the controls X
- Without idetification assumptions, the PE should not be interpreted as a treatment effect (TE)

### Local to Global Treatment Effects
- We can defining conditional average treatment effects as $\tau (x) = E[Y(1) − Y(0)|X = x]$
- ATE emerges from CATE through averaging: $\tau _{ATE} = E[\tau (x)] = E[E[Y(1) − Y(0)|X = x]]$
- If CATE is constant, $\tau (x) = \tau$ and $\tau _{ATE} = E[\tau] = \tau$

### FWL Logic in Heterogeneous TE
- For partial linear model, we residualized globally to derive the estimator of TE
    - $Y_i^* = Y_i − g_0(X_i)$ and $d_i^* = d_i − m_0(X_i)$
    - $\tau = \frac{cov(Y_i^*, d_i^*)}{var(d_i^*)}$
- For heterogeneous TE (fully non-parametric model), residualisation must work locally, using either discrete or ’continuous’ partitions of X

### Classification Model
- Econometrics commonly use parametric probit model for classification
    - Probit model: $Y^* = X'\beta + \epsilon$ , $Y = 1 (Y^* > 0)$ , $E[Y|X] = Pr(\epsilon < X'\beta) = Φ(X'\beta) $ under normality
    - It is a normal CDF function applied to a linear combination of the predictors
    - It makes global assumptions and estimate smooth probabilities
    - Forbidden regression: we cannot use IV for Probit model (no IV for discrete dependent variable)
- Classification trees:
    - Partition the feature space into rectangular space
    - Voting: estimate local probability as the fraction of training observations in that region belonging to each class
    - This can capture non-linear relationships and interactions
    - Can it circumvent the forbidden regression problem?

### Model Interpretability
- Black-box Approach:
    - ML algorithms have historically been “black boxes”
    - Difficult to understand their inner processes and explain to regulatory agencies and stakeholders
    - Interpretability: the extent to which effect from each predictor can be determined
- White-box approach:
    - Decision making processes completely transparent (e.g. OLS, interpretable; poor prediction)
    - Users able to audit decisions made by the model (e.g. gender should not contribute much to prediction)
    - Answers to questions:
        - Why the model made a particular decision?
        - What are the most influencing variables for a particular decision?