# ECON5280 Lecture 8 Causal Forest

<font size="5">Junlong Feng</font>

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/junlong-feng/econ5280/main?filepath=Lecture8_Forest.ipynb)

## Outline

* Motivation: Knowing heterogeneity is great and we hate assumptions on functional form.
* CATE and Heterogeneity: Why is heterogeneity interesting but hard to know?
* Traditional Methods and Their Drawbacks: A brief intro/review of nonparametrics.
* Random Tree, Causal Tree and Causal Forest: Econometrics learns from and develops ML.
* Applications and Implementation: Everything can be done in R by some simple commands.

## 1. CATE and Heterogeneity

We learned CATE in Lecture 5. Let $D$ be the treatment, $W$ be a vector of control variables, and $Y$ be the outcome, recall that CATE from $D=d'$ to $d$ is defined as:
$$
CATE(d,d';W)\equiv\mathbb{E}(Y(d)|W)-\mathbb{E}(Y(d')|W),
$$
where $Y(d)$ is the potential outcome at $d$. 

Knowing CATE is great because

- You can get ATE by $ATE(d,d')=\mathbb{E}[CATE(d,d';W)]$.
- You can get the so-called *heterogeneous effects*: People with different values of $W$ may have different treatment effects. Closer to the notion of *individual treatment effect*.
  - Let $D$ be attending graduate school or not. Let $W$ contain family income, gender, college major.
  - The expected income effect of attending graduate school may be different for a girl from a high-income family with econ major compared to the effect for a boy from a low-income family with non-econ major. ATE averages everything out, but CATE tells you the heterogeneity.

Knowing CATE is possible because

- Under conditional random assignment: $D\perp Y(d)|W$ for all $d$, we have CATE identified as:
  $$
  CATE(d,d';W)=\mathbb{E}(Y|D=d,W)-\mathbb{E}(Y|D=d',W).
  $$

- The two conditional expectations are directly identified and estimable from data.

- We said in Lecture 5 that you can make it linear if you're willing to assume the expectation is linear in $W$.

- But this can be a crazy assumption.

Knowing CATE is hard because

- Estimating unconditional expectation is easy: Take average and done.
- Estimating conditional expectation when the conditioning variable is discrete is easy as well. 
  - For instance, let $W\in\{0,1\}$. Then $\mathbb{E}(Y|W=1)$ can be estimated by taking the average of $Y_{i}$s for the **subsample** where $X_{i}=1$: $\sum_{i:W_{i}=1}Y_{i}/\#(i:W_{i}=1)$.
- Estimating conditional expectation when the conditioning variable is continuous is hard. 
  - When the conditioning variable $W$ is continuous, $W=w$ has probability 0 for any $w$.
  - Impossible (with probability one) to find a **subgroup** whose $W_{i}=w$.
  - For instance, suppose $W$ is annual income. Suppose I ask you to estimate $\mathbb{E}(Y|W=10125)$, i.e., the expected $Y$ for someone whose annual income is exactly equal to 10125. It is highly possible that in your data set there does not exist any individual whose income is equal to this number.

## 2. Traditional Methods and Their Drawbacks

Now let's forget about treatment effect for a moment. The conditional expectation is a function of the conditioning variables. For instance, $\mathbb{E}(Y|D=1,W)$ changes when $W$ changes. So if we recover the entire function: $\mathbb{E}(D=1,W=w)$ for all possible $w$ that $W$ can take, then we are done. For notational simplicity, let
$$
f_{d}(w)\equiv \mathbb{E}(Y|D=d,W=w).
$$
 Our goal is to estimate function $f_{d}(w)$ for different $d$s.

Traditional methods estimate $f_{d}(w)$ using $Y$s around $W=w$. For instance, suppose I assme funtion $f_{d}(w)$ is continuous in $w$, then if I go arbitrarily close to $w$, the function value will also be arbitrarily close to $f_{d}(w)$.

- Suppose $W$ is a scalar and is continuous.
- Suppose we want to know $f_{d}(10)$.
- Ideally we would like to find all $Y_{i}$s whose $W_{i}=10$ and then take average.
- But this may not be possible. So instead, we find all $i$s whose $W_{i}\in (10-h,10+h)$, and average over their $Y_{i}$s.
- Of course there is bias because, say, $f_{d}(9.8)$, is not equal to $f_{d}(10)$ even though they are close. 
- Bias is larger if $h$ is larger. Variance is smaller is $h$ is larger because the effective sample size increases.
- If I send $h\to 0$ when $n\to\infty $, the bias should be gone asymptotically and I get a consistent estimate.

This is the central idea of the traditional **nonparametric methods** in statistics.

- Key logic: When $w$ changes a little, $f_{d}(w)$ changes a little.
- Key task: Find observations whose $W$'s values are close to the target $w$.
- Key challenge: **Curse of dimensionality**.

**Curse of dimensionality**. When $W$ is no longer a scalar, say it contains $p>1$ variables, it's hard to find *nearby* observations:

- Suppose you have $n$ data points, i.e., $n$ individuals.
- Suppose all variables in $W$ take values on $[0,1]$.
- When $p=1$, it's like throwing $n$ points randomly into a unit interval. Points are close to each other.
- When $p=2$, throw $n$ points into a unit square. When $p=3$, a cube. And so on...
- On average, your $n$ data points lie farther away from each other as $p$ gets larger, creating the so-called **sparsity**.
- For instance, suppose $W$ only contains income. Then for $W=10125$, you only need to find people whose income is around 10125. But if $W$ contains income and experience, and for $W=(10125,10)$, you need to find people whose income is around 10125 **and** whose experience is around 10 years. Sounds more difficult right?

## 3. Random Tree, Causal Tree, and Causal Forest

The key problem behind the curse of dimensionality is that we are obsessed with the idea of finding $i$ whose $W$ is close to $w$. However, this is sufficient but not necessary to approximate $f_{d}(w)$.

- It is possible that $f_{d}(w)=f_{d}(w')$ when $w$ and $w'$ are far away. 
- So only focusing on points near $w$ loses information.

It would be better if we find individuals whose $f_{d}(w)$ are similar, and average over their $Y_{i}$s. But how is this possible as $f_{d}(w)$ is unknown and we are going to estimate it?

### 3.1 Random Tree

Now suppose $p=2$, i.e., there are two variables in $W$: $W_{1}$ and $W_{2}$. Further, imagine $(W_{1},W_{2})$ take values on the unit square $[0,1]\times [0,1]$.

- Now forget about the target value $w$ and we'll never try to find values close to $w$ any more.
- Instead, we are going to split the unit square into small rectangles.
- The goal is, in each rectangle, the conditional expectation, or, $f_{d}(w)$, is almost a constant.
- After we finish, we go back and find out in which rectangle our target $w$ lies. Then simply average $Y_{i}$s for the $i$s whose $W$ take value in the rectangle.

Two problems:

1. How do we split the unit square efficiently?
   - Infinite ways to split it.
2. How do we know whether $f_{d}(w)$ changes a lot in a rectangle or not? 
   - We need to estimate $f_{d}(w)$ by taking average of the $Y$s in a rectangle, then by construction its estimate does not change at all for all $w$ in the rectangle.

**The random tree algorithm** (a skecth).

1. Split the sample into two halves by $W_{1}<t_{1}$ and $W_{1}>t_{1}$. Calculate the average of $Y_{i}$s in these two subsamples. Denote them by $\bar{Y}_{t1}^{(1)}$ and $\bar{Y}_{t1}^{(2)}$. Calculate $\Delta_{1}(t_{1})\equiv (\bar{Y}_{t1}^{(1)}-\bar{Y}_{t1}^{(2)})^{2}$. 
2. Try all possible $t_{1}\in (0,1)$. Find the one that yields the largest $\Delta_{1}(t_{1})$. Call it $t_{1}^{*}$. Mathematically, $t_{1}^{*}=\arg\max_{t_{1}}\Delta_{1}(t_{1})$.
3. Split the **original** sample into two halves by $W_{2}<t_{2}$ and $W_{2}>t_{2}$. Repeat Steps 1 and 2 for all possible $t_{2}\in (0,1)$ and find $t_{2}^{*}$.
4. Compare $\Delta_{1}(t_{1}^{*})$ and $\Delta_{2}(t_{2}^{*})$. We choose **the regressor**, 1 or 2, and its corresponding $t^{*}$ which yields the largest $\Delta(t^{*})$ as our first-round split rule.
5. Now we have two subsamples. In each of them, repeat Steps 1-4.
6. Repeat the splitting until the pre-set number of rounds splitting and/or the size of the final rectangles are reached. The final rectangles are called nodes or **leaves**.

Notes. If you learned random tree from ML, you may remember one needs a testing set. This algorithm is simpler by taking advantage of the splitting criterion we use. The idea is from Wager and Athey (2018, JASA) and Athey, Tibshirani and Wager (2019, Annals of Stats). Feel free to explore the difference if you’re interested but this is not required by this course.

This algorithm solves the two questions we raised earlier:

1. How do we split the unit square efficiently?

   - No matter how large $p$ is, each time we only focus on one of them. Each splitting trial is a simple one-dimensional splitting problem. Very easy to handle; the complexity of the problem is linear in $p$.
   - The above algorithm is for $p=2$. When $p>2$, only needs to add similar steps as step 3 for every covariate. Then in step 4, compare $\Delta_{1}(t_{1}^{∗})$, $\Delta_{2}(t_{2}^{∗})$, ..., and $\Delta_{p}(t_{p}^{∗})$, and find the largest one.
   - This **recursive splitting** algorithm is so simple that can be represented by a binary tree shown below. 

2. How do we know whether $f_{d}(w)$ changes a lot in a leaf or not? 

   - We actually do not look at $f_{d}(W)$ i.e. $E(Y|W_{1},W_{2})$ on one leaf. We compare the difference between the $f_{d}(W)$s at different leaves. We find the split that yields that largest difference (largest contrast, as some authors prefer). 
   - In this way, the points on one leaf A have relatively similar function value $f_{d}(\cdot)$ compared with the points in another leaf B. Otherwise, those points would have been classfied onto leaf B.
   - When the leaves are small enough, we are confident that function $f_{d}(\cdot)$ is almost constant on it, leading to small enough bias when averaging $Y_{i}$s on it.
   
### 3.2 Causal Tree

So far we used tree to estimate conditional expectations. That works well based on ML theory. Econometrics plays no role in it yet. 

However, our ultimate goals are i) to estimate causal effect and ii) conduct statistical inference. These are two new questions to answer besides the two in [Section 3.1](#3.1 Random Tree). 

1. How do we estimate CATE?
2. How do we establish statistical properties like consistency and asymptotic distribution?

#### 3.2.1 Changing the Splitting Critirion

In the tree algorithm, we treated $\mathbb{E}(Y|D=d,W=w)$ as a function of $w$, i.e., $f_{d}(w)$. We set splitting criterion by comparing $\hat{f}_{d}$ for $W_{j}>t_{j}$ and $W_{j}<t_{j}$, and set the split as the $j$ and $t_{j}$ that yields the largest contrast.

Now since our goal is to estimate $\mathbb{E}(Y|D=d,W=w)-\mathbb{E}(Y|D=d',W=w)$, we can simply treat this difference as the function we care about:

- Let $f_{d,d'}(w)\equiv \mathbb{E}(Y|D=d,W=w)-\mathbb{E}(Y|D=d',W=w)$.
- Revise Step 1 in the algorithm in [Section 3.1](#3.1 Random Tree) as follows:
  1. Split the sample into two halves by $W_{1}<t_{1}$ and $W_{1}>t_{1}$. In each subsample, calculate the average of $Y$ for those observations with $D=d$ subtracted by the average of $Y$ for those observations with $D=d'$. Denote them by $\bar{Y}_{d-d',t1}^{(1)}$ and $\bar{Y}_{d-d',t2}$. Calculate $\Delta_{1}(t_{1})\equiv (\bar{Y}_{d-d',t1}-\bar{Y}_{d-d',t1}^{2})^{2}$. 
- Then for all the following steps, revise $\Delta$ in the same way.

In this way, we can guarantee that on each leaf, i.e., the final rectangle, the CATE does not change much so treating averaging the $Y$s on it does not create much bias.

#### 3.2.2 Honesty

The second problem is much harder to solve. It took the academia nearly two decades to come up with a solution.

- Stefan Wager and Susan Athey in their 2016 *Proceedings of the National Academy of Sciences of the United States of America (PNAS)* paper solves the problem by introducing a notion called “honesty”.
- Recall in the algorithm, $Y$ is used for two purposes: a) determine the covariate and the split in each round, and b) after the leaves are formed, compute the final estimates of CATE.
- Such dependence causes theoretical challenges for statistical properties.
  - For instance, consistency could be derived by applying WLLN on each leaf: at the end of the day, we just do sample average on each leaf.
  - WLLN needs i.i.d.
  - However, these leafs are formed by comparing $Y$s. So each leaf is a function of $Y$. Conditional on the leaf, $Y$ are no longer i.i.d.

To resolve this issue, Wager and Athey propose an extra step before Step 1 in the algorithm in [Section 3.1](#3.1 Random Tree), called *honest splitting*:

**The causal tree algorithm** (a skecth).

- Step 0 (Honest splitting). Randomly split the data set into two halves by $i$ (**NOT BY $W$**!)
  - Each subsample has around $n/2$ observations. Call them the **training data** and **estimation data** respectively.
- Step 1-6. Grow a random tree **using the training data ONLY** following the algorithm in [Section 3.1](#3.1 Random Tree) with the modified $\Delta$ in [Section 3.2.1](#3.2.1 Changing the Splitting Critirion). 
- Given the leaves, for $w$ of interest, estimate $CATE(d,d';w)$ by first check which leaf $w$ falls into, and then calculate $\bar{Y}_{D=d}-\bar{Y}_{D=d'}$ **using the estimation data**.

*Honesty* means that we use independent subsamples to grow a tree and to estimate, respectively. Then the bias caused by “Y is correlated to Y” can be avoided.

- Wager and Athey show that under some regularity conditions, the estimated CATE is unbiased, consistent and asymptotically normal. Inference can be easily done by standard t-test, p-value and confidence interval.

### 3.3 Causal Forest

A causal tree has, again, two drawbacks:

1. It’s sensitive to noises.
2. Some data (the estimation data) are never used to grow a tree, while some (the training data) are never used for estimation. Information loss.

**Causal Forest**: Grow many trees and take the average.

**The causal forest algorithm** (a skecth).

1. Randomly draw $s$ observations from the full sample of $n$ observations.
2. Grow a causal tree using this subsample with the algorthm in [Section 3.2.2](#3.2.2 Honesty).
   - The steps are exactly what we described earlier. Recall the key steps are: splitting this subsample into two halves (honesty), recursively splitting the training data to grow a tree, and use the estimation data to estimate the CATE.
3. Repeat Steps 1 and 2 for $B$ times. Choose $B$ as large as possible.
4. You end up with B causal trees. They are called a **causal forest**. For any given $w$, each tree $b$ yields an estimate of $CATE^{(b)}(d,d';w)$. Take average of theses $B$ estimates: $\sum_{b=1}^{B}CATE^{(b)}(d,d';w)/B$.

The causal forest resolves the two drawbacks of causal trees by:

1. It’s sensitive to noises.
   - Averaging makes the noise in each tree only have $1/B$ share of impact.
2. Some data (the estimation data) are never used to grow a tree, while some (the training data) are never used for estimation. Information loss.
   - Training data in one tree may be used as estimation data in another tree. When B is large enough, all data are likely to be used to grow trees and to estimate.

Wager and Athey show that under certain regularity conditions: CATE estimated by causal forest is consistent and asymptotically normal.

#### 3.3.1 Implementation Details

Pay attention to the following three aspects in implementation.

**Covariates subsetting**. 

Recall that when growing a tree, in each round in the recursive splitting, one needs to try every possible split for every covariate. 

- This can be slow when $p$ is large.
- In practice, for each round in the recursive splitting, one only need to try out a random subset of covariates. 
- This subset is randomly chosen.
- A rule of thumb is randomly choosing $\min⁡\{\sqrt{p}+20,p\}$ covariates.

**Minimum leaf size**.

Leaf size is like $h$ in traditional nonparametrics. If a leaf is too large, large bias (because extrapolated too much). If a leaf is too small, large variance (because data points are too few).

- We can control the minimum size so that when it is reached, recursive splitting ends.
- This size can be cross-validated.

**Imbalance of a split**.

- When splitting a parent node, the size of each child node is not allowed to be too different.
- Meanwhile, the number of treated and untreated observations in a child node can not be too different either.
- The level of imbalance can be controlled in the algorithm.

### 3.4 Comparison with OLS

Recall that OLS has the following drawbacks:

- You have to assume everything is linear. When your $D$ is binary and you care about ATE, that's fine. But in almost all other cases, this can be a crazy assumption.
- Even if the true model is indeed linear, in order to get a consistent estimate of $\beta$ in front of $D$, both $D$ and $W$ need to be exogenous in general.

Causal forest overcomes these drawbacks because i) it does not impose any restrictions on the functional form and ii) everything works under conditional randomization of $D$ whereas $W$ can be arbitrarily correlated with the unobservable(s).

## 4. Applications and Implementation

You can imagine there is a wide applications of causal forest in economics. Whenever you have an exogenous $D$ and want to know the heterogeneous effects, you can replace linear models (or, any parametric nonlinear models) with it and obtain everything you want. 

Moreover, you can use it with an instrumental variable as well! 

* Recall that CLATE when IV and the treatment are both binary is identified as the ratio of two conditional expectation differences. 
* You can imagine it's not hard to adapt the algorithm to estimate this quantity.
  * Already done in the literature.

Implementation is also straightfoward because an R package is ready at (check out their tutorial webpage https://grf-labs.github.io/grf/index.html). 

Here's an example.

- I generate a data set with $p=10$ and $n=2000$.
- The $2000\times 10$ matrix of $W$ are randomly drawn from normal.
- Treatment $D$ is conditional randomized: It's binary and equal to 1 with probability 0.4 if $X_{1}<0$ and with probability 0.6 if $X_{1}>0$.
- The model for $Y$ is nonlinear: $Y=\max\{W_{1},0\}\times D+W_{2}+\min\{W_{3},0\}+\varepsilon$.
  - Can you calculate the true CATE and ATE by hand? Try.

In [None]:
library(grf)
library(DiagrammeR)
### Generate data. You do not need to run these if a dataset is already given. 
n <- 2000      
p <- 10
W <- matrix(rnorm(n * p), n, p)
D <- rbinom(n, 1, 0.4 + 0.2 * (W[, 1] > 0))
Y <- pmax(W[, 1], 0) * D + W[, 2] + pmin(W[, 3], 0) + rnorm(n)
data=data.frame(Y,D,W)

## Generate values of interest for w
W.test <- matrix(0, 101, p)
W.test[, 1] <- seq(-2, 2, length.out = 101)

### Build a causal forest.
tau.forest <- causal_forest(W, Y, D) # Put regressors
tree <- get_tree(tau.forest, 1)
plot(tree) # visualize the first tree in the forest

### Estimate CATE at specified w values using the forest
tau.hat <- predict(tau.forest, W.test) 
plot(W.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions, 0, 2), 
     xlab = "w", ylab = "tau", type = "l") # estimated CATE
lines(W.test[, 1], pmax(0, W.test[, 1]), col = 2, lty = 2) # true CATE

# Estimate treatment effects with confidence interval
tau.forest <- causal_forest(W, Y, D, num.trees = 4000)
tau.hat <- predict(tau.forest, W.test, estimate.variance = TRUE)
sigma.hat <- sqrt(tau.hat$variance.estimates)
plot(W.test[, 1], tau.hat$predictions, ylim =
       range(tau.hat$predictions + 1.96 * sigma.hat, 
             tau.hat$predictions - 1.96 * sigma.hat, 0, 2), 
     xlab = "w", ylab = "tau", type = "l")
lines(W.test[, 1], tau.hat$predictions + 1.96 * sigma.hat, col = 1, lty = 2)
lines(W.test[, 1], tau.hat$predictions - 1.96 * sigma.hat, col = 1, lty = 2)
lines(W.test[, 1], pmax(0, W.test[, 1]), col = 2, lty = 1)

## As promised, ATE can be obtained from CATE
average_treatment_effect(tau.forest, target.sample = "all")

## This is what you'll get if you do OLS
model=lm(Y~D)
summary(model)