# (Athey and Imbens, 2016) The State of Applied Econometrics - Causality and Policy Evaluation

[Link to Paper](https://arxiv.org/pdf/1607.00699.pdf)

# Abstract

This paper discusses recent developments in econometrics that we view as important for empirical researchers working on evaluation questions. We focus on 3 main areas.
1. New research on identification strategies in programe valuation, with particular focus on synthetic control methods, regression discontinuity, external validity, and the causal interpretation of regression methods
2. Discuss various forms of supplementary analyses to make the identification strategies more credible. Includes placebo analyses as well as sensitivity and robustness analyses.
3. Discuss recent advances in machine learning methods for causal effects. Includes methods to adjust for differences between treated and control units in high-dimensional settings, and methods for identifying and estimating heterogenous treatment effects.

# 1. Introduction

Most of the attention in econometrics literature on reduced-form policy evaluation focuses on issues surrounding separating correlation from causality in observational studies, that is, with non-experimental data. There are several distinct strategies for estimating causal effects with observational data. These strategies are often referred to as "identification strategies", or "empirical strategies" (Angrist and Krueger, 2000).

In Section 2, we review recent developments corresponding to different identification strategies, such as one based on "regression discontinuity". Other strategies such as synthetic control methods, methods designed for network settings, and methods that combine experimental and observational data are explored as well.

Section 3 discuss *supplementary anlayses*, where the focus is on providing support for the identification strategy underlying the primary analyses, on establishing that the modeling decisions are adequate to capture the critical features of the identification strategy, or on establishing robustness of estimates to modeling assumptions.

Thus, the results of supplementary analyses are intended to convince the reader of the credibility of the primary analyses.

In section 4, briefly discuss new developments from what is referred to as the machine learning literature. Recently there has been much interesting work combining these predictive methods with causal analyses.

# 2. New Developments in Program Evaluation

The Potential Outcomes Framework is the most widely used approach in Econometrics, although there is a complementgary approach based on graphical models.

The literature on estimating average treatment effects under unconfoundedness is now a very mature literature with a number of competing estimators and many applications.

Some estimators use matching methods, some rely on weighting, and some involve the propensity score.

One area with continuing developments concerns settings with many covariates, possibly more than there are units.

For this setting connectings have been made with the machine learning and big data literatures.

Beyond settings with unconfoundedness we discuss issues related to a number of other identification strategies and settings. in 2.1 we discuss regression dicontinuity designs. Next, we discuss synthetic control methods developed by Abadie et al. 2010, which we believe is one of the most important development in program evaluation in the last decade. In 2.4 we discuss causal methods in network settings. In 2.5 we draw attention to some recent work on the causal interpretation of regression methods. We also discuss external vaildity in 2.6. In 2.7 we discuss how randomized experiments can provide leverage for observational studies.

This review does not look at the two major strands of instrumental variables literature as well as bounds and partial identification strategies.

## 2.1 Regression Discontinuity Designs

A research design that exploits discontinuities in incentives to participate in a treatment to evaluate the effect of these treatment.

### 2.1.1 Set Up

In RD designs, we are interested in the causal effect of a binary treatment or program, denoted by $W_i$. The key feature of the design is the presence of an exogenous variable, the forcing variable, denoted by $X_i$, such that at a particular value of this forcing variable, the threshold $c$, the probability of participating in the program or being exposed to the treatment changes discontinuously:

$$
\lim_{{x \to c^+}} \text{Pr}(W_i = 1 | X_i = x) \neq \lim_{{x \to c^-}} \text{Pr}(W_i = 1 | X_i = x).
$$

> You can imagine the above how, normally, when two sides approach $c$ from opposite sides, they converge to the same value. But because there is discontinuation, it does not equal.

If the jump in the conditional probability is from zero to one, we have a *sharp* regression discontinuity (SRD) design; if the magnitude of the jump is less than one, we have a *fuzzy* regression discontinuity (FRD) design.

The estimand is the discontinuity in the conditional expectation of the outcome at the threshold, scaled by the discontinuity in the probability of receiving the treatment:

$$
\tau^{rd} = \frac{\lim_{{x \downarrow c}} E[Y_i | X_i = x] - \lim_{{x \uparrow c}} E[Y_i | X_i = x]}
{\lim_{{x \downarrow c}} E[W_i | X_i = x] - \lim_{{x \uparrow c}} E[W_i | X_i = x]}
$$

> By "discontinuity", they mean the difference in the limits at $c$, so essentially how much they differ at the discontinued area.

> Note how the denominator can at most be 1. This means that under FRD designs, the estimand is inflated (I don't know if this makes sense to me though, so maybe I'm wrong)

In the SRD case, the denominator is equal to one, and we just focus on the discontinuity of the conditional expectation of the outcome given the forcing variable at the threshold. In that case, under the assumption that the individuals just to the right and just to the left of the threshold are comparable, the estimand has an interpretation as the average effect of the treatment for individuals close to the threshold.

In the FRD case, the interpretation of the estimand is the average effect for compliers at the threshold

> "Complying" means they comply with their treatment assignment I think.

### 2.1.2 Estimation and Inference

In the general FRD case, the estimand $\tau^{rd}$ has 4 components, each of them the limit of the conditional expectation of a variable at a particular value of the forcing variable.

This can be thought of as estimating the conditiona lexpectation at a boundary point. Researchers typically wish to use flexible (semiparametric or nonparametric) methods for estimating these expectations.

Because the target in each case is the conditional expectation at a boundary point, simply differencing average outcomes close to the threshold on the right and on the left leads to an estimator with poor properties, as stressed by Porter (2003).

> I think because it requires too much extrapolation, since more than not, as you go further away from the boundary point (in the pool of data to average), the less accurate the estimator becomes

As an alternative, Porter (2003) suggested "local linear regression", which involves estimating linear regressions of outcomes on the forcing variable separately on the left and the right of the threshold, weighting most heavily observations close to the threshold, and then taking the difference between the predicted values at the threshold.

This local linear estimator has substantially better finite sample properties than nonparametric methods that do not account for threshold effects, and it has become the standard.

Given a local linear estimation method, a key issue is the choice of the bandwidth, that is, how close observations need to be to the threshold.

Conventional methods for choosing optimal bandwidths in nonparametric estimation (e.g., based on cross-validation) look for bandwidths that are optimal for estimating the entire regression function, whereas here the interest is soley in the value of the regression function at a particular point. The current state of the literature suggests choosing the bandwidth for the local linear regression using asymptotic expansions of the estimators around small values for the bandwidth. See Imbens and Kalyanaraman (2012) for further discussion.

In some cases, the discontinuity involves *multiple* exogenouos variables.

For example, in Jacob and Lefgren (2004) and Matsudaira (2008), the focus is on the causal effect of attending summer school. The formal rule is that students who score below a threshold on either a language or a mathematics test are required to attend summer school. Although not all students who are required to attend summer school do so (thus leading to FRD instead of an SRD), the fact that the forcing variable is a known function of two observed exogenous variables makes it possible to estimate the effect of summer school at different margins.

Even more than the presence of other exogenous variables, the dependence of the threshold on multiple exogenous variables improves the ability to detect and analyze heterogeneity in the causal effects.

### 2.1.3 An Illustration

(This section talks further in depth about the RD design with summer school data from Jacob and Lefgren (2004))

Conditional on the design being valid, they saw a significant positive effect in summer school.

### 2.1.4 Regression Kink Designs

One of the most interesting recent developments in the area of regression discontinuity designs is the generalization to discontinuities in derivatives, rather than levels of conditional expectations. The first discussions of these regression "kink" designs are in Nielsen et al. 2010. The basic idea is that at a threshold for the forcing variable, the slope of the outcome function (as a function of the forcing variable) changes, and the goal is to estimate this change in slope.



### 2.1.5 Summary of Recommendations

Thre are some specific choices to be made in RD analyses.

We recommend using local linear or local quadratic methods rather than global polynomial methods.

These local linear methods require a bandwidth chioce. We recommend the optimal bandwidth algorithms based on asymptotic arguments involving local expansions discussed in Imbens and Kalyanaraman (2012).

We also recommend carrying out supplementary analyses to assess the credibility of the design, and in particular to test for evidence of manipulation of the forcing variable.

### 2.1.6 The Literature

RD designs have a long history, all the way back to 1960 in the psychology literature, although it did not become aprt of mainstream economics literature until the early 2000's.

## 2.2 Synthetic Control Methods and Differences-In-Differences

DID methods have become an important tool for empirical researchers.

In the basic setting there are two or more groups, at least one treated and one control, and we observe (possibly different) units from all groups in two or more time periods, some prior to the treatment and some after the treatment.

The diffrence between the treatment and control groups post-treatment is adjusted for the difference between the two groups prior to the treatment.

In the simple DID case these adjustments are linear: they take the form of estimating the average treatment effect as the difference in average outcomes post treatment minus the difference in average outcomes pre treatment.

Here we discuss two important recent developments, the synthetic control approach and the nonlinear changes-in-changes method.



### 2.2.1 Synthetic Control Methods

Arguably the most important innovation in the evaluation literature in the last 15 years is the synthetic control approach by Abadie et al. in 2003, 2010, 2014b.

This method builds on difference-in-differences estimation, but uses arguably more attractive comparisons to get causal effects.

We discuss the basic Abadie et al. 2010 approach and highlight alternative choices and restrictions that may be imposed to further improve the performance of the methods relative to DiD estimation methods.

We observe outcomes for a number of units, indexed by $i=0,...,N$ for a number of periods indexed by $t=1,...T$.

There is a single unit, say unit $0$, who was exposed to Control treatment during periods $1,...,T_0$ and treatment in $T_0+1$. See how in this case there is only one post-treatment period. All other units are exposed to control for all periods. The number of control units $N$ can be as small as 1 and time periods $T$ can be as small we 2. We may also observe exogenous fixed covariates for each of the units. The units are often aggregates of individuals, say states, cities, or countries. We are interested in the causal effect of the treatment for this unit,
$$Y_{0T}(1)-Y_{0T}(0)$$

The traditional DiD would compare the change for the treated unit (unit $0$) between periods $t$ and $T$, for some $t< T$, to the corresponding change for some other unit.
> So this could be comparing how 2 units change over time, one control and one treatment

The synthetic control idea is to move away from using a single control unit or a simple average of control units, and instead use a weighted average of the set of controls, with the weights chosen so that the weighted average is similar to the treated unit in terms of lagged outcomes and covariates.

The implementation of the synthetic control method requires a particular choices for estimating the weights. The original paper Abadie et al. 2010 restricts the weights to be non-negative and requires them to add up to one.

Let $K$ be the dimension of the covariates $X_i$, and let $\Omega$ be an arbitrary positive definite $K\times K$ matrix.

Then, let $\lambda(\Omega)$ be the weights that solve

$$\lambda(\Omega)=\arg\min_\lambda\left(X_0-\sum_{i=1}^{N}\lambda_i\cdot X_i\right)'\Omega\left(X_0-\sum_{i=1}^{N}\lambda_i\cdot X_i\right)$$

> intuitively, you want weights of the weighted average of your group of control units to resemble the unit $X_0$ to the best of your ability.

Abadie et al. 2010 choose the weight matrix $\Omega$ that minimizes

$$\sum_{t=1}^{T_0}\left(Y_{0t}-\sum_{i=1}^{N}\lambda_i(\Omega)\cdot Y_{it}\right)^2$$

If the covariates $X_i$ consist of the vector of lagged outcomes, this estimate amounts to minimizing

$$\sum_{t=1}^{T_0}\left(Y_{0t}-\sum_{i=1}^{N}\lambda_i\cdot Y_{it}\right)^2$$

> in other words, if there are no covariates besides just lagged outcomes, you just need a weighted average of the lagged outcomes that equal the lagged outcome of Treatment.


Doudchenko and Imbens (2016) point out that one can view the question of estimating the weights in the above synthetic control method differently.

Starting with the case without covariates and only lagged outcomes, one can consider the regression function

$$Y_{0t}=\sum_{i=1}^{N}\lambda_i\cdot Y_{it}+\epsilon_t$$

with $T_0$ units and $N$ regressors.

Estimating this regression by least squares is typically not possible because the # of regressors $N$ (the number of control units) is often larger than (or the same as) the number of observations (the number of time periods $T_0$.

We need to thus regularize the estimates in some fashion or another.

Abadie et. al 2010's restriction was that the weights $\lambda_i$ are non-negative and add up to one, which often led to a unique set of weights. However, this may hurt the performance and there are other ways. We could use something like best subset regression or LASSO where we add a penalty proportional to the sum of the weights.

### 2.2.2 Nonlinear Difference-in-Difference Models

A commonly noted concern with DiD methods is that the functional form assumptions play an important role.

The nonlinear DiD model can be used for two distinct purposes. First, the distribution is of direct interest for policy, beyond the average treatment effect. Further, a number of authors have used this approach as a robustness check for the results from a linear model.

## 2.3 Estimating Average Treatment Effects under Unconfoundedness in Settings with Multivalued Treatments

Here we discuss the results of the more recent multi-valued treatment effect literature.

In the binary case, many methods have been proposed for estimating the average treatment effect.

Here we focus on two of these methods, subclassification with regression and matching with regression that have been found to be effective in the binary treatment case (Imbens and Rubin, 2015). We discuss how these can be extended to the multi-valued treatment setting without increasing the complexity of the estimators. In particular, the dimension reducing properties of a generalized version of the propensity score can be maintained in the multi-valued treatment setting.

### 2.3.1 Set Up

In the binary case, the most effective estimation methods appear to be those that combine some covariance adjustment through regression with a covariate balancing method such as subclassification, matching, or weighting based on the propensity score (Imbens and Rubin, 2015)

> This paper was published before (King and Nielsen, 2019)

Substantially less attention has been paid to the case where the treatment takes on multiple values.

Let $\mathbb{W}=\{0,1,...,T\}$ be the set of values for the treatment. In multivalued treatment case, one needs to be careful in defining estimands, and the role of the propensity score is subtly different. One natural set of estimands is the average treatment effect if all units were switched from treatment level $w_1$ to treatment level $w_2$:

$$\tau_{w_1,w_2}=\mathbb{E}[Y_i(w_2)-Y_i(w_1)]$$


it is not sufficient to take all the units with treatment levels $w_1,w_2$ and use the above binary methods, because that would actually lead to an estimate of

$$\tau_{w_1,w_2}'=\mathbb{E}[Y_i(w_2)-Y_i(w_1)|W_i\in\{w_1,w_2\}]$$

It is better to focus on unconditional treatment effects.

The idea is not to find subsets of the covariate space where we can interpret the difference in average outcomes by all treatment levels as estimates of causal effects. Instead we find subsets where we can estimate the marginal average outcome for a particular treatment level as the conditional average for units with that treatment level, one treatment level at a time. This opens up the way for using matching and other propensity score methods developed for the case with binary treatments in settings multivalued treatments, irrespective of the number of treatment levels.

A separate literature has gone beyond the multi-valued treatment setting to look at dynamic treatment regimes. With few exceptions most of these studies appear in the biostatistical literature: see Hernan and Robins (2006) for a general discussion.




## 2.4 Causal Effects in Networks and Social Interactions

Here, we discuss some of the settings and some of the progress that has been made. However, this review will be brief, and incomplete, because this continues to be a very active area, with work ranging from econometrics (Manski 1993) to economic theory (Jackson 2010).

In general, the questions in this literature focus on causal effects in settings where units, often individuals, interact in a way that makes the no-interference or SUTVA (Rosenbaum and Rubin (1983a), Imbens and Rubin (2015)) assumptions that are routinely made in the treatment effect literature implausible.

Settings of interest include those where:
- the possible interference is simply a nuisance
- the interest is in the magnitude of the interactions or peer effects (that is, in the effects of changing treatments for one unit on the outcome of other units).

There are settings where the network (that is, the set of links connecting the individuals) is fixed exogenously, and some where the network itself is the result of a possibly complex set of choices by individuals, possibly dynamic and possibly affected by treatments.

There are settings where the population can be partitioned into subpopulations with all units within a subpopulation connected, as, for example, in classroom settings, workers in a labor market, or roommates in college, or with general networks, where friends of friends are not necessarily friends themselves.

This large set of scenarios has led to the literature becoming somewhat fractured and unwieldy. We will only touch on a subset of these problems in this review.



### 2.4.1 Models for Peer Effects

Before considering estimation strategies, it is useful to begin by considering models of the outcomes in a setting with peer effects. Such models have been proposed in the literature.

A seminal paper in the econometric literature is Manski's linear-in-means model. Manski's original paper focuses on the setting where the population is partitioned into groups (e.g., classrooms), and peer effects are constant within the groups. The basic model specification is

$$Y_i=\beta_0+\beta_\bar Y\bar Y_i+\beta'_XX_i+\beta'_\bar X\bar X_i+\beta'_ZZ_i+\epsilon_i$$

where,
- $i$ indexes the individual
- $Y_i$ is the outcome for individual $i$
- $\bar Y_i$ is the average outcome for individuals in the peer group for individual $i$
- $X_i$ is a set of exogenous characteristics of individual $i$
- $\bar X_i$ is the average value of the characteristics in individual $i$'s peer group
- $Z_i$ are group characteristics that are constant for all individuals in the same peer group

Manski considers 3 types of peer effects.
- Outcomes for individuals in the same group may be correlated because of a shared environment (Correlated Peer Effects, captured by $Z_i$)
- Exogenous peer effects, captured by group average $\bar X_i$ of the exogenous variables
- Endogenous peer effect, $\bar Y_i$.

Manski concludes that identification of these effects, even in the linear model setting, relies on very strong assumptions and is unrealistic in many settings. In subsequent empirical work, researchers have often ruled out some of these effects in order to identify others.

### 2.4.2 Models for Network Formation

Another part of the literature has focused on developing models for network formation.

> It is not super clear what a network formation is as this section is very brief.

### 2.4.3 Exact Tests for Interactions

One challenge in testing hypotheses about peer effects using methods based on standard asymptotic theory is that when individuals interact (e.g., in a network), it is not clear how interactions among individuals would change as the network grows

## 2.5 Randomization Inference and Causal Regressions

In recent empirical work, data from randomized experiments are often analyzed using conventional regression methods and are generally well-accepted as valid, especially for larger samples.

## 2.6 External Validity

One concern that has been raised in many studies of causal effects is that of external validity.
Even if a causal study is done carefully, either in analysis or by design, so that the internal
validity of such a study is high, there is often little guarantee that the causal effects are valid
for populations or settings other than those studied. This concern has been raised particularly
forcefully in experimental studies where the internal validity is guaranteed by design

## 2.7 Leveraging Experiments

At an abstract level, the observational data
are used to estimate rich models that allow one to answer many questions, but the model is
forced to accommodate the answers from the experimental data for the limited set of questions
the latter can address. Doing so will improve the answers from the observational data without
compromising their ability to answer more questions.

Here we discuss two specific settings where experimental studies can be leveraged in combination with observational studies to provide richer answers than either of the designs could provide on their own

### 2.7.1 Surrogate Variables

### 2.7.2 Experiments and Observational Studies

### 2.7.3 Multiple Experiments

Consider a setting where a number of experiments were
conducted. The experiments may vary in terms of the population that the sample is drawn from,
or in the exact nature of the treatments included. The researcher may be interested in combining
these experiments to obtain more efficient estimates, predicting the effect of a treatment in
another population, or estimating the effect of a treatment with different characteristics. Such
inferences are not validated by the design of the experiments, but the experiments are important
in making such inferences more credible

# 3. Supplementary Analyses

The point of the supplementary analyses is to shed light on the credibility of the primary analyses. They are intended to probe the identification strategy underlying the primary analyses.

The supplementary analyses are often based on careful and creative examinations of the identification strategy. Although at first glance, this creativity may appear application-specifc, in this section we try to highlight some common themes.

The supplementary analyses can take on a variety of forms and we are currently not aware of a comprehensive survey to date

> as of 2016, anyway (when this literature review was published)

## 3.1 Placebo Analyses

The most widely used of the supplementary analyses is what is often referred to as a *placebo analysis*.

In this case the researcher replicates the primary analysis with the outcome replaced by a pseudo outcome that is known not to be affected by the treatment. Thus, the true value of the estimand for this pseudo outcome is zero, and the goal of the supplementary analysis is to assess whether the adjustment methods employed in the primary analysis when applied to the pseudo outcome lead to estimates that are close to zero, taking into account the statistical uncertainty.

Here we discuss settings in which this has been applied and provide some general guidance.

> The two elephants in the room that I'm thinking of, before moving onto the explanations are - how do we find a pseudo outcome not affected by treatment? Is this an assumption based on expert knowledge? Second, what if the pseudo outcome is near zero regardless of if the estimation strategy is applied or not? Does this not mean that what we really want is a pseudo outcome that is visibly *correlated* with treatment but has no causal relation.

### 3.1.1 Lagged Outcomes

One type of placebo test relies on treating lagged outcomes as pseudo outcomes. Of-course, this is only reliable when it is reasonable that lagged outcomes have no relation with treatment - for example, while post-lottery-winner earnings may depend on whether someone won the lottery or not, their earnings a year prior should be much less impacted.

In this section's example, they test this separately for various subpopulations and combinine the 4 tests in a chi-squared test. In their example, the p-value was 0.135 therefore the null was not rejected, therefore leading to higher support of unconfoundedness holding in this study.

However, they showed that in the popular LaLonde dataset, this test was rejected, raising some concern that there may be more evidence that unconfoundedness does not hold.

### 3.1.2 Covariates in Regression Discontinuity Analyses

In an RD design, covariates typically play only a minor role in the primary analyses, although they can improve precision (Imbens and Lemieux, 2008).

The reason is that in most applications of RD designs, the covariates are uncorrelated with the treatment conditional on the forcing variable being close to the threshold, thus they are not required for eliminating bias. However, these exogenous covariates can play an important role in assessing the plausibility of the design. This is because according to the identification strategy, they should be uncorrelated with the treatment when the forcing variable is close to the threshold but there is nothing in the data that guarantees that this holds.

Thus, one way to test this is to use a covariate as the pseudo outcome in a regression discontintuity analysis.

### 3.1.3 Multiple Control Groups

Another example of the use of placebo regressions is Rosenbaum et al., 1987.

> this section is very brief and doesn't offer much explanation so I've skipped it

#3 3.2 Robustness and Sensitivity

Remember that the classical frequentist statistical paradigm suggests that a researcher specifies a single statistical model. The researcher then estimates this model on the data, and reports estimates and standard errors. The standard errors and the corresponding confidence intervals are valid given under the assumption that the model is correctly specified and estimated only once. Of-course, this is far from common practice. In practice, a researcher consider many specifications and perform various specification tests before settling on a preferred model. Not all the intermediate estimation results and tests are reported.

A common practice in modern empirical work is to present in the final paper estimates of the preferred specification of the model, in combination with assessments of the robustness of the
findings from this preferred specification.

These alternative specifications are *not* intended to be interpreted as statistical tests for the validity of the preferred model, rather they are intended to convey that the substantive results of the preferred specification are not sensitive to some of the choices in the specification.

These alternative specifications may involve different functional forms of the regression function or different ways of controlling for differences in subpopulations.

Recently there has been some work trying to make these efforts at assessing robustness more systematic.

Athey and Imbens (2015) propose an approach to this problem.

Another place where it is natural to assess robustness is in estimation of average treatment effects $\mathbb{E}[Y_i(1)-Y_i(0)]$ under unconfoundedness or selection on observables,

$$W_i\perp\!\!\!\perp\left(Y_i(0),Y_i(1)\right)|Xi$$

The theoretical literature has developed many estimators in the setting with unconfoundedness. Some rely on estimating the conditional mean $\mathbb{E}[Y_i|X_i,W_i]$, some rely on estimating the propensity score $\mathbb{E}[W_i|X_i]$, while others rely on the covariates or the propensity score. See Imbens and Wooldridge (2009) for a review of this literature. We believe that researchers should not rely on a single method, but report estimates estimation based on a variety of methods to assess robustness.

Rosenbaum and Rubin propose another approach called sensitivity analysis - where they explore how strong unobserved confounding would have to be to meaningfully change your effect estimate.

## 3.3 Identification and Sensitivity

Gentzkow and Shapiro (2015) take a different approach to sensitivity. They propose a method
for highlighting what statistical relationships in a dataset are most closely related to parameters
of interest. Intuitively, the idea is that covariation between particular sets of variables may determine the magnitude of model estimates

## 3.4 Supplementary Analysis in RD Designs

> Skipping this for now as I'm not super familiar with RD (the most I know is this literature review's review of RD up above)

# 4. Machine Learning and Econometrics

In recent years there have been substantial advances in flexible methods for analyzing data in
computer science and statistics, a literature that is commonly referred to as the “machine learning” literature.

These methods have made only limited inroads into the economics literature,
although interest has increased substantially very recently

There are two broad categories of
machine learning, “supervised” and “unsupervised” learning. “Unsupervised learning” focuses
on methods for finding patterns in data, such as groups of similar items. In the parlance of this
review, it focuses on reducing the dimensionality of covariates in the absence of outcome data.

Unsupervised learning can be used as a first step
in a more complex model. For example, instead of including as covariates indicator variables
for whether a unit (a document) contains each of a very large set of words in the English language, unsupervised learning can be used to put documents into groups, and then subsequent
models could use as covariates indicators for whether a document belongs to one of the groups.

In this review, we focus primarily on problems of causal inference, showing how supervised machine
learning methods can be used to improve the performance of causal analysis, particularly in cases
with many covariates.

We also highlight a number of differences in focus between the supervised machine learning
literature and the econometrics literature on nonparametric regression.

A leading difference is
that the supervised machine learning literature focuses on how well a prediction model does in
minimizing the mean-squared error of prediction in an independent test set, often without much
attention to the asymptotic properties of the estimator.

 The focus on minimizing mean-squared
error on a new sample implies that predictions will make a bias-variance tradeoff; successful
methods allow for bias in estimators (for example, by dampening model parameters towards
the mean) in order to reduce the variance of the estimator. Thus, predictions from machine
learning methods are not typically unbiased, and estimators may not be asymptotically normal
and centered around the estimand.

The fact that model performance (in the sense of predictive accuracy on a test set) can be
directly measured makes it possible to meaningfully compare predictive models, even when their
asymptotic properties are not understood. It is perhaps not surprising that enormous progress
has been made in the machine learning literature in terms of developing models that do well
(according to the stated criteria) in real-world datasets.

Here, we briefly review some of the
supervised machine learning methods that are most popular and also most useful for causal
inference, and relate them to methods traditionally used in the economics and econometrics literatures.


## 4.1 Prediction Problems

The first problem we discuss here is that of nonparametric estimation of regression functions.

The target is the conditional expectation

$$g(x)=\mathbb{E}[Y_i|X_i=x]$$

Traditional methods in econometrics are based on kernel regression or nearest neighbor methods.

For example, in KNN methods, $\hat g(x)$ is the sample average of the $K$ nearest observations to $x$ in Euclidean distance, with $K$ being a tuning parameter.

In the supervised machine learning literature, $K$ might be chosen through cross-validation to minimize MSE, however, in economics, bias-reduction is paramount and thus it is more common to use a small number for $K$.

> This causes the model to fit the training set "better", although it may induce higher variance

Kernel regression is similar, but a weighting function is used to weight observations nearby to $x$ more heavily than those far away. However, these kernel estimators are known to be poor when the dimension of $X_i$ is high.

Other alternatives for nonparametric regression such as basis functions also do not work too well in high-dimensional cases.



### 4.1.1 Penalized Regression

One of the most important methods in the supervised machine learning literature.

The most popular member of the class of penalized regression models is the LASSO (Tibshirani, 1996).

This estimator imposes a linear model for outcomes as a function of covariates and attempts to minimize an objective that includes the sum of square residuals as in OLS, but also adds on an additional term penalizing the magnitude of regression parameters.

Formally, the objective function for these penalized regression models, after demeaning the covariates and outcome, and standardizing the variance of the covariates, can be written as

$$\min_{\beta_1,...,\beta_K}\sum_{i=1}^{N}\left(Y_i-\sum_{k=1}^{K}\beta_k\cdot X_{ik}\right)^2+\lambda\cdot\|\beta\|$$

where $\|\cdot\|$ is a general norm. Note that OLS estimatator is not unique if there are more regressors than units $K > N$. Positive values for $\lambda$ regularize this problem, so that the solution to the LASSO minimization problem is well defined even if $K>N$.

A key feature is that for some choices of the norm, the algorithm leads to some of the $\beta_k$ being exactly zero, leading to a sparse model

$L_0\text{ norm }\|\beta\|=\sum_{k=1}^{K}1_{\beta_k\neq 0}$

which leads to optimal subset selection.

An another interesting choice is Ridge Regression which shrink all $\beta_k$ towards zero but none are equal to zero.

The most interesting case is the LASSO, where some $\beta_k$ are shrunk to zero while some become exactly zero.

In the original Tibshirani 1996 paper, it was discussed that LASSO is better when some $\beta_k$ are actually zero. However, since LASSO leads to a sparse model, it is easier to interpret and discuss.

There are further extensions such as elastic net as well as extentions of basic LASSO that allow for nonlinear regression (e.g., logistic) as well as selection of groups of parameters.

One can think of the penalty term as taking into account the "over-fitting" error, which corresponds to the expected difference between in-sample goodness of fit and out-of-sample goodness of fit.

Unlike many supervised machine learning methods, there is a large literature on the formal asymptotic properties of the LASSO; this may make the LASSO more attractive as an empirical method in economics. Under some conditions standard least squares confidence intervals based ignoring the variable selection feature of the LASSO are valid.

In addition, it is important to recognize that regularized regression models reward parsimony: if there are several correlated variables, LASSO will prefer to put more weight on one and drop the others. Thus, individual coefficients should be interpreted with caution in moderate sample sizes or when sparsity is not known to hold.

### 4.1.2 Regression Trees

The classic reference for regression trees is Breiman et al. 1984. Given sample with $N$ units and a set of regressors $X_i$, the idea is to sequentially partition the covariate space into subspaces in a way that reduces the sum of squared residuals as much as possible.

There is relatively little asymptotic theory on the properties of regression trees. A key problem in establishing such properties is that the estimated regression function is a non-smooth step function.

Regression trees are generally dominated by other, more continuous models when the only goal is prediction.

### 4.1.3 Random Forests

One of the most popular supervised ML methods known for their reliable "out-of-the-box" performance that does not require a lot of model tuning.

### 4.1.4 Boosting

A general way to improve simple machinme learning methods is boosting.

The idea is to repeatedly apply the naive method - after the first application we calculate the residuals and then apply the same method to the residuals and repeat this. This leads to a regression function that is flexible.

### 4.1.5 Super Learners and Ensemble methods

One theme in supervised ML literature is that model averaging often performs very well; many contests such as those held by Kaggle are won by algorithms that average many models.

The idea of Super Learners in Van der Laan et al. 2007 is to use model performance to construct weights, so that better performing models receive more weight in the averaging.

## 4.2 Machine Learning Methods for Average Causal Effects

machine learning methods have been introduced to account for the presence of many covariates.

### 4.2.1 Propensity Score Methods

In order to deal with many covariates, researchers has proposed estimating the propensity scores using random forests, boosting, or LASSO, and then use weights based on those estimates following the usual approaches from the existing literature.

However, procedures such as "trimming" the data to eliminate extreme values of the estimated propensity score (thus changing the estimand) remain important.

### 4.2.2 Regularized Regression Methods

Using LASSO for estimating average treatment effect using regression under unconfoundedness leads to poor properties.

Belloni et al. 2013 propose modification of the LASSO that address these concerns and restores the ability of LASSO to produce valid causal estimates in a "double selection" procedure.

### 4.2.3 Balancing and Regression

An alternative line of research has focused on finding weights that directly balance covariates
or functions of the covariates between treatment and control groups, so that once the data has
been re-weighted, it mimics more closely a randomized experiment.

## 4.2 Heterogenous Causal Effects

A different problem is that of estimating the average effects of the treatment for each value of the features, that is, the conditional average treatment effect (CATE)

$$\tau(x)=\mathbb{E}[Y_i(1)-Y_i(0)|X_i=x]$$

The concern is that searching over many covariates and subsets of the covariate space may lead to spurious findings of treatment effect differences.

### 4.3.1 Multiple Hypothesis Testing

One approach to this problem is to exhaustively search for treatment effect heterogeneity and then correct for issues of multiple testing.

To address this problem, List et al. 2016 propose to discretize each covariate, and then loop through the covariates, testing whether the treatment effect is different when the covariate is low versus high.

A different approach is to adapt machine learning methods to discover particular forms of heterogeneity, as we discuss in the next section.

### 4.3.2 Subgroup Analysis

In some settings, it is useful to identify subgroups that have different treatment effects. Subgroup analysis has been long used in medical studies but is often subject to criticism due to concerns of multiple testing.

Athey and Imbens develops a method they call "causal trees".

The method is based on regressiohn trees, and its goal is to identify a partition of the covariate space into subgroups based on treatment effect heterogeneity.

The output of the method is a treatment effect and a confidence interval for each subgroup.

The approach differs from standard regression trees. First, it uses a different criterion for tree building: rather than focusing on improvements in MSE of the prediction of outcomes, it focuses on MSE of treatment effects. Second, the method relies on "sample splitting" to ensure that confidence intervals have nominal coverage even when the number of covariates is large.

### 4.3.3 Personalized Treatment Effects

Wager and Athey (2015) propose a method for estimating heterogeneous treatment effects based on random forests.

An alternative approach, based on Bayesian Additive Regression Trees (BART) (Chipman et al. 2010) apply these methods to estimate heterogeneous treatment effects. BART is essentially a Bayesian version of random forests. Large sample properties of this method are unknown, but it appears to have good empirical performance in applications.

## 4.4 Machine Learning Methods with Instrumental Variables

The first stage in instrumental variables is typically purely a predictive exercise, where the conditional expectation of the endogenous variables is estimated using all the exogenous variables and excluded instruments. If there are many instruments, standard methods are known to have poor properties.

Belloni et al. 2013 develop LASSO to estimate the first stage in such settings.