# ECON5280 Lecture 10 Regularization, High-Dimensional Regression, and Matrix Completion

<font size="5">Junlong Feng</font>

## Outline

* Motivation: To introduce a powerful idea when handling high-dimensional data and how to use it for causal inference.
* High-dimensional (instrumental variable) regression: When $p>>n$, what should we do?
* Netflix Prize competition and causal inference: Sometimes causal inference is like a Sudoku game.

## 1. High-dimensional (IV) Regression 

Suppose we have a linear regression model:
$$
Y=X\beta+\varepsilon,\mathbb{E}(X_{i}\varepsilon_{i})=0\ \forall i.
$$
We learned that we can estimate $\beta$ by $\hat{\beta}=(X'X)^{-1}(X'Y)$. Suppose $X$ is $n\times p$, then $X'X$ is $p\times p$; this matrix is invertible only if $X$ has rank $p$ (asymptotically). Matrix algebra tells us that $rank(X)=\min\{n,p\}$, so at least we need $n\geq p$. 

However, there are some datasets in this big data era where $p$ is much larger than $n$. Theoretically, we can model it as "$p\to\infty$ with $n\to\infty$ and $p/n\to\text{a constant or }\infty$".

- Some online shopping website's customer data: $n$ is the number of registered user. $p$ is the number of variables about customer behavior that the website can track; browsing history, time spent on each page, etc. $p$ can be potentially very large.

When we do have $p>n$, OLS, and in fact every method we have learned so far will no longer work because your data matrix $X$ is no longer intervible.

One way to solve this problem is to use **regularization**, or **penalty**. Below is a specific penalty called LASSO (**least absolute shrinkage and selection operator*):
$$
\hat{\beta}^{LASSO}=\arg\min_{b}(Y-Xb)'(Y-Xb)+\lambda\sum_{j=1}^{p}|b_{j}|.
$$
The first term on the right hand side is the objective function of OLS. The second is new. It's called a penalty. $\lambda$ is a positive constant set by you. 

**What's the effect of the LASSO penalty?**

To understand how LASSO works, consider a simple case where $p=1$. If the minimizer $\hat{\beta}^{LASSO}$ turns out to be positive, then by by $|\hat{\beta}^{LASSO}|=\hat{\beta}^{LASSO}$, we have
$$
-2X'(Y-X\hat{\beta}^{LASSO})+\lambda=0\implies X'X\hat{\beta}^{LASSO}=X'Y-\lambda/2\implies \hat{\beta}^{LASSO}=\hat{\beta}^{OLS}-\frac{\lambda}{2}(X'X)^{-1}.
$$
Similarly, if $\hat{\beta}^{LASSO}<0$, then we would have $\hat{\beta}^{LASSO}=\hat{\beta}^{OLS}+\frac{\lambda}{2}(X'X)^{-1}$.

Therefore, $|\hat{\beta}^{LASSO}|$ is always strictly smaller than $\hat{\beta}^{OLS}$. Specifically,
$$
\hat{\beta}^{LASSO}=\begin{cases}\max\{\hat{\beta}^{OLS}-\frac{\lambda}{2}(X'X)^{-1},0\}&\text{if } \hat{\beta}^{OLS}>0;\\\min\{\hat{\beta}^{OLS}+\frac{\lambda}{2}(X'X)^{-1},0\}&\text{if } \hat{\beta}^{OLS}<0.\end{cases}
$$
This is why LASSO is called a shrinkage estimator.

**When does LASSO work?**

LASSO works when we assume that although $p>>n$, the number of the **true** $\beta_{j}$s that are **not equal to 0** is **much smaller than $n$**. This assumption is called **sparsity**. Under sparsity, LASSO works because it sets a subset of $\hat{\beta}$ exactly equal to 0. 

**Bias-variance tradeoff and bias-correction**.

In the $p=1$ case above, $\hat{\beta}^{OLS}$ is unbiased by OLS theory. Then obviously LASSO introduces bias since $\lambda>0$. Its variance is usually smaller because the estimator is more concentrated around 0. A popular way to debias the estimator is to run another round of OLS only using the variables that are selected by LASSO. This procedure is called **post-LASSO**. 

**Implementation details**.

Since LASSO sets relatively small coefficients to 0, the coefficients for different regressors need to be comparable. Hence, one always need to standarize regressors first: demean and divide the $X$s with standard deviation so that the unit of $X$ does not matter.

$\lambda$ can be chosen by, say, cross-validation.

### 1.1 LASSO and Causal Inference

Suppose we have a treatment $D$ and control $W$: 
$$
Y=D_{i}\beta+W_{i}'\gamma+\varepsilon_{i}.
$$
Suppose we believe $\beta$ is causal (we've explained in Lecture 5 and 6 when this is the case).

#### 1.1.1 Exogenous $D$ and High-Dimensional $W$

Suppose $E(\varepsilon_{i}(D_{i},W_{i}')')=0$. We cannot estimate $\beta$ by OLS if $W$ is high-dimensional. It's also not good to directly run LASSO because:

- If we include $\beta$ in the penalty, $D$ may be dropped if $\beta$ is small in magnitude. 
- If we do not include $\beta$ in LASSO, $D$ is protected but if we make a mistake by dropping $W$ that may cause OVB, we again get biased estimate.

Belloni, Chernozhukov and Hansen (2014, *Journal of Economic Perspectives*) propose the **double selection** method to minimize the risk of dropping relevant controls:

1. Regress $Y$ on $W$ by LASSO. 
2. Regress $D$ on $W$ by LASSO.
3. Regress $Y$ on $D$ and the **union** of the selected control variables in 1 **and** 2 by **OLS**.

#### 1.1.2 Endogenous $D$ with High-Dimensional $W$ and $Z$ (Brief Overview)

Suppose $D$ is endogenous and there exists a high-dimensional $Z$. Since $W$ and $Z$ are both high-dimensional, the algorithm is now more complicated (Chernozhukov, Hansen and Spindler (2015), *AER: Papers & Proceedings*):

1. Regress $D$ on $(W,Z)$ by LASSO and obtain the fitted value $\hat{D}$.
2. Regress $Y$ on $W$ by LASSO and obtain the residual $\hat{u}_{YX}$.
3. Regress $\hat{D}$ on $W$ by LASSO and obtain the fitted value $\hat{\hat{D}}$.
4. Construct $\tilde{D}=(D-\hat{\hat{D}})$ and $\tilde{Z}=\hat{D}-\hat{\hat{D}}$.
5. Run 2SLS by treating $\hat{u}_{YX}$ as the dependent variable, $\tilde{D}$ as the endogenous variable and $\tilde{Z}$ as the IV.

#### 1.1.3 Implentation

Both algorithms in Sections 1.1.1 and 1.1.2 can be implemented using R package **hdm**. See https://cran.r-project.org/web/packages/hdm/index.html.

## 2. Causal Inference and Low-Rank Matrix Recovery

$***$ All examples and the estimator in this section are taken/adapted from Athey, S., Bayati, M., Doudchenko, N., Imbens, G., & Khosravi, K. (2021). Matrix completion methods for causal panel data models. *Journal of the American Statistical Association*, *116*(536), 1716-1730.

So far, we only studied cross-sectional data: we observe multiple units for one time period. However, it's natural to have multiple time periods in a data set. If our data contain both multiple time periods and multiple units, we say we have a panel data set.

To fix idea, suppose we only have a $Y$ variable: $\{Y_{it}:i=1,...,N;t=1,...,T\}$. Suppose there is a treatment variable $D$: For each individual $i$ and each time period $t$, $D_{it}\in\{0,1\}$. The so-called fundamental problem of causal inference is that $Y_{it}$ is either equal to $Y_{it}(1)$ or $Y_{it}(0)$ depending on $D_{it}$, but since $D_{it}$ only has one realization, we cannot simultaneously observe both. 

Let's denote the $N\times T$ matrix of all $Y_{it}(0)$s by $Y(0)$ and the $N\times T$ matrix of all $D_{it}$s by $D$.  Then if our $D$ is, for instance,
$$
D=\begin{pmatrix}0&0&1&\cdots&0\\
1&1&0&\cdots&0\\
\cdots&\cdots&\cdots&\cdots&\cdots\\
0&1&0&\cdots&1\end{pmatrix},
$$
then our $Y(0)$ matrix is:
$$
Y(0)=\begin{pmatrix}Y_{11}&Y_{12}&?&\cdots&Y_{1T}\\?&?&Y_{23}&\cdots&Y_{2T}\\\cdots&\cdots&\cdots&\cdots&\cdots\\Y_{N1}&?&Y_{N3}&\cdots&?\end{pmatrix}.
$$
As long as we can fill in the missing values, we get the causal effects because $Y(1)$ for these entires are observable.

This set up is very general and nests many causal inference setups:

**Causal inference under unconfoundedness using cross-sectional data**:
$$
Y(0)=\begin{pmatrix}\checkmark\\\checkmark\\\vdots\\\checkmark\\?\\\vdots\\?\end{pmatrix}.
$$
**Causal inference under unconfoundedness using short panel (Difference-in-Difference)**:


$$
Y(0)=\begin{pmatrix}\checkmark&\checkmark\\\checkmark&\checkmark\\\vdots&\vdots\\\checkmark&\checkmark\\\checkmark&?\\\vdots&\vdots\\\checkmark&?\end{pmatrix}.
$$
**Synthetic Control**:
$$
Y(0)=\begin{pmatrix}\checkmark&\checkmark&\cdots& \checkmark&\cdots&\checkmark\\
\checkmark&\checkmark&\cdots& \checkmark&\cdots&\checkmark\\
\checkmark&\checkmark&\cdots& ?&\cdots&?\\\end{pmatrix}.
$$

### 2.1 Low-Rank Matrix Recovery

The $Y(0)$ matrix is formed by the same set of individuals. Their outcome overtime should have certain dependence. If the dependence is linear, the rank of the matrix would be low compared to $N$ and $T$.

Suppose $rank(Y(0))<<\min\{N,T\}$, then the estimation problem becomes:
$$
\hat{Y}(0)=\min_{Y\text{ is low-rank}}\sum_{\{(i,t):D_{it}=0\}}(Y_{it}-Y_{it}(0))^{2}.
$$
The question is how to impose the constraint.

Emmanuel Candès, Benjamin Recht, Terry Tao, among others, pioneered a powerful method to estimate a low-rank matrix between 2005 and 2015. This method has a wide application: extracting humans/background from surveillance video, removing shadows from face images, etc. The idea is very similar to LASSO: 

- For an arbitrary symmetric matrix $M$, recall that it can be decomposed into $M=U\Sigma U'$ where $\Sigma$ is a diagonal matrix containing the eigenvalues.
- The rank of a matrix is **equal to** the number of nonzero eigenvalues.
- A symmetric matrix has low-rank when a lot of eigenvalues are zero.
- So the vector of the eigenvalues is sparse for a low-rank matrix. 
- For a general matrix, a similar notion is called singular values.
- Therefore, if we impose the LASSO penalty on the singular values, we would presumably obtain a low-rank estimator, which approximates the true matrix well if the latter is low-rank.
- Recall LASSO penalty is the sum of absolute values. 
- The sum of absolute value of singular values of a matrix $M$ is called the **nuclear norm** of the matrix, denoted by $\|M\|_{*}$. 
- Therefore, the LASSO penalty is equal to the nuclear norm.

Hence, we can estimate $Y(0)$ by:
$$
\hat{Y}(0)=\min_{Y}\sum_{\{(i,t):D_{it}=0\}}(Y_{it}-Y_{it}(0))^{2}+\lambda\|Y\|_{*}.
$$
Some final remarks.

- Usually, nuclear norm-penalized estimators work in settings when both $N$ and $T$ are large.
- There are some inferential methods available. More to come.