# ECMA 31350: Lasso and Variations  
### 3rd TA Discussion – Post-Lasso, Partial Penalization, and Double Lasso  
**Date:** April 9, 2025  
**TA:** Lauren Qu

## Sala-i-Martin (1997) "I Just Ran Two Million Regressions"

### Background

The goal of the study is to identify the determinants of long-run economic growth across countries.
The dataset includes 62 candidate explanatory variables
The author estimates all possible regressions using only 7 variables at a time

###  Method & Assumption
Brute-force approach, 2mil regressions, tracks which variables appear significantly more often than others

###  Problem
Computationally infeasible when the number of covariates p is large
Subset selection becomes intractable as p grows (combinatorial explosion)

### Motivation for Lasso
Lasso offers a convex relaxation of the best subset selection problem
It efficiently performs variable selection without searching all 2^p combinations
Suitable when we believe only a few predictors are relevant (i.e., sparsity)


1. Post-Lasso  
2. Partial Penalization  
3. Double Lasso (with orthogonalization logic)  
4. Neyman Orthogonality  


### 1. Post-Lasso

#### Definition
Lasso performs both variable selection and shrinkage. However, the shrinkage introduces bias, especially when the true signal is strong. Post-Lasso is designed to remove the shrinkage bias while retaining variable selection.

#### Procedure
1. Run Lasso on $Y \sim X$ and obtain the set of selected variables:
   $$
   \hat{J}_n = \{j: \hat{\beta}^{Lasso}_j \neq 0\}
   $$
2. Run OLS on the selected covariates $X_j$, $j \in \hat{J}_n$

#### Advantage
- Unbiased estimation after variable selection
- Works well when the true model is sparse and selection is accurate

#### Reference
Belloni and Chernozhukov (2009) show that Post-Lasso retains the Lasso rate of convergence and reduces bias.


### 2. Partial Penalization

#### Motivation
In many applications, some variables are theoretically important (e.g., treatment, price, fixed effects) and should **not be penalized** during Lasso regularization.

#### Model
Let $X = (D, W)$, where:
- $D \in \mathbb{R}^d$: important variables (not penalized)
- $W \in \mathbb{R}^p$: high-dimensional controls (penalized)

The estimation problem becomes:
$$
\min_{b_1, b_2} \frac{1}{n} \sum_{i=1}^n (Y_i - D_i'b_1 - W_i'b_2)^2 + \lambda \sum_{j=1}^p |b_{2j}|
$$

#### Implementation
Partial out $D$ from both $Y$ and $W$ using the Frisch-Waugh-Lovell theorem, then perform Lasso on the residualized data.


### 3. Double Lasso: Estimation of Treatment Effects with High-Dimensional Controls

#### Goal
Estimate the treatment effect $\beta_1$ in:
$$
Y = D \beta_1 + W'\beta_2 + \varepsilon,\quad \mathbb{E}[\varepsilon|D, W] = 0
$$
when $W$ is high-dimensional and potentially correlated with both $D$ and $Y$.

#### Problem
Lasso may omit variables highly correlated with $D$, leading to omitted variable bias.

#### Double Lasso Procedure
1. Run Lasso of $D \sim W$ → obtain $\hat{\gamma}$
2. Run Lasso of $Y \sim D + W$ → obtain $\hat{\beta}_2$
3. Estimate $\beta_1$ via orthogonal moment:
$$
\hat{\beta}_1 = \frac{\frac{1}{n} \sum_{i=1}^n (Y_i - W_i'\hat{\beta}_2)(D_i - W_i'\hat{\gamma})}{\frac{1}{n} \sum_{i=1}^n D_i (D_i - W_i'\hat{\gamma})}
$$

#### Reference
Belloni, Chernozhukov, and Hansen (2014), *Review of Economic Studies*

#  Belloni, Chernozhukov, and Hansen (2014) – *“Inference on Treatment Effects after Selection among High-Dimensional Controls” 

## Goal
Estimate the **causal effect** of a treatment or policy variable (e.g., education, treatment assignment, price) **in the presence of many control variables**, where:
- The number of controls $p$ may be large relative to $n$
- The relevant control variables are assumed to be **sparse**

## Model Setup

Main model:
$$
Y = D \beta_1 + W'\beta_2 + \varepsilon, \quad \mathbb{E}[\varepsilon | D, W] = 0
$$

Auxiliary model (for selection bias correction):
$$
D = W'\gamma + \nu, \quad \mathbb{E}[\nu | W] = 0
$$

- $D$: treatment or policy variable (e.g., education, treatment assignment)
- $W$: high-dimensional control variables
- Lasso is used to select relevant $W$'s from both models


## Key Insight: Neyman Orthogonality

They construct a moment condition that is **robust to small errors in nuisance parameters** $\beta_2, \gamma$:
$$
\psi(Y, D, W; \beta_1) = (Y - D\beta_1 - W'\beta_2)(D - W'\gamma)
$$
This condition satisfies:
$$
\frac{\partial}{\partial \beta_2} \mathbb{E}[\psi] = 0, \quad \frac{\partial}{\partial \gamma} \mathbb{E}[\psi] = 0
$$

 This means the estimator for $\beta_1$ remains consistent even if the selection of controls is imperfect!


##  Estimation Steps: Double Lasso

1. Run Lasso of $D$ on $W$ → get $\hat{\gamma}$
2. Run Lasso of $Y$ on $D$ and $W$ → get $\hat{\beta}_2$
3. Estimate $\hat{\beta}_1$ using the orthogonal moment:
$$
\hat{\beta}_1 = \frac{ \frac{1}{n} \sum_{i=1}^n (Y_i - W_i'\hat{\beta}_2)(D_i - W_i'\hat{\gamma}) }{ \frac{1}{n} \sum_{i=1}^n D_i(D_i - W_i'\hat{\gamma}) }
$$



In [None]:
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

# Simulate data
np.random.seed(42)
n, p = 500, 100
W = np.random.randn(n, p)
true_gamma = np.zeros(p); true_gamma[:2] = [0.5, 0.3]
D = W @ true_gamma + np.random.randn(n)
true_beta2 = np.zeros(p); true_beta2[:1] = [0.7]
Y = 1.5 * D + W @ true_beta2 + np.random.randn(n)

# Standardize W
scaler = StandardScaler()
W_scaled = scaler.fit_transform(W)

# Step 1: Lasso of D ~ W
lasso_D = LassoCV(cv=5).fit(W_scaled, D)
gamma_hat = lasso_D.coef_

# Step 2: Lasso of Y ~ D + W
X_full = np.column_stack((D, W_scaled))
lasso_Y = LassoCV(cv=5).fit(X_full, Y)
beta_full = lasso_Y.coef_
beta1_tilde = beta_full[0]            # coefficient on D
beta2_hat = beta_full[1:]            # coefficients on W

# Step 3: Orthogonalized estimation of beta_1
W_gamma = W_scaled @ gamma_hat
W_beta2 = W_scaled @ beta2_hat

numerator = np.mean((Y - W_beta2) * (D - W_gamma))
denominator = np.mean(D * (D - W_gamma))
beta1_hat = numerator / denominator

print(f"Estimated treatment effect (Double Lasso): {beta1_hat:.4f}")


Estimated treatment effect (Double Lasso): 1.4377


In [None]:
# Naive OLS (bad if p > n or omitted variable bias exists)
X_ols = sm.add_constant(np.column_stack((D, W_scaled)))
ols_model = sm.OLS(Y, X_ols).fit()
print(f"Naive OLS estimate on D: {ols_model.params[1]:.4f}")


Naive OLS estimate on D: 1.4243


## Comparison of Post-Lasso, Partial Penalization, and Double Lasso

| Feature / Method           | Post-Lasso                          | Partial Penalization                 | Double Lasso                                 |
|---------------------------|-------------------------------------|--------------------------------------|----------------------------------------------|
| **Goal**                  | Reduce bias after variable selection| Avoid penalizing key regressors      | Obtain valid causal inference with many controls |
| **Penalized Variables**   | All initially penalized             | Only subset of covariates penalized  | Only controls $W$ penalized              |
| **Bias**                  | Reduced (compared to Lasso)         | Reduced for key regressors           | Robust via orthogonalization                 |
| **Consistency for β₁**    | No guarantee                        | No guarantee                         | Yes (under regularity + orthogonality)       |
| **Interpretation**        | Improves estimation                 | Theory-driven modeling flexibility   | Supports valid inference                     |
| **Assumption**            | Correct model selection             | Known key variables                  | Approx. sparsity + moment orthogonality      |
| **Inference possible?**   | Risky (depends on selection accuracy)| Risky (unless known model)          | Yes (asymptotic normality holds)             |




## Recommended Use-Cases

### 1. **Post-Lasso**

- **When to use**:
  - You care about **prediction** or **point estimates**, but not inference
  - You believe Lasso selects the right variables
  - You want to reduce shrinkage bias
  
- **Example**:
  - **Belloni & Chernozhukov (2009)**:
    *“Least Squares After Model Selection in High-Dimensional Sparse Models”*, published in *Bernoulli*  
    > Shows Post-Lasso often dominates Lasso in mean-squared error if selection is correct


### 2. **Partial Penalization**

- **When to use**:
  - You **must include certain variables** due to theory (e.g., price, policy dummies, fixed effects)
  - You want flexible model selection for nuisance controls
  
- **Example**:
  - **DellaVigna & Gentzkow (2019)**:
    *“Uniform Pricing in U.S. Retail Chains”*, *Quarterly Journal of Economics*  
    > Important regressors like prices are **never penalized**, but other store-level or region-level controls are selected flexibly



### 3. **Double Lasso**

- **When to use**:
  - You want to **estimate a causal effect** of a treatment with many potential controls
  - You are concerned about **omitted variable bias**
  - You need **valid standard errors** and confidence intervals
  
- **Examples**:

  1. **Belloni, Chernozhukov, Hansen (2014)**  
     *“Inference on Treatment Effects After Selection Among High-Dimensional Controls”*, *Review of Economic Studies*  
     > Canonical paper introducing Double Lasso for estimating treatment effects

  2. **Chernozhukov et al. (2015)**  
     *“Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach”*, *Annual Review of Economics*  
     > Introduces Neyman orthogonality in a general double/debiased ML framework

  3. **Bartik et al. (2020)**  
     *“Using Machine Learning to Estimate Heterogeneous Treatment Effects”*, *AER: Insights*  
     > Combines Double Lasso and causal forests for robust treatment effect estimation



## Summary

- Use **Post-Lasso** when your goal is **better prediction** and you trust Lasso’s variable selection
- Use **Partial Penalization** when **economic theory mandates inclusion** of certain variables
- Use **Double Lasso** when you aim for **valid causal inference** in high-dimensional settings