# Double Machine Learning with Compressed Leave-One-Out Residuals

**Theory: From Leave-One-Out Residuals to Compressed Estimation**

[Delgado and Mora (1995, Econometrica)](https://e-archivo.uc3m.es/rest/api/core/bitstreams/bedb8220-ea10-4c31-98ae-293acdbbf0c3/content) - the 'nonsmoothing' approach - provides a way to estimate partially linear models using leave-one-out (LOO) residuals when the covariate vector is discrete, as is often the case in applications. Here, we derive an algebraic formulation that allows us to compute the necessary statistics using only group-level aggregates, enabling efficient estimation even with large datasets.

Consider the partially linear model:

$$Y_i = \beta X_i + g(Z_i) + U_i$$

where we are interested in estimating the parameter $\beta$. $Z_i$ can be a vector of multiple discrete covariates, which we concatenate into a group indicator. The core idea of the PLM (the most popular DML method going back to Robinson 1988) is to estimate $\beta$ from a regression of residuals on residuals:

$$\tilde{Y}_i = \beta \tilde{X}_i + U_i$$

where $\tilde{Y}_i = Y_i - E[Y | Z_i]$ and $\tilde{X}_i = X_i - E[X | Z_i]$.


The challenge is estimating the conditional expectations. The Delgado and Mora (1995) nonsmoothing approach uses a **leave-one-out mean** as the estimate. Let's define groups $\mathcal{G}$ based on unique values of $Z_i$. For an observation $i$ belonging to group $g$, the LOO estimates are:

$$
\hat{m}_{y(-i)}(Z_i) = \frac{1}{N_g - 1} \sum_{j \in g, j \neq i} Y_j \quad \text{and} \quad \hat{m}_{x(-i)}(Z_i) = \frac{1}{N_g - 1} \sum_{j \in g, j \neq i} X_j
$$

The OLS estimator for $\beta$ is then $\hat{\beta} = (\sum_i \tilde{X}_i \tilde{Y}_i) / (\sum_i \tilde{X}_i^2)$.

**The Algebraic Derivation for Compressed Data**

Our goal is to compute $\sum_i \tilde{X}_i^2$ and $\sum_i \tilde{X}_i \tilde{Y}_i$ using only group-level aggregates. Let $S_x^{(g)} = \sum_{j \in g} X_j$ be the sum of $X$ within group $g$. The LOO mean for observation $i$ in group $g$ can be written as:

$$
\hat{m}_{x(-i)} = \frac{S_x^{(g)} - X_i}{N_g - 1}
$$

The corresponding residual $\tilde{X}_i$ for observation $i$ is:

$$
\tilde{X}_i = X_i - \hat{m}_{x(-i)} = X_i - \frac{S_x^{(g)} - X_i}{N_g - 1} = \frac{(N_g - 1)X_i - S_x^{(g)} + X_i}{N_g - 1} = \frac{N_g X_i - S_x^{(g)}}{N_g - 1}
$$

Now, let's compute the sum of squared residuals **within group g**:

$$
\sum_{i \in g} \tilde{X}_i^2 = \sum_{i \in g} \left( \frac{N_g X_i - S_x^{(g)}}{N_g - 1} \right)^2 = \frac{1}{(N_g - 1)^2} \sum_{i \in g} (N_g^2 X_i^2 - 2 N_g X_i S_x^{(g)} + (S_x^{(g)})^2)
$$

Distributing the summation:

$$
= \frac{1}{(N_g - 1)^2} \left[ N_g^2 \sum_{i \in g} X_i^2 - 2 N_g S_x^{(g)} \sum_{i \in g} X_i + \sum_{i \in g} (S_x^{(g)})^2 \right]
$$

Let $S_{xx}^{(g)} = \sum_{i \in g} X_i^2$. We know $\sum_{i \in g} X_i = S_x^{(g)}$ and $\sum_{i \in g} (S_x^{(g)})^2 = N_g (S_x^{(g)})^2$. Substituting these in:

$$
= \frac{1}{(N_g - 1)^2} \left[ N_g^2 S_{xx}^{(g)} - 2 N_g (S_x^{(g)})^2 + N_g (S_x^{(g)})^2 \right] = \frac{N_g}{(N_g - 1)^2} \left( N_g S_{xx}^{(g)} - (S_x^{(g)})^2 \right)
$$

By exact analogy, the sum of cross-products of residuals within group $g$ is:

$$
\sum_{i \in g} \tilde{X}_i \tilde{Y}_i = \frac{N_g}{(N_g - 1)^2} \left( N_g S_{xy}^{(g)} - S_x^{(g)} S_y^{(g)} \right)
$$

The final estimator for $\beta$ is the ratio of the sum of these quantities across all groups:

$$
\hat{\beta} = \frac{\sum_{g \in \mathcal{G}} \sum_{i \in g} \tilde{X}_i \tilde{Y}_i}{\sum_{g \in \mathcal{G}} \sum_{i \in g} \tilde{X}_i^2} = \frac{\sum_{g \in \mathcal{G}} \frac{N_g}{(N_g - 1)^2} \left( N_g S_{xy}^{(g)} - S_x^{(g)} S_y^{(g)} \right)}{\sum_{g \in \mathcal{G}} \frac{N_g}{(N_g - 1)^2} \left( N_g S_{xx}^{(g)} - (S_x^{(g)})^2 \right)}
$$

This is our master formula. It depends only on group-level counts (`Ng`), sums (`Sx`, `Sy`), and sums of squares/cross-products (`Sxx`, `Sxy`). This is exactly what we can compute with a single SQL `GROUP BY` query!

In [1]:
import numpy as np
import pandas as pd
import duckdb
from duckreg.estimators import DuckDML

In [2]:
# --- Simulation Setup ---
DB_NAME = "dml_simulation.db"
TABLE_NAME = "sales_data"
TRUE_BETA = 2.5

# 1. Generate Data
print("Step 1: Generating sample data...")
N = 10_000_000
n_towns = 1000
n_days = 100

df = pd.DataFrame({
    'town_id': np.random.randint(0, n_towns, N),
    'day_id': np.random.randint(0, n_days, N),
})

# g(Z) is a non-linear function of town and day
g_z = 0.5 * df['town_id'] + 0.01 * df['day_id'] * df['town_id'] + np.sin(df['day_id'])

# Treatment X is correlated with the fixed effects
df['X'] = 0.2 * df['town_id'] + 0.1 * df['day_id'] + np.random.randn(N)

# Outcome Y follows the partially linear model
df['Y'] = TRUE_BETA * df['X'] + g_z + np.random.normal(0, 2, N)

# 2. Load data into DuckDB
print(f"Step 2: Loading {N:,} rows into DuckDB table '{TABLE_NAME}'...")
con = duckdb.connect(DB_NAME)
con.execute(f"DROP TABLE IF EXISTS {TABLE_NAME}")
con.register('df_pandas', df)
con.execute(f"CREATE TABLE {TABLE_NAME} AS SELECT * FROM df_pandas")
con.close()

Step 1: Generating sample data...
Step 2: Loading 10,000,000 rows into DuckDB table 'sales_data'...


In [3]:
# 3. Run the DML LOO Estimator
print("\nStep 3: Initializing and running the DuckPartialLinearLOO estimator...")
dml_model = DuckDML(
    db_name=DB_NAME,
    table_name=TABLE_NAME,
    outcome_var='Y',
    treatment_var='X',
    discrete_covars=['town_id', 'day_id'],
    seed=42,
    n_bootstraps=500
)

dml_model.fit()
results = dml_model.summary()


Step 3: Initializing and running the DuckPartialLinearLOO estimator...


Bootstrapping: 100%|██████████| 500/500 [00:05<00:00, 90.91it/s]


In [4]:
# 4. Print Results
print("\n--- Estimation Results ---")
print(f"True Beta: {TRUE_BETA}")
print(f"Estimated Beta: {results['point_estimate'][0]:.4f}")
print(f"Standard Error (Bootstrap): {results['standard_error'][[0]].squeeze():.4f}")
print("--------------------------\n")

# Verify the size of the compressed data
print(f"Original number of rows: {len(df):,}")
print(f"Compressed number of rows (groups): {len(dml_model.df_compressed):,}")


--- Estimation Results ---
True Beta: 2.5
Estimated Beta: 2.4995
Standard Error (Bootstrap): 0.0006
--------------------------

Original number of rows: 10,000,000
Compressed number of rows (groups): 100,000


This notebook demonstrates the power of combining deep econometric theory with modern data engineering. By leveraging the algebraic properties of the leave-one-out nonsmoothing estimator, we created a Double Machine Learning workflow that:

-   **Correctly estimates** the parameter of interest in the presence of high-dimensional fixed effects.
-   Is **extremely scalable**, as it operates on a compressed summary of the data that is orders of magnitude smaller than the original dataset.
-   **Avoids computational complexity**, replacing expensive k-fold cross-fitting with a single, efficient `GROUP BY` aggregation.

This method is a powerful tool for applied researchers who need to control for many discrete covariates without incurring prohibitive computational costs.