### Double ML Partially Linear Regression 

1. Split Data: Randomly split data into $K$-folds.

2. Estimate Nuisance Functions:
   For each fold $k \in \{1, \dots, K\}$:
   - Train $\hat{\ell}(X) \approx \mathbb{E}[Y \mid X]$ and $\hat{m}(X) \approx \mathbb{E}[D \mid X]$ on $\mathcal{I}_{-k}$.
   - Predict $\hat{\ell}(X_i)$ and $\hat{m}(X_i)$ for $i \in \mathcal{I}_k$.

3. Residualization:
   For all $i \in \{1, \dots, N\}$:
   $ \tilde{Y}_i = Y_i - \hat{\ell}(X_i), \quad \tilde{D}_i = D_i - \hat{m}(X_i). $

4. Second-Stage Regression:
   - Regress $\tilde{Y}$ on $\tilde{D}$ to estimate:
     $ \hat{\theta}_0 = \frac{\sum_{i=1}^N \tilde{Y}_i \tilde{D}_i}{\sum_{i=1}^N \tilde{D}_i^2}. $

5. Variance and Confidence Interval:
   - Compute residuals $\hat{u}_i = \tilde{Y}_i - \hat{\theta}_0 \tilde{D}_i$.
   - Estimate variance:
     $ \text{Var}(\hat{\theta}_0) = \frac{\frac{1}{N} \sum_{i=1}^N \hat{u}_i^2}{\sum_{i=1}^N \tilde{D}_i^2}. $
   - Construct confidence interval:
     $ \text{CI} = \hat{\theta}_0 \pm z_{1-\alpha/2} \sqrt{\text{Var}(\hat{\theta}_0)}. $

In [2]:
import numpy as np
from sklearn.model_selection import KFold
from lightgbm import LGBMRegressor
import statsmodels.api as sm

# Generate synthetic data
N = 500  # Number of observations
p = 5  # Number of covariates

np.random.seed(42)

# Covariance matrix for multivariate normal
Sigma = np.array([[0.7 ** abs(i - j) for j in range(p)] for i in range(p)])

# Covariates X ~ N(0, Σ)
X = np.random.multivariate_normal(mean=np.zeros(p), cov=Sigma, size=N)

# True nuisance functions
def m0(x):
    return x[0] + 0.25 / (1 + np.exp(x[2])) + np.exp(x[3]) / (1 + np.exp(x[3]))

def g0(x):
    return np.exp(x[0]) / (1 + np.exp(x[0])) + 0.25 * x[3]

theta_0 = 0.5  # True causal effect

# Generate treatment D and outcome Y
v = np.random.normal(size=N)
zeta = np.random.normal(size=N)

m_values = np.array([m0(x) for x in X])
g_values = np.array([g0(x) for x in X])

D = m_values + v
Y = theta_0 * D + g_values + zeta

# Function to implement the cross-fitting procedure
def cross_fitting_plr(Y, D, X, K=10):
    # Initialize residuals
    Y_residual = np.zeros_like(Y)
    D_residual = np.zeros_like(D)

    # K-fold cross-splitting
    kf = KFold(n_splits=K, shuffle=True, random_state=42)
    for train_idx, val_idx in kf.split(X):
        # Train nuisance models on training folds
        model_m = LGBMRegressor(verbose=-1)
        model_ell = LGBMRegressor(verbose=-1)

        model_m.fit(X[train_idx], D[train_idx])  # Predict E[D|X]
        model_ell.fit(X[train_idx], Y[train_idx])  # Predict E[Y|X]

        # Predict on validation folds
        m_hat = model_m.predict(X[val_idx])
        ell_hat = model_ell.predict(X[val_idx])

        # Compute residuals
        Y_residual[val_idx] = Y[val_idx] - ell_hat
        D_residual[val_idx] = D[val_idx] - m_hat

    # Second-stage regression
    model = sm.OLS(Y_residual, sm.add_constant(D_residual)).fit()

    # Return theta estimate and standard error
    return model.params[1], model.bse[1]

# Run the cross-fitting procedure
theta_hat, stderr = cross_fitting_plr(Y, D, X, K=10)

# Output results
print(f"Estimated theta: {theta_hat:.3f}")
print(f"Standard error: {stderr:.3f}")


Estimated theta: 0.457
Standard error: 0.049
