# DML Condition Number Study: Monte Carlo Simulation

This notebook implements the complete Monte Carlo study for analyzing the DML condition number $\kappa_{\mathrm{DML}}$ in the Partially Linear Regression (PLR) model.

## Theoretical Background

The DML condition number is defined as:
$$\kappa_{\mathrm{DML}} := \frac{1}{|\hat{J}_\theta|} = \frac{n}{\sum_i \hat{U}_i^2}$$

where $\hat{J}_\theta = -\frac{1}{n}\sum_i \hat{U}_i^2$ is the empirical Jacobian and $\hat{U}_i$ are residualized treatments.

The refined linearization from the paper shows:
$$\hat{\theta} - \theta_0 = \kappa_{\mathrm{DML}} \cdot (S_n + B_n) + R_n$$

### Three Conditioning Regimes

| Regime | Condition | CI Length | Interpretation |
|--------|-----------|-----------|----------------|
| Well-conditioned | $\kappa_n = O_P(1)$ | $O_P(n^{-1/2})$ | Standard inference reliable |
| Moderately ill-cond. | $\kappa_n = O_P(n^\beta)$, $0 < \beta < 1/2$ | $O_P(n^{\beta-1/2})$ | Slower convergence |
| Severely ill-cond. | $\kappa_n \asymp c\sqrt{n}$ | $O_P(1)$ | CI fails to shrink |

## 1. Setup and Imports

In [40]:
# Standard imports
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Add src to path for local development
sys.path.insert(0, os.path.abspath('../src'))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Configure matplotlib for publication-quality figures
plt.rcParams.update({
    'font.size': 12,
    'axes.titlesize': 14,
    'axes.labelsize': 13,
    'xtick.labelsize': 11,
    'ytick.labelsize': 11,
    'legend.fontsize': 11,
    'figure.figsize': (10, 7),
    'figure.dpi': 100,
})

# Import DML simulation module
from dml_condition import (
    # Constants
    THETA0, R2_TARGETS, DEFAULT_SEED, B_DEFAULT,
    # DGP
    generate_plr_data, calibrate_sigma_xi_sq, compute_V_gamma,
    # DML
    run_dml_plr, get_nuisance_model,
    # Simulation
    run_simulation, run_single_replication,
    # Summary & Tables
    compute_cell_summary, make_table1, make_table2, table_to_latex,
    # Visualization
    plot_coverage_vs_kappa, plot_ci_length_vs_kappa,
    # Main entry
    run_full_study,
)

print("DML Condition Number Study")
print("=" * 50)
print(f"True treatment effect: θ₀ = {THETA0}")
print(f"Default seed: {DEFAULT_SEED}")
print(f"Default replications: B = {B_DEFAULT}")
print(f"R² targets: {R2_TARGETS}")

DML Condition Number Study
True treatment effect: θ₀ = 1.0
Default seed: 20241205
Default replications: B = 500
R² targets: {'high': 0.75, 'moderate': 0.9, 'low': 0.97}


## 2. Design Verification: Overlap Calibration

First, let's verify that our DGP correctly achieves the target $R^2(D|X)$ values for each overlap level.

In [41]:
# Verify R²(D|X) calibration
print("R²(D|X) Calibration Verification")
print("=" * 60)
print(f"{'Overlap Level':<15} {'Target R²':<12} {'σ_ξ²':<12} {'V_γ':<12} {'Check R²':<12}")
print("-" * 60)

for level, R2 in R2_TARGETS.items():
    sigma_xi_sq, V_gamma = calibrate_sigma_xi_sq(R2, rho=0.5, p=10)
    check_R2 = V_gamma / (V_gamma + sigma_xi_sq)
    print(f"{level.capitalize():<15} {R2:<12.2f} {sigma_xi_sq:<12.4f} {V_gamma:<12.4f} {check_R2:<12.4f}")

print()
print("Empirical verification with n=5000:")
print("-" * 60)
for level, R2 in R2_TARGETS.items():
    Y, D, X, info = generate_plr_data(n=5000, R2_target=R2, rho=0.5, random_state=42)
    print(f"{level.capitalize():<15} Target: {R2:.2f}, Sample: {info.sample_R2:.4f}")

R²(D|X) Calibration Verification
Overlap Level   Target R²    σ_ξ²         V_γ          Check R²    
------------------------------------------------------------
High            0.75         1.3544       4.0632       0.7500      
Moderate        0.90         0.4515       4.0632       0.9000      
Low             0.97         0.1257       4.0632       0.9700      

Empirical verification with n=5000:
------------------------------------------------------------
High            Target: 0.75, Sample: 0.7499
Moderate        Target: 0.90, Sample: 0.8986
Low             Target: 0.97, Sample: 0.9686


## 3. Single Replication Example

Let's examine a single DML estimation to understand the components.

In [42]:
# Single replication example across different overlap levels and learners
print("Single Replication Example")
print("=" * 80)
print(f"{'R²(D|X)':<10} {'Learner':<8} {'θ̂':<10} {'κ_DML':<10} {'SE':<10} {'CI':<20} {'Covers θ₀?':<10}")
print("-" * 80)

for R2 in [0.75, 0.90, 0.97]:
    for learner in ['LIN', 'LAS', 'RF']:
        # Generate data
        Y, D, X, info = generate_plr_data(n=500, R2_target=R2, rho=0.5, random_state=123)
        
        # Run DML
        result = run_dml_plr(Y, D, X, learner_label=learner, K=5, random_state=123)
        
        # Format CI
        ci_str = f"[{result.ci_lower:.3f}, {result.ci_upper:.3f}]"
        covers = "Yes" if result.covers(THETA0) else "No"
        
        print(f"{R2:<10.2f} {learner:<8} {result.theta_hat:<10.4f} {result.kappa_dml:<10.3f} "
              f"{result.se_dml:<10.4f} {ci_str:<20} {covers:<10}")
    print()  # Blank line between R² levels

Single Replication Example
R²(D|X)    Learner  θ̂         κ_DML      SE         CI                   Covers θ₀?
--------------------------------------------------------------------------------
0.75       LIN      1.0109     0.711      0.0506     [0.912, 1.110]       Yes       
0.75       LAS      1.0079     0.711      0.0499     [0.910, 1.106]       Yes       
0.75       LAS      1.0079     0.711      0.0499     [0.910, 1.106]       Yes       
0.75       RF       1.0047     0.547      0.0413     [0.924, 1.086]       Yes       

0.90       LIN      1.0195     2.133      0.0871     [0.849, 1.190]       Yes       
0.75       RF       1.0047     0.547      0.0413     [0.924, 1.086]       Yes       

0.90       LIN      1.0195     2.133      0.0871     [0.849, 1.190]       Yes       
0.90       LAS      1.0176     2.132      0.0868     [0.848, 1.188]       Yes       
0.90       LAS      1.0176     2.132      0.0868     [0.848, 1.188]       Yes       
0.90       RF       0.9534     1.076    

## 4. Simulation Configuration

Define the Monte Carlo simulation parameters.

In [43]:
# =============================================================================
# SIMULATION CONFIGURATION
# =============================================================================

# Sample sizes
N_LIST = [500, 2000]

# Target R²(D|X) levels (overlap calibration)
R2_LIST = [0.75, 0.90, 0.97]  # High, Moderate, Low overlap

# Nuisance learners
LEARNERS = ["LIN", "LAS", "RF"]

# Monte Carlo replications per design cell
# Use B=500 for paper results, B=50-100 for quick testing
B = 500  # Set to 500 for full study

# Fixed design parameters
RHO = 0.5     # Toeplitz correlation
P = 10        # Covariate dimension
K = 5         # Cross-fitting folds

# Output directory
RESULTS_DIR = "../results"

# Print configuration
n_cells = len(N_LIST) * len(R2_LIST) * len(LEARNERS)
total_reps = n_cells * B

print("Simulation Configuration")
print("=" * 50)
print(f"Sample sizes:       n ∈ {N_LIST}")
print(f"Overlap levels:     R² ∈ {R2_LIST}")
print(f"Learners:           {LEARNERS}")
print(f"Replications:       B = {B}")
print(f"Fixed parameters:   ρ = {RHO}, p = {P}, K = {K}")
print("-" * 50)
print(f"Design cells:       {n_cells}")
print(f"Total replications: {total_reps:,}")
print(f"Results directory:  {RESULTS_DIR}")

Simulation Configuration
Sample sizes:       n ∈ [500, 2000]
Overlap levels:     R² ∈ [0.75, 0.9, 0.97]
Learners:           ['LIN', 'LAS', 'RF']
Replications:       B = 500
Fixed parameters:   ρ = 0.5, p = 10, K = 5
--------------------------------------------------
Design cells:       18
Total replications: 9,000
Results directory:  ../results


## 5. Run Monte Carlo Simulation

⚠️ **Warning**: The full simulation with B=500 takes several hours. For quick testing, reduce B to 50-100.

In [44]:
%%time

# Run the full Monte Carlo simulation
print("Starting Monte Carlo simulation...")
print("=" * 60)

results_df = run_simulation(
    n_list=N_LIST,
    R2_list=R2_LIST,
    learners=LEARNERS,
    B=B,
    rho=RHO,
    p=P,
    K=K,
    verbose=True,
)

print(f"\nSimulation complete!")
print(f"Total rows: {len(results_df):,}")
print(f"Columns: {list(results_df.columns)}")

Starting Monte Carlo simulation...
DML Condition Number Monte Carlo Study
Design: 2 sample sizes × 3 R² levels × 3 learners
Replications per cell: B = 500
Total replications: 9,000
Fixed parameters: ρ = 0.5, p = 10, K = 5, θ₀ = 1.0

[1/18] n=500, R²=0.75 (high), learner=LIN
    Completed 100/500 replications
    Completed 100/500 replications
    Completed 200/500 replications
    Completed 200/500 replications
    Completed 300/500 replications
    Completed 300/500 replications
    Completed 400/500 replications
    Completed 400/500 replications
    Completed 500/500 replications

[2/18] n=500, R²=0.75 (high), learner=LAS
    Completed 500/500 replications

[2/18] n=500, R²=0.75 (high), learner=LAS
    Completed 100/500 replications
    Completed 100/500 replications
    Completed 200/500 replications
    Completed 200/500 replications
    Completed 300/500 replications
    Completed 300/500 replications
    Completed 400/500 replications
    Completed 400/500 replications
    Compl

In [45]:
# Quick look at raw results
print("Raw Results Sample:")
print(results_df.head(10).to_string())

print(f"\nResults shape: {results_df.shape}")
print(f"\nκ_DML range: [{results_df['kappa_dml'].min():.3f}, {results_df['kappa_dml'].max():.3f}]")
print(f"Coverage rate: {results_df['coverage'].mean():.3f}")

Raw Results Sample:
     n  R2_target  rho learner  replication  theta_hat  kappa_dml    se_dml  ci_lower  ci_upper  ci_length  coverage      bias  squared_error  sample_R2
0  500       0.75  0.5     LIN            0   1.062364   0.678946  0.051470  0.961483  1.163245   0.201761         1  0.062364       0.003889   0.767950
1  500       0.75  0.5     LIN            1   1.035781   0.637099  0.044177  0.949194  1.122368   0.173173         1  0.035781       0.001280   0.750382
2  500       0.75  0.5     LIN            2   1.005497   0.685950  0.055125  0.897451  1.113542   0.216092         1  0.005497       0.000030   0.707995
3  500       0.75  0.5     LIN            3   1.005220   0.683186  0.055045  0.897332  1.113108   0.215776         1  0.005220       0.000027   0.750023
4  500       0.75  0.5     LIN            4   0.928522   0.754188  0.050907  0.828744  1.028301   0.199556         1 -0.071478       0.005109   0.811837
5  500       0.75  0.5     LIN            5   0.968639   0.662

## 6. Compute Summary Statistics

In [46]:
# Compute cell-level summary statistics
cell_summary = compute_cell_summary(results_df)

print("Cell-Level Summary Statistics")
print("=" * 100)
display_cols = ['n', 'R2_target', 'learner', 'overlap', 'median_kappa', 'mean_kappa', 
                'coverage', 'avg_ci_length', 'mean_bias', 'rmse']
print(cell_summary[display_cols].to_string(index=False))

# Add κ-regime classification
from dml_condition import assign_kappa_regime
cell_summary['kappa_regime'] = cell_summary['median_kappa'].apply(assign_kappa_regime)
print("\n\nWith κ-Regime Classification:")
print(cell_summary[['n', 'R2_target', 'learner', 'median_kappa', 'kappa_regime', 'coverage']].to_string(index=False))

Cell-Level Summary Statistics
   n  R2_target learner  overlap  median_kappa  mean_kappa  coverage  avg_ci_length  mean_bias     rmse
 500       0.75     LAS     High      0.724528    0.723932     0.940       0.194932   0.000607 0.050800
 500       0.75     LIN     High      0.724199    0.726663     0.956       0.193697   0.000908 0.048555
 500       0.75      RF     High      0.545200    0.548077     0.956       0.177379  -0.020183 0.045386
 500       0.90     LAS Moderate      2.148282    2.164505     0.934       0.333747  -0.000187 0.089426
 500       0.90     LIN Moderate      2.149735    2.157493     0.950       0.334531   0.001423 0.084770
 500       0.90      RF Moderate      1.058415    1.059487     0.962       0.279483  -0.043104 0.070660
 500       0.97     LAS      Low      7.762764    7.780217     0.932       0.636169   0.000343 0.170831
 500       0.97     LIN      Low      7.714352    7.732915     0.946       0.634666  -0.007035 0.165090
 500       0.97      RF      Low  

## 7. Table 1: Design Summary

Summary of the experimental design with median $\kappa_{\mathrm{DML}}$ by overlap level.

In [47]:
# Generate Table 1: Design summary with median kappa
# Note: make_table1 expects raw results_df (not cell_summary)
from dml_condition import make_table1

table1 = make_table1(results_df)
print("Table 1: Design Summary")
print("=" * 60)
display(table1)

Table 1: Design Summary


Unnamed: 0,Overlap,R²(D|X),Median κ_DML,Mean κ_DML,SD κ_DML,n values,Learners
0,High (R²=0.75),0.75,0.71,0.68,0.08,"500, 2000","LIN, LAS, RF"
1,Moderate (R²=0.90),0.9,2.11,1.84,0.5,"500, 2000","LIN, LAS, RF"
2,Low (R²=0.97),0.97,7.6,5.82,2.86,"500, 2000","LIN, LAS, RF"


## 8. Table 2: Coverage & CI Length by $\kappa_{\mathrm{DML}}$ Regime

Main results table showing coverage and CI length across conditioning regimes.

In [48]:
# Generate Table 2: Coverage and CI length by kappa regime
# Note: make_table2 expects cell_summary (not raw results)
from dml_condition import make_table2

table2 = make_table2(cell_summary)
print("Table 2: Coverage and CI Length by κ_DML Regime")
print("=" * 80)
display(table2)

Table 2: Coverage and CI Length by κ_DML Regime


Unnamed: 0,κ-Regime,Learner,Coverage,Avg CI Length,Bias,RMSE
1,< 1,LAS,94.3,0.146,0.0002,0.0375
2,< 1,LIN,95.1,0.145,0.0012,0.0365
3,< 1,RF,89.0,0.131,-0.0226,0.0384
0,1-2,RF,84.1,0.265,-0.0592,0.0786
4,> 2,LAS,94.4,0.364,-0.0001,0.096
5,> 2,LIN,95.0,0.363,-0.0003,0.0938
6,> 2,RF,39.8,0.191,-0.1035,0.1092


## 9. Export Tables to LaTeX

Generate publication-ready LaTeX tables.

In [49]:
# Export Table 1 to LaTeX
latex_table1 = table1.to_latex(
    caption="Design Summary: Median $\\kappa_{\\mathrm{DML}}$ by Overlap Level",
    label="tab:design_summary",
    escape=False,
    float_format="%.3f"
)
print("Table 1 LaTeX:")
print(latex_table1)

# Save to file
with open('../results/table1_design_summary.tex', 'w') as f:
    f.write(latex_table1)
print("\nSaved: results/table1_design_summary.tex")

Table 1 LaTeX:
\begin{table}
\caption{Design Summary: Median $\kappa_{\mathrm{DML}}$ by Overlap Level}
\label{tab:design_summary}
\begin{tabular}{llrrrrll}
\toprule
 & Overlap & R²(D|X) & Median κ_DML & Mean κ_DML & SD κ_DML & n values & Learners \\
\midrule
0 & High (R²=0.75) & 0.750 & 0.710 & 0.680 & 0.080 & 500, 2000 & LIN, LAS, RF \\
1 & Moderate (R²=0.90) & 0.900 & 2.110 & 1.840 & 0.500 & 500, 2000 & LIN, LAS, RF \\
2 & Low (R²=0.97) & 0.970 & 7.600 & 5.820 & 2.860 & 500, 2000 & LIN, LAS, RF \\
\bottomrule
\end{tabular}
\end{table}


Saved: results/table1_design_summary.tex


In [50]:
# Export Table 2 to LaTeX
latex_table2 = table2.to_latex(
    caption="Coverage and CI Length by $\\kappa_{\\mathrm{DML}}$ Regime",
    label="tab:coverage_by_regime",
    escape=False,
    float_format="%.3f"
)
print("Table 2 LaTeX:")
print(latex_table2)

# Save to file
with open('../results/table2_coverage_by_regime.tex', 'w') as f:
    f.write(latex_table2)
print("\nSaved: results/table2_coverage_by_regime.tex")

Table 2 LaTeX:
\begin{table}
\caption{Coverage and CI Length by $\kappa_{\mathrm{DML}}$ Regime}
\label{tab:coverage_by_regime}
\begin{tabular}{lllrrrr}
\toprule
 & κ-Regime & Learner & Coverage & Avg CI Length & Bias & RMSE \\
\midrule
1 & < 1 & LAS & 94.300 & 0.146 & 0.000 & 0.037 \\
2 & < 1 & LIN & 95.100 & 0.145 & 0.001 & 0.036 \\
3 & < 1 & RF & 89.000 & 0.131 & -0.023 & 0.038 \\
0 & 1-2 & RF & 84.100 & 0.265 & -0.059 & 0.079 \\
4 & > 2 & LAS & 94.400 & 0.364 & -0.000 & 0.096 \\
5 & > 2 & LIN & 95.000 & 0.363 & -0.000 & 0.094 \\
6 & > 2 & RF & 39.800 & 0.191 & -0.103 & 0.109 \\
\bottomrule
\end{tabular}
\end{table}


Saved: results/table2_coverage_by_regime.tex


## 10. Save All Results

Save the full simulation results and summary statistics for reproducibility.

In [51]:
# Save full results to CSV
results_df.to_csv('../results/simulation_results.csv', index=False)
print(f"Saved: results/simulation_results.csv ({len(results_df)} rows)")

# Save cell summary
cell_summary.to_csv('../results/cell_summary.csv', index=False)
print(f"Saved: results/cell_summary.csv ({len(cell_summary)} rows)")

# Save Table 1 and Table 2 as CSV for convenience
table1.to_csv('../results/table1_design_summary.csv')
table2.to_csv('../results/table2_coverage_by_regime.csv')
print("Saved: results/table1_design_summary.csv")
print("Saved: results/table2_coverage_by_regime.csv")

Saved: results/simulation_results.csv (9000 rows)
Saved: results/cell_summary.csv (18 rows)
Saved: results/table1_design_summary.csv
Saved: results/table2_coverage_by_regime.csv


## 11. Summary Statistics

Final overview of the simulation results.

In [52]:
# Final summary statistics
print("=" * 70)
print("SIMULATION SUMMARY")
print("=" * 70)
print(f"\nTotal replications: {len(results_df)}")
print(f"Unique configurations: {len(cell_summary)}")
print(f"  - Sample sizes (n): {sorted(results_df['n'].unique())}")
print(f"  - Overlap levels (R²): {sorted(results_df['R2_target'].unique())}")
print(f"  - Learners: {sorted(results_df['learner'].unique())}")

print("\n" + "-" * 70)
print("OVERALL COVERAGE BY LEARNER:")
print("-" * 70)
for learner in sorted(results_df['learner'].unique()):
    subset = results_df[results_df['learner'] == learner]
    cov = subset['coverage'].mean()
    print(f"  {learner}: {cov:.3f} (n={len(subset)})")

print("\n" + "-" * 70)
print("κ_DML DISTRIBUTION:")
print("-" * 70)
print(results_df['kappa_dml'].describe())

print("\n" + "=" * 70)
print("SIMULATION COMPLETE")
print("=" * 70)

SIMULATION SUMMARY

Total replications: 9000
Unique configurations: 18
  - Sample sizes (n): [np.int64(500), np.int64(2000)]
  - Overlap levels (R²): [np.float64(0.75), np.float64(0.9), np.float64(0.97)]
  - Learners: ['LAS', 'LIN', 'RF']

----------------------------------------------------------------------
OVERALL COVERAGE BY LEARNER:
----------------------------------------------------------------------
  LAS: 0.944 (n=3000)
  LIN: 0.950 (n=3000)
  RF: 0.784 (n=3000)

----------------------------------------------------------------------
κ_DML DISTRIBUTION:
----------------------------------------------------------------------
count    9000.000000
mean        2.778151
std         2.769398
min         0.459049
25%         0.739346
50%         1.808823
75%         2.299724
max         9.474677
Name: kappa_dml, dtype: float64

SIMULATION COMPLETE
