# DoubleML Analysis: Water Treatment Impact on E.coli Risk

This notebook implements comprehensive Double Machine Learning analysis with:
- Base models (PLR & IRM) with specified covariates
- Extended models (PLR & IRM) with all covariates
- Subsample analysis by RiskSource categories
- LaTeX table outputs for each analysis

## Variable Definitions

| Variable | Role | Type | Description |
|----------|------|------|-------------|
| **Outcome Variables (Y)** ||||
| `SomeRiskHome` | Dependent | Binary | E.coli risk indicator (1 = some risk, 0 = no risk) |
| `VeryHighRiskHome` | Dependent | Binary | High E.coli risk indicator (1 = very high risk, 0 = otherwise) |
| **Treatment Variable (T)** ||||
| `water_treatment` | Treatment | Binary | Household treats water before drinking (1 = yes, 0 = no) |
| **Subsample Variable** ||||
| `RiskSource` | Stratification | Categorical | Water source risk level: "No risk", "Moderate to high risk", "Very high risk" |
| **Basic Controls** ||||
| `windex5` | Control | Ordinal (0-4) | Wealth index quintile (0=Poorest, 1=Poor, 2=Middle, 3=Rich, 4=Richest) |
| `helevel` | Control | Ordinal (0-2) | Education level (0=None, 1=Primary, 2=Secondary+) |
| `urban` | Control | Binary | Urban residence (1=Urban, 0=Rural) |
| `wq27_decile` | Control | Ordinal (1-10) | Water quality decile based on E.coli contamination |
| `country_cat_*` | Control | Binary (one-hot) | Country fixed effects (24 countries, reference: Bangladesh) |
| `WS1_g_*` | Control | Binary (one-hot) | Water source type groups (7 types, reference: Delivered water) |
| **Extended Controls - Household Composition** ||||
| `Any_U5` | Control | Binary | Household has children under 5 years |
| `Girls_less_than15` | Control | Binary | Household has girls under 15 years |
| `Boys_15or_less` | Control | Binary | Household has boys 15 years or younger |
| **Extended Controls - Sanitation** ||||
| `improved_latrine` | Control | Binary | Household has improved latrine facility |
| `Flush` | Control | Binary | Household has flush toilet |
| `Pit_latrine` | Control | Binary | Household uses pit latrine |
| `Open_defecation` | Control | Binary | Household practices open defecation |
| **Extended Controls - Water Sources & Services** ||||
| `rainy_season` | Control | Binary | Survey conducted during rainy season |
| `RainandSurfaceWater` | Control | Binary | Main water source is rain/surface water |
| `PurchasedWater` | Control | Binary | Household purchases water |
| `Basic_water_service` | Control | Binary | Household has basic water service level |
| `Limited_water_service` | Control | Binary | Household has limited water service level |
| `Unimproved_water_service` | Control | Binary | Household has unimproved water service |
| `ImprovedWaterSource` | Control | Binary | Water source is classified as improved |
| `PipedWater` | Control | Binary | Household has piped water access |
| `WellandSpringWater` | Control | Binary | Main water source is well/spring |
| `water_carrier_edu` | Control | Ordinal | Education level of person who fetches water (-1=missing/NA) |

**Notes:**
- **Base model** uses: `windex5`, `helevel`, `country_cat_*`, `WS1_g_*`
- **Extended model** uses: All controls listed above
- **Subsample analysis** stratifies by `RiskSource` categories

In [1]:
import pandas as pd
import numpy as np
import optuna
from doubleml import DoubleMLData, DoubleMLPLR, DoubleMLIRM
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings('ignore')

## 1. Load and Prepare Data

In [2]:
# Load cleaned data
data = pd.read_csv('mics_clean.csv')

print(f"Dataset shape: {data.shape}")
print(f"\nMissing values:\n{data.isnull().sum()[data.isnull().sum() > 0]}")

# Drop rows with NaN values (DoubleML requires complete cases)
data_complete = data.dropna()
print(f"\nComplete cases: {data_complete.shape}")

# Display basic info
data_complete.head()

Dataset shape: (56721, 55)

Missing values:
Open_defecation             1209
RainandSurfaceWater            3
PurchasedWater                 3
Basic_water_service            2
Limited_water_service          2
Unimproved_water_service       3
PipedWater                     3
WellandSpringWater             3
dtype: int64

Complete cases: (55510, 55)


Unnamed: 0,SomeRiskHome,VeryHighRiskHome,water_treatment,RiskSource,urban,windex5,helevel,wq27_decile,Any_U5,Girls_less_than15,...,country_cat_Togo,country_cat_Trinidad and Tobago,country_cat_Viet Nam,country_cat_Zimbabwe,WS1_g_Packaged/Bottled water,WS1_g_Piped water,WS1_g_Protected well/spring,WS1_g_Surface/Rain water,WS1_g_Tube/Well/Borehole,WS1_g_Unprotected well/spring
0,1,0,0,Moderate to high risk,0,1,0,7,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,0,0,No risk,0,1,0,1,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1,1,0,Very high risk,0,2,0,8,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1,1,0,Very high risk,0,2,0,8,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1,0,0,Moderate to high risk,0,0,0,8,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


## 2. Define Variable Sets

In [3]:
# Outcome variables
outcome_vars = ['SomeRiskHome', 'VeryHighRiskHome']

# Treatment variable
treatment_var = 'water_treatment'

# Subsample variable (used for filtering, not as covariate)
subsample_var = 'RiskSource'

# Base model covariates - using what's actually available in the data
# Note: country_cat and WS1_g were one-hot encoded, so we include all their encoded versions
country_cols = [col for col in data_complete.columns if col.startswith('country_cat_')]
ws1g_cols = [col for col in data_complete.columns if col.startswith('WS1_g_')]
region_cols = [col for col in data_complete.columns if col.startswith('Region_')]

base_covariates = ['windex5', 'helevel'] + country_cols + ws1g_cols

# Extended model covariates (all variables except outcomes, treatment, and subsample var)
extended_covariates = [col for col in data_complete.columns 
                       if col not in outcome_vars + [treatment_var, subsample_var]]

# RiskSource categories (using actual string values from the data)
risk_categories = ['No risk', 'Moderate to high risk', 'Very high risk']

print(f"RiskSource value counts:\n{data_complete['RiskSource'].value_counts()}")

RiskSource value counts:
RiskSource
No risk                  23323
Moderate to high risk    20603
Very high risk           11584
Name: count, dtype: int64


## 3. Helper Functions

In [4]:
def create_hyperparameter_space():
    """Define hyperparameter search space for XGBoost"""
    return {
        'n_estimators': {'low': 50, 'high': 200, 'step': 25},
        'max_depth': {'low': 2, 'high': 6},
        'min_child_weight': {'low': 1, 'high': 10},
        'subsample': {'low': 0.6, 'high': 0.9},
        'colsample_bytree': {'low': 0.6, 'high': 0.9},
        'learning_rate': {'low': 0.01, 'high': 0.1}
    }

def run_doubleml_model(data, outcome, covariates, model_type='plr', n_trials=50):
    """
    Run DoubleML model with hyperparameter tuning
    
    Parameters:
    -----------
    data : DataFrame
        Complete dataset
    outcome : str
        Outcome variable name
    covariates : list
        List of covariate names
    model_type : str
        'plr' or 'irm'
    n_trials : int
        Number of Optuna trials
        
    Returns:
    --------
    dict : Model results
    """
    
    # Create DoubleML data object
    dml_data = DoubleMLData(
        data=data,
        y_col=outcome,
        d_cols=treatment_var,
        x_cols=covariates
    )
    
    # Initialize XGBoost classifiers
    if model_type == 'plr':
        ml_l = XGBClassifier(objective='binary:logistic', eval_metric='logloss', random_state=42)
        ml_m = XGBClassifier(objective='binary:logistic', eval_metric='logloss', random_state=42)
        model = DoubleMLPLR(dml_data, ml_l=ml_l, ml_m=ml_m)
        
        # Hyperparameter tuning functions
        def ml_l_params(trial):
            hp = create_hyperparameter_space()
            return {
                'n_estimators': trial.suggest_int('n_estimators', hp['n_estimators']['low'], 
                                                 hp['n_estimators']['high'], step=hp['n_estimators']['step']),
                'max_depth': trial.suggest_int('max_depth', hp['max_depth']['low'], hp['max_depth']['high']),
                'min_child_weight': trial.suggest_int('min_child_weight', hp['min_child_weight']['low'], 
                                                     hp['min_child_weight']['high']),
                'subsample': trial.suggest_float('subsample', hp['subsample']['low'], hp['subsample']['high']),
                'colsample_bytree': trial.suggest_float('colsample_bytree', hp['colsample_bytree']['low'], 
                                                        hp['colsample_bytree']['high']),
                'learning_rate': trial.suggest_float('learning_rate', hp['learning_rate']['low'], 
                                                    hp['learning_rate']['high'], log=True)
            }
        
        def ml_m_params(trial):
            hp = create_hyperparameter_space()
            return {
                'n_estimators': trial.suggest_int('n_estimators', hp['n_estimators']['low'], 
                                                 hp['n_estimators']['high'], step=hp['n_estimators']['step']),
                'max_depth': trial.suggest_int('max_depth', hp['max_depth']['low'], hp['max_depth']['high']),
                'min_child_weight': trial.suggest_int('min_child_weight', hp['min_child_weight']['low'], 
                                                     hp['min_child_weight']['high']),
                'subsample': trial.suggest_float('subsample', hp['subsample']['low'], hp['subsample']['high']),
                'colsample_bytree': trial.suggest_float('colsample_bytree', hp['colsample_bytree']['low'], 
                                                        hp['colsample_bytree']['high']),
                'learning_rate': trial.suggest_float('learning_rate', hp['learning_rate']['low'], 
                                                    hp['learning_rate']['high'], log=True)
            }
        
        param_space = {'ml_l': ml_l_params, 'ml_m': ml_m_params}
        
    else:  # IRM
        ml_g = XGBClassifier(objective='binary:logistic', eval_metric='logloss', random_state=42)
        ml_m = XGBClassifier(objective='binary:logistic', eval_metric='logloss', random_state=42)
        model = DoubleMLIRM(dml_data, ml_g=ml_g, ml_m=ml_m)
        
        def ml_g_params(trial):
            hp = create_hyperparameter_space()
            return {
                'n_estimators': trial.suggest_int('n_estimators', hp['n_estimators']['low'], 
                                                 hp['n_estimators']['high'], step=hp['n_estimators']['step']),
                'max_depth': trial.suggest_int('max_depth', hp['max_depth']['low'], hp['max_depth']['high']),
                'min_child_weight': trial.suggest_int('min_child_weight', hp['min_child_weight']['low'], 
                                                     hp['min_child_weight']['high']),
                'subsample': trial.suggest_float('subsample', hp['subsample']['low'], hp['subsample']['high']),
                'colsample_bytree': trial.suggest_float('colsample_bytree', hp['colsample_bytree']['low'], 
                                                        hp['colsample_bytree']['high']),
                'learning_rate': trial.suggest_float('learning_rate', hp['learning_rate']['low'], 
                                                    hp['learning_rate']['high'], log=True)
            }
        
        def ml_m_params(trial):
            hp = create_hyperparameter_space()
            return {
                'n_estimators': trial.suggest_int('n_estimators', hp['n_estimators']['low'], 
                                                 hp['n_estimators']['high'], step=hp['n_estimators']['step']),
                'max_depth': trial.suggest_int('max_depth', hp['max_depth']['low'], hp['max_depth']['high']),
                'min_child_weight': trial.suggest_int('min_child_weight', hp['min_child_weight']['low'], 
                                                     hp['min_child_weight']['high']),
                'subsample': trial.suggest_float('subsample', hp['subsample']['low'], hp['subsample']['high']),
                'colsample_bytree': trial.suggest_float('colsample_bytree', hp['colsample_bytree']['low'], 
                                                        hp['colsample_bytree']['high']),
                'learning_rate': trial.suggest_float('learning_rate', hp['learning_rate']['low'], 
                                                    hp['learning_rate']['high'], log=True)
            }
        
        param_space = {'ml_g': ml_g_params, 'ml_m': ml_m_params}
    
    # Optimize hyperparameters
    optuna_settings = {
        'n_jobs_optuna': -1,
        'show_progress_bar': True,
        'verbosity': optuna.logging.WARNING,
        'n_trials': n_trials
    }
    
    model.tune_ml_models(ml_param_space=param_space, optuna_settings=optuna_settings)
    
    # Fit the model
    model.fit()
    
    # Extract results
    summary = model.summary
    
    return {
        'model': model,
        'coef': model.coef[0],
        'se': model.se[0],
        'ci_lower': model.confint()['2.5 %'].values[0],
        'ci_upper': model.confint()['97.5 %'].values[0],
        'pval': model.pval[0],
        'n_obs': len(data),
        'summary': summary
    }

def create_results_table(results_dict, title):
    """
    Create formatted results table
    
    Parameters:
    -----------
    results_dict : dict
        Dictionary with model results
    title : str
        Table title
        
    Returns:
    --------
    DataFrame : Formatted results table
    """
    rows = []
    for key, result in results_dict.items():
        pval = result['pval']
        pval_str = "$< 0.0001$" if round(pval, 4) == 0 else f"{pval:.4f}"
    
        rows.append({
            'Model': key,
            'Coefficient': f"{result['coef']:.4f}",
            'Std. Error': f"{result['se']:.4f}",
            '95\\% CI': f"[{result['ci_lower']:.4f}, {result['ci_upper']:.4f}]",
            'p-value': pval_str,
            'n. obs.': f"{result['n_obs']:,}",
        })

    
    df = pd.DataFrame(rows)
    print(f"\n{'='*80}")
    print(f"{title.center(80)}")
    print(f"{'='*80}")
    print(df.to_string(index=False))
    print(f"{'='*80}\n")
    
    return df

## 4. Analysis: SomeRiskHome

### 4.1 Base Models (PLR & IRM)

In [5]:
# Run base models for SomeRiskHome
somerisk_base_plr = run_doubleml_model(data_complete, 'SomeRiskHome', base_covariates, 'plr')
somerisk_base_irm = run_doubleml_model(data_complete, 'SomeRiskHome', base_covariates, 'irm')

# Create results table
somerisk_base_results = {
    'PLR': somerisk_base_plr,
    'IRM': somerisk_base_irm
}

somerisk_base_table = create_results_table(
    somerisk_base_results, 
    "SomeRiskHome - Base Models"
)

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


                           SomeRiskHome - Base Models                           
Model Coefficient Std. Error            95\% CI    p-value n. obs.
  PLR     -0.0708     0.0055 [-0.0816, -0.0600] $< 0.0001$  55,510
  IRM     -0.0529     0.0078 [-0.0682, -0.0376] $< 0.0001$  55,510



In [6]:
somerisk_base_table = create_results_table(
    somerisk_base_results, 
    "SomeRiskHome - Base Models"
)


                           SomeRiskHome - Base Models                           
Model Coefficient Std. Error            95\% CI    p-value n. obs.
  PLR     -0.0708     0.0055 [-0.0816, -0.0600] $< 0.0001$  55,510
  IRM     -0.0529     0.0078 [-0.0682, -0.0376] $< 0.0001$  55,510



In [7]:
# Export to LaTeX
latex_base_somerisk = somerisk_base_table.to_latex(index=False, caption="SomeRiskHome - Base Models (PLR and IRM)",  label="tab:somerisk_base", )

# Save to tables/ folder
with open('tables/somerisk_base.tex', 'w') as f:
    f.write(latex_base_somerisk)

### 4.2 Extended Models (PLR & IRM)

In [8]:
# Run extended models for SomeRiskHome
somerisk_ext_plr = run_doubleml_model(data_complete, 'SomeRiskHome', extended_covariates, 'plr')
somerisk_ext_irm = run_doubleml_model(data_complete, 'SomeRiskHome', extended_covariates, 'irm')

# Create results table
somerisk_ext_results = {
    'PLR': somerisk_ext_plr,
    'IRM': somerisk_ext_irm
}

somerisk_ext_table = create_results_table(
    somerisk_ext_results, 
    "SomeRiskHome - Extended Models"
)

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


                         SomeRiskHome - Extended Models                         
Model Coefficient Std. Error            95\% CI    p-value n. obs.
  PLR     -0.0864     0.0050 [-0.0962, -0.0766] $< 0.0001$  55,510
  IRM     -0.0774     0.0081 [-0.0932, -0.0616] $< 0.0001$  55,510



In [9]:
# Export to LaTeX
latex_ext_somerisk = somerisk_ext_table.to_latex(index=False, caption="SomeRiskHome - Extended Models (PLR and IRM)",  label="tab:somerisk_ext")

# Save to tables/ folder
with open('tables/somerisk_ext.tex', 'w') as f:
    f.write(latex_ext_somerisk)

### 4.3 Subsample Analysis by RiskSource

In [10]:
# Subsample analysis for SomeRiskHome
somerisk_subsample_results = {}

for risk_cat in risk_categories:
    print(f"\n{'='*60}")
    print(f"RiskSource: {risk_cat}")
    print(f"{'='*60}")
    
    # Filter data for this RiskSource category
    subsample_data = data_complete[data_complete['RiskSource'] == risk_cat].copy()
    print(f"Sample size: {len(subsample_data)}")
    
    if len(subsample_data) < 100:
        print(f"Warning: Small sample size for {risk_cat}")
        continue
    
    # Run all 4 models
    base_plr = run_doubleml_model(subsample_data, 'SomeRiskHome', base_covariates, 'plr', n_trials=50)
    base_irm = run_doubleml_model(subsample_data, 'SomeRiskHome', base_covariates, 'irm', n_trials=50)
    ext_plr = run_doubleml_model(subsample_data, 'SomeRiskHome', extended_covariates, 'plr', n_trials=50)
    ext_irm = run_doubleml_model(subsample_data, 'SomeRiskHome', extended_covariates, 'irm', n_trials=50)
    
    somerisk_subsample_results[risk_cat] = {
        'Base PLR': base_plr,
        'Base IRM': base_irm,
        'Extended PLR': ext_plr,
        'Extended IRM': ext_irm
    }
    
    # Create table for this subsample
    subsample_table = create_results_table(
        somerisk_subsample_results[risk_cat],
        f"SomeRiskHome - RiskSource: {risk_cat}"
    )
    
    # Export to LaTeX
    latex_subsample = subsample_table.to_latex(
        index=False, 
        caption=f"SomeRiskHome - RiskSource: {risk_cat}",
        label=f"tab:somerisk_{risk_cat.replace(' ', '_').lower()}"
    )
    
    # Save to tables/ folder
    risk_name = risk_cat.replace(' ', '_').lower()
    filename = f'tables/somerisk_{risk_name}.tex'
    with open(filename, 'w') as f:
        f.write(latex_subsample)


RiskSource: No risk
Sample size: 23323


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


                       SomeRiskHome - RiskSource: No risk                       
       Model Coefficient Std. Error            95\% CI    p-value n. obs.
    Base PLR     -0.0641     0.0101 [-0.0839, -0.0443] $< 0.0001$  23,323
    Base IRM     -0.0633     0.0149 [-0.0924, -0.0342] $< 0.0001$  23,323
Extended PLR     -0.0601     0.0101 [-0.0800, -0.0403] $< 0.0001$  23,323
Extended IRM     -0.0681     0.0154 [-0.0982, -0.0380] $< 0.0001$  23,323


RiskSource: Moderate to high risk
Sample size: 20603


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


                SomeRiskHome - RiskSource: Moderate to high risk                
       Model Coefficient Std. Error            95\% CI    p-value n. obs.
    Base PLR     -0.1145     0.0074 [-0.1291, -0.0999] $< 0.0001$  20,603
    Base IRM     -0.0994     0.0117 [-0.1223, -0.0766] $< 0.0001$  20,603
Extended PLR     -0.1120     0.0074 [-0.1265, -0.0975] $< 0.0001$  20,603
Extended IRM     -0.0993     0.0108 [-0.1205, -0.0781] $< 0.0001$  20,603


RiskSource: Very high risk
Sample size: 11584


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


                   SomeRiskHome - RiskSource: Very high risk                    
       Model Coefficient Std. Error            95\% CI    p-value n. obs.
    Base PLR     -0.0837     0.0068 [-0.0969, -0.0704] $< 0.0001$  11,584
    Base IRM     -0.0651     0.0078 [-0.0803, -0.0498] $< 0.0001$  11,584
Extended PLR     -0.0825     0.0068 [-0.0959, -0.0692] $< 0.0001$  11,584
Extended IRM     -0.0685     0.0085 [-0.0852, -0.0519] $< 0.0001$  11,584



## 5. Analysis: VeryHighRiskHome

### 5.1 Base Models (PLR & IRM)

In [11]:
# Run base models for VeryHighRiskHome
veryhigh_base_plr = run_doubleml_model(data_complete, 'VeryHighRiskHome', base_covariates, 'plr')
veryhigh_base_irm = run_doubleml_model(data_complete, 'VeryHighRiskHome', base_covariates, 'irm')

# Create results table
veryhigh_base_results = {
    'PLR': veryhigh_base_plr,
    'IRM': veryhigh_base_irm
}

veryhigh_base_table = create_results_table(
    veryhigh_base_results, 
    "VeryHighRiskHome - Base Models"
)

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


                         VeryHighRiskHome - Base Models                         
Model Coefficient Std. Error            95\% CI    p-value n. obs.
  PLR     -0.0525     0.0058 [-0.0640, -0.0411] $< 0.0001$  55,510
  IRM     -0.0467     0.0084 [-0.0632, -0.0302] $< 0.0001$  55,510



In [12]:
# Export to LaTeX
latex_base_veryhigh = veryhigh_base_table.to_latex(index=False, caption="VeryHighRiskHome - Base Models (PLR and IRM)", label="tab:veryhigh_base")

# Save to tables/ folder
with open('tables/veryhigh_base.tex', 'w') as f:
    f.write(latex_base_veryhigh)

### 5.2 Extended Models (PLR & IRM)

In [13]:
# Run extended models for VeryHighRiskHome
veryhigh_ext_plr = run_doubleml_model(data_complete, 'VeryHighRiskHome', extended_covariates, 'plr')
veryhigh_ext_irm = run_doubleml_model(data_complete, 'VeryHighRiskHome', extended_covariates, 'irm')

# Create results table
veryhigh_ext_results = {
    'PLR': veryhigh_ext_plr,
    'IRM': veryhigh_ext_irm
}

veryhigh_ext_table = create_results_table(
    veryhigh_ext_results, 
    "VeryHighRiskHome - Extended Models"
)

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


                       VeryHighRiskHome - Extended Models                       
Model Coefficient Std. Error            95\% CI    p-value n. obs.
  PLR     -0.0652     0.0055 [-0.0759, -0.0545] $< 0.0001$  55,510
  IRM     -0.0645     0.0078 [-0.0798, -0.0492] $< 0.0001$  55,510



In [14]:
# Export to LaTeX
latex_ext_veryhigh = veryhigh_ext_table.to_latex(index=False, caption="VeryHighRiskHome - Extended Models (PLR and IRM)", label="tab:veryhigh_ext")

# Save to tables/ folder
with open('tables/veryhigh_ext.tex', 'w') as f:
    f.write(latex_ext_veryhigh)

### 5.3 Subsample Analysis by RiskSource

In [15]:
# Subsample analysis for VeryHighRiskHome
print("Running Subsample Analysis for VeryHighRiskHome...\n")

veryhigh_subsample_results = {}

for risk_cat in risk_categories:
    print(f"\n{'='*60}")
    print(f"RiskSource: {risk_cat}")
    print(f"{'='*60}")
    
    # Filter data for this RiskSource category
    subsample_data = data_complete[data_complete['RiskSource'] == risk_cat].copy()
    print(f"Sample size: {len(subsample_data)}")
    
    if len(subsample_data) < 100:
        print(f"Warning: Small sample size for {risk_cat}")
        continue
    
    # Run all 4 models
    base_plr = run_doubleml_model(subsample_data, 'VeryHighRiskHome', base_covariates, 'plr', n_trials=50)
    base_irm = run_doubleml_model(subsample_data, 'VeryHighRiskHome', base_covariates, 'irm', n_trials=50)
    ext_plr = run_doubleml_model(subsample_data, 'VeryHighRiskHome', extended_covariates, 'plr', n_trials=50)
    ext_irm = run_doubleml_model(subsample_data, 'VeryHighRiskHome', extended_covariates, 'irm', n_trials=50)
    
    veryhigh_subsample_results[risk_cat] = {
        'Base PLR': base_plr,
        'Base IRM': base_irm,
        'Extended PLR': ext_plr,
        'Extended IRM': ext_irm
    }
    
    # Create table for this subsample
    subsample_table = create_results_table(
        veryhigh_subsample_results[risk_cat],
        f"VeryHighRiskHome - RiskSource: {risk_cat}"
    )
    
    # Export to LaTeX
    latex_subsample = subsample_table.to_latex(
        index=False, 
        caption=f"VeryHighRiskHome - RiskSource: {risk_cat}",
        label=f"tab:veryhigh_{risk_cat.replace(' ', '_').lower()}"
    )
    
    # Save to tables/ folder
    risk_name = risk_cat.replace(' ', '_').lower()
    filename = f'tables/veryhigh_{risk_name}.tex'
    with open(filename, 'w') as f:
        f.write(latex_subsample)

Running Subsample Analysis for VeryHighRiskHome...


RiskSource: No risk
Sample size: 23323


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


                     VeryHighRiskHome - RiskSource: No risk                     
       Model Coefficient Std. Error           95\% CI p-value n. obs.
    Base PLR     -0.0034     0.0063 [-0.0158, 0.0091]  0.5962  23,323
    Base IRM      0.0063     0.0110 [-0.0152, 0.0278]  0.5668  23,323
Extended PLR     -0.0050     0.0063 [-0.0174, 0.0073]  0.4256  23,323
Extended IRM     -0.0010     0.0115 [-0.0235, 0.0215]  0.9311  23,323


RiskSource: Moderate to high risk
Sample size: 20603


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


              VeryHighRiskHome - RiskSource: Moderate to high risk              
       Model Coefficient Std. Error            95\% CI    p-value n. obs.
    Base PLR     -0.0384     0.0087 [-0.0553, -0.0214] $< 0.0001$  20,603
    Base IRM     -0.0320     0.0152 [-0.0618, -0.0022]     0.0353  20,603
Extended PLR     -0.0376     0.0086 [-0.0544, -0.0207] $< 0.0001$  20,603
Extended IRM     -0.0472     0.0134 [-0.0735, -0.0209]     0.0004  20,603


RiskSource: Very high risk
Sample size: 11584


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]


                 VeryHighRiskHome - RiskSource: Very high risk                  
       Model Coefficient Std. Error            95\% CI    p-value n. obs.
    Base PLR     -0.2110     0.0129 [-0.2364, -0.1857] $< 0.0001$  11,584
    Base IRM     -0.2164     0.0198 [-0.2552, -0.1776] $< 0.0001$  11,584
Extended PLR     -0.2096     0.0130 [-0.2351, -0.1841] $< 0.0001$  11,584
Extended IRM     -0.2188     0.0179 [-0.2538, -0.1838] $< 0.0001$  11,584

