# ATE Mechanism Analysis

This notebook uses **Causal Forest DML** to test **mechanism effects**: we estimate the average effect of **Digital literacy** on two mediatorsâ€”**Online social network** and **Entrepreneurship**. 

- **Treatment (T)**: Digital literacy (first-level index).
- **Outcomes (Y)**: Online social network; Entrepreneurship (each in turn).
- **Covariates (X)**: Control variables with City one-hot encoded.

In [1]:
import pandas as pd
import numpy as np
import os
import warnings
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.exceptions import DataConversionWarning
from econml.dml import CausalForestDML
from scipy.stats import norm
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore', category=DataConversionWarning)
warnings.filterwarnings('ignore', message='.*nonzero intercept.*', module='econml')
plt.rcParams['font.sans-serif'] = ['DejaVu Sans', 'Arial']
plt.rcParams['axes.unicode_minus'] = False

RANDOM_STATE = 42
OUTPUT_DIR = './results'
os.makedirs(OUTPUT_DIR, exist_ok=True)

HYPERPARAMS = {
    "n_estimators": 500,
    "min_samples_split": 50,
    "min_samples_leaf": 18,
    "max_samples": 0.4,
}

def run_causal_model(Y, T, X, param_dict):
    """Fit CausalForestDML and return ATE, 95% CI, StdErr, p-value, and significance stars."""
    X_train, X_test, T_train, T_test, Y_train, Y_test = train_test_split(
        X, T, Y, test_size=0.3, random_state=RANDOM_STATE
    )
    model_y = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, random_state=RANDOM_STATE)
    model_t = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, random_state=RANDOM_STATE)
    est = CausalForestDML(
        model_y=model_y,
        model_t=model_t,
        discrete_treatment=False,
        min_balancedness_tol=0.45,
        honest=True,
        inference=True,
        random_state=RANDOM_STATE,
        **param_dict
    )
    est.fit(Y_train, T_train, X=X_train)
    te_pred = est.const_marginal_effect(X_test)
    te_lo, te_hi = est.const_marginal_effect_interval(X_test, alpha=0.05)
    avg_effect = te_pred.mean()
    ci_lo, ci_hi = te_lo.mean(), te_hi.mean()
    stderr = (ci_hi - ci_lo) / (2 * 1.96)
    z = avg_effect / stderr
    p = 2 * (1 - norm.cdf(abs(z)))
    sig = '***' if p < 0.01 else '**' if p < 0.05 else '*' if p < 0.1 else ''
    return {
        'ATE': round(avg_effect, 4),
        'StdErr': round(stderr, 4),
        'p_value': round(p, 4),
        'sig': sig
    }

  from .autonotebook import tqdm as notebook_tqdm


---
## Load data and define variables

Variable names are standardized for analysis. We use controls (X), treatment (Digital literacy), and two mechanism outcomes (Online social network, Entrepreneurship).

In [2]:
df = pd.read_excel('data/data.xlsx')

X_cols_name = [
    'Gender', 'Age', 'Health status', 'Education level', 'Growing experience',
    'Marital status', 'Growing area', 'Labourer', 'Production facility', 'Storage facility',
    'Agricultural insurance', 'Loan', 'Social expenditure', 'Clan status', 'Natural disaster',
    'Training', 'Brand label usage', 'Logistics convenience', 'City'
]
D_cols_name = ['Digital literacy', 'Digital device access', 'Digital information acquisition', 'Digital platform usage']
Y_cols_name = ['kakwani_new', 'Household total income']
M_cols_name = ['Online social network', 'Entrepreneurship']

X_df = df[X_cols_name]
D_df = df[D_cols_name]
Y_df = df[Y_cols_name]
M_df = df[M_cols_name]


---
## Preprocessing: X (one-hot City), T, mechanism outcomes

In [3]:
X = X_df.copy()
categorical_cols = ['City']
enc = OneHotEncoder(drop='first', sparse_output=False)
encoded_region = enc.fit_transform(X[categorical_cols])
encoded_region_df = pd.DataFrame(encoded_region, columns=enc.get_feature_names_out(categorical_cols))
X = pd.concat([X.drop(columns=categorical_cols).reset_index(drop=True), encoded_region_df], axis=1)

T2 = D_df[['Digital literacy']].values
Y_m1 = M_df['Online social network'].values
Y_m2 = M_df['Entrepreneurship'].values


---
## Mechanism analysis: Digital literacy on each mediator

We fit Causal Forest DML for each mechanism outcome (Online social network, Entrepreneurship) and report ATE, 95% CI, standard error, and p-value.

In [4]:
mechanism_dict = {
    'Online social network': Y_m1,
    'Entrepreneurship': Y_m2,
}

results = []
for mech_name, Y in mechanism_dict.items():
    res = run_causal_model(Y, T2, X, HYPERPARAMS)
    results.append({
        'mechanism': mech_name,
        **res
    })
    print(f"Digital literacy on '{mech_name}': ATE = {res['ATE']}, StdErr = {res['StdErr']}, p = {res['p_value']}, sig = {res['sig']}")

results_df = pd.DataFrame(results)
results_df.to_csv(os.path.join(OUTPUT_DIR, 'CFDML_Mechanism_results.csv'), index=False, encoding='utf-8-sig')
print("\nResults saved to CFDML_Mechanism_results.csv")
display(results_df)

Digital literacy on 'Online social network': ATE = 4.8197, StdErr = 2.2038, p = 0.0287, sig = **
Digital literacy on 'Entrepreneurship': ATE = 0.1759, StdErr = 0.0808, p = 0.0294, sig = **

Results saved to CFDML_Mechanism_results.csv


Unnamed: 0,mechanism,ATE,StdErr,p_value,sig
0,Online social network,4.8197,2.2038,0.0287,**
1,Entrepreneurship,0.1759,0.0808,0.0294,**
