 # Appendix: Meta-Learner Implementations - Algorithm Understanding Through Hand-Coded Models

 Technical deep dive demonstrating algorithmic understanding of uplift modeling meta-learners. Each approach estimates Conditional Average Treatment Effects (CATE) using different methodologies, implemented from scratch and validated against EconML library

 **CATE Definition:**
 $$CATE(x) = \tau(x) = E[Y(1)−Y(0)∣X=x] = E[Y∣T=1, X=x]−E[Y∣T=0,X=x]$$

 **Implementation Process:**
 1. **Data Setup**: Identical to Notebooks, 60/20/20 train/validation/test split with 10% of the sample
 2. **Model Training**: Implement each meta-learner approach using XGBoost base models with fixed hyperparameters
 3. **Final Training**: Train on combined train+validation data
 4. **Evaluation**: Test on held-out test set
 5. **Verification**: Compare hand-coded implementations with EconML library

 **Purpose**: Demonstrates understanding beyond API calls while validating production-quality implementations


In [None]:
import polars as pl
import numpy as np
from xgboost import XGBClassifier, XGBRegressor
from sklearn.model_selection import train_test_split
from sklift.metrics import uplift_auc_score
from econml.metalearners import SLearner, TLearner, XLearner
from econml.dr import DRLearner
from sklearn.base import BaseEstimator, RegressorMixin
import warnings
warnings.filterwarnings('ignore')

# Wrapper to make classifier output probabilities for EconML
# EconML expects continuous outputs for regression tasks, but we use classifiers
# for binary outcomes. This wrapper makes classifiers compatible by outputting
# probabilities instead of discrete class predictions.
class ClassifierAsRegressor(BaseEstimator, RegressorMixin):
    """Wrapper to make a classifier behave like a regressor by outputting probabilities"""
    def __init__(self, classifier):
        self.classifier = classifier

    def fit(self, X, y):
        self.classifier.fit(X, y)
        return self

    def predict(self, X):
        # Return probability of positive class instead of discrete prediction
        return self.classifier.predict_proba(X)[:, 1]

In [None]:
# DATA PREPARATION
# Load 10% sample of Criteo dataset for faster execution
df = pl.read_csv("data/criteo-uplift-v2.1.csv").sample(fraction=0.10, seed=42)

# Data preparation
feature_cols = [f'f{i}' for i in range(12)]
X = df.select(feature_cols).to_numpy()
T = df.select('treatment').to_numpy().ravel()
Y = df.select('conversion').to_numpy().ravel()

# Train/validation/test split (60/20/20)
X_train, X_temp, Y_train, Y_temp, T_train, T_temp = train_test_split(
    X, Y, T, test_size=0.4, random_state=42, stratify=T
)
X_val, X_test, Y_val, Y_test, T_val, T_test = train_test_split(
    X_temp, Y_temp, T_temp, test_size=0.5, random_state=42, stratify=T_temp
)

# Combine train+val for final training
X_trainval = np.vstack([X_train, X_val])
Y_trainval = np.hstack([Y_train, Y_val])
T_trainval = np.hstack([T_train, T_val])

# Final data shapes: Train+Val (80%) for model training, Test (20%) for evaluation

 ## S-Learner (Single Model Approach)

 **Method:** Train one model with treatment as a feature to estimate E[Y|X,T], then predict outcomes under both treatment conditions

 **Algorithm:**
 1. **Model Training**: Single model learns E[Y|X,T] using features X plus treatment indicator T
 2. **CATE Estimation**: τ(X) = μ(X,T=1) - μ(X,T=0) by predicting for each observation under both treatments
 3. **Advantages**: Simple, uses all data for one model
 4. **Disadvantages**: Assumes treatment effect is linear in model specification

 **Implementation**: XGBoost classifier with treatment as additional feature, then difference predictions

In [None]:
# Use a fix set of hyperparameters - skipping grid search
best_s_params = {'learning_rate': 0.01, 'scale_pos_weight': 50}

# Create augmented features
X_trainval_with_treatment = np.column_stack([X_trainval, T_trainval])
X_test_treatment = np.column_stack([X_test, np.ones(len(X_test))])
X_test_control = np.column_stack([X_test, np.zeros(len(X_test))])

# Train S-Learner model
s_model = XGBClassifier(
    max_depth=5,
    learning_rate=best_s_params['learning_rate'],
    scale_pos_weight=best_s_params['scale_pos_weight'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
s_model.fit(X_trainval_with_treatment, Y_trainval)

# Generate CATE predictions
p1_test = s_model.predict_proba(X_test_treatment)[:, 1]
p0_test = s_model.predict_proba(X_test_control)[:, 1]
s_cate = p1_test - p0_test

# EconML S-Learner with EXACT same model wrapped to output probabilities
econml_s = SLearner(
    overall_model=ClassifierAsRegressor(XGBClassifier(
        max_depth=5,
        learning_rate=best_s_params['learning_rate'],
        scale_pos_weight=best_s_params['scale_pos_weight'],
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    ))
)
econml_s.fit(Y_trainval, T_trainval, X=X_trainval)
econml_s_cate = econml_s.effect(X_test)

# Calculate correlation between implementations
s_corr = np.corrcoef(s_cate, econml_s_cate)[0,1]

 ## T-Learner (Two Model Approach)

 **Method:** Train separate models for treatment and control groups, then take the difference

 **Algorithm:**
 1. **Split Data**: Separate training data by treatment assignment
 2. **Model Training**: μ₁(X) learns E[Y|X,T=1] on treated, μ₀(X) learns E[Y|X,T=0] on control
 3. **CATE Estimation**: τ(X) = μ₁(X) - μ₀(X)
 4. **Advantages**: Flexible, allows different response functions for each group
 5. **Disadvantages**: High variance with unbalanced treatments, data splitting reduces power

 **Implementation**: Two XGBoost classifiers trained on separate treatment groups

In [None]:
# T-LEARNER COMPARISON
# T-Learner: Two separate models for treatment and control groups
# CATE = μ₁(X) - μ₀(X) where μ₁ and μ₀ are separate models

# Use fixed hyperparameters
best_t_params = {'learning_rate': 0.01, 'scale_pos_weight': 50}

# Split data by treatment
X_trainval_treatment = X_trainval[T_trainval == 1]
Y_trainval_treatment = Y_trainval[T_trainval == 1]
X_trainval_control = X_trainval[T_trainval == 0]
Y_trainval_control = Y_trainval[T_trainval == 0]

# Train T-Learner models
t_treatment = XGBClassifier(
    max_depth=5,
    learning_rate=best_t_params['learning_rate'],
    scale_pos_weight=best_t_params['scale_pos_weight'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
t_control = XGBClassifier(
    max_depth=5,
    learning_rate=best_t_params['learning_rate'],
    scale_pos_weight=best_t_params['scale_pos_weight'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

t_treatment.fit(X_trainval_treatment, Y_trainval_treatment)
t_control.fit(X_trainval_control, Y_trainval_control)

# Generate CATE predictions
p1_test = t_treatment.predict_proba(X_test)[:, 1]
p0_test = t_control.predict_proba(X_test)[:, 1]
t_cate = p1_test - p0_test

# EconML T-Learner with EXACT same models wrapped to output probabilities
econml_t = TLearner(
    models=[
        ClassifierAsRegressor(XGBClassifier(  # Control model
            max_depth=5,
            learning_rate=best_t_params['learning_rate'],
            scale_pos_weight=best_t_params['scale_pos_weight'],
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        )),
        ClassifierAsRegressor(XGBClassifier(  # Treatment model
            max_depth=5,
            learning_rate=best_t_params['learning_rate'],
            scale_pos_weight=best_t_params['scale_pos_weight'],
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        ))
    ]
)
econml_t.fit(Y_trainval, T_trainval, X=X_trainval)
econml_t_cate = econml_t.effect(X_test)

# Calculate correlation between implementations
t_corr = np.corrcoef(t_cate, econml_t_cate)[0,1]

 ## X-Learner (Cross-Prediction with Propensity Weighting)

 **Method:** Advanced approach addressing treatment imbalance by using cross-predictions and propensity-weighted averaging

 **Algorithm:**
 1. **Stage 1**: Same as T-Learner - train μ₁(X) on treated, μ₀(X) on control
 2. **Impute Effects**: Create pseudo treatment effects using cross-predictions:
    - τ̃₁(X) = Y - μ₀(X) for treated (observed - counterfactual control)
    - τ̃₀(X) = μ₁(X) - Y for control (counterfactual treatment - observed)
 3. **Stage 2**: Train CATE models on imputed effects:
    - τ₁(X) models treatment effects using treated data
    - τ₀(X) models treatment effects using control data
 4. **Weighted Average**: τ(X) = g(X)·τ₀(X) + (1-g(X))·τ₁(X)

 **Propensity Weighting Logic:**
 - τ₀(X) relies on treatment model accuracy → weight by P(T=1|X) (more treatment data = more reliable)
 - τ₁(X) relies on control model accuracy → weight by P(T=0|X) (more control data = more reliable)

 **Advantages**: Handles treatment imbalance, uses all data effectively
 **Implementation**: Four-stage process with XGBoost models plus propensity weighting

In [None]:
# X-LEARNER COMPARISON
# X-Learner: Advanced method using cross-predictions and propensity weighting
# Addresses treatment imbalance by using information from both groups
# CATE = g(X)·τ₀(X) + (1-g(X))·τ₁(X) where g(X) is propensity score

# Use fixed hyperparameters
best_x_params = {'learning_rate': 0.1, 'scale_pos_weight': 50}

# Stage 1: Outcome models
x_mu1 = XGBClassifier(
    max_depth=5,
    learning_rate=best_x_params['learning_rate'],
    scale_pos_weight=best_x_params['scale_pos_weight'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
x_mu0 = XGBClassifier(
    max_depth=5,
    learning_rate=best_x_params['learning_rate'],
    scale_pos_weight=best_x_params['scale_pos_weight'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

x_mu1.fit(X_trainval_treatment, Y_trainval_treatment)
x_mu0.fit(X_trainval_control, Y_trainval_control)

# Impute treatment effects
mu0_pred_treatment = x_mu0.predict_proba(X_trainval_treatment)[:, 1]
mu1_pred_control = x_mu1.predict_proba(X_trainval_control)[:, 1]
tau1_imputed = Y_trainval_treatment - mu0_pred_treatment
tau0_imputed = mu1_pred_control - Y_trainval_control

# Stage 2: Treatment effect models
x_tau1 = XGBRegressor(
    max_depth=5,
    learning_rate=best_x_params['learning_rate'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
x_tau0 = XGBRegressor(
    max_depth=5,
    learning_rate=best_x_params['learning_rate'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

x_tau1.fit(X_trainval_treatment, tau1_imputed)
x_tau0.fit(X_trainval_control, tau0_imputed)

# Propensity model
x_prop = XGBClassifier(
    max_depth=5,
    learning_rate=best_x_params['learning_rate'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
x_prop.fit(X_trainval, T_trainval)

# Generate CATE predictions
tau1_test = x_tau1.predict(X_test)
tau0_test = x_tau0.predict(X_test)
propensity_test = x_prop.predict_proba(X_test)[:, 1]
x_cate = propensity_test * tau0_test + (1 - propensity_test) * tau1_test

# EconML X-Learner with wrapped models to output probabilities
econml_x = XLearner(
    models=[
        ClassifierAsRegressor(XGBClassifier(  # Control outcome
            max_depth=5,
            learning_rate=best_x_params['learning_rate'],
            scale_pos_weight=best_x_params['scale_pos_weight'],
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        )),
        ClassifierAsRegressor(XGBClassifier(  # Treatment outcome
            max_depth=5,
            learning_rate=best_x_params['learning_rate'],
            scale_pos_weight=best_x_params['scale_pos_weight'],
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        ))
    ],
    cate_models=[
        XGBRegressor(  # Control CATE
            max_depth=5,
            learning_rate=best_x_params['learning_rate'],
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        ),
        XGBRegressor(  # Treatment CATE
            max_depth=5,
            learning_rate=best_x_params['learning_rate'],
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        )
    ]
    # propensity_model uses default LogisticRegression
)
econml_x.fit(Y_trainval, T_trainval, X=X_trainval)
econml_x_cate = econml_x.effect(X_test)

# Calculate correlation between implementations
x_corr = np.corrcoef(x_cate, econml_x_cate)[0,1]

 ## R-Learner (Residualization Approach)

 **Method:** Use residuals to directly model treatment effects by removing confounding patterns first

 **Algorithm:**
 1. **Nuisance Models**: Train separate models for outcome and treatment assignment:
    - μ(X) estimates E[Y|X] (outcome model, excludes treatment)
    - p(X) estimates P(T=1|X) (propensity score model)
 2. **Residualization**: Remove confounding patterns:
    - Outcome residuals: e_Y = Y - μ(X)
    - Treatment residuals: e_T = T - p(X)
 3. **Treatment Effect Model**: Solve weighted regression:
    - Target: e_Y/e_T (residualized outcome ratio)
    - Weights: e_T² (variance-based weighting)
    - Features: X (original covariates)

 **Key Insight**: After orthogonalization, residual relationship directly reveals causal effects

 **Advantages**: Theoretically robust, handles high-dimensional confounding

 **Implementation**: Three-stage process with cross-fitting for unbiased estimation

In [None]:
# R-LEARNER COMPARISON
# R-Learner: Residualization approach that directly models treatment effects
# Uses residuals: e_Y = Y - E[Y|X], e_T = T - E[T|X]
# Then solves: min Σ(e_Y - e_T·τ(X))² with weights e_T²

# Use fixed hyperparameters
best_r_params = {'learning_rate': 0.01, 'scale_pos_weight': 50}

# Outcome and propensity models
r_outcome = XGBClassifier(
    max_depth=5,
    learning_rate=best_r_params['learning_rate'],
    scale_pos_weight=best_r_params['scale_pos_weight'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
r_propensity = XGBClassifier(
    max_depth=5,
    learning_rate=best_r_params['learning_rate'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

r_outcome.fit(X_trainval, Y_trainval)
r_propensity.fit(X_trainval, T_trainval)

# Compute residuals
Y_pred_trainval = r_outcome.predict_proba(X_trainval)[:, 1]
T_pred_trainval = r_propensity.predict_proba(X_trainval)[:, 1]
e_Y = Y_trainval - Y_pred_trainval
e_T = T_trainval - T_pred_trainval

# Train treatment effect model
weights = e_T ** 2
weights = np.maximum(weights, 1e-8)
e_T_safe = np.where(np.abs(e_T) < 1e-8, 1e-8, e_T)
target = e_Y / e_T_safe

r_tau = XGBRegressor(
    max_depth=5,
    learning_rate=best_r_params['learning_rate'],
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)
r_tau.fit(X_trainval, target, sample_weight=weights)

# Generate CATE predictions
r_cate = r_tau.predict(X_test)

# EconML R-Learner using NonParamDML (which implements R-Learner)
from econml.dml import NonParamDML

# Use wrapped classifiers to output probabilities for outcome model
econml_r = NonParamDML(
    model_y=ClassifierAsRegressor(XGBClassifier(  # Outcome model
        max_depth=5,
        learning_rate=best_r_params['learning_rate'],
        scale_pos_weight=best_r_params['scale_pos_weight'],
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    )),
    model_t=XGBClassifier(  # Propensity model
        max_depth=5,
        learning_rate=best_r_params['learning_rate'],
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    ),
    model_final=XGBRegressor(  # Final stage model
        max_depth=5,
        learning_rate=best_r_params['learning_rate'],
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    ),
    discrete_treatment=True,
    cv=1  
)
econml_r.fit(Y_trainval, T_trainval, X=X_trainval)
econml_r_cate = econml_r.effect(X_test)

# Calculate correlation between implementations
r_corr = np.corrcoef(r_cate, econml_r_cate.ravel())[0,1]

In [None]:
# FINAL VERIFICATION SUMMARY
# Compare all learner implementations and provide detailed explanations

# Create comparison table without CATE means for cleaner output
results = pl.DataFrame({
    'Learner': ['S-Learner', 'T-Learner', 'X-Learner', 'R-Learner'],
    'Correlation': [s_corr, t_corr, x_corr, r_corr],
    'Status': [
        'Perfect' if s_corr > 0.999 else 'Good',
        'Perfect' if t_corr > 0.999 else 'Good',
        'Excellent' if x_corr > 0.99 else 'Good',
        'Perfect' if r_corr > 0.999 else 'Good'
    ]
})

# Display final results
print("\n" + "="*70)
print(" " * 15 + "HAND-CODED vs ECONML VERIFICATION RESULTS")
print("="*70)
print(results)
print("="*70)

print("\nSUMMARY:")
print("S-Learner: Perfect match - implementations are identical")
print("T-Learner: Perfect match - implementations are identical")
print("X-Learner: Near perfect - minor difference from propensity model (XGB vs Logit)")
print("R-Learner: Perfect match - single-fold CV eliminates randomness")

print("\nCONCLUSION: All hand-coded implementations are validated and correct!")


               HAND-CODED vs ECONML VERIFICATION RESULTS
shape: (4, 3)
┌───────────┬─────────────┬───────────┐
│ Learner   ┆ Correlation ┆ Status    │
│ ---       ┆ ---         ┆ ---       │
│ str       ┆ f64         ┆ str       │
╞═══════════╪═════════════╪═══════════╡
│ S-Learner ┆ 1.0         ┆ Perfect   │
│ T-Learner ┆ 1.0         ┆ Perfect   │
│ X-Learner ┆ 0.998948    ┆ Excellent │
│ R-Learner ┆ 1.0         ┆ Perfect   │
└───────────┴─────────────┴───────────┘

SUMMARY:
S-Learner: Perfect match - implementations are identical
T-Learner: Perfect match - implementations are identical
X-Learner: Near perfect - minor difference from propensity model (XGB vs Logit)
R-Learner: Perfect match - single-fold CV eliminates randomness

CONCLUSION: All hand-coded implementations are validated and correct!


 ## Implementation Verification Results

 **Summary of Hand-Coded vs EconML Comparisons:**

 | Learner | Correlation | Status | Key Differences |
 |---------|-------------|--------|-----------------|
 | S-Learner | 1.000 | Perfect | Identical implementations |
 | T-Learner | 1.000 | Perfect | Identical implementations |
 | X-Learner | 0.999 | Excellent | Propensity model choice (XGB vs Logistic) |
 | R-Learner | 1.000 | Perfect | Single-fold cross-validation (cv=1) |

 **Why Differences Exist:**

 **X-Learner (0.999 correlation):**
 - Hand-coded uses XGBoost for propensity scores
 - EconML uses default LogisticRegression for propensity estimation
 - Since final CATE = g(X)·τ₀(X) + (1-g(X))·τ₁(X), different propensity models create slightly different weighting
 - Near-perfect correlation confirms both capture identical treatment effect patterns

 **Conclusion:** Hand-coded implementations successfully replicate production-quality algorithms with excellent fidelity. Minor differences stem from implementation choices (propensity models, cross-fitting) rather than algorithmic errors, demonstrating deep understanding of each method's theoretical foundations.