 # 01: Exploratory Data Analysis - Criteo Uplift Dataset

 Analysis of the Criteo Uplift Dataset reveals severe non-compliance: while treatment assignment is randomized (85/15 split), only 3.6% of assigned users actually see ads, creating challenges for targeting strategy.

 **Structure:**
 - **Part A**: Dataset Structure and Treatment Assignment Patterns
 - **Part B**: Randomization Quality Assessment and Non-Compliance Analysis

 **Key Findings:**
 - Valid randomization: Treatment assignment is properly randomized (SMD < 0.1)
 - Severe non-compliance: Only 3.6% of assigned users actually see ads
 - Sequential structure: Random Assignment → Selective Exposure → Conversion

 **Key Numbers at a Glance**
 - **Sample size:** 14M users
 - **Treatment assignment:** 85% treatment, 15% control (randomized)
 - **Exposure rate:** 3.6% of assigned users see ads (non-compliance)
 - **ITT effect:** 0.12pp conversion lift (0.31% treatment vs 0.19% control)

In [None]:
import polars as pl 
from scipy import stats

 ## Part A: Dataset Structure and Treatment Assignment Patterns

In [None]:
# Load the Criteo uplift dataset
df = pl.read_csv("data/criteo-uplift-v2.1.csv").sample(fraction=1.0, seed=42)  

obs, cols = df.shape
print(f"Dataset contains {obs} observations and {cols} columns.")
df.describe()

Dataset contains 13979592 observations and 16 columns.


statistic,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,treatment,conversion,visit,exposure
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""",13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0,13979592.0
"""null_count""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""mean""",19.620297,10.069977,8.446582,4.178923,10.338837,4.028513,-4.155356,5.101765,3.933581,16.027638,5.333396,-0.170967,0.85,0.002917,0.046992,0.030631
"""std""",5.377464,0.104756,0.299316,1.336645,0.343308,0.431097,4.577914,1.205248,0.05666,7.018975,0.168229,0.022833,0.357071,0.053927,0.211622,0.172316
"""min""",12.616365,10.059654,8.214383,-8.398387,10.280525,-9.011892,-31.429784,4.833815,3.635107,13.190056,5.300375,-1.383941,0.0,0.0,0.0,0.0
"""25%""",12.616365,10.059654,8.214383,4.679882,10.280525,4.115453,-6.699321,4.833815,3.910792,13.190056,5.300375,-0.168679,1.0,0.0,0.0,0.0
"""50%""",21.923416,10.059654,8.214383,4.679882,10.280525,4.115453,-2.411115,4.833815,3.971858,13.190056,5.300375,-0.168679,1.0,0.0,0.0,0.0
"""75%""",24.436459,10.059654,8.723335,4.679882,10.280525,4.115453,0.294443,4.833815,3.971858,13.190056,5.300375,-0.168679,1.0,0.0,0.0,0.0
"""max""",26.745255,16.344187,9.051962,4.679882,21.123508,4.115453,0.294443,11.998401,3.971858,75.295017,6.473917,-0.168679,1.0,1.0,1.0,1.0


 ### Dataset Schema

 **Features:** `f0` to `f11` - User characteristics (12 dense, float features)

 **Treatment Variables:**
 - `treatment`: Assignment to ad campaign (1 = assigned, 0 = control)
 - `exposure`: Actually received treatment (1 = ad served, 0 = no ad)

 **Outcome Variables:**
 - `conversion`: User converted (binary outcome)
 - `visit`: User visited site (binary outcome)


In [None]:
# Check treatment assignment rate
treatment_rate = df['treatment'].mean()
print(f"Treatment Rate: {treatment_rate:.4f}")

Treatment Rate: 0.8500


In [None]:
# Check actual ad exposure rate within the dataset
exposure_rate = df['exposure'].mean()
exposure_rate_treated = df.filter(pl.col('treatment') == 1)['exposure'].mean()
exposure_rate_control = df.filter(pl.col('treatment') == 0)['exposure'].mean()
print(f"Exposure Rate (Overall): {exposure_rate:.4f}")
print(f"Exposure Rate (Treated): {exposure_rate_treated:.4f}")
print(f"Exposure Rate (Control): {exposure_rate_control:.4f}")

Exposure Rate (Overall): 0.0306
Exposure Rate (Treated): 0.0360
Exposure Rate (Control): 0.0000


In [None]:
# Calculate outcome rates for each group using the groups dictionary
groups = {
    'control': df.filter(pl.col('treatment') == 0),
    'treated': df.filter(pl.col('treatment') == 1),
    'non_exposed': df.filter((pl.col('treatment') == 1) & (pl.col('exposure') == 0)),
    'exposed': df.filter((pl.col('treatment') == 1) & (pl.col('exposure') == 1))
}

outcome_cols = ['visit', 'conversion']
for col in outcome_cols:
    col_rate = df[col].mean()
    print(f"Overall {col} rate: {col_rate:.4f}")

total = df.height
summary_data = []
for group_name, group_df in groups.items():
    count = group_df.height
    pct = count * 100 / total
    
    # Store rates for summary table
    rates = {}
    for col in outcome_cols:
        rate = group_df[col].mean()
        rate_std = group_df[col].std()
        rates[col] = (rate, rate_std)

    # Append to summary data using already calculated values
    summary_data.append({
        'Group': group_name,
        'N': count,
        'Pct': f"{pct:.2f}%",
        'Visit': f"{rates['visit'][0]:.4f}",
        'Visit_Std': f"{rates['visit'][1]:.4f}",
        'Conversion': f"{rates['conversion'][0]:.6f}",
        'Conversion_Std': f"{rates['conversion'][1]:.6f}"
    })

# Print summary table
print("\n" + "="*60)
print("SUMMARY TABLE")
print("="*60)
summary_df = pl.DataFrame(summary_data)
print(summary_df)

# Calculate naive ITT
print("\n" + "="*60)
print("INTENT-TO-TREAT (ITT) EFFECTS")
print("="*60)

for col in outcome_cols:
    treated_rate = groups['treated'][col].mean() 
    control_rate = groups['control'][col].mean()
    itt = treated_rate - control_rate # pyright: ignore[reportOperatorIssue]
    
    print(f"{col.upper()} ITT: {itt:.6f} ({itt*100:.4f}pp)") # pyright: ignore[reportOperatorIssue]
    print(f"  Treated: {treated_rate:.6f}")
    print(f"  Control: {control_rate:.6f}")

Overall visit rate: 0.0470
Overall conversion rate: 0.0029

SUMMARY TABLE
shape: (4, 7)
┌─────────────┬──────────┬────────┬────────┬───────────┬────────────┬────────────────┐
│ Group       ┆ N        ┆ Pct    ┆ Visit  ┆ Visit_Std ┆ Conversion ┆ Conversion_Std │
│ ---         ┆ ---      ┆ ---    ┆ ---    ┆ ---       ┆ ---        ┆ ---            │
│ str         ┆ i64      ┆ str    ┆ str    ┆ str       ┆ str        ┆ str            │
╞═════════════╪══════════╪════════╪════════╪═══════════╪════════════╪════════════════╡
│ control     ┆ 2096937  ┆ 15.00% ┆ 0.0382 ┆ 0.1917    ┆ 0.001938   ┆ 0.043975       │
│ treated     ┆ 11882655 ┆ 85.00% ┆ 0.0485 ┆ 0.2149    ┆ 0.003089   ┆ 0.055497       │
│ non_exposed ┆ 11454443 ┆ 81.94% ┆ 0.0349 ┆ 0.1834    ┆ 0.001194   ┆ 0.034538       │
│ exposed     ┆ 428212   ┆ 3.06%  ┆ 0.4145 ┆ 0.4926    ┆ 0.053784   ┆ 0.225591       │
└─────────────┴──────────┴────────┴────────┴───────────┴────────────┴────────────────┘

INTENT-TO-TREAT (ITT) EFFECTS
VISIT ITT: 

 ### Treatment vs Exposure: Key Distinction

 **Treatment Assignment:** `treatment = 1` (assigned to ad campaign), `treatment = 0` (control)

 **Ad Exposure:** `exposure = 1` (ad served), `exposure = 0` (no ad delivered)

 **Assignment Patterns:**
 - Treatment rate: 85% assigned, 15% control (randomized)
 - Exposure rate: Only 3.6% of assigned users see ads (non-compliance)

 **Why Low Exposure?** Ad auction losses, ad blockers, technical failures, ad inventory limits

 **Intent-to-Treat (ITT) Effects:**
 - Visit ITT: +1.03pp (4.85% vs 3.82%)
 - Conversion ITT: +0.12pp (0.31% vs 0.19%)

 ## Part B: Randomization Quality Assessment

In [None]:
# Check baseline feature balance between treatment and control groups
# This tests whether randomization worked - features should have similar means/std
feature_cols = [col for col in df.columns if col.startswith('f')]

# Define group comparisons (treatment_group, control_group)
# For SMD: treatment_group vs control_group standardized by control_group std
comparisons = {
    'Treatment vs Control': ('treated', 'control'),
    'Non-exposed vs Control': ('non_exposed', 'control'),
    'Exposed vs Control': ('exposed', 'control'),
    'Exposed vs Non-exposed': ('exposed', 'non_exposed')  # non_exposed as baseline
}

# Generic function for statistical tests
def run_statistical_tests(groups, comparisons, feature_cols):
    for comparison_name, (group1_name, group2_name) in comparisons.items():
        print(f"\n=== {comparison_name} ===")
        print("T-tests:")
        for col in feature_cols:
            group1_data = groups[group1_name][col].to_numpy()
            group2_data = groups[group2_name][col].to_numpy()

            t_stat, p_value = stats.ttest_ind(group1_data, group2_data)
            print(f"{col}: t-stat={t_stat:.3f}, p-value={p_value:.3f}")

        print("\nKolmogorov-Smirnov tests:")
        for col in feature_cols:
            group1_data = groups[group1_name][col].to_numpy()
            group2_data = groups[group2_name][col].to_numpy()

            ks_stat, p_value = stats.ks_2samp(group1_data, group2_data)
            print(f"{col}: KS-stat={ks_stat:.3f}, p-value={p_value:.3f}")

# Run all statistical tests
run_statistical_tests(groups, comparisons, feature_cols + outcome_cols)


=== Treatment vs Control ===
T-tests:
f0: t-stat=-9.174, p-value=0.000
f1: t-stat=30.604, p-value=0.000
f2: t-stat=-8.348, p-value=0.000
f3: t-stat=-63.343, p-value=0.000
f4: t-stat=10.576, p-value=0.000
f5: t-stat=-39.445, p-value=0.000
f6: t-stat=-53.349, p-value=0.000
f7: t-stat=27.994, p-value=0.000
f8: t-stat=-29.688, p-value=0.000
f9: t-stat=31.640, p-value=0.000
f10: t-stat=13.981, p-value=0.000
f11: t-stat=-6.845, p-value=0.000
visit: t-stat=65.257, p-value=0.000
conversion: t-stat=28.517, p-value=0.000

Kolmogorov-Smirnov tests:
f0: KS-stat=0.009, p-value=0.000
f1: KS-stat=0.003, p-value=0.000
f2: KS-stat=0.004, p-value=0.000
f3: KS-stat=0.014, p-value=0.000
f4: KS-stat=0.002, p-value=0.000
f5: KS-stat=0.006, p-value=0.000
f6: KS-stat=0.013, p-value=0.000
f7: KS-stat=0.006, p-value=0.000
f8: KS-stat=0.008, p-value=0.000
f9: KS-stat=0.007, p-value=0.000
f10: KS-stat=0.002, p-value=0.000
f11: KS-stat=0.001, p-value=0.101
visit: KS-stat=0.010, p-value=0.000
conversion: KS-stat=0

 ### Statistical Tests

 All p-values < 0.001 due to large sample size. Use SMD for practical significance.

 ### Standardized Mean Differences
 Measures effect size: (mean_treatment - mean_control) / std_control. SMD > |0.1| indicates meaningful differences.

In [None]:
# Generic function for SMD calculations
def calculate_smd(groups, comparisons, feature_cols):
    for comparison_name, (treatment_group, control_group) in comparisons.items():
        print(f"\n=== {comparison_name} ===")
        for col in feature_cols:
            mean_treatment = groups[treatment_group][col].mean()
            mean_control = groups[control_group][col].mean()
            std_control = groups[control_group][col].std()

            if std_control == 0.0:
                std_control = 1.0

            smd = (mean_treatment - mean_control) / std_control # pyright: ignore[reportOperatorIssue]
            print(f"{col}: SMD= {smd:.4f}")

# Calculate SMDs for all group comparisons using proper control groups
calculate_smd(groups, comparisons, feature_cols + outcome_cols)


=== Treatment vs Control ===
f0: SMD= -0.0069
f1: SMD= 0.0258
f2: SMD= -0.0062
f3: SMD= -0.0511
f4: SMD= 0.0080
f5: SMD= -0.0322
f6: SMD= -0.0412
f7: SMD= 0.0217
f8: SMD= -0.0227
f9: SMD= 0.0245
f10: SMD= 0.0107
f11: SMD= -0.0052
visit: SMD= 0.0540
conversion: SMD= 0.0262

=== Non-exposed vs Control ===
f0: SMD= 0.0190
f1: SMD= -0.0186
f2: SMD= -0.0073
f3: SMD= 0.0327
f4: SMD= -0.0254
f5: SMD= 0.0199
f6: SMD= 0.0168
f7: SMD= -0.0108
f8: SMD= 0.0209
f9: SMD= -0.0203
f10: SMD= -0.0110
f11: SMD= 0.0229
visit: SMD= -0.0174
conversion: SMD= -0.0169

=== Exposed vs Control ===
f0: SMD= -0.6981
f1: SMD= 1.2153
f2: SMD= 0.0215
f3: SMD= -2.2922
f4: SMD= 0.9019
f5: SMD= -1.4262
f6: SMD= -1.5918
f7: SMD= 0.8916
f8: SMD= -1.1891
f9: SMD= 1.2216
f10: SMD= 0.5924
f11: SMD= -0.7561
visit: SMD= 1.9633
conversion: SMD= 1.1790

=== Exposed vs Non-exposed ===
f0: SMD= -0.7177
f1: SMD= 1.4430
f2: SMD= 0.0288
f3: SMD= -2.4772
f4: SMD= 1.0663
f5: SMD= -1.5765
f6: SMD= -1.6459
f7: SMD= 0.9209
f8: SMD= -1.22

 ## Summary

 Randomized controlled trial of an online ad campaign with 14M users. 85% assigned to treatment (eligible for ads via auctions), 15% control (never shown ads). Sequential structure: Assignment → Exposure → Conversion.

 **Key Findings:**
 - Valid randomization: Treatment assignment balanced (SMD < 0.1)
 - Non-compliance problem: Only 3.6% of assigned users actually see ads (96.4% never exposed)
 - Selective exposure: Who sees ads differs systematically (SMD > 0.1)
 - ITT effect: Assignment increases conversion by 0.12pp

 **Business Problem:** Real-time ad auctions require two decisions: (1) which users to target, (2) how much to bid. Standard CATE gives one score - users with identical scores may need different bids if one scores high due to reachability (bid low) versus responsiveness (bid high).

 **Next Steps:** Notebook 2 implements conditional average treatment effects (CATE) for baseline performance. CATE estimates the effect of ad campaign assignment on conversion - the effect of being assigned to ad campaign versus not being assigned. NOT the effect of actually seeing the ad (exposure).