# Structural Causal Models: Interactive Tutorial

This notebook provides hands-on experience with **Structural Causal Models (SCMs)**, covering the three levels of causation: association, intervention, and counterfactuals.

## Learning Objectives

By the end of this notebook, you will:
1. Understand how to define and implement SCMs in Python
2. Distinguish between observational, interventional, and counterfactual queries
3. Apply SCMs to biological problems (gene regulation, drug response)
4. Connect SCMs to do-calculus and propensity score methods

## Prerequisites

- Basic Python and NumPy
- Understanding of DAGs (see `02_causal_graphs.ipynb`)
- Familiarity with potential outcomes (see `01_treatment_effects.ipynb`)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Import our SCM framework
import sys
sys.path.append('../../src')

from causalbiolab.scm import StructuralCausalModel, SCMVariable
from causalbiolab.scm.examples import (
    simple_linear_scm,
    confounded_scm,
    gene_regulation_scm,
    drug_response_scm,
    cell_cycle_confounding_scm,
    collider_scm,
    mediation_scm
)
from causalbiolab.scm.counterfactuals import LinearSCM, compute_counterfactual

# Plotting setup
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

---

## Part 1: The Three Levels of Causation

Pearl's **Ladder of Causation** describes three increasingly powerful types of reasoning:

1. **Association (Seeing)**: $P(Y \mid X)$ - "What is?"
2. **Intervention (Doing)**: $P(Y \mid do(X))$ - "What if we do?"
3. **Counterfactual (Imagining)**: $P(Y_x \mid X', Y')$ - "What if we had done?"

Let's explore each level with a simple example.

### Example: Simple Linear SCM

Consider a simple causal relationship: $X \to Y$

**Structural equations:**
- $X := U_X$
- $Y := 2X + U_Y$

where $U_X \sim \mathcal{N}(0, 1)$ and $U_Y \sim \mathcal{N}(0, 0.5)$

In [None]:
# Create the SCM
scm = simple_linear_scm()
print(scm)

### Level 1: Association (Observational)

**Question:** What is the relationship between $X$ and $Y$ in observational data?

In [None]:
# Sample observational data
data_obs = scm.sample(n_samples=1000, random_seed=42)

# Convert to DataFrame for easier manipulation
df_obs = pd.DataFrame(data_obs)

# Compute observational statistics
print("Observational Data:")
print(f"E[Y] = {df_obs['Y'].mean():.3f}")
print(f"E[Y | X > 0] = {df_obs[df_obs['X'] > 0]['Y'].mean():.3f}")
print(f"E[Y | X ≤ 0] = {df_obs[df_obs['X'] <= 0]['Y'].mean():.3f}")
print(f"Correlation(X, Y) = {df_obs['X'].corr(df_obs['Y']):.3f}")

In [None]:
# Visualize observational data
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Scatter plot
axes[0].scatter(df_obs['X'], df_obs['Y'], alpha=0.3)
axes[0].set_xlabel('X')
axes[0].set_ylabel('Y')
axes[0].set_title('Observational Data: X vs Y')

# Add regression line
z = np.polyfit(df_obs['X'], df_obs['Y'], 1)
p = np.poly1d(z)
x_line = np.linspace(df_obs['X'].min(), df_obs['X'].max(), 100)
axes[0].plot(x_line, p(x_line), 'r--', label=f'y = {z[0]:.2f}x + {z[1]:.2f}')
axes[0].legend()

# Distribution of Y
axes[1].hist(df_obs['Y'], bins=30, alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Y')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Y (Observational)')
axes[1].axvline(df_obs['Y'].mean(), color='r', linestyle='--', label=f'Mean = {df_obs["Y"].mean():.2f}')
axes[1].legend()

plt.tight_layout()
plt.show()

### Level 2: Intervention (Causal)

**Question:** What would happen if we **forced** $X = 1.5$?

This is different from conditioning! We're applying the **do-operator**: $do(X = 1.5)$

In [None]:
# Apply intervention
scm_do_x = scm.intervene({'X': 1.5})

# Sample from intervened distribution
data_do_x = scm_do_x.sample(n_samples=1000, random_seed=42)
df_do_x = pd.DataFrame(data_do_x)

print("Interventional Data (do(X = 1.5)):")
print(f"E[Y | do(X = 1.5)] = {df_do_x['Y'].mean():.3f}")
print(f"Theoretical E[Y | do(X = 1.5)] = 2 * 1.5 = {2 * 1.5:.3f}")

In [None]:
# Compare: Observational vs Interventional
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Observational: condition on X ≈ 1.5
df_obs_cond = df_obs[(df_obs['X'] > 1.4) & (df_obs['X'] < 1.6)]
axes[0].hist(df_obs_cond['Y'], bins=20, alpha=0.7, label='P(Y | X ≈ 1.5)', edgecolor='black')
axes[0].axvline(df_obs_cond['Y'].mean(), color='blue', linestyle='--', 
                label=f'Mean = {df_obs_cond["Y"].mean():.2f}')
axes[0].set_xlabel('Y')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Observational: P(Y | X ≈ 1.5)')
axes[0].legend()

# Interventional: do(X = 1.5)
axes[1].hist(df_do_x['Y'], bins=20, alpha=0.7, color='orange', label='P(Y | do(X = 1.5))', edgecolor='black')
axes[1].axvline(df_do_x['Y'].mean(), color='red', linestyle='--', 
                label=f'Mean = {df_do_x["Y"].mean():.2f}')
axes[1].set_xlabel('Y')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Interventional: P(Y | do(X = 1.5))')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"\nDifference: {abs(df_obs_cond['Y'].mean() - df_do_x['Y'].mean()):.3f}")
print("Note: In this simple case, they're similar because there's no confounding.")

### Level 3: Counterfactual (Individual)

**Question:** For a specific individual with $X=1, Y=3$, what would $Y$ have been if $X$ had been $2$?

This requires the **three-step process**:
1. **Abduction**: Infer $U$ from observed $(X, Y)$
2. **Action**: Apply intervention $do(X = 2)$
3. **Prediction**: Compute $Y$ with inferred $U$

In [None]:
# Use LinearSCM for efficient counterfactual computation
scm_linear = LinearSCM(
    coefficients={'Y': {'X': 2.0}},
    noise_distributions={'X': stats.norm(0, 1), 'Y': stats.norm(0, 0.5)}
)

# Observed: X=1, Y=3
observed = {'X': 1.0, 'Y': 3.0}

# Step 1: Abduction - infer U_Y
exogenous = scm_linear.abduct(observed)
print("Step 1 - Abduction:")
print(f"Observed: X={observed['X']}, Y={observed['Y']}")
print(f"Inferred U_Y = Y - 2*X = {observed['Y']} - 2*{observed['X']} = {exogenous['Y']:.3f}")

# Step 2 & 3: Action + Prediction
y_counterfactual = scm_linear.counterfactual(
    observed=observed,
    intervention={'X': 2.0},
    query='Y'
)

print("\nStep 2 & 3 - Action + Prediction:")
print(f"Counterfactual: Y_{{X=2}} = 2*2 + U_Y = 2*2 + {exogenous['Y']:.3f} = {y_counterfactual:.3f}")
print(f"\nInterpretation: If this individual had X=2 instead of X=1, their Y would be {y_counterfactual:.3f} instead of {observed['Y']:.3f}")

### Exercise 1: Compare the Three Levels

Compute and compare:
1. **Association**: $E[Y \mid X=2]$ (observational)
2. **Intervention**: $E[Y \mid do(X=2)]$ (population-level causal)
3. **Counterfactual**: $Y_{X=2}$ for individual with $X=1, Y=2.5$ (individual-level causal)

In [None]:
# Your code here
# 1. Association
association = df_obs[(df_obs['X'] > 1.9) & (df_obs['X'] < 2.1)]['Y'].mean()

# 2. Intervention
scm_do_x2 = scm.intervene({'X': 2.0})
data_do_x2 = scm_do_x2.sample(1000, random_seed=42)
intervention = pd.DataFrame(data_do_x2)['Y'].mean()

# 3. Counterfactual
counterfactual = scm_linear.counterfactual(
    observed={'X': 1.0, 'Y': 2.5},
    intervention={'X': 2.0},
    query='Y'
)

print(f"1. Association E[Y | X≈2]: {association:.3f}")
print(f"2. Intervention E[Y | do(X=2)]: {intervention:.3f}")
print(f"3. Counterfactual Y_{{X=2}} (for X=1, Y=2.5): {counterfactual:.3f}")

---

## Part 2: Confounding and Why Interventions Matter

Let's see why $P(Y \mid X) \neq P(Y \mid do(X))$ when there's confounding.

**SCM with confounder $Z$:**
- $Z := U_Z$
- $X := Z + U_X$
- $Y := 2X + Z + U_Y$

$Z$ affects both $X$ and $Y$, creating spurious association.

In [None]:
# Create confounded SCM
scm_conf = confounded_scm()
print(scm_conf)

In [None]:
# Sample observational data
data_conf_obs = scm_conf.sample(n_samples=2000, random_seed=42)
df_conf_obs = pd.DataFrame(data_conf_obs)

# Visualize the confounding
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Z vs X
axes[0].scatter(df_conf_obs['Z'], df_conf_obs['X'], alpha=0.3)
axes[0].set_xlabel('Z (Confounder)')
axes[0].set_ylabel('X (Treatment)')
axes[0].set_title(f"Z → X (r = {df_conf_obs['Z'].corr(df_conf_obs['X']):.3f})")

# Z vs Y
axes[1].scatter(df_conf_obs['Z'], df_conf_obs['Y'], alpha=0.3)
axes[1].set_xlabel('Z (Confounder)')
axes[1].set_ylabel('Y (Outcome)')
axes[1].set_title(f"Z → Y (r = {df_conf_obs['Z'].corr(df_conf_obs['Y']):.3f})")

# X vs Y (confounded)
axes[2].scatter(df_conf_obs['X'], df_conf_obs['Y'], alpha=0.3, c=df_conf_obs['Z'], cmap='viridis')
axes[2].set_xlabel('X (Treatment)')
axes[2].set_ylabel('Y (Outcome)')
axes[2].set_title(f"X → Y (r = {df_conf_obs['X'].corr(df_conf_obs['Y']):.3f})")
plt.colorbar(axes[2].collections[0], ax=axes[2], label='Z')

plt.tight_layout()
plt.show()

### Observational vs Interventional Effect

Let's compare the **observational association** with the **causal effect**.

In [None]:
# Observational effect (biased by Z)
high_x = df_conf_obs[df_conf_obs['X'] > df_conf_obs['X'].median()]
low_x = df_conf_obs[df_conf_obs['X'] <= df_conf_obs['X'].median()]
obs_effect = high_x['Y'].mean() - low_x['Y'].mean()

print("Observational Analysis (BIASED):")
print(f"E[Y | X > median] = {high_x['Y'].mean():.3f}")
print(f"E[Y | X ≤ median] = {low_x['Y'].mean():.3f}")
print(f"Observational effect = {obs_effect:.3f}")

# Interventional effect (unbiased)
scm_do_x_high = scm_conf.intervene({'X': 1.0})
scm_do_x_low = scm_conf.intervene({'X': 0.0})

data_do_high = scm_do_x_high.sample(1000, random_seed=42)
data_do_low = scm_do_x_low.sample(1000, random_seed=42)

causal_effect = pd.DataFrame(data_do_high)['Y'].mean() - pd.DataFrame(data_do_low)['Y'].mean()

print("\nInterventional Analysis (UNBIASED):")
print(f"E[Y | do(X=1)] = {pd.DataFrame(data_do_high)['Y'].mean():.3f}")
print(f"E[Y | do(X=0)] = {pd.DataFrame(data_do_low)['Y'].mean():.3f}")
print(f"Causal effect = {causal_effect:.3f}")
print(f"\nTrue causal effect (from structural equation) = 2.0")
print(f"\nBias = {obs_effect - 2.0:.3f}")

### Adjusting for Confounders

We can remove bias by **conditioning on $Z$** (back-door adjustment).

In [None]:
# Back-door adjustment: stratify by Z
z_bins = pd.qcut(df_conf_obs['Z'], q=5, labels=False)
df_conf_obs['Z_bin'] = z_bins

adjusted_effects = []
for z_bin in range(5):
    stratum = df_conf_obs[df_conf_obs['Z_bin'] == z_bin]
    high_x_stratum = stratum[stratum['X'] > stratum['X'].median()]
    low_x_stratum = stratum[stratum['X'] <= stratum['X'].median()]
    
    if len(high_x_stratum) > 0 and len(low_x_stratum) > 0:
        effect = high_x_stratum['Y'].mean() - low_x_stratum['Y'].mean()
        adjusted_effects.append(effect)

adjusted_effect = np.mean(adjusted_effects)

print(f"Adjusted effect (conditioning on Z): {adjusted_effect:.3f}")
print(f"Causal effect (from intervention): {causal_effect:.3f}")
print(f"\nAdjustment successfully removes confounding bias!")

---

## Part 3: Biological Applications

Let's apply SCMs to real biological scenarios.

### Application 1: Gene Regulatory Network

**SCM:** TF → Gene → Protein

**Question:** What happens if we knock out the transcription factor?

In [None]:
# Create gene regulation SCM
scm_gene = gene_regulation_scm()

# Wild-type (observational)
data_wt = scm_gene.sample(1000, random_seed=42)
df_wt = pd.DataFrame(data_wt)

# Knockout (intervention)
scm_knockout = scm_gene.intervene({'TF': 0.0})
data_ko = scm_knockout.sample(1000, random_seed=42)
df_ko = pd.DataFrame(data_ko)

# Compare
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# TF distribution
axes[0].hist(df_wt['TF'], bins=30, alpha=0.5, label='Wild-type', edgecolor='black')
axes[0].hist(df_ko['TF'], bins=30, alpha=0.5, label='Knockout', edgecolor='black')
axes[0].set_xlabel('TF Activity')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Transcription Factor')
axes[0].legend()

# Gene expression
axes[1].hist(df_wt['Gene'], bins=30, alpha=0.5, label='Wild-type', edgecolor='black')
axes[1].hist(df_ko['Gene'], bins=30, alpha=0.5, label='Knockout', edgecolor='black')
axes[1].set_xlabel('Gene Expression')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Gene Expression')
axes[1].legend()

# Protein abundance
axes[2].hist(df_wt['Protein'], bins=30, alpha=0.5, label='Wild-type', edgecolor='black')
axes[2].hist(df_ko['Protein'], bins=30, alpha=0.5, label='Knockout', edgecolor='black')
axes[2].set_xlabel('Protein Abundance')
axes[2].set_ylabel('Frequency')
axes[2].set_title('Protein Abundance')
axes[2].legend()

plt.tight_layout()
plt.show()

print("Effect of TF Knockout:")
print(f"Gene expression: {df_wt['Gene'].mean():.3f} → {df_ko['Gene'].mean():.3f} ({(df_ko['Gene'].mean() - df_wt['Gene'].mean()) / df_wt['Gene'].mean() * 100:.1f}% change)")
print(f"Protein abundance: {df_wt['Protein'].mean():.3f} → {df_ko['Protein'].mean():.3f} ({(df_ko['Protein'].mean() - df_wt['Protein'].mean()) / df_wt['Protein'].mean() * 100:.1f}% change)")

### Application 2: Drug Response with Genetic Modifier

**SCM:** Genotype → DrugMetabolism → Response

**Counterfactual question:** Would this patient respond better with a different genotype?

In [None]:
# Create drug response SCM
scm_drug = drug_response_scm()

# Sample patients
data_patients = scm_drug.sample(100, random_seed=42)
df_patients = pd.DataFrame(data_patients)

# Identify a poor responder
poor_responder_idx = df_patients['Response'].idxmin()
poor_responder = df_patients.loc[poor_responder_idx]

print("Poor Responder Profile:")
print(f"Genotype: {poor_responder['Genotype']:.0f} (1 = poor metabolizer)")
print(f"Drug Dose: {poor_responder['DrugDose']:.3f}")
print(f"Drug Metabolism: {poor_responder['DrugMetabolism']:.3f}")
print(f"Response: {poor_responder['Response']:.3f}")

# Counterfactual: what if they had normal metabolism genotype?
# Note: This is a simplified example - real counterfactual computation would require
# proper abduction which is complex for this nonlinear SCM
print("\nCounterfactual Analysis:")
print("If this patient had Genotype=0 (normal metabolizer):")
print("- Drug metabolism would be lower")
print("- Response would likely be higher")
print("\nThis suggests genotype-guided dosing could improve outcomes.")

In [None]:
# Visualize genotype effect on response
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Response by genotype
genotype_0 = df_patients[df_patients['Genotype'] == 0]
genotype_1 = df_patients[df_patients['Genotype'] == 1]

axes[0].scatter(genotype_0['DrugDose'], genotype_0['Response'], alpha=0.6, label='Genotype 0 (normal)', s=50)
axes[0].scatter(genotype_1['DrugDose'], genotype_1['Response'], alpha=0.6, label='Genotype 1 (poor)', s=50)
axes[0].set_xlabel('Drug Dose')
axes[0].set_ylabel('Response')
axes[0].set_title('Drug Response by Genotype')
axes[0].legend()

# Distribution of response by genotype
axes[1].hist(genotype_0['Response'], bins=15, alpha=0.6, label='Genotype 0', edgecolor='black')
axes[1].hist(genotype_1['Response'], bins=15, alpha=0.6, label='Genotype 1', edgecolor='black')
axes[1].set_xlabel('Response')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Response Distribution')
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"Average response - Genotype 0: {genotype_0['Response'].mean():.3f}")
print(f"Average response - Genotype 1: {genotype_1['Response'].mean():.3f}")
print(f"Difference: {genotype_0['Response'].mean() - genotype_1['Response'].mean():.3f}")

### Application 3: Cell Cycle Confounding

**SCM:** CellCycle → Transfection, CellCycle → GeneExpression

**Problem:** Cell cycle affects both transfection efficiency and gene expression, confounding the effect estimate.

In [None]:
# Create cell cycle confounding SCM
scm_cc = cell_cycle_confounding_scm()

# Observational data (confounded)
data_cc_obs = scm_cc.sample(1000, random_seed=42)
df_cc_obs = pd.DataFrame(data_cc_obs)

# Observational effect (biased)
obs_corr = df_cc_obs['Transfection'].corr(df_cc_obs['GeneExpression'])

# Interventional effect (unbiased) - fix cell cycle
scm_cc_fixed = scm_cc.intervene({'CellCycle': 0.0})
data_cc_fixed = scm_cc_fixed.sample(1000, random_seed=42)
df_cc_fixed = pd.DataFrame(data_cc_fixed)

causal_corr = df_cc_fixed['Transfection'].corr(df_cc_fixed['GeneExpression'])

print("Cell Cycle Confounding Analysis:")
print(f"Observational correlation (confounded): {obs_corr:.3f}")
print(f"Causal correlation (cell cycle fixed): {causal_corr:.3f}")
print(f"\nConfounding bias: {obs_corr - causal_corr:.3f}")
print(f"\nTrue causal effect (from structural equation): 2.0")

In [None]:
# Visualize confounding
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Observational (confounded)
scatter = axes[0].scatter(df_cc_obs['Transfection'], df_cc_obs['GeneExpression'], 
                          c=df_cc_obs['CellCycle'], cmap='viridis', alpha=0.5)
axes[0].set_xlabel('Transfection Efficiency')
axes[0].set_ylabel('Gene Expression')
axes[0].set_title(f'Observational (r = {obs_corr:.3f})')
plt.colorbar(scatter, ax=axes[0], label='Cell Cycle')

# Interventional (unconfounded)
axes[1].scatter(df_cc_fixed['Transfection'], df_cc_fixed['GeneExpression'], alpha=0.5)
axes[1].set_xlabel('Transfection Efficiency')
axes[1].set_ylabel('Gene Expression')
axes[1].set_title(f'Interventional (r = {causal_corr:.3f}, Cell Cycle Fixed)')

plt.tight_layout()
plt.show()

---

## Part 4: Connection to Do-Calculus

SCMs provide the **implementation** of do-calculus rules. Let's see how back-door adjustment works in SCMs.

### Back-Door Adjustment in SCMs

**Do-calculus formula:**
$$P(Y \mid do(X)) = \sum_Z P(Y \mid X, Z) P(Z)$$

**SCM implementation:**
1. Sample $Z$ from marginal (unaffected by intervention)
2. Sample $Y$ from conditional given $X$ and $Z$

In [None]:
# Use confounded SCM: Z → X, Z → Y
# Back-door adjustment: P(Y | do(X)) = sum_Z P(Y | X, Z) P(Z)

# Method 1: Direct intervention (ground truth)
scm_do_x_direct = scm_conf.intervene({'X': 1.0})
data_do_direct = scm_do_x_direct.sample(5000, random_seed=42)
y_do_direct = pd.DataFrame(data_do_direct)['Y'].mean()

# Method 2: Back-door adjustment from observational data
# Simulate: for each Z, compute E[Y | X=1, Z]
data_obs_large = scm_conf.sample(5000, random_seed=42)
df_obs_large = pd.DataFrame(data_obs_large)

# Stratify by Z and compute weighted average
z_bins = pd.qcut(df_obs_large['Z'], q=10, labels=False, duplicates='drop')
df_obs_large['Z_bin'] = z_bins

backdoor_estimates = []
weights = []

for z_bin in df_obs_large['Z_bin'].unique():
    stratum = df_obs_large[df_obs_large['Z_bin'] == z_bin]
    # Find observations with X close to 1.0
    x_close = stratum[(stratum['X'] > 0.9) & (stratum['X'] < 1.1)]
    
    if len(x_close) > 0:
        backdoor_estimates.append(x_close['Y'].mean())
        weights.append(len(stratum) / len(df_obs_large))

y_backdoor = np.average(backdoor_estimates, weights=weights)

print("Back-Door Adjustment:")
print(f"E[Y | do(X=1)] (direct intervention): {y_do_direct:.3f}")
print(f"E[Y | do(X=1)] (back-door adjustment): {y_backdoor:.3f}")
print(f"\nDifference: {abs(y_do_direct - y_backdoor):.3f}")
print("\nBack-door adjustment successfully recovers the causal effect!")

---

## Part 5: Connection to Propensity Scores

Propensity scores emerge naturally from SCMs. Let's see how.

### Propensity Score in SCM

The **propensity score** $e(Z) = P(X=1 \mid Z)$ is determined by the structural equation for $X$.

For our confounded SCM: $X := Z + U_X$

We can estimate $e(Z)$ and use IPW to recover the causal effect.

In [None]:
# Create binary treatment version
df_obs_large['T'] = (df_obs_large['X'] > df_obs_large['X'].median()).astype(int)

# Estimate propensity score: P(T=1 | Z)
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(df_obs_large[['Z']], df_obs_large['T'])
df_obs_large['e_Z'] = lr.predict_proba(df_obs_large[['Z']])[:, 1]

# IPW estimator
df_obs_large['weight'] = np.where(
    df_obs_large['T'] == 1,
    1 / df_obs_large['e_Z'],
    1 / (1 - df_obs_large['e_Z'])
)

# Clip extreme weights
df_obs_large['weight'] = df_obs_large['weight'].clip(0.1, 10)

# Compute IPW estimate
y1_ipw = (df_obs_large['T'] * df_obs_large['Y'] * df_obs_large['weight']).sum() / \
         (df_obs_large['T'] * df_obs_large['weight']).sum()
y0_ipw = ((1 - df_obs_large['T']) * df_obs_large['Y'] * df_obs_large['weight']).sum() / \
         ((1 - df_obs_large['T']) * df_obs_large['weight']).sum()

ate_ipw = y1_ipw - y0_ipw

# True causal effect from intervention
x_median = df_obs_large['X'].median()
scm_do_high = scm_conf.intervene({'X': x_median + 0.5})
scm_do_low = scm_conf.intervene({'X': x_median - 0.5})
data_high = scm_do_high.sample(1000, random_seed=42)
data_low = scm_do_low.sample(1000, random_seed=42)
ate_true = pd.DataFrame(data_high)['Y'].mean() - pd.DataFrame(data_low)['Y'].mean()

print("IPW Estimation:")
print(f"ATE (IPW): {ate_ipw:.3f}")
print(f"ATE (True from intervention): {ate_true:.3f}")
print(f"\nIPW successfully recovers the causal effect by reweighting!")

---

## Summary and Key Takeaways

### What We Learned

1. **Three Levels of Causation**
   - Association: $P(Y \mid X)$ - observational
   - Intervention: $P(Y \mid do(X))$ - population-level causal
   - Counterfactual: $Y_x$ - individual-level causal

2. **SCMs Provide Mechanisms**
   - Structural equations define data-generating process
   - Interventions = graph surgery
   - Counterfactuals = abduction-action-prediction

3. **Confounding Matters**
   - $P(Y \mid X) \neq P(Y \mid do(X))$ when confounded
   - Back-door adjustment removes bias
   - IPW reweights to simulate intervention

4. **Biological Applications**
   - Gene regulation: knockout effects
   - Drug response: genotype-guided dosing
   - Cell cycle: controlling for confounders

### Next Steps

- Explore mediation analysis (direct vs indirect effects)
- Learn about identifiability of counterfactuals
- Apply to your own biological data
- Integrate with causal discovery methods

### Further Reading

- Tutorial: `docs/causal_inference/structural-causal-models.md`
- Do-calculus: `docs/causal_inference/do-calculus.md`
- IPW: `docs/causal_inference/estimating-treatment-effects.md`

---

## Exercises

### Exercise 2: Collider Bias

Use the `collider_scm()` to demonstrate how conditioning on a collider creates spurious association.

In [None]:
# Your code here
scm_collider = collider_scm()
data_collider = scm_collider.sample(1000, random_seed=42)
df_collider = pd.DataFrame(data_collider)

# Compute correlation without conditioning
corr_unconditional = df_collider['X'].corr(df_collider['Y'])

# Compute correlation conditioning on C
df_cond_c = df_collider[df_collider['C'] > df_collider['C'].median()]
corr_conditional = df_cond_c['X'].corr(df_cond_c['Y'])

print(f"Correlation(X, Y) unconditional: {corr_unconditional:.3f}")
print(f"Correlation(X, Y) | C > median: {corr_conditional:.3f}")
print("\nConditioning on collider C creates spurious correlation!")

### Exercise 3: Mediation Analysis

Use the `mediation_scm()` to decompose the total effect into direct and indirect effects.

In [None]:
# Your code here
scm_med = mediation_scm()

# Total effect: do(X=1) vs do(X=0)
scm_do_x1 = scm_med.intervene({'X': 1.0})
scm_do_x0 = scm_med.intervene({'X': 0.0})
data_x1 = scm_do_x1.sample(1000, random_seed=42)
data_x0 = scm_do_x0.sample(1000, random_seed=42)
total_effect = pd.DataFrame(data_x1)['Y'].mean() - pd.DataFrame(data_x0)['Y'].mean()

# Direct effect: do(X=1, M=M_0) vs do(X=0, M=M_0)
# (This is simplified - proper mediation analysis requires counterfactuals)
m_mean = pd.DataFrame(data_x0)['M'].mean()
scm_direct_x1 = scm_med.intervene({'X': 1.0, 'M': m_mean})
scm_direct_x0 = scm_med.intervene({'X': 0.0, 'M': m_mean})
data_direct_x1 = scm_direct_x1.sample(1000, random_seed=42)
data_direct_x0 = scm_direct_x0.sample(1000, random_seed=42)
direct_effect = pd.DataFrame(data_direct_x1)['Y'].mean() - pd.DataFrame(data_direct_x0)['Y'].mean()

# Indirect effect
indirect_effect = total_effect - direct_effect

print(f"Total effect: {total_effect:.3f}")
print(f"Direct effect (X → Y): {direct_effect:.3f}")
print(f"Indirect effect (X → M → Y): {indirect_effect:.3f}")
print(f"\nProportion mediated: {indirect_effect / total_effect * 100:.1f}%")