<a href="https://colab.research.google.com/github/julie-dfx/causal-decision-analytics/blob/main/00_reboot_03_causal_identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Causal identification: when effects are learnable

This notebook explores the distinction between estimation and identification. Through simulations, it demonstrates cases where causal effects can be recovered by adjustment and cases where they are fundamentally not identifiable from the data

*Topics: identification, selection bias, IV intuition*


### Results

 this notebook shows that causal effects are identifiable only when the data and assumptions allow isolation of variation in the treatment independant of confounders.
 Increasing sample size improves precision but does not resolve identification failure caused by unobserved confounding

### Limitations

The examples assume simple data-genrating processes and focus on unobserved confounding as the primary source of non-identification. Other sources of non-identification, such as simultaneity or measurement error, are not yet explored

# Backdoor criterion


In [None]:
#simulate a confounded world
np.random.seed(1)
n = 500

#Confounder:
Z = np.random.normal(0, 1, n) # --> affects both X and Y

#Treatment
X = 1.5 * Z + np.random.normal(0, 1, n)

#Outcome
Y = 2.0 * X + 3.0 * Z + np.random.normal(0, 1, n)

# The backdoor path is open X <-- Z --> Y
# True causal effect of X on Y = 2.0

In [None]:
#Regression #1: no adjustment
X1 = sm.add_constant(X)
res1 = sm.OLS(Y, X1).fit()
res1.params

#result is far from 2.0

array([0.09138593, 3.40609307])

In [None]:
#Regression #2: correct adjustment
X2 = sm.add_constant(np.column_stack([X, Z]))
res2 = sm.OLS(Y, X2).fit()
res2.params

#result: coefficient is close to 2; backdoor criterion in action

array([0.07616095, 2.0236437 , 2.98435111])

In [None]:
#Regression 3: bad adjustment (collider)
C = X + Y + np.random.normal(0, 1, n)

X3 = sm.add_constant(np.column_stack([X, C]))
res3 = sm.OLS(Y, X3).fit()
res3.params

#result: X3 is wrong again (0.01); we introduced a bias by adjusting the wrong variable (collider)

array([0.04059645, 0.01346148, 0.76665627])

### Empirical demonstration of the backdoor criterion
 When the confounder Z is omitted, the regression coefficient on X is biased.
 Conditioning on Z blocks the backdoor path and recovers the true causal effect.
 Conditioning on a collider re-opens a non-causal path and introduces bias, even though model fit may improve



### Deep Dive

If we rerun for n in [100, 1000, 10000], we see that Bias persists and becomes more certain with more dta when identification is wrong

If we rerun fo increased noise vs signal Y = 2.0 * X + 3.0 * Z + np.random.normal(0, 5, n), we observe that variance increases, confidence intervals widen, but bias behaviour is unchanged. Variance is about uncertainty, bias is about structure

In [None]:
# Create a richer graph in code
W = np.random.normal(0, 1, n)
Z2 = 0.5 * Z + np.random.normal(0, 1, n)

Y = 2.0 * X + 3.0 * Z + np.random.normal(0, 1, n)

In [None]:
#Regression #1: control for X only
X1 = sm.add_constant(X)
res1 = sm.OLS(Y, X1).fit()
print(res1.summary())

#result is far from 2.0 (3.4)

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.915
Model:                            OLS   Adj. R-squared:                  0.915
Method:                 Least Squares   F-statistic:                     5351.
Date:                Thu, 15 Jan 2026   Prob (F-statistic):          1.48e-268
Time:                        09:07:06   Log-Likelihood:                -1025.3
No. Observations:                 500   AIC:                             2055.
Df Residuals:                     498   BIC:                             2063.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0460      0.084      0.545      0.5

In [None]:
#Regression #2: control for X + W
X2 = sm.add_constant(np.column_stack([X, W]))
res2 = sm.OLS(Y, X2).fit()
print(res2.summary())

#result: controlling for sth that is not on the path (W) doesnt change the coeff on X, and is still wrong

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.915
Model:                            OLS   Adj. R-squared:                  0.915
Method:                 Least Squares   F-statistic:                     2676.
Date:                Thu, 15 Jan 2026   Prob (F-statistic):          8.42e-267
Time:                        09:07:06   Log-Likelihood:                -1024.8
No. Observations:                 500   AIC:                             2056.
Df Residuals:                     497   BIC:                             2068.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0470      0.084      0.557      0.5

In [None]:
#Regression #2: control for X + Z2 (Z2 is a descendant of Z, the confounder, but is not on the causal path to Y)
X2 = sm.add_constant(np.column_stack([X, Z2]))
res2 = sm.OLS(Y, X2).fit()
print(res2.summary())

#result: controlling for sth that is not on the path (Z2), even if a descendant of the confounder doesnt change the coeff on X, and is still wrong

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.922
Model:                            OLS   Adj. R-squared:                  0.922
Method:                 Least Squares   F-statistic:                     2953.
Date:                Thu, 15 Jan 2026   Prob (F-statistic):          1.49e-276
Time:                        09:07:06   Log-Likelihood:                -1002.2
No. Observations:                 500   AIC:                             2010.
Df Residuals:                     497   BIC:                             2023.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0614      0.081      0.761      0.4

In [None]:
#Regression #2: control for X + Z + W (controlling both for the confounder and for sth that is not on the causal path)
X2 = sm.add_constant(np.column_stack([X, Z, W]))
res2 = sm.OLS(Y, X2).fit()
print(res2.summary())

#result: controlling for the confounder + for sth random gives correct results. It's just useless to control for W

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.975
Method:                 Least Squares   F-statistic:                     6450.
Date:                Thu, 15 Jan 2026   Prob (F-statistic):               0.00
Time:                        09:07:06   Log-Likelihood:                -718.87
No. Observations:                 500   AIC:                             1446.
Df Residuals:                     496   BIC:                             1463.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0314      0.046      0.684      0.4

### REAL EXAMPLE

- X = compensation issued
- Y = 28-day retention
- Z = order issue severity
- C = basket value



1.   severity --> compensation --> retention
2.   severity --> retention
3.   compensation --> basket value <-- retention


The backdoor path is open: severity is the confounder
Basket Value is influenced by both X and Y, it's the collider and should not be controlled for




### Identification vs Estimation
Can this causal effect be learned from the data at all, even with infinite data?

In [None]:
## We simulate 2 worlds, one where regression is biased but fixable by adjustment, one where it's not

## World A
np.random.seed(1)
n = 1000

Z = np.random.normal(0, 1, n)
X = Z + np.random.normal(0, 1, n)
Y = 2 * X + 3 *  Z + np.random.normal(0, 1, n)

In [None]:
#run 2 regressions

#unadjusted
res_unadj = sm.OLS(Y, sm.add_constant(X)).fit()

#adjusted
res_adj = sm.OLS(Y, sm.add_constant(np.column_stack([X, Z]))).fit()

print(res_unadj.summary())
print(res_adj.summary())

# results: adjusted ~2, unadj <>2 ; the effect is identifiable because a valid adjustment set exists

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.817
Model:                            OLS   Adj. R-squared:                  0.817
Method:                 Least Squares   F-statistic:                     4453.
Date:                Thu, 15 Jan 2026   Prob (F-statistic):               0.00
Time:                        09:12:03   Log-Likelihood:                -2255.0
No. Observations:                1000   AIC:                             4514.
Df Residuals:                     998   BIC:                             4524.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0024      0.073      0.032      0.9

In [None]:
# World B - not identifiable

np.random.seed(1)
# n = 10000
U = np.random.normal(0, 1, n) # unobserved confounder

X = U + np.random.normal(0, 1, n)
Y = 2 * X + U + np.random.normal(0, 1, n)

In [None]:
res = sm.OLS(Y, sm.add_constant(X)).fit()
print(res.summary())

#result: 2, 4, false. not identifiable. When increasing n, the results converge but not to 2 ==> not a variance problem, but an identification failure


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.895
Model:                            OLS   Adj. R-squared:                  0.895
Method:                 Least Squares   F-statistic:                 8.564e+04
Date:                Thu, 15 Jan 2026   Prob (F-statistic):               0.00
Time:                        09:19:06   Log-Likelihood:                -16171.
No. Observations:               10000   AIC:                         3.235e+04
Df Residuals:                    9998   BIC:                         3.236e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0145      0.012     -1.189      0.2

## Identification vs Estimation

In world A, the causal effect of X on Y is identifiable because a valid adjustment set exists. Conditioning on Z blocks all backdoor paths

In world B, the causal effect is not identifiable from the observed data because the confounder is unobserved. Increasing sample size improves precision but does not remove bias