### Introduction

This script demonstrates the impact of confounding in causal inference and methods used to adjust for such confounding to derive a more accurate estimate of causal effects. Using the `DoWhy` library, we simulate data that features a binary treatment variable and an outcome influenced by a confounding variable. The script estimates the Average Treatment Effect (ATE) both with and without adjusting for the confounder, illustrating how ignoring confounders can lead to biased results. It further validates the reliability of the causal estimate through a refutation test.

In [10]:
import numpy as np
import pandas as pd
import dowhy
from dowhy import CausalModel

# Function to simulate data with binary treatment and confounding variable
def simulate_data_with_confounding(sample_size=10000):
    np.random.seed(2)
    confounder = np.random.normal(0, 1, sample_size)  
    treatment_prob = 1 / (1 + np.exp(-confounder)) # A logistic function (`treatment_prob = 1 / (1 + np.exp(-confounder))`) is used to convert the confounder into a probability of receiving treatment.
    treatment = np.random.binomial(1, treatment_prob, size=sample_size) # treatment is binary 
    outcome = treatment + 5 * confounder + np.random.normal(0, 1, sample_size)
    
    data = pd.DataFrame({'Treatment': treatment, 'Outcome': outcome, 'Confounder': confounder})
    return data

# Simulate the data
data = simulate_data_with_confounding()

# Display the first few rows of the data
print(data.head())

   Treatment    Outcome  Confounder
0          0  -2.848533   -0.416758
1          0  -1.559242   -0.056267
2          0 -12.528960   -2.136196
3          1   9.441820    1.640271
4          0  -9.568476   -1.793436


In [11]:
# Creating a causal model without controlling for the confounder
model_without_confounder = CausalModel(
    data=data,
    treatment="Treatment",
    outcome="Outcome"
)

# Identifying the effect without controlling for the confounder
identified_estimand_without_confounder = model_without_confounder.identify_effect(proceed_when_unidentifiable=True)
print("Identified estimand (without considering confounder in model): ", identified_estimand_without_confounder)

# Estimating the effect without controlling for the confounder
estimate_without_confounder = model_without_confounder.estimate_effect(
    identified_estimand_without_confounder,
    method_name="backdoor.linear_regression" # linear regression based approach applied to the given model, i.e. without considering the confounder. 
)
print("Estimated ATE without confounder: ", estimate_without_confounder.value) # The ATE should be (> 1), i.e. biased, due to the confounder!

Identified estimand (without considering confounder in model):  Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
     d                  
────────────(E[Outcome])
d[Treatment]            
Estimand assumption 1, Unconfoundedness: If U→{Treatment} and U→Outcome then P(Outcome|Treatment,,U) = P(Outcome|Treatment,)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

Estimated ATE without confounder:  5.0331774736038195


In [12]:
# Creating a causal model controlling for the confounder
model_with_confounder = CausalModel(
    data=data,
    treatment="Treatment",
    outcome="Outcome",
    common_causes=["Confounder"]
)

# Identifying the effect with confounder control
identified_estimand_with_confounder = model_with_confounder.identify_effect(proceed_when_unidentifiable=True)
print("Identified estimand (with considering confounder): ", identified_estimand_with_confounder)

# Estimating the effect controlling for the confounder
estimate_with_confounder = model_with_confounder.estimate_effect(
    identified_estimand_with_confounder,
    method_name="backdoor.propensity_score_matching" #"backdoor.linear_regression" 
)

print("Estimated ATE with confounder: ", estimate_with_confounder.value) # 

Identified estimand (with considering confounder):  Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
     d                             
────────────(E[Outcome|Confounder])
d[Treatment]                       
Estimand assumption 1, Unconfoundedness: If U→{Treatment} and U→Outcome then P(Outcome|Treatment,Confounder,U) = P(Outcome|Treatment,Confounder)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

Estimated ATE with confounder:  1.0332999920426305


**Note: different methods can be considered to control for confounders, e.g.:**
- [backdoor.linear_regression](https://www.pywhy.org/dowhy/v0.10/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_backdoor/regression_based_methods.html)
- [backdoor.propensity_score_matching](https://www.pywhy.org/dowhy/v0.10/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_backdoor/propensity_based_methods.html)

In [13]:
# Refuting the estimate using a placebo treatment
print("Refutation result: ", model.refute_estimate(identified_estimand, estimate, method_name="placebo_treatment_refuter"))

Refutation result:  Refute: Use a Placebo Treatment
Estimated effect:1.0332999920426305
New effect:0.003643336685162588
p value:0.74



### Refutation Step Explanation:

1. **Estimated Effect**:
   - This is the original estimated treatment effect calculated using the causal model including the true treatment variable. From the output, this value was approximately 1 when controlling for the confounder.

2. **Placebo Treatment**:
   - In the refutation step, `dowhy` introduces a placebo treatment variable which replaces the true treatment variable. The objective is to test if similar treatment effects would be detected using this placebo.
   - By using a placebo, the model checks whether the same confounding mechanisms or random chance could falsely indicate a treatment effect, when in fact, none should logically exist.

3. **New Effect**:
   - The `New effect` is the estimated treatment effect calculated using the placebo treatment. Ideally, if the original model correctly captures the causal effect, this new effect should be close to zero, indicating no bogus treatment effect from the placebo.

4. **p-value**:
   - The p-value for the placebo effect tests the null hypothesis that the observed placebo effect (new effect) is different from zero due to random chance.
   - A high p-value (typically greater than 0.05) implies that there's no statistical evidence against the null hypothesis, which indicates the placebo estimate is consistent with zero. In this output, the p-value is 0.74, leading to no significant evidence of a treatment effect from the placebo.
   - Thus, a high p-value supports the reliability and robustness of the original causal effect estimate since no artificial effect is detected under placebo conditions, suggesting that the estimated effect when using the actual treatment is genuine.

### Interpretation:

- **Original Effect vs. Placebo Effect**: The original estimated effect (with the true treatment) was around 1., and the placebo effect was nearly zero with a high p-value, which strongly suggests that the actual treatment effect is real and not due to confounding artifacts or random noise.
- **Robustness Verification**: The placebo test supports the causal model's findings by showing that the causal effect estimate is unlikely to have resulted from spurious processes. Therefore, the effectiveness of the true treatment is confirmed while guarding against potential misinterpretation due to hidden biases.

Overall, the refutation step with a placebo treatment acts as a robustness check to provide confidence in the causal conclusions drawn from the model, ensuring that the treatment effect estimates are valid and not a consequence of random variation or unnoticed confounding variables.