- Since we do not have live experiment data let's make use of some synthetic data
- causalml.dataset module provides utility functions to generate synthetic data
- We can specify various configurations to generate synthetic data such as:
- - n : Number of samples
- - p : Number of covariates (i.e. number of features)
- It returns the following:
- - y : outcome array (i.e. synthetic outcome from the experiments), in this case these are continuous variable
- - X : Independent variables of dimensions n,p
- - w : Treatment flag, 0 signifies control
- - tau : ITE
- - b : Expected outsome
- - e : Propensity of receiving treatment

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
from causalml.dataset import synthetic_data

y, X, treatment_flags, ite_list, exp_outcome_list, e = synthetic_data(mode=1, n=10_000, p=5, sigma=1.0)

print(f"Sample outcome: {y[:5]}")
print(f"Sample independent variables: {X[:5]}")

print(f"Treatment Data Count = {np.count_nonzero(treatment_flags)}")
print(f"Control Data Count = {len(y) - np.count_nonzero(treatment_flags)}")

print(f"Sample ITE: {ite_list[:5]}")


Sample outcome: [1.27648274 2.86010907 2.19068981 2.19385831 3.00888889]
Sample independent variables: [[0.53815609 0.99937586 0.7999792  0.00416606 0.9715307 ]
 [0.7648678  0.72510838 0.86039634 0.92684141 0.79977226]
 [0.07446032 0.25089329 0.8692077  0.51098639 0.86212467]
 [0.61163504 0.40759826 0.97159678 0.69027405 0.99513741]
 [0.2145972  0.59304468 0.49659091 0.36907764 0.33733784]]
Treatment Data Count = 5244
Control Data Count = 4756
Sample ITE: [0.76876598 0.74498809 0.16267681 0.50961665 0.40382094]


- Now we have set of synthetic data with continuous outcome variable and independent variables
- Let's try out one Causal Inference Algorithm to estimate ATE

- LRSRegressor : this is a type of S Learner Regressor
- We can specify the following parameters while creating the instance of this regressor:
- - ate_alpha = Confidence level of ATE estimation, default is 0.05
- - control_name = string or int value representing the control samples, by default it's 0. This is used to identify control samples when we call estimate_ate()

In [2]:
from causalml.inference.meta import LRSRegressor

lr = LRSRegressor()
print(lr)

LRSRegressor(model=<causalml.inference.meta.slearner.StatsmodelsOLS object at 0x7fa968817220>)


- Estimate ATE with upper and lower bound based on pre-set confidence level alpha of 0.05

In [6]:
te, lb, ub = lr.estimate_ate(X, treatment_flags, y)
print(f"Confidence Level Alpha = {lr.ate_alpha}")
print(f"ATE = {np.round(te[0], 2)}, Range : {np.round(lb[0], 2)}:{np.round(ub[0], 2)}")

Confidence Level Alpha = 0.05
ATE = 0.68, Range : 0.63:0.72


- As can be seen, on the synthetic data of 10K the Averare Treatment Effect was 0.68 with Range of 0.63 to 0.72
- The synthetic data had 5 independent variables
- We now know the true impact of our treatment and based on business adoption use cae we can decide the ROI to decide full scale adoption of this change