<a href="https://colab.research.google.com/github/pgurazada/causal_inference/blob/master/AB%20test/tuned_metalearners_indirect_AB_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q flaml

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/295.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.2/295.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import pandas as pd

from sklearn.model_selection import train_test_split
from flaml import AutoML

from tqdm import tqdm

# Data

In [3]:
file_url = "https://msalicedatapublic.z5.web.core.windows.net/datasets/RecommendationAB/ab_sample.csv"
data_df = pd.read_csv(file_url)

In [4]:
data_df.sample(5)

Unnamed: 0,days_visited_exp_pre,days_visited_free_pre,days_visited_fs_pre,days_visited_hs_pre,days_visited_rs_pre,days_visited_vrs_pre,locale_en_US,revenue_pre,os_type_osx,os_type_windows,easier_signup,became_member,days_visited_post
29676,0,1,22,7,11,0,1,1.49,1,0,0,0,11
5420,7,24,13,13,17,21,0,1.43,1,0,0,0,9
89542,22,5,3,26,9,4,0,0.02,0,0,0,0,1
70687,12,3,16,13,5,15,1,209.88,0,0,1,0,1
72516,6,27,4,18,17,26,0,1.85,1,0,0,0,23


In [5]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   days_visited_exp_pre   100000 non-null  int64  
 1   days_visited_free_pre  100000 non-null  int64  
 2   days_visited_fs_pre    100000 non-null  int64  
 3   days_visited_hs_pre    100000 non-null  int64  
 4   days_visited_rs_pre    100000 non-null  int64  
 5   days_visited_vrs_pre   100000 non-null  int64  
 6   locale_en_US           100000 non-null  int64  
 7   revenue_pre            100000 non-null  float64
 8   os_type_osx            100000 non-null  int64  
 9   os_type_windows        100000 non-null  int64  
 10  easier_signup          100000 non-null  int64  
 11  became_member          100000 non-null  int64  
 12  days_visited_post      100000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 9.9 MB


# Data

In this scenario, a travel website would like to know whether joining a membership program compels users to spend more time engaging with the website and purchasing more products.

A direct A/B test is infeasible because the website cannot force users to become members. Likewise, the travel company can’t look directly at existing data, comparing members and non-members, because the customers who chose to become members are likely already more engaged than other users.

**Solution:** The company had run an earlier experiment to test the value of a new, faster sign-up process. This experimental nudge towards membership as an instrument that generates random variation in the likelihood of membership. This is known as an **intent-to-treat** setting: the intention is to give a random group of user the "treatment" (access to the easier sign-up process), but not not all users will actually take it.

The data* is comprised of:
 * Features collected in the 28 days prior to the experiment (denoted by the suffix `_pre`)
 * Experiment variables (whether the use was exposed to the easier signup -> the instrument, and whether the user became a member -> the treatment)
 * Variables collected in the 28 days after the experiment (denoted by the suffix `_post`).

Feature Name | Type | Details
:--- |:--- |:---
**days_visited_exp_pre** | X | #days a user visits the attractions pages
**days_visited_free_pre** | X | #days a user visits the website through free channels (e.g. domain direct)
**days_visited_fs_pre** | X | #days a user visits the flights pages
**days_visited_hs_pre** | X | #days a user visits the hotels pages
**days_visited_rs_pre** | X | #days a user visits the restaurants pages
**days_visited_vrs_pre** | X |#days a user visits the vacation rental pages
**locale_en_US** | X | whether the user access the website from the US
**os_type** | X | user's operating system (windows, osx, other)
**revenue_pre** | X | how much the user spent on the website in the pre-period
**easier_signup** | Z | whether the user was exposed to the easier signup process
**became_member** | T | whether the user became a member
**days_visited_post** | Y | #days a user visits the website in the 28 days after the experiment


**To protect the privacy of the travel company's users, the data used in this scenario is synthetically generated and the feature distributions don't correspond to real distributions. However, the feature names have preserved their names and meaning.*

Overall impact

In [6]:
data_df.groupby('became_member').agg({'days_visited_post': 'mean'})

Unnamed: 0_level_0,days_visited_post
became_member,Unnamed: 1_level_1
0,8.323402
1,12.947956


# T-Learner

Estimated CATE:

$$
\hat{\tau}(x) = E[Y|X=x, T=1]-E[Y|X=x, T=0]=\hat{\mu}_1(x) - \hat{\mu}_0(x)
$$

where $\hat{\mu}_0=M_0(Y^0 \sim X^0)$, $\hat{\mu}_1=M_1(Y^1 \sim X^1)$ are any machine learning algorithms that are estimated on control and treatment subsets of training data respectively.


We choose gradient boosted regressors and classifiers as base learners through hyperparameter tuning over randomly chosen sets of feature combinations.

In [7]:
NUM_ITERATIONS = 50

In [8]:
train_df, test_df = train_test_split(
    data_df, test_size=0.3, random_state=42
)

In [9]:
train_df.shape, test_df.shape

((70000, 13), (30000, 13))

In [15]:
target = 'days_visited_post'
treatment = 'became_member'

In [16]:
# Split data into treated and untreated
train_0_df = train_df.query("became_member == 0")
train_1_df = train_df.query("became_member == 1")

In [17]:
tlearners = []

In [18]:
automl_settings = {
    "time_budget": 120,
    "metric": 'mse',
    "task": 'regression'
}

In [19]:
for training_split in tqdm([train_0_df, train_1_df]):

    automl = AutoML()

    automl.fit(
        training_split.drop(columns=[target, treatment]),
        training_split[target],
        **automl_settings
    )

    tlearners.append(automl.model.estimator)

  0%|          | 0/2 [00:00<?, ?it/s]

[flaml.automl.logger: 01-08 07:02:11] {1679} INFO - task = regression
[flaml.automl.logger: 01-08 07:02:11] {1690} INFO - Evaluation method: holdout
[flaml.automl.logger: 01-08 07:02:11] {1788} INFO - Minimizing error metric: mse
[flaml.automl.logger: 01-08 07:02:11] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl.logger: 01-08 07:02:11] {2218} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 01-08 07:02:11] {2344} INFO - Estimated sufficient time budget=17883s. Estimated necessary time budget=126s.
[flaml.automl.logger: 01-08 07:02:11] {2391} INFO -  at 0.7s,	estimator lgbm's best error=42.4393,	best estimator lgbm's best error=42.4393
[flaml.automl.logger: 01-08 07:02:11] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-08 07:02:11] {2391} INFO -  at 0.7s,	estimator lgbm's best error=42.4393,	best estimator lgbm's best error=42.4393
[flaml.automl.logger: 01-08 07:02:11] {221

 50%|█████     | 1/2 [02:00<02:00, 120.41s/it]

[flaml.automl.logger: 01-08 07:04:11] {1679} INFO - task = regression
[flaml.automl.logger: 01-08 07:04:11] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-08 07:04:11] {1788} INFO - Minimizing error metric: mse
[flaml.automl.logger: 01-08 07:04:11] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl.logger: 01-08 07:04:11] {2218} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 01-08 07:04:11] {2344} INFO - Estimated sufficient time budget=2570s. Estimated necessary time budget=18s.
[flaml.automl.logger: 01-08 07:04:11] {2391} INFO -  at 0.3s,	estimator lgbm's best error=46.0626,	best estimator lgbm's best error=46.0626
[flaml.automl.logger: 01-08 07:04:11] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-08 07:04:12] {2391} INFO -  at 0.6s,	estimator lgbm's best error=46.0626,	best estimator lgbm's best error=46.0626
[flaml.automl.logger: 01-08 07:04:12] {2218} INFO

100%|██████████| 2/2 [04:00<00:00, 120.11s/it]


In [20]:
test_df.sample(5)

Unnamed: 0,days_visited_exp_pre,days_visited_free_pre,days_visited_fs_pre,days_visited_hs_pre,days_visited_rs_pre,days_visited_vrs_pre,locale_en_US,revenue_pre,os_type_osx,os_type_windows,easier_signup,became_member,days_visited_post
29950,24,10,1,17,10,18,0,0.14,1,0,1,1,18
67185,1,18,12,7,4,17,0,0.04,0,0,0,0,17
95764,1,11,2,0,24,11,0,0.6,0,0,1,1,2
98076,22,1,5,18,18,8,0,0.24,0,1,0,0,11
68854,2,6,14,18,3,6,1,1.35,0,1,0,0,6


In [21]:
tlearners[0], tlearners[1]

(LGBMRegressor(colsample_bytree=0.9803618227602316,
               learning_rate=0.07638538669973302, max_bin=127,
               min_child_samples=3, n_estimators=1, n_jobs=-1, num_leaves=4,
               reg_alpha=0.005543226159822647, reg_lambda=0.029645253555435395,
               verbose=-1),
 LGBMRegressor(colsample_bytree=0.7096001901021348,
               learning_rate=0.2449651888970477, max_bin=127,
               min_child_samples=11, n_estimators=1, n_jobs=-1, num_leaves=4,
               reg_alpha=0.012306726057063571, reg_lambda=0.021777444718995027,
               verbose=-1))

In [22]:
# Calculate the difference in days visited for members
tlearner_te_1 = (
    tlearners[1].predict(test_df.drop(columns=[target, treatment])) -
    tlearners[0].predict(test_df.drop(columns=[target, treatment]))
)

In [23]:
tlearner_te_1.mean()

3.414076894490046

X-Learners

In [24]:
xlearners = tlearners

In [25]:
# Calculate the difference between actual outcomes and predictions
xlearner_te_0 = xlearners[1].predict(train_0_df.drop(columns=[target, treatment])) - train_0_df[target]
xlearner_te_1 = train_1_df[target] - xlearners[0].predict(train_1_df.drop(columns=[target, treatment]))

In [26]:
xlearner_combined = AutoML()

In [27]:
xlearner_combined.fit(
  # Stack the X variables for the treated and untreated users
  pd.concat([train_0_df, train_1_df]).drop(columns=[target, treatment]),
  # Stack the X-learner treatment effects for treated and untreated users
  pd.concat([xlearner_te_0, xlearner_te_1]),
  **automl_settings
)

[flaml.automl.logger: 01-08 07:06:48] {1679} INFO - task = regression
[flaml.automl.logger: 01-08 07:06:48] {1690} INFO - Evaluation method: holdout
[flaml.automl.logger: 01-08 07:06:48] {1788} INFO - Minimizing error metric: mse
[flaml.automl.logger: 01-08 07:06:48] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl.logger: 01-08 07:06:48] {2218} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 01-08 07:06:48] {2344} INFO - Estimated sufficient time budget=2084s. Estimated necessary time budget=15s.
[flaml.automl.logger: 01-08 07:06:48] {2391} INFO -  at 0.2s,	estimator lgbm's best error=34.9371,	best estimator lgbm's best error=34.9371
[flaml.automl.logger: 01-08 07:06:48] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-08 07:06:48] {2391} INFO -  at 0.3s,	estimator lgbm's best error=34.9371,	best estimator lgbm's best error=34.9371
[flaml.automl.logger: 01-08 07:06:48] {2218}

In [28]:
xlearner_simple_te = xlearner_combined.predict(test_df.drop(columns=[target, treatment]))

In [29]:
xlearner_simple_te.mean()

3.4181927488715234