<a href="https://colab.research.google.com/github/pgurazada/causal_inference/blob/master/case%20studies/revenue/tuned_metalearners_revenue.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q flaml

In [2]:
import pandas as pd

from sklearn.model_selection import train_test_split
from flaml import AutoML

from tqdm import tqdm

# Data

In [3]:
file_url = "https://msalicedatapublic.z5.web.core.windows.net/datasets/ROI/multi_attribution_sample.csv"
data_df = pd.read_csv(file_url)

In [4]:
data_df.sample(5)

Unnamed: 0,Global Flag,Major Flag,SMC Flag,Commercial Flag,IT Spend,Employee Count,PC Count,Size,Tech Support,Discount,Revenue
991,0,0,1,1,47989,152,179,255746,1,1,38496.75612
158,0,1,0,1,25135,55,45,131549,0,0,8435.037353
1170,1,0,1,1,20911,23,27,157897,1,0,18476.45679
215,0,0,1,0,1854,33,23,14154,0,0,2925.00813
1000,0,0,0,1,41467,31,29,264171,1,1,32723.72263


In [5]:
data_df = data_df.rename(columns={'Tech Support': "Tech_Support"})

In [6]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Global Flag      2000 non-null   int64  
 1   Major Flag       2000 non-null   int64  
 2   SMC Flag         2000 non-null   int64  
 3   Commercial Flag  2000 non-null   int64  
 4   IT Spend         2000 non-null   int64  
 5   Employee Count   2000 non-null   int64  
 6   PC Count         2000 non-null   int64  
 7   Size             2000 non-null   int64  
 8   Tech_Support     2000 non-null   int64  
 9   Discount         2000 non-null   int64  
 10  Revenue          2000 non-null   float64
dtypes: float64(1), int64(10)
memory usage: 172.0 KB


# Data

The data* contains ~2,000 customers and is comprised of:

* Customer features: details about the industry, size, revenue, and technology profile of each customer.
* Interventions: information about which incentive was given to a customer.
* Outcome: the amount of product the customer bought in the year after the incentives were given.

Feature Name | Type | Details
:--- |:--- |:---
**Global Flag** | W | whether the customer has global offices
**Major Flag** | W | whether the customer is a large consumer in their industry (as opposed to SMC - Small Medium Corporation - or SMB - Small Medium Business)
**SMC Flag** | W | whether the customer is a Small Medium Corporation (SMC, as opposed to major and SMB)
**Commercial Flag** | W | whether the customer's business is commercial (as opposed to public secor)
**IT Spend** | W | \\$ spent on IT-related purchases
**Employee Count** | W | number of employees
**PC Count** | W | number of PCs used by the customer
**Size** | X | customer's size given by their yearly total revenue
**Tech Support** | T | whether the customer received tech support (binary)
**Discount** | T | whether the customer was given a discount (binary)
**Revenue** | Y | \\$ Revenue from customer given by the amount of software purchased

**To protect the privacy of the startup's customers, the data used in this scenario is synthetically generated and the feature distributions don't correspond to real distributions. However, the feature names have preserved their names and meaning.*

Overall impact

In [7]:
data_df.groupby(['Tech_Support', 'Discount']).agg({'Revenue': 'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue
Tech_Support,Discount,Unnamed: 2_level_1
0,0,6585.891792
0,1,12247.935953
1,0,15104.111534
1,1,26784.124649


We encode treatments using the following notation.


Tech_Support| Discount| Treatment encoding| Details
:--- |:--- |:--- |:---
0 | 0 | 0 | no incentive
1 | 0 | 1 | tech support only
0 | 1 | 2 | discount only
1 | 1 | 3 | both incentives


# T-Learner

Estimated CATE:

$$
\hat{\tau}(x) = E[Y|X=x, T=1]-E[Y|X=x, T=0]=\hat{\mu}_1(x) - \hat{\mu}_0(x)
$$

where $\hat{\mu}_0=M_0(Y^0 \sim X^0)$, $\hat{\mu}_1=M_1(Y^1 \sim X^1)$ are any machine learning algorithms that are estimated on control and treatment subsets of training data respectively.


We choose gradient boosted regressors and classifiers as base learners through hyperparameter tuning over randomly chosen sets of feature combinations.

In [8]:
NUM_ITERATIONS = 50

In [9]:
train_df, test_df = train_test_split(
    data_df, test_size=0.3, random_state=42
)

In [10]:
train_df.shape, test_df.shape

((1400, 11), (600, 11))

In [11]:
target = 'Revenue'
treatment1 = 'Tech_Support'
treatment2 = 'Discount'

In [12]:
# Split data into treated and untreated
train_0_df = train_df.query("Tech_Support == 0 & Discount == 0")
train_1_df = train_df.query("Tech_Support == 1 & Discount == 0")
train_2_df = train_df.query("Tech_Support == 0 & Discount == 1")
train_3_df = train_df.query("Tech_Support == 1 & Discount == 1")

In [13]:
tlearners = []

In [14]:
automl_settings = {
    "time_budget": 120,
    "metric": 'r2',
    "task": 'regression'
}

In [16]:
for training_split in tqdm([train_0_df, train_1_df, train_2_df, train_3_df]):

    automl = AutoML()

    automl.fit(
        training_split.drop(columns=[target, treatment1, treatment2]),
        training_split[target],
        **automl_settings
    )

    tlearners.append(automl.model.estimator)

  0%|          | 0/4 [00:00<?, ?it/s]

[flaml.automl.logger: 01-08 01:18:36] {1679} INFO - task = regression
[flaml.automl.logger: 01-08 01:18:36] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-08 01:18:37] {1788} INFO - Minimizing error metric: 1-r2
[flaml.automl.logger: 01-08 01:18:37] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl.logger: 01-08 01:18:37] {2218} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 01-08 01:18:37] {2344} INFO - Estimated sufficient time budget=4710s. Estimated necessary time budget=33s.
[flaml.automl.logger: 01-08 01:18:37] {2391} INFO -  at 0.6s,	estimator lgbm's best error=0.7001,	best estimator lgbm's best error=0.7001
[flaml.automl.logger: 01-08 01:18:37] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-08 01:18:38] {2391} INFO -  at 1.1s,	estimator lgbm's best error=0.7001,	best estimator lgbm's best error=0.7001
[flaml.automl.logger: 01-08 01:18:38] {2218} INFO - 

 25%|██▌       | 1/4 [02:00<06:00, 120.07s/it]

[flaml.automl.logger: 01-08 01:20:37] {1679} INFO - task = regression
[flaml.automl.logger: 01-08 01:20:37] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-08 01:20:37] {1788} INFO - Minimizing error metric: 1-r2
[flaml.automl.logger: 01-08 01:20:37] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl.logger: 01-08 01:20:37] {2218} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 01-08 01:20:37] {2344} INFO - Estimated sufficient time budget=483s. Estimated necessary time budget=3s.
[flaml.automl.logger: 01-08 01:20:37] {2391} INFO -  at 0.1s,	estimator lgbm's best error=0.6552,	best estimator lgbm's best error=0.6552
[flaml.automl.logger: 01-08 01:20:37] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-08 01:20:37] {2391} INFO -  at 0.1s,	estimator lgbm's best error=0.6552,	best estimator lgbm's best error=0.6552
[flaml.automl.logger: 01-08 01:20:37] {2218} INFO - it

 50%|█████     | 2/4 [04:00<04:00, 120.02s/it]

[flaml.automl.logger: 01-08 01:22:37] {1679} INFO - task = regression
[flaml.automl.logger: 01-08 01:22:37] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-08 01:22:37] {1788} INFO - Minimizing error metric: 1-r2
[flaml.automl.logger: 01-08 01:22:37] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl.logger: 01-08 01:22:37] {2218} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 01-08 01:22:37] {2344} INFO - Estimated sufficient time budget=605s. Estimated necessary time budget=4s.
[flaml.automl.logger: 01-08 01:22:37] {2391} INFO -  at 0.1s,	estimator lgbm's best error=0.5650,	best estimator lgbm's best error=0.5650
[flaml.automl.logger: 01-08 01:22:37] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-08 01:22:37] {2391} INFO -  at 0.2s,	estimator lgbm's best error=0.5650,	best estimator lgbm's best error=0.5650
[flaml.automl.logger: 01-08 01:22:37] {2218} INFO - it

 75%|███████▌  | 3/4 [06:00<02:00, 120.02s/it]

[flaml.automl.logger: 01-08 01:24:37] {1679} INFO - task = regression
[flaml.automl.logger: 01-08 01:24:37] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-08 01:24:37] {1788} INFO - Minimizing error metric: 1-r2
[flaml.automl.logger: 01-08 01:24:37] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl.logger: 01-08 01:24:37] {2218} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 01-08 01:24:37] {2344} INFO - Estimated sufficient time budget=712s. Estimated necessary time budget=5s.
[flaml.automl.logger: 01-08 01:24:37] {2391} INFO -  at 0.1s,	estimator lgbm's best error=0.5285,	best estimator lgbm's best error=0.5285
[flaml.automl.logger: 01-08 01:24:37] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-08 01:24:37] {2391} INFO -  at 0.2s,	estimator lgbm's best error=0.5285,	best estimator lgbm's best error=0.5285
[flaml.automl.logger: 01-08 01:24:37] {2218} INFO - it

100%|██████████| 4/4 [08:00<00:00, 120.04s/it]


In [17]:
test_df.sample(5)

Unnamed: 0,Global Flag,Major Flag,SMC Flag,Commercial Flag,IT Spend,Employee Count,PC Count,Size,Tech_Support,Discount,Revenue
1532,0,0,0,0,59529,24,15,181704,1,0,14601.54169
570,1,0,0,0,14352,102,88,45394,0,1,11339.28766
465,0,1,0,0,4127,10,8,12863,0,1,5745.206084
1782,0,1,0,0,18815,80,81,61704,0,0,7972.853529
289,0,1,0,1,5531,38,40,14398,0,1,7639.256995



Tech_Support| Discount| Treatment encoding| Details
:--- |:--- |:--- |:---
0 | 0 | 0 | no incentive
1 | 0 | 1 | tech support only
0 | 1 | 2 | discount only
1 | 1 | 3 | both incentives


In [18]:
tlearners[0], tlearners[1], tlearners[2], tlearners[3]

(XGBRegressor(base_score=None, booster=None, callbacks=[],
              colsample_bylevel=0.6902001610039813, colsample_bynode=None,
              colsample_bytree=0.5864653969007765, device=None,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=None,
              grow_policy='lossguide', importance_type=None,
              interaction_constraints=None, learning_rate=0.11204280907305136,
              max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=0, max_leaves=4,
              min_child_weight=2.9634783013379282, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=163,
              n_jobs=-1, num_parallel_tree=None, random_state=None, ...),
 ExtraTreesRegressor(max_features=0.88848737819076, max_leaf_nodes=85,
                     n_estimators=34, n_jobs=-1, random_state=12032022),
 LGBMRegressor(colsample_bytre

In [19]:
# Calculate the difference in revenue when only tech support is given
tlearner_te_1 = (
    tlearners[1].predict(test_df.drop(columns=[target, treatment1, treatment2])) -
    tlearners[0].predict(test_df.drop(columns=[target, treatment1, treatment2]))
)

In [20]:
tlearner_te_1.mean()

7383.874685778877

In [21]:
# Calculate the difference in revenue when only discount is given
tlearner_te_2 = (
    tlearners[2].predict(test_df.drop(columns=[target, treatment1, treatment2])) -
    tlearners[0].predict(test_df.drop(columns=[target, treatment1, treatment2]))
)

In [22]:
tlearner_te_2.mean()

5863.969301322161

In [23]:
# Calculate the difference in revenue when both incentives are given
tlearner_te_3 = (
    tlearners[3].predict(test_df.drop(columns=[target, treatment1, treatment2])) -
    tlearners[0].predict(test_df.drop(columns=[target, treatment1, treatment2]))
)

In [24]:
tlearner_te_3.mean()

13190.104125965105

X-Learners

In [25]:
xlearners = tlearners

In [26]:
# Calculate the difference between actual outcomes and predictions
xlearner_te_0 = xlearners[1].predict(train_0_df.drop(columns=[target, treatment1, treatment2])) - train_0_df[target]
xlearner_te_1 = train_1_df[target] - xlearners[0].predict(train_1_df.drop(columns=[target, treatment1, treatment2]))

In [27]:
xlearner_combined = AutoML()

In [28]:
xlearner_combined.fit(
  # Stack the X variables for the treated and untreated users
  pd.concat([train_0_df, train_1_df]).drop(columns=[target, treatment1, treatment2]),
  # Stack the X-learner treatment effects for treated and untreated users
  pd.concat([xlearner_te_0, xlearner_te_1]),
  **automl_settings
)

[flaml.automl.logger: 01-08 02:11:40] {1679} INFO - task = regression
[flaml.automl.logger: 01-08 02:11:40] {1690} INFO - Evaluation method: cv
[flaml.automl.logger: 01-08 02:11:40] {1788} INFO - Minimizing error metric: 1-r2
[flaml.automl.logger: 01-08 02:11:40] {1900} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth']
[flaml.automl.logger: 01-08 02:11:40] {2218} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 01-08 02:11:40] {2344} INFO - Estimated sufficient time budget=1776s. Estimated necessary time budget=13s.
[flaml.automl.logger: 01-08 02:11:40] {2391} INFO -  at 0.3s,	estimator lgbm's best error=0.6703,	best estimator lgbm's best error=0.6703
[flaml.automl.logger: 01-08 02:11:40] {2218} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 01-08 02:11:40] {2391} INFO -  at 0.4s,	estimator lgbm's best error=0.6703,	best estimator lgbm's best error=0.6703
[flaml.automl.logger: 01-08 02:11:40] {2218} INFO - 

In [29]:
xlearner_simple_te = xlearner_combined.predict(test_df.drop(columns=[target, treatment1, treatment2]))

In [30]:
xlearner_simple_te.mean()

7506.649169002118