<a href="https://colab.research.google.com/github/pgurazada/causal_inference/blob/master/media%20pricing/tuned_Tlearner_demand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import (
    HistGradientBoostingClassifier,
    HistGradientBoostingRegressor
)

# Data

In [2]:
file_url = "https://msalicedatapublic.z5.web.core.windows.net/datasets/Pricing/pricing_sample.csv"
data_df = pd.read_csv(file_url)

In [3]:
data_df.sample(5)

Unnamed: 0,account_age,age,avg_hours,days_visited,friends_count,has_membership,is_US,songs_purchased,income,price,demand
7979,2,44,5.038748,1,10,0,0,2.435949,0.49805,0.9,7.219374
6634,2,20,7.822185,0,12,0,1,2.754055,2.504828,1.0,20.911092
9457,3,38,5.529428,7,11,1,1,4.45485,0.937723,0.9,12.464714
4320,1,20,6.347718,0,11,0,1,4.377465,0.803267,0.8,9.573859
3619,4,36,4.64155,7,13,1,1,4.387687,0.911951,1.0,10.320775


The dataset* has ~10,000 observations and includes 9 continuous and categorical variables that represent user's characteristics and online behaviour history such as age, log income, previous purchase, previous online time per week, etc.

We define the following variables:

Feature Name|Type|Details
:--- |:---|:---
**account_age** |W| user's account age
**age** |W|user's age
**avg_hours** |W| the average hours user was online per week in the past
**days_visited** |W| the average number of days user visited the website per week in the past
**friend_count** |W| number of friends user connected in the account
**has_membership** |W| whether the user had membership
**is_US** |W| whether the user accesses the website from the US
**songs_purchased** |W| the average songs user purchased per week in the past
**income** |X| user's income
**price** |T| the price user was exposed during the discount season (baseline price * small discount)
**demand** |Y| songs user purchased during the discount season

**To protect the privacy of the company, we use the simulated data as an example here. The data is synthetically generated and the feature distributions don't correspond to real distributions. However, the feature names have preserved their names and meaning.*

In [4]:
data_df.demand.describe()

count    10000.000000
mean        15.493496
std          6.568161
min          3.000000
25%          9.128451
50%         15.299043
75%         20.471066
max         27.923607
Name: demand, dtype: float64

In [6]:
data_df.price.value_counts()

1.0    4346
0.8    3089
0.9    2565
Name: price, dtype: int64

Overall impact

In [36]:
data_df.groupby('price').agg({'demand': 'mean'})

Unnamed: 0_level_0,demand
price,Unnamed: 1_level_1
0.8,12.704062
0.9,14.007674
1.0,18.353066


# T-Learner

Estimated CATE:

$$
\hat{\tau}(x) = E[Y|X=x, T=1]-E[Y|X=x, T=0]=\hat{\mu}_1(x, 1) - \hat{\mu}_0(x, 0)
$$

where $\hat{\mu}_0=M_0(Y^0 \sim X^0)$, $\hat{\mu}_1=M_1(Y^1 \sim X^1)$ are any machine learning algorithms that are estimated on control and treatment subsets of training data respectively.


We choose gradient boosted regressors and classifiers as base learners through hyperparameter tuning over randomly chosen sets of feature combinations.

In [7]:
NUM_ITERATIONS = 5

In [8]:
train_df, test_df = train_test_split(
    data_df, test_size=0.3, random_state=42
)

In [9]:
train_df.shape, test_df.shape

((7000, 11), (3000, 11))

In [15]:
target = 'demand'
treatment = 'price'

In [16]:
# Split data into treated and untreated
train_0_df = train_df[train_df[treatment] == 0.8]
train_1_df = train_df[train_df[treatment] == 0.9]
train_2_df = train_df[train_df[treatment] == 1]

In [20]:
random_grid_params = {
    "max_depth": [6, 10, 12, 14, 16, 18, 20, 22],
    "learning_rate": [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.03, 0.1, 0.2]
}

In [22]:
# Fit the models on each sample
regressor_random_grid_0 = RandomizedSearchCV(
    HistGradientBoostingRegressor(),
    random_grid_params,
    scoring="neg_mean_squared_error",
    n_iter=NUM_ITERATIONS,
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

regressor_random_grid_0.fit(train_0_df.drop(columns=[target, treatment]), train_0_df[target])

Fitting 3 folds for each of 5 candidates, totalling 15 fits


In [23]:
tlearner_0 = regressor_random_grid_0.best_estimator_

In [24]:
tlearner_0

In [25]:
regressor_random_grid_1 = RandomizedSearchCV(
    HistGradientBoostingRegressor(),
    random_grid_params,
    scoring="neg_mean_squared_error",
    n_iter=NUM_ITERATIONS,
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

regressor_random_grid_1.fit(train_1_df.drop(columns=[target, treatment]), train_1_df[target])

Fitting 3 folds for each of 5 candidates, totalling 15 fits


In [26]:
tlearner_1 = regressor_random_grid_1.best_estimator_

In [27]:
tlearner_1

In [28]:
regressor_random_grid_2 = RandomizedSearchCV(
    HistGradientBoostingRegressor(),
    random_grid_params,
    scoring="neg_mean_squared_error",
    n_iter=NUM_ITERATIONS,
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

regressor_random_grid_2.fit(train_2_df.drop(columns=[target, treatment]), train_2_df[target])

Fitting 3 folds for each of 5 candidates, totalling 15 fits


In [29]:
tlearner_2 = regressor_random_grid_2.best_estimator_

In [30]:
tlearner_2

In [31]:
test_df.sample(5)

Unnamed: 0,account_age,age,avg_hours,days_visited,friends_count,has_membership,is_US,songs_purchased,income,price,demand
2110,5,26,4.80226,5,16,0,1,1.46417,1.505471,1.0,24.40113
1334,3,50,7.163916,3,9,1,1,5.464952,2.23847,1.0,20.581958
8537,3,45,5.948424,1,8,1,1,9.175136,0.737869,0.9,7.674212
3038,1,25,7.426659,6,12,1,1,10.713349,1.954194,1.0,25.713329
1569,2,59,7.909198,1,10,1,1,6.793841,0.670458,0.8,10.354599


In [32]:
# Calculate the difference in demand for price tier 0.8 to price tier 0.9
tlearner_te_tier1 = (
    tlearner_1.predict(test_df.drop(columns=[target, treatment])) -
    tlearner_0.predict(test_df.drop(columns=[target, treatment]))
)

In [33]:
tlearner_te_tier1.mean()

-0.9784107252867646

In [34]:
# Calculate the difference in demand for price tier 0.8 to price tier 1.0
tlearner_te_tier2 = (
    tlearner_2.predict(test_df.drop(columns=[target, treatment])) -
    tlearner_0.predict(test_df.drop(columns=[target, treatment]))
)

In [35]:
tlearner_te_tier2.mean()

-1.98819993053243

As price is increased from .8 to .9, we expect demand to fall by 0.97 and if increased from 0.8 to 1.0, we expect demand to fall by 1.98.