<a href="https://colab.research.google.com/github/pgurazada/causal_inference/blob/master/case%20studies/media%20pricing/tuned_Tlearner_demand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import HistGradientBoostingRegressor

# Data

In [2]:
file_url = "https://msalicedatapublic.z5.web.core.windows.net/datasets/Pricing/pricing_sample.csv"
data_df = pd.read_csv(file_url)

In [3]:
data_df.sample(5)

Unnamed: 0,account_age,age,avg_hours,days_visited,friends_count,has_membership,is_US,songs_purchased,income,price,demand
4157,2,22,5.84279,6,4,0,1,7.360061,1.092814,1.0,24.921395
5354,5,24,6.092511,1,12,0,1,2.822009,1.607992,0.9,20.346255
16,5,58,7.534408,4,9,0,0,5.055373,1.329912,1.0,20.767204
5424,3,29,5.381692,2,7,0,1,7.145275,0.675496,0.8,9.090846
1491,1,56,6.946954,0,11,1,1,5.560778,2.203977,1.0,20.473477


The dataset* has ~10,000 observations and includes 9 continuous and categorical variables that represent user's characteristics and online behaviour history such as age, log income, previous purchase, previous online time per week, etc.

We define the following variables:

Feature Name|Type|Details
:--- |:---|:---
**account_age** |W| user's account age
**age** |W|user's age
**avg_hours** |W| the average hours user was online per week in the past
**days_visited** |W| the average number of days user visited the website per week in the past
**friend_count** |W| number of friends user connected in the account
**has_membership** |W| whether the user had membership
**is_US** |W| whether the user accesses the website from the US
**songs_purchased** |W| the average songs user purchased per week in the past
**income** |X| user's income
**price** |T| the price user was exposed during the discount season (baseline price * small discount)
**demand** |Y| songs user purchased during the discount season

**To protect the privacy of the company, we use the simulated data as an example here. The data is synthetically generated and the feature distributions don't correspond to real distributions. However, the feature names have preserved their names and meaning.*

In [4]:
data_df.demand.describe()

count    10000.000000
mean        15.493496
std          6.568161
min          3.000000
25%          9.128451
50%         15.299043
75%         20.471066
max         27.923607
Name: demand, dtype: float64

In [5]:
data_df.price.value_counts()

1.0    4346
0.8    3089
0.9    2565
Name: price, dtype: int64

Overall impact

In [6]:
data_df.groupby('price').agg({'demand': 'mean'})

Unnamed: 0_level_0,demand
price,Unnamed: 1_level_1
0.8,12.704062
0.9,14.007674
1.0,18.353066


# T-Learner

Estimated CATE:

$$
\hat{\tau}(x) = E[Y|X=x, T=1]-E[Y|X=x, T=0]=\hat{\mu}_1(x, 1) - \hat{\mu}_0(x, 0)
$$

where $\hat{\mu}_0=M_0(Y^0 \sim X^0)$, $\hat{\mu}_1=M_1(Y^1 \sim X^1)$ are any machine learning algorithms that are estimated on control and treatment subsets of training data respectively.


We choose gradient boosted regressors and classifiers as base learners through hyperparameter tuning over randomly chosen sets of feature combinations.

In [7]:
NUM_ITERATIONS = 30

In [8]:
train_df, test_df = train_test_split(
    data_df, test_size=0.3, random_state=42
)

In [9]:
train_df.shape, test_df.shape

((7000, 11), (3000, 11))

In [10]:
target = 'demand'
treatment = 'price'

In [11]:
# Split data into treated and untreated
train_0_df = train_df[train_df[treatment] == 0.8]
train_1_df = train_df[train_df[treatment] == 0.9]
train_2_df = train_df[train_df[treatment] == 1]

In [12]:
random_grid_params = {
    "max_depth": [6, 10, 12, 14, 16, 18, 20, 22],
    "learning_rate": [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.03, 0.1, 0.2]
}

In [13]:
# Fit the models on each sample
regressor_random_grid_0 = RandomizedSearchCV(
    HistGradientBoostingRegressor(),
    random_grid_params,
    scoring="neg_mean_squared_error",
    n_iter=NUM_ITERATIONS,
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

regressor_random_grid_0.fit(train_0_df.drop(columns=[target, treatment]), train_0_df[target])

Fitting 3 folds for each of 30 candidates, totalling 90 fits


In [14]:
tlearner_0 = regressor_random_grid_0.best_estimator_

In [15]:
tlearner_0

In [16]:
regressor_random_grid_1 = RandomizedSearchCV(
    HistGradientBoostingRegressor(),
    random_grid_params,
    scoring="neg_mean_squared_error",
    n_iter=NUM_ITERATIONS,
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

regressor_random_grid_1.fit(train_1_df.drop(columns=[target, treatment]), train_1_df[target])

Fitting 3 folds for each of 30 candidates, totalling 90 fits


In [17]:
tlearner_1 = regressor_random_grid_1.best_estimator_

In [18]:
tlearner_1

In [19]:
regressor_random_grid_2 = RandomizedSearchCV(
    HistGradientBoostingRegressor(),
    random_grid_params,
    scoring="neg_mean_squared_error",
    n_iter=NUM_ITERATIONS,
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

regressor_random_grid_2.fit(train_2_df.drop(columns=[target, treatment]), train_2_df[target])

Fitting 3 folds for each of 30 candidates, totalling 90 fits


In [20]:
tlearner_2 = regressor_random_grid_2.best_estimator_

In [21]:
tlearner_2

In [22]:
test_df.sample(5)

Unnamed: 0,account_age,age,avg_hours,days_visited,friends_count,has_membership,is_US,songs_purchased,income,price,demand
8403,4,32,4.949337,6,7,0,1,4.529563,1.328372,1.0,24.474668
5123,2,60,5.776426,5,8,0,0,8.779147,1.069878,0.8,25.488213
1496,5,57,3.477391,4,11,0,1,7.118941,1.37489,1.0,18.738695
6309,2,45,2.804208,4,10,1,1,5.83068,1.301141,1.0,18.402104
5278,1,53,4.296996,7,10,1,1,0.0,0.387042,0.8,13.548498


In [23]:
# Calculate the difference in demand for price tier 0.8 to price tier 0.9
tlearner_te_tier1 = (
    tlearner_1.predict(test_df.drop(columns=[target, treatment])) -
    tlearner_0.predict(test_df.drop(columns=[target, treatment]))
)

In [24]:
tlearner_te_tier1.mean()

-0.9801041960832857

In [25]:
# Calculate the difference in demand for price tier 0.8 to price tier 1.0
tlearner_te_tier2 = (
    tlearner_2.predict(test_df.drop(columns=[target, treatment])) -
    tlearner_0.predict(test_df.drop(columns=[target, treatment]))
)

In [26]:
tlearner_te_tier2.mean()

-1.9905121919802387

As price is increased from .8 to .9, we expect demand to fall by 0.9 and if increased from .8 to 1.0, we expect demand to fall by 1.9.