# Causal ML for Marketing Campaigns

A retailer aims to improve the effectiveness of their campaigns with discount marketing strategies. They distribute promotions across various channels and seek to refine their marketing strategies using data on user demographics, campaign and coupon details, product information, and previous transactions. The original dataset is available at [Kaggle](https://www.kaggle.com/datasets/vasudeva009/predicting-coupon-redemption), and the specific sample comes from [this source](https://doi.org/10.7910/DVN/2P8AY0).

**Data dictionary:**

- dailyspending: daily spending of the customer
- coupons: whether the customer received a coupon
- coupons_preperiod: whether the customer received a coupon in the previous period
- dailyspending_preperiod: daily spending of the customer in the previous period
- income_bracket: income bracket from 1 to 12
- age_range: age range from 1 to 6
- married: whether the customer is married
- rented: whether the customer rents a house
- family_size: number of people in the customer's household

In [130]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import linear_model, ensemble
import doubleml as dml

import warnings
warnings.simplefilter('ignore')

## Check the data

In [131]:
# Read data
df = pd.read_csv('data/coupon.csv')
df.head()

Unnamed: 0,dailyspending,coupons,coupons_preperiod,dailyspending_preperiod,income_bracket,age_range,married,rented,family_size
0,411.624,0,0,0.0,4,6,1,0,2
1,253.574444,0,0,411.624,4,6,1,0,2
2,261.673684,1,0,253.574444,4,6,1,0,2
3,0.0,1,1,0.0,5,4,1,0,2
4,0.0,1,1,0.0,5,4,1,0,2


In [132]:
# Descriptive Statistics
df.describe().round(2)

Unnamed: 0,dailyspending,coupons,coupons_preperiod,dailyspending_preperiod,income_bracket,age_range,married,rented,family_size
count,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0,1293.0
mean,291.45,0.24,0.18,269.47,5.01,3.57,0.74,0.08,2.54
std,310.26,0.43,0.39,380.83,2.35,1.3,0.44,0.27,1.19
min,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
25%,56.09,0.0,0.0,0.0,4.0,3.0,0.0,0.0,2.0
50%,210.57,0.0,0.0,123.42,5.0,4.0,1.0,0.0,2.0
75%,427.36,0.0,0.0,395.34,6.0,4.0,1.0,0.0,3.0
max,1975.75,1.0,1.0,3565.34,12.0,6.0,1.0,1.0,5.0


## Regression

What is the effect of sending coupons on the daily spending of the customer?

$$
\text{dailyspending} = \beta_0 + \beta_1 \text{coupons} + e
$$

In [133]:
# OLS no controls
model_base = ('dailyspending ~ coupons')
base = smf.ols(model_base, data=df)
results_ols = base.fit(cov_type='HC1')
results_ols.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,268.7191,9.405,28.572,0.000,250.285,287.153
coupons,95.4337,21.778,4.382,0.000,52.750,138.117


Let's add pre-treatment covariates to the model:

$$
\text{dailyspending} = \beta_0 + \beta_1 \text{coupons} + \beta_2' X + e
$$

In [134]:
# OLS with additive controls
X = df.drop(columns=['dailyspending'])
X = sm.add_constant(X)
Y = df['dailyspending']
results_ols_add = sm.OLS(Y, X).fit(cov_type='HC1')
results_ols_add.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,170.7605,40.484,4.218,0.000,91.414,250.107
coupons,73.6412,23.234,3.169,0.002,28.103,119.180
coupons_preperiod,-22.6512,24.574,-0.922,0.357,-70.815,25.513
dailyspending_preperiod,0.1200,0.029,4.127,0.000,0.063,0.177
income_bracket,19.6230,4.227,4.643,0.000,11.339,27.907
age_range,-16.0827,6.030,-2.667,0.008,-27.902,-4.263
married,40.0459,19.473,2.056,0.040,1.879,78.213
rented,11.3779,27.913,0.408,0.684,-43.331,66.087
family_size,1.4065,9.218,0.153,0.879,-16.660,19.473


In [135]:
# OLS with interacted controls
X = df.drop(columns=['dailyspending', 'coupons'])
X = X - X.mean(axis=0)
X[['coupons*' + col for col in X.columns]] = df[['coupons']].values * X
X['coupons'] = df['coupons']
X = sm.add_constant(X)
Y = df['dailyspending']
results_ols_int = sm.OLS(Y, X).fit(cov_type='HC1')
results_ols_int.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,273.4269,9.944,27.497,0.000,253.937,292.917
coupons_preperiod,-24.1186,30.235,-0.798,0.425,-83.378,35.141
dailyspending_preperiod,0.0800,0.035,2.283,0.022,0.011,0.149
income_bracket,22.6714,5.047,4.492,0.000,12.780,32.562
age_range,-11.3180,6.471,-1.749,0.080,-24.000,1.364
married,16.9742,22.445,0.756,0.450,-27.018,60.966
rented,9.7466,34.742,0.281,0.779,-58.347,77.841
family_size,12.4076,11.451,1.084,0.279,-10.036,34.851
coupons*coupons_preperiod,6.4552,47.905,0.135,0.893,-87.438,100.348


## Double Machine Learning

Instead of assuming a linear relationship between the treatment and the outcome, we can use machine learning models to estimate the treatment effect.

$$
\begin{gathered}
\text{dailyspending} = \beta_1 \text{coupons} + g(X) + u \\
\text{coupons} = m(X) + v
\end{gathered}
$$

In [136]:
# DML with linear and logistic regression
splits = 10
covariates = list(df.drop(['dailyspending', 'coupons'], axis = 1).columns)
dml_data = dml.DoubleMLData(df, y_col='dailyspending', d_cols='coupons', x_cols=covariates)
ml_g = linear_model.LinearRegression() # outcome model
ml_m = linear_model.LogisticRegression() # treatment model
results_dml_linear = dml.DoubleMLPLR(dml_data, ml_g, ml_m, n_folds=splits).fit()
print(results_dml_linear)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: LinearRegression()
Learner ml_m: LogisticRegression()
Out-of-sample Performance:
Learner ml_l RMSE: [[301.84048989]]
Learner ml_m RMSE: [[0.36392079]]

------------------ Resampling        ------------------
No. folds: 10
No. repeated sample splits: 1

------------------ Fit summary       ------------------
              coef    std err         t     P>|t|      2.5 %      97.5 %
coupons  76.961419  23.148433  3.324692  0.000885  31.591325  122.331514


In [137]:
# DML with lasso
cv = 10
ml_g = linear_model.LassoCV(cv=cv)
ml_m = linear_model.LogisticRegressionCV(penalty='l1', solver='saga', cv=cv)
results_dml_lasso = dml.DoubleMLPLR(dml_data, ml_g, ml_m, n_folds=splits).fit()
print(results_dml_lasso)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: LassoCV(cv=10)
Learner ml_m: LogisticRegressionCV(cv=10, penalty='l1', solver='saga')
Out-of-sample Performance:
Learner ml_l RMSE: [[301.95858742]]
Learner ml_m RMSE: [[0.48573464]]

------------------ Resampling        ------------------
No. folds: 10
No. repeated sample splits: 1

------------------ Fit summary       ------------------
             coef  std err         t     P>|t|      2.5 %     97.5 %
coupons  57.21486  17.2296  3.320731  0.000898  23.445464  90.984256


In [138]:
# DML with random forest
ml_g = ensemble.RandomForestRegressor(max_features='sqrt')
ml_m = ensemble.RandomForestClassifier()
results_dml_rf = dml.DoubleMLPLR(dml_data, ml_g, ml_m, n_folds=splits).fit()
print(results_dml_rf)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: partialling out

------------------ Machine learner   ------------------
Learner ml_l: RandomForestRegressor(max_features='sqrt')
Learner ml_m: RandomForestClassifier()
Out-of-sample Performance:
Learner ml_l RMSE: [[314.39476267]]
Learner ml_m RMSE: [[0.39721959]]

------------------ Resampling        ------------------
No. folds: 10
No. repeated sample splits: 1

------------------ Fit summary       ------------------
              coef    std err         t     P>|t|     2.5 %     97.5 %
coupons  46.690073  23.750186  1.965882  0.049312  0.140565  93.239582


We can also use an non-linear interacted regression model for the outcome equation:

$$
\begin{gathered}
\text{dailyspending} = g(\text{coupons}, X) + u \\
\text{coupons} = m(X) + v
\end{gathered}
$$

In [139]:
# DML with interacted regression and lasso
ml_g = linear_model.LassoCV(cv=cv)
ml_m = linear_model.LogisticRegressionCV(penalty='l1', solver='saga', cv=cv)
results_dml_int = dml.DoubleMLIRM(dml_data, ml_g, ml_m, n_folds=splits, 
                         normalize_ipw=True, trimming_rule='truncate', trimming_threshold=0.01).fit()
print(results_dml_int)


------------------ Data summary      ------------------
Outcome variable: dailyspending
Treatment variable(s): ['coupons']
Covariates: ['coupons_preperiod', 'dailyspending_preperiod', 'income_bracket', 'age_range', 'married', 'rented', 'family_size']
Instrument variable(s): None
No. Observations: 1293

------------------ Score & algorithm ------------------
Score function: ATE

------------------ Machine learner   ------------------
Learner ml_g: LassoCV(cv=10)
Learner ml_m: LogisticRegressionCV(cv=10, penalty='l1', solver='saga')
Out-of-sample Performance:
Learner ml_g0 RMSE: [[291.01035026]]
Learner ml_g1 RMSE: [[332.17517523]]
Learner ml_m RMSE: [[0.48515701]]

------------------ Resampling        ------------------
No. folds: 10
No. repeated sample splits: 1

------------------ Fit summary       ------------------
              coef    std err         t     P>|t|      2.5 %      97.5 %
coupons  72.993101  21.804408  3.347631  0.000815  30.257247  115.728955


In [140]:
groups = df[['age_range']].astype('str')
gate_fam = results_dml_int.gate(groups=groups)
print(gate_fam)


------------------ Fit summary ------------------
               coef     std err         t     P>|t|      [0.025      0.975]
Group_1   72.655940  104.062732  0.698194  0.485182 -131.495259  276.807140
Group_2   76.196123   53.832290  1.415435  0.157183  -29.412545  181.804790
Group_3   82.992428   40.899639  2.029173  0.042646    2.755152  163.229705
Group_4   48.426216   40.092860  1.207851  0.227326  -30.228316  127.080748
Group_5   74.663688   68.382612  1.091852  0.275103  -59.489932  208.817309
Group_6  108.213310   66.879514  1.618034  0.105900  -22.991520  239.418139


## Summary

In [149]:
results = pd.DataFrame(columns=['Estimate', 'SE', 't-stat', 'p-value', 'CI_low', 'CI_high'], 
                       index=['OLS', 'OLS_add', 'OLS_int', 'DML_linear', 'DML_lasso', 'DML_rf', 'DML_int'])

for i, res in enumerate([results_ols, results_ols_add, results_ols_int]):
    results.iloc[i, 0] = res.params['coupons']
    results.iloc[i, 1] = res.bse['coupons']
    results.iloc[i, 2] = res.tvalues['coupons']
    results.iloc[i, 3] = res.pvalues['coupons']
    results.iloc[i, 4] = res.conf_int().loc['coupons', 0]
    results.iloc[i, 5] = res.conf_int().loc['coupons', 1]
    

for i, res in enumerate([results_dml_linear, results_dml_lasso, results_dml_rf, results_dml_int]):
    results.iloc[i+3, 0] = res.coef[0]
    results.iloc[i+3, 1] = res.se[0]
    results.iloc[i+3, 2] = res.t_stat[0]
    results.iloc[i+3, 3] = res.pval[0]
    results.iloc[i+3, 4] = res.confint().iloc[0, 0]
    results.iloc[i+3, 5] = res.confint().iloc[0, 1]

results.astype('float').round(2)

Unnamed: 0,Estimate,SE,t-stat,p-value,CI_low,CI_high
OLS,95.43,21.78,4.38,0.0,52.75,138.12
OLS_add,73.64,23.23,3.17,0.0,28.1,119.18
OLS_int,68.32,23.23,2.94,0.0,22.79,113.84
DML_linear,76.96,23.15,3.32,0.0,31.59,122.33
DML_lasso,57.21,17.23,3.32,0.0,23.45,90.98
DML_rf,46.69,23.75,1.97,0.05,0.14,93.24
DML_int,72.99,21.8,3.35,0.0,30.26,115.73
