# Chap 1

効果検証入門のRコードをPythonで再現

## 1.4 Rによるメールマーケティングの効果の検証

In [3]:
!wget "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"

--2020-02-10 16:56:59--  http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv
www.minethatdata.com (www.minethatdata.com) をDNSに問いあわせています... 69.168.84.70
www.minethatdata.com (www.minethatdata.com)|69.168.84.70|:80 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 3964977 (3.8M) [text/csv]
`Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv' に保存中


2020-02-10 16:57:01 (2.73 MB/s) - `Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv' へ保存完了 [3964977/3964977]



In [5]:
!head ./Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv

recency,history_segment,history,mens,womens,zip_code,newbie,channel,segment,visit,conversion,spend
10,2) $100 - $200,142.44,1,0,Surburban,0,Phone,Womens E-Mail,0,0,0
6,3) $200 - $350,329.08,1,1,Rural,1,Web,No E-Mail,0,0,0
7,2) $100 - $200,180.65,0,1,Surburban,1,Web,Womens E-Mail,0,0,0
9,5) $500 - $750,675.83,1,0,Rural,1,Web,Mens E-Mail,0,0,0
2,1) $0 - $100,45.34,1,0,Urban,0,Web,Womens E-Mail,0,0,0
6,2) $100 - $200,134.83,0,1,Surburban,0,Phone,Womens E-Mail,1,0,0
9,3) $200 - $350,280.2,1,0,Surburban,1,Phone,Womens E-Mail,0,0,0
9,1) $0 - $100,46.42,0,1,Urban,0,Phone,Womens E-Mail,0,0,0
9,5) $500 - $750,675.07,1,1,Rural,1,Phone,Mens E-Mail,0,0,0


In [7]:
df = pd.read_csv('./Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv')

In [8]:
df.head()

Unnamed: 0,recency,history_segment,history,mens,womens,zip_code,newbie,channel,segment,visit,conversion,spend
0,10,2) $100 - $200,142.44,1,0,Surburban,0,Phone,Womens E-Mail,0,0,0.0
1,6,3) $200 - $350,329.08,1,1,Rural,1,Web,No E-Mail,0,0,0.0
2,7,2) $100 - $200,180.65,0,1,Surburban,1,Web,Womens E-Mail,0,0,0.0
3,9,5) $500 - $750,675.83,1,0,Rural,1,Web,Mens E-Mail,0,0,0.0
4,2,1) $0 - $100,45.34,1,0,Urban,0,Web,Womens E-Mail,0,0,0.0


In [20]:
df_filtered = df.query('segment != "Womens E-Mail"').assign(
    treatment=lambda d: (d['segment'] == 'Mens E-Mail').astype(int)
)

In [25]:
df_filtered.groupby('treatment').agg({
    'conversion': ['count', 'sum', 'mean'],
    'spend': 'mean'
})

Unnamed: 0_level_0,conversion,conversion,conversion,spend
Unnamed: 0_level_1,count,sum,mean,mean
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,21306,122,0.005726,0.652789
1,21307,267,0.012531,1.422617


In [26]:
 from statsmodels.stats.weightstats import ttest_ind

In [31]:
tstat, pvalue, df = ttest_ind(
    df_filtered.query('treatment == 1')['spend'],
    df_filtered.query('treatment == 0')['spend'],
    usevar='pooled')

In [35]:
print('test statistic', tstat)
print('pvalue', pvalue)
print('degrees of freedom', df)

test statistic 5.300090294465455
pvalue 1.163200872605976e-07
degrees of freedom 42611.0


## 1.4.3 バイアスのある状況での効果の検証

In [36]:
_df = df_filtered
df_biased = _df.assign(
    obs_rate_c=np.where((_df['history'] > 300) | (_df['recency'] < 6) | (_df['channel'] == 'Multichannel'), 0.5, 1),
    obs_rate_t=np.where((_df['history'] > 300) | (_df['recency'] < 6) | (_df['channel'] == 'Multichannel'), 1, 0.5),
    random_number=np.random.random(len(_df))
).query('(treatment == 0 and random_number < obs_rate_c) or (treatment == 1 and random_number < obs_rate_t)')

In [37]:
df_biased.groupby('treatment').agg({
    'conversion': ['count', 'sum', 'mean'],
    'spend': 'mean'
})

Unnamed: 0_level_0,conversion,conversion,conversion,spend
Unnamed: 0_level_1,count,sum,mean,mean
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,14848,84,0.005657,0.653098
1,17182,224,0.013037,1.520655


In [38]:
tstat, pvalue, df = ttest_ind(
    df_biased.query('treatment == 1')['spend'],
    df_biased.query('treatment == 0')['spend'],
    usevar='pooled')

In [39]:
print('test statistic', tstat)
print('pvalue', pvalue)
print('degrees of freedom', df)

test statistic 4.893675554384148
pvalue 9.945035252215874e-07
degrees of freedom 32028.0
