# A/B Testing dan Regresi Logistik

Data yang digunakan disini adalah data dari sebuah perusahaan e-comerce yang ingin menerapkan laman baru untuk meningkatkan jumlah pengguna. A/B testing digunakan untuk memberikan wawasan kepada perusahaan tentang apakah perusahaan harus menerapkan halaman lama atau halaman baru.

## Import library yang diperlukan

In [23]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.stats.api as sm
%matplotlib inline

## Data Preparation

Input data dan lihat bagaimana data tersusun

In [10]:
df = pd.read_csv('ab_data.csv')
df

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1
...,...,...,...,...,...
294473,751197,2017-01-03 22:28:38.630509,control,old_page,0
294474,945152,2017-01-12 00:51:57.078372,control,old_page,0
294475,734608,2017-01-22 11:45:03.439544,control,old_page,0
294476,697314,2017-01-15 01:20:28.957438,control,old_page,0


## Mencari Statistik dasar

In [17]:
print(' Panjang data : ', df.shape[0],'\n',
        'Jumlah data unik : ',df['user_id'].nunique()
     )

 Panjang data :  294478 
 Jumlah data unik :  290584


 Proporsi data

In [67]:
print(' Jumlah pelanggan memilih beralih : ', df['converted'].value_counts()[1],'\n',
        'Jumlah pelanggan memilih tetap : ', df['converted'].shape[0],'\n',
      'Proporsi antara pelanggan tetap dibanding beralih :', np.round(df['converted'].mean()*100, 2), '%')

 Jumlah pelanggan memilih beralih :  35237 
 Jumlah pelanggan memilih tetap :  294478 
 Proporsi antara pelanggan tetap dibanding beralih : 11.97 %


Jumlah pelanggan dengan treatment yang beralih dan pelanggan tanpa treatment yang tidak beralih

In [18]:
a = df.query("group != 'treatment' & landing_page == 'new_page'").count()
b = df.query("group == 'treatment' & landing_page != 'new_page'").count()

a + b

user_id         3893
timestamp       3893
group           3893
landing_page    3893
converted       3893
dtype: int64

Mencari adanya missing value

In [19]:
df.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

## Membersihkan data

In [8]:
# Store all records that new_page and treatment don't match
a = df.query("group != 'treatment' & landing_page == 'new_page'")
b = df.query("group == 'treatment' & landing_page != 'new_page'")

# Store all records that old_page and control don't match
c = df.query("group != 'control' & landing_page == 'old_page'")
d = df.query("group == 'control' & landing_page != 'old_page'")

df2 = pd.concat([df, a, b, c, d]).drop_duplicates(keep=False)
df2.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


Mencari tau apakah ada duplicated data pada user

In [11]:
df2[df2['user_id'].duplicated()].user_id

2893    773192
Name: user_id, dtype: int64

In [12]:
df2[df2.duplicated(['user_id'], keep = False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


Menghapus duplicated data

In [13]:
df2 = df2.drop_duplicates(keep = 'first')
df2.duplicated().sum()

0

# A/B testing

## 1. Mencari Statistik Dasar

Mencari beberapa informasi dasar tentang bentuk sampel

In [24]:
conversion_rates = df2.groupby('group')['converted']

std_p = lambda x: np.std(x, ddof=0)              # Std. deviation of the proportion
se_p = lambda x: stats.sem(x, ddof=0)            # Std. error of the proportion (std / sqrt(n))

conversion_rates = conversion_rates.agg([np.mean, std_p, se_p])
conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']


conversion_rates.style.format('{:.3f}')

Unnamed: 0_level_0,conversion_rate,std_deviation,std_error
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,0.12,0.325,0.001
treatment,0.119,0.324,0.001


Dilihat dari statistik di atas, desain lama dan baru performa yang sangat mirip, orang-orang dalam group treatment/perlakuan memiliki kemungkinan yang lebih rendah untuk berpindah (sebesar 11,9%) dibandingkan orang-orang dalam group control (12,0%). Namun, perbedaan ini terlalu kecil sehingga diperlukan investigasi lanjutan.

## 2. Melakukan Uji Hipotesis

A/B Testing dilakukan dengan asumsi bahwa halaman lama menghasilkan tingkat konversi yang lebih besar daripada halaman baru, kecuali jika halaman baru terbukti benar-benar menghasilkan tingkat konversi yang lebih besar pada tingkat kesalahan Tipe I sebesar 5%. Berdasarkan asumsi ini,

untuk Hipotesis Null, probabilitas semua pengguna yang berkonversi dari mendarat di halaman lama lebih besar atau sama dengan probabilitas semua pengguna yang berkonversi dari mendarat di halaman baru.
untuk Hipotesis Alternatif, probabilitas semua pengguna yang berkonversi dari mendarat di halaman baru lebih besar atau sama dengan probabilitas semua pengguna yang berkonversi dari mendarat di halaman lama.

In [68]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

control_results = df2[df2['group'] == 'control']['converted']
treatment_results = df2[df2['group'] == 'treatment']['converted']

n_con = control_results.count()
n_treat = treatment_results.count()
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs=nobs)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')

z statistic: 1.31
p-value: 0.190


Dari hasil diatas kita tidak dapat menolak hipotesis Null, karena nilai p-value lebih besar dari 0.05, yang berarti bahwa desain baru kita tidak bekerja secara signifikan berbeda (apalagi lebih baik) dari yang lama.

# Regressi Logistik

## 1. Persiapan

Pertama kita harus membuat kolom untuk intersep dahulu, kemudian membuat kolom untuk meletakkan variabel dummy dari treatment atau control yang diterima setiap pengguna.

In [28]:
df2['intercept'] = 1
df2[['control', 'treatment']] = pd.get_dummies(df2['group'])

In [29]:
df2.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,intercept,control,treatment
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,1,1,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,1,1,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,1,0,1
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,1,0,1
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,1,1,0


Pada prinsipnya kolom control dan treatmen adalah sama maka kita akan drop salah satu kolom dan mengganti namanya menjadi ab_page

In [32]:
df2.drop(columns=['control'], inplace=True)
df2.rename(columns={'treatment': 'ab_page'}, inplace=True)

In [33]:
df2.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,intercept,ab_page
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,1,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,1,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,1,1
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,1,1
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,1,0


## 2. Membangun model 

In [34]:
# Model the Logistic Regression
model = sm.Logit(df2['converted'], df2[['intercept', 'ab_page']])
# Fit the model
results = model.fit()
# Show the logistic regression summary
results.summary2()

Optimization terminated successfully.
         Current function value: 0.366118
         Iterations 6


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.0
Dependent Variable:,converted,AIC:,212780.6032
Date:,2021-12-07 13:16,BIC:,212801.7625
No. Observations:,290585,Log-Likelihood:,-106390.0
Df Model:,1,LL-Null:,-106390.0
Df Residuals:,290583,LLR p-value:,0.18965
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
intercept,-1.9888,0.0081,-246.6690,0.0000,-2.0046,-1.9730
ab_page,-0.0150,0.0114,-1.3116,0.1897,-0.0374,0.0074


In [66]:
np.exp(results.params)

intercept    0.136863
ab_page      0.985115
dtype: float64

Nilai p-Value lebih besar dari 0.05 menunjukkan bahwa kita tidak dapat untuk menolak H0. Berdasarkan Nilai koefisien yang diperoleh juga menunjukkan bahwa tidak ada dampak signifikan karena tingkat keputusan pengunjung untuk beralih hanya 0.99 kali yang artinya tidak ada peningkatan nilai.

## 3. Menambahkan Faktor Negara Asal dala model

In [36]:
df_countries = pd.read_csv('countries.csv')
df_countries.head()

Unnamed: 0,user_id,country
0,834778,UK
1,928468,US
2,822059,UK
3,711597,UK
4,710616,UK


In [37]:
df_countries.country.value_counts()

US    203619
UK     72466
CA     14499
Name: country, dtype: int64

In [38]:
# Merge the dataframes
df3 = df2.merge(df_countries, on="user_id", how = "left")
df3.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,intercept,ab_page,country
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,1,0,US
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,1,0,US
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,1,1,US
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,1,1,US
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,1,0,US


In [64]:
# Add the dummy variables
df3[['US','UK','CA']] = pd.get_dummies(df3['country'])
df3.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,intercept,ab_page,country,US,UK,CA,US_interaction,UK_interaction,CA_interaction
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,1,0,US,0,0,1,0,0,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,1,0,US,0,0,1,0,0,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,1,1,US,0,0,1,0,0,1
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,1,1,US,0,0,1,0,0,1
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,1,0,US,0,0,1,0,0,0


In [59]:
model2 = sm.Logit(df3['converted'], df3[['intercept', 'US', 'UK']])
# Fit the model
results2 = model2.fit()

# Show the logistic regression summary
results2.summary2()

Optimization terminated successfully.
         Current function value: 0.366115
         Iterations 6


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.0
Dependent Variable:,converted,AIC:,212781.088
Date:,2021-12-07 13:49,BIC:,212812.8269
No. Observations:,290585,Log-Likelihood:,-106390.0
Df Model:,2,LL-Null:,-106390.0
Df Residuals:,290582,LLR p-value:,0.19834
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
intercept,-1.9967,0.0068,-292.3154,0.0000,-2.0101,-1.9833
US,-0.0408,0.0269,-1.5176,0.1291,-0.0935,0.0119
UK,0.0099,0.0133,0.7462,0.4555,-0.0161,0.0360


In [60]:
np.exp(results2.params)

intercept    0.135778
US           0.960024
UK           1.009971
dtype: float64

Berdasarkan nilai koefisien kita tahu bahwa tidak ada impact yang berari kepada keputusan pengunjung dari UK untuk merubah perilakunya kaena nilainya kurang lebih  sama dengan 1 (menunjukkan tidak ada peningkatan nilai). Begitu pula dengan pengunjung dari US yang memiliki peluang 0.96x lebih kecil. Dengan hasil demikian menunjukkan bahwa negara asal tidak memberikan pengaruh perilaku pengunjung. 

## 4. Menambahkan interaksi ke dalam model

In [43]:
# Create interaction variables
df3['US_interaction'] = df3['ab_page'] * df3['US']
df3['UK_interaction'] = df3['ab_page'] * df3['UK']
df3['CA_interaction'] = df3['ab_page'] * df3['CA']

In [44]:
df3.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,intercept,ab_page,country,US,UK,CA,US_interaction,UK_interaction,CA_interaction
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,1,0,US,0,0,1,0,0,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,1,0,US,0,0,1,0,0,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,1,1,US,0,0,1,0,0,1
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,1,1,US,0,0,1,0,0,1
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,1,0,US,0,0,1,0,0,0


In [61]:
model3 = sm.Logit(df3['converted'], df3[['intercept', 'ab_page', 'US_interaction', 'UK_interaction','US','UK']])

In [62]:
# Fit the model
results3 = model3.fit()

# Show the logistic regression summary
results3.summary2()

Optimization terminated successfully.
         Current function value: 0.366108
         Iterations 6


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.0
Dependent Variable:,converted,AIC:,212782.9124
Date:,2021-12-07 13:49,BIC:,212846.3903
No. Observations:,290585,Log-Likelihood:,-106390.0
Df Model:,5,LL-Null:,-106390.0
Df Residuals:,290579,LLR p-value:,0.19182
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
intercept,-1.9865,0.0096,-206.3440,0.0000,-2.0053,-1.9676
ab_page,-0.0206,0.0137,-1.5060,0.1321,-0.0474,0.0062
US_interaction,-0.0469,0.0538,-0.8716,0.3834,-0.1523,0.0585
UK_interaction,0.0314,0.0266,1.1811,0.2375,-0.0207,0.0835
US,-0.0175,0.0377,-0.4652,0.6418,-0.0914,0.0563
UK,-0.0057,0.0188,-0.3057,0.7598,-0.0426,0.0311


In [63]:
# Exponentiate the results for interpretation
np.exp(results3.params)

intercept         0.137178
ab_page           0.979636
US_interaction    0.954208
UK_interaction    1.031907
US                0.982625
UK                0.994272
dtype: float64

## Kesimpulan

Nilai p-value pada interaction variable, lebih besar dari 0.05. Ini berarti tidak ada pengaaruh negara asal untuk merubah perilaku pengunjung. Dengan demikian akan lebih baik jika tetap menggunakan laman lama atau membuat design baru.