## A/B testing on an Ecommerce dataset

The UX designer of an ecommerce company is working hard on a new version of the product page, which is expected to lead to a higher conversion rate - percentage of users who purchase products after viewing them. The proudct manager shared that the current conversion rate is about **13%** on average throughout the year and the success would look like a **2%** increase in conversion rate. Before rolling out the change, the team wanted to perform an A/B test on a small number of users to see how it performs.

This notebook will cover the process of analyzing an A/B test. The dataset comes from a [Kaggle competition](https://www.kaggle.com/zhangluyuan/ab-testing), which contains the results of an A/B test on what seems to be 2 different designs of a website page (old_page vs. new_page). 

* [Perform Power Analysis](#Section 1)
* [Collect and Clean the data](#Section 2)
* [Data Sampling](#Section 3)
* [Calculate Test Metrics](#Section 4)
* [Testing the hypothesis](#Section 5)
* [Draw conclusions](#Section 6)

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.stats.api as sms
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

<a id='Section 1'></a>
### Step 1: Power Analysis

* Power of test: 1 - \beta = 0.8
* Significance level: \alpha = 0.05
* Effect size: effective when conv. rate increase from 13% -> 15% (**2%**)

In [2]:
# Experiment configurations

random_seed = 0
power = 0.8
sig_level = 0.05
expected_cvr = 0.13
effective_cvr = 0.15

effect_size = sms.proportion_effectsize(expected_cvr, effective_cvr)

required_sample_size = sms.NormalIndPower().solve_power(
    effect_size,
    power=power,
    alpha=sig_level,
    ratio=1
)

required_sample_size = ceil(required_sample_size)
print(f"We need {required_sample_size} samples for both control and test group.")

We need 4720 samples for both control and test group.


<a id='Section 2'></a>

### Step 2: Collect and clean data

In [3]:
df = pd.read_csv('./data/ab_data.csv')
df.head(3)

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [5]:
# A quick glance what is in the dataset

users_count = df.user_id.nunique()
print(f'There are {users_count} users in the dataset.')

for var in ['group', 'landing_page', 'converted']:
    print(f"The {var} variable has the following classes: ")
    print(df[var].unique().tolist())
    print("\n")

There are 290584 users in the dataset.
The group variable has the following classes: 
['control', 'treatment']


The landing_page variable has the following classes: 
['old_page', 'new_page']


The converted variable has the following classes: 
[0, 1]




We learn the following about the dataset:

* `user_id` - The user ID of each session
* `timestamp` - Timestamp for the session
* `group` - Which group the user was assigned to for that session {control, treatment}
* `landing_page` - Which design each user saw on that session {old_page, new_page}
* `converted` - Whether the session ended in a conversion or not (binary, 0=not converted, 1=converted)

Next, we perform a sanity check of the dataset to see if there is any experiment 'leak'. 

In [6]:
# Sanity Check: whether (i) all control group users are seeing the old page and  
# (ii) all treatment group users are seeing the new page

pd.crosstab(df['group'], df['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1928,145274
treatment,145311,1965


**Observation**: There is a small portion of leakage, we could remove them from our dataset.

In [7]:
# Remove leakage existing in the control and treatment group.

users_sessions_count = df['user_id'].value_counts()
users_w_multiple_sessions = users_sessions_count[
    users_sessions_count>1
].index.tolist()

df = df[~df.user_id.isin(users_w_multiple_sessions)]
print(f'Removing {len(users_w_multiple_sessions)} multi-session users from the dataset.')

Removing 3894 multi-session users from the dataset.


<a id='Section 3'></a>

### Step 3: Sampling

Remeber we have calculated how many samples we need to achieve enough power, so we sample them randomly from the original dataset.

In [8]:
# Sample to contitute new test dataset, with each group having the required number of samples.

test_df = pd.concat([
    df[df.group=='control'].sample(n=required_sample_size, random_state=random_seed),
    df[df.group=='treatment'].sample(n=required_sample_size, random_state=random_seed)
], axis=0)
test_df.reset_index(drop=True, inplace=True)

In [9]:
# Verfiy

test_df.group.value_counts()

control      4720
treatment    4720
Name: group, dtype: int64

<a id='Section 4'></a>
### Step 4: Calculate test metrics

In [11]:
grouped = test_df.groupby('group')['converted']

def p_std(x):
    return np.std(x)

def p_stderr(y):
    return np.std(y)/np.sqrt(y.shape[0])

conversion_tbl = grouped.agg(
    [np.mean,
     p_std,
     p_stderr
    ]
)
conversion_tbl.columns = ['conversion rate', 'standard deviation', 
                          'standard error']
conversion_tbl

Unnamed: 0_level_0,conversion rate,standard deviation,standard error
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,0.111017,0.314153,0.004573
treatment,0.120975,0.326098,0.004747


**Observation:** The conversion rate of treatment group is higherm but the difference does not seem to be too much. It could be due to chance, however, we need to perform hypothesis testing to validate this.

<a id='Section 5'></a>
### Step 5: Hypothesis testing

H0: p_control = p_treatment <br>
Ha: p_control != p_treatment

**(i)** Calculate confidence interval using statsmodels library

In [12]:
control_subdf = test_df[test_df.group=='control']
treatment_subdf = test_df[test_df.group=='treatment']
successes = [control_subdf.converted.sum(), 
             treatment_subdf.converted.sum()]
counts = [control_subdf.converted.count(), 
          treatment_subdf.converted.count()]

# Calculate the z-score and p value
zscore, pval = proportions_ztest(successes, nobs=counts)
(lower_control, lower_treatment), (upper_control, upper_treatment) = \
                                proportion_confint(successes, nobs=counts, alpha=sig_level)

print(f'z score: {zscore:.2f}')
print(f'p-value: {pval:.3f}')
print(f'The 95% confidence interval for control group: [{lower_control:.3f}, {upper_control:.3f}]')
print(f'The 95% confidence interval for treatment group: [{lower_treatment:.3f}, {upper_treatment:.3f}]')

z score: -1.51
p-value: 0.131
The 95% confidence interval for control group: [0.102, 0.120]
The 95% confidence interval for treatment group: [0.112, 0.130]


**(ii)** Calculate confidence interval by using normal approximation

In [17]:
# Two-sample z-test
# Accoring to CLT, we can assume p follows a normal distribution.

p_control, p_treatment = conversion_tbl.iloc[0, 0], \
                            conversion_tbl.iloc[1, 0]
stderr_control, stderr_treatment = conversion_tbl.iloc[0, 2], \
                            conversion_tbl.iloc[1, 2]

unpooled_std = np.sqrt(p_control*(1-p_control)/counts[0] + \
                          p_treatment*(1-p_treatment)/counts[1])
zscore = (p_control-p_treatment)/unpooled_std
pval = stats.norm.cdf(zscore) * 2

# 95% confidence interval
lower_control = p_control - stats.norm.ppf((1-sig_level/2))* stderr_control
upper_control = p_control + stats.norm.ppf((1-sig_level/2))* stderr_control
lower_treatment = p_treatment - stats.norm.ppf((1-sig_level/2))* stderr_treatment
upper_treatment = p_treatment + stats.norm.ppf((1-sig_level/2))* stderr_treatment

print(f'z score: {zscore:.2f}')
print(f'p-value: {pval:.3f}')
print(f'The 95% confidence interval for control group: [{lower_control:.3f}, {upper_control:.3f}]')
print(f'The 95% confidence interval for treatment group: [{lower_treatment:.3f}, {upper_treatment:.3f}]')

z score: -1.51
p-value: 0.131
The 95% confidence interval for control group: [0.102, 0.120]
The 95% confidence interval for treatment group: [0.112, 0.130]


<a id='Section 6'></a>
### Conclusion

Since we calculate that p-value=0.131, which is greater than \alpha=0.05, we cannot reject the null hypothesis H0. This indicates that the new design **did not perform significantly different** than the old design.

Another way to indicate the same result is that the 95% CI for the treatment group is [0.112, 0.130]. This interval contains the mean conversion of control group and does not contain the target at effect size - 15% of conversion. 