### A/B Testing in Python 

>Let's imagine you work on the product team at a <b>medium-sized online e-commerce business</b>. The UX designer worked really hard on <b>a new version of the product page</b>, with the <b>hope</b> that it will lead to a <b>higher conversion rate</b>. The product manager (PM) told you that the current conversion rate is about 13% on average throughout the year, and that the team would be happy with an increase of 2%, meaning that the <b>new design will be considered a success if it raises the conversion rate</b> to 15%.

This notebook aims at understanding the process of analysing A/B test, right from formulating a hypothesis, testing it, and finally interpreting the results. 

<b>References:</b> 
1. A/B Testing Procedure: https://towardsdatascience.com/ab-testing-with-python-e5964dd66143 \
2. Code Replicated From Here: https://github.com/renatofillinich/ab_test_guide_in_python \
3. Data Source: https://www.kaggle.com/datasets/zhangluyuan/ab-testing?resource=download 

A dataset from Kaggle (linked above) containing the results of an A/B test on what seems to be 2 different designs of a webpage is used here (old_page vs new_page). 

<b>Steps illustarted in the notebook:</b>
1. Designing our Experiment 
2. Collection and Preparing the Data
3. Visualizing the Results
4. Testing the Hypothesis
5. Drawing Conclusions 

We have imagined a more realistic scenario after looking at the data. So now we can perform an A/B test on a subset of the user base before rolling out the change. 

Before we get started, it's natural to ask the question: <b>What is an A/B Test?</b>

>A/B testing (also known as bucket testing, split-run testing, or split testing) is a user experience research methodology. A/B tests consist of a randomized experiment that usually involves two variants (A and B), although the concept can be also extended to multiple variants of the same variable. It includes application of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistics. A/B testing is a way to compare multiple versions of a single variable, for example by testing a subject's response to variant A against variant B, and determining which of the variants is more effective. [https://en.wikipedia.org/wiki/A/B_testing] 

### Step 01: Designing The Experiment 
<b><u>Formulating a Hypothesis</u></b>

Here we don't know if the new design will perform better or worse or the same, so we 'll choose a <b>two-tailed test</b>: 
>In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if the estimated value is greater or less than a certain range of values, for example, whether a test taker may score above or below a specific range of scores. This method is used for null hypothesis testing and if the estimated value exists in the critical areas, the alternative hypothesis is accepted over the null hypothesis. A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, left or right, but not both. [https://en.wikipedia.org/wiki/One-_and_two-tailed_tests]

H<sub>0</sub>: p = p<sub>0</sub>\
H<sub>A</sub>: p != p<sub>0</sub>

where p, p<sub>0</sub> stand for the conversion rate of the new and old design, respectively. Also, setting the confidence level at 95%. \
>$\alpha = 0.05$

<b><u>Choosing the Variables</u></b>

For the test, we think of 2 groups: 
1. Control: Show the old design
2. Treatment: Show the new design

With two groups, we can basically control for factors other than the design that might impact our result. 

Further, understand that we are interested in capturing the conversion rate. We can encode it as a binary variable:
1. 0: User did not buy the product during this user session
2. 1: User bought the product during this user session

This way, we can easily calculate the mean for each group to get the conversion rate of each design. 

<b><u>Choosing a Sample Size</u></b>

We will not test our whole user base (duh!), but only a subset. Thus, the conversion rates obtained here will be <i>estimates</i> of the true rates. 

The number of people (or user sessions) we decide to capture in each group will have an effect on the precision of our estimated conversion rates: the larger the sample size, the more precise our estimates (i.e. the smaller our confidence intervals), the higher the chance to detect a difference in the two groups, if present. On the other hand, the larger our sample gets, the more expensive (and impractical) our study becomes.

So, <b>how to decide the number of people to have in each group? </b>
>Power Analysis: A power analysis is the calculation used to estimate the smallest sample size needed for an experiment, given a required significance level, statistical power, and effect size. It helps to determine if a result from an experiment or survey is due to chance, or if it is genuine and significant.

Factors:
1. Power of the test = $1-\beta$: This represents the probability of finding a statistical difference between the groups in our test when a difference is actually present. This is usually set at 0.8 as a convention. 
2. Alpha value = $\alpha$: Critical value set to 0.05 earlier. 
3. Effect Size: How big of a difference we expect there to be between the conversion rates. Since our team would be happy with a difference of 2%, we can use 13% and 15% to calculate the effect size we expect.

<b>Finally, time for some calculations:</b>

In [2]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.stats.api as sms
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil 

%matplotlib inline

plt.style.use('seaborn-whitegrid')
font = {
            'family' : 'Helvetica',
            'weight' : 'bold',
            'size'   : 14
       }

mpl.rc('font', **font)

In [3]:
# Calculating effect size based on our expected rates
effect_size = sms.proportion_effectsize(0.13, 0.15)    

# Calculating sample size needed
required_n = sms.NormalIndPower().solve_power(
                                                effect_size, 
                                                power=0.8, 
                                                alpha=0.05, 
                                                ratio=1
                                              ) 

# Rounding up to next whole number                          
required_n = ceil(required_n)                         
print(required_n)

4720


So, we need atleast 4720 observations for each group. Also, having set the power parameter to 0.8 in practice means that if there exists an actual difference in conversion rate between our designs, assuming the difference is the one we estimated (13% vs. 15%), we have about 80% chance to detect it as statistically significant in our test with the sample size we calculated.

### Step 02: Collecting and Preparing the Data
We have the number of observations that we will require. Ideally, this is the stage where an actual experiment is set up and then data is collected through it. But we will be using the Kaggle dataset referenced right at the start of this notebook. To simulate a data collection situation, we will check and clean the data as required, and further randomly sample 4720 rows from it for each group.  

In [9]:
df = pd.read_csv("Data-Files/ab_data.csv")
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [10]:
df.info()
# we're only interested in the group and converted column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [11]:
# To make sure all the control group are seeing the old page and viceversa
pd.crosstab(df['group'], df['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1928,145274
treatment,145311,1965


In [12]:
# Checking in case any user has been sampled multiple times 
session_counts = df['user_id'].value_counts(ascending = False)
multi_users = session_counts[session_counts > 1].count()

print(f'There are {multi_users} users that appear multiple times in the dataset')

There are 3894 users that appear multiple times in the dataset


Since 3894 is small compared to total 294478 entries, we can go ahead and drop the multiple enteries.

In [14]:
user_drop = session_counts[session_counts>1].index
df = df[~df['user_id'].isin(user_drop)]
df.shape

(286690, 5)

<b>Sampling</b>

In [15]:
control_sample = df[df['group'] == 'control'].sample(n = required_n, random_state = 22)
treatment_sample = df[df['group'] == 'treatment'].sample(n = required_n, random_state = 22)

ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)

In [16]:
ab_test

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,763854,2017-01-21 03:43:17.188315,control,old_page,0
1,690555,2017-01-18 06:38:13.079449,control,old_page,0
2,861520,2017-01-06 21:13:40.044766,control,old_page,0
3,630778,2017-01-05 16:42:36.995204,control,old_page,0
4,656634,2017-01-04 15:31:21.676130,control,old_page,0
...,...,...,...,...,...
9435,908512,2017-01-14 22:02:29.922674,treatment,new_page,0
9436,873211,2017-01-05 00:57:16.167151,treatment,new_page,0
9437,631276,2017-01-20 18:56:58.167809,treatment,new_page,0
9438,662301,2017-01-03 08:10:57.768806,treatment,new_page,0


In [17]:
ab_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9440 entries, 0 to 9439
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_id       9440 non-null   int64 
 1   timestamp     9440 non-null   object
 2   group         9440 non-null   object
 3   landing_page  9440 non-null   object
 4   converted     9440 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 368.9+ KB


In [18]:
ab_test['group'].value_counts()

control      4720
treatment    4720
Name: group, dtype: int64

### Step 03: Getting to the Results
Let's first get some basic statistics to get an idea of what our samples look like:

In [23]:
conversion_rates = ab_test.groupby('group')['converted']

# Standard deviation of the proportion
std_p = lambda x: np.std(x, ddof = 0)

# Standard error of the proportion (std / sqrt(n))
se_p = lambda x: stats.sem(x, ddof = 0)

conversion_rates = conversion_rates.agg([np.mean, std_p, se_p])
conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']

conversion_rates.style.format('{:.3f}')

Unnamed: 0_level_0,conversion_rate,std_deviation,std_error
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,0.123,0.329,0.005
treatment,0.126,0.331,0.005


The stats above indicate that our designed performed very similary, with the new design performing only slightly better (12.6% over 12.3%).

The conversion rates for our groups are indeed very close. Also note that the conversion rate of the control group is lower than what we would have expected given what we knew about our avg. conversion rate (12.3% vs. 13%). This goes to show that there is some variation in results when sampling from a population.

<b>Trying out a different sample, just for a quick check:</b>

In [36]:
control_sample2 = df[df['group'] == 'control'].sample(n = required_n, random_state = 10)
treatment_sample2 = df[df['group'] == 'treatment'].sample(n = required_n, random_state = 10)

ab_test2 = pd.concat([control_sample2, treatment_sample2], axis=0)
ab_test2.reset_index(drop=True, inplace=True)

conversion_rates2 = ab_test2.groupby('group')['converted']

# Standard deviation of the proportion
std_p2 = lambda x: np.std(x, ddof = 0)

# Standard error of the proportion (std / sqrt(n))
se_p2 = lambda x: stats.sem(x, ddof = 0)

conversion_rates2 = conversion_rates2.agg([np.mean, std_p, se_p])
conversion_rates2.columns = ['conversion_rate', 'std_deviation', 'std_error']

conversion_rates2.style.format('{:.3f}')

Unnamed: 0_level_0,conversion_rate,std_deviation,std_error
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,0.123,0.329,0.005
treatment,0.121,0.326,0.005


Now we actually got a lower value, although again it is just slightly lower. 

We need to check if these differences between control and treatment are <i>statistically significant</i>.

### Step 04: Testing the Hypothesis
Since we have a very large sample, we can use the normal approximation for calculating the p-value (viz. z-test).

In [37]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

control_results = ab_test[ab_test['group'] == 'control']['converted']
treatment_results = ab_test[ab_test['group'] == 'treatment']['converted']

n_con = control_results.count()
n_treat = treatment_results.count()

successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs = nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')
print(f'ci 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'ci 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

z statistic: -0.34
p-value: 0.732
ci 95% for control group: [0.114, 0.133]
ci 95% for treatment group: [0.116, 0.135]


### Conclusion

We got a p-value = 0.732 which is way above our $\alpha = 0.05$, we CANNOT reject the null hypothesis H<sub>0</sub>, which means that our <b><i>new design DID NOT perform significantly different than the old one</i></b>. 

Additionally, looking at the confidence interval for the treatment group: [0.116, 0.135], we notice that:
1. It includes the baseline value 13%
2. But it does not include the target value 15% (the +2% that was aimed for)

What this means is that it is more likely that the true conversion rate of the new design is similar to our baseline, rather than the 15% target we had hoped for. This is further proof that our new design is not likely to be an improvement on our old design, and that unfortunately we are back to the drawing board!

<b><i>The End</i></b>