# A/B Testing

A/B testing is a user experience research methodology that consists of a randomized experiment that usually involves two variants (A and B). The statistical method of two-sample hypothesis testing is typically used when conducting A/B tests. 

A/B testing is a way to compare multiple versions of a single variable, for example by testing a subject's response to variant A against variant B, and determining which of the variants is more effective. This is particularly useful in the field of social media as it allows for understanding of user engagement and satisfaction of online features like a new feature or product.

This analysis is inspired by the Towards Data Science tutorial [here](https://towardsdatascience.com/ab-testing-with-python-e5964dd66143).


## Business Problem
The setup for this analysis is described in the scenario below:

> Suppose that a company has updated the "Click Here to Sign-up" button for their website in an attempt to make it easier for users to see the button and sign-up for their newsletter. The current sign-up or conversion rate is about 12% on average throughout the year. With this new button, the team would like to see an increase of 3% before rolling out the feature to all users. 

Thus, the task at hand is to conduct an analysis to answer the following question:
> Does the new design perform differently than the old design? If so, does it perform better or worse?

To do this, we suggest running an A/B test on a random subset of the users to examine both a two-tailed and a one-tailed hypothesis test.


## Data 

The data for this analysis contains the results of an A/B test on 2 different designs of a website page (old_page vs. new_page). The dataset is available on [Kaggle](https://www.kaggle.com/datasets/zhangluyuan/ab-testing?select=ab_data.csv). 

Let us look at the first few rows of the dataset:

In [54]:
# load required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.stats.api as sms
from math import ceil
import scipy.stats as stats
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [None]:
# read in ab test data
file = ('ab_data.csv')
df = pd.read_csv(file)
  
# displaying the contents of the file
df.head()

Here we see that the dataset contains five variables, namely:


* **user_id:** unique user idenfitifier for each session
* **timestamp:** timestamp including date and time of the session
* **group:** which group the user was assigned to for that session
* **landing_page:** which design the user was shown, the new or old page
* **converted:** whether the user converted (signed up) or not, binary

For our analysis, our ***independent variable*** will be the group that the user is assigned to:


*   **Control** - These users will be shown the old page.
*   **Treatment** - These users will be shown the new page.

Our ***dependent variable*** will be whether they converted (signed up) or not:


*   **0** - Did not sign up
*   **1** - Signed up





### EDA

Before we begin, it is important to do some exploratory data analysis to better understand the data and ensure that all the variables are in the proper format. 

In [55]:
print(df.shape)

(294478, 5)


In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


There are 294,478 observations in the dataset each representing a user session. There are two numerical variables in the dataframe, *user_id* and *converted*. We note that there are no missing values.

Looking at the timestamp variable, we can see when this data was collected.

In [57]:
# convert timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day

df.head()

# see when this data was collected
pd.crosstab(df['year'],df['month'])

month,1
year,Unnamed: 1_level_1
2017,294478


It appears as though these data were collected in January of 2017.

Next, let us look at some cross-tabulations. First, we want to check that all the users in the control group recieve the old design, and that all the users in the treatment recieve the new design. 

In [58]:
pd.crosstab(df['group'], df['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1928,145274
treatment,145311,1965


Looking at the table above, we see that this is not the case. There appear to be 1928 users in the control group who recieved the new page, and 1965 users in the treatment group who recieved the old page. These will be handled in the data cleaning step below.

In [59]:
pd.crosstab(df['group'],df['converted'])

converted,0,1
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,129479,17723
treatment,129762,17514


Here we see that 17,723 users signed up in the control group, and 17,514 users signed up in the treatment group. We will calculate the conversion rate after we clean the data further. 

### Data Cleaning and Preparation
Next, we would like to clean and prepare the data for the analysis. Based on the EDA, we have a few cases that need to be handled:

1. Handle cases where the control group does not match the page shown
2. Check for duplicates
3. Select random subset


#### 1. Cases where the Control Group does not Match the Page Shown
First, let us review the cases where the control group does not match the page shown. For this analysis, we need the control group to see the old design, and the treatment group to see the new design.

In [60]:
## Because we can't be sure that these users truly recieved the proper page, we will opt to remove these rows. 

i = df[((df['group']=='treatment') ==(df['landing_page']=='new_page')) == False].index
df = df.drop(i)

In [61]:
## Check
pd.crosstab(df['group'], df['landing_page'])

landing_page,new_page,old_page
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,0,145274
treatment,145311,0


#### 2. Check for Duplicates
Now that we have handled those cases, we can check for duplicates in the data.



In [62]:
## See if there are users that appear mulitple times
visit_counts = df['user_id'].value_counts(ascending=False)
multi_users = visit_counts[visit_counts > 1].count()

print(f'There are {multi_users} user(s) that appear multiple times in the dataset')

There are 1 user(s) that appear multiple times in the dataset


There is 1 user that is duplicated in the data. Let us take a look at the entries. 

In [63]:
df[df.duplicated(['user_id'], keep=False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted,year,month,day
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0,2017,1,9
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0,2017,1,14


Here we see the same user visted the website twice, once on Januart 9, and again on January 14. Both times, they were shown the new deisgn. Because we want each row to be a unique user visit to the website, we can remove this duplicate to ensure we are not sampling the same person twice.

In [64]:
users_to_drop = visit_counts[visit_counts > 1].index

df = df[~df['user_id'].isin(users_to_drop)]
print(f'The updated dataset now has {df.shape[0]} entries')

The updated dataset now has 290583 entries


#### 3. Select Random Subset
Finally, we need to select a random subset of the user base to use for our analysis. For the analysis, the control and test groups must have the same number of observations.

In order to calculate our required sample size, we must perform a power analysis. This will rely on several factors, namely, the power of the test, our alpha value, and the effect size. As a convention, a value of 0.8 is used for the power of the test. The power of the test represents the probability of finding a statistical difference between the groups in our test when a difference is actually present. We can decide to set our alpha value at 0.05, which in turn means that we will set our confidence level at 95%. Finally, because we are interested in a difference of 3% as mentioned in the inital scenario set up, we can use 12% and 15% to calculate the effect size we expect.

In [65]:
## Calculate required sample size
## Code from tutorial
effect_size = sms.proportion_effectsize(0.12, 0.15)    # Calculating effect size based on our expected rates

required_n = sms.NormalIndPower().solve_power(
    effect_size, 
    power=0.8, 
    alpha=0.05, 
    ratio=1
    )                                                  # Calculating sample size needed

required_n = ceil(required_n)                          # Rounding up to next whole number                          

print(required_n)

2031


Based on the power analysis calculations, we need a minimum sample size of 2031 observations for each group. 

Now that we know how many observations we need for each group, we can go ahead and select simple random sample of size 2031 from each group to get our final dataset. 

In [66]:
# create new dfs that have the same num of samples from control group as test group
control_sample = df[df['group'] == 'control'].sample(required_n, random_state=6)
treatment_sample = df[df['group'] == 'treatment'].sample(required_n, random_state=6)

In [71]:
ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)

ab_test.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,year,month,day
0,937543,2017-01-22 13:27:36.377738,control,old_page,1,2017,1,22
1,792951,2017-01-02 15:11:46.641852,control,old_page,0,2017,1,2
2,749418,2017-01-02 15:17:21.137629,control,old_page,0,2017,1,2
3,919364,2017-01-13 19:07:07.736603,control,old_page,0,2017,1,13
4,764985,2017-01-08 16:43:28.208056,control,old_page,0,2017,1,8


In [72]:
ab_test['group'].value_counts()

control      2031
treatment    2031
Name: group, dtype: int64

In [73]:
ab_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4062 entries, 0 to 4061
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   user_id       4062 non-null   int64         
 1   timestamp     4062 non-null   datetime64[ns]
 2   group         4062 non-null   object        
 3   landing_page  4062 non-null   object        
 4   converted     4062 non-null   int64         
 5   year          4062 non-null   int64         
 6   month         4062 non-null   int64         
 7   day           4062 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(2)
memory usage: 254.0+ KB


Now we have a cleaned and prepared data set with a total of 4062 observations, with 2031 observations from the control group, and 2031 observations from the treatment group. 

### Final Dataset

Let us take a look at our final dataset before we continue our analysis.

In [76]:
import scipy.stats as stats
conversion_rates = ab_test.groupby('group')['converted']

std_p = lambda x: np.std(x, ddof=0)              # Std. deviation of the proportion
se_p = lambda x: stats.sem(x, ddof=0)            # Std. error of the proportion (std / sqrt(n))

conversion_rates = conversion_rates.agg([np.mean, std_p, se_p])
conversion_rates.columns = ['conversion_rate', 'std_deviation', 'std_error']


conversion_rates.style.format('{:.3f}')

Unnamed: 0_level_0,conversion_rate,std_deviation,std_error
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,0.129,0.336,0.007
treatment,0.123,0.329,0.007


Here we see that the conversion rate for both the control and treatment group are very similar, at 12.9% and 12.3% respectively. The control group seemed to perform slightly better than the treatment group by 0.6 percentage points. 

We are interested in knowing if this difference is **statistically significant**. In order to make a claim about this, we must perform some statistical tests. 

## Methods
For this analysis, we will examine two types of hypothesis tests, namely, a two tailed hypothesis test and a one-sided hypothesis test. 

The two sided hypothesis will tell us if there is a sigificant difference between the new and the old design. This can be written as follows: 
$$
H_0: p = p_0 $$
$$
H_1: p \ne p_0
$$
where $p$ and $p_0$ stand for the conversion rate of the new and old design, respectively.

This is helpful as we do not know if the new design will perform better or worse (or the same) as our current design. 

A one-sided hypothesis test will tell us if there is a significant difference in a certain direction between the new and old design. In other words, this test will enable us to examine whether the new button proves to be better or worse than the old design. These types of tests are used as a superiority tests and are common in A/B testing.

The two one-sided hypothesis tests we will examine can be written as follows: $$
H_0: p ≤ p_0 $$
$$
H_1: p > p_0
$$

and 

$$
H_0: p \geq p_0 $$
$$
H_1: p < p_0
$$

Where the first test examines if the new design performs better than the new design and the second test examines the opposite, i.e., whether the old design outperforms the new design. 

In all cases, we can set our alpha ($\alpha$) value equal to $0.05$, setting our confidence level at $95\%$.



## Results

### Two-sided Hypothesis Test
First, let us perform the two-sided hypothesis test to see if there is a significant difference between the performance of the new and old designs.

In [79]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

control_results = ab_test[ab_test['group'] == 'control']['converted']
treatment_results = ab_test[ab_test['group'] == 'treatment']['converted']
n_con = control_results.count() # number of converted users
n_treat = treatment_results.count() # number of treatment users
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs=nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')
print(f'ci 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'ci 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

z statistic: 0.61
p-value: 0.539
ci 95% for control group: [0.115, 0.144]
ci 95% for treatment group: [0.109, 0.137]


Here we see that the p-value for the two-sided test equals 0.539, which is greater than our $\alpha = 0.05$. This means that we can **not** reject the null hypothesis that the performance of the two designs are equal. In other words, the new design did not perform significantly different than the old design. 

We note that the confidence interval for the treatment group is 10.9% to 13.7%.It includes our baseline value of 12% conversion rate and does not include our target value of 15% (the 3% increase we were hoping to see). What this means is that it is more likely that the true conversion rate of the new design is similar to our baseline.

This result alone may make the team reconsider using the new sign-up button as there does not seem to be a difference. Note that this test does not answer the second part of the business leader's question - whether this new button proves to be worse than the old design or not. This is because the two-tailed test does not provide any sort of direction. 

Thus, to determine if the new design has more converted users than the old design, we will need to run a one-tailed test.

### One-sided Hypothesis Tests 
First, let us test the alternative hypothesis that the new design attracts more sign-ups than the old design.

In [81]:
# alternative: the new design attracts more sign-ups than the old
z_stat, pval = proportions_ztest(successes, nobs=nobs, alternative='larger')

print(f'z statistic one-tailed test that treatment > control : {z_stat:.2f}')
print(f'p-value one-tailed test that  treatment > control : {pval:.3f}')
print('')

z statistic one-tailed test that treatment > control : 0.61
p-value one-tailed test that  treatment > control : 0.270



Here we see that the p-value for this test is 0.27, which is greater than our $\alpha = 0.05$. This means that we can **not** reject the null hypothesis. This means that we can not say that the new design attracts more sign-ups than the old design. 

In [82]:
# alternative: the old design attracts more subscribers than the new
# the new design is losing us subscribers
z_stat, pval = proportions_ztest(successes, nobs=nobs, alternative='smaller')

print(f'z statistic one-tailed test that treatment < control : {z_stat:.2f}')
print(f'p-value one-tailed test that treatment < control : {pval:.3f}')

z statistic one-tailed test that treatment < control : 0.61
p-value one-tailed test that treatment < control : 0.730


Here we see that the p-value for this test is 0.73, which again, is greater than our $\alpha = 0.05$. This means that we can **not** reject the null hypothesis meaning that we can not say that the old design attracts more subscribers than the new or that the new design is making us lose users.

## Discussion

Our statistical tests have revealed that the new design is not doing a better job at attracting customers to sign up to our newsletter than the old design. 
The analysis showed that the new design is not likely to be an improvement on the old design. Thus, business leaders may consider if they would like to revert back to the old design, or continue to try other solutions to attempt to attract more users to sign-up.