# A/B Testing

This notebook goes through the processes of setting up an A/B test. 

Inspired by the tutorial at: https://towardsdatascience.com/ab-testing-with-python-e5964dd66143

Dataset can be found at: https://www.kaggle.com/datasets/zhangluyuan/ab-testing?resource=download

In [7]:
# import basic packages
#!pip install pandas
import pandas as pd
import os
import numpy as np

# stats packages 
#!pip install scipy
import scipy.stats as stats
#!pip install statsmodels
import statsmodels.stats.api as sms
from statsmodels.stats.proportion import proportions_ztest

# data viz
#!pip install matplotlib
import matplotlib.pyplot as plt
#!pip install seaborn
import seaborn as sns

os.getcwd()

'C:\\Users\\18594\\AB_Testing'

In [8]:
df = pd.read_csv(os.getcwd() + '/ab_data.csv')

In [9]:
df.head(50)

print(df.shape)

(294478, 5)


# Exploratory Data Analysis

In [10]:
# see how many users are in each group
df['group'].value_counts()

treatment    147276
control      147202
Name: group, dtype: int64

In [11]:
# see how many were converted from test and control groups
pd.crosstab(df['group'],df['converted'])

converted,0,1
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,129479,17723
treatment,129762,17514


In [12]:
df.describe()

Unnamed: 0,user_id,converted
count,294478.0,294478.0
mean,787974.124733,0.119659
std,91210.823776,0.324563
min,630000.0,0.0
25%,709032.25,0.0
50%,787933.5,0.0
75%,866911.75,0.0
max,945999.0,1.0


In [13]:
## How long was this test run for?
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day

df.head()

# see what months/years this test spans across
pd.crosstab(df['year'],df['month'])


month,1
year,Unnamed: 1_level_1
2017,294478


This test was conducted in January 2017. Let's see how many clicks occurred on each date in January.

In [14]:
# see what months/years this test spans across
pd.crosstab(df['day'],df['month'])

month,1
day,Unnamed: 1_level_1
2,5783
3,13394
4,13284
5,13124
6,13528
7,13381
8,13564
9,13439
10,13523
11,13553


# Define what Success looks like, as well as pre and post metrics.

In [15]:
df.loc[df['converted']==1, 'Success'] = "User converted"
df.loc[df['converted']==0, 'Success'] = "User did not convert"
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,year,month,day,Success
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,2017,1,21,User did not convert
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,2017,1,12,User did not convert
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,2017,1,11,User did not convert
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,2017,1,8,User did not convert
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,2017,1,21,User converted


### Pre-Test Success metric: Our baseline performance. 

How many control users clicked on the old version of the "Click Here to Subscribe" button on your site?

I am calculating this to use as the number for our "normal" or "expected" performance. Keep in mind that this is number is coming from a sample - if you are applying this in a business setting, your business partners may be able to provide a metric for what the typical performance of the population is.

In [16]:
# separate out only non-converted users
not_conv = df[df['group']=='control']

# percentage of time user was converted vs not
not_conv['converted'].value_counts()

# take num of successes / total instances
pre_test_success = not_conv['converted'].value_counts()[1] / (not_conv['converted'].value_counts()[1] + not_conv['converted'].value_counts()[0])
print(pre_test_success)

0.12039917935897611


### Post-Test Success metric: Our performance after the treatment is applied.

How many treatment users clicked on the new version of the "Click Here to Subscribe" button on your site?

In [17]:
# separate out only non-converted users
conv = df[df['group']=='treatment']

# percentage of time user was converted vs not
conv['converted'].value_counts()

# take num of successes / total instances
post_test_success = conv['converted'].value_counts()[1] / (conv['converted'].value_counts()[1] + conv['converted'].value_counts()[0])
print(post_test_success)

0.11891957956489856


# Business Problem

Both pre and post conversion rates are around 12%, with post test metrics actually seeming to be a little *worse* with the new design than the old design. 

Your business team is starting to worry that this new design is losing potential customers, wants to know if they should keep this design or revert back to the old button. They say that in order to keep the new button, they would like to see at least a 3% increase in subscribers being attracted. Otherwise, they will return back to the old design.

In [18]:
desired_effect = pre_test_success + 0.03
desired_effect

0.15039917935897612

# Run the A/B test

# Test for duplicates users, and remove them if they appear more than once

Since we want to ensure that each observation is a unique user coming to the site, we want to remove any duplicate users from our df.


In [19]:
# see if there are users that appear mulitple times
visit_counts = df['user_id'].value_counts(ascending=False)
multi_users = visit_counts[visit_counts > 1].count()

print(f'There are {multi_users} users that appear multiple times in the dataset')

There are 3894 users that appear multiple times in the dataset


In [20]:
users_to_drop = visit_counts[visit_counts > 1].index

df = df[~df['user_id'].isin(users_to_drop)]
print(f'The updated dataset now has {df.shape[0]} entries')

The updated dataset now has 286690 entries


In [21]:
df.dtypes

user_id                  int64
timestamp       datetime64[ns]
group                   object
landing_page            object
converted                int64
year                     int64
month                    int64
day                      int64
Success                 object
dtype: object

# Two-tailed Hypothesis: The new button impacts website performance in attracting subscribers.

# Gathering the correct sample size

A/B testing generally requires many observations to get statistically valid results. But, although we have a pretty good amount of data in this dataset, we still may not need to analyze all of it to answer our business question.

The code below calculates the minimum number of observations in each group (control and treatment) that we need to collect before performaing the analysis. This can help us to speed up processing times and run a less computationally-expensive test.

In [22]:
# # # article eff size - same as my calc value?
#effect_size_article = sms.proportion_effectsize(pre_treatment_success,post_treatment_success)
effect_size= sms.proportion_effectsize(pre_test_success,desired_effect) # see if there is an effect between expected pre and post amounts
effect_size

-0.08780542373591216

In [23]:
# https://www.statsmodels.org/dev/generated/statsmodels.stats.power.NormalIndPower.solve_power.html
# https://towardsdatascience.com/ab-testing-with-python-e5964dd66143
required_n = sms.NormalIndPower().solve_power(
    effect_size, 
    power=0.8, # Power is the probability that the test correctly rejects the Null Hypothesis if the Alternative Hypothesis is true. Typical default value is 0.8, saying that there is an 80% chance that the test correctly rejects the null hypothesis.
    alpha=0.05, 
    ratio=1
    )               

In [24]:
# minimum number of stores from each group that need to be sampled 
# in order to detect if there is a true difference between our pre and post test.
# tip: you can edit the power and alpha levels above if you want more/less precision behind your results
required_n = np.ceil(required_n)
print(required_n)

2037.0


# Balance the data frame that we are testing

The control and test groups must have the same number of ns.

We can make sure each group is represented by randomly sampling 2037 (our min sample size) from each group and putting these values into a new df

In [25]:
# create new df that have the same num of samples from control group as test group
required_n = required_n.astype(np.int64)
print(required_n)

control_sample = df[df['converted'] == 0].sample(required_n, random_state=22)
treatment_sample = df[df['converted'] == 1].sample(required_n, random_state=22)



2037


In [26]:
ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)

ab_test.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,year,month,day,Success
0,733022,2017-01-23 15:19:30.625626,treatment,new_page,0,2017,1,23,User did not convert
1,907867,2017-01-15 11:05:07.146881,treatment,new_page,0,2017,1,15,User did not convert
2,637065,2017-01-06 01:27:52.026002,treatment,new_page,0,2017,1,6,User did not convert
3,753838,2017-01-10 07:41:29.024607,control,old_page,0,2017,1,10,User did not convert
4,861583,2017-01-20 16:21:08.793702,treatment,new_page,0,2017,1,20,User did not convert


In [27]:
ab_test.shape

(4074, 9)

In [28]:
# make sure control and test groups are equally spread
ab_test['converted'].value_counts()

0    2037
1    2037
Name: converted, dtype: int64

In [29]:
ab_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4074 entries, 0 to 4073
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   user_id       4074 non-null   int64         
 1   timestamp     4074 non-null   datetime64[ns]
 2   group         4074 non-null   object        
 3   landing_page  4074 non-null   object        
 4   converted     4074 non-null   int64         
 5   year          4074 non-null   int64         
 6   month         4074 non-null   int64         
 7   day           4074 non-null   int64         
 8   Success       4074 non-null   object        
dtypes: datetime64[ns](1), int64(5), object(3)
memory usage: 286.6+ KB


## Test to see if new design performed differently from the old design

In [30]:
control_results = ab_test[ab_test['group'] == 'control']['converted']
treatment_results = ab_test[ab_test['group'] == 'treatment']['converted']
n_con = control_results.count() # number of converted users
n_treat = treatment_results.count() # number of treatment users
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs=nobs)

print(f'z statistic two-tailed test: {z_stat:.2f}')
print(f'p-value two-tailed test: {pval:.3f}')


z statistic: -0.56
p-value: 0.573


# Results from two-tailed test:

Because our p-value in the two-tailed test was > our alpha level of 0.05, we cannot conclude that there is a statistically difference in performance betwewn the new and old designs at the level that the business leaders would have liked to see. These results alone may make our team reconsider if they'd like to continue to use the new subscriber button or not.

But, this test does not tell us the second part of the business leader's question - which is if this new button proves to be worse than the old design or not. This is because the two-tailed test does not provide any sort of direction. To get determine if the new design has more converted users than the old design, we will need to run a one-tailed test.

# One-tailed Hypothesis: The new button is better than the old button at attracting subscribers.

# One-tailed test

In [36]:
# alternative: control < treatment 
# alternative: the new design attracts more subscribers than the old
z_stat, pval = proportions_ztest(successes, nobs=nobs, alternative='smaller')

print(f'z statistic one-tailed test that control < treatment : {z_stat:.2f}')
print(f'p-value one-tailed test that  control < treatment : {pval:.3f}')
print('')

# alternative: control > treatment 
# alternative: the old design attracts more subscribers than the new
# in other words, the new design is losing us subscribers
z_stat, pval = proportions_ztest(successes, nobs=nobs, alternative='larger')

print(f'z statistic one-tailed test that control > treatment : {z_stat:.2f}')
print(f'p-value one-tailed test that  control > treatment : {pval:.3f}')


z statistic one-tailed test that control < treatment : -0.56
p-value one-tailed test that  control < treatment : 0.286

z statistic one-tailed test that control > treatment : -0.56
p-value one-tailed test that  control > treatment : 0.714


# Results from one-tailed test:

Because our p-value in the one-tailed test where control < treatment was > our alpha level of 0.05, we cannot conclude that the new button attracts subscribers better than the old button. So, the new design is not necessarily gaining us any new subscribers.

Also, because our p-value in the one-tailed test where control > tretment was > our alpha level of 0.05, we cannot conclude that the new button deters subscribers compared to the old button - this is good news and shows the business leaders that we aren't losing subscribers because of this new design like they had previously thought.

# Overall Results

Our statistical tests have revealed that the new design is not doing a better job at attracting customers to become subscribers to our site than the old design. Although the new design is not necessarily losing us customers either, business leaders may consider if they would like to revert back to the old design, or continue to try other solutions to attempt to attract more subscribers.