# A/B Testing

This notebook goes through the processes of setting up an A/B test. 

Inspired by the tutorial at: https://towardsdatascience.com/ab-testing-with-python-e5964dd66143

Dataset can be found at: https://www.kaggle.com/datasets/zhangluyuan/ab-testing?resource=download

In [27]:
# import basic packages
#!pip install pandas
import pandas as pd
import os
import numpy as np

# stats packages 
import scipy.stats as stats
#!pip install statsmodels
import statsmodels.stats.api as sms
from statsmodels.stats.proportion import proportions_ztest

# data viz
import matplotlib.pyplot as plt
#!pip install seaborn
import seaborn as sns

os.getcwd()

'/Users/kxr6264/Documents/Taco_Bell_KDS_Analysis'

In [4]:
df = pd.read_csv('/Users/kxr6264/Documents/Taco_Bell_KDS_Analysis/ab_data.csv')

In [5]:
df.head(50)

print(df.shape)

(294478, 5)


# Exploratory Data Analysis

In [6]:
# see how many users are in each group
df['group'].value_counts()

treatment    147276
control      147202
Name: group, dtype: int64

In [7]:
# see how many were converted from test and control groups
pd.crosstab(df['group'],df['converted'])

converted,0,1
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,129479,17723
treatment,129762,17514


In [9]:
df.describe()

Unnamed: 0,user_id,converted
count,294478.0,294478.0
mean,787974.124733,0.119659
std,91210.823776,0.324563
min,630000.0,0.0
25%,709032.25,0.0
50%,787933.5,0.0
75%,866911.75,0.0
max,945999.0,1.0


In [29]:
## How long was this test run for?
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day

df.head()

# see what months/years this test spans across
pd.crosstab(df['year'],df['month'])


month,1
year,Unnamed: 1_level_1
2017,294478


This test was conducted in January 2017. Let's see how many clicks occurred on each date in January.

In [31]:
# see what months/years this test spans across
pd.crosstab(df['day'],df['month'])

month,1
day,Unnamed: 1_level_1
2,5783
3,13394
4,13284
5,13124
6,13528
7,13381
8,13564
9,13439
10,13523
11,13553


# Define what Success looks like, as well as pre and post metrics.

In [10]:
#df['Success'] = np.where(df['Post SOS']<df['Pre SOS'],1,0)
df.loc[df['converted']==1, 'Success'] = "User converted"
df.loc[df['converted']==0, 'Success'] = "User did not convert"
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,Success
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,User did not convert
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,User did not convert
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,User did not convert
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,User did not convert
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,User converted


### Pre-Test Success metric: Our baseline performance. 

How many control users clicked on the old version of the "Click Here to Subscribe" button on your site?

I am calculating this to use as the number for our "normal" or "expected" performance. Keep in mind that this is number is coming from a sample - if you are applying this in a business setting, your business partners may be able to provide a metric for what the typical performance of the population is.

In [17]:
# separate out only non-converted users
not_conv = df[df['group']=='control']

# percentage of time SOS was reduced vs not
not_conv['converted'].value_counts()

# take num of successes / total instances
pre_test_success = not_conv['converted'].value_counts()[1] / (not_conv['converted'].value_counts()[1] + not_conv['converted'].value_counts()[0])
print(pre_test_success)

0.12039917935897611


### Post-Test Success metric: Our performance after the treatment is applied.

How many treatment users clicked on the new version of the "Click Here to Subscribe" button on your site?

In [18]:
# separate out only non-converted users
conv = df[df['group']=='treatment']

# percentage of time SOS was reduced vs not
conv['converted'].value_counts()

# take num of successes / total instances
post_test_success = conv['converted'].value_counts()[1] / (conv['converted'].value_counts()[1] + conv['converted'].value_counts()[0])
print(post_test_success)

0.11891957956489856


# Business Problem

Both pre and post conversion rates are around 12%, with post test metrics actually seeming to be a little *worse* with the new design than the old design. 

Your business team is starting to worry that this new design is losing potential customers, wants to know if they should keep this design or revert back to the old button. They say that in order to keep the new button, they would like to see a 3% increase in subscribers being attracted. Otherwise, they will return back to the old design.

In [48]:
desired_effect = pre_test_success + 0.03
desired_effect

0.15039917935897612

# Run the A/B test

# Test for duplicates users, and remove them if they appear more than once

Since we want to ensure that each observation is a unique user coming to the site, we want to remove any duplicate users from our df.


In [33]:
# see if there are users that appear mulitple times
visit_counts = df['user_id'].value_counts(ascending=False)
multi_users = visit_counts[visit_counts > 1].count()

print(f'There are {multi_users} users that appear multiple times in the dataset')

There are 3894 users that appear multiple times in the dataset


In [36]:
users_to_drop = visit_counts[visit_counts > 1].index

df = df[~df['user_id'].isin(users_to_drop)]
print(f'The updated dataset now has {df.shape[0]} entries')

The updated dataset now has 286690 entries


In [37]:
df.dtypes

user_id                  int64
timestamp       datetime64[ns]
group                   object
landing_page            object
converted                int64
Success                 object
year                     int64
month                    int64
day                      int64
dtype: object

# Two-tailed Hypothesis: The new button impacts website performance in attracting subscribers.

# Gathering the correct sample size

A/B testing generally requires many observations to get statistically valid results. But, although we have a pretty good amount of data in this dataset, we still may not need to analyze all of it to answer our business question.

The code below calculates the minimumm number of observations in each group (control and treatment) that we need to collect before performaing the analysis. This can help us to speed up processing times and run a less computationally-expensive test.

In [49]:
# # # article eff size - same as my calc value?
#effect_size_article = sms.proportion_effectsize(pre_treatment_success,post_treatment_success)
effect_size= sms.proportion_effectsize(pre_test_success,desired_effect) # see if there is an effect between expected pre and post amounts
effect_size

-0.08780542373591216

In [50]:
# https://www.statsmodels.org/dev/generated/statsmodels.stats.power.NormalIndPower.solve_power.html
# https://towardsdatascience.com/ab-testing-with-python-e5964dd66143
required_n = sms.NormalIndPower().solve_power(
    effect_size, 
    power=0.8, # Power is the probability that the test correctly rejects the Null Hypothesis if the Alternative Hypothesis is true. Typical default value is 0.8, saying that there is an 80% chance that the test correctly rejects the null hypothesis.
    alpha=0.05, 
    ratio=1
    )               

In [51]:
# minimum number of stores from each group that need to be sampled 
# in order to detect if there is a true difference between our pre and post test.
# tip: you can edit the power and alpha levels above if you want more/less precision behind your results
required_n = np.ceil(required_n)
print(required_n)

2037.0



# Rule of Thumb for interpreting Cohen's d:
#### https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/cohens-d/

Use these general “rule of thumb” guidelines (which Cohen said should be used cautiously):

 - Small effect = 0.2
 - Medium Effect = 0.5
 - Large Effect = 0.8
 
“Small” effects are difficult to see with the naked eye. For example, Cohen reported that the height difference 
between 15-year-old and 16-year-old girls in the US is about this effect size. “Medium” is probably big enough to 
be discerned with the naked eye, while effects that are “large” can definitely be seen with the naked eye 
(Cohen calls this “grossly perceptible and therefore large”). For example, the difference in heights between 
13-year-old and 18-year-old girls is 0.8. An effect under 0.2 can be considered trivial, even if your results 
are statistically significant.

Bear in mind that a “large” effect isn’t necessarily better than a “small” effect, especially in settings where 
small differences can have a major impact. For example, an increase in academic scores or health grades by an effect 
size of just 0.1 can be very significant in the real world. Durlak (2009) suggests referring to prior research in 
order to get an idea of where your findings fit into the bigger context.

In [None]:
# # calculate effect size using cohen's d for equal group sizes. concerned we may not have enough samples. 
# # check here for unequal group sizes: https://toptipbio.com/cohens-d/
# # formula here: https://www.researchgate.net/figure/Formula-for-Cohens-d_fig1_286089628
# # business users state that they expect TKDS to save roughly 2-3 seconds on each transaction than non-TKDS.

# # calculate avg for pre and post of all stores we have, then find the difference between those values.
# # this number is the top of you
# m1 = df['Pre SOS'].mean()
# m2 = df['Post SOS'].mean()
# diff_in_means = m2-m1  # this number is the numerator of your cohen's d formula

# # calculate the sample std dev of each group and square it. add those two values together then divide by two
# m1sd_squared = df['Pre SOS'].std() * df['Pre SOS'].std() 
# m2sd_squared = df['Post SOS'].std() * df['Post SOS'].std()
# sum_squares = m1sd_squared + m2sd_squared
# div_by_2 = sum_squares / 2

# # lastly, take the radical of your resulting number above
# bottom_half_cohens_d_formula = np.sqrt(div_by_2)  # this number is the denominator of your cohen's d formula

# cohens_d = diff_in_means / bottom_half_cohens_d_formula 

# print(cohens_d)


In [None]:
# # https://www.statsmodels.org/dev/generated/statsmodels.stats.power.NormalIndPower.solve_power.html
# # https://towardsdatascience.com/ab-testing-with-python-e5964dd66143
# required_n = sms.NormalIndPower().solve_power(
#     cohens_d, 
#     power=0.8, # Power is the probability that the test correctly rejects the Null Hypothesis if the Alternative Hypothesis is true.
#     alpha=0.05, 
#     ratio=1
#     )     

In [None]:
# # minimum number of stores from each group that need to be sampled for an effective analysis
# required_n = np.ceil(required_n)
# print(required_n)

# Balance the data frame that we are testing

#### The control and test groups must have the same number of ns. Here tkds1s have 424 observations, and tkds0s have 326.
#### We can make sure each group is represented by randomly sampling 294 (our min sample size) from each group and putting these values into a new df

In [52]:
# create new df that have the same num of samples from control group as test group
required_n = required_n.astype(np.int64)
print(required_n)

control_sample = df[df['converted'] == 0].sample(required_n, random_state=22)
treatment_sample = df[df['converted'] == 1].sample(required_n, random_state=22)



2037


In [53]:
ab_test = pd.concat([control_sample, treatment_sample], axis=0)
ab_test.reset_index(drop=True, inplace=True)

ab_test.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,Success,year,month,day
0,733022,2017-01-23 15:19:30.625626,treatment,new_page,0,User did not convert,2017,1,23
1,907867,2017-01-15 11:05:07.146881,treatment,new_page,0,User did not convert,2017,1,15
2,637065,2017-01-06 01:27:52.026002,treatment,new_page,0,User did not convert,2017,1,6
3,753838,2017-01-10 07:41:29.024607,control,old_page,0,User did not convert,2017,1,10
4,861583,2017-01-20 16:21:08.793702,treatment,new_page,0,User did not convert,2017,1,20


In [54]:
ab_test.shape

(4074, 9)

In [55]:
# make sure control and test groups are equally spread
ab_test['converted'].value_counts()

0    2037
1    2037
Name: converted, dtype: int64

In [56]:
ab_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4074 entries, 0 to 4073
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   user_id       4074 non-null   int64         
 1   timestamp     4074 non-null   datetime64[ns]
 2   group         4074 non-null   object        
 3   landing_page  4074 non-null   object        
 4   converted     4074 non-null   int64         
 5   Success       4074 non-null   object        
 6   year          4074 non-null   int64         
 7   month         4074 non-null   int64         
 8   day           4074 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(3)
memory usage: 286.6+ KB


## Test to see if new design performed differently from the old design

In [57]:
control_results = ab_test[ab_test['group'] == 'control']['converted']
treatment_results = ab_test[ab_test['group'] == 'treatment']['converted']
n_con = control_results.count() # number of converted users
n_treat = treatment_results.count() # number of treatment users
successes = [control_results.sum(), treatment_results.sum()]
nobs = [n_con, n_treat]

z_stat, pval = proportions_ztest(successes, nobs=nobs)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')


z statistic: -0.56
p-value: 0.573
ci 95% for control group: [0.474, 0.517]
ci 95% for treatment group: [0.483, 0.526]


# Results from two-tailed test:

Because our p-value > our aplha level of 0.05, we cannot conclude that there is a statistically difference in performance betwewn the new and old designs at the level that the business leaders would have liked to see. These results alone may make our team reconsider if they'd like to continue to use the new subscriber button or not.

But, this test does not tell us the second part of the business leader's question - which is if this new button proves to be worse than the old design or not. The two-tailed test did not provide any sort of direction.. To get those results, we will need to run a one-tailed test.