# A/B Testing

## AKA Applied Hypothesis Testing!

If you went through all the stats up to this point and thought "oh man when am I ever going to use this stuff" - I get it. But one of the most common ways that Hypothesis Testing techniques are used in the real world is through A/B Testing!

One of the most common places you see A/B Testing out in the world is in marketing - companies will run A/B tests on elements of their website, their emails, their calls to action, etc. While you see A/B testing in other places, Marketing is going to be my example lens for today's session.

### A/B Testing in Marketing

Hubspot is a marketing software company, and I'm going to use some of their resources in the setup to why all this matters. You can access the specific A/B Testing Kit they put out for marketing optimization process at this link: https://drive.google.com/drive/folders/1Wk3J2nA5gguN1Y_41cACxQ9mcJls9TmI

Hubspot's definition of split testing, aka A/B testing:

> Split testing, commonly referred to as A/B testing, is a method of testing through which marketing variables (such as copy, images, layout, etc) are compared to each other to identify the one that brings a better conversion rate. In this context, the element that is being testing is called the “control” and the element that is argued to give a better result is called the “treatment.”

#### Hubspot's 10 Guidelines for Effective A/B Testing: 

1. Only conduct one test (on one asset) at a time
2. Test one variable at a time
3. Test minor changes, too
4. You can A/B test the entire element
5. Measure as far down funnel as possible
6. Set up control & treatment
7. Decide what you want to test
8. Split your sample group randomly 
9. Test at the same time
10. Decide on necessary significance before testing

### What will the data look like?

Data source: https://www.kaggle.com/zhangluyuan/ab-testing

Unfortunately, this data has no real meta-data associated with it, but the author did say the data comes from an e-commerce website. 

Full credit to Robbie Geoghegan, now a Data Scientist at Facebook, for giving me the idea and sharing work they did on this dataset: https://medium.com/@robbiegeoghegan/implementing-a-b-tests-in-python-514e9eb5b3a1 

Another blog I referenced: https://medium.com/@RenatoFillinich/ab-testing-with-python-e5964dd66143

Before we go any further, and typically before we run a test like this, we need to decide our significance level. Otherwise, let's assume that the group who ran this test did it properly (ran tests in parallel, split users randomly, etc)

Significance Level: $\alpha = .05$

In [1]:
# Imports
import pandas as pd
import numpy as np

from scipy import stats

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize, proportions_ztest

In [2]:
# Grab our data - want the column 'timestamp' to be a datetime object
df = pd.read_csv('data/ab_data.csv', parse_dates=['timestamp'])

df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null datetime64[ns]
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 11.2+ MB


In [4]:
# Check our timeframe
print(df['timestamp'].min())
print(df['timestamp'].max())
# ran test for 12 days nearly exactly

2017-01-02 13:42:05.378582
2017-01-24 13:41:54.460509


#### There's an issue...

In [5]:
df.loc[df['group'] == 'control']['landing_page'].value_counts()

old_page    145274
new_page      1928
Name: landing_page, dtype: int64

In [6]:
df.loc[df['group'] == 'treatment']['landing_page'].value_counts()

new_page    145311
old_page      1965
Name: landing_page, dtype: int64

In [7]:
# For now, let's drop those rows - 3893 in total (1% of original rows)
# First problem
to_drop_1 = df.loc[(df['group'] == 'control') & (df['landing_page'] == 'new_page')].index
# Sanity check
print(len(to_drop_1))

# Second problem
to_drop_2 = df.loc[(df['group'] == 'treatment') & (df['landing_page'] == 'old_page')].index
# Sanity check
print(len(to_drop_2))

joined_drop_list = [*to_drop_1, *to_drop_2] # unpacks both original lists using star notation
print(len(joined_drop_list) == (len(to_drop_1) + len(to_drop_2)))

1928
1965
True


In [8]:
# Drop both
df = df.drop(labels=joined_drop_list)

#### One more thing to check...

In [9]:
df['user_id'].duplicated().sum()

1

In [10]:
# check the one duplicated user
df.loc[df['user_id'].duplicated(keep=False) == True]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


In [11]:
# drop the second time they landed on the new page
df = df.drop_duplicates(subset=['user_id'])

#### Now, let's explore:

In [12]:
# Split out our two groups
control_group = df.loc[df['group'] == 'control']
treat_group = df.loc[df['group'] == 'treatment']

In [13]:
# Check the number of samples, timeframe and conv % for each group
for sub_df in [control_group, treat_group]:
    name = list(sub_df['group'])[0].title()
    print(f"Number of Samples in our {name} Group: {len(sub_df):,}")
    print(f"Timeframe: {sub_df['timestamp'].min()} - {sub_df['timestamp'].max()}")
    print(f"Number of Conversions in our {name} Group: {sub_df['converted'].sum():,}")
    print(f"Conversion % in our {name} Group: {sub_df['converted'].mean() * 100:.3f}%")
    print("*"*20)

Number of Samples in our Control Group: 145,274
Timeframe: 2017-01-02 13:42:15.234051 - 2017-01-24 13:41:54.460509
Number of Conversions in our Control Group: 17,489
Conversion % in our Control Group: 12.039%
********************
Number of Samples in our Treatment Group: 145,310
Timeframe: 2017-01-02 13:42:05.378582 - 2017-01-24 13:41:44.097174
Number of Conversions in our Treatment Group: 17,264
Conversion % in our Treatment Group: 11.881%
********************


Our friend at Facebook, whose [blog](https://medium.com/@robbiegeoghegan/implementing-a-b-tests-in-python-514e9eb5b3a1) and [code](https://github.com/RobbieGeoghegan/AB_Testing/blob/master/AB_Testing.ipynb) inspired this notebook, uses two things you can determine in advance to calculate effect size:

> Baseline rate — an estimate of the metric being analyzed before making any changes
>
> Practical significance level — the minimum change to the baseline rate that is useful to the business, for example an increase in the conversion rate of 0.001% may not be worth the effort required to make the change, whereas a 2% change will be

In other words, you can determine the minimum amount of change you want to see between your two groups and use that to calculate effect size (different than calculating effect size after the study has been conducted, which isn't ideal).

To do this with statsmodels, since we're doing a test on a proportion, we use: https://www.statsmodels.org/stable/generated/statsmodels.stats.proportion.proportion_effectsize.html

In [14]:
# grabbing some useful variables, going ahead and doing for both groups
num_converted_control = control_group['converted'].sum()
samples_control = len(control_group)
conv_rate_control = num_converted_control / samples_control

num_converted_treat = treat_group['converted'].sum()
samples_treat = len(treat_group)
conv_rate_treat = num_converted_treat / samples_treat

In [15]:
# baseline is what we expect given what we have
# here, we'll capture that with our percentage of conversions 
baseline_rate = num_converted_control / samples_control # aka = conv_rate_control
practical_significance = 0.01 # user defined - want at least 1% difference here

effect_size = proportion_effectsize(baseline_rate, baseline_rate + practical_significance)

In [16]:
# determine our minimum sample size per group
confidence_level = 0.05 # want to be 95% confident
power = 0.8 # user defined (1 - beta)

min_sample_size = NormalIndPower().solve_power(effect_size = effect_size, 
                                               power = power, 
                                               alpha = confidence_level)

print(f"Required minimum sample size: {min_sample_size:,.0f} per group")

Required minimum sample size: 17,209 per group


In [17]:
# luckily...
print(f"Samples in Control Group: {samples_control:,}")
print(f"Samples in Treatment Group: {samples_treat:,}")

Samples in Control Group: 145,274
Samples in Treatment Group: 145,310


In [20]:
# Now let's test!
# Using a proportion test (not dealing with means but proportions)
results = proportions_ztest([num_converted_control, num_converted_treat], 
                            nobs=[samples_control, samples_treat])
results

In [22]:
print(f"Test Statistic: {results[0]:.3f}, P-Value: {results[1]:.3f}")

Test Statistic: 1.311, P-Value: 0.190


So?

- 
