## Motivation

**For this project, I will be working  on the results of an A/B test run by an e-commerce website. My goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.**

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers 
random.seed(42)

In [24]:
df = pd.read_csv('ab_data.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [25]:
df.shape[0]

294478

In [26]:
# The number of unique users in the dataset.
df['user_id'].unique().shape

(290584,)

In [27]:
#The proportion of users converted
df['converted'].mean()

0.11965919355605512

In [28]:
# The number of times the new_page and treatment don't line up

df.query('(group=="control" and landing_page=="new_page") or (group=="treatment" and landing_page=="old_page") ').shape[0]


3893

In [29]:
#Do any of the rows have missing values?

df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

### For the rows where treatment is not aligned with new_page or control is not aligned with old_page, we cannot be sure if this row truly received the new or old page.

In [30]:
df2 = df.query("(group == 'control' and landing_page == 'old_page') or (group == 'treatment' and landing_page == 'new_page')")


In [33]:
# How many unique user_ids are in df2?
df2.user_id.unique().shape

(290584,)

In [41]:
# There is one user_id repeated in df2. What is it?

df2[df2.duplicated('user_id', keep=False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


In [42]:
# What is the probability of an individual converting regardless of the page they receive?

df2['converted'].mean()

0.11959667567149027

In [None]:
# Given that an individual was in the control group, what is the probability they converted?

In [43]:
df2.query('group=="control"').converted.mean()

0.1203863045004612

In [44]:
# Given that an individual was in the treatment group, what is the probability they converted?
df2.query('group=="treatment"').converted.mean()

0.11880724790277405

In [45]:
# What is the probability that an individual received the new page?
(df2.landing_page=='new_page').mean()

0.5000636646764286

# A/B Test

Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.

However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time? How long do you run to render a decision that neither page is better than another?

These questions are the difficult parts associated with A/B tests in general.

1. For now, consider you need to make the decision just based on all the data provided. If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be? You can state your hypothesis in terms of words or in terms of $p_{old}$ and $p_{new}$, which are the converted rates for the old and new pages.

$$ H_0: p_{new} \leq p_{old} $$ 
2. Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the converted success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the converted rate in ab_data.csv regardless of the page.

In [48]:
# What is the convert rate for $p_{new}$ under the null?
p_new = df2[df2['landing_page']=='new_page'].converted.mean()
p_new

0.11880724790277405

In [47]:
#What is the convert rate for $p_{old}$ under the null?
p_old = df2[df2['landing_page']=='old_page'].converted.mean()
p_old


0.1203863045004612

In [55]:
# What is $n_{new}$?
n_new = df2[df2['landing_page']=='new_page'].converted.shape
n_new

(145311,)

In [56]:
n_old = df2[df2['landing_page']=='old_page'].converted.shape
n_old

(145274,)

In [57]:
# Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null. Store these $n_{new}$ 1's and 0's in new_page_converted.

new_page_converted = np.random.binomial(n_new,p_new)


In [58]:
# Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null. Store these $n_{old}$ 1's and 0's in old_page_converted.
old_page_converted = np.random.binomial(n_old,p_old)


In [62]:
# Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).

new_page_converted/n_new - old_page_converted/n_old 

array([-0.00316892])

In [61]:
old_page_converted

array([17586])