## Analyze A/B Test Results


## Table of Contents
- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)


<a id='intro'></a>
### Introduction

A/B tests are very commonly performed by data analysts and data scientists.  It is important that we get some practice working with the difficulties of these 

For this project, I will be working to understand the results of an A/B test run by an e-commerce website.  My goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.

<a id='probability'></a>
#### Part I - Probability

In [3]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
random.seed(42)

In [4]:
df = pd.read_csv("ab_data.csv")
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [5]:
df.shape[0]

294478

In [6]:
df.user_id.nunique()

290584

In [7]:
# The proportion of users converted

df.query('converted == 1').converted.count()/ df.shape[0]

0.11965919355605512

In [8]:
# The number of times the `new_page` and `treatment` don't match.

con = df.query('group == "control" & landing_page =="new_page"').count()
tr = df.query('group == "treatment" & landing_page =="old_page"').count()

con + tr

user_id         3893
timestamp       3893
group           3893
landing_page    3893
converted       3893
dtype: int64

In [9]:
df.isna().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

For the rows where **treatment** does not match with **new_page** or **control** does not match with **old_page**, we cannot be sure if this row truly received the new or old page.

In [10]:
df2 = df.drop(df[(df.group == 'control') & (df.landing_page == 'new_page')].index)
df2.drop(df2[(df2.group == 'treatment') & (df.landing_page == 'old_page')].index, inplace = True)

df2.head()

  df2.drop(df2[(df2.group == 'treatment') & (df.landing_page == 'old_page')].index, inplace = True)


Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [11]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

In [12]:
# How many unique user_ids are in df2?

df2.user_id.nunique()

290584

In [13]:
# There is one user_id repeated in df2.  What is it?

df2.user_id.value_counts().head()

773192    2
630732    1
811737    1
797392    1
795345    1
Name: user_id, dtype: int64

In [14]:
# What is the row information for the repeat user_id? 

df2.query("user_id == 773192")

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


In [15]:
# Remove one of the rows with a duplicate user_id.

df2.drop_duplicates(subset = 'user_id', inplace = True)

In [16]:
# What is the probability of an individual converting regardless of the page they receive?

df2.converted.mean()

0.11959708724499628

In [17]:
# Given that an individual was in the `control` group, what is the probability they converted?

df2.query("group == 'control'").converted.mean()

0.1203863045004612

In [18]:
# Given that an individual was in the `treatment` group, what is the probability they converted?

df2.query("group == 'treatment'").converted.mean()

0.11880806551510564

In [19]:
# the probability that an individual received the new page

df2.query("landing_page == 'new_page'").count() / df2.shape[0]

user_id         0.500062
timestamp       0.500062
group           0.500062
landing_page    0.500062
converted       0.500062
dtype: float64

e. Consider results from parts (a) through (d) above, and explain below whether there is sufficient evidence to conclude that the new treatment page leads to more conversions:

No. The conversion rate of control group is actually higher than the one of treatment group. Considering there are almost the same number of participants in each group, we can assume the number of conversion would be similar as well, not just conversion "rates". Also, there is no evidence to support the idea that new treatment page leads to more conversions.

Null hypothesis (H0): **$p_{old}$** - **$p_{new}$** >= 0

Althernative hypothesis (H1): **$p_{old}$** - **$p_{new}$** < 0

`2.` Let's assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the **converted** success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the **converted** rate in **ab_data.csv** regardless of the page. 

In [20]:
# conversion rate for p_new under the null

p_new = df2.converted.mean()
p_new

0.11959708724499628

In [21]:
# conversion rate for p_old under the null

p_old = df2.converted.mean()
p_old

0.11959708724499628

In [22]:
# the number of individuals in the treatment group

n_new = df2.query("group == 'treatment'").user_id.nunique()
n_new

145310

In [23]:
# the number of individuals in the control group

n_old = df2.query("group == 'control'").user_id.nunique()
n_old

145274

In [24]:
# Let's simulate n_new transactions with a conversion rate of p_new under the null

new_page_converted = np.random.choice([0,1], n_new, [p_new, 1-p_new])
print(new_page_converted)

[1 1 0 ... 0 1 1]
