## Analyze A/B Test Results


A project that leverages inferential statistics and hypothesis testing in combination with a database of user conversion rates to determine whether a company should adopt a new web page or keep the old one.

## Table of Contents
- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)


<a id='intro'></a>
### Introduction

A/B tests are very commonly performed by data analysts and data scientists.  It is important that you get some practice working with the difficulties of these 

For this project, you will be working to understand the results of an A/B test run by an e-commerce website.  Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.


<a id='probability'></a>
#### Part I - Probability

To get started, let's import our libraries.

In [91]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

`1.` Now, read in the `ab_data.csv` data. Store it in `df`. 

a. Read in the dataset and take a look at the top few rows here:

In [92]:
df=pd.read_csv('ab_data.csv')
df.head()

b. Use the cell below to find the number of rows in the dataset.

In [93]:
df.shape

(294478, 5)

c. The number of unique users in the dataset.

In [94]:
df.user_id.nunique()

290584

d. The proportion of users converted.

In [42]:
df.converted.mean()

0.11965919355605512

e. The number of times the `new_page` and `treatment` don't match.

In [95]:
df_np = df.query('landing_page == "new_page"')
(df_np['group'] != "treatment").sum()

1928

In [96]:
df_op = df.query('landing_page == "old_page"')
(df_op['group'] != "control").sum()

1965

f. Do any of the rows have missing values?

In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


`2.` For the rows where **treatment** does not match with **new_page** or **control** does not match with **old_page**, we cannot be sure if this row truly received the new or old page.  we should figure out how we should handle these rows.  

a. They should be removed because we want to be sure about the quality of our data.

In [98]:
df1=df_np[df_np['group'] == "treatment"]
df2=df1.append(df_op[df_op['group'] == "control"]);

In [99]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

`3.` Clean the data.

a. How many unique **user_id**s are in **df2**?

In [100]:
df2.info()
df2.user_id.nunique()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 290585 entries, 2 to 294476
Data columns (total 5 columns):
user_id         290585 non-null int64
timestamp       290585 non-null object
group           290585 non-null object
landing_page    290585 non-null object
converted       290585 non-null int64
dtypes: int64(2), object(3)
memory usage: 13.3+ MB


290584

b. There is one **user_id** repeated in **df2**.  What is it?

In [119]:
df2[df2.user_id.duplicated()].user_id

2893    773192
Name: user_id, dtype: int64

c. What is the row information for the repeat **user_id**? 

In [121]:
df2[df2.user_id.duplicated()]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


d. Remove **one** of the rows with a duplicate **user_id**, but keep your dataframe as **df2**.

In [122]:
df2.drop(2893,axis=0,inplace=True)

`4.` Investigate the data as follow:

a. What is the probability of an individual converting regardless of the page they receive?

In [123]:
df2.converted.mean()

0.11959708724499628

b. Given that an individual was in the `control` group, what is the probability they converted?

In [51]:
df2[df2['group']=="control"].converted.mean()

0.1203863045004612

c. Given that an individual was in the `treatment` group, what is the probability they converted?

In [52]:
df2[df2['group']=="treatment"].converted.mean()

0.11880806551510564

d. What is the probability that an individual received the new page?

In [54]:
(df2['landing_page']=="new_page").mean()

0.50006194422266881

e. Consider your results from parts (a) through (d) above, and explain below whether you think there is sufficient evidence to conclude that the new treatment page leads to more conversions.

So far there are not enough evidence to conclude the new page may lead to more conversion. Receiving the new page was fair(p=0.5) for individuals and the conversion rate for new page is slightly smaller than the control group. Thus, we should run a hypothesis test to find out if neither page is better than another.

<a id='ab_test'></a>
### Part II - A/B Test

Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.  

However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time?  How long do you run to render a decision that neither page is better than another?  

These questions are the difficult parts associated with A/B tests in general.  


`1.` For now, consider you need to make the decision just based on all the data provided.  If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be?  You can state your hypothesis in terms of words or in terms of **$p_{old}$** and **$p_{new}$**, which are the converted rates for the old and new pages.