## Analyze A/B Test Results


## Table of Contents
- [Introduction](#intro)

- [Part II - A/B Test](#ab_test)



<a id='intro'></a>
### Introduction

A/B tests are very commonly performed by data analysts and data scientists.  It is important that you get some practice working with the difficulties of these 

We will be working to understand the results of an A/B test run by an e-commerce website.  Your goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.


<a id='probability'></a>
#### Part I - EDA

To get started, let's import our libraries.

In [None]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

`1.` Now, read in the `ab_data.csv` data. Store it in `df`.  **Use your dataframe to answer the questions in Quiz 1 of the classroom.**

a. Read in the dataset and take a look at the top few rows here:

In [None]:
#Read csv
df = pd.read_csv('ab_data.csv')
df.head()

b. Use the below cell to find the number of rows in the dataset.

In [None]:
#Find the number of rows 
df.shape[0]

c. The number of unique users in the dataset.

In [None]:
#Unique user ids


d. The proportion of users converted.

In [None]:
#Proportion of users converted


e. The number of times the `new_page` and `treatment` don't line up.

In [None]:
#Count the number of lines where new_page and control are aligned, also old page and treatment, and add them up


f. Checking any rows with missing values?

In [None]:
#Search for missing values


`2.` For the rows where **treatment** is not aligned with **new_page** or **control** is not aligned with **old_page**, we cannot be sure if this row truly received the new or old page.  
.

In [None]:
#Filter on lines where new page and control are aligned
npcontrol = df[(df.landing_page == "new_page") & (df.group == "control")]

#Filter on lines where old page and treatment are aligned
optreatment = df[(df.landing_page == "old_page") & (df.group == "treatment")]

#Concatenate the inaccurate lines 
inaccurate = pd.concat([npcontrol, optreatment])

#Assign the index for these lines
inaccurate_index = inaccurate.index

#Drop the lines with the indexes assigned above
df2 = df.drop(inaccurate_index)

In [None]:
# Double Check all of the correct rows were removed - this should be 0


In [None]:
#Check the new data frame
df2.head()

a. Looking at the unique **user_id**s are in **df2**?

In [None]:
#Number of unique users
df2['user_id'].nunique()

b. Duplicate  **user_id** repeated in **df2**.  

In [None]:
#Find the duplicate id 


c. Information for the repeat **user_id**? 

In [None]:
#Match the lines with the duplicate id found above


d. Remove **one** of the rows with a duplicate **user_id**, but keep your dataframe as **df2**.

In [None]:
#Remove one of the duplicate lines
df2.drop(labels = 1899, axis=0, inplace=True)

In [None]:
#Confirm removal of one of the lines


`4.` 

a. Finding the  probability of an individual converting regardless of the page they receive?

In [None]:
#Since 1 is considered True, we don't need to specify the condition "converted == 1". 
df2['converted'].mean()

b. Given that an individual was in the `control` group, what is the probability they converted?

In [None]:
#Probability of a user converted in control group
df2[df2['group'] == "control"]['converted'].mean()

c. Given that an individual was in the `treatment` group, what is the probability they converted?

In [None]:
#Probability of a user converted in treatment group
df2[df2['group'] == "treatment"]['converted'].mean()

d. What is the probability that an individual received the new page?

In [None]:
#Probability of a user landing on new_page
(df2.landing_page == "new_page").mean()

In [None]:
df2.head()

e. Consider your results from a. through d. above, and explain below whether you think there is sufficient evidence to say that the new treatment page leads to more conversions.

#### According to above proportions, there is a small difference between users converted from treatment group and from control group, and, therefore we cannot conclude that the new treatment page leads to more conversions.

<a id='ab_test'></a>
### Part II - A/B Test

Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.  

However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time?  How long do you run to render a decision that neither page is better than another?  

These questions are the difficult parts associated with A/B tests in general.  


`1.` For now, consider you need to make the decision just based on all the data provided.  If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, what should your null and alternative hypotheses be?  You can state your hypothesis in terms of words or in terms of **$p_{old}$** and **$p_{new}$**, which are the converted rates for the old and new pages.

`2.` Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the **converted** success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the **converted** rate in **ab_data.csv** regardless of the page. <br><br>

Use a sample size for each page equal to the ones in **ab_data.csv**.  <br><br>

Perform the sampling distribution for the difference in **converted** between the two pages over 10,000 iterations of calculating an estimate from the null.  <br><br>

Use the cells below to provide the necessary parts of this simulation.  If this doesn't make complete sense right now, don't worry - you are going to work through the problems below to complete this problem.  

a. What is the **convert rate** for $p_{new}$ under the null? 

In [None]:
#Find the proportion of converted rate assuming p_new and p_old are equal


b. What is the **convert rate** for $p_{old}$ under the null? <br><br>

In [None]:
#Find the proportion of converted rate assuming p_new and p_old are equal


c. What is $n_{new}$?

In [None]:
#Number of users landing on new page



d. What is $n_{old}$?

In [None]:
#Number of users landing on old page


e. Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null.  Store these $n_{new}$ 1's and 0's in **new_page_converted**.

In [None]:
#Draw samples from a binomial distribution


f. Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null.  Store these $n_{old}$ 1's and 0's in **old_page_converted**.

In [None]:
#Draw samples from a binomial distribution


g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).

In [None]:
#Number of rows from new page are higher than the ones on old page, therefore we truncate new page up to the numbers of old 
#page and compute the difference


h. Simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process similarly to the one you calculated in parts **a. through g.** above.  Store all 10,000 values in a numpy array called **p_diffs**.

In [None]:
#Simulate 10000 samples of the differences in conversion rates
p_diffs = []

for _ in range(10000):
    new_page_converted = np.random.binomial(1, p_new, n_new)
    old_page_converted = np.random.binomial(1, p_old, n_old)
    new_page_p = new_page_converted.mean()
    old_page_p = old_page_converted.mean()
    p_diffs.append(new_page_p - old_page_p)

i. Plot a histogram of the **p_diffs**.  Does this plot look like what you expected?  Use the matching problem in the classroom to assure you fully understand what was computed here.

In [None]:
#Show the histogram
plt.hist(p_diffs);

j. What proportion of the **p_diffs** are greater than the actual difference observed in **ab_data.csv**?

In [None]:
#Actual difference of converted rates


In [None]:
#Convert to numpy array and calculate the p-value


k. In words, explain what you just computed in part **j.**  What is this value called in scientific studies?  What does this value mean in terms of whether or not there is a difference between the new and old pages?

<a id='conclusions'></a>
## Conclusions

According to the analysis performed we found that the old page was better than the new page, therefore  fail to reject the null hypothesis. Moreover, the histogram shows that the new page is not better than the old page.
