## Analyze A/B Test Results


## Table of Contents
- [Introduction](#intro)
- [Part I - Probability](#probability)
- [Part II - A/B Test](#ab_test)
- [Part III - Regression](#regression)


<a id='intro'></a>
### Introduction

A/B tests are very commonly performed by data analysts and data scientists.  

For this project, I will be working to understand the results of an A/B test run by an e-commerce website.  My goal is to work through this notebook to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.


<a id='probability'></a>
#### Part I - Probability

To get started, let's import our libraries.

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
#We are setting the seed to assure you get the same answers on quizzes as we set up
random.seed(42)

In [2]:
df= pd.read_csv('ab_data.csv')
df.head(5)

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


b.Find the number of rows in the dataset.

In [3]:
df.shape[0]

294478

c. The number of unique users in the dataset.

`1.` Now, read in the `ab_data.csv` data. Store it in `df`. 
a. Read in the dataset and take a look at the top few rows here:

In [4]:
df.user_id.nunique()



290584

d. The proportion of users converted.

In [5]:
(df.query('converted== 1').user_id.nunique())/ (df.user_id.nunique())

0.12104245244060237

e. The number of times the `new_page` and `treatment` don't line up.

In [6]:
(df.query('group=="treatment" and landing_page != "new_page"').count()[0])+(df.query('group=="control" and landing_page != "old_page"').count()[0])

3893

f. Do any of the rows have missing values?

`2.` For the rows where **treatment** is not aligned with **new_page** or **control** is not aligned with **old_page**, we cannot be sure if this row truly received the new or old page.  

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [8]:
df2 = df.drop(df.query('(group == "treatment" and landing_page != "new_page") or \
(group == "control" and landing_page != "old_page")').index)

In [9]:
# Double Check all of the correct rows were removed - this should be 0
df2[((df2['group'] == 'treatment') == (df2['landing_page'] == 'new_page')) == False].shape[0]

0

3.

a. How many unique **user_id**s are in **df2**?

In [10]:
df2.user_id.nunique()

290584

b.What is the **user_id** repeated in **df2**

In [11]:
df2[df2.duplicated('user_id', keep=False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


c. Remove **one** of the rows with a duplicate **user_id**, but keep your dataframe as **df2**.

In [12]:
df2.drop(df2.index[1899], inplace=True)

`4.` 

a. What is the probability of an individual converting regardless of the page they receive?

In [13]:
df2.query('converted==1').count()[0]/ len(df2)

0.11959708724499628

b. Given that an individual was in the `control` group, what is the probability they converted?

In [14]:
(df2.query('group=="control" and converted==1').count()[0])/(df2.query('group=="control"').count()[0])

0.12038713319061353

c. Given that an individual was in the `treatment` group, what is the probability they converted?

In [15]:
(df2.query('group=="treatment" and converted==1').count()[0])/(df2.query('group=="treatment"').count()[0])

0.11880724790277405

d. What is the probability that an individual received the new page?

In [16]:
(df2.query('landing_page == "new_page"').count()[0])/ (df2.user_id.nunique())

0.50006710647216113

The results in the previous two portions suggests that the probability of the user converting in either group is approximately the same therefore there isnt much evidence that suggests that one page leads to more conversions. 

<a id='ab_test'></a>
### Part II - A/B Test

Notice that because of the time stamp associated with each event, you could technically run a hypothesis test continuously as each observation was observed.  

However, then the hard question is do you stop as soon as one page is considered significantly better than another or does it need to happen consistently for a certain amount of time?  How long do you run to render a decision that neither page is better than another?  

These questions are the difficult parts associated with A/B tests in general.  


`1.` For now, consider you need to make the decision just based on all the data provided.  If you want to assume that the old page is better unless the new page proves to be definitely better at a Type I error rate of 5%, your null and alternative hypotheses should be:  

Null hypothesis: ùëùùëúùëôùëë >= ùëùùëõùëíùë§
alternative hypothesis ùëùùëõùëíùë§>ùëùùëúùëôùëë

`2.` Assume under the null hypothesis, $p_{new}$ and $p_{old}$ both have "true" success rates equal to the **converted** success rate regardless of page - that is $p_{new}$ and $p_{old}$ are equal. Furthermore, assume they are equal to the **converted** rate in **ab_data.csv** regardless of the page. <br><br>

Use a sample size for each page equal to the ones in **ab_data.csv**.  <br><br>

Perform the sampling distribution for the difference in **converted** between the two pages over 10,000 iterations of calculating an estimate from the null.  <br><br>

Use the cells below to provide the necessary parts of this simulation.  If this doesn't make complete sense right now, don't worry - you are going to work through the problems below to complete this problem.  <br><br>

a. What is the **convert rate** for $p_{new}$ under the null? 

In [17]:
pnew= (df2.query('converted ==1').count()[0])/(df2.user_id.nunique())
pnew

0.11959749882133504

b. What is the **convert rate** for $p_{old}$ under the null? <br><br>

In [18]:
pold= (df2.query('converted ==1').count()[0])/(df2.user_id.nunique())
pold

0.11959749882133504

c. What is $n_{new}$?

In [19]:
n_new= df2.query('landing_page=="new_page"')['user_id'].nunique()
n_new

145310

d. What is $n_{old}$?

In [20]:
n_old= df2.query('landing_page=="old_page"')['user_id'].nunique()
n_old

145273

e. Simulate $n_{new}$ transactions with a convert rate of $p_{new}$ under the null.  Store these $n_{new}$ 1's and 0's in **new_page_converted**.

In [21]:
new_converted= np.random.choice([0,1],n_new, p=(pnew,1-pnew))
new_converted

array([1, 1, 1, ..., 1, 1, 0])

f. Simulate $n_{old}$ transactions with a convert rate of $p_{old}$ under the null.  Store these $n_{old}$ 1's and 0's in **old_page_converted**.

In [22]:
old_converted= np.random.choice([0,1],n_old, p=(pold,1-pold))
old_converted

array([1, 1, 1, ..., 1, 1, 0])

g. Find $p_{new}$ - $p_{old}$ for your simulated values from part (e) and (f).

In [23]:
obs_diff= new_converted.mean()-old_converted.mean()
obs_diff

0.00095259598952801561

h. Simulate 10,000 $p_{new}$ - $p_{old}$ values using this same process similarly to the one you calculated in parts **a. through g.** above.  Store all 10,000 values in **p_diffs**.

In [None]:
p_diffs=[]
for _ in range(10000):
    bootsamp= df2.sample(df2.shape[0])
    new_pconverted= (np.random.choice([0,1],n_new, p=(pnew,1-pnew))).mean()
    old_pconverted= (np.random.choice([0,1],n_old, p=(pold,1-pold))).mean()
    p_diffs.append(new_pconverted-old_pconverted)

p_diffs=np.array(p_diffs)  
        

i. Plot a histogram of the **p_diffs**.  Does this plot look like what you expected?  Use the matching problem in the classroom to assure you fully understand what was computed here.

In [None]:
plt.hist(p_diffs);
plt.axvline(x= obs_diff, color="red");

j. What proportion of the **p_diffs** are greater than the actual difference observed in **ab_data.csv**?

In [None]:
actual_diffs = (df2.query('group == "treatment"')['converted'].mean()) - (df2.query('group=="control"')['converted'].mean())
(p_diffs > actual_diffs).mean()

k. What was computed in part **j.** is called the p-value, which is the probability of finding the statistic when the null hypothesis is assumed true. Since the pvalue is large, we fail to reject the null hypothesis.

l. We could also use a built-in to achieve similar results.  Though using the built-in might be easier to code, the above portions are a walkthrough of the ideas that are critical to correctly thinking about statistical significance. The below is used to calculate the number of conversions for each page, as well as the number of individuals who received each page. Let `n_old` and `n_new` refer the the number of rows associated with the old page and new pages, respectively.

In [None]:
import statsmodels.api as sm

convert_old = df2.query('group == "control"')['converted'].sum()
convert_new = df2.query('group == "treatment"')['converted'].sum()
n_old = df2.query('landing_page == "old_page"').shape[0]
n_new = df2.query('landing_page == "new_page"').shape[0]

m. Now use `stats.proportions_ztest` to compute  test statistic and p-value.  [Here](http://knowledgetack.com/python/statsmodels/proportions_ztest/) is a helpful link on using the built in.

In [None]:
z_score, p_value = sm.stats.proportions_ztest([convert_new, convert_old], [n_new,n_old], value=None, alternative='two-sided', prop_var=False)
z_score, p_value

Since the z-score is between -1.96 and +1.96 and the p-value associated with a 95% confidence level is greater than 0.05 so we fail to reject the null.

<a id='regression'></a>
### Part III - A regression approach

`1.` In this final part, you will see that the result you acheived in the previous A/B test can also be acheived by performing regression.<br><br>

a. Since each row is either a conversion or no conversion, I would use Logistic regression in this case.

b. The goal is to use **statsmodels** to fit the regression model you specified in part **a.** to see if there is a significant difference in conversion based on which page a customer receives.  However, you first need to create a colun for the intercept, and create a dummy variable column for which page each user received.  Add an **intercept** column, as well as an **ab_page** column, which is 1 when an individual receives the **treatment** and 0 if **control**.

In [None]:
df2[['treatment', 'control']] = pd.get_dummies(df2['group'])
df2['ab_page'] = df2['treatment']

c. Use **statsmodels** to import your regression model.  Instantiate the model, and fit the model using the two columns you created in part **b.** to predict whether or not an individual converts.

In [None]:
import statsmodels.api as sm
df2['intercept']=1
logit_mod= sm.Logit(df2['converted'],df2[['intercept','ab_page']])
results=logit_mod.fit()

d. Provide the summary of your model below, and use it as necessary to answer the following questions.

In [None]:
results.summary()

e. What is the p-value associated with **ab_page** is p-value of 0.19>0.05 and we fail to reject the null hypothesis. Here we look at the difference between the two conditions where null hypothesis is pnew=pold and the alternative is pnew!=pold. In part two we were predicting which page was getting more conversions with the null hypothesis being pnew<= pold and alternative hypothesis being pnew>pold.

f. Now, there are considerations that might influence whether or not an individual converts. This includes taking into consideration existing users to avoid bias results due to change aversion and novelty effects.However the more metric added the more likely significant differences are observed by chance.

g. Now along with testing if the conversion rate changes for different pages, also add an effect based on which country a user lives. You will need to read in the **countries.csv** dataset and merge together your datasets on the approporiate rows.

In [None]:
cdf= pd.read_csv('countries.csv')
cdf.head()

In [None]:
df3= df2.set_index('user_id').join(cdf.set_index('user_id'),sort=False)


In [None]:
df3[['CA','UK','US']]= pd.get_dummies(df3['country'])
df3.tail()


In [None]:
logit_mod2= sm.Logit(df3['converted'], df3[['intercept','UK','US']])
results=logit_mod2.fit()
results.summary()

h. Though you have now looked at the individual factors of country and page on conversion, we would now like to look at an interaction between page and country to see if there significant effects on conversion. 

In [None]:
logit_mod4 = sm.Logit(df3['converted'],df3[['intercept', 'US', 'UK','ab_page']])
results2 = logit_mod4.fit()
results2.summary()

It doesn't appear that there is any interaction between country and page on conversion since all the pvalues are above 0.05.

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Analyze_ab_test_results_notebook.ipynb'])