# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [2]:
#Your code here
import pandas as pd
df=pd.read_csv('homepage_actions.csv')

In [3]:
df.columns,df.index


(Index(['timestamp', 'id', 'group', 'action'], dtype='object'),
 RangeIndex(start=0, stop=8188, step=1))

In [4]:
df.describe()

Unnamed: 0,id
count,8188.0
mean,564699.749878
std,219085.845672
min,182988.0
25%,373637.5
50%,566840.5
75%,758078.0
max,937217.0


In [5]:
df['group'].unique()

array(['experiment', 'control'], dtype=object)

In [6]:
df['action'].unique()

array(['view', 'click'], dtype=object)

In [7]:
df.action.value_counts()

view     6328
click    1860
Name: action, dtype: int64

In [8]:
df


Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view
...,...,...,...,...
8183,2017-01-18 09:11:41.984113,192060,experiment,view
8184,2017-01-18 09:42:12.844575,755912,experiment,view
8185,2017-01-18 10:01:09.026482,458115,experiment,view
8186,2017-01-18 10:08:51.588469,505451,control,view


In [10]:
# <1>* How many viewers also clicked?
# Looking for duplicated id (multiple rows with same id) and whether there were clicks and views for that id.
dfeda= df.copy()
dfeda['temp']=1
dfsort=dfeda.groupby(by = ['id']).sum()
dfsort.describe()
# interpretation: at a maximum, a given user appears twice in the dataframe.
# it could be click&click click&view view&click view&view (order matters)
# we are looking for view&click
# cids = set ( df[df.action == 'click'])

# set of unique users' ids who clicked
cids = set ( df[df.action == 'click']['id'].unique() ) 
# set of unique users' ids who viewed
vids = set ( df[df.action == 'view']['id'].unique() ) 

print("Number of viewers: {} \tNumber of clickers: {}".format(len(vids), len(cids)))

#     * Are there any anomalies with the data; did anyone click who didn't view?

for idnum in list(df.loc[df['action'] == 'click']['id']):
    if df['id'] == idnum & df[:
        print ( idnum , )
                              
#     * Is there any overlap between the control and experiment groups? 
#         * If so, how do you plan to account for this in your experimental design?


Number of viewers: 6328 	Number of clickers: 1860


[349125,
 601714,
 487634,
 468601,
 555973,
 398892,
 444902,
 544571,
 269335,
 596892,
 653403,
 922848,
 283438,
 194950,
 894454,
 370483,
 639852,
 826660,
 381380,
 910738,
 246990,
 832871,
 461125,
 368583,
 839729,
 332165,
 855872,
 691753,
 226057,
 417019,
 852585,
 482798,
 366151,
 769017,
 835449,
 365028,
 480891,
 234444,
 792966,
 550685,
 184992,
 408715,
 766674,
 489523,
 707411,
 304461,
 632767,
 726650,
 349239,
 225854,
 828266,
 774573,
 825953,
 449882,
 457832,
 931812,
 557715,
 786605,
 420863,
 755354,
 619230,
 424146,
 838042,
 443177,
 673124,
 378232,
 883682,
 559812,
 931905,
 638030,
 450877,
 539338,
 251716,
 649807,
 884709,
 363131,
 883875,
 634811,
 489374,
 615186,
 726961,
 559646,
 230350,
 378864,
 261489,
 700205,
 540729,
 388377,
 550280,
 619415,
 645349,
 887912,
 776364,
 432416,
 237346,
 828426,
 404428,
 696656,
 297965,
 430193,
 399915,
 918596,
 826501,
 934486,
 737683,
 200061,
 395723,
 677294,
 759783,
 581893,
 426465,
 

In [None]:
# Comment: Everyone who clicked, also viewed the homepage! 
# (Thank goodness!)

## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [None]:
#Your code here
# H0 : the experimental homepage was no more effective than that of the control group
#     ie. click rate mean  is the same for  experiment and control populations

# Ha: click rate mean is higher for experiment than for control populations

# alpha = 0.05

# 

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [None]:
#Your code here

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [None]:
#Your code here

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [None]:
#Your code here

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Your analysis here**

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.