# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
data = pd.read_csv('homepage_actions.csv')

In [3]:
data.head()

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view


In [4]:
data.groupby('group')['action'].value_counts()

group       action
control     view      3332
            click      932
experiment  view      2996
            click      928
Name: action, dtype: int64

## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [5]:
def welch_t(a, b):
    
    """ Calculate Welch's t-statistic for two samples. """

    numerator = a.mean() - b.mean()
    
    # “ddof = Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, 
    #  where N represents the number of elements. By default ddof is zero.
    
    denominator = np.sqrt(a.var(ddof=1)/a.size + b.var(ddof=1)/b.size)
    
    return np.abs(numerator/denominator)

In [6]:
def welch_df(a, b):
    
    """ Calculate the effective degrees of freedom for two samples. """
    
    s1 = a.var(ddof=1) 
    s2 = b.var(ddof=1)
    n1 = a.size
    n2 = b.size
    
    numerator = (s1/n1 + s2/n2)**2
    denominator = (s1/ n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)
    
    return numerator/denominator

In [7]:
def p_value(a, b, two_sided=False):

    t = welch_t(a, b)
    df = welch_df(a, b)
    
    p = 1-stats.t.cdf(np.abs(t), df)
    
    if two_sided:
        return 2*p
    else:
        return p

In [8]:
exp = data[data['group'] == 'experiment']
control = data[data['group'] == 'control']

In [9]:
exp_click = exp['action'] == 'click'
exp_view = exp['action'] == 'view'

In [14]:
control_click = control['action'] == 'click'
control_view = control['action'] == 'view'

In [16]:
p_value(exp_click, control_click)

0.026743886922199422

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [19]:
CTR = sum(control_click) / sum(control_view)
print(CTR)

0.2797118847539016


In [20]:
expected = sum(exp_view) * CTR
expected

838.0168067226891

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [31]:
data['Binary'] = np.where(data['action'] == 'click', 1, 0)

data.tail(10)

Unnamed: 0,timestamp,id,group,action,Binary
8178,2017-01-18 08:17:12.675797,616692,control,view,0
8179,2017-01-18 08:53:50.910310,615849,experiment,view,0
8180,2017-01-18 08:54:56.879682,615849,experiment,click,1
8181,2017-01-18 09:07:37.661143,795585,control,view,0
8182,2017-01-18 09:09:17.363917,795585,control,click,1
8183,2017-01-18 09:11:41.984113,192060,experiment,view,0
8184,2017-01-18 09:42:12.844575,755912,experiment,view,0
8185,2017-01-18 10:01:09.026482,458115,experiment,view,0
8186,2017-01-18 10:08:51.588469,505451,control,view,0
8187,2017-01-18 10:24:08.629327,461199,control,view,0


In [32]:
n = len(exp)
p = CTR

var = n * p * (1-p)
std = np.sqrt(var)
print(std)

28.117265621107368


In [38]:
z = (sum(exp_click) - expected) / std

In [39]:
print(z)

3.200282505769741


### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [40]:
stats.norm.sf(abs(z))

0.000686464723597255

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Your analysis here**

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.