# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('homepage_actions.csv')
df

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view
...,...,...,...,...
8183,2017-01-18 09:11:41.984113,192060,experiment,view
8184,2017-01-18 09:42:12.844575,755912,experiment,view
8185,2017-01-18 10:01:09.026482,458115,experiment,view
8186,2017-01-18 10:08:51.588469,505451,control,view


In [3]:
df.group.unique()

array(['experiment', 'control'], dtype=object)

In [4]:
df.action.unique()

array(['view', 'click'], dtype=object)

In [5]:
df.timestamp = pd.to_datetime(df.timestamp)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8188 entries, 0 to 8187
Data columns (total 4 columns):
timestamp    8188 non-null datetime64[ns]
id           8188 non-null int64
group        8188 non-null object
action       8188 non-null object
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 256.0+ KB


In [6]:
df.timestamp.min()

Timestamp('2016-09-24 17:42:27.839496')

In [7]:
df.timestamp.max()

Timestamp('2017-01-18 10:24:08.629327')

In [8]:
exp_df = df[df.group == 'experiment']
cont_df = df[df.group == 'control']

In [9]:
len(exp_df.id.unique())

2996

In [10]:
len(cont_df.id.unique())

3332

In [11]:
# check to see if any id's are in both exp and control
exp_df.merge(cont_df, how='inner', suffixes=('_cl', '_v'), on='id')

Unnamed: 0,timestamp_cl,id,group_cl,action_cl,timestamp_v,group_v,action_v


In [12]:
# check to see if anyone clicked but didn't view
exp_c_df = exp_df[exp_df.action == 'click']
exp_v_df = exp_df[exp_df.action == 'view']
cont_c_df = cont_df[cont_df.action == 'click']
cont_v_df = cont_df[cont_df.action == 'view']

In [13]:
exp_c_df.merge(exp_v_df, on='id', suffixes=('_cl', '_v'), how='left').info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 928 entries, 0 to 927
Data columns (total 7 columns):
timestamp_cl    928 non-null datetime64[ns]
id              928 non-null int64
group_cl        928 non-null object
action_cl       928 non-null object
timestamp_v     928 non-null datetime64[ns]
group_v         928 non-null object
action_v        928 non-null object
dtypes: datetime64[ns](2), int64(1), object(4)
memory usage: 58.0+ KB


In [14]:
cont_c_df.merge(cont_v_df, on='id', suffixes=('_cl', '_v'), how='left')

Unnamed: 0,timestamp_cl,id,group_cl,action_cl,timestamp_v,group_v,action_v
0,2016-09-25 02:53:25.459874,398892,control,click,2016-09-25 02:52:43.844199,control,view
1,2016-09-25 05:19:15.810727,544571,control,click,2016-09-25 05:18:58.565357,control,view
2,2016-09-25 08:25:32.821891,194950,control,click,2016-09-25 08:24:31.192802,control,view
3,2016-09-25 09:45:12.114972,894454,control,click,2016-09-25 09:43:32.734737,control,view
4,2016-09-25 10:38:53.299877,639852,control,click,2016-09-25 10:37:38.286145,control,view
...,...,...,...,...,...,...,...
927,2017-01-17 18:32:30.832981,762498,control,click,2017-01-17 18:32:04.305072,control,view
928,2017-01-17 22:40:54.304413,591686,control,click,2017-01-17 22:39:20.924266,control,view
929,2017-01-17 23:20:35.483601,451198,control,click,2017-01-17 23:19:39.649126,control,view
930,2017-01-17 23:47:58.209653,252195,control,click,2017-01-17 23:46:19.329053,control,view


In [15]:
cont_v_df.merge(cont_c_df, on='id', suffixes=('_cl', '_v'), how='left').info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3332 entries, 0 to 3331
Data columns (total 7 columns):
timestamp_cl    3332 non-null datetime64[ns]
id              3332 non-null int64
group_cl        3332 non-null object
action_cl       3332 non-null object
timestamp_v     932 non-null datetime64[ns]
group_v         932 non-null object
action_v        932 non-null object
dtypes: datetime64[ns](2), int64(1), object(4)
memory usage: 208.2+ KB


In [16]:
evn = len(exp_df[exp_df.action == 'view'])
evn

2996

In [17]:
ecn = len(exp_df[exp_df.action == 'click'])
ecn

928

In [18]:
cvn = len(cont_df[cont_df.action == 'view'])
cvn

3332

In [19]:
ccn = len(cont_df[cont_df.action == 'click'])
ccn

932

In [20]:
p_c = 932/3332
p_c

0.2797118847539016

In [21]:
p_e = 928/2996
p_e

0.3097463284379172

In [22]:
# about 31% clicked in exp vs. 28% in control

## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [23]:
import numpy as np
import statsmodels.api as sm

In [24]:
zscore, pval = sm.stats.proportions_ztest([ccn, ecn], [cvn, evn], alternative='smaller')
print(pval)

0.004415037788297902


In [25]:
# we can reject the null hypothesis that control and experiment had the same effectiveness

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [29]:
control_click_rate = p_c # the click rate was already calculated above

expected_exp_clicks = control_click_rate * evn # evn is the number of members of the experiment group
expected_exp_clicks

838.0168067226891

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [38]:
actual_exp_clicks = ecn
diff_in_clicks = actual_exp_clicks - expected_exp_clicks
print(f'the difference in clicks was {diff_in_clicks}')
var_control = evn * control_click_rate * (1 - control_click_rate)
sd = np.sqrt(var_control)
print(f'the control sd is {sd}')
sd_away = diff_in_clicks / sd
print(f'number of sd away is {sd_away}')


the difference in clicks was 89.98319327731087
the control sd is 24.568547907005815
number of sd away is 3.6625360854823588


### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [34]:
import scipy.stats as stats

In [39]:
p_value = stats.norm.sf(sd_away)
p_value

0.00012486528006951198

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Your analysis here**

In [40]:
print(f'The diff in p_values is {pval - p_value}')

The diff in p_values is 0.00429017250822839


In [41]:
# the diff in p_values is considerable, but both would reject the Null Hypothesis

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.