# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [55]:
#Your code here
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('homepage_actions.csv')

In [19]:
df.head()

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8188 entries, 0 to 8187
Data columns (total 4 columns):
timestamp    8188 non-null object
id           8188 non-null int64
group        8188 non-null object
action       8188 non-null object
dtypes: int64(1), object(3)
memory usage: 256.0+ KB


In [21]:
print('Number of types of groups: \n', df.group.value_counts(), '\n Number of types of action: \n', df.action.value_counts())

Number of types of groups: 
 control       4264
experiment    3924
Name: group, dtype: int64 
 Number of types of action: 
 view     6328
click    1860
Name: action, dtype: int64


In [22]:
# create subsets of viewers and clickers

viewers = df.loc[df.action == 'view']
clickers = df.loc[df.action == 'click']

len(viewers) + len(clickers)

8188

In [23]:
# use pd merge to find the intersection between these two subsets

viewers_and_clickers = viewers.merge(clickers,how='inner', on='id')
viewers_and_clickers

# 1860 viewers are also clickers, because there were 6328 viewers, 4468 viewers DID NOT click.

Unnamed: 0,timestamp_x,id,group_x,action_x,timestamp_y,group_y,action_y
0,2016-09-24 20:57:20.336757,349125,experiment,view,2016-09-24 20:58:01.948663,experiment,click
1,2016-09-24 21:05:15.348935,601714,experiment,view,2016-09-24 21:06:27.553057,experiment,click
2,2016-09-24 21:29:19.766467,487634,experiment,view,2016-09-24 21:30:02.739756,experiment,click
3,2016-09-24 23:01:08.713402,468601,experiment,view,2016-09-24 23:01:12.108316,experiment,click
4,2016-09-25 00:00:47.700734,555973,experiment,view,2016-09-25 00:01:47.933853,experiment,click
...,...,...,...,...,...,...,...
1855,2017-01-17 23:19:39.649126,451198,control,view,2017-01-17 23:20:35.483601,control,click
1856,2017-01-17 23:46:19.329053,252195,control,view,2017-01-17 23:47:58.209653,control,click
1857,2017-01-18 00:55:55.026210,344770,experiment,view,2017-01-18 00:56:24.554729,experiment,click
1858,2017-01-18 08:53:50.910310,615849,experiment,view,2017-01-18 08:54:56.879682,experiment,click


In [24]:
viewers_and_clickers.columns

Index(['timestamp_x', 'id', 'group_x', 'action_x', 'timestamp_y', 'group_y',
       'action_y'],
      dtype='object')

In [25]:
# at least in viewers_and_clickers df, there is no overlap between control and experiment group
overlap_df = viewers_and_clickers.loc[viewers_and_clickers['group_x'] != viewers_and_clickers['group_y']]
overlap_df

Unnamed: 0,timestamp_x,id,group_x,action_x,timestamp_y,group_y,action_y


In [35]:
#let's subset the df into control and experiment groups
control = df.loc[df.group == 'control']
experiment = df.loc[df.group == 'experiment']

# there should be no common ids between them if the groups are perfectly split.
# let's merge them together to try to find intersections between them.

# doesn't look like merging the 2 subsets on their ids using inner method came up with anything.
# therefore the two groups don't overlap
group_overlap = control.merge(experiment,how='inner', on='id')
group_overlap

AttributeError: 'set' object has no attribute 'loc'

In [38]:
# an alternative way to do the above
vid = set(df.loc[df.action == 'view']['id'].unique()) # unique IDs of viewers

cid = set(df.loc[df.action == 'click']['id'].unique()) # unique IDs of clickers

# lets create a subset of unique user IDs.
users = set(df['id'].unique()) # unique IDs of users

print(f'Length of data set is: {len(df)}')
print(f'Number of unique viewers: {len(users)}')
print(f'Number of viewers who also clicked {len(users) - len(vid - cid)}')
print(f'Number of viewers who DID NOT also click: {len(vid - cid)}')
print(f'Number of clickers who DID NOT also view: {len(cid - vid)}')

Length of data set is: 8188
Number of unique viewers: 6328
Number of viewers who also clicked 1860
Number of viewers who DID NOT also click: 4468
Number of clickers who DID NOT also view: 0


## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [28]:
#Your code here
# how many clickers were there in the control group?
control_clickers = sum(control['action'] == 'click')
experiment_clickers = sum(experiment['action'] == 'click')

print(f'There were {control_clickers} clickers in the control group while there were {experiment_clickers} clickers in the experiment group')
print(f'There are {len(control)} subjects in the control group and there were {len(experiment)} subjects in the experiment group.')

There were 932 clickers in the control group while there were 928 clickers in the experiment group
There are 4264 subjects in the control group and there were 3924 subjects in the experiment group.


In [29]:
control['count'] = 1
experiment['count'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [30]:
# build pivot tables

control = control.pivot(index='id',columns='action', values='count').fillna(0)

experiment = experiment.pivot(index = 'id', columns='action', values='count').fillna(0)


In [31]:
import flatiron_stats as fs
fs.p_value_welch_ttest(control.click,experiment.click)

0.004466402814337078

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [42]:
control_rate = control.click.mean() # this is the rate of the control group.
expected_number_clicks_experiment = control_rate * len(experiment)
expected_number_clicks_experiment

838.0168067226891

In [43]:
# actual number of clicks by experiment group
experiment_clickers

928

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [45]:
#Your code here
n = len(experiment)
p = control_rate
var = n*p*(1-p)
sd = var**0.5


24.568547907005815

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [53]:
#Your code here

num = experiment_clickers - expected_number_clicks_experiment
z_score = num/sd
z_score

3.6625360854823588

In [57]:
1-stats.norm.cdf(z_score) # this is the p value

0.00012486528006949715

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Your analysis here**

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.