# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [25]:
from statsmodels.stats.proportion import proportions_ztest
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')


df = pd.read_csv('homepage_actions.csv')
df.head(15)

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view
5,2016-09-24 20:32:25.712659,681598,experiment,view
6,2016-09-24 20:39:03.248853,522116,experiment,view
7,2016-09-24 20:57:20.336757,349125,experiment,view
8,2016-09-24 20:58:01.948663,349125,experiment,click
9,2016-09-24 21:00:12.278374,560027,control,view


In [26]:
# 1. How many viewers also clicked?
# First, we group the data by 'id' and then check if each group has both 'view' and 'click' actions.
grouped = df.groupby('id')['action'].agg(set).reset_index()
viewers_who_clicked = grouped[grouped['action'].apply(lambda x: 'view' in x and 'click' in x)]

# 2. Are there any anomalies with the data; did anyone click who didn't view?
# We check if there are any 'click' actions without a corresponding 'view' action.
clickers_who_didnt_view = grouped[grouped['action'].apply(lambda x: 'click' in x and 'view' not in x)]

# 3. Is there any overlap between the control and experiment groups?
# We will check if there are any ids that are present in both groups.
ids_in_experiment = df[df['group'] == 'experiment']['id'].unique()
ids_in_control = df[df['group'] == 'control']['id'].unique()
overlap_ids = set(ids_in_experiment).intersection(ids_in_control)

# Let's summarize the findings.
exploratory_summary = {
    'viewers_who_clicked_count': len(viewers_who_clicked),
    'viewers_who_clicked_ids': viewers_who_clicked['id'].tolist(),
    'clickers_who_didnt_view_count': len(clickers_who_didnt_view),
    'clickers_who_didnt_view_ids': clickers_who_didnt_view['id'].tolist(),
    'overlap_ids_count': len(overlap_ids),
    'overlap_ids': list(overlap_ids)
}

exploratory_summary


{'viewers_who_clicked_count': 1860,
 'viewers_who_clicked_ids': [182994,
  183141,
  183248,
  183617,
  183938,
  184212,
  184441,
  184654,
  184992,
  185108,
  186656,
  187518,
  188580,
  188853,
  189694,
  189707,
  189925,
  190262,
  190312,
  190684,
  191118,
  191301,
  191355,
  191433,
  191649,
  191688,
  192320,
  192673,
  194447,
  194579,
  194789,
  194950,
  194995,
  195631,
  196367,
  198896,
  199540,
  199724,
  200061,
  201505,
  202565,
  203003,
  203215,
  203806,
  204006,
  204792,
  204891,
  205023,
  205250,
  205805,
  205853,
  205897,
  206206,
  206536,
  206558,
  206855,
  207196,
  207275,
  207418,
  207612,
  208714,
  208883,
  208937,
  209052,
  209934,
  211527,
  211665,
  213073,
  213516,
  214596,
  214723,
  214729,
  214979,
  215249,
  215711,
  215862,
  215913,
  215968,
  216131,
  216374,
  216956,
  217590,
  217616,
  217792,
  219858,
  220827,
  221228,
  221425,
  223177,
  223463,
  224219,
  225007,
  225854,
  22605

## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [27]:
# H0: The experimental homepage was not more effective than that of the control group
# H1: The experimental homepage was more effective than that of the control group
# alpha = 0.05


In [28]:
# Create both samples
experimental_df = df[df['group']=='experiment']
control_df = df[df['group']=='control']

# Calculating the number of clicks for each sample
number_experimental_clicks = (experimental_df['action']=='click').sum()

number_control_clicks = (control_df['action']=='click').sum()

# Calculating the number of views for each sample
number_experimental_views = (experimental_df['action']=='view').sum()

number_control_views = (control_df['action']=='view').sum()

# Calculating the Click-through-rate (CRT) for each sample
CRT_experimental = number_experimental_clicks/number_experimental_views

CRT_control = number_control_clicks/number_control_views

# Preparation for a z-test
n_clicks = np.array([number_experimental_clicks,number_control_clicks])
n_views = np.array([number_experimental_views,number_control_views])



In [29]:
proportions_ztest(n_clicks, n_views)

(2.618563885349469, 0.008830075576595804)

Results show that p-value<<alpha so we reject the null hypothesis 

## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [30]:
# Create both samples
experimental_df = df[df['group']=='experiment']
control_df = df[df['group']=='control']

# Calculating the number of clicks for each sample
number_experimental_clicks = (experimental_df['action']=='click').sum()

number_control_clicks = (control_df['action']=='click').sum()

# Calculating the number of views for each sample
number_experimental_views = (experimental_df['action']=='view').sum()

number_control_views = (control_df['action']=='view').sum()

# Calculating the CRT for each sample
CRT_experimental = number_experimental_clicks/number_experimental_views

CRT_control = number_control_clicks/number_control_views

# Calculating the number of clicks for experimental
n_expected_clicks_experimental = CRT_control*number_experimental_views

# Number of expected clicks for the expriment group
n_expected_clicks_experimental

838.0168067226891

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [31]:
std_experiment = np.sqrt(CRT_control*number_experimental_views*(1-CRT_control))
std_experiment

24.568547907005815

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [32]:
from scipy.stats import norm

z_score = (number_experimental_clicks - n_expected_clicks_experimental)/std_experiment

p_value = 1 - norm.cdf(z_score)

p_value

0.00012486528006949715

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: **Your analysis here**

Yes. In my previous experiment and in this one the p-values are << alpha. So in both cases we reject the H0. 

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.