# Website A/B Testing - Lab

## Introduction

In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.

## Objectives

You will be able to:
* Analyze the data from a website A/B test to draw relevant conclusions
* Explore and analyze web action data

## Exploratory Analysis

Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data.

> Hints:
    * Start investigating the id column:
        * How many viewers also clicked?
        * Are there any anomalies with the data; did anyone click who didn't view?
        * Is there any overlap between the control and experiment groups? 
            * If so, how do you plan to account for this in your experimental design?

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv('homepage_actions.csv')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,timestamp,id,group,action
0,2016-09-24 17:42:27.839496,804196,experiment,view
1,2016-09-24 19:19:03.542569,434745,experiment,view
2,2016-09-24 19:36:00.944135,507599,experiment,view
3,2016-09-24 19:59:02.646620,671993,control,view
4,2016-09-24 20:26:14.466886,536734,experiment,view


In [2]:
# 1. Determine the number of unique users who viewed and clicked
viewed_users = set(data[data['action'] == 'view']['id'].unique())
clicked_users = set(data[data['action'] == 'click']['id'].unique())

# 2. Identify anomalies: users who clicked without viewing
click_without_view = clicked_users - viewed_users

# 3. Check for overlap between the control and experimental groups
control_users = set(data[data['group'] == 'control']['id'].unique())
experiment_users = set(data[data['group'] == 'experiment']['id'].unique())
overlap_users = control_users.intersection(experiment_users)

len(viewed_users), len(clicked_users), len(click_without_view), len(overlap_users)

(6328, 1860, 0, 0)

The dataset contains four columns:

1. `timestamp`: The time at which the action was taken.
2. `id`: The unique identifier for each user.
3. `group`: Specifies whether the user was part of the control group or the experimental group.
4. `action`: Indicates whether the user viewed the page or clicked on it.

To address the questions posed:

1. We'll first determine how many unique users viewed the page and how many clicked on it.
2. We'll then identify if there are any anomalies, such as users who clicked without viewing.
3. We'll check for any overlap between the control and experimental groups.
4. If overlaps are found, we'll discuss how to account for this in our analysis.

Here are the findings:

1. **Number of unique users who viewed the page:** 6,328
2. **Number of unique users who clicked on the page:** 1,860
3. **Anomalies (users who clicked without viewing):** None. This means all users who clicked had previously viewed the page.
4. **Overlap between the control and experimental groups:** None. This means no user was mistakenly categorized in both groups.

Given these results, there is no overlap between the control and experimental groups, so we don't need to make adjustments for this in our analysis.

## Conduct a Statistical Test

Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group.

In [3]:
from statsmodels.stats.proportion import proportions_ztest

# 1. Compute the number of successes (clicks) and trials (views) for both groups
successes_control = len(data[(data['group'] == 'control') & (data['action'] == 'click')])
successes_experiment = len(data[(data['group'] == 'experiment') & (data['action'] == 'click')])

trials_control = len(data[(data['group'] == 'control') & (data['action'] == 'view')])
trials_experiment = len(data[(data['group'] == 'experiment') & (data['action'] == 'view')])

# 2. Conduct the Z-test
count = [successes_control, successes_experiment]
nobs = [trials_control, trials_experiment]

z_stat, p_value = proportions_ztest(count, nobs, alternative='smaller')

# 3. Display the results
print( z_stat, p_value)

-2.618563885349469 0.004415037788297902


## Verifying Results

One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. 

The variance for the number of successes in a sample of a binomial variable with n observations is given by:

## $n\bullet p (1-p)$

Given this, perform 3 steps to verify the results of your statistical test:
1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 
2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. 
3. Finally, calculate a p-value using the normal distribution based on this z-score.

### Step 1:
Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. 

In [4]:
# Calculate the CTR of the control group
ctr_control = successes_control / trials_control

# Calculate the expected number of clicks for the experiment group
expected_clicks_experiment = ctr_control * trials_experiment

expected_clicks_experiment


838.0168067226891

### Step 2:
Calculate the number of standard deviations that the actual number of clicks was from this estimate.

In [5]:
import numpy as np

# Calculate the standard deviation for the expected number of clicks in the experiment group
std_dev = np.sqrt(trials_experiment * ctr_control * (1 - ctr_control))

# Calculate the z-score
actual_clicks_experiment = successes_experiment
z_score = (actual_clicks_experiment - expected_clicks_experiment) / std_dev

z_score

3.6625360854823588

### Step 3: 
Finally, calculate a p-value using the normal distribution based on this z-score.

In [6]:
from scipy.stats import norm

# Calculate the p-value using the standard normal distribution
p_value_verification = 1 - norm.cdf(z_score)

p_value_verification


0.00012486528006949715

### Analysis:

Does this result roughly match that of the previous statistical test?

> Comment: ****Analysis:**

Yes, the outcome of this verification method closely resembles the outcome of the prior statistical test.



- For proportions, the p-value obtained from the Z-test was (0.0044).

- The p-value from the binomial variance verification was (0.000125).



Both p-values are significantly smaller than the usual significance level ((alpha = 0.05 )), hence the null hypothesis is rejected in both situations.



**Comment:**

The consistency of results from both methodologies supports the conclusion that the experimental homepage was more effective in terms of click-through rate than the control homepage. By addressing the question from a somewhat different angle but arriving at the same conclusion, the verification method utilizing binomial variance provides an extra layer of confidence in the results. This type of cross-validation is recommended.

## Summary

In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values.