# Designing and Analyzing the Results of an A/B Test

## Experiment Overview

Udacity is an online education company.  In this experiment, they tested a change to their course sign-up.  After clicking the “Start free trial” button, users were prompted to state how many hours per week they could dedicate to the course.  If a user entered fewer than five hours then they were told that they would need to devote more time than that.  In addition, Udacity proffered the information that the course materials could be accessed for free if the user was unsure of their commitment.  If the user entered more than five hours then they were directed to enter their credit card information and start their 14-day free trial.

<img src = "course-sign-up.png">
> Screen capture of the experimental treatment

In creating this change, Udacity was interested in determining if this “warning” would help users to better understand the commitment required of the course, and hence decrease future frustration.  Udacity hopes this will lead more users to finish their free trial and continue to remain enrolled into the first pay period.

## Load Data

In [44]:
import pandas as pd
import math

In [45]:
# Baseline data
xlsx = pd.ExcelFile('baseline_values.xlsx')
baseline = pd.read_excel(xlsx, 'Sheet1', header=None, names = ['full_metric_name','value'])
baseline.set_index([['pageviews', 'clicks', 'enrollments', 'ctp', 'p_enroll_click', 'p_pay_enroll', 'p_pay_click']], inplace=True)
baseline

Unnamed: 0,full_metric_name,value
pageviews,Unique cookies to view course overview page pe...,40000.0
clicks,"Unique cookies to click ""Start free trial"" per...",3200.0
enrollments,Enrollments per day:,660.0
ctp,"Click-through-probability on ""Start free trial"":",0.08
p_enroll_click,"Probability of enrolling, given click:",0.20625
p_pay_enroll,"Probability of payment, given enroll:",0.53
p_pay_click,"Probability of payment, given click",0.109313


In [46]:
# Experiment data - control group
xlsx = pd.ExcelFile('experiment_results.xlsx')
res_cont = pd.read_excel(xlsx, 'Control')
res_cont

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0
5,"Thu, Oct 16",9670,823,138.0,82.0
6,"Fri, Oct 17",9008,748,146.0,76.0
7,"Sat, Oct 18",7434,632,110.0,70.0
8,"Sun, Oct 19",8459,691,131.0,60.0
9,"Mon, Oct 20",10667,861,165.0,97.0


In [47]:
# Experiment data - experiment group
res_exp = pd.read_excel(xlsx, 'Experiment')
res_exp

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0
5,"Thu, Oct 16",9500,788,129.0,61.0
6,"Fri, Oct 17",9088,780,127.0,44.0
7,"Sat, Oct 18",7664,652,94.0,62.0
8,"Sun, Oct 19",8434,697,120.0,77.0
9,"Mon, Oct 20",10496,860,153.0,98.0


## Experiment Design

### Metric Choices

Invariant Metrics:
* Number of unique cookies to view the course overview page
* Number of unique clicks to click the “Start free trial” button
* Click-through-probability of clicking “Start free trial” button from the course overview page

Evaluation Metrics:
* Gross conversion: number of user-ids to complete checkout divided by the number of unique cookies to click the “Start free trial” button
* Net conversion: number of user-ids to remain enrolled past the 14-day boundary divided by the number of unique cookies to click the “Start free trial” button


#### Invariant Metrics

Invariant metrics shouldn’t change across the experiment and control groups.  We use invariant metrics to size our experiment and ensure that our experiment is run properly.  

I chose the following as invariant metrics:
* Number of unique cookies to view the course overview page
* Number of unique clicks to click the “Start free trial” button
* Click-through-probability of clicking “Start free trial” button from the course overview page

Since we are not making any changes before the user presses the button, we would want the same number of clicks to the button and the same click-through-probability for it, as well as the same number of cookies (as a proxy for users) to visit the course overview page in both the experiment and control groups.  A metric I would not use for invariant checking is the number of users who enrolled in the free trial.  Our experiment may, and is meant to, affect how many users decide to enroll by possibly convincing the less fervent to access free course materials in lieu of starting the free trial.

These metrics were chosen as a way to make sure our two groups are sized similarly and to provide sanity checks.  The experiment does not affect any part of the user experience before the pressing of the “Start free trial” button, so if there are differences between the experimental and control groups for these metrics there is an error with the setup of the experiment, event-capturing, etc.

#### Evaluation Metrics

Evaluation metrics are used to compare the experiment and control groups after the treatment has been applied; these metrics are used to determine if the experiment was a success.

I chose the following as evaluation metrics:
* Gross conversion: number of user-ids to complete checkout divided by the number of unique cookies to click the “Start free trial” button
* Net conversion: number of user-ids to remain enrolled past the 14-day boundary divided by the number of unique cookies to click the “Start free trial” button

To understand the effects of the experimental treatment we need to use two evaluation metrics: gross conversion and net conversion.  Gross conversion quantifies how the experimental change affects the number of users that complete the checkout after clicking on the button.  This helps Udacity understand what portion of their users are abandoning the free trial sign-up in favor of accessing the free course materials due to the experimental treatment.  

Net conversion is also a useful evaluation metric to track since it communicates the overall effect of “warning” users from clicking the free trial button to entering the first pay period.  If the portion of users increases for the first metric, but overall there is not enough of an increase, the change may not be worth launching.

The following metric was not chosen:
* Retention: number of user-ids to remain enrolled past the 14-day boundary (to make at least one payment) divided by the number of user-ids to complete checkout (enter their credit card details to start the free trial)

Retention measures the effect of our decision on the users that still choose to start a free trial after being advised about the commitment necessary for the course; retention can help us understand if a greater proportion of users in the experimental group that start a free trial are less frustrated, and more likely to study past the free 14 days, as compared with those in the control group.  This would be a valuable metric, but since the unit of analysis for this variable is different from the unit of diversion, the number of pageviews necessary to track this metric is far too large to be feasible.  This is discussed in more depth in the section concerned with experiment sizing.

A minimum practical change of 0.01 for gross conversion and 0.0075 for net conversion would be required in order to launch the change, as stated by Udacity.

### Measuring Standard Deviation

In [14]:
scale_factor = 5000/baseline.loc['pageviews', 'value']
scale_factor

0.125

In [15]:
values = baseline['value'].tolist()

# Scale pageviews, clicks, enrollments for the sample of 5000 pageviews
# Leave the probabilities the same
sample_vals = []
for i in range(0, len(values)):
    if i < 3:
        sample_vals.append(values[i]*scale_factor)
    else:
        sample_vals.append(values[i])
        
baseline['sample'] = sample_vals
baseline

Unnamed: 0,full_metric_name,value,sample
pageviews,Unique cookies to view course overview page pe...,40000.0,5000.0
clicks,"Unique cookies to click ""Start free trial"" per...",3200.0,400.0
enrollments,Enrollments per day:,660.0,82.5
ctp,"Click-through-probability on ""Start free trial"":",0.08,0.08
p_enroll_click,"Probability of enrolling, given click:",0.20625,0.20625
p_pay_enroll,"Probability of payment, given enroll:",0.53,0.53
p_pay_click,"Probability of payment, given click",0.109313,0.109313


#### Gross Conversion

In [20]:
# p = probability of enrolling given click
p = baseline.loc['p_enroll_click', 'value']
q = 1 - p

gc_std = math.sqrt(p*q/baseline.loc['clicks', 'sample'])
gc_std

0.020230604137049392

#### Net Conversion

In [21]:
# p = probability of making a payment given click
p = baseline.loc['p_pay_click', 'value']
q = 1 - p

nc_std = math.sqrt(p*q/baseline.loc['clicks', 'sample'])
nc_std

0.01560154458248846

#### Standard Deviation Summary

| Evaluation Metric | Standard Deviation   |
|:------------------|:---------------------|
| Gross conversion  | 0.020230604137049392 |
| Net conversion    | 0.01560154458248846  |

### Sizing

The [Sample Size Calculator](http://www.evanmiller.org/ab-testing/sample-size.html) was used with the appropriate baseline values, and a statistical power $1-\beta = 0.8$ and significance level $\alpha = 0.05$ to calculate the necessary sample size required to detect the practical significance.

Given the baseline value of gross conversion (the probability of enrolling given click) of 0.20625 and practical signficance of 1%, 25,835 cookies are required for the experiment group.  Given the baseline value of net conversion (the probability of making a payment given click) of 0.1093125 and a practical significance of 0.75%, 27,413 cookies are required for the experiment group.

Since the unit of diversion is a cookie, we need to divide by the click-through-probability for the "Start free trial" button (0.08) to get the pageviews, and multiply by two to get the total required for the control and experiment.  This yields 645,875 pageviews for gross conversion and 685,325 for net conversion.  685,325 pageviews are therefore used for the experiment, in order to detect effects for both metrics.

In [83]:
# Results from the sample size calculator
gc_cookies = 25835
nc_cookies = 27413

gc_pageviews = int((gc_cookies * 2)/0.08)
nc_pageviews = int((nc_cookies * 2)/0.08)

| Evaluation Metric | Pageviews Required |
|:------------------|:-------------------|
| Gross conversion  | 645,875            |
| Net conversion    | 685,325            |
| FINAL             | 685,325            |

The unit of analysis for gross conversion and net conversion is the number of cookies that click the button.  However, for retention the unit of analysis is the number of user-ids that complete checkout.  The unit of diversion for the experiment is a cookie that clicks the "Start free trial" button.  Since the unit of analysis for retention is not the same as the unit of diversion, it would require many more pageviews than gross conversion or net conversion. In order to test retention, we would require 39,115 cookies that have made a payment, given a baseline value of 0.53 and a practical significance of 1%.  Dividing this number by the click-through-probability on the button (0.08) times the the probability of enrolling given a click (0.20625) yields the number of pageviews required.
```
39115/(0.08*0.20625) = 4,741,212.121
```
This number of pageviews would greatly increase the length of the experiment, hence retention was not chosen as an evaluation metric.

### Duration and Exposure

This experiment represents a somewhat risky change for Udacity, since it could lead more users to opt to access the free course materials, instead of enrolling in the free trial, which could ultimately lead to fewer paid enrollments.  In addition, any flaws in the implementation could dissuade users from enrolling.  In order to mitigate the risk while also diverting enough traffic to not overly prolong the experiment, I would divert 50% of the traffic to this experiment.  Given that baseline value of 40,000 daily pageviews, and 685,325 required pageviews to run the experiment, it would take 35 days to run the experiment.
```
685,325/(40000 * 0.5)  = 34.26625
```

## Experiment Analysis

### Sanity Checks

Sanity Checks

For each of the invariant metrics, a 95% confidence interval needs to be created for the values that one expects to observe, in order to check the sizing and proper execution of the experiment.

The meaning of each field name:
* Pageviews: Number of unique cookies to view the course overview page
* Clicks: Number of unique cookies to click the course overview page
* Enrollments: Number of user-ids to enroll
* Payments: Number of user-ids who enrolled and remained enrolled for 14 days and thus made a payment

Invariant metrics to check:
1.	Number of unique cookies to view the course overview page (pageviews)
2.	Number of unique clicks to click the “Start free trial” button (clicks)
3.	Click-through-probability of clicking “Start free trial” button from the course overview page

#### Checking pageviews

In [48]:
tot_pvs_cont = res_cont['Pageviews'].sum()
tot_pvs_exp = res_exp['Pageviews'].sum()

print(tot_pvs_cont)
print(tot_pvs_exp)

345543
344660


In [38]:
std_err_pvs = math.sqrt((0.5*0.5)/(tot_pvs_cont + tot_pvs_exp))
std_err_pvs

0.0006018407402943247

In [41]:
margin_err_pvs = 1.96 * std_err_pvs
margin_err_pvs

0.0011796078509768765

In [43]:
conf_interval_pvs = [0.5 - margin_err_pvs, 0.5 + margin_err_pvs]
conf_interval_pvs

[0.49882039214902313, 0.5011796078509769]

In [50]:
actual_pvs = tot_pvs_cont/(tot_pvs_cont + tot_pvs_exp)
actual_pvs

0.5006396668806133

#### Checking clicks

In [54]:
tot_clicks_cont = res_cont['Clicks'].sum()
tot_clicks_exp = res_exp['Clicks'].sum()

print(tot_clicks_cont)
print(tot_clicks_exp)

28378
28325


In [57]:
std_err_clicks = math.sqrt((0.5*0.5)/(tot_clicks_cont + tot_clicks_exp))
std_err_clicks

0.002099747079699252

In [58]:
margin_err_clicks = 1.96 * std_err_clicks
margin_err_clicks

0.0041155042762105335

In [59]:
conf_interval_clicks = [0.5 - margin_err_clicks, 0.5 + margin_err_clicks]
conf_interval_clicks

[0.49588449572378945, 0.5041155042762105]

In [60]:
actual_clicks = tot_clicks_cont/(tot_clicks_cont + tot_clicks_exp)
actual_clicks

0.5004673474066628

I used a normal approximation of the binomial distribution to model the assignment of cookies to the experimental and control groups.  The normal approximation can be used since the sample size is suitably large.  The standard error for both of the above is calculated as:

$$\sqrt{p(1-p) \over n_{1}+n_{2}}$$

Since assignment is random, the probabilities of being chosen for the experiment or control are both 0.5.  The confidence interval is centered around 0.5, the mean of the distribution.  Both of these sanity checks pass as the actual values (actual_pvs and actual_clicks) fall within the confidence intervals calculated in conf_interval_pvs and conf_interval_clicks, respectively.

#### Checking click-through-probability

In [61]:
ctp_cont = tot_clicks_cont/tot_pvs_cont
ctp_cont

0.08212581357457682

In [62]:
ctp_exp = tot_clicks_exp/tot_pvs_exp
ctp_exp

0.08218244066616376

In [68]:
# p = pooled probability of click-through-probability
p = (tot_clicks_cont + tot_clicks_exp) / (tot_pvs_cont + tot_pvs_exp)
print(p)
q = 1 - p

0.08215409089789526


In [71]:
pooled_std_err_ctp = math.sqrt(p*q*(1/tot_pvs_cont + 1/tot_pvs_exp))
pooled_std_err_ctp

0.0006610608156387222

In [73]:
margin_err_ctp = 1.96 * pooled_std_err_ctp
margin_err_ctp

0.0012956791986518956

In [77]:
conf_interval_ctp = [0 - margin_err_ctp, 0 + margin_err_ctp]
conf_interval_ctp

[-0.0012956791986518956, 0.0012956791986518956]

In [78]:
ctp_diff = ctp_exp - ctp_cont
ctp_diff

5.662709158693602e-05

The invariant metric above is calculated as clicks divided by pageviews.  Once again, the normal approximation of the binomial distribution is used, but this time it is centered around the pooled probability.  The pooled standard error is calculated as:

$$ \sqrt{p(1-p) \left({1 \over n_{1}} + {1 \over n_{2}} \right)}$$

In order for the sanity check to pass, the difference between the click-through-probabilities for the control and experiment groups needs to be in the 95% confidence interval constructed around 0.  This sanity check passes as the actual difference in ctp_diff falls within the confidence interval calculated in conf_interval_ctp.

#### Sanity Checks Summary

| Invariant Metric         | Actual Value | CI Lower Bound | CI Upper Bound | Pass? |
|:-------------------------|:-------------|:---------------|:---------------|:------|
| pageviews                | 0.500639667  | 0.498820392    | 0.501179608    | Yes   |
| clicks                   | 0.500467347  | 0.495884496    | 0.504115504    | Yes   |
| click-through-probability| 0.000056627  | -0.001295679   | 0.001295679    | Yes   |