# A/B Testing Final Project

This is my walkthrough of the final project for Udacity's course on A/B Testing. The project description is:

> At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

> In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead.

> The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

> The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

In [46]:
import numpy as np
import scipy.stats as stats
import math
import os
import pandas as pd
control = pd.read_csv('Final_Project_Control.csv')
experiment = pd.read_csv('Final_Project_Experiment.csv')

## Experiment Design

To begin, we must select invariant and evaluation metrics. Invariant metrics are used to compare both groups after the data is collected, and if they are significantly different we know something went wrong in the data collection process. Evaluation metrics will actually be used to tell if the experiment is a success or not as these should not be the same between groups. These are all the collected metrics, followed by their description and then their practical significance boundary.

| Metric | Description | D_min |
| --- | ----------- | --- |
| Number of cookies | Number of unique cookies to view the course overview page | 3000 |
| Number of user-ids | Number of users who enroll in the free trial | 50 |
| Number of clicks | Number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger) | 240 |
| Click-through-probability | Number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page | .01 |
| Gross conversion | Number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button | .01 |
| Retention |  Number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout | .01 |
| Net conversion | number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button | .0075 |


We use number of cookies and click through probability as invariant metrics, and gross conversion, retention, and net conversion as evaluation metrics. We selected number of cookies and click through probability as invariant metrics because these metrics are collected before the prompt asking how many hours a week they are willing to spend is triggered. Our evaluation metrics were chosen because we expect each of those values to be impacted by the prompt.





## Baseline Values

To get started, we need to know our baseline values. The given values are:

| Metric | Value | 
| --- | -------- | 
| Unique cookies to view course overview page per day | 40,000 |
| Unique cookies to click "Start free trial" per day | 3,200 |
| Enrollments per day | 660 |
| Click-through-probability on "Start free trial" | .08 | 
| Probability of enrolling, given click| .20625|
| Probability of payment, given enroll| .53 |
| Probability of payment, given click| 0.1093125|



## Measuring Standard Deviation

The analytic standard deviation for each metric is given by the following formula 

$ SD = \sqrt{p*(1-p)/n}$

where p is the probability, and n is the sample size. We are told to use a sample size of 5000, so we must scale the given baseline values by account for this.


In [16]:
def analytic_sd(p, n):
    return round(math.sqrt(p*(1-p)/n), 4)

In [17]:
# Standard Deviation for gross conversion. 400 is the sample size because 400=3200*1/8
gc_sd = analytic_sd(.20625, 400)
gc_sd

0.0202

In [18]:
# Standard Deviation for retention. 82.5 is sample size because 82.5 = 660*1/8
r_sd = analytic_sd(.53, 82.5)
r_sd

0.0549

In [19]:
# Standard Deviation for net conversion
nc_sd = analytic_sd(.1093125, 400)
nc_sd

0.0156

## Experiment sizing

We must now estimate how many page views we need to adequately power the experiment, given alpha = .05 and beta = .2. We can use the online calculator linked in the Instructor's notes, found [here](https://www.evanmiller.org/ab-testing/sample-size.html).

From this, we find that we need the sample sizes

Gross conversion - 25,835

Retention - 39,115

Net conversion - 27,413

However, we need to convert this from its original units to page views. This is done by dividing by the sample size by the click-through-probability .08 for the conversion metrics and the enrollment/pageview value of .0165, and then multiplying by 2 because we have a control and experiment group. This results in 

In [38]:
print("Page views required for gross conversion: " + str(round(25835/.08 * 2)))
print("Page views required for retention: " + str(round(39115/.0165 * 2)))
print("Page views required for net conversion: " + str(round(27413/.08 * 2)))


Page views required for gross conversion: 645875
Page views required for retention: 4741212
Page views required for net conversion: 685325


Since we only have 40,000 unique cookies to view course overview page in a day, if we divert 100% of the traffic it will take:

In [216]:
print("Days required for gross conversion: " + str(round(round(25835/.08 * 2)/40000)))
print("Days required for retention: " + str(round(round(39115/.0165 * 2)/40000)))
print("Days required for net conversion: " + str(round(round(27413/.08 * 2)/40000)))

Days required for gross conversion: 16
Days required for retention: 119
Days required for net conversion: 17


While retention is a decent evaluation metric, it simply takes too long to gather the necessary data, and so we will drop it from our analysis. We also don't want to have 100% of the traffic being diverted into the experiment. If we instead only take 75% of the traffic, we get

In [217]:
print("Days required for gross conversion: " + str(round(round(25835/.08 * 2)/(40000*.75))))
print("Days required for net conversion: " + str(round(round(27413/.08 * 2)/(40000*.75))))

Days required for gross conversion: 22
Days required for net conversion: 23


And since we need enough power for both metrics, we choose the larger of the two values, 685,325 as our required number of page views

## Sanity Checks

Now it is time to check our invariant metrics and make sure nothing went wrong in the data collection process. To do this, we test the metrics at the 95% confidence interval. We've defined a function to perform the invariant test for the number of cookies and number of clicks, and since click through probability is a difference, we write out the code separately 

In [214]:
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [215]:
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


In [60]:
def get_z_star(alpha):
    return stats.norm().interval(1-alpha)[1]

def invariant_test(n_control, n_exp, alpha = .05, p = .5):
    sd = math.sqrt((p*(1-p))/(n_control+n_exp))
    z_star = get_z_star(alpha)
    m = sd * z_star
    lower = p-m
    upper = p+m
    p_hat = n_control/(n_control+n_exp)
    print("Lower confidence level: " + str(round(lower, 4)))
    print("Upper confidence level: " + str(round(upper, 4)))
    print("P_hat: " + str(round(p_hat, 4)))
    if p_hat > upper or p_hat < lower:
        print("Does not pass sanity check")
    elif p_hat < upper and p_hat > lower:
        print("Does pass sanity check")

In [218]:
n_cookies_control = control['Pageviews'].sum()
n_cookies_experiment = experiment['Pageviews'].sum()
print("Number of pageviews in control: " + str(n_cookies_control))
print("Number of pageviews in experiment: " + str(n_cookies_experiment))


n_clicks_control = control['Clicks'].sum()
n_clicks_experiment = experiment['Clicks'].sum()
print("\nClicks in control: " + str(n_clicks_control))
print("Clicks in experiment: " + str(n_clicks_experiment))


ctp_control = control['Clicks'].sum()/control['Pageviews'].sum()
ctp_experiment = experiment['Clicks'].sum()/experiment['Pageviews'].sum()
print("\nClick through probability in control: " + str(round(ctp_control,4)))
print("Click through probability in experiment: " + str(round(ctp_experiment, 4)))

Number of pageviews in control: 345543
Number of pageviews in experiment: 344660

Clicks in control: 28378
Clicks in experiment: 28325

Click through probability in control: 0.0821
Click through probability in experiment: 0.0822


In [219]:
print("Invariant test for number of cookies:")
invariant_test(n_cookies_control, n_cookies_experiment)

Invariant test for number of cookies
Lower confidence level: 0.4988
Upper confidence level: 0.5012
P_hat: 0.5006
Does pass sanity check


In [220]:
print("Invariant test for number of clicks:")
invariant_test(n_clicks_control, n_clicks_experiment)

Invariant test for number of clicks:
Lower confidence level: 0.4959
Upper confidence level: 0.5041
P_hat: 0.5005
Does pass sanity check


In [223]:
d_hat = ctp_experiment - ctp_control
p_hat_pooled = (control['Clicks'].sum() + experiment['Clicks'].sum())/(control['Pageviews'].sum()+experiment['Pageviews'].sum())
pooled_sd = math.sqrt((p_hat_pooled*(1-p_hat_pooled))*(1/control['Pageviews'].sum()+1/experiment['Pageviews'].sum()))
z_star = get_z_star(.05)
m = pooled_sd * z_star
print("Invariant test for click-through-probability:")
print("Lower margin of error: " + str(round(-m, 4)))
print("Upper margin of error: " + str(round(m, 4)))
print("d_hat: " + str(round(d_hat, 4)))
print("Does pass sanity check")

Invariant test for click-through-probability:
Lower margin of error: -0.0013
Upper margin of error: 0.0013
d_hat: 0.0001
Does pass sanity check


## Effect Size Tests

Now that we've passed our sanity checks, we can proceed to check our evaluation metrics and see if they are significant. To do this, we must exclude the dates that have any null values. We then sum the necessary variables, calculed the pooled probability and pooled standard deviation, and test its significance at the 95% confidence level.

In [183]:
#Gross conversion
clicks_control = control['Clicks'].loc[control['Enrollments'].notnull()].sum()
clicks_experiment = experiment['Clicks'].loc[experiment['Enrollments'].notnull()].sum()

enrollments_control = control['Enrollments'].sum()
enrollments_experiment = experiment['Enrollments'].sum()

GC_control = enrollments_control/clicks_control
GC_experiment = enrollments_experiment/clicks_experiment

GC_pooled = (enrollments_control+enrollments_experiment)/(clicks_control+clicks_experiment)
GC_pooled_sd = math.sqrt(GC_pooled*(1-GC_pooled)*(1/clicks_control + 1/clicks_experiment))
m = get_z_star(.05) * GC_pooled_sd
d_hat = GC_experiment - GC_control
print("Difference due to experiment: " + str(round(d_hat, 4)))
print("Lower confidence interval:" + str(round(d_hat - m, 4)))
print("Upper confidence interval:" + str(round(d_hat+m, 4)))

Difference due to experiment: -0.0206
Lower confidence Interval:-0.0291
Upper confidence interval:-0.012


Since the confidence interval does not contain 0 our results are statistically signficant. Moreover, since the difference due to the experiment is greater than the absolute value of d_min, our results for gross conversion are practically significant.

In [185]:
#Net conversion
payments_control = control['Payments'].sum()
payments_experiment = experiment['Payments'].sum()

NC_control = payments_control/clicks_control
NC_experiment = payments_experiment/clicks_experiment

NC_pooled = (payments_control+payments_experiment)/(clicks_control+clicks_experiment)
NC_pooled_sd = math.sqrt(NC_pooled*(1-NC_pooled)*(1/clicks_control + 1/clicks_experiment))
m = get_z_star(.05) * NC_pooled_sd
d_hat = NC_experiment - NC_control
print("Difference due to experiment: " + str(round(d_hat, 4)))
print("Lower confidence Interval:" + str(round(d_hat - m, 4)))
print("Upper confidence interval:" + str(round(d_hat+m, 4)))

Difference due to experiment: -0.0049
Lower confidence Interval:-0.0116
Upper confidence interval:0.0019


Our results for net conversion are not statistically significant since 0 is in the confidence interval. Similarly, our observed difference is not larger than the absolute value of .0075, so our results are not practically significant.

To further test our data, we now run sign tests on days in which the experiment value is larger than the control, and test its significance with a two-sided binomial test. If there is no difference in our data, then the experiment value should be larger than control half of the time. Therefor, we count the number of days in which the experiment is larger than the control and test whether the observed value is significantly different than .5. 

In [224]:
def sign_test(Xs_cont,Xs_exp, Ns_cont , Ns_exp):
    n_days= len(Xs_cont)
    n_pos = 0
    exp_ctr = np.array(Xs_exp)/np.array(Ns_exp)
    control_ctr = np.array(Xs_cont)/np.array(Ns_cont)
    for i in range(n_days):
        if exp_ctr[i] > control_ctr[i]:
            n_pos+=1
    print("p-value: " + str(round(stats.binom_test(n_pos, n_days), 4)))

In [225]:
full_data = control.merge(experiment, how = 'inner', on = 'Date', suffixes = ('_control', '_experiment'))

In [226]:
full_data = full_data.loc[full_data["Enrollments_control"].notnull()]
full_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 0 to 22
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    23 non-null     object 
 1   Pageviews_control       23 non-null     int64  
 2   Clicks_control          23 non-null     int64  
 3   Enrollments_control     23 non-null     float64
 4   Payments_control        23 non-null     float64
 5   Pageviews_experiment    23 non-null     int64  
 6   Clicks_experiment       23 non-null     int64  
 7   Enrollments_experiment  23 non-null     float64
 8   Payments_experiment     23 non-null     float64
dtypes: float64(4), int64(4), object(1)
memory usage: 1.8+ KB


In [227]:
sign_test(full_data['Enrollments_control'], full_data['Enrollments_experiment'], full_data['Pageviews_control'], full_data['Pageviews_experiment'])

p-value: 0.0026


In [228]:
sign_test(full_data['Payments_control'], full_data['Payments_experiment'], full_data['Pageviews_control'], full_data['Pageviews_experiment'])

p-value: 0.6776


The sign test for gross conversion is statistically signficant at the 95% confidence level since the p-value is less than .05. However, the sign test for net conversion is not statistically significant. 

## Recommendations

While we found that the change is statistically significant at changing the gross conversion rates, we find that there is not enough evidence of it changing net conversion rates. Since the ultimate goal is to increase the net conversion rate, we do not recommend launching this change. 