<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Experiment-Overview" data-toc-modified-id="Experiment-Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Experiment Overview</a></span><ul class="toc-item"><li><span><a href="#Changes-to-test" data-toc-modified-id="Changes-to-test-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Changes to test</a></span></li><li><span><a href="#Hypothesis" data-toc-modified-id="Hypothesis-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Hypothesis</a></span></li><li><span><a href="#Unit-of-diversion" data-toc-modified-id="Unit-of-diversion-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Unit of diversion</a></span></li></ul></li><li><span><a href="#Choosing-a-Metric-to-Track" data-toc-modified-id="Choosing-a-Metric-to-Track-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Choosing a Metric to Track</a></span><ul class="toc-item"><li><span><a href="#Evaluation-Metrics" data-toc-modified-id="Evaluation-Metrics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Evaluation Metrics</a></span></li><li><span><a href="#Invariant-Metrics" data-toc-modified-id="Invariant-Metrics-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Invariant Metrics</a></span></li></ul></li><li><span><a href="#Baseline-Values" data-toc-modified-id="Baseline-Values-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Baseline Values</a></span><ul class="toc-item"><li><span><a href="#Metric-Baseline-Values" data-toc-modified-id="Metric-Baseline-Values-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Metric Baseline Values</a></span></li><li><span><a href="#Estimating-Standard-Deviation" data-toc-modified-id="Estimating-Standard-Deviation-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Estimating Standard Deviation</a></span></li></ul></li><li><span><a href="#Choosing-Sample-Size" data-toc-modified-id="Choosing-Sample-Size-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Choosing Sample Size</a></span><ul class="toc-item"><li><span><a href="#Using-formula" data-toc-modified-id="Using-formula-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Using formula</a></span></li></ul></li><li><span><a href="#Evaluating-Experiment-Results" data-toc-modified-id="Evaluating-Experiment-Results-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Evaluating Experiment Results</a></span><ul class="toc-item"><li><span><a href="#Checking-Invariant-Metrics" data-toc-modified-id="Checking-Invariant-Metrics-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Checking Invariant Metrics</a></span></li></ul></li></ul></div>

In [113]:
# Imports
import pandas as pd
import numpy as np
from scipy.stats import norm

# Experiment Overview

## Changes to test
- Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials"
- Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course.
- If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual.
- If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free.

## Hypothesis
- This change might set clearer expectations for students upfront.
- Thus reducing the number of frustrated students who left the free trial.
- without significantly reducing the number of students to continue past the free trial.

- **Goal**: Improve overall student experience

## Unit of diversion

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

# Choosing a Metric to Track

We will have to choose at least two types of metrics, an evaluation metric and an invariant metric, and a practical significance level.

**Invariant Metrics**: Metrics that we do not expect to change between experiment and control groups (They should not be affected by the proposed change on Udacity's website)

**Evaluation Metrics**: These are the metrics that we want to track and expect to see significant change in. 

**Practical Significance Level**: Also known as Dmin, is a significance threshold set by stakeholders which indicates if the observed change is practical from a business standpoint

## Evaluation Metrics

Some possible metrics to choose from are:

- **Gross conversion**: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)

- **Retention**: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01) - We expect this metric to grow since students who completed checkout have seen the pop up message and decided to commit more than 5 hours a week for learning

- **Net conversion**: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075) - This metrics should decrease, since we are assuming that the new pop up message will deter some students from enrolling in the free trial

## Invariant Metrics

- **Number of cookies**: That is, number of unique cookies to view the course overview page. (dmin=3000)

- **Number of clicks**: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)

- **Click-through-probability**: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)

# Baseline Values


## Metric Baseline Values

According to Udacity, these are the baseline values for the following metrics:

- Unique cookies to view course overview page per day:	40000
- Unique cookies to click "Start free trial" per day:	3200
- Enrollments per day:	660
- Click-through-probability on "Start free trial":	0.08
- Probability of enrolling, given click:	0.20625
- Probability of payment, given enroll:	0.53
- Probability of payment, given click	0.1093125

In [75]:
# Create a dictionary to save baseline metrics
baseline_dict = {"cookies_unique": 40_000,
                 "clicks": 3200,
                 "enrollments_daily": 660,
                 "CTP": 0.08,
                 "GConversion": 0.20625,
                 "retention": 0.53,
                 "NConversion": 0.109313}

## Estimating Standard Deviation

The first thing we can do is to calculate some basic statistics to get an idea of what our samples look like.

In [76]:
# scale down baseline values to appropriate sample size
# only count estimates need to be scaled
sample_size = 5000 # defined sample size

baseline_scaled = baseline_dict.copy()
baseline_scaled["cookies_unique"] = sample_size 
baseline_scaled["clicks"] = baseline_dict["clicks"] * (sample_size/40_000)
baseline_scaled["enrollments_daily"] = baseline_dict["enrollments_daily"] * (sample_size/40_000)

baseline_scaled

{'cookies_unique': 5000,
 'clicks': 400.0,
 'enrollments_daily': 82.5,
 'CTP': 0.08,
 'GConversion': 0.20625,
 'retention': 0.53,
 'NConversion': 0.109313}

We can assume that all metrics that represent probability follow the binomial distribution and so the standard deviation DS can be estimated analytically with the following formula:

![image.png](attachment:image.png)

Note that estimating SD analytically only works when the **unit of diversion** and **unit of analysis** are the same. When this is not the case, it it's better to empirically calculate the SD.

In [78]:
# Estimating SD for the 3 chosen evealuation metrics

def sd_analytical(n: int, p: float) -> float:
    return round(np.sqrt((p*(1-p))/(n)), 4)

# Gross Conversion rate
GConversion = {}
GConversion["n"] = baseline_scaled["clicks"]
GConversion["p"] = baseline_dict["GConversion"]
GConversion["d_min"] = 0.01

GConversion["sd"] = sd_analytical(GConversion["n"], GConversion["p"])

# Retention rate
retention = {}
retention["n"] = baseline_scaled["enrollments_daily"]
retention["p"] = baseline_dict["retention"]
retention["d_min"] = 0.01

retention["sd"] = sd_analytical(retention["n"], retention["p"])

# Net Conversion rate
NConversion = {}
NConversion["n"] = baseline_scaled["clicks"]
NConversion["p"] = baseline_dict["NConversion"]
NConversion["d_min"] = 0.0075

NConversion["sd"] = sd_analytical(NConversion["n"], NConversion["p"])

In [79]:
GConversion, retention, NConversion

({'n': 400.0, 'p': 0.20625, 'd_min': 0.01, 'sd': 0.0202},
 {'n': 82.5, 'p': 0.53, 'd_min': 0.01, 'sd': 0.0549},
 {'n': 400.0, 'p': 0.109313, 'd_min': 0.0075, 'sd': 0.0156})

# Choosing Sample Size

To calculate the correct sample size for our experiment and control groups, we have two options:
- Use [this](https://www.evanmiller.org/ab-testing/sample-size.html)
 online calculator

- Or calculate with the following formula: 

![image.png](attachment:image.png)

## Using formula

In [91]:
def get_sd1(p: float) -> float:
    return np.sqrt(p*(1-p) + p*(1-p))

def get_sd2(p: float, d: float) -> float:
    return np.sqrt(p*(1-p) + (p+d) * (1-(p+d)))

def get_z_score(alpha_value: float):
    return norm.ppf(alpha_value)

def get_sample_size(alpha: float, beta:float, d:float, p:float) -> int:
    
    z_alpha = get_z_score(1-alpha/2)
    z_beta = get_z_score(1-beta)
    sd1 = get_sd1(p)
    sd2 = get_sd2(p, d)
    
    return round((z_alpha*sd1 + z_beta*sd2)**2/d**2)

In [99]:
# sample size per evaluation metric

# Gross Conversion
GConversion["sample_size_clicks"] = get_sample_size(0.05, 0.2, GConversion["d_min"], GConversion["p"])

# convert to number of cookies who visted the site then multiply by 2 (we have two groups)
# according to baseline metrics, 8% of cookies resulted in clicks
GConversion["sample_size"] = (GConversion["sample_size_clicks"]/0.08)*2

# Retention
retention["sample_size_user_id"] = get_sample_size(0.05, 0.2, retention["d_min"], retention["p"])

# convert to number of users enroled to number of cookies who clicked then to cookies who visted the site
# then multiply by 2 (we have two groups)

# Precentage of users who enroll given a click = 20.6%
retention["sample_size"] = ((retention["sample_size_user_id"]/0.20625)/0.08)*2

# Net Conversions
NConversion["sample_size_clicks"] = get_sample_size(0.05, 0.2, NConversion["d_min"], NConversion["p"])

# convert number of clicks to cookies who visted the site
# then multiply by 2 (we have two groups)

NConversion["sample_size"] = (NConversion["sample_size_clicks"]/0.08)*2

In [100]:
GConversion["sample_size_clicks"]

25835.0

In [111]:
df_sample_size = pd.DataFrame(data={"Metric": ["GConversion", "Retention", "NConversion"], 
                   "SampleSize": [GConversion["sample_size"], retention["sample_size"], NConversion["sample_size"]]})

# calculate experiment length in days
average_daily_traffic = 40_000
divert_fraction = 0.8
df_sample_size["experiment_length_in_days"] = round(df_sample_size["SampleSize"] / (average_daily_traffic * divert_fraction), 2)

In [112]:
df_sample_size

Unnamed: 0,Metric,SampleSize,experiment_length_in_days
0,GConversion,645875.0,20.18
1,Retention,4737818.0,148.06
2,NConversion,685325.0,21.42


From the results we can see that with a diversion fraction of 80%, in order to use Retention as an evaluation metric, the experiment needs to be run for at least 148 days, which is a very long time. Therefore this metric will not be considered any further.
The remaining two metrics are more realistic in terms of experiment duration and are a good option.

# Evaluating Experiment Results

Udacity provided two datasets containing both experiment and control group results

In [114]:
df_control = pd.read_excel("../raw_data/Final Project Results.xlsx", sheet_name="Control")
df_experiment = pd.read_excel("../raw_data/Final Project Results.xlsx", sheet_name="Experiment")

In [115]:
df_control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723.0,687.0,134.0,70.0
1,"Sun, Oct 12",9102.0,779.0,147.0,70.0
2,"Mon, Oct 13",10511.0,909.0,167.0,95.0
3,"Tue, Oct 14",9871.0,836.0,156.0,105.0
4,"Wed, Oct 15",10014.0,837.0,163.0,64.0


In [116]:
df_control.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         37 non-null     object 
 1   Pageviews    37 non-null     float64
 2   Clicks       37 non-null     float64
 3   Enrollments  23 non-null     float64
 4   Payments     23 non-null     float64
dtypes: float64(4), object(1)
memory usage: 1.6+ KB


In [117]:
df_experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716.0,686.0,105.0,34.0
1,"Sun, Oct 12",9288.0,785.0,116.0,91.0
2,"Mon, Oct 13",10480.0,884.0,145.0,79.0
3,"Tue, Oct 14",9867.0,827.0,138.0,92.0
4,"Wed, Oct 15",9793.0,832.0,140.0,94.0


In [118]:
df_experiment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         37 non-null     object 
 1   Pageviews    37 non-null     float64
 2   Clicks       37 non-null     float64
 3   Enrollments  23 non-null     float64
 4   Payments     23 non-null     float64
dtypes: float64(4), object(1)
memory usage: 1.6+ KB


## Checking Invariant Metrics

We defined 3 invariant metrics to use as sanity checks. 

- Number of cookies

- Number of clicks

- Click-through-probability

We expect these 3 metrics to be approximately the same in both groups. If there not, then we might have set our experiment in way that could introduce bias.

**Number of cookies**

In [120]:
print("Number of cookies in control group", df_control["Pageviews"].sum())
print("Number of cookies in experiment group", df_experiment["Pageviews"].sum())

Number of cookies in control group 345543.0
Number of cookies in experiment group 344660.0


Although the number are very close, we should still check if there's a statistically significant difference between pageviews. To do this we need to test whether the proportion of pageviews in the control group isn't significantly different than 0.5 (since we expect that the control groups contains 50% of the pageviews in the control and experiment group combined)

In order to do that, we need to fist calculate the 95% confidence interval.

![image.png](attachment:image.png)

Where p is the actual proportion of pageviews in the control group, an ME is the margin of error.

In [127]:
alpha = 0.05
p = 0.5
total_pageviews = df_control["Pageviews"].sum() + df_experiment["Pageviews"].sum()
p_control = df_control["Pageviews"].sum() / total_pageviews
z_score = get_z_score(1-alpha/2)
sd = sd_analytical(total_pageviews, p_control)
me = z_score*sd

print(f"Confidence intervall: {p-me} , {p+me}")

Confidence intervall: 0.49882402160927597 , 0.501175978390724


Our p_control = 0.5004 is within the confidence interval so there is no statistically siginficant difference in pageviews between the two groups.

**Number of clicks**


In [128]:
alpha = 0.05
p = 0.5
total_clicks = df_control["Clicks"].sum() + df_experiment["Clicks"].sum()
p_control = df_control["Clicks"].sum() / total_clicks
z_score = get_z_score(1-alpha/2)
sd = sd_analytical(total_pageviews, p_control)
me = z_score*sd

print(f"Confidence intervall: {p-me} , {p+me}")

Confidence intervall: 0.49882402160927597 , 0.501175978390724


Our p_control = 0.5004 is within the confidence interval so there is no statistically siginficant difference in clicks between the two groups.

**Click-through-probability**


CTP is defined as the ratio between unique clicks and unique pageviews. For this reason the Standard deviation is calculated differently than before:


![image.png](attachment:image.png)

In [145]:
def get_sd_pooled(n_cont, n_exp, x_cont, x_exp):
    
    p_pool = (x_cont + x_exp)/(n_cont + n_exp)
    
    return np.sqrt(p_pool*((1-p_pool)*((1/n_cont)+(1/n_exp))))

n_cont = df_control["Pageviews"].sum()
n_exp = df_experiment["Pageviews"].sum()
x_cont = df_control["Clicks"].sum()
x_exp = df_experiment["Clicks"].sum()

sd_pooled = get_sd_pooled(n_cont, n_exp, x_cont, x_exp)

ctp_cont = x_cont/n_cont
ctp_exp = x_exp/n_exp

d = round(ctp_exp - ctp_cont, 4)

print(f"Confidence intervall: {0-z_score*sd_pooled} , {0+z_score*sd_pooled}")

Confidence intervall: -0.001295655390242568 , 0.001295655390242568


Our d = 0.0001 is within the confidence interval so there is no statistically siginficant difference in CTP between the two groups.