In [36]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import norm

In [9]:
baseline = pd.read_csv('data/baseline.csv', header=None)
control = pd.read_excel('data/results.xlsx', sheet_name=0)
experiment = pd.read_excel('data/results.xlsx', sheet_name=1)

In [21]:
baseline

Unnamed: 0,0,1
0,Unique cookies to view course overview page pe...,40000.0
1,"Unique cookies to click ""Start free trial"" per...",3200.0
2,Enrollments per day:,660.0
3,"Click-through-probability on ""Start free trial"":",0.08
4,"Probability of enrolling, given click:",0.20625
5,"Probability of payment, given enroll:",0.53
6,"Probability of payment, given click",0.109313


In [68]:
control

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0
5,"Thu, Oct 16",9670,823,138.0,82.0
6,"Fri, Oct 17",9008,748,146.0,76.0
7,"Sat, Oct 18",7434,632,110.0,70.0
8,"Sun, Oct 19",8459,691,131.0,60.0
9,"Mon, Oct 20",10667,861,165.0,97.0


In [13]:
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


### Experiment Overview:

Free Trial Screener
At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.


In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.


The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.


The **unit of diversion is a cookie**, although **if the student enrolls in the free trial, they are tracked by user-id from that point forward**. The same user-id cannot enroll in the free trial twice. For users that **do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page**.



### Metric choice

**Number of cookies**: That is, number of unique cookies to view the course overview page. (dmin=3000)

**Number of user-ids**: That is, number of users who enroll in the free trial. (dmin=50)

**Number of clicks**: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)

**Click-through-probability**: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)

**Gross conversion**: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)

**Retention**: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)

**Net conversion**: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

### The invariant metrics are: 
**Number of cookies** - These are landing pages that happen before they see the change <br/>
**Number of clicks** - These also happen before the free trial screener is triggered <br/>
**Click-through-probability** - These follow logically from the 2 above

### The evaluation metrics are: 
**Gross conversion** - These are likely to change due to less students enrolling in the free trial, and just take the free course <br/>
**Retention** - These are likely to change due to change in number of enrolled students <br/>
**Net conversion** - These follow logically from the 2 above

### Calculating standard deviation

In [26]:
total_pageviews = baseline.iloc[0,1]
total_clicks = baseline.iloc[1,1]
total_enrollments = baseline.iloc[2,1]
clicks_to_pv_prob = baseline.iloc[3,1]
enroll_given_click_conversion = baseline.iloc[4,1]
pay_given_enroll_retention = baseline.iloc[5,1]
pay_given_click_net_conversion = baseline.iloc[6,1]

***Question***: For each metric you selected as an **evaluation** metric, estimate its standard deviation analytically, based on a sample size of 5000 PV. Do you expect the analytic estimates to be accurate? That is, for which metrics, if any, would you want to collect an empirical estimate of the variability if you had time?

*Formula to calculate standard deviation is:* <br/> $$\sigma = \sqrt{\frac{\hat{p}*(1-\hat{p})}{N}}$$

To calculate the above on a sample size of 5000 PVs, we would need to scale the denominator (total_clicks and total_enrollments) in this case to correspond to 5000 PVs

In [33]:
gross_conversion_std = np.sqrt(enroll_given_click_conversion * (1 - enroll_given_click_conversion) / ((total_clicks*5000)/total_pageviews))
retention_std = np.sqrt(pay_given_enroll_retention * (1 - pay_given_enroll_retention) / ((total_enrollments*5000)/total_pageviews))
net_conversion_std = np.sqrt(pay_given_click_net_conversion * (1 - pay_given_click_net_conversion) / ((total_clicks*5000)/total_pageviews))
print(f"Gross conversion standard deviation is: {gross_conversion_std}")
print(f"Retention standard deviation is: {retention_std}")
print(f"Net conversion standard deviation is: {net_conversion_std}")

Gross conversion standard deviation is: 0.020230604137049392
Retention standard deviation is: 0.05494901217850908
Net conversion standard deviation is: 0.01560154458248846


### Calculating number of pageviews

*Formula to calculate number of sample is:* <br/>
$$n = \frac{(z_{\alpha/2} * \sqrt{2*\sigma^2_{control})} + z_\beta * \sqrt{\sigma^2_{control} + \sigma^2_{experiment}})^2}{\delta^2}$$<br/>
where, <br/>
$$\sigma^2 = p(1-p)$$<br/>
with, <br/>
**n**: number of sample needed in each group<br/>
**alpha**: desired statistical significance level<br/>
**beta**: 1 - power<br/>
**delta**: absolute practical significance<br/>

*A quick estimation of the above function is:*<br/>
$$n = 16\frac{\sigma^2}{\delta^2}$$

In [63]:
alpha = 0.05
beta = 0.2
def calc_sample_size(alpha, beta, p, delta):
    """ Based on https://www.evanmiller.org/ab-testing/sample-size.html
    Ref: https://stats.stackexchange.com/questions/357336/create-an-a-b-sample-size-calculator-using-evan-millers-post
    Args:
        alpha (float): How often are you willing to accept a Type I error (false positive)?
        power (float): How often do you want to correctly detect a true positive (1-beta)?
        p (float): Base conversion rate
        pct_mde (float): Minimum detectable effect, relative to base conversion rate.

    """
    t_alpha2 = norm.ppf(1.0-alpha/2)
    t_beta = norm.ppf(1-beta)

    sd1 = np.sqrt(2 * p * (1.0 - p))
    sd2 = np.sqrt(p * (1.0 - p) + (p + delta) * (1.0 - p - delta))

    return int((t_alpha2 * sd1 + t_beta * sd2) * (t_alpha2 * sd1 + t_beta * sd2) / (delta * delta))


def calculate_num_sample_multiple_metrics(metrics, alpha, beta, bonferroni=False):
    """
    metrics: a dictionary mapping name to the appropriate baseline and d_min
    bonferroni: to use bonferroni estimation or not (divide alpha by number of metrics)
    """
    needed_samples = []
    if bonferroni:
        alpha = alpha / len(metrics)
    for k, v in metrics.items():
        needed_samples.append((k, calc_sample_size(alpha, beta, v['baseline'], v['d_min'])))
    return needed_samples

In [64]:
metrics = {'gross_conversion': {'baseline': enroll_given_click_conversion, 'd_min': 0.01}, 'retention': {'baseline': pay_given_enroll_retention, 'd_min': 0.01}, 'net_conversion': {'baseline': pay_given_click_net_conversion, 'd_min': 0.0075}}
needed_samples = calculate_num_sample_multiple_metrics(metrics, alpha, beta, bonferroni=False)
needed_samples

[('gross_conversion', 25834), ('retention', 39086), ('net_conversion', 27413)]

<br/>*Remember that the numbers above are of the appropriate baseline though. So gross_conversion, 25845 are not pageviews, but are number of clicks. Similarly, for retention, we need 39086 enrollments; for net_conversion, 27413 clicks. As such, number of pageviews needed are (we time two because the function above is PVs of one control/experiment group):*

In [65]:
2*max(needed_samples[0][1] / total_clicks * total_pageviews, needed_samples[1][1] / total_enrollments * total_pageviews, needed_samples[2][1] / total_clicks * total_pageviews)

4737696.96969697

*So we need ~4.7M pageviews, which means we need around 4700000/40000=120 days for this if we run on full traffic. This is not good and thus we should remove retention from our evaluation metrics*

In [66]:
metrics = {'gross_conversion': {'baseline': enroll_given_click_conversion, 'd_min': 0.01}, 'net_conversion': {'baseline': pay_given_click_net_conversion, 'd_min': 0.0075}}
needed_samples = calculate_num_sample_multiple_metrics(metrics, alpha, beta, bonferroni=False)
needed_samples

[('gross_conversion', 25834), ('net_conversion', 27413)]

In [67]:
2*max(needed_samples[0][1] / total_clicks * total_pageviews, needed_samples[1][1] / total_clicks * total_pageviews)

685325.0

### Choosing duration and exposure<br/>
If we run this experiment on 100% traffic, we would need to run for 685325/40000=17.13~18 days. This is a fairly long period of time, so it is advisable to run on 100% traffic. However, you need to be aware of the implications of this.

### Sanity Check

#### Count metrics

In [86]:
idx = ["cookies","clicks","enrollments","payments"]
results = {"Control":pd.Series([control['Pageviews'].sum(), control['Clicks'].sum(), control['Enrollments'].sum(),control['Payments'].sum()],
                                index = idx),
           "Experiment":pd.Series([experiment['Pageviews'].sum(), experiment['Clicks'].sum(), experiment['Enrollments'].sum(),experiment['Payments'].sum()],
                               index = idx)}
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Control,Experiment
cookies,345543.0,344660.0
clicks,28378.0,28325.0
enrollments,3785.0,3423.0
payments,2033.0,1945.0


*First, we calculate the margin:*<br/>
$$m = z_{\alpha/2} * \sqrt{\frac{\sigma^2}{N}}$$<br/>
where,<br/>
$$\sigma^2 = p(1-p)$$

In [88]:
df_results['Total']=df_results['Control'] + df_results['Experiment']
df_results['Prob'] = 0.5
df_results['SE'] = np.sqrt((df_results['Prob'] * (1- df_results['Prob']))/df_results['Total'])
df_results['Margin'] = norm.ppf(1-alpha/2) * df_results['SE']
df_results

Unnamed: 0,Control,Experiment,Total,Prob,SE,Margin
cookies,345543.0,344660.0,690203.0,0.5,0.000602,0.00118
clicks,28378.0,28325.0,56703.0,0.5,0.0021,0.004115
enrollments,3785.0,3423.0,7208.0,0.5,0.005889,0.011543
payments,2033.0,1945.0,3978.0,0.5,0.007928,0.015538


In [93]:
df_results['CI_low'] = df_results['Prob'] - df_results['Margin']
df_results['CI_high'] = df_results['Prob'] + df_results['Margin']
df_results['Observed'] = df_results['Experiment'] / df_results['Total']
df_results['Pass'] = df_results.apply(lambda x: (x['Observed'] < x['CI_high']) and (x['Observed'] > x['CI_low']), axis=1)
df_results

Unnamed: 0,Control,Experiment,Total,Prob,SE,Margin,CI_low,CI_high,Observed,Pass
cookies,345543.0,344660.0,690203.0,0.5,0.000602,0.00118,0.49882,0.50118,0.49936,True
clicks,28378.0,28325.0,56703.0,0.5,0.0021,0.004115,0.495885,0.504115,0.499533,True
enrollments,3785.0,3423.0,7208.0,0.5,0.005889,0.011543,0.488457,0.511543,0.474889,False
payments,2033.0,1945.0,3978.0,0.5,0.007928,0.015538,0.484462,0.515538,0.488939,True


### Other metrics

*For other metrics such as rate, we need to calculate the CI around the control rate, and check if the experiment rate falls within the CI:*<br/>
$$m = \sqrt{\frac{p_{control}*(1-p_{control})}{N_{control}}}$$<br/>
$$CI = p_{control} \pm m$$

In [113]:
clicks_cont = df_results.loc['clicks', 'Control']
pvs_cont = df_results.loc['cookies', 'Control']
clicks_exp = df_results.loc['clicks', 'Experiment']
pvs_exp = df_results.loc['cookies', 'Experiment']


clicks_to_pvs_cont = clicks_cont / pvs_cont
clicks_to_pvs_exp = clicks_exp / pvs_exp

diff = clicks_to_pvs_exp - clicks_to_pvs_cont
se = np.sqrt(clicks_to_pvs_cont * (1- clicks_to_pvs_cont) / pvs_cont)
margin = se * norm.ppf(1-alpha/2)

lower = clicks_to_pvs_cont - margin
upper = clicks_to_pvs_cont + margin

In [117]:
print(f"Lower CI for CTR is: {lower}")
print(f"Upper CI for CTR is: {upper}")
print(f"Observed CTR is: {clicks_to_pvs_exp}")

Lower CI for CTR is: 0.08121037657420853
Upper CI for CTR is: 0.08304125057494512
Observed CTR is: 0.08218244066616376


### Effect size test

In [120]:
control = control[~control['Enrollments'].isna()]
experiment = experiment[~experiment['Enrollments'].isna()]

In [122]:
idx = ["cookies","clicks","enrollments","payments"]
results = {"Control":pd.Series([control['Pageviews'].sum(), control['Clicks'].sum(), control['Enrollments'].sum(),control['Payments'].sum()],
                                index = idx),
           "Experiment":pd.Series([experiment['Pageviews'].sum(), experiment['Clicks'].sum(), experiment['Enrollments'].sum(),experiment['Payments'].sum()],
                               index = idx)}
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Control,Experiment
cookies,212163.0,211362.0
clicks,17293.0,17260.0
enrollments,3785.0,3423.0
payments,2033.0,1945.0


In [124]:
df_results['Total']=df_results['Control'] + df_results['Experiment']

In [129]:
# Get the values

enrollments_cont = df_results.loc['enrollments', 'Control']
clicks_cont = df_results.loc['clicks', 'Control']
payments_cont = df_results.loc['payments', 'Control']
enrollments_exp = df_results.loc['enrollments','Experiment']
clicks_exp = df_results.loc['clicks', 'Experiment']
payments_exp = df_results.loc['payments', 'Experiment']

gross_conversion_cont = enrollments_cont/clicks_cont
net_conversion_cont = payments_cont/clicks_cont
gross_conversion_exp = enrollments_exp/clicks_exp
net_conversion_exp = payments_exp/clicks_exp


gross_conversion = (enrollments_exp + enrollments_cont)/(clicks_cont + clicks_exp)
net_conversion = (payments_cont + payments_exp)/(clicks_cont + clicks_exp)
print(f"Gross conversion is: {gross_conversion}")
print(f"Net conversion is: {net_conversion}")

Gross conversion is: 0.20860706740369866
Net conversion is: 0.1151274853124186


*Here, we can use either the analytical or empirical variance to calculate the standard error:*<br/>
$$SE_{empirical} = \sqrt{\hat{p} * (1 - \hat{p}) * \frac{1}{N_{cont}} * \frac{1}{N_{exp}}}$$<br/>
We can also use analytical variance to estimate:<br/>
$$\frac{SE_{analytical}}{\sqrt{1/N_{control} + 1/N_{exp}}} = \sqrt{\frac{std_{5000\_PVs}}{1/N_{event\_from\_5000\_pvs}}}$$

In [136]:
def get_ci(p_hat, p_hat_cont, p_hat_exp, alpha, N_cont, N_exp):
    SE = np.sqrt((p_hat * (1 - p_hat))*(1/N_cont + 1/N_exp))
    z_score = norm.ppf(1-alpha/2)
    margin = z_score * SE
    
    diff = p_hat_exp - p_hat_cont
    ci_lower = diff - margin
    ci_upper = diff + margin
    
    return ci_lower, ci_upper

In [137]:
res = get_ci(gross_conversion, gross_conversion_cont, gross_conversion_exp, alpha, clicks_cont, clicks_exp)
print(f"Gross conversion CI is {res}")

(-0.02912320088750467, -0.011986548273218463)

In [138]:
res = get_ci(net_conversion, net_conversion_cont, net_conversion_exp, alpha, clicks_cont, clicks_exp)
print(f"Net conversion CI is {res}")

Net conversion CI is (-0.011604500677993734, 0.0018570553289053993)


### Sign test

*Sign test is a binomial test, which is the *<br/>

In [140]:
for dataset in [control, experiment]:
    dataset['gross_conversion'] = dataset['Enrollments'] / dataset['Clicks']
    dataset['net_conversion'] = dataset['Payments'] / dataset['Clicks']

In [152]:
from scipy.stats import binom_test
gross_conversion_sign = control['gross_conversion'] > experiment['gross_conversion']
net_conversion_sign = control['net_conversion'] > experiment['net_conversion']

print(f"Gross conversion p-value is: {binom_test(sum(gross_conversion_sign), len(gross_conversion_sign))}")
print(f"Net conversion p-value is: {binom_test(sum(net_conversion_sign), len(net_conversion_sign))}")

Gross conversion p-value is: 0.0025994777679443364
Net conversion p-value is: 0.6776394844055175


In [153]:
sum(gross_conversion_sign)

19

In [154]:
len(gross_conversion_sign)

23