#  1. Experiment overview

At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This screenshot shows what the experiment looks like.

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

# 2. Customer Journey

#### Experiment
Customer visit Udacity course page >> click "Start Free" button >> Question "+5 hours" >> Yes & Enroll >> after 14 days, pay

#### Control
Customer visit Udacity course page >> click "Start Free" button  >>  Enroll >> after 14 days, pay

# 3. Metrics

1. Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000)
2. Number of user-ids: That is, number of users who enroll in the free trial. (dmin=50)
3. Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
4. Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
5. Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
6. Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
7. Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)


In [1]:
import math
from scipy.stats import norm
import pandas as pd
import numpy as np

## Metric Choice
Which of the following metrics would you choose to measure for this experiment and why? For each metric you choose, indicate whether you would use it as an invariant metric or an evaluation metric. The practical significance boundary for each metric, that is, the difference that would have to be observed before that was a meaningful change for the business, is given in parentheses. All practical significance boundaries are given as absolute changes.


Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice.


You should also decide now what results you will be looking for in order to launch the experiment. Would a change in any one of your evaluation metrics be sufficient? Would you want to see multiple metrics all move or not move at the same time in order to launch? This decision will inform your choices while designing the experiment.

#####  Invariant Metrics : number of cookies, number of clicks, click-through-probability.

#####  Evaluation Metrics : gross conversion, retention, net conversion.

#### website to double check your sample size result

https://www.evanmiller.org/ab-testing/sample-size.html

#### Metric variablity

In [41]:
n_pageviews=40000
n_clicks=3200
n_enroll=660
ctp=0.08
n_sample=5000

click_through_probability=0.08 #clicks / pageviews
gross_conversion=0.20625 # enroll / click
retention=0.53 # payment / enroll
net_conversion=0.1093125 # payment / click

"""analytic standard deviation estimate"""
# gross_conversion
std_gross_conversion=math.sqrt(gross_conversion*(1-gross_conversion)/(n_clicks/n_pageviews*n_sample))
# retention
std_retention=math.sqrt(retention*(1-retention)/(n_enroll/n_pageviews*n_sample))
# net_conversion
std_net_conversion=math.sqrt(net_conversion*(1-net_conversion)/(n_clicks/n_pageviews*n_sample))
print("SD of GC: ",round(std_gross_conversion,4))
print("SD of Retention: ",round(std_retention,4))
print("SD of NC: ",round(std_net_conversion,4))

SD of GC:  0.0202
SD of Retention:  0.0549
SD of NC:  0.0156


In [43]:
print("SD of GC: ",round(np.sqrt((.206250*(1-.206250))/(5000*3200/40000)),4))
print("SD of Retention: ",round(np.sqrt((.53*(1-.53))/(5000*660/40000)),4))
print("SD of NC: ",round(np.sqrt((.109313*(1-.109313))/(5000*3200/40000)),4))


SD of GC:  0.0202
SD of Retention:  0.0549
SD of NC:  0.0156


# 4. Sizing
Will you use Bonferroni Correction? > Evaluation metrics are closely related to each other, so that Bonferroni would be too conservative.

Which evaluation metrics did you choose? > gross conversion, retention, net conversion
How many pageviews will you need?

##### Gross Conversion
Baseline Conversion: 20.625%
Minimum Detectable Effect: 1%
alpha: 5%
beta: 20%
1 - beta: 80%
sample size = 25,835 enrollments/group
Number of groups = 2 (experiment and control)
total sample size = 51,670 enrollments
clicks/pageview: 3200/40000 = .08 clicks/pageview
pageviews = 645,875

##### Retention
Baseline Conversion: 53%
Minimum Detectable Effect: 1%
alpha: 5%
beta: 20%
1 - beta: 80%
sample size = 39,155 enrollments/group
Number of groups = 2 (experiment and control)
total sample size = 78,230 enrollments
enrollments/pageview: 660/40000 = .0165 enrollments/pageview
pageviews = 78,230/.0165 = 4,741,212


##### Net Conversion
Baseline Conversion: 10.9313%
Minimum Detectable Effect: .75%
alpha: 5%
beta: 20%
1 - beta: 80%
sample size = 27,413 enrollments/group
Number of groups = 2 (experiment and control)
total sample size = 54,826
clicks/pageview: 3200/40000 = .08 clicks/pageview
pageviews = 685,325

# 5. Duration

In [44]:
## too long
4741212.0/40000

118.5303

 If we divert 100% off traffic, given 40,000 page views per day, the experiment would take 119 days. That is a long time. If we eliminate retention, we are left with Gross Conversion and Net Conversion. This reduces the number of required pageviews to 685,325, and an 18 day experiment with 100% diversion. There may be other experiments to run, so let's say 50% diversion for 35 days.




In [45]:
685325.0/40000

17.133125

#### Analysis
The data for you to analyze is here. This data contains the raw information needed to compute the above metrics, broken down day by day. Note that there are two sheets within the spreadsheet - one for the experiment group, and one for the control group.


The meaning of each column is:

1. Pageviews: Number of unique cookies to view the course overview page that day.
2. Clicks: Number of unique cookies to click the course overview page that day.
3. Enrollments: Number of user-ids to enroll in the free trial that day.
4. Payments: Number of user-ids who who enrolled on that day to remain enrolled for 14 days and thus make a payment. (Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)

# 6. Sanity Checks : Invariant Metrics

Start by checking whether your invariant metrics are equivalent between the two groups. If the invariant metric is a simple count that should be randomly split between the 2 groups, you can use a binomial test as demonstrated in Lesson 5. Otherwise, you will need to construct a confidence interval for a difference in proportions using a similar strategy as in Lesson 1, then check whether the difference between group values falls within that confidence level.


If your sanity checks fail, look at the day by day data and see if you can offer any insight into what is causing the problem.

In [2]:
dates=['Sat, Oct 11', 'Sun, Oct 12', 'Mon, Oct 13', 'Tue, Oct 14',
       'Wed, Oct 15', 'Thu, Oct 16', 'Fri, Oct 17', 'Sat, Oct 18',
       'Sun, Oct 19', 'Mon, Oct 20', 'Tue, Oct 21', 'Wed, Oct 22',
       'Thu, Oct 23', 'Fri, Oct 24', 'Sat, Oct 25', 'Sun, Oct 26',
       'Mon, Oct 27', 'Tue, Oct 28', 'Wed, Oct 29', 'Thu, Oct 30',
       'Fri, Oct 31', 'Sat, Nov 1', 'Sun, Nov 2', 'Mon, Nov 3',
       'Tue, Nov 4', 'Wed, Nov 5', 'Thu, Nov 6', 'Fri, Nov 7',
       'Sat, Nov 8', 'Sun, Nov 9', 'Mon, Nov 10', 'Tue, Nov 11',
       'Wed, Nov 12', 'Thu, Nov 13', 'Fri, Nov 14', 'Sat, Nov 15',
       'Sun, Nov 16']
pageviews_cont=[ 7723,  9102, 10511,  9871, 10014,  9670,  9008,  7434,  8459,
       10667, 10660,  9947,  8324,  9434,  8687,  8896,  9535,  9363,
        9327,  9345,  8890,  8460,  8836,  9437,  9420,  9570,  9921,
        9424,  9010,  9656, 10419,  9880, 10134,  9717,  9192,  8630,
        8970]
pageviews_exp=[ 7716,  9288, 10480,  9867,  9793,  9500,  9088,  7664,  8434,
       10496, 10551,  9737,  8176,  9402,  8669,  8881,  9655,  9396,
        9262,  9308,  8715,  8448,  8836,  9359,  9427,  9633,  9842,
        9272,  8969,  9697, 10445,  9931, 10042,  9721,  9304,  8668,
        8988]
clicks_cont=[687, 779, 909, 836, 837, 823, 748, 632, 691, 861, 867, 838, 665,
       673, 691, 708, 759, 736, 739, 734, 706, 681, 693, 788, 781, 805,
       830, 781, 756, 825, 874, 830, 801, 814, 735, 743, 722]
clicks_exp=[686, 785, 884, 827, 832, 788, 780, 652, 697, 860, 864, 801, 642,
       697, 669, 693, 771, 736, 727, 728, 722, 695, 724, 789, 743, 808,
       831, 767, 760, 850, 851, 831, 802, 829, 770, 724, 710]
enrolls_cont=[134, 147, 167, 156, 163, 138, 146, 110, 131, 165, 196, 162, 127,
       220, 176, 161, 233, 154, 196, 167, 174, 156, 206]
enrolls_exp=[105, 116, 145, 138, 140, 129, 127,  94, 120, 153, 143, 128, 122,
       194, 127, 153, 213, 162, 201, 207, 182, 142, 182]
payment_cont=[ 70,  70,  95, 105,  64,  82,  76,  70,  60,  97, 105,  92,  56,
       122, 128, 104, 124,  91,  86,  75, 101,  93,  67]
payment_exp=[ 34,  91,  79,  92,  94,  61,  44,  62,  77,  98,  71,  70,  68,
        94,  81, 101, 119, 120,  96,  67, 123, 100, 103]
ctp_cont=[i/j for i,j in zip(clicks_cont,pageviews_cont)]
ctp_exp=[i/j for i,j in zip(clicks_exp,pageviews_exp)]

In [5]:
"""pageviews: control and experiment group are expected to be shown same amount pageviews ~0.5 """
sum_pv_cont=sum(pageviews_cont)
sum_pv_exp=sum(pageviews_exp)
SD_pageviews=math.sqrt(0.5*0.5/(sum_pv_cont+sum_pv_exp))
m=1.96*SD_pageviews
ci_min,ci_max=0.5-m,0.5+m
print("Confidence Interval for pageviews: [{},{}]".format(round(ci_min,4),round(ci_max,4)))
print("Observed: ",round(sum_pv_cont/(sum_pv_exp+sum_pv_cont),4))

Confidence Interval for pageviews: [0.4988,0.5012]
Observed:  0.5006


In [6]:
"""clicks: control and experiment group are expected to be shown same amount pageviews ~ 0.5 """
sum_click_cont=sum(clicks_cont)
sum_click_exp=sum(clicks_exp)
SD_clicks=math.sqrt(0.5*0.5/(sum_click_cont+sum_click_exp))
m=1.96*SD_clicks
ci_min,ci_max=0.5-m,0.5+m
print("Confidence Interval for clicks: [{},{}]".format(round(ci_min,4),round(ci_max,4)))
print("Observed: ",round(sum_click_cont/(sum_click_exp+sum_click_cont),4))

Confidence Interval for clicks: [0.4959,0.5041]
Observed:  0.5005


In [7]:
"""click_through_probability:control and experiment group are expected to be shown same amount pageviews ~ 0 """
ctp_cont=sum_click_cont/sum_pv_cont
ctp_exp=sum_click_exp/sum_pv_exp
d_hat=ctp_exp-ctp_cont
ctp_pool=(sum_click_cont+sum_click_exp)/(sum_pv_cont+sum_pv_exp)
SE_ctp=math.sqrt(ctp_pool*(1-ctp_pool)*(1/sum_pv_cont+1/sum_pv_exp))
m=1.96*SE_ctp
ci_min,ci_max=-m,m
print("Confidence Interval for ctp: [{},{}]".format(round(ci_min,4),round(ci_max,4)))
print("Observed: ",round(d_hat,4))

Confidence Interval for ctp: [-0.0013,0.0013]
Observed:  0.0001


In [25]:
# a different presentation

results = {"Control":pd.Series([sum(pageviews_cont),sum(clicks_cont),
                                  sum(enrolls_cont),sum(payment_cont)],
                                  index = ["cookies","clicks","enrollments","payments"]),
           "Experiment":pd.Series([sum(pageviews_exp),sum(clicks_exp),
                               sum(enrolls_exp),sum(payment_exp)],
                               index = ["cookies","clicks","enrollments","payments"])}
df_results = pd.DataFrame(results)


df_results['Total']=df_results.Control + df_results.Experiment
df_results['Prob'] = 0.5
df_results['StdErr'] = np.sqrt((df_results.Prob * (1- df_results.Prob))/df_results.Total)
df_results["MargErr"] = 1.96 * df_results.StdErr
df_results["CI_lower"] = df_results.Prob - df_results.MargErr
df_results["CI_upper"] = df_results.Prob + df_results.MargErr
df_results["Obs_val"] = df_results.Experiment/df_results.Total
df_results["Pass_Sanity"] = df_results.apply(lambda x: (x.Obs_val > x.CI_lower) and (x.Obs_val < x.CI_upper),axis=1)
df_results['Diff'] = abs((df_results.Experiment - df_results.Control)/df_results.Total)


df_results

Unnamed: 0,Control,Experiment,Total,Prob,StdErr,MargErr,CI_lower,CI_upper,Obs_val,Pass_Sanity,Diff
cookies,345543,344660,690203,0.5,0.000602,0.00118,0.49882,0.50118,0.49936,True,0.001279
clicks,28378,28325,56703,0.5,0.0021,0.004116,0.495884,0.504116,0.499533,True,0.000935
enrollments,3785,3423,7208,0.5,0.005889,0.011543,0.488457,0.511543,0.474889,False,0.050222
payments,2033,1945,3978,0.5,0.007928,0.015538,0.484462,0.515538,0.488939,True,0.022122


# 7  Evaluation Metrics

5. Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
6. Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
7. Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)


In [26]:
"""gross conversion"""
n=len(enrolls_exp)
d_min=0.01
sum_clicks_cont=sum(clicks_cont[:n])
sum_clicks_exp=sum(clicks_exp[:n])
sum_enroll_cont=sum(enrolls_cont[:n])
sum_enroll_exp=sum(enrolls_exp[:n])
# pooled standard error & error margin
p_pool=(sum_enroll_exp+sum_enroll_cont)/(sum_clicks_exp+sum_clicks_cont)
SE_pool=math.sqrt(p_pool*(1-p_pool)*(1/sum_clicks_cont+1/sum_clicks_exp))
m=SE_pool*1.96

d_hat=sum_enroll_exp/sum_clicks_exp-sum_enroll_cont/sum_clicks_cont # observed difference

print("Confidence Interval:[{},{}]".format(d_hat-m,d_hat+m))
print("Observed:",d_hat)
print ("Statistically significant:", d_hat+m<0 or d_hat-m>0 ,",  CI doesn't include 0")
print("Practically significant:",True,",  CI doesn't include d_min or -d_min")

Confidence Interval:[-0.0291233583354044,-0.01198639082531873]
Observed: -0.020554874580361565
Statistically significant: True ,  CI doesn't include 0
Practically significant: True ,  CI doesn't include d_min or -d_min


In [27]:
"""retention"""
n=len(payment_exp)
d_min=0.01
sum_payment_cont=sum(payment_cont[:n])
sum_payment_exp=sum(payment_exp[:n])
sum_enroll_cont=sum(enrolls_cont[:n])
sum_enroll_exp=sum(enrolls_exp[:n])
p_pool=(sum_payment_cont+sum_payment_exp)/(sum_enroll_cont+sum_enroll_exp)
SE_pool=math.sqrt(p_pool*(1-p_pool)*(1/sum_enroll_cont+1/sum_enroll_exp))
m=SE_pool*1.96
d_hat=sum_payment_exp/sum_enroll_exp-sum_payment_cont/sum_enroll_cont
print("Confidence Interval:[{},{}]".format(d_hat-m,d_hat+m))
print("Observed:",d_hat)
print ("Statistically significant:", d_hat+m<0 or d_hat-m>0 ,",  CI doesn't include 0")
print("Practically significant:",False,",  CI include d_min")

Confidence Interval:[0.008104435728019967,0.05408517368626556]
Observed: 0.031094804707142765
Statistically significant: True ,  CI doesn't include 0
Practically significant: False ,  CI include d_min


In [28]:
"""net conversion"""
n=len(enrolls_exp)
d_min=0.0075
sum_clicks_cont=sum(clicks_cont[:n])
sum_clicks_exp=sum(clicks_exp[:n])
sum_payment_cont=sum(payment_cont[:n])
sum_payment_exp=sum(payment_exp[:n])
p_pool=(sum_payment_exp+sum_payment_cont)/(sum_clicks_exp+sum_clicks_cont)
SE_pool=math.sqrt(p_pool*(1-p_pool)*(1/sum_clicks_cont+1/sum_clicks_exp))
m=SE_pool*1.96
d_hat=sum_payment_exp/sum_clicks_exp-sum_payment_cont/sum_clicks_cont
print("Confidence Interval:[{},{}]".format(d_hat-m,d_hat+m))
print("Observed:",d_hat)
print ("Statistically significant:", d_hat+m<0 or d_hat-m>0 ,",  CI doesn't include 0")
print("Practically significant:",False,",  CI include d_min")

Confidence Interval:[-0.011604624359891718,0.001857179010803383]
Observed: -0.0048737226745441675
Statistically significant: False ,  CI doesn't include 0
Practically significant: False ,  CI include d_min


## Sign Test

For each evaluation metric, do a sign test using the day-by-day breakdown. If the sign test does not agree with the confidence interval for the difference, see if you can figure out why.



In [38]:
from scipy.stats import binom_test 
"""gross conversion"""
alpha=0.05
beta=0.2

gc_exp=[i/j for i,j in zip(enrolls_exp,clicks_exp)]
gc_cont=[i/j for i,j in zip(enrolls_cont,clicks_cont)]
gc_diff=sum([i>j for i,j in zip(gc_exp,gc_cont)])
days=len(gc_exp)

# The prob of gross conversion of experiment group > gross conversion of control group is 0.5
print('There are {} out of {} days that gross conversion of experiment group > gross conversion of control group'.format(gc_diff, days))
p_value=binom_test(gc_diff, n=days, p=0.5)
print("p-value:",p_value,", Statistically Significant:",p_value<alpha)

There are 4 out of 23 days that gross conversion of experiment group > gross conversion of control group
p-value: 0.002599477767944336 , Statistically Significant: True


In [39]:
"""retention"""
rt_exp=[i/j for i,j in zip(payment_exp,enrolls_exp)]
rt_cont=[i/j for i,j in zip(payment_cont,enrolls_cont)]
rt_diff=sum([i>j for i,j in zip(rt_exp,rt_cont)])
days=len(rt_exp)
p_value=binom_test(rt_diff, n=days, p=0.5)

print('There are {} out of {} days that retention of experiment group > retention of control group'.format(rt_diff, days))
print("p-value:",p_value,", Statistically Significant:",p_value<alpha)


There are 13 out of 23 days that retention of experiment group > retention of control group
p-value: 0.6776394844055176 , Statistically Significant: False


In [40]:
"""net conversion"""
nc_exp=[i/j for i,j in zip(payment_exp,clicks_exp)]
nc_cont=[i/j for i,j in zip(payment_cont,clicks_cont)]
nc_diff=sum([i>j for i,j in zip(nc_exp,nc_cont)])
days=len(nc_exp)
p_value=binom_test(nc_diff, n=days, p=0.5)
print('There are {} out of {} days that net conversion of experiment group > net conversion of control group'.format(nc_diff, days))
print("p-value:",p_value,", Statistically Significant:",p_value<alpha)

There are 10 out of 23 days that net conversion of experiment group > net conversion of control group
p-value: 0.6776394844055176 , Statistically Significant: False


# 8 Result Summary

1. Give good justification for their choice of whether to use the Bonferroni correction.
2. Give well-reasoned and plausible explanations for each discrepancy between the effect size tests and the sign tests.

# 9 Recommendation 

1. Make recommendation that is well reasoned and supported by the data.

# 10 Follow-up Experiment

1. Give a plausible experiment that would be worth testing and the hypothesis is clearly stated.
2. Metrics chosen will be sufficient to evaluate the hypothesis of the experiment, would be possible to measure under most infrastructures, and are well-supported by reasoning.
3. Choose a reasonable unit of diversion and given good support for their choice.