# P7: Design and A/B test - Free Trial Screener

In [24]:
import numpy as np
import pandas as pd

from __future__ import division

# 1. Experiment Design

At the time of this experiment, Udacity courses currently have two options on the home page: "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This [screenshot](https://drive.google.com/file/d/0ByAfiG8HpNUMakVrS0s4cGN2TjQ/view) shows what the experiment looks like.

The hypothesis was that this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.

The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

## a. Metric Choice

**List which metrics you will use as invariant metrics and evaluation metrics here. Here is the list of metrics:**

- **Number of cookies:** That is, number of unique cookies to view the course overview page. (dmin=3000)
- **Number of user-ids:** That is, number of users who enroll in the free trial. (dmin=50)
- **Number of clicks:** That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
- **Click-through-probability:** That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
- **Gross conversion:** That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
- **Retention:** That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
- **Net conversion:** That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

### Invariant metrics
Invariant metrics are metrics that shouldn't change between the control group and experiment
group. As we ask students "how much time do you have available to devote to the course?"
after they click on the "Start free trial" button, the metrics **number of cookies, number of
clicks and Click through probability** shouldn't change between the control and experiment
groups because they are measured before clicking on "Start free trial" button. However the other
metrics are measured after the question "how much time do you have available to devote to the
course?" is shown, so we can expect Number of user ids, Gross conversion, Retention and Net
conversion to vary between control and experiment groups.
Later during our sanity check we will look only at the invariant metrics **number of cookies and
number of clicks and click through probability**.

### Evaluation metrics
In the experiment we want to analyze if the question "how much time do you have available to
devote to the course?" has an impact on the number of people who decide to checkout and
enroll in the free trial.

We can perform the following hypothesis testing on the **gross conversion** metric:
The hypothesis would be that the gross conversion for the experiment group is lower than the
gross conversion for the control group because the question sets clearer expectations for
students upfront in terms of weekly investment, thus reducing the number of student enrolling in
the free trial by more than 1%.

We also want to use the **net conversion metric** as an evaluation metric to see if the number
of students who pass the free trial is not reduced significantly as a result of asking this question.

The hypothesis would be that the net conversion metric remains the same between the control
and experiment groups because the number of people who pass the free trial is the same
between the control and experiment groups. Those who would have left during the free trial
don't enroll in the free trial in the first place, because the expectations are set clearly. So for this 
reason the lower bound of the confidence interval for the difference for the net conversion
shouldn’t be lower than 0.0075.

We could look at the retention metric; however, it's simply the difference between the net
conversion and the gross conversion. For this reason we will only look at the **net conversion
and the gross conversion metrics**.

## b. Measuring Standard Deviation

Let's make an analytic estimate of the standard deviation for the gross conversion and net conversion metrics, given a sample size of 5000 cookies visiting the course overview page.

We want to measure if the practical significance boundary is realistic with the variability of these metrics. 

For the metrics gross conversion and net conversion, the unit of analysis (denominator of the metric) and unit of diversion are the same: number of cookies. So we can use the analytical estimate instead of emperical estimate. If we would have used the retention as an evaluation metric, we would have to compute the emperical estimate. When the unit of diversion and unit of analysis are not the same, such as in the case of the retention, the emperical variability tends to be much higher than the analytical variability.

To compute the standard deviation of the gross conversion and net conversion metrics, we use the following [table including the baseline values](https://docs.google.com/spreadsheets/d/1MYNUtC47Pg8hdoCjOXaHqF-thheGpUshrFA21BAJnNc/edit#gid=0).

In [25]:
nb_cookies_view_per_day=40000
nb_cookies_click_free_trial=3200
Pr_enro_click=0.20625
Pr_pay_click=0.1093125

#sample size
nb_cookies_visit_course=5000

To compute the standard deviation for the gross conversion and net retention metrics, we need to adjust the number of unique cookies to click "Start free trial" (3200) per day in the table given the sample size of 5000 cookies. Because our metrics are probability (of success), we also make the assumption that our metrics follow a binomial distribtion. 

In [26]:
SD_gross_conversion=np.sqrt(Pr_enro_click*(1-Pr_enro_click)/(nb_cookies_click_free_trial*nb_cookies_visit_course/nb_cookies_view_per_day))
print "Standard deviation for the gross conversion metric :",SD_gross_conversion
SD_net_conversion=np.sqrt(Pr_pay_click*(1-Pr_pay_click)/(nb_cookies_click_free_trial*nb_cookies_visit_course/nb_cookies_view_per_day))
print "Standard deviation for the net conversion metric :", SD_net_conversion

Standard deviation for the gross conversion metric : 0.020230604137
Standard deviation for the net conversion metric : 0.0156015445825


## c. Sizing

We won't use **[Bonferroni correction](https://en.wikipedia.org/wiki/Bonferroni_correction)** because to launch the experiment, we need to make sure that both metrics, gross conversion and net conversion, match the hypothesis. It’s different than the case where any of the metrics need to be significant in order to launch the experiment. For this case, we would have used the Bonferroni correction.

### i. Number of Samples vs. Power
**Give the number of pageviews you will need to power your experiment appropriately**

To get the number of pageviews we need for the experiment, we will use the [online calculator](http://www.evanmiller.org/ab-testing/sample-size.html).

#### Gross conversion

For the gross conversion metric, we have to enter the following parameters into the calculator:


- Baseline conversion rate=0.20625
- Minimum Detectable Effect = 0.01
- alpha = 0.05
- beta = 0.2

The calculator returns a sample_size = 27413. Because we want to have the total number of page views, we need to multiply the sample size returned by the caculator by two and divide it by the Click-through-probability on "Start free trial".

In [27]:
sample_size=27413
print "Total number of page views :",2*sample_size/(3200/40000)

Total number of page views : 685325.0


#### Net conversion

Let's do the same calculation for the net conversion metric.

Probability of enrolling, given click=0.1093125

- Baseline conversion rate=0.1093125
- Minimum Detectable Effect = 0.0075
- alpha = 0.05
- beta = 0.2

The calculator returns a sample_size = 25835.

In [28]:
sample_size=25835
print "Total number of page views:",2*sample_size/(3200/40000)

Total number of page views: 645875.0


In order to power the experiment appropriately for these two metrics we take the largest number of page views calculated for each metric. So we need to collect 685,325 pageviews.

### ii. Duration vs. Exposure

We will run the experiment on all traffic. It shouldn’t be a risky test as it’s a small popup on
electing the free trial.

If we run the experiment on all traffic, it will take **18 days** to collect 685,325 pageviews. 18 days
is already a long experiment and the experiment is run on a mix of weekends and weekdays. That
confirms that we should run the experiment on all traffic

In [29]:
print "Number of days:",685325/nb_cookies_view_per_day/1.0

Number of days: 17.133125


We will run the experiment for 18 days.

# 2. Experiment Analysis

In [30]:
control=pd.read_csv("Results_control_group.csv")
experiment=pd.read_csv("Results_experiment_group.csv")

In [31]:
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134,70
1,"Sun, Oct 12",9102,779,147,70
2,"Mon, Oct 13",10511,909,167,95
3,"Tue, Oct 14",9871,836,156,105
4,"Wed, Oct 15",10014,837,163,64


In [32]:
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105,34
1,"Sun, Oct 12",9288,785,116,91
2,"Mon, Oct 13",10480,884,145,79
3,"Tue, Oct 14",9867,827,138,92
4,"Wed, Oct 15",9793,832,140,94


The meaning of each column is:

**Pageviews:** Number of unique cookies to view the course overview page that day.

**Clicks:** Number of unique cookies to click the course overview page that day.

**Enrollments:** Number of user-ids to enroll in the free trial that day.

**Payments:** Number of user-ids who who enrolled on that day to remain enrolled for 14 days and thus make a payment. (Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)

## a. Sanity Checks

For each invariant metric, we want to give the 95% confidence interval for the value we expect to observe, the actual observed value, and whether the metric passes the sanity check.

#### Number of cookies

We will use the following method for the sanity check. Each cookie is randomly assigned to the control group and experiment group with probability 0.5.

1. We compute standard deviation of binomial with probability 0.5 to success
2. Multiply standard deviation by Z score to get margin of error
3. Compute confidence interval around 0.5
4. check whether observed fraction is within interval

In [33]:
nb_cookies_page_control=sum(control.Pageviews)
print "Number of unique cookies to view page in the control group:", nb_cookies_page_control
nb_cookies_page_experiment=sum(experiment.Pageviews)
print "Number of unique cookies to view page in the experiment group:", nb_cookies_page_experiment
print "There is a difference of:",nb_cookies_page_control-nb_cookies_page_experiment

Number of unique cookies to view page in the control group: 345543
Number of unique cookies to view page in the experiment group: 344660
There is a difference of: 883


In [34]:
sd=np.sqrt((0.5*0.5)/(nb_cookies_page_control+nb_cookies_page_experiment))
print "Standard deviation:",sd
m=sd*1.96
print "Margin of error:",m
lower_bound=0.5-m
print "Lower bound of the 95% confidence interval:",lower_bound
upper_bound=0.5+m
print "Upper bound of the 95% confidence interval:",upper_bound
p_observed=nb_cookies_page_control/(nb_cookies_page_control+nb_cookies_page_experiment)
print "Oberved fraction:",p_observed

Standard deviation: 0.000601840740294
Margin of error: 0.00117960785098
Lower bound of the 95% confidence interval: 0.498820392149
Upper bound of the 95% confidence interval: 0.501179607851
Oberved fraction: 0.500639666881


The observed fraction is included in the the confidence interval, so the invariant metric, the number of cookies, passes the sanity check.

#### Number of clicks

In [35]:
nb_cookies_clicks_control=sum(control.Clicks)
print "Number of unique cookies to click 'start free trial' button in the control group:", nb_cookies_clicks_control
nb_cookies_clicks_experiment=sum(experiment.Clicks)
print "Number of unique cookies to click 'start free trial' button in the experiment group:", nb_cookies_clicks_experiment
print "There is a difference of:",nb_cookies_clicks_control-nb_cookies_clicks_experiment

Number of unique cookies to click 'start free trial' button in the control group: 28378
Number of unique cookies to click 'start free trial' button in the experiment group: 28325
There is a difference of: 53


In [36]:
sd=np.sqrt((0.5*0.5)/(nb_cookies_clicks_control+nb_cookies_clicks_experiment))
print "Standard deviation:",sd
m=sd*1.96
print "Margin of error:",m
lower_bound=0.5-m
print "Lower bound of the confidence interval:",lower_bound
upper_bound=0.5+m
print "Upper bound of the confidence interval:",upper_bound
p_observed=nb_cookies_clicks_control/(nb_cookies_clicks_control+nb_cookies_clicks_experiment)
print "Observed fraction:",p_observed

Standard deviation: 0.0020997470797
Margin of error: 0.00411550427621
Lower bound of the confidence interval: 0.495884495724
Upper bound of the confidence interval: 0.504115504276
Observed fraction: 0.500467347407


The observed fraction is included in the confidence interval, so the invariant metric, the number of clicks, passes the sanity check.

#### Click through probability

In [38]:
p=nb_cookies_clicks_control/(nb_cookies_page_control)

sd=np.sqrt((p*(1-p))/(nb_cookies_page_control))
print "Standard Deviation:",sd
m=sd*1.96
lower_bound=p-m
upper_bound=p+m
print "Lower bound:",lower_bound
print "Upper bound:",upper_bound

p_experiment=nb_cookies_clicks_experiment/(nb_cookies_page_experiment)
print p_experiment

Standard Deviation: 0.000467068276555
Lower bound: 0.0812103597525
Upper bound: 0.0830412673966
0.0821824406662


The observed fraction is included in the confidence interval, so the invariant metric, the number of clicks, passes the sanity check.

## b. Result Analysis

### i. Effect Size Tests

We only keep the days of the experiment that include the number of enrollments and payments. The free trial lasts 14 days, that's why we don't yet have the number of enrollments and payments after November 2nd.

In [39]:
control_sub=control[control.Enrollments.notnull()==True]
experiment_sub=experiment[experiment.Enrollments.notnull()==True]

#### Gross conversion

In [40]:
N_cont=sum(control_sub.Clicks)
X_cont=sum(control_sub.Enrollments)

N_exp=sum(experiment_sub.Clicks)
X_exp=sum(experiment_sub.Enrollments)

print "Gross conversion observed for control group:",X_cont/N_cont
print "Gross conversion observed for experiment group:",X_exp/N_exp

p_pool=(X_cont+X_exp)/(N_cont+N_exp)
print "Pool probability:",p_pool

SE_pool=np.sqrt(p_pool*(1-p_pool)*(1/N_cont+1/N_exp))
print "Pool standard error:",SE_pool

m=1.96*SE_pool
print "Margin of error:",m

d_observed=X_exp/N_exp-X_cont/N_cont
print "Difference observed between control and experiment group",d_observed

lower_bound=d_observed-m
print "Lower bound of the 95% of the confidence intervall:",lower_bound
upper_bound=d_observed+m
print "Upper bound of the 95% of the confidence intervall:", upper_bound

Gross conversion observed for control group: 0.218874689181
Gross conversion observed for experiment group: 0.1983198146
Pool probability: 0.208607067404
Pool standard error: 0.00437167538523
Margin of error: 0.00856848375504
Difference observed between control and experiment group -0.0205548745804
Lower bound of the 95% of the confidence intervall: -0.0291233583354
Upper bound of the 95% of the confidence intervall: -0.0119863908253


For the gross conversion metric, the confidence interval doesn’t include 0, so the test is statistically significant. The confidence interval [0.02912,0.01198] is inferior to the practical boundary (­0.01), so the test is practically significant.

#### Net conversion

In [41]:
N_cont=sum(control_sub.Clicks)
X_cont=sum(control_sub.Payments)

N_exp=sum(experiment_sub.Clicks)
X_exp=sum(experiment_sub.Payments)

print "Net conversion observed for control group:",X_cont/N_cont
print "Net conversion observed for experiment group:",X_exp/N_exp

p_pool=(X_cont+X_exp)/(N_cont+N_exp)
print "Pool probability:",p_pool

SE_pool=np.sqrt(p_pool*(1-p_pool)*(1/N_cont+1/N_exp))
print "Pool standard error:",SE_pool

m=1.96*SE_pool
print "Margin of error:",m

d_observed=X_exp/N_exp-X_cont/N_cont
print "Difference observed between control and experiment group",d_observed

lower_bound=d_observed-m
print "Lower bound of the 95% of the confidence intervall:",lower_bound
upper_bound=d_observed+m
print "Upper bound of the 95% of the confidence intervall:", upper_bound

Net conversion observed for control group: 0.117562019314
Net conversion observed for experiment group: 0.11268829664
Pool probability: 0.115127485312
Pool standard error: 0.00343413351293
Margin of error: 0.00673090168535
Difference observed between control and experiment group -0.00487372267454
Lower bound of the 95% of the confidence intervall: -0.0116046243599
Upper bound of the 95% of the confidence intervall: 0.0018571790108


For the net conversion metric, the confidence interval [­0.0116 , 0.00186] includes 0 so the test is not statistically significant. It also includes the practical boundary (­0.0075), so the test is not practically significant.

### ii. Sign Tests

### Gross conversion

We create a pandas DataFrame with the gross conversion for the control and experiment groups. We also create a flag 'sign test' that indicates if the gross converion for the experiment group is higher than the gross conversion for the control group.

In [42]:
gross_conversion_control=control_sub.Enrollments/control_sub.Clicks
gross_conversion_experiment=experiment_sub.Enrollments/experiment_sub.Clicks

sign_test_gross_conversion=pd.concat([gross_conversion_control,gross_conversion_experiment],axis=1)
sign_test_gross_conversion.head()
sign_test_gross_conversion.columns=['gross_conversion_control','gross_conversion_experiment']
sign_test_gross_conversion['sign_test']=sign_test_gross_conversion.gross_conversion_control<sign_test_gross_conversion.gross_conversion_experiment

In [43]:
sign_test_gross_conversion.head()

Unnamed: 0,gross_conversion_control,gross_conversion_experiment,sign_test
0,0.195051,0.153061,False
1,0.188703,0.147771,False
2,0.183718,0.164027,False
3,0.186603,0.166868,False
4,0.194743,0.168269,False


In [44]:
nb_days=len(sign_test_gross_conversion)
print "Number of days", nb_days
nb_days_positive_sign=sum(sign_test_gross_conversion.sign_test==True)
print "Number of days the gross conversion rate is higher for the experiment \
group than the control group:",nb_days_positive_sign

Number of days 23
Number of days the gross conversion rate is higher for the experiment group than the control group: 4


What's the chance that this happening randomly? If there is no difference between the control group and the experiment group, there is a 50% chance of positive change on each day.

We can use this online [calculator](http://graphpad.com/quickcalcs/binomial1.cfm).

In the online calculator we enter:

Number of "successes" you observed = 4

Number of trials or experiments = 23

Probability = 0.5

The two-tail P-value returned by the calculator is equal to 0.0026, so the statiscal test is significant. It means that these results are unlikely to happen by chance. The sign test agrees with the effect size test.

#### Net conversion

We use the same method as for the gross conversion rate.

We create a pandas DataFrame with the net conversion for the control and experiment groups. We also create a flag 'sign test' that indicates if the net converion for the experiment group is higher than the net conversion for the control group.

In [45]:
net_conversion_control=control_sub.Payments/control_sub.Clicks
net_conversion_experiment=experiment_sub.Payments/experiment_sub.Clicks

sign_test_net_conversion=pd.concat([net_conversion_control,net_conversion_experiment],axis=1)
sign_test_net_conversion.columns=['net_conversion_control','net_conversion_experiment']
sign_test_net_conversion['sign_test']=sign_test_net_conversion.net_conversion_control<sign_test_net_conversion.net_conversion_experiment

In [46]:
nb_days=len(sign_test_net_conversion)
print "Number of days", nb_days
nb_days_positive_sign=sum(sign_test_net_conversion.sign_test==True)
print "Number of days the gross conversion rate is higher for the experiment \
group than the control group:",nb_days_positive_sign

Number of days 23
Number of days the gross conversion rate is higher for the experiment group than the control group: 10


In the online calculator we enter:

Number of "successes" you observed =10

Number of trials or experiments =23

Probability = 0.5

The two-tail P-value returned by the calculator is equal to 0.6776, so the statiscal test is not significant. It means that these results are likely to happen by chance. The sign test agrees with the effect size test.

### iii. Summary

We can conclude that we have been able to decrease the gross conversion by setting clearer expectations for students upfront. However for the net conversion, the confidence interval [0.0116 , 0.00186] does include the negative of the practical significance boundary (0.0075). So there is a risk that the net conversion metric decreases by an amount that would matter for the business and negatively impact revenue. For this reason, we shouldn’t launch this experiment.

# 3. Follow-Up Experiment

In this experiment, we will test whether or not an email showcasing a pertinent forum discussion from the course they are taking can improve their retention. The hypothesis would be that students who are struggling would find useful information that would help them better understand concepts and ideas in the course, or additional explanations on exercises. This should reduce their frustration and increase their motivation.

Our evaluation metrics would be the retention: number of userids to remain enrolled past the
14 day boundary (and thus make at least one payment) divided by number of user ids to
complete checkout. To have a practical significant result we would need to have a positive
practical boundary of 0.01.

Our unit of diversion would be userid because we only care about students who already
enrolled in the free trial.

### Ressources
- Design and A/B test Udacity's course 