# A/B Testing Project
## Udacity free trial experiment



## Experiment Overview: Free Trial Screener

At the time of this experiment, Udacity courses have two options on the home page: "start free trial", and "access course materials". If the user clicks on the "start free trial", he will be asked to enter his credit card information, and he will be enrolled for the paid version of the course for free during the 14 days free trail period. After 14 days, if the user didn't cancel the enrollment he will be automatically charged for the courses.
The second option, 'access course materials' does not require a credit card information and gives full access to the course materials and quizzes excluding project feedback, coaching service and verified certificate.
For this experiment Udacity tested a change where those users who clicked "start free trial" were asked how much time they were willing to devote to the course. If the answer is 5 or more hours per week they will be taken through the checkout process as usual. If the users are not devoting to spend at least 5 hours per week, a message would appear indicating the need for a greater time commitment to enable success and suggesting they might like to access the free content.
The hypothesis was that this change might set clearer expectations for the student, and reduce the number of frustrated students who left the 14 day trial because they did not have enough time to work on the course. The change also should not significantly reduce the number of students who continued past the free trial and eventually completed the course. 
So if Gross conversion for the experiment group is lower than that in the control group. This indicates that the change is effective in communicating to prospective students the time commitment needed to be successful in the course, and we expect this change to be significant statistically and practically. Also the net conversion for the experiment group is not significantly lower than that in the control group, we would like that the net conversion to do not be practically or statistically significant
The unit of diversion is a cookie, although if the user / student enrolls in the free trial, a user-id is generated and used to track the user from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page. 



## Metric Choice


List which metrics you will use as invariant metrics and evaluation metrics here.

Here is the list of metrics chosen to be used:

- **Number of Cookies** : That is, number of unique cookies to visit the course overview page. The reason of using it is because it is the unit of diversion for the project and therefore an equal number of cookies would be expected. Also because the number of cookies are counted before the users are exposed to the experiment in each group. It is a good invariant metric because it's being directly randomized between experiment and control group. Is a bad evaluation metric because it won't be different between experiment and control.

- **Number of Clicks**: That is, number of users who enroll in the free trial. The reason of using it as invariant metric is because the clicks are pre-experiment and it’s expected to not change between the control and experiment group, in other word the clicks happen before the exposure of the users to the experiment, and the number of clicks should be equal in each group because only the users exposed to experiment are the only one will be in the experiment. I didn’t choose it as a variant metric it won't be different between experiment and control.

- **Click-through Probability**: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. Since this is a ratio between the numbers of users to click the free trial button over the number of unique cookies to visit the course overview page, it is a good invariant metric and bad evaluation metric because it’s expected to remind unchanged for both groups.

- **Gross Conversion**: he number of users to enroll in free trial over the number of clicks. This will be tracked by using user ids. The reason that this was chosen as an evaluation metric is that measure the number of enrolled student after been answering the 5 hours or more commitment to the class question, and it will create a variation between the control and experiment groups. And that what we are trying to measure. So the number of enrolled students will not be the same for the control and experiment groups, for that reason it is a good evaluation metric and bad invariant metric.

- **Net Conversion**: The number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the “Start free trial” button. And this also tracked and measured with user ids. The reason that this was chosen as an evaluation metric is that measure the number of enrolled student that pay for the course after the 14 free trial days. So the net conversion is a good evaluation metric because it will change from control to experiment groups’ because the number of user ids will not be the same. And we can’t use it as invariant metric.

#### Unused Metric

- **Number of user-ids** : The number of users to enroll in the free trial. User-ids are only tracked after a student enrolls in the course, so this is not a good choice as an invariant metric because the user-ids are counted after the click occurs, and therefore will be effected by the test, which means that we would expect to have different distributions and there is no way to normalize it.

- **Retention**: The number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. The reason that this was not chosen as an evaluation metric even though it tells us exactly how many users remained enrolled after the 14 days trial period, is because it would cause the experiment to take too long, also it's using different units for the analysis and diversion, which can lead to inaccuracies, furthermore Gross Conversion and Net Conversion are good enough for testing our hypothesis.



## Load Data

In [132]:
import numpy as np
import pandas as pd
import math


In [133]:
# Baseline data
xlsx = pd.ExcelFile('Final Project Baseline Values.xlsx')
baseline = pd.read_excel(xlsx, 'Sheet1', header=None, names = ['full_metric_name','value'])
baseline.set_index([['pageviews', 'clicks', 'enrollments', 'ctp', 'p_enroll_click', 'p_pay_enroll', 'p_pay_click']], inplace=True)
baseline

Unnamed: 0,full_metric_name,value
pageviews,Unique cookies to view course overview page pe...,40000.0
clicks,"Unique cookies to click ""Start free trial"" per...",3200.0
enrollments,Enrollments per day:,660.0
ctp,"Click-through-probability on ""Start free trial"":",0.08
p_enroll_click,"Probability of enrolling, given click:",0.20625
p_pay_enroll,"Probability of payment, given enroll:",0.53
p_pay_click,"Probability of payment, given click",0.109313


In [134]:
# import experiment data
df_control = pd.read_csv("FP_Control.csv")
df_experiment = pd.read_csv("FP_Experiment.csv")

In [135]:
# Scale factor for a sample of 5000 pageviews
scale_factor = 5000/baseline.loc['pageviews', 'value']
scale_factor

0.125

In [136]:
# Calculating the sample value for each metric

values = baseline['value'].tolist()

sample_vals = []
for i in range(0, len(values)):
    if i < 3:
        sample_vals.append(values[i]*scale_factor)
    else:
        sample_vals.append(values[i])
        
baseline['sample'] = sample_vals
baseline

Unnamed: 0,full_metric_name,value,sample
pageviews,Unique cookies to view course overview page pe...,40000.0,5000.0
clicks,"Unique cookies to click ""Start free trial"" per...",3200.0,400.0
enrollments,Enrollments per day:,660.0,82.5
ctp,"Click-through-probability on ""Start free trial"":",0.08,0.08
p_enroll_click,"Probability of enrolling, given click:",0.20625,0.20625
p_pay_enroll,"Probability of payment, given enroll:",0.53,0.53
p_pay_click,"Probability of payment, given click",0.109313,0.109313


In [137]:
# Standard deviation for Gross Conversion:
SDGC = round(np.sqrt((.206250*(1-.206250))/(5000*0.08)),4)
print "Standard deviation for the gross conversion metric :", SDGC

# Standard deviation for Gross Conversion:

SDR = round(np.sqrt((.53*(1-.53))/(5000*0.0165)),4)
print "Standard deviation for the Retention metric :",SDR

# Standard deviation for Gross Conversion:
SDNC = round(np.sqrt((.109313*(1-.109313))/(5000*0.08)),4)
print "Standard deviation for the net conversion  metric :",SDNC

Standard deviation for the gross conversion metric : 0.0202
Standard deviation for the Retention metric : 0.0549
Standard deviation for the net conversion  metric : 0.0156


### Measuring Standard Deviation

I used a normal approximation of the binomial distribution.  The normal approximation can be used since the sample size is suitably large.  The standard deviation for both of the above is calculated as:

$$\sqrt{{p(1-p)} \over {n}}$$

where p is the probability of success (enrolling/paying), and $n$ is the sample size.
Our sample size is 5000 and the result is presented in the following table: 



| Evaluation Metric | Standard Deviation |
| ----------------- |:------------------:| 
| Gross Conversion  |     0.0202         |
| Retention         |     0.0549         |
| Net Conversion    |     0.0156         |

Analytical and empirical standard deviations tend to have similar measurements when the unit of diversion (cookies that click the "Start free trial" button) is equal to the unit of analysis. Gross Conversion and Net Conversion are expected to have similar standard deviation while Retention does not. If Retention is ultimately used, then the standard deviation should be calculated empirically.


## Sizing

### Number of Samples Given Power

I will use this <a href="http://www.evanmiller.org/ab-testing/sample-size.html"> calculator</a> given an alpha of 0.05 and a beta of 0.2.<br/> to calaculate the sample size for each metric:

#### **Gross Conversion**

* Baseline Conversion Rate: 0.20625
* Minimum Detectable Effect: 0.01
* sample size = 25,835
* total sample size = 51,670
* clicks/pageview: 3200/40000 = .08

**pageviews required = 645,875**


#### **Retention**

* Baseline Conversion Rate: 0.53
* Minimum Detectable Effect: 0.01
* sample size = 39,155 
* total sample size = 78,230 
* enrollments/pageview: 660/40000 = .0165

**pageviews required = 4,741,212**


#### **Net Conversion**

* Baseline Conversion Rate: 0.1093125
* Minimum Detectable Effect: 0.75
* sample size = 27,413
* total sample size = 54,826
* clicks/pageview: 3200/40000 = 0.08

**pageviews required = 685,325**

In order to conduct the experiment we need 4,741,213 pageview. This is a high number if we take in consideration the number of people that visit the site daily.

### Duration vs. Exposure

To conduct the experiment we will need 4,741,213 pagewiew and with a given 40,000 pageview a day we will need 119 days which is a long duration for such experiment. To reduce the experiment period I'll exclude the “Retention" metric and instead use the "Net Conversion" metric that is only 685,325 pageview, so the duration of the experiment will be only 18 days with a 100% exposures which is a reasonable time frame. And since no sensitive data is collected from the user we will consider that the experiment has no to insignificant risk.

## Experiment Analysis


### Sanity Checks

Let's do a sanity check for each invariant metric to make sure our assumptions about the experiment have been satisfied.
we'll calculate totals for cookies views, clicks, and click-through-probability. It's expected that the experiment and control groups should be relatively even. Then, we'll calculate the Observed value, Standard Error, Marginal Error, Confidence Interval, for each invariant metric, and see if those values whether pass the sanity checks or not:


#### **Number of Cookies**
Control	Experiment	Total	Expected_value	Observed_value	Standard_Error	Marginal_Error	CI_lower	CI_upper	Pass_Sanity
* Expected Value: 0.5
* Observed Value: 0.499360
* Standard Error: 0.001904
* Marginal Error: 0.003732
* CI Lower Bound: 0.496268
* CI Upper Bound: 0.503732

**Result: PASS**

#### **Number of Clicks**

* Expected Value: 0.5
* Observed Value: 0.499533
* Standard Error: 0.006642
* Marginal Error: 0.013018
* CI Lower Bound: 0.486982
* CI Upper Bound: 0.513018

**Result: PASS**

#### **Click-through Probability**

* Expected Value: 0.082123
* Observed Value: 0.082182
* Standard Error: 0.000467
* Marginal Error: 0.000915
* CI Lower Bound: 0.083098
* CI Upper Bound: 0.081267

**Result: PASS**

In [138]:
# Summarize data 
results = {"Control":pd.Series([df_control.Pageviews.sum(),df_control.Clicks.sum(),
                                  df_control.Enrollments.sum(),df_control.Payments.sum()],
                                  index = ["cookies","clicks","enrollments","payments"]),
           "Experiment":pd.Series([df_experiment.Pageviews.sum(),df_experiment.Clicks.sum(),
                               df_experiment.Enrollments.sum(),df_experiment.Payments.sum()],
                               index = ["cookies","clicks","enrollments","payments"])}
df_results = pd.DataFrame(results)
df_results

Unnamed: 0,Control,Experiment
cookies,345543.0,344660.0
clicks,28378.0,28325.0
enrollments,3785.0,3423.0
payments,2033.0,1945.0


In [139]:
# Calculate SE, ME, CI and Diff
df_results['Total'] = df_results.Control + df_results.Experiment
df_results['Expected_value'] = 0.5
df_results['Observed_value'] = df_results.Experiment/df_results.Total
df_results['Standard_Error'] = np.sqrt((df_results.Expected_value * (1-df_results.Expected_value))*(1/df_results.Control) + (1/df_results.Experiment))
df_results['Marginal_Error'] = df_results.Standard_Error * 1.96
df_results['CI_lower'] = df_results.Expected_value - df_results.Marginal_Error
df_results['CI_upper'] = df_results.Expected_value + df_results.Marginal_Error
df_results["Pass_Sanity"] = df_results.apply(lambda x: (x.Observed_value > x.CI_lower) and (x.Observed_value < x.CI_upper),axis=1)
df_results['Diff'] = abs((df_results.Experiment - df_results.Control)/df_results.Total)

In [140]:
# Overview the results
df_results

Unnamed: 0,Control,Experiment,Total,Expected_value,Observed_value,Standard_Error,Marginal_Error,CI_lower,CI_upper,Pass_Sanity,Diff
cookies,345543.0,344660.0,690203.0,0.5,0.49936,0.001904,0.003732,0.496268,0.503732,True,0.001279
clicks,28378.0,28325.0,56703.0,0.5,0.499533,0.006642,0.013018,0.486982,0.513018,True,0.000935
enrollments,3785.0,3423.0,7208.0,0.5,0.474889,0.018926,0.037095,0.462905,0.537095,True,0.050222
payments,2033.0,1945.0,3978.0,0.5,0.488939,0.025241,0.049472,0.450528,0.549472,True,0.022122


In [141]:
# click through probability 

control_cookies = df_results.loc['cookies','Control']
control_clicks = df_results.loc['clicks','Control']

exp_cookies = df_results.loc['cookies','Experiment']
exp_clicks = df_results.loc['clicks', 'Experiment']

## control value 
cont_v = control_clicks/control_cookies

## observed value 
obs_v = exp_clicks/exp_cookies

## Standard Error
se_v = np.sqrt((cont_v * (1- cont_v))/control_cookies)

## marginal error (z = 1.96)
me_v = se_v * 1.96

## Confidence Interval
upper_ClickProb = obs_v + me_v
lower_ClickProb = obs_v - me_v

## Sane in the membrane (yes, it passes)
print "control value:",cont_v
print "observed value:",obs_v
print "Standard Error:",se_v
print "marginal error:",me_v
print "Confidence Interval upper:",upper_ClickProb
print "Confidence Interval lower :",lower_ClickProb



control value: 0.08212581357457682
observed value: 0.08218244066616376
Standard Error: 0.0004670682765546443
marginal error: 0.0009154538220471028
Confidence Interval upper: 0.08309789448821087
Confidence Interval lower : 0.08126698684411665


## Result Analysis
### Effect Size Tests

In [142]:
df_control_sub=df_control[df_control.Enrollments.notnull()==True]
df_experiment_sub=df_experiment[df_experiment.Enrollments.notnull()==True]

In [143]:
N_cont=sum(df_control_sub.Clicks)
X_cont=sum(df_control_sub.Enrollments)

N_exp=sum(df_experiment_sub.Clicks)
X_exp=sum(df_experiment_sub.Enrollments)

print "Gross conversion observed for control group:",X_cont/N_cont
print "Gross conversion observed for experiment group:",X_exp/N_exp

p_pool=(X_cont+X_exp)/(N_cont+N_exp)
print "Pool probability:",p_pool

SE_pool=np.sqrt(p_pool*(1-p_pool)*(1/N_cont+1/N_exp))
print "Pool standard error:",SE_pool

m=1.96*SE_pool
print "Margin of error:",m

d_observed=X_exp/N_exp-X_cont/N_cont
print "Difference observed between control and experiment group",d_observed

lower_bound=d_observed-m
print "Lower bound of the 95% of the confidence intervall:",lower_bound
upper_bound=d_observed+m
print "Upper bound of the 95% of the confidence intervall:", upper_bound

Gross conversion observed for control group: 0.218874689181
Gross conversion observed for experiment group: 0.1983198146
Pool probability: 0.208607067404
Pool standard error: 0.003089764805374595
Margin of error: 0.006055939018534206
Difference observed between control and experiment group -0.0205548745804
Lower bound of the 95% of the confidence intervall: -0.02661081359889577
Upper bound of the 95% of the confidence intervall: -0.01449893556182736


In [144]:
N_cont=sum(df_control_sub.Clicks)
X_cont=sum(df_control_sub.Payments)

N_exp=sum(df_experiment_sub.Clicks)
X_exp=sum(df_experiment_sub.Payments)

print "Net conversion observed for control group:",X_cont/N_cont
print "Net conversion observed for experiment group:",X_exp/N_exp

p_pool=(X_cont+X_exp)/(N_cont+N_exp)
print "Pool probability:",p_pool

SE_pool=np.sqrt(p_pool*(1-p_pool)*(1/N_cont+1/N_exp))
print "Pool standard error:",SE_pool

m=1.96*SE_pool
print "Margin of error:",m

d_observed=X_exp/N_exp-X_cont/N_cont
print "Difference observed between control and experiment group",d_observed

lower_bound=d_observed-m
print "Lower bound of the 95% of the confidence intervall:",lower_bound
upper_bound=d_observed+m
print "Upper bound of the 95% of the confidence intervall:", upper_bound

Net conversion observed for control group: 0.117562019314
Net conversion observed for experiment group: 0.11268829664
Pool probability: 0.115127485312
Pool standard error: 0.0024271392384427114
Margin of error: 0.004757192907347714
Difference observed between control and experiment group -0.00487372267454
Lower bound of the 95% of the confidence intervall: -0.009630915581891882
Upper bound of the 95% of the confidence intervall: -0.0001165297671964537


Since the confidence interval For the gross conversion metric doesn’t include 0, the test is statistically significant.
and the confidence interval For net conversion metric includes 0, the test is not statistically significant.  
In the other hand The confidence interval [0.02912,0.01198] is inferior to the practical boundary (­0.01), so we can say that the test is practically significant.Whereas the net conversion metric confidence interval [­0.0116 , 0.00186] includes the practical boundary (­0.0075), so the test is not practically significant.

### Sign Test

Two-tailed t-tests were run using the [this calculator](http://graphpad.com/quickcalcs/binomial1.cfm) tool to calculate the sign and binomial test and compare the results with the practical and significance test. 

#### **Gross Conversion**

* Improvement Days: 4
* Total Days: 23
* Hypothesis Probability: 0.5
* p-value: 0.0026

**Statistical Significance: YES**

This is significant because the p-value is smaller than alpha.

#### **Net Conversion**

* Improvement Days: 10
* Total Days: 23
* Hypothesis Probability: 0.5
* p-value: 0.6776

**Statistical Significance: NO**

This is not significant because the p-value is greater than alpha.

## Summary

To sum up, the control group is enrolled in a free trial for the paid version of the course, while experiment group went through the process of answering the question about how many hours they are can commit for the class. The unit of diversion is cookie (page views) and it’s an invariant metrics that didn’t change during the experiment.
I didn’t use the Bonferroni correction because the proposed hypothesis for this experiment both evaluation metrics would had to show statistical significance in order to launch the change, that is Gross Conversion decrease and the Net Conversion stay at the same level or increase. The Bonferroni correction would recommend the change if any metric would show the statistical significance.

From the above analysis we can conclude that the gross conversion metric has been verified to be both practically and statistically significant by both the hypothesis test and the sign test, and therefore rejection of the null hypothesis, but the net conversion was not.

The experiment did not show strong evidence that the screener successfully diverted the students who would not remain past the free trial period, or the students who were taking advantage of the free trial for 14 days and canceling their subscription before making the first payment, or that Udacity could improve the overall student experience and improve coaches’ capacity to support students who are likely to complete the course.

From the above analysis we can conclude that the gross conversion metric has been verified to be both practically and statistically significant by both the hypothesis test and the sign test. While the net conversion rate was not.


## Recommendation

I do not recommend pushing the free trial screener to the live Udacity service based on the results of this A/B test because 
as stated in the expectations for launching the experiment, is to filter students base on study time commitment to improve the overall student experience, without negatively impacting the number of students that continue through the trial period and into full enrollment. 
From our analysis we have found that there was a significant decrease in the gross conversion rate, but the confidence interval for net conversion includes the negative practical significance boundry. In other word there were a decrease in enrollment that was accompanied by no differences in conversions to enrollment from the trial period, so this experiment was not effective in accomplishing it's goals. 



## Follow-Up Experiment

A follow-up experiment could be conducted to determine if having coaches directly contact students would reduce early cancellations by asking students if they prefer to be contacted by phone or email, in this case we will test if a consulting session will improve the retention and we will use the user ID as a unit of diversion so we can keep track of number the student contacted and we will know the number of students drooped the class versus students converted to paid version after the trail . The hypothesis would be that the new enrollee will have enough information about the courses and what is expected from him and also it will be the time for the potential student to ask questions that may clear up ambiguity

