In [2]:
import numpy as np

## Experimental goal
1. Reduce the number of students left the free trial.
2. Without significantly reducing the number of students completed the course.

## Experiment Design
### Metric Choice
> List which metrics you will use as invariant metrics and evaluation metrics here.
Explain both why you did or did not use it as an invariant metric and why you did or did not use it as an evaluation metric. Also, state what results you will look for in your evaluation metrics in order to launch the experiment.

#### Invariant metrics
* Number of cookies
* Number of clicks 
* Click through probability

**These are events happen before triggering the free trial screener so the numbers should be similar in both groups (experiment and control). We can know if the number of samples are equally assigned to both experiment and control groups from the value of "Number of cookies" and "Number of clicks". "Click through probability" is a ratio between 2 invarants so if the values are significantly different in control and experiment groups, we know something wrong with random assignment.**

#### Evaluation metrics
* Gross conversion: shows how many percentage of students deciding to enroll after they click "Start free trial". 
* Retention: shows the probability of payment, given enroll. 
* Net conversion: shows how many percentage of students deciding to pay after they click "Start free trial". 

**"Enrollment" and "payment" are events happens after the operation of the experiemnt (i.e., showing the screener). The number of enrollment directly reflects the effect of the experiment in reducing unready students, making gross conversion a good metric to evaluate our first goal. The number of payment direclt reflects how many motivated students want to complete the degree, making net conversion a good metric to evaluate our second goal. Retention, the ratio between payment and enrollment, captures the percentage of paid students in all enrolled students, making it a good metric for evaluating our second goal. Moreover, all the metrics are normalized so they are more robust to the differences in the sample sizes.  **

#### Criteria for evaluation metrics for launching the experiment
* **Gross conversion** The experiment will be launched only if a lower gross conversion in the experimental group is observed so we know the experiment indeed reduce students having less commitment in the enrollment with the newly added screener.

* **Retention** The experiment will be launched only if the retention number for the experiment group is not significantly lower than the control group, so we can be certain that the screener don't have negative effect on the number of students proceed to the payment and finish the course. 

* **Net conversion** The experiment will be launched only if the net conversion in the experimental group is not significantly lower than the control group, so we can be certain that the screener won't significanlt reduce the number of students proceed to the payment and finish the course.


#### Matrics didn't choose
* **Number of user-ids** shows the number of enrolled users in the free trial, which gives similar information as gross conversion. However, the different sample sizes in experiment and control groups affects accuracy. Gross conversion, on the other hand, is normalized to cookies so it is more robust to the differences in the sample sizes. 


## Measuring Standard Deviation
>List the standard deviation of each of your evaluation metrics. 

In [13]:
# gross_conversion
gross = 0.20625
std_gross = np.sqrt(gross*(1-gross)/3200)*np.sqrt(40000/5000)
print 'std of gross conversion:', std_gross
# Retention
retention = 0.53
std_retention = np.sqrt(retention*(1-retention)/660)*np.sqrt(40000/5000)
print 'std of retention:', std_retention
# net_conversion
net = 0.1093125
std_net = np.sqrt(net*(1-net)/3200)*np.sqrt(40000/5000)
print 'std of net conversion:',std_net

std of gross_conversion: 0.020230604137
std of retention: 0.0549490121785
std of net_conversion: 0.0156015445825


>For each of your evaluation metrics, indicate whether you think the analytic estimate would be comparable to the the empirical variability, or whether you expect them to be different (in which case it might be worth doing an empirical estimate if there is time). Briefly give your reasoning in each case.

** Both "gross conversion" and "net conversion" use cookie as denominator. This suggests a comparable analytic estimate and empirical variability, as their unit of analysis matches their unit of diversion. On the other hand, the unit of analysis (i.e., number of enrollments) is different from the unit of diversion for Retention, implying a higher standard error using analytical estimation. Therefore, I will do the empirical estimate of Retention if I have time.**

## Sizing

### Number of Samples vs. Power
>Indicate whether you will use the Bonferroni correction during your analysis phase, and give the number of pageviews you will need to power you experiment appropriately. 

I didn't use Bonferroni correction.
[online calculator](http://www.evanmiller.org/ab-testing/sample-size.html) was used to calculate the sample sizes:
* gross_conversion: 
Baseline conversion rate = 0.2063,
Minimum Detectable Effect = 0.01,
alpha = 0.05,
1-beta = 0.8,
N = 25835 

* Retention
Baseline conversion rate = 0.53
Minimum Detectable Effect = 0.01,
alpha = 0.05,
1-beta = 0.8,
N = 39087

* net_conversion:
Baseline conversion rate = 0.1093,
Minimum Detectable Effect = 0.0075,
alpha = 0.05,
1-beta = 0.8,
N = 27413

** N is the sample sizes required in experiment and control groups. To calculate the acutal number of pageviews required, we also need to consider the clicking rate (0.08). For retention, as the samples coming from enrolled students, we also need to consider the probability of enrolling, given click (0.20625).  **
** If we include Retention in the evaluation metrics, we need 39087$\times$2/0.08/0.20625 = 4,737,819 pageviews. This will take about 4 months to finish the experiment, which is unrealistic long for the experiment.
Use gross_conversion and net_conversion as evaluation metrics, we need to collect 27413$\times$2/0.08 = 685,325 pageviews, which takes about 685,325/40,000 = 17.13 days if I use 100% traffic. Therefore, in the end I only use "gross conversion" and "net conversion" as evaluation metrics.  **

### Duration vs. Exposure
>Indicate what fraction of traffic you would divert to this experiment and, given this, how many days you would need to run the experiment. 
Give your reasoning for the fraction you chose to divert. How risky do you think this experiment would be for Udacity?

** If I use 100% traffic for the experiment, I will need 685325/40000 = 17.13 days to run the experiment. The experiement is not risky because it won't  have any effect for students already enrolled and won't do harm to students want to take the courses, as it only adds one step/window to kindly remind students the efforts required for the courses before their enrollment. We are not dealing with sensitive data in the experiment and the experiment won't increase risks undertaken by participants. As long as we are certain that there is no other experiment needed to be run in the coming 18 days, we can use the highest traffic possible to finish the experiment in 18 days. This will need 685325/18/40000 = 95.18% traffic. **

## Experiment Analysis
### Sanity Checks
>For each of your invariant metrics, give the 95% confidence interval for the value you expect to observe, the actual observed value, and whether the metric passes your sanity check. 

##### Number of cookies:
* $N_{cont}$: 345543
* $N_{exp}$: 344660
* $N_{total}$: $N_{cont}$ + $N_{exp}$
* expected probability in control and experiment group: P = 0.5
* SE = $\sqrt{\frac{P\times(1-P)}{N_{cont}+N_{exp}}}$ = 0.0006018
* margin of error (m) = 1.96 * SE = 0.00118
* CI = [0.5-m, 0.5+m] = [0.4988, 0.5012]
* observed probability = $\frac{N_{cont}}{N_{total}}$ = 0.5006, within the range of CI

##### Number of clicks:
* $N_{cont}$: 28378
* $N_{exp}$: 28325
* $N_{total}$: $N_{cont}$ + $N_{exp}$
* expected probability in control and experiment group: P = 0.5
* SE = $\sqrt{\frac{P\times(1-P)}{N_{cont}+N_{exp}}}$ = 0.0021
* margin of error (m) = 1.96 * SE = 0.0041
* CI = [0.5-m, 0.5+m] = [0.4959, 0.5041]
* observed probability = $\frac{N_{cont}}{N_{total}}$ = 0.5005, within the range of CI

##### Click through probability:
* $N_{cont}$: 345543
* control: P = 0.082125814
* SE = $\sqrt{\frac{P\times(1-P)}{N_{cont}}}$ = 0.000467
* margin of error (m) = 1.96 * SE =0.000915
* CI = [P-m, P+m] = [0.0812, 0.0830]
* experiment: 0.0822

**All observed values are within the confidence interval so all of them passed the sanity check.**


## Result Analysis
### Effect Size Tests
>For each of your evaluation metrics, give a 95% confidence interval around the difference between the experiment and control groups. Indicate whether each metric is statistically and practically significant. 

* $P_{pool} = \frac{X_{cont}+X_{exp}}{N_{cont}+N_{exp}}$
* $SE = \sqrt{P(1-P)(\frac{1}{N_{exp}}+\frac{1}{N_{cont}})}$
* $ d = \frac{X_{exp}}{N_{exp}} - \frac{X_{cont}}{N_{cont}}$ 
* $\mathrm{CI} = [\mathrm{d}-1.96\times\mathrm{SE}, \mathrm{d}+1.96\times\mathrm{SE}]$ 

##### Gross conversion:
* $d_{min}$ = 0.01
* $X_{cont}$ = 3785
* $N_{cont}$ = 17293
* $X_{exp}$ = 3423
* $N_{exp}$ = 17260
* $P_{pool}$ = 0.2086
* SE = 0.004371
* margin of error = 0.008568
* d = -0.02055
* CI = [-0.0291 , -0.0120]
* ** statistically significant: Because CI doesn't include 0. **
* ** practically significant: Because the result shows a decrease more than the practical significance boundary, $d_{min}$ (i.e., the absolute value of CI is larger than $d_{min}$).** 

##### Net conversion:
* $d_{min}$ = 0.0075
* $X_{cont}$ = 2033
* $N_{cont}$ = 17293
* $X_{exp}$ = 1945
* $N_{exp}$ = 17260
* $P_{pool}$ = 0.11513
* SE = 0.003434
* margin of error = 0.00673
* d = -0.00487
* CI = [-0.0116 , 0.00186]
* ** not statistically significant: Because CI include 0. **
* ** not practically significant: The CI is mainly in the negative region, showing that the experiment mainly results in a decrease in the net conversion. The result should at least increase $d_{min}$ to reach our expectation of practically significant but it failed (the whole CI is below the practical significance boundary, $d_{min}$). **


### Sign Tests
>For each of your evaluation metrics, do a sign test using the day-by-day data, and report the p-value of the sign test and whether the result is statistically significant. 

Use [this website](http://graphpad.com/quickcalcs/binomial1.cfm) to calculate p-value for the sign test
##### Gross conversion:
* days experiment group show higher gross conversion than control group: 4
* total days: 23
* two-tailed p-value: 0.0026
* ** statisitcally significant** as p-value < alpha(0.05)
##### Net conversion:
* days experiment group show higher net conversion than control group: 10
* total days: 23
* two-tailed p-value = 0.6776 
* ** no statisitcal significance** as p-value > alpha(0.05)

### Summary
>State whether you used the Bonferroni correction, and explain why or why not. If there are any discrepancies between the effect size hypothesis tests and the sign tests, describe the discrepancy and why you think it arose.

I didn't use the Bonferroni correction. 

The Bonferroni correction is designed to reduce type I error. 
If the criteria for launching an experiment is that at least one of many evaluation metrics (say, N metrics) matches our expectation, then the probability that we make an mistake in launching the experiment will be $(1-(1-\alpha)^N)$. This probability will become very high if we have lots of evaluation metrics. Therefore, Bonferroni correction adjusts the $\alpha$ to be $\frac{\alpha}{N}$, to decrease the probablility for making mistake due to type I error.

In this case, I need to fulfill the criteria of both gross conversion and net conversion metrics in order to launch the experiment. I will make a wrong decision about launching the experiment if the result of gross conversion or net conversion is actually wrong but I failed to reject it. This will be the type II error so the Bonferroni correction won't help in this case

### Recommendation
>Make a recommendation and briefly describe your reasoning.

I won't recommend to launch the experiment. 

First, the analysis of gross conversion metric is significantly (both statistically and practically) smaller in the experiment group. Therefore, we know the experiment successfully achieved the first goal: reduce the number of students left the free trial. Second, the analysis of net conversion metric shows no significant changes in the experiment group, suggesting the experiment didn't reduce the number of students already paid. However, the change of net conversion metric is not practically significant for us to launch the experiment. The much broader negative range of CI than its positive range also implies the high chance of decreasing the net conversion metric, which disagrees with our second goal of the experiment. 



## Follow-Up Experiment: How to Reduce Early Cancellations
>If you wanted to reduce the number of frustrated students who cancel early in the course, what experiment would you try? Give a brief description of the change you would make, what your hypothesis would be about the effect of the change, what metrics you would want to measure, and what unit of diversion you would use. Include an explanation of each of your choices.

I think experiences from students finished nanodegree will help to guide students to overcome the frustrations. If we can provide video/article from students sucessfully finished the nanodegree with different backgrounds telling about what kind of difficulties they encountered in their nanodegree study, how they deal with all challenges, how they plan their time to finish the nanodegree and how the skills learned from the course are used in their current lives... This implementation can help the current enrolled students to learn from students who finished the nanodegree successfully and having similar background to overcome the frustrations as well as get motivated.

In the experiment, a new button "see how other students get the nanodegree" will be added to the main page of enrolled users. Only enrolled users can see the button. If they click the button then they can watch the video/article from students sucessfully finished the nanodegree with different backgrounds. 

The hypothesis is that the experiences of students who finished the nanodegree successfully and having similar background can help the frustrated students to deal with the challenges and frustrations, thus reducing the number of cancellation. The experiment should not significantly reduce the number of students that already paid for the nanodegree. 

The unit of diversion will be user-id, as every enrolled students have their own user-id and the button can only be seen by students with user-id. 

The invariant metrics will be "num_view = the number of unique enrolled user-id view their main pages (the page where the button exists)" and "num_enroll = the number of users enrolled in the free trial". We can use these metrics to verify if the samples are randomly assigned in the experiment and control groups.

The evaluation metrics will be "percent_paid = the number of user-ids past the 14-day boundary divided by the number of unique enrolled user-id view their main pages" and "percent_cancelled = the number of cancelled user-id divided by the number of unique enrolled user-id view their main pages". The experiment will be launched only if the criteria for both percent_paid and percent_cancelled metrics are matched. The percent_paid in the experiment group should not significantlly lower than that in the control group, so we know the experiment won't have negative effect the students that already paid for the nanodegree. The percent_cancelled in the experiment group should be significanly lower than that in the control group. 