# Final Project

Instructions: https://docs.google.com/document/u/1/d/1aCquhIqsUApgsxQ8-SQBAigFDcfWVVohLEXcV6jWbdI/pub?embedded=True

**Experiment**: Free Trial Screener <br />
**Hypothesis**: Reducing #frustrated students <br />
**Unit of diversion** (what is the unit for the both groups): Cookie <br />
*Note:* if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

## Experiment Design
### Metric Choice
May be considered as an invariant (expect similar distribution in the both groups):
<ol>
<li>**#cookies** (#unique cookies to view the course overview page, dmin=3000): a good choice if we expect even allocation into two groups, as a unit of diversion is a cookie.</li>
<li>**#user-ids** (#users who enroll in the free trial, dmin=50): not appropriate, since unregistered users also could view the page, so experimental group might be less than control group in this case. Can't be used as an evaluation, since it is not normalized.</li>
<li>**#clicks** (#unique cookies to click the "Start free trial" button which happens before the free trial screener is trigger, dmin=240): a good choice, since is independent from the experiment.</li>
<li>**Click-through-probability** (#clicks/#cookies, dmin=0.01): also is independent from the experiment and can be used in sanity checks, but if the first two metrics are chosen then this one is surplus.</li>
</ol>

May be considered as an evaluation (would like to see a significant difference):
<ol>
<li>**Gross conversion** (#user-ids to complete checkout and enroll in the free trial / #unique cookies to click the "Start free trial" button, dmin= 0.01): a good choice, as it is dependent on the experiment and is about probability to succeed (expect to be lower in the experimental group, as #users who left trial due to time lack should reduce significantly). </li>
<li>**Retention** (#user-ids to remain enrolled past the 14-day boundary and thus make at least one payment / #user-ids to complete checkout, dmin=0.01): a good choice, as it is dependent on the experiment and is about probability to succeed (expect to be higher in the experimental group to filter out less flustrating users). Based on futher analysis this metric is declined due to required experiment duration. Additionally this one is a superposition of gross and net conversions, so can be skipped in the analysis.</li>
<li>**Net conversion** (#user-ids to remain enrolled past the 14-day boundary and thus make at least one payment / #unique cookies to click the "Start free trial" button, dmin= 0.0075): a good choice, as it is dependent on the experiment and is about probability to succeed (expect to be higher in the experimental group, but not necessarily, as #users who made at least one payment should reduce insignificantly).</li>
</ol>

### Measuring Standard Deviation
#### Baseline Values
<ul>
<li>Unique cookies to view page per day:	40000</li>
<li>Unique cookies to click "Start free trial" per day:	3200</li>
<li>Enrollments per day:	660</li>
<li>Click-through-probability on "Start free trial":	0.08</li>
<li>Probability of enrolling, given click:	0.20625</li>
<li>Probability of payment, given enroll:	0.53</li>
<li>Probability of payment, given click:	0.1093125</li>
</ul>

How many units of analysis will correspond to 5000 pageviews for each evaluation metric?

In [1]:
from math import sqrt
def standart_deviation(p,n): # for Bernoulli distribution with probability p and population n
    return round(sqrt(p*(1-p)/n),4)

number_of_pageviews = 5000
number_of_cookies = 40000

number_of_clicks = number_of_pageviews*0.08
number_of_enrollment = number_of_pageviews*0.08*0.20625

The standard deviation of each of the evaluation metrics:

In [2]:
std_gross_conversion = standart_deviation(0.20625,number_of_clicks)
std_retention = standart_deviation(0.53,number_of_enrollment)
std_net_conversion = standart_deviation(0.1093125,number_of_clicks)

print "Gross conversion: %s" % std_gross_conversion
print "Retention: %s" % std_retention
print "Net conversion: %s" % std_net_conversion

Gross conversion: 0.0202
Retention: 0.0549
Net conversion: 0.0156


For gross and net conversions the analytic estimations should be roughly close to the empirical ones since the unit of diversion is equal to the unit of analysis. If we choose retention as an evaluation metric, then the emperical variability should be calculated.

## Sizing
### Number of Samples vs. Power
Using the analytic estimates of variance, how many pageviews total (across both groups) would you need to collect to adequately power the experiment? Use an alpha of 0.05 and a beta of 0.2. Make sure you have enough power for each metric.

The metrics chosen are not independent therefore Bonferroni correction might be too conservative to use.<br/>
The sample size (the number of pageviews needed to power the experiment appropriately) can be estimated using the online calculator:  http://www.evanmiller.org/ab-testing/sample-size.html

**Gross conversion**
Baseline for sample size = 0.20625*100 = 20.625%		
Minimum Detectable Effect = 1%
=> Sample size = 25835 per group

**Retention**
Baseline for sample size = 0.53*100 = 53%		
Minimum Detectable Effect = 1%	
=> Sample size = 39115 per group

**Net conversion**
Baseline for sample size = 0.1093125*100 = 10.9313%		
Minimum Detectable Effect = 0.75%	
=> Sample size = 27413 per group

In [3]:
from __future__ import division

sample_size_gross_conversion = 25835 * 2
sample_size_retention = 39115 * 2
sample_size_net_conversion = 27413 * 2

ctr_gross_conversion = 3200/number_of_cookies
ctr_retention = 660/number_of_cookies
ctr_gross_conversion = 3200/number_of_cookies

pageviews_gross_conversion = int(sample_size_gross_conversion / ctr_gross_conversion)
pageviews_retention = int(sample_size_retention / ctr_retention)
pageviews_net_conversion = int(sample_size_net_conversion / ctr_gross_conversion)

#print pageviews_gross_conversion,pageviews_retention,pageviews_net_conversion
print "Pageviews required: %s" % max(pageviews_gross_conversion,pageviews_retention,pageviews_net_conversion)

Pageviews required: 4741212


### Duration vs. Exposure
Indicate what fraction of traffic you would divert to this experiment and, given this, how many days you would need to run the experiment.

Give your reasoning for the fraction you chose to divert. How risky do you think this experiment would be for Udacity?

In [4]:
from math import ceil
def number_of_days(sample_size, fraction, daily_traffic):
    return int(ceil(sample_size / fraction / daily_traffic))

fraction = 1 # any value from 0 to 1, 1 means 100% of traffic

print "%s days are required" % number_of_days(pageviews_retention,fraction,number_of_cookies)

119 days are required


This is unreasonably long, let's try gross and net conversions only.

In [5]:
sample_size = max(pageviews_gross_conversion,pageviews_net_conversion)
number_of_days(sample_size,fraction,number_of_cookies)

18

In [6]:
fraction = 0.75
number_of_days(sample_size,fraction,number_of_cookies)

23

In [7]:
fraction = 0.5
number_of_days(sample_size,fraction,number_of_cookies)

35

Seems reasonable. Let's stick to these two metrics as evaluation ones. A decision about the duration and the traffic fraction should be made based on bussiness needs.

## Experiment Analysis

The experiment raw data broken down day by day is provided. 

**Pageviews:** Number of unique cookies to view the course overview page that day.<br />
**Clicks:** Number of unique cookies to click the course overview page that day.<br />
**Enrollments:** Number of user-ids to enroll in the free trial that day.<br />
**Payments:** Number of user-ids who who enrolled on that day to remain enrolled for 14 days and thus make a payment. (Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)

In [8]:
import xlrd

def load_data(filename, sheet_index, header, rows_number):
    d = {}
    wb = xlrd.open_workbook(filename)
    sh = wb.sheet_by_index(sheet_index)  
    if header: 
        start = 1
    else: 
        start = 0
    for i in range(start,rows_number):
        d[sh.cell(i,0).value] = [sh.cell(i,j).value for j in range(1,5)]
    return d

In [9]:
control_dict = load_data('Final Project Results.xlsx', 0, True, 38)
experiment_dict = load_data('Final Project Results.xlsx', 1, True, 38)

### Sanity checks
Model the assigment to each group as a Bernoulli distribution with p = 0.5. <br/>
The observed value is equal to # in the control group divided by the total number. <br/>
For each of your invariant metrics, give the 95% confidence interval for the value you expect to observe, the actual observed value, and whether the metric passes your sanity check. For any sanity check that did not pass, explain your best guess as to what went wrong based on the day-by-day data. Do not proceed to the rest of the analysis unless all sanity checks pass.

In [10]:
pageviews_control = sum([float(v[0]) for v in control_dict.values() if v[0]!=''])
clicks_control = sum([float(v[1]) for v in control_dict.values() if v[1]!=''])

enrollments_control = sum([float(v[2]) for v in control_dict.values() if v[2]!=''])
payments_control = sum([float(v[3]) for v in control_dict.values() if v[3]!=''])

clicks_after_payments_control = sum([float(v[1]) for k,v in control_dict.items() if v[3]!=''])

In [11]:
pageviews_experiment = sum([float(v[0]) for v in experiment_dict.values() if v[0]!=''])
clicks_experiment = sum([float(v[1]) for v in experiment_dict.values() if v[1]!=''])

enrollments_experiment = sum([float(v[2]) for v in experiment_dict.values() if v[2]!=''])
payments_experiment = sum([float(v[3]) for v in experiment_dict.values() if v[3]!=''])

clicks_after_payments_experiment = sum([float(v[1]) for k,v in experiment_dict.items() if v[3]!=''])

In [12]:
import scipy.stats as st
z_score = st.norm.ppf(1-(1-0.95)/2) 
z_score

1.959963984540054

In [13]:
def confidence_interval(mean,standard_error,z_score):
    return (round(mean - (z_score*standard_error),4),round(mean + (z_score*standard_error),4))

In [14]:
p_true = 0.5

SE_number_of_cookies = standart_deviation(p_true,pageviews_control+pageviews_experiment)
CI = confidence_interval(p_true, SE_number_of_cookies, 1.96)
print CI

(0.4988, 0.5012)


In [15]:
p_hat_number_of_cookies = round(pageviews_control/(pageviews_control+pageviews_experiment),4)
print p_hat_number_of_cookies
print "Metric passes the sanity check: %s" % (p_hat_number_of_cookies > CI[0] and p_hat_number_of_cookies < CI [1])

0.5006
Metric passes the sanity check: True


In [16]:
p_true = 0.5
SE_number_of_clicks = standart_deviation(p_true,clicks_control+clicks_experiment)
CI = confidence_interval(p_true, SE_number_of_clicks, 1.96)
print CI

(0.4959, 0.5041)


In [17]:
p_hat_number_of_clicks = round(clicks_control/(clicks_control+clicks_experiment),4)
print p_hat_number_of_clicks
print "Metric passes the sanity check: %s" % (p_hat_number_of_clicks > CI[0] and p_hat_number_of_clicks < CI [1])

0.5005
Metric passes the sanity check: True


### Result Analysis
#### Effect Size Tests
For each of your evaluation metrics, give a 95% confidence interval around the difference between the experiment and control groups. Indicate whether each metric is statistically and practically significant.

Still don't use Bonferroni correction (the same reason).

In [18]:
# Gross conversion
p_hat_gross_conversion = round((enrollments_control+enrollments_experiment)/(clicks_after_payments_control+clicks_after_payments_experiment),4)
print p_hat_gross_conversion

0.2086


In [19]:
SE_gross_conversion = standart_deviation(p_hat_gross_conversion,clicks_after_payments_control*clicks_after_payments_experiment/(clicks_after_payments_control + clicks_after_payments_experiment))
d_hat_gross_conversion = round(enrollments_experiment/clicks_after_payments_experiment - enrollments_control/clicks_after_payments_control,4)
CI = confidence_interval(d_hat_gross_conversion, SE_gross_conversion, 1.96)
print CI

(-0.0292, -0.012)


In [20]:
d_min = -0.01

print "Statistically significant: %s" % (0 < CI[0] or 0 > CI[1])
print "Practically significant: %s" % (d_min > CI[0] or d_min > CI[1])

Statistically significant: True
Practically significant: True


In [21]:
# Net conversion
p_hat_net_conversion = round((payments_control+payments_experiment)/(clicks_after_payments_control+clicks_after_payments_experiment),4)
print p_hat_net_conversion

0.1151


In [22]:
SE_net_conversion = standart_deviation(p_hat_net_conversion,clicks_after_payments_control*clicks_after_payments_experiment/(clicks_after_payments_control + clicks_after_payments_experiment))
d_hat_net_conversion = round(payments_experiment/clicks_after_payments_experiment - payments_control/clicks_after_payments_control,4)
CI = confidence_interval(d_hat_net_conversion, SE_net_conversion, 1.96)
print CI

(-0.0116, 0.0018)


In [23]:
d_min = -0.075

print "Statistically significant: %s" % (0 < CI[0] or 0 > CI[1])
print "Practically significant: %s" % (d_min > CI[0] or d_min > CI[1])

Statistically significant: False
Practically significant: False


#### Sign Tests
For each of your evaluation metrics, do a sign test using the day-by-day data, and report the p-value of the sign test and whether the result is statistically significant.

http://graphpad.com/quickcalcs/binomial1.cfm is using

In [24]:
# gross_conversion = enrollments / clicks
# gross_conversion = payments / clicks
gross_conversion_control = {k: round(float(v[2])/float(v[1]),4) for k,v in control_dict.items() if v[1]!='' and v[2]!=''}
net_conversion_control = {k: round(float(v[3])/float(v[1]),4) for k,v in control_dict.items() if v[1]!='' and v[3]!=''}
gross_conversion_experiment = {k: round(float(v[2])/float(v[1]),4) for k,v in experiment_dict.items() if v[1]!='' and v[2]!=''}
net_conversion_experiment = {k: round(float(v[3])/float(v[1]),4) for k,v in experiment_dict.items() if v[1]!='' and v[3]!=''}

In [25]:
len(gross_conversion_control)

23

In [26]:
gross_conversion_successes = len([k for k in gross_conversion_experiment.keys() if (gross_conversion_experiment[k]-gross_conversion_control[k])>0])

In [27]:
(gross_conversion_experiment[u'Thu, Oct 16']-gross_conversion_control[u'Thu, Oct 16'])>0

False

In [28]:
gross_conversion_successes

4

Using http://graphpad.com/quickcalcs/binomial1.cfm, P value is 0.0026 < 0.05 => OK

In [29]:
net_conversion_successes = len([k for k in net_conversion_experiment.keys() if (net_conversion_experiment[k]-net_conversion_control[k])>0])

In [30]:
net_conversion_successes

10

Using the same tool, P value is 0.6776 > 0.05 => The difference is not significant

#### Summary
If there are any discrepancies between the effect size hypothesis tests and the sign tests, describe the discrepancy and why you think it arose.

Gross conversion significantly decreases but net conversion does not significantly decrease. Therefore the change would reduce #enrollments that don't pay, but doesn't reduce enrollment that pay. 

### Recommendation

In order to except the change all evaluation metrics (gross and net conversions) need to be relevant, but this is not the case. So the recommendation is not to launch it.

## Follow-Up Experiment
Give a high-level description of the follow up experiment you would run, what your hypothesis would be, what metrics you would want to measure, what your unit of diversion would be, and your reasoning for these choices.

One should re-consider the change and the metrics: 
<ul>
<li>if a user-id may be used as a unit of diversion (more stable than cookies),</li>
<li>if it's possible to use another evaluation metric (like how many hours per day user is online, how many days between enrolling and cancelling the enrollments, etc.),</li>
<li>the alternative change can be "trial period".</li>
</ul>