In [1]:
import math as mt
import numpy as np
import pandas as pd
from scipy.stats import norm

# 1 Udacity Experiment

Udacity wants to:
1. Choose and characterize metrics to evaluate an experiment
2. Design an experiment with enough statistical power
3. Analyze results and draw valid conclusions

# 2 Experiment Overview
**Experiment Name**: "Free Trial" Screener
For Udacity, a website with the overall goal of maximizing course completion by students.

### 2.1 Current Conditions
- Two options to *click*: 
    - "start free trial": Get credit card info and charged after first 2 weeks unless cancelled
    - "access course materials": They can view videos and take quizzes for free but no coaching/feedback on final project
    
### 2.2 Experimental Change
- If select "start free trial" asked how much time they will devote to course
    - If over 5 hrs/wk: Proceed with checkout process
    - If under 5hrs/wk: Suggests accessing course materials for free
    
### 2.3 Experiment Hypothesis
Sets clearer expectations, reducing the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course.

Udacity could improve overall student experience and improve coaches' capacity to support students likely to stay.

**Expect to see**:
- Equal amount of students staying past the trial as before
- Less students dropping after the trial

### 2.4 Additional Details
- Unit of diversion is a cookie
- If the student enrolls in the free trial, they are tracked by user-id from that point forward

# 3 Metric Choice

**Invariate Metrics** are used for "sanity checks"
**Evaluation Metrics** are metrics we assume will change/ rlated to the business goal.

For each metric we choose a ***Dmin*** which marks the minimum change which is practically significant to the business. Ex: Stating any retention that is under 2%, even if statistically significant, is not practical to the business.

### 3.1 Invariate Metrics - Sanity Check!
![image.png](attachment:image.png)

### 3.2 Evaluation Metrics - KPIs
![image-2.png](attachment:image-2.png)

# 4 Estimating the baseline values of metrics
We'd like to know how our metrics behave before experimentation


### 4.1 Collecting estimators data
![image.png](attachment:image.png)

In [2]:
# Place estimators in Dictionary for ease of use later
baseline = {"Cookies": 40000, "Clicks": 3200, "Enrollments": 660, 
            "CTP": 0.08, "GConversion": 0.20625, "Retention": 0.53,
           "NConversion": 0.109313}

### 4.2 Estimating Standard Deviation

**Assume a sample size of 5,000 cookies visiting the course overview page per day.**

- Used for Evaluation metrics only
- Standard deviation 's' relates to sample
- Standard deviation '$\sigma$' relates to population
    - s is generally larger than $\sigma$
    
The sample size we are considering should be smaller than the "population" we collected and small enough to have two groups with that size.

#### 4.2.1 Scaling Collected Data

Scale our collected counts estimates of metrics. In this case, from 40000 unique cookies to visit the course overview page per day, to 5000.

In [3]:
# Scale the counts estimates
baseline["Cookies"] = 5000
baseline["Clicks"] = baseline["Clicks"]*(5000/40000)
baseline["Enrollments"] = baseline["Enrollments"]*(5000/40000)
baseline

{'Cookies': 5000,
 'Clicks': 400.0,
 'Enrollments': 82.5,
 'CTP': 0.08,
 'GConversion': 0.20625,
 'Retention': 0.53,
 'NConversion': 0.109313}

#### 4.2.2 Estimating Analytically - Standard Deviation for Evaluation Metrics

Assume metrics which are probabilities are binomially distributed to estimate variance. This is only valid if **unit of diversion** = **unit of analysis**. If not, measure variance empirically. 
![image.png](attachment:image.png)
Where:

- *$\hat{p}$* - baseline probability of the event to occur

- *n* - sample size

What is our unit of diversion?

- The **unit of diversion** is the element by which we differentiate samples and assign them to control an experiment groups.
    - Our unit of diversion = Cookies 

What are our units of analysis?

- The **unit of analysis** is the denominator of the evaluation metrics.

    - **Gross Conversion** - Cookies enrolled / Cookies clicked 
    - **Retention** - Cookies paid after 14 day/ Cookies enrolled (those who enrolled and stayed)
    - **Net Conversion** - Cookies paid after 14 days / Cookies clicked 

In [11]:
# Let's get the p and n we need for Gross Conversion (GC)
# and compute the Stansard Deviation(sd) rounded to 4 decimal digits.
GC = {}
GC["d_min"] = 0.01
GC["p"] = baseline["GConversion"]
# p is given but could have calculated from the enrollment/clicks
GC["n"]=baseline["Clicks"]
GC['sd'] = round(mt.sqrt((GC["p"]*(1-GC["p"]))/GC["n"]),4)
GC['sd']

0.0202

In [6]:
# Let's get the p and n we need for Retention(R)
# and compute the Stansard Deviation(sd) rounded to 4 decimal digits.
R = {}
R['d_min'] = 0.01
R['p'] = baseline['Retention']
R['n'] = baseline['Enrollments']
R['sd'] = round(mt.sqrt((R["p"]*(1-R["p"]))/R["n"]),4)
R['sd']

0.0549

In [8]:
# Let's get the p and n we need for Net Conversion(NC)
# and compute the Stansard Deviation(sd) rounded to 4 decimal digits.
NC = {}
NC['d_min'] = 0.0075
NC['p'] = baseline['NConversion']
NC['n'] = baseline['Clicks']
NC['sd'] = round(mt.sqrt((NC["p"]*(1-NC["p"]))/NC["n"]),4)
NC['sd']

0.0156

# 5 Experiment Sizing

Now we calculate the minimal number of samples we need to have enough statistical power (as well as significance) in our experiment.

Given $\alpha$ = 0.05 (significance) and $\beta$ = 0.2 (Power = 1-$\beta$), estimate total pageviews (cookies viewing page) we need in the experiment. 
- *p* is the baseline conversion rate (our *$\hat{p}$*)
- *d* is the detectable effect (*d* = D<sub>min</sub> )

**Hypothesis**:

**H<sub>0</sub>**: P<sub>cont</sub>  - P<sub>exp</sub> = 0 <br/>

**H<sub>A</sub>**: P<sub>cont</sub>  - P<sub>exp</sub> = *d*
    
<div>
<img src="attachment:image.png" width="300">
</div>

**What do we still need?**

- Z score for 1- $\alpha$/2 and for 1 - $\beta$
- sd<sub>1</sub> and sd<sub>2</sub> for baseline and expected changed rate, respectively.

### 5.1 Get z-score critical value and Standard Deviations

In [10]:
def get_sds(p,d):
    sd1 = mt.sqrt(2*p*(1-p))
    sd2 = mt.sqrt(p*(1-p)+(p+d)*(1-(p+d)))
    x = [sd1, sd2]
    return x

In [16]:
# Inputs: Required alpha
# Outputs: z-score for given alpha

def get_z_score(alpha):
    return(norm.ppf(alpha))

In [17]:
# Inputs:sd1-sd for the baseline,sd2-sd for the expected change,alpha,beta,d-d_min,p-baseline estimate p
# Returns: the minimum sample size required per group according to metric denominator
def get_sampSize(sds,alpha,beta,d):
    n=pow((get_z_score(1-alpha/2)*sds[0]+get_z_score(1-beta)*sds[1]),2)/pow(d,2)
    return n

#### A note of ppf vs cdf

- ppf(q,loc,scale) is ppf(percentile, mean, SD) returns SDs above or below mean


<div>
<img src="attachment:image.png" width="500">
</div>

<div>
<img src="attachment:image-2.png" width="400">
</div>


### 5.2 Calculate Sample Size per Metric

We are now going to calculate the number of samples required for the experiment per metric, and we are subected to the fact that the highest sample size will be the effective size. This size should be considered in terms of efficacy of duration and exposure: how long will it take to get this many samples for the experiment.

In [18]:
GC["d"]=0.01
R["d"]=0.01
NC["d"]=0.0075

In [20]:
# Get necessary sample size for Gross Conversion
GC["SampSize"]=round(get_sampSize(get_sds(GC["p"],GC["d"]),0.05,0.2,GC["d"]))
GC['SampSize']



25835

This means we need at least 25,835 cookies who click the Free Trial button - per group! That means that if we got 400 clicks out of 5000 pageviews (400/5000 = 0.08) -> So, we are going to need GC["SampSize"]/0.08 = 322,938 pageviews, again ; per group! Finally, the total amount of samples per the Gross Conversion metric is:

In [21]:
GC["SampSize"]=round(GC["SampSize"]/0.08*2)
GC["SampSize"]

645875

In [23]:
# Get necessary sample size for Retention
R["SampSize"]=round(get_sampSize(get_sds(R["p"],R["d"]),0.05,0.2,R["d"]))
R["SampSize"]

39087

This means that we need 39,087 users who enrolled per group! We have to first convert this to cookies who clicked, and then to cookies who viewed the page, then finally to multipky by two for both groups.

In [24]:
R["SampSize"]=R["SampSize"]/0.08/0.20625*2
R["SampSize"]

4737818.181818182

This takes us as high as over 4 million page views total, this is practically impossible because we know we get about 40,000 a day, this would take well over 100 days. This means we have to drop this metric and not continue to work with it because results from our experiment (which is much smaller) will be biased.

In [26]:
# Get necessary sample size Net Conversion
NC["SampSize"]=round(get_sampSize(get_sds(NC["p"],NC["d"]),0.05,0.2,NC["d"]))
NC["SampSize"]

27413

In [27]:
NC["SampSize"]=NC["SampSize"]/0.08*2
NC["SampSize"]

685325.0

We are all the way up to 685,325 cookies who view the page. This is more than what was needed for Gross Conversion, so this will be our number. Assuming we take 80% of each days pageviews, the data collection period for this experiment (the period in which the experiment is revealed) will be about 3 weeks.

### 5.3 Sample Size per Metric Summary

We found that the samples needed for our evaluating metrics to be statistically significant and have enough statistical power are as follows:

**Cookies visiting site**:
- Gross Conversion: 645,875 
- Retention: 4,737,819
- Net Conversion: 685,325

**Conclusions**:
- If the site gets about 40,000 visits per day, Retention would need 100+ days to be included in this experiment.
- Choose to discard the Retention metric and use the Net Conversion metric sample size instead.

# 6 Analyzing Collected Data
### 6.1 Loading collected data

In [28]:
control = pd.read_csv('control_data.csv')
experiment = pd.read_csv('experiment_data.csv')
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


### 6.2 Sanity Checks

#### 6.2.1 Sanity Checks for differences between counts

- **Number of cookies who viewed the course overview page** - Analyze if there is statistically significant differences between data

In [35]:
pageviews_cont = control['Pageviews'].sum()
pageviews_exp = experiment['Pageviews'].sum()
pageviews_total = pageviews_cont + pageviews_exp
print('Number of page views in control:', pageviews_cont)
print('Number of page views in experiiment:', pageviews_exp)

Number of page views in control: 345543
Number of page views in experiiment: 344660


**What we expect:**
- The amount of pageviews in the control group to be about half (50%) of total pageviews in both groups
- Simulate this with **binomial random variable** with N experiments and probability p = 0.5
1. Use the **central limit theorem** to approximate this binomial distribution to a normal distribution

2. Test whether observed $\hat{p}$ (number of samples in control divided by total number in both groups) is not significantly different than p=0.5 with 95% confidence. The **margin of error** is:

<p style="text-align: center;">$\hat{X}$ = p </sub></p>
<p style="text-align: center;">SD = $\sqrt{\frac{p(1-p)}{N}}$</sub></p>

<p style="text-align: center;">ME = Z<sub>1-$\frac{\alpha}{2}$</sub>SD</p>

Where $\alpha$ = 0.05, and SD is Standard Deviation, $\hat{X}$ is sample mean, N is number of experiments. The **confidence interval** shows the rande of an observed p that can exsist and be acceptable as "the same" value.

<p style="text-align: center;">CI = [p-ME, p+ME] </sub></p>


In [36]:
p = 0.5 # Population mean
alpha = 0.05 # 95% confidence
p_hat = round(pageviews_cont/pageviews_total,4) # Sample mean
sd = mt.sqrt(p*(1-p)/pageviews_total) #Population SD
ME = round(get_z_score(1-(alpha/2))*sd,4)
print ("The confidence interval is between",p-ME,"and",p+ME,"; Is",p_hat,"inside this range?")

The confidence interval is between 0.4988 and 0.5012 ; Is 0.5006 inside this range?


- **Number of cookies who clicked the Free Trial Button**

In [37]:
clicks_cont=control['Clicks'].sum()
clicks_exp=experiment['Clicks'].sum()
clicks_total=clicks_cont+clicks_exp

p_hat=round(clicks_cont/clicks_total,4)
sd=mt.sqrt(p*(1-p)/clicks_total)
ME=round(get_z_score(1-(alpha/2))*sd,4)
print ("The confidence interval is between",p-ME,"and",p+ME,"; Is",p_hat,"inside this range?")

The confidence interval is between 0.4959 and 0.5041 ; Is 0.5005 inside this range?


### 6.2.2 Sanity Checks for differences between probabilities

- **Click-through-probability of the Free Trial Button**
Need to compare to the pooled click-through-probability from the control and experimental group.

In [38]:
ctp_cont=clicks_cont/pageviews_cont
ctp_exp=clicks_exp/pageviews_exp
d_hat=round(ctp_exp-ctp_cont,4)
p_pooled=clicks_total/pageviews_total
sd_pooled=mt.sqrt(p_pooled*(1-p_pooled)*(1/pageviews_cont+1/pageviews_exp))
ME=round(get_z_score(1-(alpha/2))*sd_pooled,4)
print ("The confidence interval is between",0-ME,"and",0+ME,"; Is",d_hat,"within this range?")

The confidence interval is between -0.0013 and 0.0013 ; Is 0.0001 within this range?


In [39]:
#Just another way to do the above^^ This also works. Comparing cpt_cont to pooled control and exp.

ctp_cont=clicks_cont/pageviews_cont
ctp_exp=clicks_exp/pageviews_exp
d_hat=round(ctp_exp-ctp_cont,4)
p_pooled=clicks_total/pageviews_total
sd_pooled=mt.sqrt(p_pooled*(1-p_pooled)*(1/pageviews_cont+1/pageviews_exp))

ME=round(get_z_score(1-(alpha/2))*sd_pooled,4)
print ("The confidence interval is between",p_pooled-ME,"and",p_pooled+ME,"; Is",ctp_cont,"within this range?")
       
       
       

The confidence interval is between 0.08085409089789526 and 0.08345409089789525 ; Is 0.08212581357457682 within this range?


# Appendix

### Central Limit Theory
Use the central limit theory to gauge when the distribution of the samples will be approximately normally distributed.

**The sample mean and standard deviation become the following:**

- Sample Mean: *$\hat{X}$ = np*
- Sample Standard Deviation: $\sigma$<sub>$\hat{X}$</sub> = $\frac{\sigma}{\sqrt{n}}$

- Applies to **normal populations**.
    - Sample size test:
        - n > 30
    - Translating mean and deviation
        - Population Mean: $\mu$
        - Population Standard deviation: $\sigma$
    
<div>
<img src="attachment:image-3.png" width="100">
</div>

- Applies to **binomial populations** (with yes/no characteristics).
    - Sample size test:
        - min(np, n(1-p))> 5
    - Translating mean and deviation
        - Population Mean: *$\mu$ = np*
        - Population Standard deviation: $\sigma$ = $\sqrt{np(1-p)}$

        
<div>
<img src="attachment:image-2.png" width="100">
</div>

- Applies to **Poisson populations** (with yes/no characteristics).
    - Sample size test:
        - n > 30
    - Translating mean and deviation
        - Population Mean: $\mu$
        - Population Standard deviation: $\sigma$ = $\sqrt{\mu}$

<div>
<img src="attachment:image-4.png" width="100">
</div>


### Difference between Poisson and Normal Distribution
- Poisson is discrete, Normal is continuous
- Poisson has bounds at 0, Normal can range from -/+ infinity
- Poisson doesn't HAVE to be symmetric and can be skewed, Normal is symmetric
- Poisson variance = mean, Normal these two are separate
- Poisson with large mean value can sometimes be approximated as a normal distribution

