In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math

### Metric Choice

- Number of cookies: That is, number of unique cookies to view the course overview page. (dmin=3000)
- Number of user-ids: That is, number of users who enroll in the free trial. (dmin=50)
- Number of clicks: That is, number of unique cookies to click the "Start free trial" button (which happens before the free trial screener is trigger). (dmin=240)
- Click-through-probability: That is, number of unique cookies to click the "Start free trial" button divided by number of unique cookies to view the course overview page. (dmin=0.01)
- Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the "Start free trial" button. (dmin= 0.01)
- Retention: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)
- Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the "Start free trial" button. (dmin= 0.0075)

#### Invariant metrics

- Number of cookies 
- Number of clicks 
- Click-through probability 

####  Evaluation metrics

- Gross conversion
- Retention
- Net conversion

#### The goals of the experiment in terms of our metrics:

- The gross conversion should significantly decrease.
- The retention should significantly increase.
- The net conversion should not decrease.


### Measuring Variability

Final Project Baseline Values:

- Unique cookies to view course overview page per day:	40000
- Unique cookies to click "Start free trial" per day:	3200
- Enrollments per day:	660
- Click-through-probability on "Start free trial":	0.08
- Probability of enrolling, given click:	0.20625
- Probability of payment, given enroll:	0.53
- Probability of payment, given click:	0.1093125

### Sizing

Significant Power = 80% \
Significant level = 5%

(Notes: I have used the online calculator https://www.evanmiller.org/ab-testing/sample-size.html)


#### Gross conversion

- Baseline conversion rate: 0.20625
- Minimal detectable effect: 0.01
- 25,835 clicks per variation
- Number of pageviews needed for gross conversion: 2 × 25835 × 40000 ÷ 3200 =  645875.0

#### Retention

- Baseline conversion rate: 0.53
- Minimal detectable effect: 0.01
- 39,115 clicks per enroll
- Number of pageviews needed for retention: 2 × 39115 × 40000 ÷ 660 =  4741212.12121

#### Net conversion

- Baseline conversion rate: 0.1093125
- Minimal detectable effect: 0.0075
- 27,413 clicks per variation
- Number of pageviews needed for net conversion: 2 × 27413 × 40000 ÷ 3200 =  685325.0


### Duration vs. Exposure

- Duration to test three metrics: 4741212 / 40000 = 118 days

- Duration to test two metrics (gross conversion and net conversion): 685325 / 40000 = 17 days

It takes 118 days to test all three metrics with 100% traffic level, which is too long. If we only test two metrics, we can use the period 17 days with full traffic, but it gives us the result too quickly. The behavior of people could be rarely detected in a short time period for 2-3 weeks. To slightly increase the time interval, we can set the percentage of used traffic at 50 (fraction = 0.5) and it gives us the number: 34 days.

In [43]:
# Data 'Free Trial Screener' is obatained from Udacity A/B Testing Final Project
# Read in control and experimental groups separately 

control_data = pd.read_csv('/Users/olivia/Desktop/AB Testing/Final_Project_Results_Control.csv')
experiment_data = pd.read_csv('/Users/olivia/Desktop/AB Testing/Final_Project_Results_Experiment.csv')
control_data2 = control_data[:23]
experiment_data2 = experiment_data[:23]

In [3]:
control_data2.head(3)

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723.0,687.0,134.0,70.0
1,"Sun, Oct 12",9102.0,779.0,147.0,70.0
2,"Mon, Oct 13",10511.0,909.0,167.0,95.0


In [4]:
experiment_data2.head(3)

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716.0,686.0,105.0,34.0
1,"Sun, Oct 12",9288.0,785.0,116.0,91.0
2,"Mon, Oct 13",10480.0,884.0,145.0,79.0


#### Desctriptions:

- Pageviews: Number of unique cookies to view the course overview page that day.
- Clicks: Number of unique cookies to click the course overview page that day.
- Enrollments: Number of user-ids to enroll in the free trial that day.
- Payments: Number of user-ids who who enrolled on that day to remain enrolled for 14 days and thus make a payment.

(Note that the date for this column is the start date, that is, the date of enrollment, rather than the date of the payment. The payment happened 14 days later. Because of this, the enrollments and payments are tracked for 14 fewer days than the other columns.)

In [5]:
# Combining Actual Data with simulated noise

def add_noise(data):
    data['Pageviews'] = data['Pageviews']+ np.random.normal(0,100,23).astype(int)
    data['Clicks'] = data['Clicks']+ np.random.normal(0,50,23).astype(int)
    data['Enrollments'] = data['Enrollments']+ np.random.normal(0,10,23).astype(int)
    data['Payments'] = data['Payments']+ np.random.normal(0,5,23).astype(int)
    return data

control_data2 = add_noise(control_data2)
experiment_data2 = add_noise(experiment_data2)

In [6]:
print('Control data with random noise:')
control_data2.head(3)

Control data with random noise:


Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7781.0,705.0,140.0,69.0
1,"Sun, Oct 12",9192.0,651.0,156.0,71.0
2,"Mon, Oct 13",10422.0,926.0,176.0,96.0


In [7]:
print('Experiment data with random noise:')
experiment_data2.head(3)

Experiment data with random noise:


Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7738.0,607.0,100.0,39.0
1,"Sun, Oct 12",9606.0,828.0,109.0,92.0
2,"Mon, Oct 13",10562.0,959.0,149.0,75.0


### First step, sanity check.

Before analyzing result, the first step is to do sanity check — check if your invariant metrics have changed. If your sanity check failed, do not proceed. Instead, go analyze why your sanity check failed. You can do either: (1) retrospective analysis, or (2) look into if there’s learning effect.

In [8]:
Pageviews_cont = control_data['Pageviews'].sum()
Pageviews_exp = experiment_data['Pageviews'].sum()

Clicks_cont = control_data['Clicks'].sum()
Clicks_exp = experiment_data['Clicks'].sum()


print ("Control group:")
print ("Clicks = ", Clicks_cont, "   ", \
      "Pageviews = ", Pageviews_cont)

print ("Experimental group:")
print ("Clicks = ", Clicks_exp, "   ", \
      "Pageviews = ", Pageviews_exp)

Control group:
Clicks =  28494.0     Pageviews =  345439.0
Experimental group:
Clicks =  28239.0     Pageviews =  345196.0


#### Sanity Checks -> Cookies 

In [9]:
p = 0.5 

se_cookies = math.sqrt(p * p / (Pageviews_cont + Pageviews_exp))
m_cookies = se_cookies * 1.96
(lb_cookies, ub_cookies) = (p - m_cookies, p + m_cookies)
p_hat = Pageviews_cont/(Pageviews_cont + Pageviews_exp)
print ("Sanity Test (Cookies):", u'p\u0302', "=",p_hat , u'\u2208', (lb_cookies, ub_cookies), u'\u2713')

Sanity Test (Cookies): p̂ = 0.5001759250544788 ∈ (0.4988207611357565, 0.5011792388642435) ✓


#### Sanity Checks -> Clicks

In [10]:
se_clicks = math.sqrt(p * p / (Clicks_cont + Clicks_exp))
m_clicks = se_clicks * 1.96
(lb_clicks, ub_clicks) = (p - m_clicks, p + m_clicks)
p_hat = Clicks_cont/(Clicks_cont + Clicks_exp)

print ("Sanity Test (Clicks):", u'p\u0302', "=", p_hat, u'\u2208', (lb_clicks, ub_clicks), u'\u2713')

Sanity Test (Clicks): p̂ = 0.5022473692559886 ∈ (0.4958855839921207, 0.5041144160078793) ✓


#### Sanity Checks -> Click-Through-Probability (CTP)

In [11]:
p_pool = 1.0 * (Clicks_cont + Clicks_exp) / (Pageviews_cont + Pageviews_exp)
se_pool = math.sqrt(p_pool * (1 - p_pool) * (1.0 / Pageviews_cont + 1.0 / Pageviews_exp))
m_pool = se_pool * 1.96
d_hat = Clicks_cont / Pageviews_cont - Clicks_exp / Pageviews_exp
(lb_pool, ub_pool) = (0 - m_pool, 0 + m_pool)
print ("Sanity Test (CTP):", u'p\u0302', "=", d_hat, u'\u2208', (lb_pool, ub_pool), u'\u2713')

Sanity Test (CTP): p̂ = 0.0006806446729920174 ∈ (-0.0012952158606468556, 0.0012952158606468556) ✓


### Results Analysis

In [12]:
Clicks_cont = control_data2['Clicks'].sum()
Clicks_exp = experiment_data2['Clicks'].sum()

Enrollments_cont = control_data2['Enrollments'].sum()
Enrollments_exp = experiment_data2['Enrollments'].sum()

Payments_cont = control_data2['Payments'].sum()
Payments_exp = experiment_data2['Payments'].sum()

print ("Control group:")
print ("Clicks = ", Clicks_cont, "   ", \
"Enrollments = ", Enrollments_cont, "   ", \
"Payments = ", Payments_cont)


print ("Experimental group:")
print ("Clicks = ", Clicks_exp, "   ", \
"Enrollments = ", Enrollments_exp, "   ", \
"Payments = ", Payments_exp)

Control group:
Clicks =  17409.0     Enrollments =  3804.0     Payments =  2021.0
Experimental group:
Clicks =  17174.0     Enrollments =  3347.0     Payments =  1971.0


#### Result Analysis -> Gross Conversion

In [22]:
p_pool = 1.0 * (Enrollments_cont + Enrollments_exp) / (Clicks_cont+Clicks_exp)
se_pool = math.sqrt(p_pool * (1 - p_pool) * (1.0 / Clicks_cont + 1.0 / Clicks_exp))
m_pool = se_pool * 1.96
d = Enrollments_cont / Clicks_cont - Enrollments_exp / Clicks_exp
(lb_pool, ub_pool) = (d - m_pool, d + m_pool)

print (0, u'\u2209', (lb_pool, ub_pool))
print ((-0.01, 0, 0.01), u'\u2284', (lb_pool, ub_pool))
print ("Statistical significance", u'\u2713', "  Practical significance ", u'\u2713')

0 ∉ (0.015082871873568143, 0.032157223376795344)
(-0.01, 0, 0.01) ⊄ (0.015082871873568143, 0.032157223376795344)
Statistical significance ✓   Practical significance  ✓


#### Result Analysis -> Net Conversion

In [44]:
p_pool = 1.0 * (Payments_cont + Payments_exp) / (Clicks_cont+Clicks_exp)
se_pool = math.sqrt(p_pool * (1 - p_pool) * (1.0 / Clicks_cont + 1.0 / Clicks_exp))
m_pool = se_pool * 1.96
d = Payments_cont / Clicks_cont - Payments_exp / Clicks_exp
(lb_pool, ub_pool) = (d - m_pool, d + m_pool)

print (0, u'\u2208', (lb_pool, ub_pool))
print (0.0075, u'\u2282', (lb_pool, ub_pool))
print ("Statistical significance", u'\u2718', "  Practical significance ", u'\u2718')

0 ∈ (-0.005413006265008506, 0.008058749355919152)
0.0075 ⊂ (-0.005413006265008506, 0.008058749355919152)
Statistical significance ✘   Practical significance  ✘


### Sign Tests

#### Binomial Test: 
$ Pr(X=k)={\binom {n}{k}}p^{k}(1-p)^{n-k} $

(Notes: I used QuickCalcs: https://www.graphpad.com/quickcalcs/binomial1/ for sign and binomial test Test)

#### Sign test -> Gross conversion & Net conversion

In [27]:
# Supporting calculation of successful events for the evaluation metrics 

print ("Sign test:")

Gross_conversion_success = experiment_data2['Enrollments']/experiment_data2['Clicks'] \
>control_data2['Enrollments']/control_data2['Clicks']
print (" Gross conversion: success =", Gross_conversion_success.sum(), "  total =", Gross_conversion_success.size)
print(' The chance of observing',Gross_conversion_success.sum() , 
      'or more successes, or ',23-Gross_conversion_success.sum(),
      ' or fewer successes, in 23 trials is:\n Two-tailed P value: 0.0347')

Net_conversion_success = experiment_data2['Payments']/experiment_data2['Clicks'] \
>control_data2['Payments']/control_data2['Clicks']
print (" Net conversion: success =", Net_conversion_success.sum(), "  total =", Net_conversion_success.size)
print(' The chance of observing',Net_conversion_success.sum() , 
      'or more successes, or ',23-Net_conversion_success.sum(),
      ' or fewer successes, in 23 trials is:\n Two-tailed P value: 0.4049')

Sign test:
 Gross conversion: success = 6   total = 23
 The chance of observing 6 or more successes, or  17  or fewer successes, in 23 trials is:
 Two-tailed P value: 0.0347
 Net conversion: success = 9   total = 23
 The chance of observing 9 or more successes, or  14  or fewer successes, in 23 trials is:
 Two-tailed P value: 0.4049


### Follow-Up Analysis

#### Result Analysis -> Retention

In [24]:
# Analyze for evaluation metrics "Retention" on available data

p_pool = 1.0 * (Payments_cont + Payments_exp) / (Enrollments_cont + Enrollments_exp)
se_pool = math.sqrt(p_pool * (1 - p_pool) * (1.0 / Enrollments_cont + 1.0 / Enrollments_exp))
m_pool = se_pool * 1.96
d = Payments_cont / Enrollments_cont - Payments_exp / Enrollments_exp
(lb_pool, ub_pool) = (d - m_pool, d + m_pool)

print (0, u'\u2209', (lb_pool, ub_pool))
print ((-0.01, 0, 0.01), u'\u2284', (lb_pool, ub_pool))
print ("Statistical significance", u'\u2713', "  Practical significance ", u'\u2713')

0 ∉ (-0.08066989681212079, -0.034535521226287544)
(-0.01, 0, 0.01) ⊄ (-0.08066989681212079, -0.034535521226287544)
Statistical significance ✓   Practical significance  ✓


#### Sign test -> Retention

In [28]:
print ("Sign test:")
Retention_success = experiment_data2['Payments']/experiment_data2['Enrollments'] \
>control_data2['Payments']/control_data2['Enrollments']
print (" success =", Retention_success.sum(), "  total =", Retention_success.size)
print(' The chance of observing',Retention_success.sum() , 
      'or more successes, or ',23-Retention_success.sum(),
      ' or fewer successes, in 23 trials is:\n Two-tailed P value: 0.4049 ')

Sign test:
 success = 14   total = 23
 The chance of observing 14 or more successes, or  9  or fewer successes, in 23 trials is:
 Two-tailed P value: 0.4049 


### Summary

#### Results Analysis

- Gross Conversion Difference: 

95% CI: (0.0151, 0.03216) - ✅ Statistical significance, ✅ Practical significance

- Retention Difference:

95% CI: (-0.0807, -0.0345) - ✅ Statistical significance, ✅ Practical significance

- Net Conversion Difference: 

95% CI: (-0.0054, 0.0081) - ❌ Statistical significance, ❌ Practical significance


#### Sign Tests

- Gross Conversion : 

Success = 6   Total = 23 Two-tailed P value: 0.0347 - ✅ Statistical significance

- Retention:

Success = 14   Total = 23 Two-tailed P value: 0.4049 - ❌ Statistical significance

- Net Conversion: 

Success = 9   Total = 23 Two-tailed P value: 0.4049 - ❌ Statistical significance


In [32]:
Gross_conversion_diff = experiment_data2['Enrollments']/experiment_data2['Clicks'] - \
control_data2['Enrollments']/control_data2['Clicks']
print ("Gross conversion difference")
print ("Median: ", Gross_conversion_diff.median())
print ("Mean: ", Gross_conversion_diff.mean())

Retention_diff = experiment_data2['Payments']/experiment_data2['Enrollments'] - \
control_data2['Payments']/control_data2['Enrollments']
print ("Retention difference")
print ("Median: ", Retention_diff.median())
print ("Mean: ", Retention_diff.mean())

Net_conversion_diff = experiment_data2['Payments']/experiment_data2['Clicks'] - \
control_data2['Payments']/control_data2['Clicks']
print ("Net conversion difference")
print ("Median: ", Net_conversion_diff.median())
print ("Mean: ", Net_conversion_diff.mean())

Gross conversion difference
Median:  -0.0336617699528024
Mean:  -0.026018093753032923
Retention difference
Median:  0.01557489693082914
Mean:  0.0706052627681826
Net conversion difference
Median:  -0.003957843187037013
Mean:  -0.0006048057503616266



Effect size tests: Statistical significance of the difference between the control and experimental groups was checked by using the values of the **mean** \
Sign tests: The null hypothesis for the sign test is that the difference between **medians** is zero.

As we can see, the mean and median vary considerably for the retention and it causes that two test results differ.

### Recommendation

Launching the experiment change is not recommended, reasons as below: 

- Gross conversion: Gross conversion difference is practically significant and negative, which indicates the team could risk in revenue drop. 
 
- Retention: The retention difference is practically significant and positive. However, the sign test does not agree with the confidence interval for the difference. On the other hand, we have not gathered enough data to draw conclusions about the retention and because of this the difference between the control and experimental groups is not convinced. 

- Net conversion: There is no significance observed on net conversion difference, and the 95% confident interval includes negaetive values, which is another sign of potential financial losses.
