# A/B Testing

In [13]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt


control = pd.read_csv("dataset/control_data.csv")
experiment = pd.read_csv("dataset/experiment_data.csv")

### Free-Trial Experiment

Enrolling into an Udacity Course, with the overall business goal of maximizing course completion by students.

[Dataset Here](https://www.kaggle.com/tammyrotem/ab-tests-with-python?scriptVersionId=2109873)

#### Conditions Before Change
Two Options
* "start free trial", Asks for Credit Card Information, Can cancel after 14 days without payment.
* "access course materials", free.

#### Conditions After Change
Asked how much time they had to devote
* 5< usual checkout process
* 5> message for greater time commitment, suggestion to access course for free.

#### Hypothesis
To reduce the number of frusterated students who quit in the first 14 days, without significantly reducing the students going beyond 14 days and finishing the course.

### Metric
We choose two metrics
* Invariate: A metric which doesn't change that much after the experiment as a sanity check.
* Evaluation: This is the metric we would use to test the outcome of the experiment.

<img src="https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/styles/simple_image/public/images/adwords-industry-benchmarks-average-ctr.jpg" alt="Drawing" style="width: 700px; height: 500px;"/>

<img src="https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/styles/simple_image/public/images/adwords-industry-benchmarks-average-conversion-rate.jpg" alt="Drawing" style="width: 600px; height: 700px;"/>

#### Invariate Metrics
* number of cookies: unique daily cookies (Ck, 3000)
* number of clicks on free trial: unique daily cookies clicked (Cl, 240)
* free trial button CTR: 240/3000, Ck/Cl

#### Evaluation Metrics
* gross conversion: enrolled/Cl
* retention: paid/enrolled
* net conversion: paid/Cl

#### Baseline

| Item | Description  | Estimator  |
|:-:|:-:|:-:|
| Number of cookies | Daily unique cookies to view course overview page  | 40,000  |
| Number of clicks | Daily unique cookies to click Free Trial button  | 3,200 |
| Number of enrollments | Free Trial enrollments per day  | 660  |
| CTP | CTP on Free Trial button  | 0.08  |
| Gross Conversion | Probability of enrolling, given a click  | 0.20625  |
| Retention | Probability of payment, given enrollment  | 0.53  |
| Net Conversion | Probability of payment, given click  | 0.109313 |

In [14]:
#Let's place this estimators into a dictionary for ease of use later
baseline = {"Cookies":40000,"Clicks":3200,"Enrollments":660,"CTP":0.08,"GConversion":0.20625,
           "Retention":0.53,"NConversion":0.109313}

Once we collected these estimates, we should estimate the standard deviation of a metric, this is computed for sample size calculations and confidence intervals for our results. The more variant a metric is, the harder it is to reach a significant result. 

In [15]:
# scale the metrics from 40,000 to 5000

baseline['Cookies'] = 5000
baseline['Clicks'] = baseline['Clicks'] * 5000/40000
baseline['Enrollments'] = baseline['Enrollments'] * 5000/40000
baseline

{'Cookies': 5000,
 'Clicks': 400.0,
 'Enrollments': 82.5,
 'CTP': 0.08,
 'GConversion': 0.20625,
 'Retention': 0.53,
 'NConversion': 0.109313}

In order to estimate variance analytically, we can assume metrics which are probabilities ( p^ ) are binomially distributed
#### Gross Conversion
The baseline probability for Gross Conversion can be calculated by the number of users to enroll in a free trial divided by the number of cookies clicking the free trial. In other words, the probability of enrollment given a click. 

In [16]:
GC = {}
GC['d_min'] = 0.01
GC['p'] = baseline['GConversion']
#p is given in this case - or we could calculate it from enrollments/clicks
GC["n"] = baseline["Clicks"]
GC['sd'] = round(np.sqrt((GC["p"]*(1-GC["p"]))/GC["n"]), 4)
GC['sd']

0.0202

#### Retention
The baseline probability for retention is the number of paying users (enrolled after 14 free days) divided by the number of total enrolled users. In other words, the probability of payment, given enrollment. 

In [19]:
R = {}
R['d_min'] = 0.01
R['p'] = baseline['Retention']
#p is given in this case - or we could calculate it from paid/enrollments
R["n"] = baseline["Enrollments"]
R['sd'] = round(np.sqrt((R["p"]*(1-R["p"]))/R["n"]), 4)
R['sd']

0.0549

#### Net Conversion
The baseline probability for the net conversion is the number of paying users divided by the number of cookies that clicked the free trial button. In other words, the probability of payment, given a click.

In [20]:
NC = {}
NC['d_min'] = 0.01
NC['p'] = baseline['NConversion']
#p is given in this case - or we could calculate it from paid/clicked
NC["n"] = baseline["Clicks"]
NC['sd'] = round(np.sqrt((NC["p"]*(1-NC["p"]))/NC["n"]), 4)
NC['sd']

0.0156