# Problem Session 5

The problems in this notebook will cover the content covered in our Inference I lectures including:
- Hypothesis Testing
- Confidence Intervals
- Linear Regression Inference

#### 1. The dangers of early peeking in A/B testing

One especially common use of hypothesis testing in data science is "A/B testing".

Many companies which prioritize data informed decision making have very mature A/B testing platforms to trial changes before adoption.  Here is an example of an A/B test which comes from the book "Trustworthy Online Controlled Experiments" by Kohavi, Tang and Xu:

Someone at your company proposes implementing a coupon code system.  To rapidly get some idea of the potential impacts even *before* implementing the complete system you decide to implement the following A/B test:  for a period of two weeks you will show half of your customers your standard checkout page (the "control group"), and you will show the other half a new checkout page which has a coupon code box (the "treatment group"). Since there are no coupon codes in existence yet, putting anything in the box will simply display "invalid code" and otherwise do nothing.

You will monitor how customers interact with the coupon code box (how many people click on it, enter anything into it, enter one or more attempted codes, etc), how long they stay on the checkout page in the control and treatment group, what fraction of customers who make it to the checkout page who actually complete their purchase, and the revenue per customer who made it to the checkout page.

In this example, the mere presence of a coupon code box significantly reduced revenue per customer with an effect size large enough to scuttle the project!

It is common (but ill advised) for companies to **continuously** monitor such experiments and stop early when a significant result is obtained in either direction.  The reasoning is that we would not want to continue a disastrous experiment for the full planned time (e.g.  hardly anyone checks out after being presented with the coupon code box), and we would likewise not want to miss out on the benefits by **not** implementing a positive experiment as soon as possible.

In this exercise we will see why early stopping is such a bad idea through simulation.

##### a)

Finish the definition of the following function:

In [1]:
import numpy as np

def simulate_data(control_mean, treatment_mean, scale, size):
    '''
    Draws from two normal distributions with the same scale and different means

    Args:
        control_mean:  the mean of the control group
        treatment_mean: the mean of the treatment group
        scale: the common standard deviation of both groups
        size: the shape of both outputs
    
    returns:
        The tuple (control_data, treatment_data)        
    '''
    control_data = np.random.normal(loc = control_mean, scale = scale, size = size)
    treatment_data = np.random.normal(loc = treatment_mean, scale = scale, size = size)
    return (control_data, treatment_data)

##### b)

Read the following documentation:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

Use the `ttest_ind` to finish writing the following two functions:

In [2]:
from statsmodels.stats.weightstats import ttest_ind

def no_early_peeking_results(control_data, treatment_data):
    '''
    Returns the p-value of the t-test comparing the two group means.
    '''
    p_value = ttest_ind(control_data, treatment_data)[1]
    return p_value

assert(no_early_peeking_results([1,2,3], [4,5,6]) == 0.021311641128756727)

In [3]:
def early_peeking_results(control_data, treatment_data, alpha):
    '''
    Runs a t-test on each initial segment of the control and treatment data.

    Args:
        alpha: the threshold for significance.
        
    Returns: 
        (p_value, nobs)
        p_value: the p-value of the first significant such test
        nobs: the number of observations in that test
    
    Example:  
        early_peeking_results([1,2,2,2,2,2], [4,5,5,5,5,5], 0.05)
        Should run a t-test on 
            [1,2], [4,5]
            [1,2,2], [4,5,5]
            [1,2,2,2], [4,5,5,5]
            etc
        when a significant p-value is found it will output
        that p-value and the length of the control group for that test.
    '''
    p_value = 1
    nobs = 2
    while p_value > alpha and nobs < len(control_data):
        p_value = ttest_ind(control_data[:nobs], treatment_data[:nobs])[1]
        nobs += 1
    return p_value, nobs

assert(early_peeking_results([1,2,2,2,2,2], [4,5,5,5,5,5], 0.05)[0] == 0.0031255892524457277)
assert(early_peeking_results([1,2,2,2,2,2], [4,5,5,5,5,5], 0.05)[1] == 4)

##### c)

We will now see the impact of early peeking on the false positive rate.

By setting both the control and training mean to 0, and $\alpha = 0.05$ for the threshold for significance, we should expect to see a false positive rate of roughly $0.05$.  We will see that early peeking wildly inflates the false positive rate!

This is bad news for our company, because we will be mislead into thinking that our treatment has a significant effect (in either direction) when there really is no effect at all.

In [8]:
def comparing_procedures(control_mean = 0, 
                            treatment_mean = 0, 
                            scale = 1, 
                            size = 1000, 
                            num_simulations = 100, 
                            alpha = 0.05):
    '''
    Simulates data from both control and treatment groups.
    Returns the following self explanatory variables:
        no_early_peeking_false_positives of type int
        early_peeking_false_positives of type int
        early_peeking_nobs of type list(int)
    '''                
    no_early_peeking_false_positives = 0
    early_peeking_false_positives = 0
    early_peeking_nobs = []

    for i in range(100):
        control_data, treatment_data = simulate_data(control_mean,treatment_mean,scale,size)
        no_early_peeking_p_value = no_early_peeking_results(control_data, treatment_data)
        early_peeking_p_value, nobs = early_peeking_results(control_data, treatment_data, alpha)
        if no_early_peeking_p_value < 0.05:
            no_early_peeking_false_positives += 1
        if early_peeking_p_value < 0.05:
            early_peeking_false_positives += 1
        early_peeking_nobs.append(nobs)

    early_peeking_nobs = np.array(early_peeking_nobs)

    return (no_early_peeking_false_positives, early_peeking_false_positives, early_peeking_nobs)

In [7]:
num_simulations = 100
size = 1000
(no_early_peeking_false_positives, early_peeking_false_positives, early_peeking_nobs) = comparing_procedures(num_simulations=num_simulations, size = size)
print(f"The number of false positives with no early peeking is {no_early_peeking_false_positives} out of {num_simulations}")
print(f"The number of false positives with early peeking is {early_peeking_false_positives} out of {num_simulations}")
print(f"The number of observations to reach those false positives (out of {size}) were \n {early_peeking_nobs[early_peeking_nobs != 1000]}")

The number of false positives with no early peeking is 4 out of 100
The number of false positives with early peeking is 58 out of 100
The number of observations to reach those false positives (out of 1000) were 
 [  4  86  87   4   3   9 704   4  31 166  81   3  41  11 901   3   9 550
 764   5   4  15  31   5  12   7 953 190 388   4   4   8  62  94  63  12
 922   7   4  23   6 262  80   6   3  38  50 249  68  24  55 233 138   4
   6   9 220   3]


#### 2. Confidence Intervals and Linear Regression Coefficients
