# Problem Session 5

The problems in this notebook will cover the content covered in our Inference I lectures including:
- Hypothesis Testing
- Confidence Intervals
- Linear Regression Inference

#### 1. The dangers of early peeking in A/B testing

One especially common use of hypothesis testing in data science is "A/B testing".

Many companies prioritize data informed decision making and have very mature A/B testing platforms to trial changes before adoption.  Here is an example of an A/B test which comes from the book "Trustworthy Online Controlled Experiments" by Kohavi, Tang and Xu:

Someone at your company proposes implementing a coupon code system.  To rapidly get some idea of the potential impacts even *before* implementing the complete system you decide to implement the following A/B test:  for a period of two weeks you will show half of your customers your standard checkout page (the "control group"), and you will show the other half a new checkout page which has a coupon code box (the "treatment group"). Since there are no coupon codes in existence yet, putting anything in the box will simply display "invalid code" and otherwise do nothing.

You will monitor how customers interact with the coupon code box (how many people click on it, enter anything into it, enter one or more attempted codes, etc), how long they stay on the checkout page in the control and treatment group, what fraction of customers who make it to the checkout page who actually complete their purchase, and the revenue per customer who made it to the checkout page.

In this example, the mere presence of a coupon code box significantly reduced revenue per customer with an effect size large enough to scuttle the project!

It is common (but ill advised) for companies to **continuously** monitor such experiments and stop early when a significant result is obtained in either direction.  The reasoning is that we would not want to continue a disastrous experiment for the full planned time (e.g.  hardly anyone checks out after being presented with the coupon code box), and we would likewise not want to miss out on the benefits by **not** implementing a positive experiment as soon as possible.

In this exercise we will see why early stopping is such a bad idea through simulation.

##### a)

Finish the definition of the following function:

In [1]:
import numpy as np
import pandas as pd

def simulate_data(control_mean, treatment_mean, scale, size):
    '''
    Draws from two normal distributions with the same scale and different means

    Args:
        control_mean:  the mean of the control group
        treatment_mean: the mean of the treatment group
        scale: the common standard deviation of both groups
        size: the shape of both outputs
    
    returns:
        The tuple (control_data, treatment_data)        
    '''
    control_data = 
    treatment_data = 
    return (control_data, treatment_data)


# Note: for the probabilistic checks below, the probability of both false positive and negatives are so low as to be effectively zero
# It is a fun little puzzle to estimate these probabilities.
assert(simulate_data(3,3,1,(10,2))[0].shape == (10,2)), "control_data does not have the correct shape"
assert(simulate_data(3,3,1,(10,2))[1].shape == (10,2)), "treatment_data does not have the correct shape"
assert(np.abs(simulate_data(0,3,1,(1000))[0].mean()) < 1  ), "control_data does not have the correct mean"
assert(np.abs(simulate_data(0,3,1,(1000))[1].mean() - 3) < 1  ), "treatment_data does not have the correct mean"
assert(np.abs(simulate_data(0,3,1,(1000))[0].std() - 1) < 0.5 ), "control_data does not have the correct standard deviation"
assert(np.abs(simulate_data(0,3,1,(1000))[1].std() - 1) < 0.5 ), "treatment_data does not have the correct standard deviation"

##### b)

Use  [`ttest_ind`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) to finish writing the following two functions:

In [2]:
from statsmodels.stats.weightstats import ttest_ind

def no_early_peeking_results(control_data, treatment_data):
    '''
    Returns the p-value of the t-test comparing the two group means.
    '''
    p_value = 
    return p_value

assert(no_early_peeking_results([1,2,3], [4,5,6]) == 0.021311641128756727)

In [3]:
def early_peeking_results(control_data, treatment_data, alpha):
    '''
    Runs a t-test on each initial segment of the control and treatment data.

    Args:
        alpha: the threshold for significance.
        
    Returns: 
        (p_value, nobs)
        p_value: the p-value of the first significant such test
        nobs: the number of observations in that test
    
    Example:  
        early_peeking_results([1,2,2,2,2,2], [4,5,5,5,5,5], 0.05)
        Should run a t-test on 
            [1,2], [4,5]
            [1,2,2], [4,5,5]
            [1,2,2,2], [4,5,5,5]
            etc
        when a significant p-value is found it will output
        that p-value and the length of the control group for that test.
    '''
    
    return p_value, nobs

assert(early_peeking_results([1,2,2,2,2,2], [4,5,5,5,5,5], 0.05)[0] == 0.0031255892524457277)

# Be careful about off by one errors here!
assert(early_peeking_results([1,2,2,2,2,2], [4,5,5,5,5,5], 0.05)[1] == 3)

##### c)

We will now see the impact of early peeking on the false positive rate.

By setting both the control and training mean to 0, and $\alpha = 0.05$ for the threshold for significance, we should expect to see a false positive rate of roughly $0.05$.  We will see that early peeking wildly inflates the false positive rate!

This is bad news for our company, because we will be mislead into thinking that our treatment has a significant effect (in either direction) when there really is no effect at all.

In [None]:
def comparing_procedures(size = 1000, 
                         num_simulations = 100, 
                        alpha = 0.05):
    '''
    Simulates data from control and treatment groups both with mean 0 and variance 1.
    Returns a tuple of the following variables:
        no_early_peeking_false_positives of type int
            Number of significant trials without early peeking
        early_peeking_false_positives of type int
            Number of "significant" trials with early peeking
        early_peeking_nobs of type list(int)
            List of the number of samples in the "significant" early peeking trials.
    '''            
    no_early_peeking_false_positives = 0
    early_peeking_false_positives = 0
    early_peeking_nobs = []

    for i in range(num_simulations):
        # Your code here
        
    return (no_early_peeking_false_positives, early_peeking_false_positives, early_peeking_nobs)

In [None]:
num_simulations = 100
size = 1000
(no_early_peeking_false_positives, early_peeking_false_positives, early_peeking_nobs) = comparing_procedures(num_simulations=num_simulations, size = size)
print(f"The number of false positives with no early peeking is {no_early_peeking_false_positives} out of {num_simulations}")
print(f"The number of false positives with early peeking is {early_peeking_false_positives} out of {num_simulations}")
print(f"The number of observations to reach those false positives (out of {size}) were \n {early_peeking_nobs[early_peeking_nobs != 1000]}")

#### 2. Confidence Intervals and Linear Regression Coefficients


In [6]:
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

#### a) Problem introduction

The following [dataset](https://dasl.datadescription.com/datafile/bodyfat/) comes from the [Data and Story Library](https://dasl.datadescription.com/). 

Our target variable is `Pct.BF` which stands for "percent body fat".

Age is given in years, weight is given in pounds, and all other measurements are given in inches.

In [None]:
bf = pd.read_csv('../../data/bodyfat.csv')
bf.head()

This table does not include [Body Mass Index](https://en.wikipedia.org/wiki/Body_mass_index), but we can calculate it from the data.

Add a new column `BMI` which records this.

In [8]:
bf['BMI'] = 

The name of the target variable `Pct.BF` is a bit problematic for `statsmodels` because of the period.  Change the name of that column to `BFP`.

In [9]:
bf = 

#### b)  Train/Test Split

One of the reasons that we have a reproducibility crisis in several fields of science is that researchers test multiple hypotheses on their full dataset and only report the statistically significant findings.  As you should know, assuming that the null hypotheses are actually all true you will find one "significant" result for every $20$ things you try.

Even without deliberate "p-hacking" one can still be led astray as the article ["The Garden of Forking Paths"](http://www.stat.columbia.edu/%7Egelman/research/unpublished/p_hacking.pdf) by Gelman and Loken so convincingly argues.  

One way to combat this is by conducting a replication study, but that can be expensive.

Alternatively can conduct your own replication study by splitting your data in half.  Experiment to your hearts content on the training data.  Find whatever interesting potential associations you like without worrying.  Then test one of these hypotheses on your testing set.  The downside is that your study is less powerful since you have access to only half the data.  The upside is that the data used to generate your research hypothesis is not the same data which you use to test your hypothesis (a very good thing).

For these reasons, make a 50/50 train test split of the bodyfat data:

In [10]:
from sklearn.model_selection import 

bf_train, bf_test = 

#### c) Exploring possible associations

Fit the following regression models using `statsmodels` (not `sklearn`):

Note:  We imported `statsmodels.formula.api` as `smf`.  It is convenient to use formulas for fitting these models.  For example, if you wanted to regress `Neck` on `Chest` and `Height` you could write

```python
neck_model = smf.ols('Neck ~ Chest + Height', data=bf_train).fit()
```

In [11]:
# full_model uses BMI, Waist, and Abdomen as features
full_model = 

# waist_model uses Waist as the only feature
waist_model = 

# bmi_model uses BMI as the only feature
bmi_model = 

# abdomen_model uses abdomen as the only feature
abdomen_model = 

Look at the summary of each model.  Do they tell a consistent story?

For each model discuss the following with your group:

1. Describe *precisely* what is the meaning of the $p$-value listed for each feature.
2. Describe *precisely* what is the meaning of the $95\%$ confidence interval listed for each feature.
2. How can it be that a feature is considered significant by one model and not by another?
3. Can you explain any unusual findings? 

In [None]:
full_model.summary()

In [None]:
waist_model.summary()

In [None]:
bmi_model.summary()

In [None]:
abdomen_model.summary()

Use an F-test to compare the full model to the waist, abdomen, and bmi models.  What is the precise meaning of the $p$-value of each test? 

In [None]:
# Perform the F-test
f_test_waist = 
f_test_bmi = 
f_test_abdomen = 


print("p-value of full compared to waist model:", f_test_waist[1])
print("p-value of full compared to abdomen model:", f_test_abdomen[1])
print("p-value of full compared to BMI model:", f_test_bmi[1])


If you have not already, make a plot of `waist` against `abdomen` to help explain what we have been seeing:

#### d)  Final model evaluation on testing set.

Let's choose the waist model as our final model.  Fit the model to the testing data and look at the summary.  Discuss with your group.

In [None]:
final_model = 

final_model.summary()

#### e) Confidence and prediction intervals

Make a graph which includes:

1. A scatterplot of the data
2. The confidence interval for the predicted mean response.
3. The prediction interval for the response.

Note:  You will want to use the [`.get_predictions`](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.get_prediction.html) method of an ols model which returns prediction results. These prediction results then have a [`.summary_frame`](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.PredictionResults.summary_frame.html) method (the documentation is non-existent and you will need to look at the source to see what this does) which contain the confidence and prediction interval bounds you want.

Find the confidence and prediction interval for Body Fat Percentage at a waist size of 45 inches.  Explain these in a way that a layman could understand what they mean.