# Bootstrapping SE's and CI's with Grad School Data

**Instructions:**

The goal of this exercise is to become familiar with the technique of
bootstrapping and appreciate how it can be used to estimate the precision
of statistics through resampling data to generate standard errors and
confidence intervals that may otherwise be difficult to compute directly.

**What to do:**

Login to learning catalytics and join the session for the
module entitled "Grad School Correlations". You will answer a series of
questions based on the guided programming below. Each section begins with
a '%%'. Read through the comments and follow the instructions provided.
In some cases you will be asked to answer a question, clearly indicated
by 'QUESTION'. In other cases, you be asked to supply missing code,
indicated by 'TODO'. The corresponding question in learning catalytics
will be indicated in parentheses (e.g. Q1). If there is no 'Q#'
accompanying a 'QUESTION' just type your answer into this script and
discuss it with your team. 

Original source of exercise:
Efron, B. & Tibshirani Robert, J. (1993) An introduction to the
bootstrap. Chapman & Hall, London, see Table 3.2 on p. 21

RTB wrote it 07 July 2019 (Kinsale, Ireland; Cork Distance Week,
"Champion of Champions" day), ERBB translated to Python on 04 August, 2021


**Concepts covered:**
1. Standard error of the mean calculated 3 ways:
      a) formula, b) population sampling, c) bootstrap sampling
2. Calculating correlation coefficients with 'corr'
3. Bootstrapping standard errors with the built-in 'bootstrp' function
4. Bootstrapping confidence intervals with the built-in 'bootci' function
5. Parametric bootstrap by sampling from a bivariate normal distribution

The data here are GRE (quant) and GPA (science) scores from a census of
82 graduate programs in neuroscience. We also have a random sample of 15
schools from this census as well. Note that these data were collected
prior to August of 2011, so the GRE scores were scaled from 200 to 800.


In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
sns.set(rc={'figure.figsize':(10, 5)})

import matplotlib.pyplot as plt

In [None]:
# Read in the data
ds82 = pd.read_excel('https://github.com/rickborn/TAD/blob/master/Unit%20%232%20Bootstrap%201/Grad_School_82.xlsx?raw=true') # All graduate programs (*census*)
ds15 = pd.read_excel('https://github.com/rickborn/TAD/blob/master/Unit%20%232%20Bootstrap%201/Grad_School_15.xlsx?raw=true') # Random sample of 15

# Define a few constants
n_boot = 10000
n_samp = len(ds15.GRE)
n_census = len(ds82.GRE)

# Look at the original excel spreadsheet for one of the files and compare
# it to the variable that 'pd.read_excel' created in Python. 
ds15.columns

Index(['SchoolID', 'GRE', 'GPA'], dtype='object')

In [None]:
# Plot GPA and GRE scores

fig, ax = plt.subplots(1, 1, figsize = (10, 5))

# Set axis limits
ax.set(xlim = [450, 750])

# Plot census scatter plot and least-squares line
sns.regplot(data = ds82, x = 'GRE', y = 'GPA', ax = ax, truncate = False, ci = None,
            color = 'k', marker = '+',
            scatter_kws={"s": 100}, 
            label = 'Census')

# Plot sample scatter plot and least-squares line
sns.regplot(data = ds15, x = 'GRE', y = 'GPA', ax = ax, truncate = False, ci = None,
            color = 'r', marker = 'o', 
            scatter_kws={"s": 100, "facecolor":'none', "edgecolor":'r'}, 
            label = 'Sample')

# Add legend
ax.legend()

# Set axis labels
ax.set(xlabel = 'GRE Score (quant)', ylabel = 'GPA (science)');

## Mean GRE score & standard error (SE)
We start with something that is easy to compute directly. The reason we
can do this is that, thanks to the Central Limit Theorem, we KNOW that
the sampling distribution of the mean, regardless of the distribution of
the original data from which the mean was calculated, will be normally
distributed. Furthermore, we know that the standard deviation of the
sampling distribution for the mean will be equal to the sample standard
deviation divided by the square-root of the number of samples. This is
the standard error of the mean.

NOTE: There are two standard deviations at play here. The first is the
standard deviation that we calculate from our sample--this is the sample
standard deviation. But, in the next sections, we will be explicitly
calculating the "sampling distribution of the mean," which is the
distribution of mean values we would get if we re-took our sample of 15
many different times and calculated a new sample mean each time. The
standard deviation of the sampling distribution of the mean is, by
definition, the standard error. In fact, and this is very important to
just flat out memorize until practice makes it intuitive: THE STANDARD
ERROR OF ANY STATISTIC IS THE STANDARD DEVIATION OF THE SAMPLING
DISTRIBUTION OF THAT STATISTIC. The mean is a special case where we can
use a handy-dandy formula to calculate the standard deviation of the
sampling distribution (i.e. the standard error of the mean) based on the
standard deviation of our single sample.

*Python note*: Pandas and matlab both divide by n - 1 when they compute standard deviation. In numpy, the std is computed by dividing by n, unless you change the `ddof` (degrees of freedom) argument. Use pandas to sync with Matlab.


In [None]:
# TODO: Calculate the mean GRE score for your sample and its standard error (SE) 

mean_GRE = ...
sem_GRE = ...

sem_GRE


**QUESTION (Q1)**: What is the value of semGRE to 2 decimal places?


# Bootstrapping

## "True" standard error by sampling from the population






In [None]:
all_means = pd.Series(np.zeros(n_boot,))

np.random.seed(123) # for consistency across class; You would not normally do this.
for k in range(n_boot):

    # TODO: Draw n_boot samples of size 15 (n_samp) from the CENSUS of 82, each time 
    # calculating the sample mean. Save each mean in 'all_means'
    all_means[k] = ...

# Look at the sampling distribution of the mean
ax = sns.histplot(data = all_means, bins = 10)
ax.set(xlabel = 'mean GRE score', ylabel = '# of samples of size 15',
       title = 'Distribution of means, sampling from census');

In [None]:
# TODO: calculate the standard error of the mean from this sample:
sem_GRE_samp = ...

sem_GRE_samp

**QUESTION** (Q2): What is the value of `sem_GRE_samp` to 2 decimal places? 

## Bootstrap standard error by sampling from the sample

Calculate another SEM as you did above, but now, instead of drawing your samples from the CENSUS, you will draw your samples from the sample.
You do this by sampling WITH REPLACEMENT from your original actual sample of the 15 graduate schools. This is the essence of the bootstrap!




In [None]:
all_means_bootstrap = pd.Series(np.zeros(n_boot,))

np.random.seed(123) 
for k in range(n_boot):

    # TODO: sample with replacement from the sample and get mean
    all_means_bootstrap[k] = ...

# Visualize the sampling distribution of the mean
fig, axes = plt.subplots(2, 1, figsize = (10, 7))
bin_edges = np.histogram_bin_edges(all_means_bootstrap)

sns.histplot(data = all_means, bins = bin_edges, ax = axes[0])
axes[0].set(xlabel = 'mean GRE score', ylabel = '# of samples of size 15',
       title = 'Distribution of means, sampling from census', ylim = [0, 3100]);

sns.histplot(data = all_means_bootstrap, bins = bin_edges, ax = axes[1])
axes[1].set(xlabel = 'mean GRE score', ylabel = '# of samples of size 15',
       title = 'Distribution of means, re-sampling from the sample', ylim = [0, 3100]);

plt.tight_layout()

In [None]:
# TODO: calculate the standard error of the mean from the bootstrap 
#      sampling distribution of the means
sem_GRE_boot = ...

sem_GRE_boot

**QUESTION** (Q3): What is the value of `sem_GRE_boot` to 2 decimal places? 


**QUESTION** (Q4): What is the error (in %) of the bootstrap estimate w/r/t that of the formula? Calculate in next cell. Round to the nearest whole number in %.



In [None]:
# TODO: Compare your bootstrap estimate of the SE with that from the formula
percent_error = ...

percent_error

## Correlation of GRE and GPA in census and sample

In [None]:
# TODO: Use pandas dataframe method `corr` to calculate correlation 
#       coefficients of both the census and sample

rho_hat_82 = ...
rho_hat_15 = ...

print(rho_hat_82)
print(rho_hat_15)

**QUESTION (Q5)**: What is a correlation coefficient?

**QUESTION (Q6)**: What is the correlation coefficient for the census? 

**QUESTION (Q7)**: Based on the correlation coefficient and the graph, would you guess GRE score and GPA are correlated?


## Standard error for the correlation coefficient



In [None]:
# Set up a pandas dataframe for various ways of estimating the standard error of 
#  the correlation coefficient

rhos = pd.DataFrame({
       'bs_rhos': np.zeros(n_boot,),
       'all_rhos_TS': np.zeros(n_boot,)
    })

Unlike for the mean, there is no handy, dandy formula for the standard error of a correlation coefficient. This is where the bootstrap comes in!

Get a bootstrap sample of correlation coefficients the old fashioned way, using a 'for' loop. Store in our pandas dataframe `rhos` in the column `bs_rhos`.

In [None]:
np.random.seed(123) # for reproducibility

for k in range(n_boot):

    # TODO: Randomly sample n_samp rows from ds15 with replacement
    sampled_rows = ...
    
    # TODO: Compute the correlation of GRE score and GPA for this sample
    rhos['bs_rhos'][k] = ...

ax = sns.histplot(rhos['bs_rhos'])
ax.set(xlabel = 'Correlation coefficient', ylabel = 'Probability');

In [None]:
# Compute standard error of our correlation coefficient
se_rho_boot = ...

se_rho_boot

**QUESTION (Q8)**: What is the value of `se_rho_boot` (referred to as `se_rho_boot_FL` in learning catalytics) to 4 decimal places?



In [None]:
# Compute mean of distribution
mean_rho_boot = ...

mean_rho_boot

**QUESTION (Q9)**: What is the mean of this distribution to 2 decimal places?


Q10/Q11/Q12 are specific to Matlab

## Sample from census

As we did above for the mean, we can take advantage of the fact that we
have data for the complete population (i.e. census), and see how our
estimate of rho is distributed when we repeatedly sample from the
population. That is, instead of re-sampling our sample of 15 with
replacement, we sample the 'population' of 82 graduate schools with
replacement.

Store in our pandas dataframe `rhos` in the column `all_rhos_TS`.

In [None]:
np.random.seed(123)  # for reproducibility
for k in range(n_boot):

    # TODO: Randomly sample n_samp rows from ds82 with replacement
    sampled_rows = ...
    
    # Compute the correlation of GRE score and GPA for this sample
    rhos['all_rhos_TS'][k] = ...

# Visualize
ax = sns.histplot(rhos)
plt.legend(['Census', 'Bootstrap'])
ax.set(xlabel = 'Correlation coefficient', ylabel = 'Probability');

**QUESTION (Q13)**: How does this distribution compare to the bootstrapped resampling of 15 schools? Consider the general skew, spread, location (i.e. mean,median) of the distributions.

**Question (Q14)**: Compute the standard error of the correlation coefficient for the samples bootstrapped from the population.

In [None]:
# TODO: Compute the standard error of the correlation coefficient for the samples 
# bootstrapped from the population.
se_rho_boot_TS = ...

se_rho_boot_TS

## The 'parametric bootstrap' (p. 53 of E&T)

"Instead of sampling with replacement from the data, we draw B samples of
size n from the parametric estimate of the population."

The parametric bootstrap differs from the traditional bootstrap in that
we fit a model to the data and then draw random numbers from this fitted
model, rather than resampling the data itself. Why might one want to do
this? Well, in rare instances when one wants to bootstrap the SE for some
sample 'outlier', such as the 'min' or 'max', the data-driven bootstrap
will fail. (Try this and see for yourself what is going on.) In such
cases, the parametric bootstrap gets it right.

In this case, we will assume that the population has a bivariate normal
distribution, with means `mu_hat` and a covariance matrix of `cov_hat`.




In [None]:
mu_hat = ds15[['GRE', 'GPA']].mean()
cov_hat = ds15[['GRE', 'GPA']].cov()

Using what we learned from bootstrapping, create a 'for' loop that uses the `np.random.multivariate_normal` function to draw n_boot samples of size `n_samp` from a bivariate normal distribution with mean `mu_hat` and covariance `cov_hat`. Compute the correlation coefficient for each sample and store in the column called `pbs_rhos` of our pandas dataframe `rhos`, which is initialized below. 

In [None]:
rhos['pbs_rhos'] = np.zeros((n_boot,));

np.random.seed(123)
for k in range(n_boot):

    # Draw samples from normal distribution
    R = ...

    # Compute correlation coefficient (use np.corrcoef)
    rhos['pbs_rhos'][k] = ...

**QUESTION (Q15)**: What is the standard error of the correlation coefficient as determined by parametric bootstrapping? Compute in code below

In [None]:
# TODO: get standard error of correlation coefficient from parametric bootstrapping
se_rho_PBS = ...

se_rho_PBS

In [None]:
# Visualize
ax = sns.histplot(rhos)
plt.legend(['Parametric', 'Census', 'Bootstrap'])
ax.set(xlabel = 'Correlation coefficient', ylabel = 'Probability');

In [None]:
print(se_rho_boot)
print(se_rho_boot_TS)
print(se_rho_PBS)

**QUESTION (Q16)**: How does the SE of the correlation coefficient compare to our other bootstrapping strategies? If it's different, why do you think this may be so?

Let's look at a side-by-side view of our histograms


In [None]:
fig, ax = plt.subplots(1, 1, figsize = (10, 5))
ax.hist(np.array(rhos), bins = 25);

ax.set(xlim = [0, 1], xlabel = 'Correlation coefficient', ylabel = '# of bootstrap replicates',
       title = 'Distribution of rho values')
plt.legend(['Bootstrap','Census','Parametric']);

# Confidence intervals

We have used several different strategies to create sampling
distributions:
  1. Repeated sampling from the entire population.
  2. Repeated re-sampling from our original sample (bootstrap)
  3. Repeated sampling from a population defined by parameters derived
     from our original sample (parametric bootstrap)

But in each case, we have generated an estimate of the sampling
distribution for a given statistic. Thus far, we have used these
distributions to generate a single estimate of precision: the standard
error. However, we can use these same distributions to calculate other
measures of precision, such as confidence intervals. After all, under
normal assumptions, a standard error is a kind of confidence interval,
since we expect about 68% of the distribution to be within +/- s.d. That
is, for example, the SEM can be thought of as defining a 68% CI for our
estimate of the mean. But we can go further.


## CI by asymptotic normal distribution theory

Since the std. error is the 68% CI, we can get any other CI by just calculating the appropriate number of standard deviates from the normal distribution. Let's use our distribution of `rhos[bs_rhos]`, calculated above.

Below, we show the mean of the distribution with the vertical black dashed line.

In [None]:
# Plot of distribution of bootstrapped correlation coefficients
ax = sns.histplot(rhos['bs_rhos'], bins = 100)
_, y_max = ax.get_ylim()
ax.plot([rhos['bs_rhos'].mean(), rhos['bs_rhos'].mean()], [0, y_max], '--k')
ax.set(xlabel = 'Correlation coefficient', ylabel = 'Probability',
       title = 'Distribution of rho values: boostrap',
       xlim = [0.15, 1.1]);

In [None]:
# This is our mean correlation
mean_rho_boot = rhos['bs_rhos'].mean()

# This is our 68% CI
se_rho_boot = rhos['bs_rhos'].std()

# Set alpha for a 95% CI
my_alpha = 0.05

You probably remember that a 95% CI is +/- 1.96 standard deviates. So we
could calculate our CI as `mean_rho_boot` +/- 1.96*`se_rho_boot`. But say we
wanted to be able to calculate any arbitrary confidence interval. For a
99% CI, we would set `my_alpha` to 0.01.


In [None]:
# TODO: Write a line of code that will convert a desired CI, 
#       expressed as my_alpha to the appropriate number of standard deviates. 
#       Use `norm` from scipy.stats, imported below
from scipy.stats import norm
num_std_deviates = ...

**QUESTION (Q17)**: What is `num_std_deviates` for `my_alpha` = 0.001?

In [None]:
# TODO: Calculate the lower and upper bounds for the 95% CI
rho95_CI_low = ...
rho95_CI_hi = ...

print(rho95_CI_low)
print(rho95_CI_hi)

**QUESTION (Q18)**: What is the value of `rho95_CI_hi`?

**QUESTION (Q19)**: Does this value make sense? Why or why not?


In [None]:
# Plot of distribution of bootstrapped correlation coefficients
ax = sns.histplot(rhos['bs_rhos'], bins = 100)
_, y_max = ax.get_ylim()
ax.plot([rhos['bs_rhos'].mean(), rhos['bs_rhos'].mean()], [0, y_max], '--k')

# TODO: Draw lines for the 95%CI on our histogram in red 
ax.plot(...) 
ax.plot(...) 

ax.set(xlabel = 'Correlation coefficient', ylabel = 'Probability',
       title = 'Distribution of rho values: boostrap',
       xlim = [0.15, 1.1]);

##  CI by percentile method

In this case, we generated 10,000 samples, so a more intuitive, brute-force way to calculate the 95% CI is just to sort our bootstrap replicates and then find the values corresponding to 250th and the 9750th index in the sorted array.


In [None]:
# Sort our bootstrap replicates
bs_rhos_sorted = rhos['bs_rhos'].sort_values()

# TODO: find indices corresponding to lower and upper bounds
idx_lo = ...   # index corresponding to lower bound
idx_hi = ...  # index corresponding to upper bound

# Get high and low bounds using percentiles (remember to use iloc if a pandas format)
rho95_CI_percentile_low = bs_rhos_sorted.iloc[int(idx_lo)]
rho95_CI_percentile_hi = bs_rhos_sorted.iloc[int(idx_hi)]

print(rho95_CI_percentile_low)
print(rho95_CI_percentile_hi)

**QUESTION (Q20)**: What is the lower bound of the 95% CI?


In [None]:
# Plot of distribution of bootstrapped correlation coefficients
ax = sns.histplot(rhos['bs_rhos'], bins = 100)
_, y_max = ax.get_ylim()
ax.plot([rhos['bs_rhos'].mean(), rhos['bs_rhos'].mean()], [0, y_max], '--k')

ax.plot([rho95_CI_low, rho95_CI_low], [0, y_max], '-r', label = 'NI. approx') 
ax.plot([rho95_CI_hi, rho95_CI_hi], [0, y_max], '-r') 

# TODO: Draw lines for the 95%CI on our histogram in green 
ax.plot(..., label = 'Percentile') 
ax.plot(...) 

plt.legend()
ax.set(xlabel = 'Correlation coefficient', ylabel = 'Probability',
       title = 'Distribution of rho values: boostrap',
       xlim = [0.15, 1.1]);

Question 21 is specific to Matlab

**QUESTION (Q22)**: Think about this confidence interval and your earlier guess about whether GRE score and GPA are correlated. How can you use this to generate a hypothesis test? (i.e. Can we say that GRE and GPA are significantly correlated at p < 0.05?)

**QUESTION (Q23)**: Today we've explored bootstrapping as a way to estimate standard errors and confidence intervals for means and correlation coefficients. Which of these measures are the most robust across our different ways of bootstrapping and estimating? Which are more sensitive to the method we chose?

