# Resampling methods
In this chapter, we will get a brief introduction to resampling methods and their applications. We will get a taste of bootstrap resampling, jackknife resampling, and permutation testing. After completing this chapter, students will be able to start applying simple resampling methods for data analysis.

# 1. Introduction to resampling methods
### 1.1 Sampling with replacement
In this example, you will review the `np.random.choice()` function that you've already seen in the previous chapters. You are given multiple variations of `np.random.choice()` for sampling from arrays. Look at each variation carefully and use the console to test out the options. __Select the option that could generate `['a', 'c', 'c']` as an output__.

Possible Answers
1. `np.random.choice(['a', 'b', 'c'], size=3, replace=False)`
2. `np.random.choice(['a', 'b', 'c', 'd', 'e'], size=5, replace=True)[:3]`
3. `np.random.choice(['a', 'b', 'c', 'd', 'e'], size=5, replace=False)[:3]`
4. `np.random.choice(['a', 'b'], size=3, replace=True)`

In [1]:
import numpy as np

np.random.choice(['a', 'b', 'c'], size=3, replace=False)

array(['b', 'c', 'a'], dtype='<U1')

In [2]:
np.random.choice(['a', 'b', 'c', 'd', 'e'], size=5, replace=True)[:3]

array(['e', 'd', 'd'], dtype='<U1')

In [3]:
np.random.choice(['a', 'b', 'c', 'd', 'e'], size=5, replace=False)[:3]

array(['c', 'b', 'a'], dtype='<U1')

In [4]:
np.random.choice(['a', 'b'], size=3, replace=True)

array(['b', 'a', 'b'], dtype='<U1')

<div align="right">Answer: (2)</div>

## 1.2 Probability example
In this exercise, we will review the difference between sampling with and without replacement. We will calculate the probability of an event using simulation, but vary our sampling method to see how it impacts probability.

Consider a bowl filled with colored candies - three blue, two green, and five yellow. Draw three candies at random, with replacement and without replacement. You want to know the probability of __drawing a yellow candy on the third draw given that the first candy was blue and the second candy was green__.

### Instructions:
* Set up your sample space `bowl` as a list having three blue `'b'`, two green `'g'` and five yellow `'y'` candies.
* Draw a sample of three candies with replacement (`sample_rep`) and without replacement (`sample_no_rep`).
* Write down the success condition for drawing a yellow candy on the third draw given that the first candy was blue and the second candy was green.
* Calculate the probability with and without replacement as successes divided by the number of iterations.

In [5]:
np.random.seed(123)

# Set up the bowl
success_rep, success_no_rep, sims = 0, 0, 10000
bowl = ['b', 'b', 'b', 'g', 'g', 'y', 'y', 'y', 'y', 'y']

for i in range(sims):
    # Sample with and without replacement & increment success counters
    sample_rep = np.random.choice(bowl, size=3, replace=True)
    sample_no_rep = np.random.choice(bowl, size=3, replace=False)
    if (sample_rep[0] == 'b') & (sample_rep[1] == 'g') & (sample_rep[2] == 'y'): 
        success_rep += 1
    if (sample_no_rep[0] == 'b') & (sample_no_rep[1] == 'g') & (sample_no_rep[2] == 'y'): 
        success_no_rep += 1

# Calculate probabilities
prob_with_replacement = success_rep/sims
prob_without_replacement = success_no_rep/sims
print("Probability with replacement = {}, without replacement = {}".format(prob_with_replacement, prob_without_replacement))

Probability with replacement = 0.0266, without replacement = 0.0415


# 2. Bootstrapping
## 2.1 Running a simple bootstrap
Welcome to the first exercise in the bootstrapping section. We will work through an example where we learn to run a simple bootstrap. As we saw in the video, the main idea behind bootstrapping is sampling with replacement.

Suppose you own a factory that produces wrenches. You want to be able to characterize the average length of the wrenches and ensure that they meet some specifications. Your factory produces thousands of wrenches every day, but it's infeasible to measure the length of each wrench. However, you have access to a representative sample of 100 wrenches. Let's use bootstrapping to get the 95% confidence interval (CI) for the average lengths.

Examine the list `wrench_lengths`, which has 100 observed lengths of wrenches, in the shell.

### Instructions
* Draw a random sample with replacement from `wrench_lengths` and store it in `temp_sample`. Set `size = len(wrench_lengths)`.
* Calculate the mean length of each sample, assign it to `sample_mean`, and then append it to `mean_lengths`.
* Calculate the bootstrapped mean (`boot_mean`) and bootstrapped 95% confidence interval (`boot_95_ci`) by using `np.percentile()`.

In [6]:
np.random.seed(123)
wrench_lengths = np.random.normal(loc=10, scale=1, size=100)
print(wrench_lengths)

[ 8.9143694  10.99734545 10.2829785   8.49370529  9.42139975 11.65143654
  7.57332076  9.57108737 11.26593626  9.1332596   9.32111385  9.90529103
 11.49138963  9.361098    9.55601804  9.56564872 12.20593008 12.18678609
 11.0040539  10.3861864  10.73736858 11.49073203  9.06416613 11.17582904
  8.74611933  9.3622485  10.9071052   8.5713193   9.85993128  9.1382451
  9.74438063  7.20141089  8.2284669   9.30012277 10.92746243  9.82636432
 10.00284592 10.68822271  9.12046366 10.28362732  9.19463348  8.27233051
  9.60910021 10.57380586 10.33858905  9.98816951 12.39236527 10.41291216
 10.97873601 12.23814334  8.70591468  8.96121179 11.74371223  9.20193726
 10.02968323 11.06931597 10.89070639 11.75488618 11.49564414 11.06939267
  9.22729129 10.79486267 10.31427199  8.67373454 11.41729905 10.80723653
 10.04549008  9.76690794  8.80169886 10.19952407 10.46843912  9.16884502
 11.16220405  8.90279695  7.87689965 11.03972709  9.59663396  9.87397041
  9.16248328  8.39403724 11.25523737  9.31113102 11.

In [7]:
# Draw some random sample with replacement and append mean to mean_lengths.
mean_lengths, sims = [], 1000
for i in range(sims):
    temp_sample = np.random.choice(wrench_lengths, replace=True, size=len(wrench_lengths))
    sample_mean = np.mean(temp_sample)
    mean_lengths.append(sample_mean)
    
# Calculate bootstrapped mean and 95% confidence interval.
boot_mean = np.mean(mean_lengths)
boot_95_ci = np.percentile(mean_lengths, [2.5, 97.5])
print("Bootstrapped Mean Length = {}, 95% CI = {}".format(boot_mean, boot_95_ci))

Bootstrapped Mean Length = 10.027059690070363, 95% CI = [ 9.78662216 10.24854356]


## 2.2 Non-standard estimators
In the last exercise, you ran a simple bootstrap that we will now modify for more complicated estimators.

Suppose you are studying the health of students. You are given the height and weight of 1000 students and are interested in the median height as well as the correlation between height and weight and the associated 95% CI for these quantities. Let's use bootstrapping.

Examine the `pandas` DataFrame `df` with the heights and weights of 1000 students. _Using this, calculate the 95% CI for both the __median height__ as well as the __correlation__ between height and weight._

### Instructions:
* Use the `.sample()` method on `df` to generate a sample of the data with replacement and assign it to `tmp_df`.
* For each generated dataset in `tmp_df`, calculate the median heights and correlation between heights and weights using `.median()` and `.corr()`.
* Append the median heights to `height_medians` and correlation to `hw_corr`.
* Finally calculate the 95% confidence intervals for each of the above quantities using `np.percentile()`.

In [8]:
import pandas as pd
import numpy as np
heights = np.random.normal(loc=5.5, scale=2, size=1000)
weights = np.random.normal(loc=heights*35, scale=25, size=1000)
df = pd.DataFrame({'heights':heights, 'weights':weights})
np.random.seed(123)

In [9]:
# Sample with replacement and calculate quantities of interest
sims, data_size, height_medians, hw_corr = 1000, df.shape[0], [], []
for i in range(sims):
    tmp_df = df.sample(n=data_size, replace=True)
    height_medians.append(tmp_df.heights.median())
    hw_corr.append(tmp_df.weights.corr(tmp_df.heights))

# Calculate confidence intervals
height_median_ci = np.percentile(height_medians, [2.5, 97.5])
height_weight_corr_ci = np.percentile(hw_corr, [2.5, 97.5])
print("Height Median CI = {} \nHeight Weight Correlation CI = {}".format( height_median_ci, height_weight_corr_ci))

Height Median CI = [5.39064456 5.75530169] 
Height Weight Correlation CI = [0.93215527 0.94644951]


### 2.3 Bootstrapping regression
Now let's see how bootstrapping works with regression. Bootstrapping helps estimate the uncertainty of non-standard estimators. Consider the ${R^2}$ statistic associated with a regression. When you run a simple least squares regression, you get a value for ${R^2}$. But let's see how can we get a 95% CI for ${R^2}$.

Examine the DataFrame `df` with a dependent variable $y$ and two independent variables ${X1}$ and ${X2}$ using `df.head()`. We've already fit this regression with `statsmodels` (`sm`) using:
```
reg_fit = sm.OLS(df['y'], df.iloc[:,1:]).fit()
```
Examine the result using `reg_fit.summary()` to find that ${R^2=0.3504}$. Use bootstrapping to calculate the 95% CI.

### Instructions:
* Draw a bootstrap sample from the original dataset using the `sample()` method of a pandas DataFrame. The number of rows should be the same as that of the original DataFrame.
* Fit a regression similar to `reg_fit()` using `sm.OLS()` and extract the ${R^2}$ statistic.
* Append the ${R^2}$ to the list `rsquared_boot`.
* Calculate 95% CI for `rsquared_boot` as `r_sq_95_ci` using `np.percentile()`.

In [10]:
import statsmodels.api as sm
nobs = 1000
X = np.random.random((nobs, 2))
X = sm.add_constant(X)
beta = [1, -0.5, 0.5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e
df = pd.DataFrame(X)
df = pd.concat([pd.Series(y), df], axis=1)
df.columns = ['y', 'Intercept', 'X1', 'X2']
reg_fit = sm.OLS(df['y'], df.iloc[:,1:]).fit()
np.random.seed(123)

In [11]:
rsquared_boot, coefs_boot, sims = [], [], 1000
reg_fit = sm.OLS(df['y'], df.iloc[:,1:]).fit()

# Run 1K iterations
for i in range(sims):
    # First create a bootstrap sample with replacement with n=df.shape[0]
    bootstrap = df.sample(n=df.shape[0], replace=True)
    # Fit the regression and append the r square to rsquared_boot
    rsquared_boot.append(sm.OLS(bootstrap['y'],bootstrap.iloc[:,1:]).fit().rsquared)

# Calculate 95% CI on rsquared_boot
r_sq_95_ci = np.percentile(rsquared_boot, [2.5, 97.5])
print("R Squared 95% CI = {}".format(r_sq_95_ci))

R Squared 95% CI = [0.337539  0.4289047]
