# Resampling Methods - Lab

## Introduction

Now that you have some preliminary background on bootstrapping, jackknife, and permutation tests, its time to practice those skills by coding them into functions. You'll then apply these tests to a hypothesis test and compare the results to a parametric t-test.

## Objectives

In this lab you will: 

* Create functions that perform resampling techniques and use them on datasets

## Bootstrap sampling


Bootstrap sampling works by combining two distinct samples into a universal set and generating random samples from this combined sample space in order to compare these random splits to the two original samples. The idea is to see if the difference between the two **original** samples is statistically significant. If similar differences can be observed through the random generation of samples, then the observed differences are not actually significant.


Write a function to perform bootstrap sampling. The function should take in two samples A and B. The two samples need not be the same size. From this, create a universal sample by combining A and B. Then, create a resampled universal sample of the same size using random sampling with replacement. Finally, split this randomly generated universal set into two samples which are the same size as the original samples, A and B. The function should return these resampled samples.

Example:

```python

A = [1,2,3]
B = [2,2,5,6]

Universal_Set = [1,2,2,2,3,5,6]
Resampled_Universal_Set = [6, 2, 3, 2, 1, 1, 2] # Could be different (randomly generated with replacement)

Resampled_A = [6,2,3]
Resampled_B = [2,1,1,2]
```

In [55]:
import numpy as np
import scipy as stats

def bootstrap(A, B):
    Universal_Set = A + B
    Resampled_Universal_Set = list(np.random.choice(Universal_Set, 
                                               size=len(Universal_Set), replace=True))
    Resampled_A = Resampled_Universal_Set[:len(A)]
    Resampled_B = Resampled_Universal_Set[len(A):]
    return Resampled_A, Resampled_B

A = [1,2,3]
B = [2,2,5,6] 

print(f"Resampled sample A :>, {Resampled_A}")
print(f"Resampled sample B :>, {Resampled_B}")

Resampled sample A :>, [5, 5, 2]
Resampled sample B :>, [1, 2, 3, 5]


## Jackknife 

Write a function that creates additional samples by removing one element at a time. The function should do this for each of the `n` items in the original sample, returning `n` samples, each with `n-1` members.

In [75]:
def jack1(sample):
    """This function should take in a list of n observations and return n lists
    each with one member (presumably the nth) removed."""
    new_sample_list = []
    for n in range(len(sample)):
        new_sample = sample[:len(sample)-n]
        new_sample_list.append(new_sample)
    return new_sample_list

sample = [1,2,2,2,3,5,6]
print(jack1(sample))

[[1, 2, 2, 2, 3, 5, 6], [1, 2, 2, 2, 3, 5], [1, 2, 2, 2, 3], [1, 2, 2, 2], [1, 2, 2], [1, 2], [1]]


In [77]:
def jack2(sample):
    """This function should take in a list of n observations and return n lists
    each with one member (presumably the nth) removed."""
    samples = []
    for i in range(len(sample)):
        new_sample = sample[:i] + sample[i+1:]
        samples.append(new_sample)
    return samples

sample = [1,2,2,2,3,5,6]
print(jack2(sample))

[[2, 2, 2, 3, 5, 6], [1, 2, 2, 3, 5, 6], [1, 2, 2, 3, 5, 6], [1, 2, 2, 3, 5, 6], [1, 2, 2, 2, 5, 6], [1, 2, 2, 2, 3, 6], [1, 2, 2, 2, 3, 5]]


In [81]:
sample = [1,2,2,2,3,5,6]
sample[1:]

[2, 2, 2, 3, 5, 6]

## Permutation testing

Define a function that generates all possible, equally sized, two set splits of two sets A and B. Sets A and B need not be the same size, but all of the generated two set splits should be of equal size. For example, if we had a set with 5 members and a set with 7 members, the function would return all possible 5-7 ordered splits of the 12 items.

> Note that these are actually combinations! However, as noted previously, permutation tests really investigate possible regroupings of the data observations, so calculating combinations is a more efficient approach!


Here's a more in depth example:

```python
A = [1, 2, 2]
B = [1, 3]
combT(A, B) 
[([1,2,2], [1,3]),
 ([1,2,3], [1,2]),
 ([1,1,2], [2,3]),
 ([1,1,3], [2,2]),
 ([2,2,3], [1,1])]
               
```  

These are all the possible 3-2 member splits of the 5 elements: 1, 1, 2, 2, 3. 

In [None]:
def combT(a,b):
    # Your code here

In [71]:
A = [1, 2, 2]
B = [1, 3]
combined = A + B
import itertools
combs = list(itertools.combinations(combined, len(A)))
combs

[(1, 2, 2),
 (1, 2, 1),
 (1, 2, 3),
 (1, 2, 1),
 (1, 2, 3),
 (1, 1, 3),
 (2, 2, 1),
 (2, 2, 3),
 (2, 1, 3),
 (2, 1, 3)]

In [72]:
from itertools import combinations

def equally_sized_splits(A, B):
    all_splits = []
    total_len = len(A) + len(B)

    for split_len in range(1, total_len // 2 + 1):
        for split_indices_A in combinations(range(len(A)), split_len):
            split_indices_B = tuple(i for i in range(len(B)) if i not in split_indices_A)
            split_A = [A[i] for i in split_indices_A]
            split_B = [B[i] for i in split_indices_B]
            all_splits.append((split_A, split_B))

    return all_splits

# Example usage
list_A = [1, 2, 3, 4, 5]
list_B = ['A', 'B', 'C', 'D', 'E', 'F', 'G']

splits = equally_sized_splits(list_A, list_B)
for i, (split_A, split_B) in enumerate(splits):
    print(f"Split {i + 1}: List A: {split_A}, List B: {split_B}")

Split 1: List A: [1], List B: ['B', 'C', 'D', 'E', 'F', 'G']
Split 2: List A: [2], List B: ['A', 'C', 'D', 'E', 'F', 'G']
Split 3: List A: [3], List B: ['A', 'B', 'D', 'E', 'F', 'G']
Split 4: List A: [4], List B: ['A', 'B', 'C', 'E', 'F', 'G']
Split 5: List A: [5], List B: ['A', 'B', 'C', 'D', 'F', 'G']
Split 6: List A: [1, 2], List B: ['C', 'D', 'E', 'F', 'G']
Split 7: List A: [1, 3], List B: ['B', 'D', 'E', 'F', 'G']
Split 8: List A: [1, 4], List B: ['B', 'C', 'E', 'F', 'G']
Split 9: List A: [1, 5], List B: ['B', 'C', 'D', 'F', 'G']
Split 10: List A: [2, 3], List B: ['A', 'D', 'E', 'F', 'G']
Split 11: List A: [2, 4], List B: ['A', 'C', 'E', 'F', 'G']
Split 12: List A: [2, 5], List B: ['A', 'C', 'D', 'F', 'G']
Split 13: List A: [3, 4], List B: ['A', 'B', 'E', 'F', 'G']
Split 14: List A: [3, 5], List B: ['A', 'B', 'D', 'F', 'G']
Split 15: List A: [4, 5], List B: ['A', 'B', 'C', 'F', 'G']
Split 16: List A: [1, 2, 3], List B: ['D', 'E', 'F', 'G']
Split 17: List A: [1, 2, 4], List B: ['C'

In [73]:
from itertools import permutations

def equally_sized_splits(A, B):
    all_splits = []

    for perm_A in permutations(A):
        for perm_B in permutations(B):
            split_A = list(perm_A)
            split_B = list(perm_B)
            all_splits.append((split_A, split_B))

    return all_splits

# Example usage
list_A = [1, 2, 3, 4, 5]
list_B = ['A', 'B', 'C', 'D', 'E', 'F', 'G']

splits = equally_sized_splits(list_A, list_B)
for i, (split_A, split_B) in enumerate(splits):
    print(f"Split {i + 1}: List A: {split_A}, List B: {split_B}")

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



Split 584291: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'C', 'F', 'B', 'E']
Split 584292: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'C', 'F', 'E', 'B']
Split 584293: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'E', 'B', 'C', 'F']
Split 584294: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'E', 'B', 'F', 'C']
Split 584295: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'E', 'C', 'B', 'F']
Split 584296: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'E', 'C', 'F', 'B']
Split 584297: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'E', 'F', 'B', 'C']
Split 584298: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'E', 'F', 'C', 'B']
Split 584299: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'F', 'B', 'C', 'E']
Split 584300: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'F', 'B', 'E', 'C']
Split 584301: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'F', 'C', 'B', 'E']
Split 584302: List A: [5, 4, 1, 3, 2], List B: ['G', 'D', 'A', 'F', 'C', 'E', 'B']
Spli

## Permutation testing in Practice
Let's further investigate the scenario proposed in the previous lesson. Below are two samples A and B. The samples are mock data for the blood pressure of sample patients. The research study is looking to validate whether there is a statistical difference in the blood pressure of these two groups using a 5% significance level.  First, calculate the mean blood pressure of each of the two samples. Then, calculate the difference of these means. From there, use your `combT()` function, defined above, to generate all the possible combinations of the entire sample data into A-B splits of equivalent sizes as the original sets. For each of these combinations, calculate the mean blood pressure of the two groups and record the difference between these sample means. The full collection of the difference in means between these generated samples will serve as the denominator to calculate the p-value associated with the difference between the original sample means.

For example, in our small handwritten example above:

$\mu_a = \frac{1+2+2}{3} = \frac{5}{3}$  
and  
$\mu_b = \frac{1+3}{2} = \frac{4}{2} = 2$  

Giving us

$\mu_a - \mu_b = \frac{5}{3} - 2 = -\frac{1}{3}$

In comparison, for our various combinations we have:

([1,2,2], [1,3]):  $\mu_a - \mu_b = \frac{5}{3} - 2 = -\frac{1}{3}$  
([1,2,3], [1,2]):  $\mu_a - \mu_b = 2 - \frac{3}{2} = \frac{1}{2}$  
([1,2,1], [2,3]):  $\mu_a - \mu_b = \frac{4}{3} - \frac{5}{3} = -\frac{1}{2}$  
([1,1,3], [2,2]):  $\mu_a - \mu_b = \frac{5}{3} - 2 = -\frac{1}{3}$  
([2,2,3], [1,1]):  $\mu_a - \mu_b = \frac{7}{3} - 1 = \frac{4}{3}$  

A standard hypothesis test for this scenario might be:

$H_0: \mu_a = \mu_b$  
$H_1: \mu_a < \mu_b$  
  
Thus comparing our sample difference to the differences of our possible combinations, we look at the number of experiments from our combinations space that were the same or greater than our sample statistic, divided by the total number of combinations. In this case, 4 out of 5 of the combination cases produced the same or greater differences in the two sample means. This value .8 is a strong indication that we cannot refute the null hypothesis for this instance.

In [None]:
a = [109.6927759 , 120.27296943, 103.54012038, 114.16555857,
       122.93336175, 110.9271756 , 114.77443758, 116.34159338,
       112.66413025, 118.30562665, 132.31196515, 117.99000948]
b = [123.98967482, 141.11969004, 117.00293412, 121.6419775 ,
       123.2703033 , 123.76944385, 105.95249634, 114.87114479,
       130.6878082 , 140.60768727, 121.95433026, 123.11996767,
       129.93260914, 121.01049611]

In [None]:
# Your code here
# ⏰ Expect your code to take several minutes to run

## T-test revisited

The parametric statistical test equivalent to our permutation test above would be a t-test of the two groups. Perform a t-test on the same data above in order to calculate the p-value. How does this compare to the above results?

In [None]:
# Your code here

## Bootstrap applied

Use your code above to apply the bootstrap technique to this hypothesis testing scenario. Here's a pseudo-code outline for how to do this:

1. Compute the difference between the sample means of A and B
2. Initialize a counter for the number of times the difference of the means of resampled samples is greater then or equal to the difference of the means of the original samples
3. Repeat the following process 10,000 times:
    1. Use the bootstrap sampling function you used above to create new resampled versions of A and B 
    2. Compute the difference between the means of these resampled samples 
    3. If the difference between the means of the resampled samples is greater then or equal to the original difference, add 1 the counter you created in step 2
4. Compute the ratio between the counter and the number of simulations (10,000) that you performed
    > This ratio is the percentage of simulations in which the difference of sample means was greater than the original difference

In [None]:
# Your code here

## Summary

Well done! In this lab, you practice coding modern statistical resampling techniques of the 20th century! You also started to compare these non-parametric methods to other parametric methods such as the t-test that we previously discussed.