# <center> Resampling

## <center>Objectives

#### <center>Be able to use Numpy to generate random data, create sample data from distributions, and resample data.

#### <center>Understand some applications of resampling including bootstrap and jacknife methods.

## First, some basic operations

In [None]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline

Generate a random value from the standard normal distribution

Generate 5 random values from the normal distribution

Generate 5 random values from a normal distribution with a mean of 4 and a standard deviation of 2.

Generate 5 random values between 50 and 70 with equal probabilities.

Generate 5 random integers between 1 and 100.

Generate the same 5 random integers.

Pick 5 values from a list with replacement.

Pick 5 values from a list without replacement.

Pick 5 values from a list with probability weights.

Shuffle the order of an array.

Generate 10 samples of 5 elements each of an array.

## <center>Bootstrap

## First with normal data

In [None]:
n=1000
mu = 50
sigma = 5
x = np.random.normal(loc=mu, scale=sigma, size=n)

In [None]:
sns.distplot(x)

In [None]:
x_bar = 60

In [None]:
1-stats.norm(mu,sigma).cdf(x_bar)

In [None]:
reps = 10000
bootstrap_samples = np.random.choice(x, (n,reps))

In [None]:
np.mean([len(np.where(x>x_bar)[0])/reps for x in bootstrap_samples])

## Now with non-parametric data

In [None]:
x = np.concatenate([np.random.exponential(15,size=200), np.random.normal(50,20,size=100)])
sns.distplot(x)

In [None]:
x_bar = 80
1-stats.norm(x.mean(),x.std()).cdf(x_bar)

In [None]:
reps = 10000
bootstrap_samples = np.random.choice(x, (n,reps))

In [None]:
np.mean([len(np.where(x>x_bar)[0])/reps for x in bootstrap_samples])

## Bootstrap confidence intervals around mean

In [None]:
bootstrap_mean = bootstrap_samples.mean(axis=0)
l_i, u_i = np.percentile(bootstrap_mean, [2.5, 97.5])
l_i, u_i

In [None]:
x.mean()-l_i, u_i-x.mean()

In [None]:
norm_li, norm_ui = stats.norm.interval(0.95, loc=np.mean(x), scale=np.std(x))
norm_li, norm_ui

In [None]:
sns.distplot(x)
plt.vlines(x.mean(),0, 0.03)
plt.vlines(l_i, 0, 0.03, colors='r');plt.vlines(u_i, 0, 0.03, colors='r')
plt.vlines(norm_li, 0, 0.03, colors='g');plt.vlines(norm_ui, 0, 0.03, colors='g')

## <center> Jackknife

The bootstrap method is the most widely used resampling method but it isn't the only one. Another resampling method is the jackknife method. The jackknife method generates additional samples by creating multiple copies of the original data, each missing one of the original data points. <br> <br>
For example, say we have [1,2,3,4] as our data. Using the jackknife method, our generated samples would be: [1,2,3], [2,3,4], [1,3,4], and [1,2,4]. 

# <center> Activity

Your task is to first create a <b> jackknife </b> function that will take an array of data and return a list of resampled arrays through the jackknife method. <br><br>
Then, create both a normal and non-normal sample. <br><br>
For each sample, using a normal distribution first and then the resampling distribution:
 - Choose an extreme value and calculate the p-value for it.
 - Calculate the 95% confidence interval around the mean. <br>
 
#### Discuss
What are the advantages and disadvantages for the bootstrap method v.s. jackknife method?

In [None]:
def jackknife(x):
    return