# Resampling

Concepts
    
* Sample: the statistical process related to get a portion of the population with intent to make inferences about population parameters.
* Resample: refers to methods for using data sample to make to improve the estimate of the population parameter anad help quantify the uncertainty of the estimate.

## Sampling

Aspects to consider prior to collecting a data sample

* Sample goal
* Population
* Selection criteria
* Sample size

Three common types of sampling in applied ML

1. Simple random sampling
2. Systematic sampling
3. Stratified sampling

We can generalize estimations from a sample to the population, but this process will contain errors. Usually, we can quantify those errors using some statistical methods (eg. confidence intervals).

Types of sampling errors

* Selection bias
* Sampling error

## Resampling

Once we have a data sample we can make estimations about the population, but we have little idea about the uncertainty of the estimations we have. Resampling allows us to make estimations several times to have an idea about uncertainty.

Two common resampling methods

* $k$-fold cross-validation: dataset is partitioned into $k$ groups.
* Bootstrap: samples drawn from the dataset with replacement.

### Bootstrap

Procedure

1. Choose the size of the sample
2. While the size of the sample is less than the chosen size
    1. Randomly select an observation from the dataset and add it to the sample

In [5]:
from sklearn.utils import resample

data_sample = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
boot = resample(data_sample, replace=True, n_samples=4, random_state=1)
oob = [x for x in data_sample if x not in boot]
print(boot, oob)

[0.6, 0.4, 0.5, 0.1] [0.2, 0.3]


### $k$-fold cross-validation

Procedure

1. Shuffle the dataset randomly
2. Split the dataset into $k$ groups
3. For each unique group
    1. Take the group as a hold out (test set) and the remainder groups as training set
    2. Fit a model on the training set and evaluate on test set
    3. Retain the evaluate score and discard the model
4. Summarize the skill of the model

In [8]:
import numpy as np
from sklearn.model_selection import KFold

data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
kfold = KFold(n_splits=3, shuffle=True, random_state=1)

for train, test in kfold.split(data):
    print('train: %s test: %s' % (data[train], data[test]))

train: [0.1 0.4 0.5 0.6] test: [0.2 0.3]
train: [0.2 0.3 0.4 0.6] test: [0.1 0.5]
train: [0.1 0.2 0.3 0.5] test: [0.4 0.6]
