## Resampling Methods from Scratch

Two resampling methods implemented in this notebook:
- Train and test split 
- k-fold cross validation
<br>

Mark Labinski

## 1. Train and Test Split

#### Important Notes about Train/Test Splitting:
- The training set is used to train the model, while the test set is held back and used to evaluate the performance of the model.
- The rows assigned to each set are randomly selected to ensure objectivity between training and evaluating a model
- If multiple algorithms are compared or multiple configurations of the same algorithm are compared, the same train/test split should be used for consistent comparison.

#### Steps:
- The first function calculates how many rows the training set will require.
- A copy of the original dataset is made.
- Random rows are selected and removed from the copied dataset and added to the train dataset until the train set contains the target number of rows.
- The rows that remain in the copy of the dataset are then returned as the test dataset.
- The randrange() function from the random model is used to generate a random integer in the range between 0 and the size of the list.

In [6]:
from random import randrange, seed

# Split a dataset into a train and test set
def train_test_split(dataset, split=0.60):
    train = list()
    train_size = split * len(dataset)
    dataset_copy = list(dataset)
    while len(train) < train_size:
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy

# Test the train_test_split function using a dataset of 10 rows. 
#    Use seed to fix the random seed before splitting to ensure the exact same 
#    split of the data is made every time the code is executed

seed(1)
dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
train, test = train_test_split(dataset)
print('Training Rows: ' + str(train))
print('Testing Rows: ' + str(test))

Training Rows: [[2], [9], [8], [3], [5], [6]]
Testing Rows: [[1], [4], [7], [10]]


## k-fold Cross Validation

#### Important Notes:
- A limitation of train_test_split is that it is a noisy estimate of algorithm performance --> k-fold cross validation is more accurate.
- Data is split into k group or "folds"
- The algorithm is trained and evaluated k times and the perfomance is summarized by taking the mean performance score
- First, train the algorithm on the k-1 groups of the data nad evaluate it on the kth hold-out group as the test set. Repeat so each of the k groups is given an opportunity to be held out and used as the test set.
    - As such, the value of k should be divisible by the number of rows in your training dataset to ensure each of the k groups has the same number of rows.
- Choose a value for k that splits the data into groups with enough rows that each group is still representative of the original dataset.
    - Good defaults: k=3 for a small dataset, k=10 for large dataset
    - To check if fold sizes are representative, calculate summary stats (mean, std dev) and see how much the values differ from the whole dataset
    

In [7]:
from random import seed
from random import randrange
 
# Split a dataset into k folds
def cross_validation_split(dataset, folds=3):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / folds)
    for i in range(folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split
 
# test cross validation split
seed(1)
dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
folds = cross_validation_split(dataset, 4)
print(folds)

[[[2], [9]], [[8], [3]], [[5], [6]], [[7], [10]]]
