# Case study: Cross-validation

In supervised learning studies we distinguish between the training and test error rates. The test error rate refers to the average performance of a model when predicting a new observation (data that was not used in the training of the model).  A useful model is one that can accurately predict new observations!

An issue in real ML studies is that data are expensive to collect and often a health data scientist doesn't have as much validation data as they would ideally like for estimating the test error rate.  Cross-validation is a general term to describe a set of statistical methods for efficiently splitting a dataset in order to estimating the test error of a model.

In the case study we will explore creating two classes for cross validation in machine learning: **Leave-One-Out Cross-Validation** and **K-Fold Cross Validation.**

> If you are interested to read further on cross validation I recommend the excellent: **An introduction to statistical learning (with application sin R)** by James, Witten, Hastie and Tibshirani. 


> Disclaimer: In a real machine learning study there is no doubt that you would implement these classes using a `numpy` array.  In practice I would therefore implement these classes slightly differently. We will cover `numpy `in the next chapter.  Another bullet proof option, that requires no implementation, is to use the `sklearn.model_selection` namespace.

We'll begin by creating each class independently.  Then we'll work on our OOP design credentials by extracting what can be encapsulated into a common baseclass.  Finally we will take a short look at what is meant by an Abstract class and how this is implemented in Python.

In [242]:
import random
import numpy as np

In [None]:
## Leave one out cross validation

In [23]:
class LOOCV:
    def __init__(self):
        pass
    
    def __repr__(self):
        return 'LOOCV()'
    
    # generator
    def split(self, X, y):
        for test_index in range(len(X)):
        
            # training data indexes
            train_X = X[:test_index] + X[test_index + 1:]
            train_y = y[:test_index] + y[test_index + 1:]
            
            # test data
            test_X, test_y = X[test_index], y[test_index]
            
            yield train_X, train_y, test_X, test_y

In [238]:
class KFoldCV:
    def __init__(self, k=5, shuffle=False, random_seed=None):
        self.k = k
        self.random_seed = random_seed
        self.shuffle = shuffle
    
    def __repr__(self):
        return f'KFoldCV(k={self.k})'
    
    # generator
    def split(self, X, y):
        
        # store the indexes of each element - its these that get shuffled.
        idx = [i for i in range(len(X))]
        if self.shuffle:
            random.seed(self.random_seed)
            random.shuffle(indicies)
        
        # length of k - 1 splits... final split continues to end.
        split_len = int(len(X) / (self.k))

        for test_idx in range(0, len(X), split_len):
        
            # create k - 1 training folds for X 
            train_X = self._fold_training_data(X, idx, test_idx, split_len)
            # X test data for fold
            test_X = [X[idx[i]] for i in range(test_idx, test_idx + split_len)]
            
            # create k - 1 training segments for y
            train_y = self._fold_training_data(y, idx, test_idx, split_len)
            # y test data fold
            test_y = [y[idx[i]] for i in range(test_idx, test_idx + split_len)]
            
            yield train_X, test_X, train_y, test_y
            
        
    def _fold_training_data(self, data, idx, test_idx, split_len):
        '''
        create training segments for X or y
        '''
        train_seg1 = [data[idx[i]] for i in range(test_idx)]
        train_seg2 = [data[idx[i]] for i in range((test_idx + split_len), 
                                                 len(data))]                                
        return train_seg1 + train_seg2

In [239]:
# training data 
# this is three data points.  each data point has two features.
X = [[1, 2], [3, 4], [5, 6]]
y = [[1], [2], [3]]

In [240]:
def synthetic_classification(n=10, shuffle=False, random_seed=None):
    '''
    Generates a simple random synthetic dataset.
    
    X data is a sequence 1 to n 
    y data is 0 or 1 weighted roughly 50/50.
    '''
    X = [i for i in range(1, n+1)]
    y = ([1] * (n // 2)) + ([0] * ((n // 2) + (n % 2)))
    
    if shuffle: 
        for lst in [X, y]:
            random.seed(random_seed)
            random.shuffle(lst)
    return X, y

In [241]:
X, y = synthetic_classification()

cv = KFoldCV(k=5)
i = 0
for train_X, test_X, train_y, test_y in cv.split(X, y):
    print(f'Fold {i}:\nTrain:\tX:{train_X}, y:{train_y}')
    print(f'Test:\tX:{test_X}, y:{test_y}')
    i += 1


Fold 0:
Train:	X:[3, 4, 5, 6, 7, 8, 9, 10], y:[1, 1, 1, 0, 0, 0, 0, 0]
Test:	X:[1, 2], y:[1, 1]
Fold 1:
Train:	X:[1, 2, 5, 6, 7, 8, 9, 10], y:[1, 1, 1, 0, 0, 0, 0, 0]
Test:	X:[3, 4], y:[1, 1]
Fold 2:
Train:	X:[1, 2, 3, 4, 7, 8, 9, 10], y:[1, 1, 1, 1, 0, 0, 0, 0]
Test:	X:[5, 6], y:[1, 0]
Fold 3:
Train:	X:[1, 2, 3, 4, 5, 6, 9, 10], y:[1, 1, 1, 1, 1, 0, 0, 0]
Test:	X:[7, 8], y:[0, 0]
Fold 4:
Train:	X:[1, 2, 3, 4, 5, 6, 7, 8], y:[1, 1, 1, 1, 1, 0, 0, 0]
Test:	X:[9, 10], y:[0, 0]


In [218]:
X, y = synthetic_classification(random_seed=101)
cv = LOOCV()

In [243]:
class KFold:
    def __init__(self, k=5, shuffle=False, random_seed=None):
        self.k = k
        self.shuffle = shuffle
        self.rng = np.random.default_rng(random_seed)
    
    def __repr__(self):
        rep = f'KFoldCV(k={self.k}, shuffle={self.shuffle},' \
                + f'random_seed={self.random_seed})'

    def split(self, X, y):
        
        # store the indexes of each element - its these that get shuffled.
        if self.shuffle:
            idx = self.rng.integers(0, len(X), size=len(X))
        else:
            idx = np.arange(len(X))
        
        # length of k - 1 splits... final split continues to end.
        split_len = int(len(X) / (self.k))

        for test_idx in range(0, len(X), split_len):
        
            # create k - 1 training folds for X 
            train_X = self._fold_training_data(X, idx, test_idx, split_len)
            # X test data for fold
            test_X = [X[idx[i]] for i in range(test_idx, test_idx + split_len)]
            
            # create k - 1 training segments for y
            train_y = self._fold_training_data(y, idx, test_idx, split_len)
            # y test data fold
            test_y = [y[idx[i]] for i in range(test_idx, test_idx + split_len)]
            
            yield train_X, test_X, train_y, test_y
            
        
    def _fold_training_data(self, data, idx, test_idx, split_len):
        '''
        create training segments for X or y
        '''
        train_seg1 = [data[idx[i]] for i in range(test_idx)]
        train_seg2 = [data[idx[i]] for i in range((test_idx + split_len), 
                                                 len(data))]                                
        return train_seg1 + train_seg2

In [247]:
X = np.arange(1, 11)
idx = np.array([5, 6, 7, 8, 9, 0, 1, 2, 3, 4])
X[idx[:test_idx]]


NameError: name 'test_idx' is not defined

In [220]:
i = 1
print(f'X:{X}, \ny:{y}')
for train_X, train_y, test_X, test_y in cv.split(X, y):
    print(f'Fold {i}:')
    print(f'Train: {train_X}, {train_y}')
    print(f'Test:{test_X}, {test_y}')
    i += 1

X:[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 
y:[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
Fold 1:
Train: [2, 3, 4, 5, 6, 7, 8, 9, 10], [1, 1, 1, 1, 0, 0, 0, 0, 0]
Test:1, 1
Fold 2:
Train: [1, 3, 4, 5, 6, 7, 8, 9, 10], [1, 1, 1, 1, 0, 0, 0, 0, 0]
Test:2, 1
Fold 3:
Train: [1, 2, 4, 5, 6, 7, 8, 9, 10], [1, 1, 1, 1, 0, 0, 0, 0, 0]
Test:3, 1
Fold 4:
Train: [1, 2, 3, 5, 6, 7, 8, 9, 10], [1, 1, 1, 1, 0, 0, 0, 0, 0]
Test:4, 1
Fold 5:
Train: [1, 2, 3, 4, 6, 7, 8, 9, 10], [1, 1, 1, 1, 0, 0, 0, 0, 0]
Test:5, 1
Fold 6:
Train: [1, 2, 3, 4, 5, 7, 8, 9, 10], [1, 1, 1, 1, 1, 0, 0, 0, 0]
Test:6, 0
Fold 7:
Train: [1, 2, 3, 4, 5, 6, 8, 9, 10], [1, 1, 1, 1, 1, 0, 0, 0, 0]
Test:7, 0
Fold 8:
Train: [1, 2, 3, 4, 5, 6, 7, 9, 10], [1, 1, 1, 1, 1, 0, 0, 0, 0]
Test:8, 0
Fold 9:
Train: [1, 2, 3, 4, 5, 6, 7, 8, 10], [1, 1, 1, 1, 1, 0, 0, 0, 0]
Test:9, 0
Fold 10:
Train: [1, 2, 3, 4, 5, 6, 7, 8, 9], [1, 1, 1, 1, 1, 0, 0, 0, 0]
Test:10, 0


In [3]:
from sklearn.model_selection import KFold, LeaveOneOut

In [None]:
x = LeaveOneOut()
x.

In [61]:
def test_change(X):
    x2 = copy.copy(X)
    random.shuffle(x2)
    print('inside')
    print(x2)
    
X = [[1, 2], [3, 4], [5, 6]]
test_change(X)

print('outside')
print(X)

inside
[[3, 4], [5, 6], [1, 2]]
outside
[[1, 2], [3, 4], [5, 6]]
