# CPE232 Supplement: Cross Validation 

## k-fold Cross Validation 

In [22]:
import numpy as np

# Sample dataset (X: Features, y: Labels)
X = np.array([
    [2, 3], [3, 5], [5, 8], [7, 10], [9, 12], 
    [11, 15], [13, 17], [15, 19], [17, 21], [19, 23]
])
y = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])  # Binary classification labels

Let's see our dataset

In [23]:
X

array([[ 2,  3],
       [ 3,  5],
       [ 5,  8],
       [ 7, 10],
       [ 9, 12],
       [11, 15],
       [13, 17],
       [15, 19],
       [17, 21],
       [19, 23]])

And its corresponding labels

In [24]:
y

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

Specify number of folds and size of each fold

In [25]:
# Number of folds
k = 5
fold_size = len(X) // k

In [26]:
fold_size

2

Shuffle the data

In [27]:
# Shuffle the dataset
indices = np.arange(len(X))
np.random.shuffle(indices)

In [28]:
X_shuffled = X[indices]
y_shuffled = y[indices]

Let's compare before and after shuffled

In [29]:
print('Before shuffled: ')
print(X)

print('\nAfter shuffled: ')
print(X_shuffled)

Before shuffled: 
[[ 2  3]
 [ 3  5]
 [ 5  8]
 [ 7 10]
 [ 9 12]
 [11 15]
 [13 17]
 [15 19]
 [17 21]
 [19 23]]

After shuffled: 
[[ 7 10]
 [11 15]
 [ 9 12]
 [17 21]
 [ 3  5]
 [19 23]
 [ 5  8]
 [ 2  3]
 [13 17]
 [15 19]]


Now, see how the data has been split into each fold

In [30]:
# Store accuracies for each fold
accuracies = []

# Perform 5-fold cross-validation
for i in range(k):
    print('\n==================\nfold: ',i)

    # Split data into training and test sets
    start = i * fold_size
    end = (i + 1) * fold_size

    print('Test start from: %d to: %d'%(start,end-1))
    X_test, y_test = X_shuffled[start:end], y_shuffled[start:end]
    print(X_test)

    X_train = np.concatenate((X_shuffled[:start], X_shuffled[end:]), axis=0)
    y_train = np.concatenate((y_shuffled[:start], y_shuffled[end:]), axis=0)
    print('\nThe rest goes to Training set.')
    print(X_train)


fold:  0
Test start from: 0 to: 1
[[ 7 10]
 [11 15]]

The rest goes to Training set.
[[ 9 12]
 [17 21]
 [ 3  5]
 [19 23]
 [ 5  8]
 [ 2  3]
 [13 17]
 [15 19]]

fold:  1
Test start from: 2 to: 3
[[ 9 12]
 [17 21]]

The rest goes to Training set.
[[ 7 10]
 [11 15]
 [ 3  5]
 [19 23]
 [ 5  8]
 [ 2  3]
 [13 17]
 [15 19]]

fold:  2
Test start from: 4 to: 5
[[ 3  5]
 [19 23]]

The rest goes to Training set.
[[ 7 10]
 [11 15]
 [ 9 12]
 [17 21]
 [ 5  8]
 [ 2  3]
 [13 17]
 [15 19]]

fold:  3
Test start from: 6 to: 7
[[5 8]
 [2 3]]

The rest goes to Training set.
[[ 7 10]
 [11 15]
 [ 9 12]
 [17 21]
 [ 3  5]
 [19 23]
 [13 17]
 [15 19]]

fold:  4
Test start from: 8 to: 9
[[13 17]
 [15 19]]

The rest goes to Training set.
[[ 7 10]
 [11 15]
 [ 9 12]
 [17 21]
 [ 3  5]
 [19 23]
 [ 5  8]
 [ 2  3]]


## Stratified k-fold Cross Validation

In the case below, our dataset is imbalanced.

Given this scenario, stratified k-fold cross validation is more suitable because it ensures that each fold contains a balanced representation of the original dataset.

In [31]:
import numpy as np

# Dataset with imbalanced classes 
X = np.array([
    [2, 3], [3, 5], [5, 8], [7, 10], [9, 12], 
    [11, 15], [13, 17], [15, 19], [17, 21], [19, 23],
    [21, 25], [23, 27], [25, 29], [27, 31], [29, 33]
])
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])  # Class 0 is dominant


In [32]:
# Number of folds
k = 5

# Store accuracy results
accuracies = []

Separate data by class (0 and 1)

In [33]:
class_0_indices = np.where(y == 0)[0]
class_1_indices = np.where(y == 1)[0]

These are IDs of data samples that are labeled 0. There are 10 of them.

In [34]:
class_0_indices

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

While these are IDs of data samples that are labeled 1, only 5 of them.

In [35]:
class_1_indices

array([10, 11, 12, 13, 14])

Shuffle them up!

In [36]:
# Shuffle indices for randomness
np.random.shuffle(class_0_indices)
np.random.shuffle(class_1_indices)

Split each class into k folds

In [37]:
# Split each class into k folds
folds_0 = np.array_split(class_0_indices, k)
folds_1 = np.array_split(class_1_indices, k)

Since we have 10 samples of class 0, we divide them into 5 folds of size 2 each.

In [38]:
folds_0

[array([4, 9]), array([0, 1]), array([5, 3]), array([6, 7]), array([8, 2])]

Likewise, since we have 5 samples of class 1, we divide them into 5 folds of size 1 each.

In [39]:
folds_1

[array([11]), array([10]), array([13]), array([12]), array([14])]

For each fold:
- Take one mini-fold from class 0 and one mini-fold from class 1 as the test set.
- Use the remaining folds as the training set.
- This ensures that each fold contains both classes in approximately the same ratio.

For example, for the first fold...

In [40]:
# Split each class into k folds
folds_0 = np.array_split(class_0_indices, k)
folds_1 = np.array_split(class_1_indices, k)

# Display the test and train sets for the first fold
test_indices = np.concatenate((folds_0[0], folds_1[0]))  # Test set for the first fold
X_test, y_test = X[test_indices], y[test_indices]
print('Data points in test set (IDs): ',test_indices)
print(X_test)

# Adjust to make sure the training set contains exactly 12 samples
train_indices = np.concatenate([fold for j, fold in enumerate(folds_0 + folds_1) if j != 0])  # Training set (exclude fold 0)
X_train, y_train = X[train_indices], y[train_indices]
print('\nData points in training set (IDs): ',train_indices)
print(X_train)

Data points in test set (IDs):  [ 4  9 11]
[[ 9 12]
 [19 23]
 [23 27]]

Data points in training set (IDs):  [ 0  1  5  3  6  7  8  2 11 10 13 12 14]
[[ 2  3]
 [ 3  5]
 [11 15]
 [ 7 10]
 [13 17]
 [15 19]
 [17 21]
 [ 5  8]
 [23 27]
 [21 25]
 [27 31]
 [25 29]
 [29 33]]


Is the code above correct? 

If not, fix it.

In [41]:
# Write your code here

## Using Library for Stratified k-fold

In [42]:
import numpy as np
from sklearn.model_selection import StratifiedKFold

# Example dataset (imbalanced classes)
X = np.array([
    [2, 3], [3, 5], [5, 8], [7, 10], [9, 12], 
    [11, 15], [13, 17], [15, 19], [17, 21], [19, 23],
    [21, 25], [23, 27], [25, 29], [27, 31], [29, 33]
])

y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])  # Class 0 is dominant

# Define StratifiedKFold with 5 splits
kf = StratifiedKFold(n_splits=5)

# Loop through the folds and display the indices for training and test sets
for fold, (train_idx, test_idx) in enumerate(kf.split(X, y)):
    print(f"\nFold {fold+1}")
    print("Train indices:", train_idx)
    print("Test indices:", test_idx)
    
    # Display train and test sets
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    
    print("X_train:", X_train)
    print("y_train:", y_train)
    print("X_test:", X_test)
    print("y_test:", y_test)



Fold 1
Train indices: [ 2  3  4  5  6  7  8  9 11 12 13 14]
Test indices: [ 0  1 10]
X_train: [[ 5  8]
 [ 7 10]
 [ 9 12]
 [11 15]
 [13 17]
 [15 19]
 [17 21]
 [19 23]
 [23 27]
 [25 29]
 [27 31]
 [29 33]]
y_train: [0 0 0 0 0 0 0 0 1 1 1 1]
X_test: [[ 2  3]
 [ 3  5]
 [21 25]]
y_test: [0 0 1]

Fold 2
Train indices: [ 0  1  4  5  6  7  8  9 10 12 13 14]
Test indices: [ 2  3 11]
X_train: [[ 2  3]
 [ 3  5]
 [ 9 12]
 [11 15]
 [13 17]
 [15 19]
 [17 21]
 [19 23]
 [21 25]
 [25 29]
 [27 31]
 [29 33]]
y_train: [0 0 0 0 0 0 0 0 1 1 1 1]
X_test: [[ 5  8]
 [ 7 10]
 [23 27]]
y_test: [0 0 1]

Fold 3
Train indices: [ 0  1  2  3  6  7  8  9 10 11 13 14]
Test indices: [ 4  5 12]
X_train: [[ 2  3]
 [ 3  5]
 [ 5  8]
 [ 7 10]
 [13 17]
 [15 19]
 [17 21]
 [19 23]
 [21 25]
 [23 27]
 [27 31]
 [29 33]]
y_train: [0 0 0 0 0 0 0 0 1 1 1 1]
X_test: [[ 9 12]
 [11 15]
 [25 29]]
y_test: [0 0 1]

Fold 4
Train indices: [ 0  1  2  3  4  5  8  9 10 11 12 14]
Test indices: [ 6  7 13]
X_train: [[ 2  3]
 [ 3  5]
 [ 5  8]
 [ 7 