# Introduction
<hr style = "border:2px solid black" ></hr>


**What?** k-Fold CV for Imbalanced Classification



# Class imbalance and CV
<hr style = "border:2px solid black" ></hr>


- The two most common approaches used for model evaluation are the train/test split and the k-fold cross-validation procedure. 

- Both approaches can be very effective in general, although they can result in misleading results and potentially fail when used on classification problems with a severe class imbalance. 

- Instead, the techniques must be modified to stratify the sampling by the class label, called stratified train-test split or **stratified k-fold cross-validation**.

- The go-to validation set is a classical 10-fold CV but this can easily break down in the case of class imbalances, even if the skew is NOT extreme.
    
- The **reason** is that the data is split into k-folds with a uniform probability distribution. This might work fine for data with a balanced class distribution, but when the distribution is severely skewed, it is likely that one or more folds will have few or no examples from the minority class. 



# Import modules
<hr style = "border:2px solid black" ></hr>

In [11]:
from numpy import unique
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

# Example of the issue
<hr style = "border:2px solid black" ></hr>


- Let's create a dataset with a 1:100 minority to majority class distribution.

- A total of 10 examples in the minority class is not many. If we used 10-folds, we would get one example in each fold in the ideal case, which is not enough to train a model. For demonstration purposes, we will use 5-folds.

- In the ideal case, we would have 10/5 or two examples in each fold, meaning 4*2 (8) folds worth of examples in a training dataset and 1*2 folds (2) in a given test dataset.



In [3]:
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[
                           0.99, 0.01], flip_y=0, random_state=1)
# summarize dataset
classes = unique(y)
total = len(y)
for c in classes:
    n_examples = len(y[y == c])
    percent = n_examples / total * 100
    print('> Class=%d : %d/%d (%.1f%%)' % (c, n_examples, total, percent))

> Class=0 : 990/1000 (99.0%)
> Class=1 : 10/1000 (1.0%)


In [6]:
kfold = KFold(n_splits=5, shuffle=True, random_state=1)

# enumerate the splits and summarize the distributions
for train_ix, test_ix in kfold.split(X):
    # select rows
    train_X, test_X = X[train_ix], X[test_ix]
    train_y, test_y = y[train_ix], y[test_ix]
    # summarize train and test composition
    train_0, train_1 = len(train_y[train_y == 0]), len(train_y[train_y == 1])
    test_0, test_1 = len(test_y[test_y == 0]), len(test_y[test_y == 1])
    print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' %
          (train_0, train_1, test_0, test_1))

>Train: 0=791, 1=9, Test: 0=199, 1=1
>Train: 0=793, 1=7, Test: 0=197, 1=3
>Train: 0=794, 1=6, Test: 0=196, 1=4
>Train: 0=790, 1=10, Test: 0=200, 1=0
>Train: 0=792, 1=8, Test: 0=198, 1=2



- We can see that in this case, there are some splits that have the expected 8/2 split for train and test sets, and others that are much worse, such as 6/4 (optimistic) and 10/0 (pessimistic).

- Evaluating a model on these splits of the data would not give a reliable estimate of performance.

- A similar issue exists if we use a simple 50/50 train/test split of the dataset, although the issue is less severe.
    


In [9]:
# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(
    X, y, test_size=0.5, random_state=2)
# summarize
train_0, train_1 = len(trainy[trainy == 0]), len(trainy[trainy == 1])
test_0, test_1 = len(testy[testy == 0]), len(testy[testy == 1])
print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' %
      (train_0, train_1, test_0, test_1))

>Train: 0=497, 1=3, Test: 0=493, 1=7



- Only three examples of the minority class are present in the training set, with seven in the test set.

- Evaluating models on this split would not give them enough examples to learn from, too many to be evaluated on, and likely give poor performance. 
    
- You can imagine how the situation could be worse with an even more severe random spit. 



# Fixing the CV splitting
<hr style = "border:2px solid black" ></hr>


- The **solution** is to not split the data randomly when using k-fold cross-validation or a train-test split.
- Ee can use a version of k-fold cross-validation that preserves the imbalanced class distribution in each fold. It is called **stratified** k-fold cross-validation and will enforce the class distribution in each split of the data to match the distribution in the complete training dataset.
    


In [12]:
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
# enumerate the splits and summarize the distributions
for train_ix, test_ix in kfold.split(X, y):
    # select rows
    train_X, test_X = X[train_ix], X[test_ix]
    train_y, test_y = y[train_ix], y[test_ix]
    # summarize train and test composition
    train_0, train_1 = len(train_y[train_y == 0]), len(train_y[train_y == 1])
    test_0, test_1 = len(test_y[test_y == 0]), len(test_y[test_y == 1])
    print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' %
          (train_0, train_1, test_0, test_1))

>Train: 0=792, 1=8, Test: 0=198, 1=2
>Train: 0=792, 1=8, Test: 0=198, 1=2
>Train: 0=792, 1=8, Test: 0=198, 1=2
>Train: 0=792, 1=8, Test: 0=198, 1=2
>Train: 0=792, 1=8, Test: 0=198, 1=2


# Other important considerations
<hr style = "border:2px solid black" ></hr>


- There are two things to keep in mind.

- **First**, make sure you use a value of k in the k-fold that ensure an adequate number of examples in the train and test sets to fit and evaluate a model.

- **Second**, make sure to use stratified k-fold CV with imbalanced datasets to preserve the class distribution in the train and test sets for each evaluation of a given model.



In [17]:
# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(
    X, y, test_size=0.5, random_state=2, stratify=y)

# summarize
train_0, train_1 = len(trainy[trainy == 0]), len(trainy[trainy == 1])
test_0, test_1 = len(testy[testy == 0]), len(testy[testy == 1])

print('Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' %
      (train_0, train_1, test_0, test_1))

Train: 0=495, 1=5, Test: 0=495, 1=5



- Running the example creates a random split of the dataset into training and test sets, ensuring that the class distribution is preserved, in this case leaving five examples in each dataset. 
- Please note the argument `stratify=y`.
    


# References
<hr style = "border:2px solid black" ></hr>


- https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/
- He, Haibo, and Yunqian Ma, eds. "Imbalanced learning: foundations, algorithms, and applications." (2013).

