### How to do K-fold cross validation (K-fold CV)

When we do the training of a model a best practice is to:
* first of all, get from the data a Test dataset, (hold-out)
* Then split the remaining data in a train and a validation dataset
* Validation for example is used to do hyper-parameter optimization

But if the original dataset is not so big when we split in train/validation there is the risk that the size of the validation set is small and therefore the results we see (metrices computed on the validation set) are depending on the way we do the split.

A common adopted technique is K-fold cross validation. With this technique we:
* train K models
* each one with (K-1) parts for training
* one part for validation

Using sklearn is almost easy to do K-fold CV. There are some additional details if we want to be sure that there is not bias in one or some folds (sometime, we want a stratified k-fold). But basically the technique is the same.

In this example I'm using this dataset

https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/hosted-ds-datasets/o/synthetic%2Forcl_attrition.csv

In [1]:
import pandas as pd
import numpy as np

# we use this class for K-fold split
from sklearn.model_selection import KFold

In [2]:
URL = "https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/hosted-ds-datasets/o/synthetic%2Forcl_attrition.csv"

data_orig = pd.read_csv(URL)

data_orig.head()

Unnamed: 0,Age,Attrition,TravelForWork,SalaryLevel,JobFunction,CommuteLength,EducationalLevel,EducationField,Directs,EmployeeNumber,...,WeeklyWorkedHours,StockOptionLevel,YearsinIndustry,TrainingTimesLastYear,WorkLifeBalance,YearsOnJob,YearsAtCurrentLevel,YearsSinceLastPromotion,YearsWithCurrManager,name
0,42,Yes,infrequent,5054,Product Management,2,L2,Life Sciences,1,1,...,80,0,8,0,1,6,4,0,5,Tracy Moore
1,50,No,often,1278,Software Developer,9,L1,Life Sciences,1,2,...,80,1,10,3,3,10,7,1,7,Andrew Hoover
2,38,Yes,infrequent,6296,Software Developer,3,L2,Other,1,4,...,80,0,7,3,3,0,0,0,0,Julie Bell
3,34,No,often,6384,Software Developer,4,L4,Life Sciences,1,5,...,80,0,8,3,3,8,7,3,0,Thomas Adams
4,28,No,infrequent,2710,Software Developer,3,L1,Medical,1,7,...,80,1,6,3,3,2,2,2,2,Johnathan Burnett


In [3]:
# let's imagine that we want to develop a classification model to predict if an employee is willing to leave the company.
# The TARGET is the column Attrition

In [4]:
data_orig.columns

Index(['Age', 'Attrition', 'TravelForWork', 'SalaryLevel', 'JobFunction',
       'CommuteLength', 'EducationalLevel', 'EducationField', 'Directs',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'WeeklyWorkedHours', 'StockOptionLevel',
       'YearsinIndustry', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsOnJob', 'YearsAtCurrentLevel', 'YearsSinceLastPromotion',
       'YearsWithCurrManager', 'name'],
      dtype='object')

In [5]:
TARGET = "Attrition"

# we have chosen some columns, the choice is not really important to show K-fold
features = [
    "Age",
    "TravelForWork",
    "SalaryLevel",
    "JobFunction",
    "CommuteLength",
    "EducationalLevel",
    "EducationField",
    "Directs",
    "EmployeeNumber",
    "EnvironmentSatisfaction",
    "Gender",
    "HourlyRate",
    "JobInvolvement",
    "JobLevel",
    "JobRole",
    "JobSatisfaction",
    "MaritalStatus",
    "MonthlyIncome",
    "MonthlyRate",
    "NumCompaniesWorked",
    "Over18",
    "OverTime",
]

### Model training with K-fold

In [6]:
# the number of FOLDS
FOLDS = 5
# good to set, to make it entirely reproducible
SEED = 4321

kf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)

# in this list we will save the trained models
models_list = []

#
# at each iteration you get a different set of indexes
# from which you get different samples for train and validation dataset
#
for i, (train_idx, valid_idx) in enumerate(kf.split(data_orig)):
    print()
    print("Processing fold:", i + 1)

    # here we split the DataFrame, using the indexes for the fold
    data_train = data_orig.iloc[train_idx]
    data_valid = data_orig.iloc[valid_idx]

    print("Samples for train dataset:", data_train.shape[0])
    print("Samples for validation dataset:", data_valid.shape[0])

    # get numpy vector, we assume that the library for the model support numpy
    x_train = data_train[features].values
    y_train = data_train[TARGET].values
    y_train = y_train.reshape(-1, 1)

    x_valid = data_valid[features].values
    y_valid = data_valid[TARGET].values
    y_valid = y_valid.reshape(-1, 1)

    # here you will do:
    # model.fit(x_train, y_train, eval=(x_test, y_test))
    # models_list.append(model)


Processing fold: 1
Samples for train dataset: 1176
Samples for validation dataset: 294

Processing fold: 2
Samples for train dataset: 1176
Samples for validation dataset: 294

Processing fold: 3
Samples for train dataset: 1176
Samples for validation dataset: 294

Processing fold: 4
Samples for train dataset: 1176
Samples for validation dataset: 294

Processing fold: 5
Samples for train dataset: 1176
Samples for validation dataset: 294


In [7]:
# at this point we have len(models_list) trained models

# to do prediction, for example on a test set we need to use each model and compute the avg of the probability predicted by each model.