### K-Fold Cross Validation

In [1]:
import pandas as pd
from sklearn.model_selection import KFold

  return f(*args, **kwds)


In [3]:
df = pd.read_csv("Datasets/winequality-red.csv", index_col=None)
df.shape

(1599, 12)

In [6]:
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.7,0.23,0.37,1.8,0.046,23.0,60.0,0.9971,3.41,0.71,12.1,6
1,6.2,0.46,0.17,1.6,0.073,7.0,11.0,0.99425,3.61,0.54,11.4,5
2,8.9,0.59,0.39,2.3,0.095,5.0,22.0,0.9986,3.37,0.58,10.3,5
3,6.6,0.84,0.03,2.3,0.059,32.0,48.0,0.9952,3.52,0.56,12.3,7
4,7.4,0.61,0.01,2.0,0.074,13.0,38.0,0.99748,3.48,0.65,9.8,5


In [45]:
kf = KFold(n_splits=5)

# ## The below code is just to understand how kf.split works.
# kf = KFold(n_splits=3) # Create 3 splits of train and test sets
# # Now, the return of kf.split is 3 training and testing sample sets. 
# The values in the val/test set would be equal to number_of_samples_in_df/n_splits. Here, 1599/5=319
# Therefore, each validation set would have 319 non-overlapping samples. 
# The train set for each sample would contain df - (samples in val set)
# This way, the model trains on a sample of training dataset and evaluates on another sample of validation dataset that it has not seen during training
# NOTE: The training sets would be overlapping while the validation sets would be non-overlapping
# (tr1, te1), (tr2, te2) , (tr3, te3)= kf.split(X=df)

In [55]:
for fold, (train_, val_) in enumerate(kf.split(X=df)):
    df['kfold'][val_] = fold
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [56]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,kfold
0,7.7,0.23,0.37,1.8,0.046,23.0,60.0,0.9971,3.41,0.71,12.1,6,0
1,6.2,0.46,0.17,1.6,0.073,7.0,11.0,0.99425,3.61,0.54,11.4,5,0
2,8.9,0.59,0.39,2.3,0.095,5.0,22.0,0.9986,3.37,0.58,10.3,5,0
3,6.6,0.84,0.03,2.3,0.059,32.0,48.0,0.9952,3.52,0.56,12.3,7,0
4,7.4,0.61,0.01,2.0,0.074,13.0,38.0,0.99748,3.48,0.65,9.8,5,0


In [57]:
df.kfold.unique()

array([0, 1, 2, 3, 4])

In [62]:
df.to_csv("Datasets/output_datasets/train_k_folds.csv", index=False)

K-Fold Cross Validation can be used when we have a balanced dataset with almost equal samples for all the categories.
But, if the dataset is unbalanced, then we can't use K-Fold Cross Validation. Instead, we have to use Stratified K-Fold Cross Validation.

### Stratified K-Fold Cross Validation

Stratified K-Fold Cross Validation keeps the ratio of target labels in each of the fold constant. This helps while dealing with imbalanced datasets.

In [1]:
from sklearn.model_selection import StratifiedKFold
import pandas as pd

In [2]:
df = pd.read_csv("Datasets/winequality-red.csv", index_col=None)
df.shape

(1599, 12)

In [3]:
df = df.sample(frac=1).reset_index(drop=True)
df.head(3)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,8.7,0.41,0.41,6.2,0.078,25.0,42.0,0.9953,3.24,0.77,12.6,7
1,7.2,0.53,0.14,2.1,0.064,15.0,29.0,0.99323,3.35,0.61,12.1,6
2,11.1,0.42,0.47,2.65,0.085,9.0,34.0,0.99736,3.24,0.77,12.1,7


In [6]:
skf = StratifiedKFold(n_splits=5)

df['skfold'] = -1 # Not sure why this wasn't necessary while doing K-Fold Cross Validation

for fold, (train_, val_) in enumerate(skf.split(X=df, y=df.quality.values)):
    df['skfold'][val_] = fold

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [7]:
df.to_csv("Datasets/output_datasets/train_s_k_folds.csv")

In [9]:
df.head(3)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,skfold
0,8.7,0.41,0.41,6.2,0.078,25.0,42.0,0.9953,3.24,0.77,12.6,7,0
1,7.2,0.53,0.14,2.1,0.064,15.0,29.0,0.99323,3.35,0.61,12.1,6,0
2,11.1,0.42,0.47,2.65,0.085,9.0,34.0,0.99736,3.24,0.77,12.1,7,0


In [10]:
df.skfold.unique()

array([0, 1, 2, 3, 4])