## Creating a Five-Fold Cross-Validation Dataset

### KFold

The KFold class in sklearn.model_selection returns a generator that provides a tuple with two indices, one for training and another for testing or validation. A generator function lets you declare a function that behaves like an iterator, thus letting you use it in a loop.

In [5]:
# import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

In [2]:
# create headers for data
_headers = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'car']

In [3]:
# read in cars dataset
df = pd.read_csv('https://raw.githubusercontent.com/'\
                 'PacktWorkshops/The-Data-Science-Workshop/'\
                 'master/Chapter07/Dataset/car.data', names=_headers, index_col=None)
print(df.shape)
df.head()

(1728, 7)


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,car
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
# split data into 80% training and 20% for evaluation
training_df, eval_df = train_test_split(df, train_size=0.8, random_state=0)

In [6]:
# instantiate KFold
_kf = KFold(n_splits=5)

In this step, you create an instance of KFold and assign it to a variable called _kf. You specify a value of 5 for the n_splits parameter so that it splits the dataset into five parts.

In [7]:
# split the data
indices = _kf.split(df)

In [8]:
# what data type does indices have?
print(type(indices))

<class 'generator'>


In [9]:
# get the first set of indices:
train_indices, val_indices = next(indices)

In this step, you make use of the next() Python function on the generator function. Using next() is the way that you get a generator to return results to you. You asked for five splits, so you can call next() five times on this particular generator. Calling next() a sixth time will cause the Python runtime to raise an exception.

The call to next() yields a tuple. In this case, it is a pair of indices. The first one contains your training indices and the second one contains your validation indices. You assign these to train_indices and val_indices.

In [10]:
# create a training dataset
train_df = df.drop(val_indices)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1382 entries, 346 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1382 non-null   object
 1   maint     1382 non-null   object
 2   doors     1382 non-null   object
 3   persons   1382 non-null   object
 4   lug_boot  1382 non-null   object
 5   safety    1382 non-null   object
 6   car       1382 non-null   object
dtypes: object(7)
memory usage: 86.4+ KB


In [11]:
# create a validation dataset
val_df = df.drop(train_indices)
val_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 346 entries, 0 to 345
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    346 non-null    object
 1   maint     346 non-null    object
 2   doors     346 non-null    object
 3   persons   346 non-null    object
 4   lug_boot  346 non-null    object
 5   safety    346 non-null    object
 6   car       346 non-null    object
dtypes: object(7)
memory usage: 21.6+ KB
