## Creating a Five-Fold Cross-Validation Dataset Using a Loop for Calls
The goal of this exercise is to create a five-fold cross-validation dataset from the data that you imported in Exercise 7.01, Importing and Splitting Data. You will make use of a loop for calls to the generator function.

In [4]:
# import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

In [2]:
# create headers for data
_headers = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'car']

In [3]:
# read in cars dataset
df = pd.read_csv('https://raw.githubusercontent.com/'\
                 'PacktWorkshops/The-Data-Science-Workshop/'\
                 'master/Chapter07/Dataset/car.data', names=_headers, index_col=None)
print(df.shape)
df.head()

(1728, 7)


Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,car
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [5]:
# define number of splits
n_splits = 5

In [6]:
# create an instance of KFold
_kf = KFold(n_splits=n_splits)

In [7]:
# generate the split indices
_indices = _kf.split(df)

In [8]:
# create two python lists
_t, _v = [], []

In [9]:
# iterate over _indices
for i in range(n_splits):
    train_idx, val_idx = next(_indices)
    _train_df = df.drop(val_idx)
    _t.append(_train_df)
    _val_df = df.drop(train_idx)
    _v.append(_val_df)

In this step, you create a loop using range to determine the number of iterations. You specify the number of iterations by providing n_splits as a parameter to range(). On every iteration, you execute next() on the _indices generator and store the results in train_idx and val_idx. You then proceed to create _train_df by dropping the validation indices, val_idx, from df. You also create _val_df by dropping the training indices from df.

In [10]:
# iterate over the training list
for d in _t:
    print(d.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1382 entries, 346 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1382 non-null   object
 1   maint     1382 non-null   object
 2   doors     1382 non-null   object
 3   persons   1382 non-null   object
 4   lug_boot  1382 non-null   object
 5   safety    1382 non-null   object
 6   car       1382 non-null   object
dtypes: object(7)
memory usage: 86.4+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1382 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1382 non-null   object
 1   maint     1382 non-null   object
 2   doors     1382 non-null   object
 3   persons   1382 non-null   object
 4   lug_boot  1382 non-null   object
 5   safety    1382 non-null   object
 6   car       1382 non-null   object
dtypes: object(7)
memory usage: 86.4+ KB
None
<class 'pa

In [11]:
# iterate over the validation list
for d in _v:
    print(d.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 346 entries, 0 to 345
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    346 non-null    object
 1   maint     346 non-null    object
 2   doors     346 non-null    object
 3   persons   346 non-null    object
 4   lug_boot  346 non-null    object
 5   safety    346 non-null    object
 6   car       346 non-null    object
dtypes: object(7)
memory usage: 21.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 346 entries, 346 to 691
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    346 non-null    object
 1   maint     346 non-null    object
 2   doors     346 non-null    object
 3   persons   346 non-null    object
 4   lug_boot  346 non-null    object
 5   safety    346 non-null    object
 6   car       346 non-null    object
dtypes: object(7)
memory usage: 21.6+ KB
None
<class 'pandas

### cross_val_score takes care of the following:

    Creating cross-validation datasets
    Training models by fitting them to the training data
    Evaluating the models on the validation data
    Returning a list of the R2 score of each model that is trained
For all of the preceding actions to happen, you will need to provide the following inputs:

    An instance of an estimator (for example, LinearRegression)
    The original dataset
    The number of splits to create (which is also the number of models that will be trained and evaluated)
