# Part 5.2: Using K-Fold Cross-validation with Keras

Cross-validation can be used for a variety of purposes in predictive modeling.  These include:

* Generating out-of-sample predictions from a neural network
* Estimate a good number of epochs to train a neural network for (early stopping)
* Evaluate the effectiveness of certain hyperparameters, such as activation functions, neuron counts, and layer counts

Cross-validation uses a number of folds, and multiple models, to provide each segment of data a chance to serve as both the validation and training set. Cross validation is shown in Figure 5.CROSS.

**Figure 5.CROSS: K-Fold Crossvalidation**
![K-Fold Crossvalidation](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_1_kfold.png "K-Fold Crossvalidation")

It is important to note that there will be one model (neural network) for each fold. To generate predictions for new data, which is data not present in the training set, predictions from the fold models can be handled in several ways:

* Choose the model that had the highest validation score as the final model.
* Preset new data to the 5 models (one for each fold) and average the result (this is an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning)).
* Retrain a new model (using the same settings as the cross-validation) on the entire dataset.  Train for as many epochs, and with the same hidden layer structure.

Generally, I prefer the last approach and will retrain a model on the entire data set once I have selected hyper-parameters.  Of course, I will always set aside a final holdout set for model validation that I do not use in any aspect of the training process.

### Regression vs Classification K-Fold Cross-Validation

Regression and classification are handled somewhat differently with regards to cross-validation.  Regression is the simpler case where you can simply break up the data set into K folds with little regard for where each item lands.  For regression it is best that the data items fall into the folds as randomly as possible.  It is also important to remember that not every fold will necessarily have exactly the same number of data items.  It is not always possible for the data set to be evenly divided into K folds.  For regression cross-validation we will use the Scikit-Learn class **KFold**.

Cross validation for classification could also use the **KFold** object; however, this technique would not ensure that the class balance remains the same in each fold as it was in the original.  It is very important that the balance of classes that a model was trained on remains the same (or similar) to the training set.  A drift in this distribution is one of the most important things to monitor after a trained model has been placed into actual use.  Because of this, we want to make sure that the cross-validation itself does not introduce an unintended shift. This is referred to as stratified sampling and is accomplished by using the Scikit-Learn object **StratifiedKFold** in place of **KFold** whenever you are using classification.  In summary, the following two objects in Scikit-Learn should be used:

* **KFold** When dealing with a regression problem.
* **StratifiedKFold** When dealing with a classification problem.

The following two sections demonstrate cross-validation with classification and regression. 

### Out-of-Sample Regression Predictions with K-Fold Cross-Validation

The following code trains the simple dataset using a 5-fold cross-validation.  The expected performance of a neural network, of the type trained here, would be the score for the generated out-of-sample predictions.  We begin by preparing a feature vector using the jh-simple-dataset to predict age.  This is a regression problem.

In [1]:
import pandas as pd
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

# Read the data set
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

# Generate dummies for job
df = pd.concat([df,pd.get_dummies(df['job'],prefix="job")],axis=1)
df.drop('job', axis=1, inplace=True)

# Generate dummies for area
df = pd.concat([df,pd.get_dummies(df['area'],prefix="area")],axis=1)
df.drop('area', axis=1, inplace=True)

# Generate dummies for product
df = pd.concat([df,pd.get_dummies(df['product'],prefix="product")],axis=1)
df.drop('product', axis=1, inplace=True)

# Missing values for income
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# Standardize ranges
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['subscriptions'] = zscore(df['subscriptions'])

# Convert to numpy - Classification
x_columns = df.columns.drop('age').drop('id')
x = df[x_columns].values
y = df['age'].values

Now that the feature vector is created a 5-fold cross-validation can be performed to generate out of sample predictions.  We will assume 500 epochs, and not use early stopping.  Later we will see how we can estimate a more optimal epoch count.

In [2]:
EPOCHS = 500

In [None]:
import pandas as pd
import os
import numpy as np
from sklearn.model_selection import KFold
from scipy.stats import zscore
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from sklearn import metrics


# Cross validate
kf = KFold(5, shuffle=True, random_state=42)#use for kfold validation
oos_y = []
oos_pred = []

fold = 0;
for train, test in kf.split(x):
    fold+=1
    print(f"Fold #{fold}")
    
    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]
    
    model = Sequential()
    model.add(Dense(20, input_dim = x.shape[1], activation = 'relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    model.compile(loss = 'mean_squared_error', optimizer = 'adam')
    model.fit(x_train, y_train, validation_data = (x_test,y_test),
              verbose = 0, epochs = EPOCHS)
    pred = model.predict(x_test)
    
    oos_y.append(y_test)
    oos_pred.append(pred)
    
    # Measure this folds RMSE
    score = np.sqrt(metrics.mean_squared_error(pred, y_test))
    print(f"Fold Score(RMSE) : {score}")

Fold #1
Fold Score(RMSE) : 0.7067026440646986
Fold #2
