# How to Evaluate the Performance of PyTorch Models

Designing a deep learning model is sometimes an art. 
There are a lot of decision points and it is not easy to tell what is the best. 
One way to come up with a design is by trial and error and evaluating the result on real data. 
Therefore, it is important to have a scientific method to evaluate the performance of your neural network and deep learning models. 
In fact, it is also the same method to compare any kind of machine learning models on a particular usage.

In this tutorial, you will discover the received work flow to robustly evaluate model performance. 
In the examples, we will use PyTorch to build our models, but the method can also be applied to other models.

## Outcome 

After completing this tutorial, you will know:

- How to evaluate a PyTorch model using a verification dataset
- How to evaluate a PyTorch model with k-fold cross-validation

## Overview

This chapter is in four parts; they are:

- Empirical Evaluation of Models
- Data Splitting
- Training a PyTorch Model with Validation
- k-Fold Cross Validation

## Empirical Evaluation of Models

In designing and configurating a deep learning model from scratch, there are a lot of decisions to make. 
This includes design decisions such as how many layers in a deep learning model, how big is each layer, and what kind of layers or activation functions to use. 
It can also be the choice of loss function, optimization algorithm, number of epochs to train, and the interpretation of the model output. 
In relief, sometimes, you can copy the structure of other people's network. 
Sometimes, you can just make up your choice using some heuristics. 
To tell if you made a good choice or not, the best is to compare multiple alternatives by empirically evaluating them with actual data.

Deep learning is often used on problems that have very large datasets. 
That is tens of thousands or hundreds of thousands of data samples. 
This provides ample data for testing. 
But you need to have a robust test strategy to estimate the performance of your model on unseen data. 
Base on that, you can have a metric to compare amongst different model configurations.

## Data Splitting

If you have a dataset of tens of thousands of samples or even more, you don't always need to give everything to your model for training. 
This will unnecessarily increase the complexity and lengthen the training time. 
The more is not always better. 
You may not get the best result.

When you get a large amount of data, usually you should take a portion of it as the training set that is feed into the model for training. 
Another portion is kept as test set to hold it out from training, but verified with a trained or partially trained model as evaluation. 
This step is usually called "train-test split".

Let's consider the Pima Indians Diabetes dataset. 
We can load the data using NumPy:

In [1]:
import numpy as np
data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")

There are 768 data samples. 
It is not a lot but enough to demonstrate the split. 
Let's consider the first 66% as training set and the remaining as test set. 
The easiest way to do so is by slicing an array:

In [2]:
# find the boundary at 66% of total samples
count = len(data)
n_train = int(count * 0.66)
# split the data at the boundary
train_data = data[:n_train]
test_data = data[n_train:]

The choice of 66% is arbitrary but we do not want the training set too small. 
Sometimes you may use 70%-30% split. 
But if the dataset is huge, you may even use 30%-70% split if 30% of training data is large enough.

If you split the data in this way, you're suggesting the data set are shuffled such that the training set and the test set are equally diverse. 
If you find the original dataset is sorted and you take the test set only at the end, you may find you have all the test data belong to the same class or carrying the same value in one of the input features. 
That's not ideal.

Of course, you can call np.random.shuffle(data) before the split to avoid that. 
But many machine learning engineers usually use scikit-learn for this. 
See this example:

In [3]:
import numpy as np
from sklearn.model_selection import train_test_split
 
data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")
train_data, test_data = train_test_split(data, test_size=0.33)

But more commonly, it is done after we separate the input feature and output labels. 
Note that this function from scikit-learn not only can work on NumPy arrays, but also PyTorch tensors:

In [4]:
import numpy as np
import torch
from sklearn.model_selection import train_test_split
 
data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")
X = data[:, 0:8]
y = data[:, 8]
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

## Training a PyTorch Model with Validation

Let's revisit the code for building and training a deep learning model on this dataset:

In [7]:
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm


model = nn.Sequential(
    nn.Linear(8, 12),
    nn.ReLU(),
    nn.Linear(12, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()
)
 
# loss function and optimizer
loss_fn = nn.BCELoss()  # binary cross entropy
optimizer = optim.Adam(model.parameters(), lr=0.0001)
 
n_epochs = 50    # number of epochs to run
batch_size = 10  # size of each batch
batches_per_epoch = len(X_train) // batch_size
 
for epoch in range(n_epochs):
    with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar:
        bar.set_description(f"Epoch {epoch}")
        for i in bar:
            # take a batch
            start = i * batch_size
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            # forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            # backward pass
            optimizer.zero_grad()
            loss.backward()
            # update weights
            optimizer.step()
            # print progress
            bar.set_postfix(
                loss=float(loss)
            )

Epoch 0: 100%|███████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 167.93batch/s, loss=4.16]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 214.33batch/s, loss=3.32]
Epoch 2: 100%|███████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 219.68batch/s, loss=2.54]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 238.16batch/s, loss=1.8]
Epoch 4: 100%|███████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 213.33batch/s, loss=1.12]
Epoch 5: 100%|██████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 227.91batch/s, loss=0.613]
Epoch 6: 100%|██████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 225.13batch/s, loss=0.468]
Epoch 7: 100%|██████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 252.48batch/s, loss=0.466]


In this code, one batch is extracted from the training set in each iteration and send to the model in the forward pass. 
Then we compute the gradient in the backward pass and update the weights.

While, in this case, you used binary cross entropy as the loss metric in the training loop, you may be more concerned with the prediction accuracy. 
Calculating accuracy is easy. 
You round off the output (in the range of 0 to 1) to the nearest integer so you can get a binary value of 0 or 1. 
Then count how much percentage your prediction matched the label is the accuracy.

But what is your prediction? It is y_pred above, which is the prediction by your current model on X_batch. 
Adding accuracy to the training loop becomes this:

In [12]:
for epoch in range(n_epochs):
    with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar:
        bar.set_description(f"Epoch {epoch}")
        for i in bar:
            # take a batch
            start = i * batch_size
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            # forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            # backward pass
            optimizer.zero_grad()
            loss.backward()
            # update weights
            optimizer.step()
            # print progress, with accuracy
            acc = (y_pred.round() == y_batch).float().mean()
            bar.set_postfix(loss=float(loss), acc=float(acc))

Epoch 0: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 216.98batch/s, acc=0.7, loss=0.638]
Epoch 1: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 224.33batch/s, acc=0.7, loss=0.638]
Epoch 2: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 224.90batch/s, acc=0.7, loss=0.638]
Epoch 3: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 240.41batch/s, acc=0.7, loss=0.638]
Epoch 4: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 246.96batch/s, acc=0.7, loss=0.638]
Epoch 5: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 233.56batch/s, acc=0.7, loss=0.638]
Epoch 6: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 238.23batch/s, acc=0.7, loss=0.638]
Epoch 7: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 249.94batch/s, acc=0.7, loss=0.639]


However, the X_batch and y_batch is used by the optimizer and the optimizer will fine tune your model to such that it can predict y_batch from X_batch. 
And now you're using accuracy to check if y_pred match with y_batch. 
It is like cheating because if your model somehow remembers the solution, it can just report to you the y_pred and get a perfect accuracy, without actually inferring y_pred from X_batch.

Indeed, a deep learning model can be so convoluted that you cannot know if your model remembers the answer or inferring the answer. 
Therefore the best way is not to calculate accuracy from X_batch or anything from X_train, but from something else: our test set. 
Let's add accuracy measurement after each epoch, using X_test:

In [13]:
for epoch in range(n_epochs):
    with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar:
        bar.set_description(f"Epoch {epoch}")
        for i in bar:
            # take a batch
            start = i * batch_size
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            # forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            # backward pass
            optimizer.zero_grad()
            loss.backward()
            # update weights
            optimizer.step()
            # print progress
            acc = (y_pred.round() == y_batch).float().mean()
            bar.set_postfix(
                loss=float(loss),
                acc=float(acc)
            )
    # evaluate model at end of epoch
    y_pred = model(X_test)
    acc = (y_pred.round() == y_test).float().mean()
    acc = float(acc)
    print(f"End of {epoch}, accuracy {acc}")

Epoch 0: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 234.28batch/s, acc=0.7, loss=0.644]


End of 0, accuracy 0.7322834730148315


Epoch 1: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 238.78batch/s, acc=0.7, loss=0.644]


End of 1, accuracy 0.7283464670181274


Epoch 2: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 261.18batch/s, acc=0.7, loss=0.645]


End of 2, accuracy 0.7283464670181274


Epoch 3: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 266.54batch/s, acc=0.7, loss=0.647]


End of 3, accuracy 0.7244094610214233


Epoch 4: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 258.63batch/s, acc=0.7, loss=0.648]


End of 4, accuracy 0.7283464670181274


Epoch 5: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 254.76batch/s, acc=0.7, loss=0.649]


End of 5, accuracy 0.7283464670181274


Epoch 6: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 234.65batch/s, acc=0.7, loss=0.649]


End of 6, accuracy 0.7283464670181274


Epoch 7: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 256.47batch/s, acc=0.7, loss=0.649]


End of 7, accuracy 0.7283464670181274


Epoch 8: 100%|██████████████████████████████████████████████████████| 51/51 [00:00<00:00, 261.61batch/s, acc=0.7, loss=0.65]


End of 8, accuracy 0.7283464670181274


Epoch 9: 100%|██████████████████████████████████████████████████████| 51/51 [00:00<00:00, 250.02batch/s, acc=0.7, loss=0.65]


End of 9, accuracy 0.7283464670181274


Epoch 10: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 254.48batch/s, acc=0.7, loss=0.649]


End of 10, accuracy 0.7283464670181274


Epoch 11: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 257.87batch/s, acc=0.7, loss=0.649]


End of 11, accuracy 0.7283464670181274


Epoch 12: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 232.50batch/s, acc=0.7, loss=0.649]


End of 12, accuracy 0.7244094610214233


Epoch 13: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 246.57batch/s, acc=0.7, loss=0.649]


End of 13, accuracy 0.7244094610214233


Epoch 14: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 261.16batch/s, acc=0.7, loss=0.649]


End of 14, accuracy 0.7244094610214233


Epoch 15: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 258.71batch/s, acc=0.7, loss=0.649]


End of 15, accuracy 0.7244094610214233


Epoch 16: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 260.25batch/s, acc=0.7, loss=0.649]


End of 16, accuracy 0.7244094610214233


Epoch 17: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 173.63batch/s, acc=0.7, loss=0.648]


End of 17, accuracy 0.7244094610214233


Epoch 18: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 236.79batch/s, acc=0.7, loss=0.648]


End of 18, accuracy 0.7244094610214233


Epoch 19: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 252.78batch/s, acc=0.7, loss=0.647]


End of 19, accuracy 0.7244094610214233


Epoch 20: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 184.16batch/s, acc=0.7, loss=0.647]


End of 20, accuracy 0.7204724550247192


Epoch 21: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 208.87batch/s, acc=0.7, loss=0.647]


End of 21, accuracy 0.7204724550247192


Epoch 22: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 245.09batch/s, acc=0.7, loss=0.646]


End of 22, accuracy 0.7204724550247192


Epoch 23: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 236.70batch/s, acc=0.7, loss=0.646]


End of 23, accuracy 0.7204724550247192


Epoch 24: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 218.96batch/s, acc=0.7, loss=0.645]


End of 24, accuracy 0.7204724550247192


Epoch 25: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 220.49batch/s, acc=0.7, loss=0.645]


End of 25, accuracy 0.7204724550247192


Epoch 26: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 243.01batch/s, acc=0.7, loss=0.645]


End of 26, accuracy 0.7204724550247192


Epoch 27: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 215.26batch/s, acc=0.7, loss=0.644]


End of 27, accuracy 0.7244094610214233


Epoch 28: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 213.20batch/s, acc=0.7, loss=0.644]


End of 28, accuracy 0.7283464670181274


Epoch 29: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 238.55batch/s, acc=0.7, loss=0.644]


End of 29, accuracy 0.7283464670181274


Epoch 30: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 228.35batch/s, acc=0.7, loss=0.644]


End of 30, accuracy 0.7283464670181274


Epoch 31: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 200.08batch/s, acc=0.7, loss=0.643]


End of 31, accuracy 0.7244094610214233


Epoch 32: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 213.77batch/s, acc=0.7, loss=0.643]


End of 32, accuracy 0.7244094610214233


Epoch 33: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 264.85batch/s, acc=0.7, loss=0.643]


End of 33, accuracy 0.7244094610214233


Epoch 34: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 238.02batch/s, acc=0.7, loss=0.642]


End of 34, accuracy 0.7244094610214233


Epoch 35: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 203.85batch/s, acc=0.7, loss=0.642]


End of 35, accuracy 0.7283464670181274


Epoch 36: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 201.51batch/s, acc=0.7, loss=0.641]


End of 36, accuracy 0.7322834730148315


Epoch 37: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 187.70batch/s, acc=0.7, loss=0.641]


End of 37, accuracy 0.7322834730148315


Epoch 38: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 183.15batch/s, acc=0.7, loss=0.641]


End of 38, accuracy 0.7322834730148315


Epoch 39: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 205.04batch/s, acc=0.7, loss=0.641]


End of 39, accuracy 0.7322834730148315


Epoch 40: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 231.43batch/s, acc=0.7, loss=0.641]


End of 40, accuracy 0.7322834730148315


Epoch 41: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 222.71batch/s, acc=0.7, loss=0.64]


End of 41, accuracy 0.7283464670181274


Epoch 42: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 214.86batch/s, acc=0.7, loss=0.64]


End of 42, accuracy 0.7322834730148315


Epoch 43: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 215.72batch/s, acc=0.7, loss=0.64]


End of 43, accuracy 0.7283464670181274


Epoch 44: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 236.85batch/s, acc=0.7, loss=0.639]


End of 44, accuracy 0.7283464670181274


Epoch 45: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 230.76batch/s, acc=0.7, loss=0.639]


End of 45, accuracy 0.7322834730148315


Epoch 46: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 221.23batch/s, acc=0.7, loss=0.639]


End of 46, accuracy 0.7283464670181274


Epoch 47: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 193.60batch/s, acc=0.7, loss=0.639]


End of 47, accuracy 0.7362204790115356


Epoch 48: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 234.24batch/s, acc=0.7, loss=0.638]


End of 48, accuracy 0.7362204790115356


Epoch 49: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 180.64batch/s, acc=0.7, loss=0.638]

End of 49, accuracy 0.7362204790115356





In this case, the acc in the inner for-loop is just a metric showing the progress. 
Not much difference as displaying the loss metric except it is not involved in the gradient descent algorithm. 
And you expect the accuracy to improve as the loss metric also improves.

In the outer for-loop, at the end of each epoch, you calculate the accuracy from X_test. 
The workflow is similar: You give the test set to the model and ask for its prediction, then count the number of matched result with your test set labels. 
But this accuracy is the one you should care about: It should improve as the training progressed, but if you do not see it improve (i.e., accuracy increase) or even deteriorates, you have to interrupt the training as it seems to start overfitting. 
Overfitting is when the model started to remember the training set rather than learning to infer the prediction from it. 
A sign of that is the accuracy from the training set keeps increasing while the accuracy from the test set is decreasing.

The following is the complete code to implement all above, from data splitting to validation using test set:

In [14]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm
from sklearn.model_selection import train_test_split
 
data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")
X = data[:, 0:8]
y = data[:, 8]
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
 
model = nn.Sequential(
    nn.Linear(8, 12),
    nn.ReLU(),
    nn.Linear(12, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()
)
 
# loss function and optimizer
loss_fn = nn.BCELoss()  # binary cross entropy
optimizer = optim.Adam(model.parameters(), lr=0.0001)
 
n_epochs = 50    # number of epochs to run
batch_size = 10  # size of each batch
batches_per_epoch = len(X_train) // batch_size
 
for epoch in range(n_epochs):
    with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0) as bar: #, disable=True) as bar:
        bar.set_description(f"Epoch {epoch}")
        for i in bar:
            # take a batch
            start = i * batch_size
            X_batch = X_train[start:start+batch_size]
            y_batch = y_train[start:start+batch_size]
            # forward pass
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            # backward pass
            optimizer.zero_grad()
            loss.backward()
            # update weights
            optimizer.step()
            # print progress
            acc = (y_pred.round() == y_batch).float().mean()
            bar.set_postfix(
                loss=float(loss),
                acc=float(acc)
            )
    # evaluate model at end of epoch
    y_pred = model(X_test)
    acc = (y_pred.round() == y_test).float().mean()
    acc = float(acc)
    print(f"End of {epoch}, accuracy {acc}")

Epoch 0: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 262.43batch/s, acc=0.5, loss=0.888]


End of 0, accuracy 0.5039370059967041


Epoch 1: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 227.89batch/s, acc=0.5, loss=0.835]


End of 1, accuracy 0.4881889820098877


Epoch 2: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 228.11batch/s, acc=0.6, loss=0.798]


End of 2, accuracy 0.4842519760131836


Epoch 3: 100%|██████████████████████████████████████████████████████| 51/51 [00:00<00:00, 242.48batch/s, acc=0.6, loss=0.77]


End of 3, accuracy 0.4803149700164795


Epoch 4: 100%|██████████████████████████████████████████████████████| 51/51 [00:00<00:00, 229.68batch/s, acc=0.6, loss=0.75]


End of 4, accuracy 0.4724409580230713


Epoch 5: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 246.54batch/s, acc=0.6, loss=0.744]


End of 5, accuracy 0.4685039222240448


Epoch 6: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 235.77batch/s, acc=0.6, loss=0.744]


End of 6, accuracy 0.4606299102306366


Epoch 7: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 250.52batch/s, acc=0.6, loss=0.748]


End of 7, accuracy 0.4685039222240448


Epoch 8: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 245.05batch/s, acc=0.6, loss=0.752]


End of 8, accuracy 0.4645669162273407


Epoch 9: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 214.50batch/s, acc=0.6, loss=0.759]


End of 9, accuracy 0.4763779640197754


Epoch 10: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 223.72batch/s, acc=0.6, loss=0.768]


End of 10, accuracy 0.4842519760131836


Epoch 11: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 247.74batch/s, acc=0.6, loss=0.779]


End of 11, accuracy 0.4842519760131836


Epoch 12: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 246.62batch/s, acc=0.6, loss=0.793]


End of 12, accuracy 0.4881889820098877


Epoch 13: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 225.34batch/s, acc=0.6, loss=0.809]


End of 13, accuracy 0.4960629940032959


Epoch 14: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 242.43batch/s, acc=0.5, loss=0.828]


End of 14, accuracy 0.5157480239868164


Epoch 15: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 222.59batch/s, acc=0.5, loss=0.847]


End of 15, accuracy 0.5236220359802246


Epoch 16: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 224.49batch/s, acc=0.5, loss=0.864]


End of 16, accuracy 0.5236220359802246


Epoch 17: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 203.69batch/s, acc=0.4, loss=0.881]


End of 17, accuracy 0.5236220359802246


Epoch 18: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 243.08batch/s, acc=0.4, loss=0.896]


End of 18, accuracy 0.5196850299835205


Epoch 19: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 188.09batch/s, acc=0.4, loss=0.909]


End of 19, accuracy 0.5275590419769287


Epoch 20: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 204.37batch/s, acc=0.4, loss=0.918]


End of 20, accuracy 0.5511810779571533


Epoch 21: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 164.11batch/s, acc=0.4, loss=0.928]


End of 21, accuracy 0.5629921555519104


Epoch 22: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 160.75batch/s, acc=0.4, loss=0.937]


End of 22, accuracy 0.5669291615486145


Epoch 23: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 191.39batch/s, acc=0.4, loss=0.944]


End of 23, accuracy 0.586614191532135


Epoch 24: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 170.89batch/s, acc=0.4, loss=0.951]


End of 24, accuracy 0.5826771855354309


Epoch 25: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 160.93batch/s, acc=0.4, loss=0.956]


End of 25, accuracy 0.5905511975288391


Epoch 26: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 198.53batch/s, acc=0.4, loss=0.964]


End of 26, accuracy 0.5944882035255432


Epoch 27: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 202.20batch/s, acc=0.4, loss=0.97]


End of 27, accuracy 0.5905511975288391


Epoch 28: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 196.17batch/s, acc=0.4, loss=0.974]


End of 28, accuracy 0.6023622155189514


Epoch 29: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 201.16batch/s, acc=0.3, loss=0.977]


End of 29, accuracy 0.6062992215156555


Epoch 30: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 146.12batch/s, acc=0.3, loss=0.979]


End of 30, accuracy 0.6141732335090637


Epoch 31: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 180.85batch/s, acc=0.3, loss=0.979]


End of 31, accuracy 0.6141732335090637


Epoch 32: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 241.71batch/s, acc=0.3, loss=0.98]


End of 32, accuracy 0.6220472455024719


Epoch 33: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 237.23batch/s, acc=0.3, loss=0.98]


End of 33, accuracy 0.625984251499176


Epoch 34: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 246.90batch/s, acc=0.3, loss=0.981]


End of 34, accuracy 0.625984251499176


Epoch 35: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 259.84batch/s, acc=0.3, loss=0.98]


End of 35, accuracy 0.625984251499176


Epoch 36: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 249.89batch/s, acc=0.3, loss=0.979]


End of 36, accuracy 0.6338582634925842


Epoch 37: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 250.30batch/s, acc=0.3, loss=0.979]


End of 37, accuracy 0.6338582634925842


Epoch 38: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 251.68batch/s, acc=0.3, loss=0.977]


End of 38, accuracy 0.6417322754859924


Epoch 39: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 252.06batch/s, acc=0.3, loss=0.975]


End of 39, accuracy 0.6456692814826965


Epoch 40: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 255.32batch/s, acc=0.3, loss=0.973]


End of 40, accuracy 0.6496062874794006


Epoch 41: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 244.94batch/s, acc=0.3, loss=0.971]


End of 41, accuracy 0.6496062874794006


Epoch 42: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 209.31batch/s, acc=0.3, loss=0.968]


End of 42, accuracy 0.6535432934761047


Epoch 43: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 260.30batch/s, acc=0.3, loss=0.967]


End of 43, accuracy 0.6535432934761047


Epoch 44: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 264.88batch/s, acc=0.3, loss=0.966]


End of 44, accuracy 0.6535432934761047


Epoch 45: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 258.53batch/s, acc=0.3, loss=0.966]


End of 45, accuracy 0.6614173054695129


Epoch 46: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 247.35batch/s, acc=0.3, loss=0.965]


End of 46, accuracy 0.6732283234596252


Epoch 47: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 238.20batch/s, acc=0.4, loss=0.963]


End of 47, accuracy 0.6692913174629211


Epoch 48: 100%|████████████████████████████████████████████████████| 51/51 [00:00<00:00, 248.54batch/s, acc=0.4, loss=0.961]


End of 48, accuracy 0.6692913174629211


Epoch 49: 100%|█████████████████████████████████████████████████████| 51/51 [00:00<00:00, 233.31batch/s, acc=0.4, loss=0.96]


End of 49, accuracy 0.6732283234596252


## k-Fold Cross Validation

In the above example, you calculated the accuracy from the test set. 
It is used as a score for the model as you progressed in the training. 
We want to stop at the point that this score is maximum. 
In fact, by merely compare the score from this test set, we know our model works best after epoch 21 and start to overfit afterwards. 
Is that right?

If you built two models of different design, should you just compare these models' accuracy on the same test set and claim one is better than another?

Actually you can argue that the test set is not representative enough even you have shuffled your dataset before extracting the test set. 
You may also argue that, by chance, one model fits better to this particular test set but not always better. 
To make a stronger argument on which model is better independent of the selection of test set, you can try with multiple test sets, and average the accuracy.

This is what a k-fold cross validation does. 
It is a progress to decide on which design works better. 
It works by repeating the training process from scratch for $k$ times, each with a different composition of the training and test set. 
Because of that, you will have $k$ models and $k$ accuracy scores from their respective test set. 
You are not only interested in the average accuracy, but also the standard deviation. 
The standard deviation tells whether the accuracy score is consistent or some test set is particularly good or bad to a model.

Since k-fold cross validation is to train the model from scratch a few times, it is best to wrap around the training loop in a function:

In [15]:
def model_train(X_train, y_train, X_test, y_test):
    # create new model
    model = nn.Sequential(
        nn.Linear(8, 12),
        nn.ReLU(),
        nn.Linear(12, 8),
        nn.ReLU(),
        nn.Linear(8, 1),
        nn.Sigmoid()
    )
 
    # loss function and optimizer
    loss_fn = nn.BCELoss()  # binary cross entropy
    optimizer = optim.Adam(model.parameters(), lr=0.0001)
 
    n_epochs = 25    # number of epochs to run
    batch_size = 10  # size of each batch
    batches_per_epoch = len(X_train) // batch_size
 
    for epoch in range(n_epochs):
        with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0, disable=True) as bar:
            bar.set_description(f"Epoch {epoch}")
            for i in bar:
                # take a batch
                start = i * batch_size
                X_batch = X_train[start:start+batch_size]
                y_batch = y_train[start:start+batch_size]
                # forward pass
                y_pred = model(X_batch)
                loss = loss_fn(y_pred, y_batch)
                # backward pass
                optimizer.zero_grad()
                loss.backward()
                # update weights
                optimizer.step()
                # print progress
                acc = (y_pred.round() == y_batch).float().mean()
                bar.set_postfix(
                    loss=float(loss),
                    acc=float(acc)
                )
    # evaluate accuracy at end of training
    y_pred = model(X_test)
    acc = (y_pred.round() == y_test).float().mean()
    return float(acc)

The code above is deliberately not print anything (with disable=True in tqdm) to keep the screen less cluttered.

Also from scikit-learn, we have a function for k-fold cross validation. 
We can make use of it to produce a robust estimate of model accuracy:

In [16]:
from sklearn.model_selection import StratifiedKFold
 
# define 5-fold cross validation test harness
kfold = StratifiedKFold(n_splits=5, shuffle=True)
cv_scores = []
for train, test in kfold.split(X, y):
    # create model, train, and get accuracy
    acc = model_train(X[train], y[train], X[test], y[test])
    print("Accuracy: %.2f" % acc)
    cv_scores.append(acc)
# evaluate the model
print("%.2f%% (+/- %.2f%%)" % (np.mean(cv_scores)*100, np.std(cv_scores)*100))

Accuracy: 0.65
Accuracy: 0.66
Accuracy: 0.68
Accuracy: 0.65
Accuracy: 0.65
65.62% (+/- 1.32%)


In scikit-learn, there are multiple k-fold cross validation functions and the one used here is stratified k-fold. 
It assumes y are class labels and takes into account of their values such that it will provide a balanced class representation in the splits.

The code above used $k=5$ or 5 splits. 
It means to split the dataset into 5 equal portions and pick one of them as test set while the rest is combined into a training set. 
There are 5 ways of doing that so the for-loop above will have 5 iterations. 
In each iteration, you called the model_train() function and obtained the accuracy score in return. 
Then you save it into a list, which will be used to calculate the mean and standard deviation at the end.

The kfold object will return to you the indices. 
Hence you do not need to run train-test split in advance but to use the indices provided to extract the training set and test set on the fly when you call the model_train() function.

The result above shows the model is moderately good, at 64% average accuracy. 
And this score is stable since the standard deviation is at 3% level, which means for most of the time, we expects the model accuracy to be 61% to 67%. 
You may try to change the model above, such as adding or removing a layer, and see how much change you have in the mean and standard deviation. 
You may also try to increase the number of epoch in training and observe the result.

The mean and standard deviation from the k-fold cross validation is the way you should use to benchmark a model design.

Tying all together, below is the complete code for k-fold cross validation:

In [17]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm
from sklearn.model_selection import StratifiedKFold
 
data = np.loadtxt("pima-indians-diabetes.csv", delimiter=",")
X = data[:, 0:8]
y = data[:, 8]
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)
 
def model_train(X_train, y_train, X_test, y_test):
    # create new model
    model = nn.Sequential(
        nn.Linear(8, 12),
        nn.ReLU(),
        nn.Linear(12, 8),
        nn.ReLU(),
        nn.Linear(8, 1),
        nn.Sigmoid()
    )
 
    # loss function and optimizer
    loss_fn = nn.BCELoss()  # binary cross entropy
    optimizer = optim.Adam(model.parameters(), lr=0.0001)
 
    n_epochs = 25    # number of epochs to run
    batch_size = 10  # size of each batch
    batches_per_epoch = len(X_train) // batch_size
 
    for epoch in range(n_epochs):
        with tqdm.trange(batches_per_epoch, unit="batch", mininterval=0, disable=True) as bar:
            bar.set_description(f"Epoch {epoch}")
            for i in bar:
                # take a batch
                start = i * batch_size
                X_batch = X_train[start:start+batch_size]
                y_batch = y_train[start:start+batch_size]
                # forward pass
                y_pred = model(X_batch)
                loss = loss_fn(y_pred, y_batch)
                # backward pass
                optimizer.zero_grad()
                loss.backward()
                # update weights
                optimizer.step()
                # print progress
                acc = (y_pred.round() == y_batch).float().mean()
                bar.set_postfix(
                    loss=float(loss),
                    acc=float(acc)
                )
    # evaluate accuracy at end of training
    y_pred = model(X_test)
    acc = (y_pred.round() == y_test).float().mean()
    return float(acc)
 
# define 5-fold cross validation test harness
kfold = StratifiedKFold(n_splits=5, shuffle=True)
cv_scores = []
for train, test in kfold.split(X, y):
    # create model, train, and get accuracy
    acc = model_train(X[train], y[train], X[test], y[test])
    print("Accuracy: %.2f" % acc)
    cv_scores.append(acc)
# evaluate the model
print("%.2f%% (+/- %.2f%%)" % (np.mean(cv_scores)*100, np.std(cv_scores)*100))

Accuracy: 0.67
Accuracy: 0.73
Accuracy: 0.61
Accuracy: 0.67
Accuracy: 0.65
66.67% (+/- 3.76%)


## Summary

In this tutorial, you discovered the importance of having a robust way to estimate the performance of your deep learning models on unseen data and you learned how to do that. 
You saw:

- How to split data into training and test set using scikit-learn
- How to do k-fold cross validation with the help of scikit-learn
- How to modify the training loop in a PyTorch model to incorporate test set validation and cross validation