<a href="https://colab.research.google.com/github/marcelounb/Deep_Learning_with_python_JasonBrownlee/blob/master/14_Keep_The_Best_Models_During_Training_With_Checkpointing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keep The Best Models During Training With Checkpointing

Deep learning models can take hours, days or even weeks to train and if a training run is stopped unexpectedly, you can lose a lot of work. In this lesson you will discover how you can checkpoint your deep learning models during training in Python using the Keras library. After completing this lesson you will know:

1. The importance of checkpointing neural network models when training.
2. How to checkpoint each improvement to a model during training.
3. How to checkpoint the very best model observed during training


# Checkpointing Neural Network Models
Application checkpointing is a fault tolerance technique for long running processes. It is an approach where a snapshot of the state of the system is taken in case of system failure. If there is a problem, not all is lost. The checkpoint may be used directly, or used as the starting point for a new run, picking up where it left off.

When training deep learning models, the checkpoint captures the weights of the model. These weights can be used to make predictions as-is, or used as the basis for ongoing training. The Keras library provides a checkpointing capability by a callback API. The ModelCheckpoint callback class allows you to deﬁne where to checkpoint the model weights, how the ﬁle should be named and under what circumstances to make a checkpoint of the model. 

The API allows you to specify which metric to monitor, such as loss or accuracy on the training or validation dataset. You can specify whether to look for an improvement in maximizing or minimizing the score. Finally, the ﬁlename that you use to store the weights can include variables like the epoch number or metric. 

The ModelCheckpoint instance can then be passed to the training process when calling the fit() function on the model. Note, you may need to install the h5py library

**Checkpoint Neural Network Model Improvements**

A good use of checkpointing is to output the model weights each time an improvement is observed during training. The example below creates a small neural network for the Pima Indians onset of diabetes binary classiﬁcation problem (see Section 7.2). 

The example uses 33% of the data for validation. Checkpointing is setup to save the network weights only when there is an improvement in classiﬁcation accuracy on the validation dataset (monitor=’val acc’ and mode=’max’). The weights are stored in a ﬁle that includes the score in the ﬁlename 

In [1]:
# Checkpoint the weights when validation accuracy improves 
from keras.models import Sequential 
from keras.layers import Dense 
from keras.callbacks import ModelCheckpoint 
import matplotlib.pyplot as plt 
import numpy as np

Using TensorFlow backend.


In [0]:
dataset = np.loadtxt("pima-indians-diabetes.csv", delimiter=",") 
# split into input (X) and output (Y) variables 
X = dataset[:,0:8] 
Y = dataset[:,8] 

In [0]:
# fix random seed for reproducibility 
seed = 7 
np.random.seed(seed)

In [0]:
# create model 
model = Sequential() 
model.add(Dense(12, input_dim=8, kernel_initializer= 'uniform' , activation= 'relu' )) 
model.add(Dense(8, kernel_initializer= 'uniform' , activation= 'relu' )) 
model.add(Dense(1, kernel_initializer= 'uniform' , activation= 'sigmoid' )) 

In [0]:
# Compile model 
model.compile(loss= 'binary_crossentropy' , optimizer= 'adam' , metrics=[ 'accuracy' ])

In [0]:
# checkpoint 
filepath="weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5" 
checkpoint = ModelCheckpoint(filepath, monitor= 'val_accuracy' , verbose=1, save_best_only=True, mode= 'max' ) 
callbacks_list = [checkpoint] 

In [15]:
# Fit the model 
history = model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0)


Epoch 00001: val_accuracy did not improve from 0.78740

Epoch 00002: val_accuracy did not improve from 0.78740

Epoch 00003: val_accuracy did not improve from 0.78740

Epoch 00004: val_accuracy did not improve from 0.78740

Epoch 00005: val_accuracy did not improve from 0.78740

Epoch 00006: val_accuracy did not improve from 0.78740

Epoch 00007: val_accuracy did not improve from 0.78740

Epoch 00008: val_accuracy did not improve from 0.78740

Epoch 00009: val_accuracy did not improve from 0.78740

Epoch 00010: val_accuracy did not improve from 0.78740

Epoch 00011: val_accuracy did not improve from 0.78740

Epoch 00012: val_accuracy did not improve from 0.78740

Epoch 00013: val_accuracy did not improve from 0.78740

Epoch 00014: val_accuracy did not improve from 0.78740

Epoch 00015: val_accuracy did not improve from 0.78740

Epoch 00016: val_accuracy did not improve from 0.78740

Epoch 00017: val_accuracy did not improve from 0.78740

Epoch 00018: val_accuracy did not improve from 

Running the example produces the output below, truncated for brevity. In the output you can see cases where an improvement in the model accuracy on the validation dataset resulted in a new weight ﬁle being written to disk.

You will also see a number of ﬁles in your working directory containing the network weights in HDF5 format.

This is a very simple checkpointing strategy. It may create a lot of unnecessary checkpoint ﬁles if the validation accuracy moves up and down over training epochs. Nevertheless, it will ensure that you have a snapshot of the best model discovered during your run.