# Keep The Best Models During Training With Checkpointing
Deep learning models can take hours, days or even weeks to train and if a training run is stopped unexpectedly, you can lose a lot of work. Discover how you can checkpoint your deep learning models during training, using the Keras library.
- checkpoint each improvement 
- checkpoint the very best model 

Application checkpointing is a fault tolerance technique for long running processes. It is an approach where a snapshot of the state of the system is taken in case of system failure. If there is a problem, not all is lost. The checkpoint may be used directly, or used as the starting point for a new run, picking up where it left oﬀ. When training deep learning models, the checkpoint captures the weights of the model. These weights can be used to make predictions as-is, or used as the basis for ongoing training. 

The Keras library provides a checkpointing capability by a callback **API**. The **ModelCheckpoint** callback class allows you to deﬁne where to checkpoint the model **weights**, how the ﬁle should be **named** and under what **circumstances** to make a checkpoint of the model. 

The **API** allows you to specify which **metric** to monitor, such as loss or accuracy on the training or validation dataset. You can specify whether to look for an improvement in maximizing or minimizing the score. Finally, the **ﬁlename** that you use to store the weights can include variables like the **epoch** number or **metric**. The ModelCheckpoint instance can then be passed to the training process when calling the **`fit()`** function on the model. Note, you may need to install the **h5py** library.

In [1]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense

from keras.callbacks import ModelCheckpoint

np.random.seed(47)
df = pd.read_csv('pima-indians-diabetes.csv', header=None)
data = df.values
X = data[:,0:8] 
y = data[:,8]
df.head(2)

Using TensorFlow backend.


Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [2]:
model = Sequential() 
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu')) 
model.add(Dense(8, kernel_initializer='uniform', activation='relu')) 
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Checkpoint the Model Improvements
A good use of checkpointing is to **output the model weights each time an improvement is observed** during training. 

The example below creates a small neural network for the Pima Indians onset of diabetes binary classiﬁcation problem. The example uses 33% of the data for validation. 
- Checkpointing is set up to save the network weights only **when there is an improvement in classiﬁcation accuracy** on the validation dataset **(monitor='val_acc', mode='max')**. 
- The weights are stored in a ﬁle that includes the score in the ﬁlename **weights-improvement-{epoch:.}-{val_acc=.2f}.hdf5**.

In [3]:
# Set up checkpoint

my_filepath="weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5" 

checkpoint = ModelCheckpoint(my_filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max') 

callbacks_list = [checkpoint] 

In [4]:
# then fit it!
model.fit(X, y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0)


Epoch 00001: val_acc improved from -inf to 0.67323, saving model to weights-improvement-01-0.67.hdf5

Epoch 00002: val_acc did not improve

Epoch 00003: val_acc did not improve

Epoch 00004: val_acc improved from 0.67323 to 0.68110, saving model to weights-improvement-04-0.68.hdf5

Epoch 00005: val_acc did not improve

Epoch 00006: val_acc improved from 0.68110 to 0.68898, saving model to weights-improvement-06-0.69.hdf5

Epoch 00007: val_acc improved from 0.68898 to 0.68898, saving model to weights-improvement-07-0.69.hdf5

Epoch 00008: val_acc improved from 0.68898 to 0.70079, saving model to weights-improvement-08-0.70.hdf5

Epoch 00009: val_acc did not improve

Epoch 00010: val_acc did not improve

Epoch 00011: val_acc did not improve

Epoch 00012: val_acc did not improve

Epoch 00013: val_acc did not improve

Epoch 00014: val_acc did not improve

Epoch 00015: val_acc did not improve

Epoch 00016: val_acc improved from 0.70079 to 0.70079, saving model to weights-improvement-16-0.7

<keras.callbacks.History at 0x2871332a518>

In the output you can see cases where an improvement in the model accuracy on the validation dataset resulted in a new weight ﬁle being written to disk. It may create a lot of unnecessary checkpoint ﬁles if the validation accuracy moves up and down over training epochs. Nevertheless, it will ensure that you have a snapshot of the best model discovered during your run.

### Checkpoint Best Model only
A simpler checkpoint strategy is to save the model weights to the same ﬁle, if and only if the validation accuracy improves. This can be done easily using the same code from above and changing the output ﬁlename to be ﬁxed (not include score or epoch information). In this case, model weights are written to the ﬁle **weights.best.hdf5** only if the classiﬁcation accuracy of the model on the validation dataset improves over the best seen so far.

In [5]:
# Set up checkpoint

my_filepath = 'weights.best.hdf5'

checkpoint = ModelCheckpoint(my_filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

callbacks_list = [checkpoint]

In [6]:
# then fit it!
model.fit(X, y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0)


Epoch 00001: val_acc improved from -inf to 0.77953, saving model to weights.best.hdf5

Epoch 00002: val_acc did not improve

Epoch 00003: val_acc did not improve

Epoch 00004: val_acc did not improve

Epoch 00005: val_acc did not improve

Epoch 00006: val_acc did not improve

Epoch 00007: val_acc did not improve

Epoch 00008: val_acc did not improve

Epoch 00009: val_acc did not improve

Epoch 00010: val_acc did not improve

Epoch 00011: val_acc did not improve

Epoch 00012: val_acc improved from 0.77953 to 0.79528, saving model to weights.best.hdf5

Epoch 00013: val_acc did not improve

Epoch 00014: val_acc did not improve

Epoch 00015: val_acc did not improve

Epoch 00016: val_acc did not improve

Epoch 00017: val_acc did not improve

Epoch 00018: val_acc did not improve

Epoch 00019: val_acc did not improve

Epoch 00020: val_acc did not improve

Epoch 00021: val_acc did not improve

Epoch 00022: val_acc did not improve

Epoch 00023: val_acc did not improve

Epoch 00024: val_acc did

<keras.callbacks.History at 0x2871332afd0>

### Loading a Saved Model
Now that you have seen how to checkpoint your deep learning models during training, you need to review how to load and use a checkpointed model. 
- The checkpoint only includes the model **weights**. **It assumes you know the network structure.** 
- This too can be serialized to ﬁle in JSON or YAML format. 

In the example below, the model structure is known and the best weights are loaded from the previous experiment, stored in the working directory in the **weights.best.hdf5** ﬁle. The model is then used to make predictions on the entire dataset.

In [7]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense

np.random.seed(47)
df = pd.read_csv('pima-indians-diabetes.csv', header=None)
data = df.values
X = data[:,0:8] 
y = data[:,8]

model = Sequential() 
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu')) 
model.add(Dense(8, kernel_initializer='uniform', activation='relu')) 
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))


In [8]:
# Loads weights and finalize the model to make the prediction

model.load_weights('weights.best.hdf5')
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print("Created model and loaded weights from file")

Created model and loaded weights from file


In [9]:
# estimate accuracy on whole dataset using loaded weights 
scores = model.evaluate(X, y, verbose=0) 
print("%s: %.2f" % (model.metrics_names[1], scores[1]))

acc: 0.78
