Model progress can be saved during - and after - training. This means a model can resume where it left off and avoid long training times. Saving also means you can share your model and others can recreate your work. When publishing research models and techniques, most machine learning practitioners share:
- code to create the model, and
- the trained weights, or parameters, for the model


Summary:
- save weights during training: `tf.keras.callbacks.ModelCheckpoint` (with some options)
- save entire model at the end: `model.save` (SavedModel or HDF5 format)


## Options

There are different ways to save Tensorflow models - depending on the API you're using. This guide uses `tf.keras`, a high-level API to build and train models in Tensorflow. For other approaches, see the Tensorflow **Save and Restore** guide or **Saving in eager**.

## Setup


In [2]:
import os

import tensorflow as tf
from tensorflow import keras

print(tf.version.VERSION)

2.2.0


### Get an example dataset

In [3]:
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

print(f"Train images shape: {train_images.shape}")
print(f"Train labels shape: {train_labels.shape}")

train_labels = train_labels[:1000]
test_labels  = test_labels[:1000]

train_images = train_images[:1000].reshape(-1, 28*28) / 255.0
test_images = test_images[:1000].reshape(-1, 28*28) / 255.0

print(f"Train images shape: {train_images.shape}")

Train images shape: (60000, 28, 28)
Train labels shape: (60000,)
Train images shape: (1000, 784)


### Define a model

In [4]:
def create_model():
    model = tf.keras.Sequential([
        keras.layers.Dense(512, activation='relu', input_shape=(784,)),
        keras.layers.Dropout(0.2),
        keras.layers.Dense(10)
    ])
    
    model.compile(optimizer='adam',
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    
    return model

In [5]:
model = create_model()
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 512)               401920    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                5130      
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________


## Save checkpoint during training

We can use a **trained model** without having to retrain it, or pick-up training where we left-off - in case the training process was interrupted. The `tf.keras.callbacks.ModelCheckpoint` callback allows to continually **save** the model both **during** and at **the end** of training.

### Checkpoint callback usage

Here, only *during* training

In [6]:
checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)

In [7]:
# Train the model with the callback created above
model.fit(train_images,
          train_labels,
          epochs=10,
          validation_data=(test_images, test_labels),
          callbacks=[cp_callback]
         )

Epoch 1/10
Epoch 00001: saving model to training_1/cp.ckpt
Epoch 2/10
Epoch 00002: saving model to training_1/cp.ckpt
Epoch 3/10
Epoch 00003: saving model to training_1/cp.ckpt
Epoch 4/10
Epoch 00004: saving model to training_1/cp.ckpt
Epoch 5/10
Epoch 00005: saving model to training_1/cp.ckpt
Epoch 6/10
Epoch 00006: saving model to training_1/cp.ckpt
Epoch 7/10
Epoch 00007: saving model to training_1/cp.ckpt
Epoch 8/10
Epoch 00008: saving model to training_1/cp.ckpt
Epoch 9/10
Epoch 00009: saving model to training_1/cp.ckpt
Epoch 10/10
Epoch 00010: saving model to training_1/cp.ckpt


<tensorflow.python.keras.callbacks.History at 0x7f84791274a8>

In [9]:
!ls {checkpoint_dir}

checkpoint  cp.ckpt.data-00000-of-00001  cp.ckpt.index


Create a **new untrained model**. When restoring a model from weights-only, you must have a model with **same architecture as the original model**. Since it's the same architecture, we can share weights despite that it's a different *instance* of model.

In [13]:
model = create_model()

loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print(f"Untrained model accuracy: {(100*acc):5.2f}%")

32/32 - 0s - loss: 2.4502 - accuracy: 0.1630
Untrained model accuracy: 16.30%


Load the weights from the checkpoint and re-evaluate

In [14]:
# Loads the weights
model.load_weights(checkpoint_path)

# Re-evaluate the model
loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print(f"Restored model accuracy: {(100*acc):5.2f}%")

32/32 - 0s - loss: 0.4255 - accuracy: 0.8600
Restored model accuracy: 86.00%


### Checkpoint callback options

The callback provides several optionsto provide unique names for checkpoints and adjust the checkpointing frequency.

Train a new model, and save uniquely named checkpoints once every five epochs.

In [15]:
checkpoint_path = "training_2/cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights every 5 seconds
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path,
    save_weights_only=True,
    verbose=1,
    period=5
)

model = create_model()

model.save_weights(checkpoint_path.format(epoch=0))

model.fit(train_images,
          train_labels,
          epochs=50,
          callbacks=[cp_callback],
          validation_data = (test_images, test_labels),
          verbose=0
         )


Epoch 00005: saving model to training_2/cp-0005.ckpt

Epoch 00010: saving model to training_2/cp-0010.ckpt

Epoch 00015: saving model to training_2/cp-0015.ckpt

Epoch 00020: saving model to training_2/cp-0020.ckpt

Epoch 00025: saving model to training_2/cp-0025.ckpt

Epoch 00030: saving model to training_2/cp-0030.ckpt

Epoch 00035: saving model to training_2/cp-0035.ckpt

Epoch 00040: saving model to training_2/cp-0040.ckpt

Epoch 00045: saving model to training_2/cp-0045.ckpt

Epoch 00050: saving model to training_2/cp-0050.ckpt


<tensorflow.python.keras.callbacks.History at 0x7f845431e518>

In [17]:
!ls {checkpoint_dir}

checkpoint			  cp-0025.ckpt.index
cp-0000.ckpt.data-00000-of-00001  cp-0030.ckpt.data-00000-of-00001
cp-0000.ckpt.index		  cp-0030.ckpt.index
cp-0005.ckpt.data-00000-of-00001  cp-0035.ckpt.data-00000-of-00001
cp-0005.ckpt.index		  cp-0035.ckpt.index
cp-0010.ckpt.data-00000-of-00001  cp-0040.ckpt.data-00000-of-00001
cp-0010.ckpt.index		  cp-0040.ckpt.index
cp-0015.ckpt.data-00000-of-00001  cp-0045.ckpt.data-00000-of-00001
cp-0015.ckpt.index		  cp-0045.ckpt.index
cp-0020.ckpt.data-00000-of-00001  cp-0050.ckpt.data-00000-of-00001
cp-0020.ckpt.index		  cp-0050.ckpt.index
cp-0025.ckpt.data-00000-of-00001


In [18]:
latest = tf.train.latest_checkpoint(checkpoint_dir)
latest

'training_2/cp-0050.ckpt'

To test, reset the model and load the latest checkpoint

In [19]:
model = create_model()

model.load_weights(latest)

loss, acc = model.evaluate(test_images, test_labels, verbose=2)
print(f"Restored model accuracy: {(100*acc):5.2f}%")

32/32 - 0s - loss: 0.4843 - accuracy: 0.8790
Restored model accuracy: 87.90%


## What are these files?

The checkpoint files contain the trained weights in binary format:
- One or more shards that contain your model's weights
- An index file that indicates which weights are stored in a which shard

If you are only training a model on a single machine, you'll have one shard with the suffix: `.data-00000-of-00001`

## Manually save weights

In [20]:
# Save the weights
model.save_weights('./checkpoints/my_checkpoint')

# Create a new model instance
model = create_model()

# Restore weights
model.load_weights('./checkpoints/my_checkpoint')

# Evaluate the model
loss,acc = model.evaluate(test_images,  test_labels, verbose=2)
print("Restored model, accuracy: {:5.2f}%".format(100*acc))

32/32 - 0s - loss: 0.4843 - accuracy: 0.8790
Restored model, accuracy: 87.90%


## Save the entire model

Call `model.save` to save a model's architecture, weights, and training configuration in a single file/folder. This allows us to export a model so it can be used without access to the original Python code*. Since the optimizer-state is recovered, we can resume training from exactly where we left off.

Saving a fully-functional model is very useful—we can load them in TensorFlow.js and then train and run them in web browsers, or convert them to run on mobile devices using TensorFlow Lite. 

### SavedModel format

The `SavedModel` format is another way to serialize models. Models saved in this format can be restored using `tf.keras.models.load_model` and are compatible with TensorFlow Serving.

In [21]:
model = create_model()
model.fit(train_images, train_labels, epochs=5)

!mkdir -p saved_model
model.save('saved_model/my_model')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: saved_model/my_model/assets


In [22]:
!ls saved_model

my_model


In [23]:
!ls saved_model/my_model

assets	saved_model.pb	variables


In [24]:
# Reload
new_model = tf.keras.models.load_model('saved_model/my_model')

new_model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_16 (Dense)             (None, 512)               401920    
_________________________________________________________________
dropout_8 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_17 (Dense)             (None, 10)                5130      
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________


In [25]:
# Evaluate the restored model
loss, acc = new_model.evaluate(test_images,  test_labels, verbose=2)
print('Restored model, accuracy: {:5.2f}%'.format(100*acc))

print(new_model.predict(test_images).shape)

32/32 - 0s - loss: 0.4280 - accuracy: 0.8580
Restored model, accuracy: 85.80%
(1000, 10)


### HDF5 format

Keras saves models by inspecting the architecture. This technique saves everything:

- The weight values
- The model's architecture
- The model's training configuration(what you passed to compile)
- The optimizer and its state, if any (this enables you to restart training where you left)

In [26]:
model = create_model()
model.fit(train_images, train_labels, epochs=5)

# Save the entire model to a HDF5 file.
# The '.h5' extension indicates that the model should be saved to HDF5.
model.save('my_model.h5')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [27]:
new_model = tf.keras.models.load_model('my_model.h5')
new_model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_18 (Dense)             (None, 512)               401920    
_________________________________________________________________
dropout_9 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_19 (Dense)             (None, 10)                5130      
Total params: 407,050
Trainable params: 407,050
Non-trainable params: 0
_________________________________________________________________


In [28]:
loss, acc = new_model.evaluate(test_images,  test_labels, verbose=2)
print('Restored model, accuracy: {:5.2f}%'.format(100*acc))

32/32 - 0s - loss: 0.4108 - accuracy: 0.8690
Restored model, accuracy: 86.90%


### Saving custom objects

The key difference between HDF5 and SavedModel is that **HDF5** uses **object configs** to save the model architecture, while **SavedModel** saves the **execution graph**. Thus, **SavedModel**s are able to save **custom objects like subclassed models and custom layers** without requiring the orginal code.


To **save custom objects to HDF5**, you must do the following:

- Define a `get_config` method in your object, and optionally a `from_config` classmethod.
    - `get_config(self)` returns a JSON-serializable dictionary of parameters needed to recreate the object.
    - `from_config(cls, config)` uses the returned config from `get_config` to create a new object. By default, this function will use the config as initialization kwargs (`return cls(**config)`).
- Pass the object to the `custom_objects` argument when loading the model. The argument must be a dictionary mapping the string class name to the Python class. E.g. `tf.keras.models.load_model(path, custom_objects={'CustomLayer': CustomLayer})`

## References

- https://www.tensorflow.org/tutorials/keras/save_and_load