# Checkpoints

This document examines how to save and restore TensorFlow models built with Estimators. TensorFlow provides two model formats:

- checkpoints, which is a format dependent on the code that created the model.
- SavedModel, which is a format independent of the code that created the model.

This document focuses on checkpoints. For details on SavedModel, see the [Saving and Restoring](https://tensorflow.google.cn/programmers_guide/saved_model) chapter of the *TensorFlow Programmer's Guide*.

## Saving partially-trained models
Estimators automatically write the following to disk:

- checkpoints, which are versions of the model created during training.
- event files, which contain information that [TensorBoard](https://developers.google.cn/machine-learning/glossary/#TensorBoard) uses to create visualizations.

To specify the top-level directory in which the Estimator stores its information, assign a value to the optional model_dir argument of any Estimator's constructor. For example, the following code sets the model_dir argument to the models/iris directory:

In [3]:
import iris_data
import tensorflow as tf

(train_x, train_y), (test_x, test_y) = iris_data.load_data()
    
my_feature_columns = []
for key in train_x.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))
    
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[10, 10],
    n_classes=3,
    model_dir='models/iris')

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'models/iris', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001AB820D8400>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


Suppose you call the Estimator's train method. For example:

In [5]:
classifier.train(
    input_fn=lambda:iris_data.train_input_fn(train_x, train_y, batch_size=100),
    steps=200)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into models/iris\model.ckpt.
INFO:tensorflow:loss = 235.59146, step = 1
INFO:tensorflow:global_step/sec: 490.445
INFO:tensorflow:loss = 70.60144, step = 101 (0.203 sec)
INFO:tensorflow:Saving checkpoints for 200 into models/iris\model.ckpt.
INFO:tensorflow:Loss for final step: 51.913696.


<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1ab8203bdd8>

As suggested by the following diagrams, the first call to train adds checkpoints and other files to the model_dir directory:

![](https://tensorflow.google.cn/images/first_train_calls.png)
<center>The first call to train().</center>

To see the objects in the created model_dir directory on a UNIX-based system, just call ls as follows:

In [7]:
!bash -c "ls -1 models/iris"

checkpoint
events.out.tfevents.1529404546.USER-20150123VJ
graph.pbtxt
model.ckpt-1.data-00000-of-00001
model.ckpt-1.index
model.ckpt-1.meta
model.ckpt-200.data-00000-of-00001
model.ckpt-200.index
model.ckpt-200.meta


The preceding ls command shows that the Estimator created checkpoints at steps 1 (the start of training) and 200 (the end of training).

## Default checkpoint directory

If you don't specify model_dir in an Estimator's constructor, the Estimator writes checkpoint files to a temporary directory chosen by Python's [tempfile.mkdtemp](https://docs.python.org/3/library/tempfile.html#tempfile.mkdtemp) function. For example, the following Estimator constructor does not specify the model_dir argument:

In [10]:
classifer = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[10, 10],
    n_classes=3)

print(classifier.model_dir)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ADMINI~1\\AppData\\Local\\Temp\\tmpv3azjnld', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001AB836F5198>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
models/iris


The tempfile.mkdtemp function picks a secure, temporary directory appropriate for your operating system. For example, a typical temporary directory on macOS might be something like the following:

## Checkpointing Frequency

By default, the Estimator saves [checkpoints](https://developers.google.cn/machine-learning/glossary/#checkpoint) in the `model_dir` according to the following schedule:

- Writes a checkpoint every 10 minutes (600 seconds).
- Writes a checkpoint when the `train` method starts (first iteration) and completes (final iteration).
- Retains only the 5 most recent checkpoints in the directory.

You may alter the default schedule by taking the following steps:

- Create a [RunConfig](https://tensorflow.google.cn/api_docs/python/tf/estimator/RunConfig) object that defines the desired schedule.
- When instantiating the Estimator, pass that `RunConfig` object to the Estimator's `config` argument.

For example, the following code changes the checkpointing schedule to every 20 minutes and retains the 10 most recent checkpoints:

In [12]:
my_checkpointing_config = tf.estimator.RunConfig(
    save_checkpoints_secs = 20*60,  # Save checkpoints every 20 minutes.
    keep_checkpoint_max = 10,        # Retain the 10 most recent checkpoints.
)

classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[10, 10],
    n_classes=3,
    model_dir='models/iris',
    config=my_checkpointing_config)

INFO:tensorflow:Using config: {'_model_dir': 'models/iris', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 1200, '_session_config': None, '_keep_checkpoint_max': 10, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001AB836F56A0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


## Restoring your model

The first time you call an Estimator's train method, TensorFlow saves a checkpoint to the model_dir. Each subsequent call to the Estimator's train, evaluate, or predict method causes the following:

1. The Estimator builds the model's [graph](https://developers.google.cn/machine-learning/glossary/#graph) by running the model_fn(). (For details on the model_fn(), see [Creating Custom Estimators.](https://tensorflow.google.cn/get_started/custom_estimators))
2. The Estimator initializes the weights of the new model from the data stored in the most recent checkpoint.

In other words, as the following illustration suggests, once checkpoints exist, TensorFlow rebuilds the model each time you call `train()`, `evaluate()`, or `predict()`.

![](https://tensorflow.google.cn/images/subsequent_calls.png)
<center>Subsequent calls to train(), evaluate(), or predict()</center>

## Avoiding a bad restoration

Restoring a model's state from a checkpoint only works if the model and checkpoint are compatible. For example, suppose you trained a DNNClassifier Estimator containing two hidden layers, each having 10 nodes:

In [15]:
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[10, 10],
    n_classes=3,
    model_dir='models/iris')

classifier.train(
    input_fn=lambda:iris_data.train_input_fn(train_x, train_y, batch_size=100),
        steps=200)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'models/iris', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001AB836ACFD0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/iris\model.ckpt-200
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints 

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1ab836ace10>

After training (and, therefore, after creating checkpoints in models/iris), imagine that you changed the number of neurons in each hidden layer from 10 to 20 and then attempted to retrain the model:

In [19]:
classifier2 = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[20, 20],
    n_classes=3,
    model_dir='models/iris')

classifier2.train(
    input_fn=lambda:iris_data.train_input_fn(train_x, train_y, batch_size=100),
        steps=200)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'models/iris', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001AB82383128>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/iris\model.ckpt-600


InvalidArgumentError: tensor_name = dnn/hiddenlayer_0/bias; shape in shape_and_slice spec [20] does not match the shape stored in checkpoint: [10]
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op 'save/RestoreV2', defined at:
  File "C:\Users\Administrator\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Administrator\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tornado\ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2717, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2827, in run_ast_nodes
    if self.run_code(code, result):
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-a4aef7364818>", line 9, in <module>
    steps=200)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py", line 363, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py", line 843, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py", line 859, in _train_model_default
    saving_listeners)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1056, in _train_with_estimator_spec
    log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 405, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 816, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 539, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1002, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1007, in _create_session
    return self._sess_creator.create_session()
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 696, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 458, in create_session
    self._scaffold.finalize()
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\monitored_session.py", line 214, in finalize
    self._saver.build()
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1347, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 1384, in _build
    build_save=build_save, build_restore=build_restore)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 829, in _build_internal
    restore_sequentially, reshape)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 525, in _AddShardedRestoreOps
    name="restore_shard"))
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 472, in _AddRestoreOps
    restore_sequentially)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\training\saver.py", line 886, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_io_ops.py", line 1546, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
    op_def=op_def)
  File "C:\Users\Administrator\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): tensor_name = dnn/hiddenlayer_0/bias; shape in shape_and_slice spec [20] does not match the shape stored in checkpoint: [10]
	 [[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]


Since the state in the checkpoint is incompatible with the model described in classifier2, retraining fails with the following error:

To run experiments in which you train and compare slightly different versions of a model, save a copy of the code that created each model_dir, possibly by creating a separate git branch for each version. This separation will keep your checkpoints recoverable.

## Summary

Checkpoints provide an easy automatic mechanism for saving and restoring models created by Estimators.

See the [Saving and Restoring](https://tensorflow.google.cn/programmers_guide/saved_model) chapter of the *TensorFlow Programmer's Guide* for details on:

- Saving and restoring models using low-level TensorFlow APIs.
- Exporting and importing models in the SavedModel format, which is a language-neutral, recoverable, serialization format.