Monitored Session
----------------

相比简单的tf.Session对象，MonitoredSession更方便使用。

它封装了checkpoint的save和restore，summary的定期保存，初始化变量，启动queue runners，还提供了很多的Hooks来监控训练的过程。另外它还实现了主从模式，适合分布式环境运行。

https://www.tensorflow.org/api_docs/python/tf/train/MonitoredTrainingSession

# tf.train.MonitoredTrainingSession

这个函数是tf.train.MonitoredSession的工厂方法。包含一系列的构造参数。先不管分布式环境相关的参数。

* checkpoint_dir：指定一个目标，它会自动的进行checkpoint的保存或者恢复
* scaffold：不明白是啥？
* hooks: SessionRunHook列表，每个hook都会被触发
* save_checkpoint_secs：每多少秒自动保存一次checkpoint
* save_summaries_steps：每多少步自动保存一次summary
* save_summaries_secs：每多少秒自动保存一次summary
* stop_grace_period_secs：Coordinator优雅退出的秒数

# tf.train.MonitoredSession

这是一个包含固定动作模板的Session运行过程，分成三个方面：

初始化过程：Initialization依次做以下的事情

1. 调用每个hook.begin()
1. 调用scaffold.finalize()
1. 创建session
1. 使用Scaffold初始化模型
1. 如果checkpoint存在，就从checkpoint恢复模型
1. 启动queue runners

执行过程：Run，当run()被调用的时候，依次执行以下过程：

1. 调用每个hook.before_run()
1. 调用被monitored的session.run()
1. 调用hook.after_run()
1. 返回session.run()的结果
1. 如果发生了AbortedError 或者 UnavailableError 两个异常，会重新创建和初始化session

关闭过程：Close

1. 调用hook.end()
1. 关闭queue runners和session
1. 忽略掉OutOfRange错误，这个代表输入队列的样本消耗完毕

In [1]:
!rm -rf log && mkdir log

In [2]:
import tensorflow as tf

v = tf.constant(1)

with tf.train.MonitoredTrainingSession(
        checkpoint_dir='log',
        ) as sess:
    sess.run(v)

INFO:tensorflow:Create CheckpointSaverHook.


RuntimeError: Global step should be created to use StepCounterHook.

In [10]:
!ls -l log/

total 56
-rw-r--r--  1 huanghao  staff     81 May  9 21:38 checkpoint
-rw-r--r--  1 huanghao  staff      4 May  9 21:38 model.ckpt-3.data-00000-of-00001
-rw-r--r--  1 huanghao  staff    127 May  9 21:38 model.ckpt-3.index
-rw-r--r--  1 huanghao  staff  16337 May  9 21:38 model.ckpt-3.meta


In [19]:
!rm -rf log && mkdir log




In [21]:
import time

global_step_tensor = tf.Variable(3, trainable=False, name='global_step')

with tf.train.MonitoredTrainingSession(
        checkpoint_dir='log',
        hooks=[tf.train.StopAtStepHook(last_step=10)],
        save_checkpoint_secs=2,
        ) as sess:
    time.sleep(1)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 3 into log/model.ckpt.


# Training Hooks

In [26]:
v = tf.constant(1, name='v')

with tf.train.MonitoredTrainingSession(
        checkpoint_dir='log',
        hooks=[tf.train.LoggingTensorHook([v], every_n_iter=1)],
        ) as sess:
    for i in xrange(10):
        sess.run(v)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from log/model.ckpt-3
INFO:tensorflow:Saving checkpoints for 3 into log/model.ckpt.
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1 (0.004 sec)
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1 (0.003 sec)
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1 (0.003 sec)
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1 (0.003 sec)
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1 (0.003 sec)
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1 (0.002 sec)
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1 (0.006 sec)
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1 (0.009 sec)
INFO:tensorflow:Tensor("v_1:0", shape=(), dtype=int32) = 1 (0.008 sec)
