# Lab 3 Image Classification Labs

## Lab 3-2 Training ResNet with CIFAR 10 dataset using TensorFlow (Single Node)

This lab is to train ResNet neural network with CIFAR-10 training data to classify an images into 10 known categories. The code is written in TensorFlow. 

Unlike Apache MXNet, TensorFlow does not yet support Amazon S3 storage. However, one issue in tensorflow git has been created with S3 file system support code. (https://github.com/tensorflow/tensorflow/issues/10616)



### Where to execute trainig and evaluation ?

You need to execute this in a ternimal not in jupyter notebook.



### Overall Steps of  Lab 3-2

Here I assume you cloned CodeCommit repository, and all necessary files are in lab3/tensorflow_resnet_cifar10 directory.

##### Step 1. Create an Amazon S3 bucket for storing checkpoint (You may use the same bucket used in Lab 3-1)

##### Step 2. Build resnet_main using bazel tool

```
$ cd lab3

$ bazel build -c opt tensorflow_resnet_cifar10/...
```

> **Bazel** is Google's own build tool, now publicly available in Beta. Bazel has built-in support for building both client and server software, including client applications for both Android and iOS platforms. It also provides an extensible framework that you can use to develop your own build rules. 

##### Step 3. Run training using the below command

```
$ bazel-bin/tensorflow_resnet_cifar10/resnet_main \
 --train_data_path=./tensorflow_resnet_cifar10/dataset/cifar-10-batches-bin/data_batch* \
 --log_root=./tensorflow_resnet_cifar10/resnet_model \
 --train_dir=./tensorflow_resnet_cifar10/resnet_model/train \
 --log_root=./tensorflow_resnet_cifar10/resnet_model/ckpt \
 --dataset='cifar10'
```

##### Step 4. Stop the training after 3-4 epoch

##### Step 5. Run evaluation using the below command

```
$ bazel-bin/tensorflow_resnet_cifar10/resnet_main \
  --eval_data_path=./tensorflow_resnet_cifar10/dataset/cifar-10-batches-bin/test_batch.bin \
  --log_root=./tensorflow_resnet_cifar10/resnet_model/ckpt \
  --eval_dir=./tensorflow_resnet_cifar10/resnet_model/test \
  --mode=eval \
  --dataset='cifar10' \
  --num_gpus=0
```

#### Step 6. Upload checkpoint files to S3 bucket

```
$ cd ./tensorflow_resnet_cifar10/resnet_model/ckpt
$ aws s3 sync . s3://<bucket_name>/deeplearning/tf-resnet-model
```

### Output from training

This training program saves checkpoints every 500 steps (or epoch) into the directory specified by *log-root* parameter.

##### Checkpoint Files

* model.ckpt-000.meta : Network design or graph structure
* model.ckpt-000.data-00000-of-00001 : the values of each variable in the graph
* model.ckpt-000.index : identifies the checkpiont

##### Model Analysis Report : Neural network architecture and the number of parameters
```
==================Model Analysis Report======================
_TFProfRoot (--/464.15k params)
  init/init_conv/DW (3x3x3x16, 432/432 params)
  logit/DW (64x10, 640/640 params)
  logit/biases (10, 10/10 params)
  unit_1_0/shared_activation/init_bn/beta (16, 16/16 params)
  unit_1_0/shared_activation/init_bn/gamma (16, 16/16 params)
  unit_1_0/sub1/conv1/DW (3x3x16x16, 2.30k/2.30k params)
  unit_1_0/sub2/bn2/beta (16, 16/16 params)
  unit_1_0/sub2/bn2/gamma (16, 16/16 params)
  unit_1_0/sub2/conv2/DW (3x3x16x16, 2.30k/2.30k params)
  unit_1_1/residual_only_activation/init_bn/beta (16, 16/16 params)
  unit_1_1/residual_only_activation/init_bn/gamma (16, 16/16 params)
  unit_1_1/sub1/conv1/DW (3x3x16x16, 2.30k/2.30k params)
  unit_1_1/sub2/bn2/beta (16, 16/16 params)
  unit_1_1/sub2/bn2/gamma (16, 16/16 params)
  unit_1_1/sub2/conv2/DW (3x3x16x16, 2.30k/2.30k params)
  unit_1_2/residual_only_activation/init_bn/beta (16, 16/16 params)
  unit_1_2/residual_only_activation/init_bn/gamma (16, 16/16 params)
  unit_1_2/sub1/conv1/DW (3x3x16x16, 2.30k/2.30k params)
  unit_1_2/sub2/bn2/beta (16, 16/16 params)
  unit_1_2/sub2/bn2/gamma (16, 16/16 params)
  unit_1_2/sub2/conv2/DW (3x3x16x16, 2.30k/2.30k params)
  unit_1_3/residual_only_activation/init_bn/beta (16, 16/16 params)
  unit_1_3/residual_only_activation/init_bn/gamma (16, 16/16 params)
  unit_1_3/sub1/conv1/DW (3x3x16x16, 2.30k/2.30k params)
  unit_1_3/sub2/bn2/beta (16, 16/16 params)
  unit_1_3/sub2/bn2/gamma (16, 16/16 params)
  unit_1_3/sub2/conv2/DW (3x3x16x16, 2.30k/2.30k params)
  unit_1_4/residual_only_activation/init_bn/beta (16, 16/16 params)
  unit_1_4/residual_only_activation/init_bn/gamma (16, 16/16 params)
  unit_1_4/sub1/conv1/DW (3x3x16x16, 2.30k/2.30k params)
  unit_1_4/sub2/bn2/beta (16, 16/16 params)
  unit_1_4/sub2/bn2/gamma (16, 16/16 params)
  unit_1_4/sub2/conv2/DW (3x3x16x16, 2.30k/2.30k params)
  .... snipped ....
  unit_3_4/residual_only_activation/init_bn/beta (64, 64/64 params)
  unit_3_4/residual_only_activation/init_bn/gamma (64, 64/64 params)
  unit_3_4/sub1/conv1/DW (3x3x64x64, 36.86k/36.86k params)
  unit_3_4/sub2/bn2/beta (64, 64/64 params)
  unit_3_4/sub2/bn2/gamma (64, 64/64 params)
  unit_3_4/sub2/conv2/DW (3x3x64x64, 36.86k/36.86k params)
  unit_last/final_bn/beta (64, 64/64 params)
  unit_last/final_bn/gamma (64, 64/64 params)

======================End of Report==========================
total_params: 464154
```

###  Output from evaluation

Running this program in evaluation mode, it read the evaluation dataset and get predicted category to compare the real label. 

```
INFO:tensorflow:Loading checkpoint ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:Restoring parameters from ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:loss: 1.231, precision: 0.694, best precision: 0.694
INFO:tensorflow:Loading checkpoint ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:Restoring parameters from ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:loss: 1.082, precision: 0.692, best precision: 0.694
INFO:tensorflow:Loading checkpoint ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:Restoring parameters from ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:loss: 1.231, precision: 0.694, best precision: 0.694
INFO:tensorflow:Loading checkpoint ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:Restoring parameters from ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:loss: 1.082, precision: 0.692, best precision: 0.694
INFO:tensorflow:Loading checkpoint ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:Restoring parameters from ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:loss: 1.231, precision: 0.694, best precision: 0.694
INFO:tensorflow:Loading checkpoint ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
INFO:tensorflow:Restoring parameters from ./tensorflow_resnet_cifar10/resnet_model/ckpt/model.ckpt-1053
```

## Source Code Explained

### 1. Importing modules and define arguments with default values

In [None]:
import time
import six
import sys

import cifar_input
import numpy as np
import resnet_model
import tensorflow as tf

FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('dataset', 'cifar10', 'cifar10 or cifar100.')
tf.app.flags.DEFINE_string('mode', 'train', 'train or eval.')
tf.app.flags.DEFINE_string('train_data_path', './dataset/cifar-10-batches-bin/data_batch*',
                           'Filepattern for training data.')
tf.app.flags.DEFINE_string('eval_data_path', '',
                           'Filepattern for eval data')
tf.app.flags.DEFINE_integer('image_size', 32, 'Image side length.')
tf.app.flags.DEFINE_string('train_dir', './resnet_model/train',
                           'Directory to keep training outputs.')
tf.app.flags.DEFINE_string('eval_dir', '',
                           'Directory to keep eval outputs.')
tf.app.flags.DEFINE_integer('eval_batch_count', 50,
                            'Number of batches to eval.')
tf.app.flags.DEFINE_bool('eval_once', False,
                         'Whether evaluate the model only once.')
tf.app.flags.DEFINE_string('log_root', './resnet_model/ckpt',
                           'Directory to keep the checkpoints. Should be a '
                           'parent directory of FLAGS.train_dir/eval_dir.')
tf.app.flags.DEFINE_integer('num_gpus', 0,
                            'Number of gpus used for training. (0 or 1)')

### 2. Implementing training

In this code, tf.summary is used to save training summary data which later can be visualized by TensorBoard.

> **TensorBoard for MXNet**
>
> TensorBoard is built together with TensorFlow. There is a git project to make a stand-alone version for general visualization purpose, and it is avaiable at https://github.com/dmlc/tensorboard
>
> Also, read [Bring TensorBoard to MXNet](http://dmlc.ml/2017/01/07/bring-TensorBoard-to-MXNet.html) for detail

In [None]:
def train(hps):
  """Training loop."""
  images, labels = cifar_input.build_input(
      FLAGS.dataset, FLAGS.train_data_path, hps.batch_size, FLAGS.mode)
  model = resnet_model.ResNet(hps, images, labels, FLAGS.mode)
  model.build_graph()

  param_stats = tf.contrib.tfprof.model_analyzer.print_model_analysis(
      tf.get_default_graph(),
      tfprof_options=tf.contrib.tfprof.model_analyzer.
          TRAINABLE_VARS_PARAMS_STAT_OPTIONS)
  sys.stdout.write('total_params: %d\n' % param_stats.total_parameters)

  tf.contrib.tfprof.model_analyzer.print_model_analysis(
      tf.get_default_graph(),
      tfprof_options=tf.contrib.tfprof.model_analyzer.FLOAT_OPS_OPTIONS)

  truth = tf.argmax(model.labels, axis=1)
  predictions = tf.argmax(model.predictions, axis=1)
  precision = tf.reduce_mean(tf.to_float(tf.equal(predictions, truth)))

  summary_hook = tf.train.SummarySaverHook(
      save_steps=100,
      output_dir=FLAGS.train_dir,
      summary_op=tf.summary.merge([model.summaries,
                                   tf.summary.scalar('Precision', precision)]))

  logging_hook = tf.train.LoggingTensorHook(
      tensors={'step': model.global_step,
               'loss': model.cost,
               'precision': precision},
      every_n_iter=100)

  class _LearningRateSetterHook(tf.train.SessionRunHook):
    """Sets learning_rate based on global step."""

    def begin(self):
      self._lrn_rate = 0.1

    def before_run(self, run_context):
      return tf.train.SessionRunArgs(
          model.global_step,  # Asks for global step value.
          feed_dict={model.lrn_rate: self._lrn_rate})  # Sets learning rate

    def after_run(self, run_context, run_values):
      train_step = run_values.results
      if train_step < 40000:
        self._lrn_rate = 0.1
      elif train_step < 60000:
        self._lrn_rate = 0.01
      elif train_step < 80000:
        self._lrn_rate = 0.001
      else:
        self._lrn_rate = 0.0001

  with tf.train.MonitoredTrainingSession(
      checkpoint_dir=FLAGS.log_root,
      hooks=[logging_hook, _LearningRateSetterHook()],
      chief_only_hooks=[summary_hook],
      # Since we provide a SummarySaverHook, we need to disable default
      # SummarySaverHook. To do that we set save_summaries_steps to 0.
      save_summaries_steps=0,
      config=tf.ConfigProto(allow_soft_placement=True)) as mon_sess:
    while not mon_sess.should_stop():
      mon_sess.run(model.train_op)

### 3. Implementing evaluation

In [None]:
def evaluate(hps):
  """Eval loop."""
  images, labels = cifar_input.build_input(
      FLAGS.dataset, FLAGS.eval_data_path, hps.batch_size, FLAGS.mode)
  model = resnet_model.ResNet(hps, images, labels, FLAGS.mode)
  model.build_graph()
  saver = tf.train.Saver()
  summary_writer = tf.summary.FileWriter(FLAGS.eval_dir)

  sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
  tf.train.start_queue_runners(sess)

  best_precision = 0.0
  while True:
    try:
      ckpt_state = tf.train.get_checkpoint_state(FLAGS.log_root)
    except tf.errors.OutOfRangeError as e:
      tf.logging.error('Cannot restore checkpoint: %s', e)
      continue
    if not (ckpt_state and ckpt_state.model_checkpoint_path):
      tf.logging.info('No model to eval yet at %s', FLAGS.log_root)
      continue
    tf.logging.info('Loading checkpoint %s', ckpt_state.model_checkpoint_path)
    saver.restore(sess, ckpt_state.model_checkpoint_path)

    total_prediction, correct_prediction = 0, 0
    for _ in six.moves.range(FLAGS.eval_batch_count):
      (summaries, loss, predictions, truth, train_step) = sess.run(
          [model.summaries, model.cost, model.predictions,
           model.labels, model.global_step])

      truth = np.argmax(truth, axis=1)
      predictions = np.argmax(predictions, axis=1)
      correct_prediction += np.sum(truth == predictions)
      total_prediction += predictions.shape[0]

    precision = 1.0 * correct_prediction / total_prediction
    best_precision = max(precision, best_precision)

    precision_summ = tf.Summary()
    precision_summ.value.add(
        tag='Precision', simple_value=precision)
    summary_writer.add_summary(precision_summ, train_step)
    best_precision_summ = tf.Summary()
    best_precision_summ.value.add(
        tag='Best Precision', simple_value=best_precision)
    summary_writer.add_summary(best_precision_summ, train_step)
    summary_writer.add_summary(summaries, train_step)
    tf.logging.info('loss: %.3f, precision: %.3f, best precision: %.3f' %
                    (loss, precision, best_precision))
    summary_writer.flush()

    if FLAGS.eval_once:
      break

    time.sleep(60)

### 4. Implementing main routine

In [None]:
def main(_):
  if FLAGS.num_gpus == 0:
    dev = '/cpu:0'
  elif FLAGS.num_gpus == 1:
    dev = '/gpu:0'
  else:
    raise ValueError('Only support 0 or 1 gpu.')

  if FLAGS.mode == 'train':
    batch_size = 128
  elif FLAGS.mode == 'eval':
    batch_size = 100

  if FLAGS.dataset == 'cifar10':
    num_classes = 10
  elif FLAGS.dataset == 'cifar100':
    num_classes = 100

  hps = resnet_model.HParams(batch_size=batch_size,
                             num_classes=num_classes,
                             min_lrn_rate=0.0001,
                             lrn_rate=0.1,
                             num_residual_units=5,
                             use_bottleneck=False,
                             weight_decay_rate=0.0002,
                             relu_leakiness=0.1,
                             optimizer='mom')

  with tf.device(dev):
    if FLAGS.mode == 'train':
      train(hps)
    elif FLAGS.mode == 'eval':
      evaluate(hps)


if __name__ == '__main__':
  tf.logging.set_verbosity(tf.logging.INFO)
  tf.app.run()