PWC-Net-small model mixed-precision training (with cyclical learning rate schedule)
=======================================================

In this notebook we:
- Use a small model (no dense or residual connections), 6 level pyramid, uspample level 2 by 4 as the final flow prediction
- Train the PWC-Net-small model on a mix of the `FlyingChairs` and `FlyingThings3DHalfRes` dataset using a Cyclic<sub>short</sub> schedule of our own
- The Cyclic<sub>short</sub> schedule oscillates between `5e-04` and `1e-05` for 200,000 steps
- The training is done using mixed-precision with a loss scaler of `128.0` and a batch size of `32`

Below, look for `TODO` references and customize this notebook based on your own needs.

## Reference

[2018a]<a name="2018a"></a> Sun et al. 2018. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. [[arXiv]](https://arxiv.org/abs/1709.02371) [[web]](http://research.nvidia.com/publication/2018-02_PWC-Net%3A-CNNs-for) [[PyTorch (Official)]](https://github.com/NVlabs/PWC-Net/tree/master/PyTorch) [[Caffe (Official)]](https://github.com/NVlabs/PWC-Net/tree/master/Caffe)

In [1]:
"""
pwcnet_train.ipynb

PWC-Net model training.

Written by Phil Ferriere

Licensed under the MIT License (see LICENSE for details)

Tensorboard:
    [win] tensorboard --logdir=E:\\repos\\tf-optflow\\tfoptflow\\pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16
    [ubu] tensorboard --logdir=/media/EDrive/repos/tf-optflow/tfoptflow/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16
"""
from __future__ import absolute_import, division, print_function
import sys
from copy import deepcopy
import tensorflow as tf

from dataset_base import _DEFAULT_DS_TRAIN_OPTIONS
from dataset_flyingchairs import FlyingChairsDataset
from dataset_flyingthings3d import FlyingThings3DHalfResDataset
from dataset_mixer import MixedDataset
from model_pwcnet import ModelPWCNet, _DEFAULT_PWCNET_TRAIN_OPTIONS

## TODO: Set this first!

In [2]:
# TODO: You MUST set dataset_root to the correct path on your machine!
if sys.platform.startswith("win"):
    _DATASET_ROOT = 'E:/datasets/'
else:
    _DATASET_ROOT = '/media/EDrive/datasets/'
_FLYINGCHAIRS_ROOT = _DATASET_ROOT + 'FlyingChairs_release'
_FLYINGTHINGS3DHALFRES_ROOT = _DATASET_ROOT + 'FlyingThings3D_HalfRes'
    
# TODO: You MUST adjust the settings below based on the number of GPU(s) used for training
# Set controller device and devices
# A one-gpu setup would be something like controller='/device:GPU:0' and gpu_devices=['/device:GPU:0']
# Here, we use a dual-GPU setup, as shown below
gpu_devices = ['/device:GPU:0']
controller = '/device:GPU:0'

# TODO: You MUST adjust this setting below based on the amount of memory on your GPU(s)
# Batch size
batch_size = 32

# Train on `FlyingChairs+FlyingThings3DHalfRes` mix

## Load the dataset

In [3]:
# TODO: You MUST set the batch size based on the capabilities of your GPU(s) 
#  Load train dataset
ds_opts = deepcopy(_DEFAULT_DS_TRAIN_OPTIONS)
ds_opts['in_memory'] = False                          # Too many samples to keep in memory at once, so don't preload them
ds_opts['aug_type'] = 'heavy'                         # Apply all supported augmentations
ds_opts['batch_size'] = batch_size * len(gpu_devices) # Use a multiple of 8; here, 16 for dual-GPU mode (Titan X & 1080 Ti)
ds_opts['crop_preproc'] = (256, 448)                  # Crop to a smaller input size
ds1 = FlyingChairsDataset(mode='train_with_val', ds_root=_FLYINGCHAIRS_ROOT, options=ds_opts)
ds_opts['type'] = 'into_future'
ds2 = FlyingThings3DHalfResDataset(mode='train_with_val', ds_root=_FLYINGTHINGS3DHALFRES_ROOT, options=ds_opts)
ds = MixedDataset(mode='train_with_val', datasets=[ds1, ds2], options=ds_opts)

In [4]:
# Display dataset configuration
ds.print_config()


Dataset Configuration:
  verbose              False
  in_memory            False
  crop_preproc         (256, 448)
  scale_preproc        None
  input_channels       3
  tb_test_imgs         False
  random_seed          1969
  val_split            0.03
  aug_type             heavy
  aug_labels           True
  fliplr               0.5
  flipud               0.5
  translate            (0.5, 0.05)
  scale                (0.5, 0.05)
  batch_size           32
  type                 into_future
  mode                 train_with_val
  train size           41282
  val size             1230


## Configure the training

In [5]:
# Start from the default options
nn_opts = deepcopy(_DEFAULT_PWCNET_TRAIN_OPTIONS)
nn_opts['verbose'] = True
nn_opts['ckpt_dir'] = './pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/'
nn_opts['batch_size'] = ds_opts['batch_size']
nn_opts['x_shape'] = [2, ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 3]
nn_opts['y_shape'] = [ds_opts['crop_preproc'][0], ds_opts['crop_preproc'][1], 2]
nn_opts['use_tf_data'] = True # Use tf.data reader
nn_opts['gpu_devices'] = gpu_devices
nn_opts['controller'] = controller

# Use the PWC-Net-small model in quarter-resolution mode
nn_opts['use_dense_cx'] = False
nn_opts['use_res_cx'] = False
nn_opts['pyr_lvls'] = 6
nn_opts['flow_pred_lvl'] = 2

# Use mixed precision training
nn_opts['use_mixed_precision'] = True 
nn_opts['loss_scaler'] = 128.
nn_opts['x_dtype'] = tf.float16
nn_opts['y_dtype'] = tf.float32

# More options
nn_opts['max_to_keep'] = 50

In [6]:
# Set the learning rate schedule. This schedule is for a single GPU using a batch size of 8.
# Below,we adjust the schedule to the size of the batch and the number of GPUs.
nn_opts['lr_policy'] = 'cyclic'
nn_opts['cyclic_lr_max'] = 5e-04 # Anything higher will generate NaNs
nn_opts['cyclic_lr_base'] = 1e-05
nn_opts['cyclic_lr_stepsize'] = 20000
nn_opts['max_steps'] = 200000

# Below,we adjust the schedule to the size of the batch and our number of GPUs (2).
nn_opts['max_steps'] = int(nn_opts['max_steps'] * 8 / ds_opts['batch_size'])
nn_opts['cyclic_lr_stepsize'] = int(nn_opts['cyclic_lr_stepsize'] * 8 / ds_opts['batch_size'])

In [7]:
# Instantiate the model and display the model configuration
nn = ModelPWCNet(mode='train_with_val', options=nn_opts, dataset=ds)
nn.print_config()

Building model...
Instructions for updating:
`normal` is a deprecated alias for `truncated_normal`
... model built.
Configuring training ops...
... training ops configured.
Initializing model with random values for initial training...

... model initialized

Model Configuration:
  verbose                True
  ckpt_dir               ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/
  max_to_keep            50
  x_dtype                <dtype: 'float16'>
  x_shape                [2, 256, 448, 3]
  y_dtype                <dtype: 'float32'>
  y_shape                [256, 448, 2]
  train_mode             train
  adapt_info             None
  sparse_gt_flow         False
  display_step           100
  snapshot_step          1000
  val_step               1000
  val_batch_size         -1
  tb_val_imgs            pyramid
  tb_test_imgs           None
  gpu_devices            ['/device:GPU:0']
  controller             /device:GPU:0
  use_tf_data            True
  use_mixed_precision    True
  loss_sc

## Train the model

In [8]:
# Train the model
nn.train()

Start training from scratch...
2018-09-21 12:58:00 Iter 100 [Train]: loss=221.06, epe=18.03, lr=0.000020, samples/sec=24.7, sec/step=1.294, eta=17:55:52
2018-09-21 13:00:37 Iter 200 [Train]: loss=223.70, epe=18.25, lr=0.000030, samples/sec=29.8, sec/step=1.074, eta=14:51:14
2018-09-21 13:03:15 Iter 300 [Train]: loss=217.99, epe=17.78, lr=0.000039, samples/sec=29.4, sec/step=1.087, eta=15:00:33
2018-09-21 13:06:07 Iter 400 [Train]: loss=220.20, epe=17.93, lr=0.000049, samples/sec=26.2, sec/step=1.221, eta=16:49:29
2018-09-21 13:09:15 Iter 500 [Train]: loss=208.75, epe=16.71, lr=0.000059, samples/sec=22.8, sec/step=1.405, eta=19:19:30
2018-09-21 13:12:44 Iter 600 [Train]: loss=188.44, epe=14.56, lr=0.000069, samples/sec=19.7, sec/step=1.628, eta=22:20:28
2018-09-21 13:16:40 Iter 700 [Train]: loss=191.34, epe=14.71, lr=0.000079, samples/sec=16.5, sec/step=1.937, eta=1 day, 2:31:10
2018-09-21 13:20:59 Iter 800 [Train]: loss=189.87, epe=14.50, lr=0.000088, samples/sec=14.6, sec/step=2.186, 

2018-09-21 16:21:24 Iter 5399 [Train]: loss=127.55, epe=9.47, lr=0.000461, samples/sec=25.2, sec/step=1.268, eta=15:42:38
2018-09-21 16:24:30 Iter 5499 [Train]: loss=127.83, epe=9.41, lr=0.000451, samples/sec=23.1, sec/step=1.386, eta=17:08:11
2018-09-21 16:27:54 Iter 5599 [Train]: loss=124.06, epe=9.12, lr=0.000441, samples/sec=20.2, sec/step=1.587, eta=19:34:10
2018-09-21 16:31:41 Iter 5699 [Train]: loss=124.14, epe=9.12, lr=0.000431, samples/sec=17.4, sec/step=1.844, eta=22:41:20
2018-09-21 16:35:43 Iter 5798 [Train]: loss=120.70, epe=8.83, lr=0.000422, samples/sec=15.9, sec/step=2.011, eta=1 day, 0:41:15
2018-09-21 16:39:59 Iter 5898 [Train]: loss=124.88, epe=9.18, lr=0.000412, samples/sec=14.8, sec/step=2.157, eta=1 day, 2:25:37
2018-09-21 16:44:20 Iter 5998 [Train]: loss=122.74, epe=8.96, lr=0.000402, samples/sec=14.5, sec/step=2.201, eta=1 day, 2:54:07
2018-09-21 16:45:37 Iter 5998 [Val]: loss=117.94, epe=8.77
Saving model...
INFO:tensorflow:./pwcnet-sm-6-2-cyclic-chairsthingsmi

2018-09-21 19:51:30 Iter 10681 [Train]: loss=100.36, epe=6.93, lr=0.000043, samples/sec=20.6, sec/step=1.551, eta=16:55:47
2018-09-21 19:55:00 Iter 10781 [Train]: loss=96.44, epe=6.62, lr=0.000048, samples/sec=19.4, sec/step=1.646, eta=17:55:24
2018-09-21 19:58:54 Iter 10877 [Train]: loss=101.26, epe=7.01, lr=0.000053, samples/sec=16.6, sec/step=1.930, eta=20:57:50
2018-09-21 20:03:14 Iter 10976 [Train]: loss=99.25, epe=6.86, lr=0.000058, samples/sec=14.4, sec/step=2.218, eta=1 day, 0:01:28
2018-09-21 20:04:37 Iter 10976 [Val]: loss=100.78, epe=7.16
Saving model...
INFO:tensorflow:./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-10976 is not in all_model_checkpoint_paths. Manually adding it.
... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-10976
2018-09-21 20:09:21 Iter 11074 [Train]: loss=105.17, epe=7.33, lr=0.000063, samples/sec=14.3, sec/step=2.232, eta=1 day, 0:06:51
2018-09-21 20:13:42 Iter 11173 [Train]: loss=101.41, epe=7.02, lr=0.000067, sample

... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-15958
2018-09-21 23:30:57 Iter 16058 [Train]: loss=101.28, epe=6.93, lr=0.000203, samples/sec=14.5, sec/step=2.212, eta=20:49:51
2018-09-21 23:35:14 Iter 16158 [Train]: loss=100.14, epe=6.90, lr=0.000198, samples/sec=14.7, sec/step=2.172, eta=20:23:39
2018-09-21 23:39:35 Iter 16258 [Train]: loss=101.39, epe=7.00, lr=0.000193, samples/sec=14.5, sec/step=2.214, eta=20:43:17
2018-09-21 23:43:54 Iter 16358 [Train]: loss=98.16, epe=6.69, lr=0.000188, samples/sec=14.6, sec/step=2.199, eta=20:31:12
2018-09-21 23:48:13 Iter 16458 [Train]: loss=100.65, epe=6.93, lr=0.000184, samples/sec=14.5, sec/step=2.207, eta=20:32:19
2018-09-21 23:52:31 Iter 16558 [Train]: loss=101.67, epe=6.99, lr=0.000179, samples/sec=14.6, sec/step=2.185, eta=20:16:06
2018-09-21 23:56:51 Iter 16657 [Train]: loss=102.14, epe=6.99, lr=0.000174, samples/sec=14.5, sec/step=2.208, eta=20:25:28
2018-09-22 00:00:35 Iter 16757 [Train]: loss=98.86, epe=6.7

2018-09-22 03:13:31 Iter 21424 [Train]: loss=nan, epe=nan, lr=0.000045, samples/sec=14.5, sec/step=2.213, eta=17:31:07
2018-09-22 03:17:50 Iter 21524 [Train]: loss=86.88, epe=6.00, lr=0.000047, samples/sec=14.5, sec/step=2.203, eta=17:22:42
2018-09-22 03:22:11 Iter 21624 [Train]: loss=85.76, epe=5.91, lr=0.000050, samples/sec=14.4, sec/step=2.220, eta=17:27:07
2018-09-22 03:26:32 Iter 21722 [Train]: loss=85.53, epe=5.91, lr=0.000052, samples/sec=14.4, sec/step=2.221, eta=17:24:02
2018-09-22 03:30:51 Iter 21822 [Train]: loss=87.90, epe=6.08, lr=0.000055, samples/sec=14.6, sec/step=2.191, eta=17:06:08
2018-09-22 03:33:58 Iter 21922 [Train]: loss=85.98, epe=5.96, lr=0.000057, samples/sec=22.9, sec/step=1.398, eta=10:52:27
2018-09-22 03:35:22 Iter 21922 [Val]: loss=84.09, epe=5.86
Saving model...
INFO:tensorflow:./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-21922 is not in all_model_checkpoint_paths. Manually adding it.
... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp

2018-09-22 07:03:47 Iter 26881 [Train]: loss=83.10, epe=5.75, lr=0.000086, samples/sec=14.6, sec/step=2.199, eta=14:03:01
2018-09-22 07:05:08 Iter 26881 [Val]: loss=83.66, epe=5.88
Saving model...
INFO:tensorflow:./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-26881 is not in all_model_checkpoint_paths. Manually adding it.
... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-26881
2018-09-22 07:09:33 Iter 26980 [Train]: loss=86.73, epe=6.01, lr=0.000084, samples/sec=15.8, sec/step=2.019, eta=12:50:39
2018-09-22 07:12:42 Iter 27078 [Train]: loss=84.43, epe=5.84, lr=0.000082, samples/sec=22.7, sec/step=1.408, eta=8:54:53
2018-09-22 07:16:01 Iter 27178 [Train]: loss=82.93, epe=5.73, lr=0.000079, samples/sec=21.2, sec/step=1.512, eta=9:31:59
2018-09-22 07:19:41 Iter 27277 [Train]: loss=81.70, epe=5.65, lr=0.000077, samples/sec=18.2, sec/step=1.755, eta=11:01:03
2018-09-22 07:23:37 Iter 27377 [Train]: loss=77.62, epe=5.31, lr=0.000074, samples/sec=16.7, sec/ste

2018-09-22 10:44:04 Iter 32004 [Train]: loss=80.47, epe=5.53, lr=0.000035, samples/sec=14.5, sec/step=2.206, eta=10:54:31
2018-09-22 10:47:39 Iter 32104 [Train]: loss=75.55, epe=5.18, lr=0.000036, samples/sec=18.8, sec/step=1.701, eta=8:21:47
2018-09-22 10:50:43 Iter 32202 [Train]: loss=78.82, epe=5.43, lr=0.000037, samples/sec=23.9, sec/step=1.342, eta=6:33:32
2018-09-22 10:54:02 Iter 32301 [Train]: loss=80.18, epe=5.53, lr=0.000038, samples/sec=21.1, sec/step=1.514, eta=7:21:42
2018-09-22 10:57:34 Iter 32399 [Train]: loss=77.99, epe=5.36, lr=0.000039, samples/sec=19.2, sec/step=1.666, eta=8:03:11
2018-09-22 11:01:19 Iter 32498 [Train]: loss=78.36, epe=5.38, lr=0.000041, samples/sec=17.7, sec/step=1.809, eta=8:41:34
2018-09-22 11:05:29 Iter 32598 [Train]: loss=79.81, epe=5.49, lr=0.000042, samples/sec=15.3, sec/step=2.091, eta=9:59:32
2018-09-22 11:09:49 Iter 32698 [Train]: loss=74.48, epe=5.09, lr=0.000043, samples/sec=14.6, sec/step=2.196, eta=10:25:47
2018-09-22 11:14:10 Iter 32798

2018-09-22 14:29:25 Iter 37440 [Train]: loss=78.91, epe=5.43, lr=0.000041, samples/sec=20.2, sec/step=1.586, eta=5:25:09
2018-09-22 14:33:11 Iter 37538 [Train]: loss=74.09, epe=5.05, lr=0.000040, samples/sec=17.5, sec/step=1.828, eta=6:11:41
2018-09-22 14:37:15 Iter 37633 [Train]: loss=75.33, epe=5.16, lr=0.000039, samples/sec=15.9, sec/step=2.017, eta=6:46:51
2018-09-22 14:41:35 Iter 37731 [Train]: loss=73.58, epe=5.02, lr=0.000038, samples/sec=14.5, sec/step=2.212, eta=7:22:28
2018-09-22 14:42:48 Iter 37731 [Val]: loss=72.18, epe=4.99
Saving model...
INFO:tensorflow:./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-37731 is not in all_model_checkpoint_paths. Manually adding it.
... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-37731
2018-09-22 14:47:29 Iter 37828 [Train]: loss=76.38, epe=5.24, lr=0.000037, samples/sec=14.5, sec/step=2.203, eta=7:16:52
2018-09-22 14:51:50 Iter 37927 [Train]: loss=75.10, epe=5.13, lr=0.000035, samples/sec=14.4, sec/step=2

... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-42665
2018-09-22 18:15:53 Iter 42764 [Train]: loss=77.65, epe=5.34, lr=0.000027, samples/sec=15.1, sec/step=2.114, eta=4:03:04
2018-09-22 18:20:14 Iter 42863 [Train]: loss=74.64, epe=5.13, lr=0.000028, samples/sec=14.4, sec/step=2.226, eta=4:12:16
2018-09-22 18:24:31 Iter 42960 [Train]: loss=76.68, epe=5.28, lr=0.000028, samples/sec=14.8, sec/step=2.163, eta=4:01:32
2018-09-22 18:28:53 Iter 43059 [Train]: loss=77.51, epe=5.33, lr=0.000029, samples/sec=14.3, sec/step=2.230, eta=4:05:20
2018-09-22 18:33:12 Iter 43159 [Train]: loss=72.31, epe=4.94, lr=0.000029, samples/sec=14.6, sec/step=2.187, eta=3:56:53
2018-09-22 18:37:31 Iter 43258 [Train]: loss=74.74, epe=5.12, lr=0.000030, samples/sec=14.6, sec/step=2.189, eta=3:53:27
2018-09-22 18:41:52 Iter 43355 [Train]: loss=73.72, epe=5.04, lr=0.000031, samples/sec=14.3, sec/step=2.236, eta=3:54:47
2018-09-22 18:46:12 Iter 43454 [Train]: loss=72.30, epe=4.91, lr=0.00003

2018-09-22 21:59:49 Iter 48079 [Train]: loss=73.88, epe=5.05, lr=0.000022, samples/sec=14.7, sec/step=2.184, eta=0:54:36
2018-09-22 22:04:07 Iter 48176 [Train]: loss=75.65, epe=5.18, lr=0.000021, samples/sec=14.6, sec/step=2.192, eta=0:51:09
2018-09-22 22:08:26 Iter 48276 [Train]: loss=75.17, epe=5.15, lr=0.000021, samples/sec=14.5, sec/step=2.200, eta=0:47:40
2018-09-22 22:12:49 Iter 48374 [Train]: loss=71.49, epe=4.88, lr=0.000020, samples/sec=14.3, sec/step=2.245, eta=0:44:54
2018-09-22 22:17:11 Iter 48473 [Train]: loss=72.62, epe=4.97, lr=0.000019, samples/sec=14.4, sec/step=2.217, eta=0:40:39
2018-09-22 22:21:30 Iter 48571 [Train]: loss=71.60, epe=4.89, lr=0.000019, samples/sec=14.5, sec/step=2.201, eta=0:36:41
2018-09-22 22:22:38 Iter 48571 [Val]: loss=72.25, epe=4.99
Saving model...
INFO:tensorflow:./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/pwcnet.ckpt-48571 is not in all_model_checkpoint_paths. Manually adding it.
... model saved in ./pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/

## Training log

Here are the training curves for the run above:

![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/loss.png)
![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/epe.png)
![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/lr.png)

Here are the predictions issued by the model for a few validation samples:

![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/val1.png)
![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/val2.png)
![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/val3.png)
![](img/pwcnet-sm-6-2-cyclic-chairsthingsmix-fp16/val4.png)