# Mnist distributed training using Horovod

<font color='red'> <h3>To run the code in this notebook, go to [launch_horovod.ipynb](launch_horovod.ipynb). Running this notebook will not enable Horovod and results in errors!</h3></font>

## Importing Libraries

After reading up [basics about Hops Notebook and local training models on Hops](../TensorFlow/cifar10_on_hops.ipynb) it's time to scale out our training. We already discussed [multi-gpu training with TensorFlow](../TensorFlow/multigpu/Multi-gpu_training_cifar.ipynb) but the key difference here is using [Horovod](https://github.com/uber/horovod) library to parallelize the training. The reason to do this is great scaling capabilities this library has in compare to traditional methods due to using MPI and Ring allreduce concepts for parallelization.

## MNIST Dataset Overview

This example is using MNIST handwritten digits. The dataset contains 60,000 examples for training and 10,000 examples for testing. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 1. For simplicity, each image has been flatten and converted to a 1-D numpy array of 784 features (28*28).

![MNIST Dataset](http://neuralnetworksanddeeplearning.com/images/mnist_100_digits.png)

More info: http://yann.lecun.com/exdb/mnist/

## Importing Libraries

In [None]:
import tensorflow as tf
import horovod.tensorflow as hvd
layers = tf.contrib.layers
learn = tf.contrib.learn
from hops import tensorboard
from hops import hdfs

tf.logging.set_verbosity(tf.logging.DEBUG)## Importing Libraries

## Building the model

In [None]:
def conv_model(feature, target, mode):
    """2-layer convolution model."""
    # Convert the target to a one-hot tensor of shape (batch_size, 10) and
    # with a on-value of 1 for each one-hot vector of length 10.
    target = tf.one_hot(tf.cast(target, tf.int32), 10, 1, 0)

    # Reshape feature to 4d tensor with 2nd and 3rd dimensions being
    # image width and height final dimension being the number of color channels.
    feature = tf.reshape(feature, [-1, 28, 28, 1])
    tf.summary.image("image", feature)

    # First conv layer will compute 32 features for each 5x5 patch
    with tf.variable_scope('conv_layer1'):
        h_conv1 = layers.conv2d(
            feature, 32, kernel_size=[5, 5], activation_fn=tf.nn.relu)
        h_pool1 = tf.nn.max_pool(
            h_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

    # Second conv layer will compute 64 features for each 5x5 patch.
    with tf.variable_scope('conv_layer2'):
        h_conv2 = layers.conv2d(
            h_pool1, 64, kernel_size=[5, 5], activation_fn=tf.nn.relu)
        h_pool2 = tf.nn.max_pool(
            h_conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
        # reshape tensor into a batch of vectors
        h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])

    # Densely connected layer with 1024 neurons.
    h_fc1 = layers.dropout(
        layers.fully_connected(
            h_pool2_flat, 1024, activation_fn=tf.nn.relu),
        keep_prob=0.5,## Building the model
        is_training=mode == tf.contrib.learn.ModeKeys.TRAIN)

    # Compute logits (1 per class) and compute loss.
    logits = layers.fully_connected(h_fc1, 10, activation_fn=None)
    loss = tf.losses.softmax_cross_entropy(target, logits)
    tf.summary.scalar('loss', loss)

    return tf.argmax(logits, 1), loss

## Integration with Horovod

Until now everything was quite standard in Deep learning techniques and Hops concepts. For running our code using Horovod however, we need to modify just few lines of our code. This few lines will enable us to scale our model potentially on hundreds of GPUs very well, providing optimal data input pipeline and fast GPUs connection between each other. 



### 7 Step to scale out with Horovod:

#### 1. Initilization using `hvd.init()`
#### 2. Making our optimizer Horovod-friendly:
```python
opt = tf.train.RMSPropOptimizer(lr)

# Wrap with Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
```

####  3. Mind the rank:
As now we have potential tens or hundreds of GPUs running the same code, we need to take care of checkpointing and writing summaries. More specifically, we need to do this actions on only one machine. For this we need to check the rank of GPU before each of these actions. In broadcasting step, we will use this concept.

####  4. Providing `Config`: 
Pin GPU to be used to process local rank (one GPU per process)

```python
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
```
####  5. Broadcasting: 
Broadcast initial variable states from rank 0 to all other processes. This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.

```python
if hvd.local_rank()==0:
    hooks = [hvd.BroadcastGlobalVariablesHook(0),
         tf.train.StopAtStepHook(last_step=5000),
         tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': loss},
                                    every_n_iter=10),
         tf.train.SummarySaverHook(save_steps=10,
                                   output_dir=tensorboard.logdir(),
                                   summary_op=tf.summary.merge_all()),
        tf.train.CheckpointSaverHook(checkpoint_dir=tensorboard.logdir(), save_steps=50)
         ]
else:
    hooks = [hvd.BroadcastGlobalVariablesHook(0),
         tf.train.StopAtStepHook(last_step=5000)
         ]
```

#### 6. Provide `hooks` and `config` to the Session:
Don't forget to run broadcasting operation which we defined earlier after initilization of variable:

```python
with tf.train.SingularMonitoredSession(hooks=hooks, config=config) as mon_sess:
```

Since we are using `SingularMonitoredSession` the initilization of variables and broadcasting is handled by implicitly and the broadcasting operator is initilized inside the `hooks` list.
#### 7. Launch it:

Congrats, we finished adapting our code to use Horovod, but for actually start training, we need to have another notebook. Refer to [launch_horovod.ipynb](launch_horovod.ipynb) notebook for further instruction on how to run and monitor your training process. 

<font color='red'> <h4>To launch this notebook, go to [launch_horovod.ipynb](launch_horovod.ipynb). Running this notebook will not enable Horovod and results in errors!</h4></font>


In [None]:
def main(_):
    # Initialize Horovod.
    hvd.init()

    # Download and load MNIST dataset.
    mnist = learn.datasets.mnist.read_data_sets('MNIST-data-%d' % hvd.rank())

    # Build model...
    with tf.name_scope('input'):
        image = tf.placeholder(tf.float32, [None, 784], name='image')
        label = tf.placeholder(tf.float32, [None], name='label')
    predict, loss = conv_model(image, label, tf.contrib.learn.ModeKeys.TRAIN)

    opt = tf.train.RMSPropOptimizer(0.01)

    # Add Horovod Distributed Optimizer.
    opt = hvd.DistributedOptimizer(opt)

    global_step = tf.contrib.framework.get_or_create_global_step()
    train_op = opt.minimize(loss, global_step=global_step)

    print(tensorboard.logdir())

    # BroadcastGlobalVariablesHook broadcasts variables from rank 0 to all other
    # processes during initialization.
    if hvd.local_rank()==0:
        hooks = [hvd.BroadcastGlobalVariablesHook(0),
             tf.train.StopAtStepHook(last_step=5000),
             tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': loss},
                                        every_n_iter=10),
             tf.train.SummarySaverHook(save_steps=10,
                                       output_dir=tensorboard.logdir(),
                                       summary_op=tf.summary.merge_all()),
            tf.train.CheckpointSaverHook(checkpoint_dir=tensorboard.logdir(), save_steps=50)
             ]
    else:
        hooks = [hvd.BroadcastGlobalVariablesHook(0),
             tf.train.StopAtStepHook(last_step=5000)
             ]

    # Pin GPU to be used to process local rank (one GPU per process)
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    config.gpu_options.visible_device_list = str(hvd.local_rank())

    # The MonitoredTrainingSession takes care of session initialization,
    # restoring from a checkpoint, saving to a checkpoint, and closing when done
    # or an error occurs.
    with tf.train.SingularMonitoredSession(hooks=hooks, config=config) as mon_sess:
        while not mon_sess.should_stop():
            # Run a training step synchronously.
            image_, label_ = mnist.train.next_batch(100)
            mon_sess.run(train_op, feed_dict={image: image_, label: label_})

## Don't forget:
To call the main function if you wrapped your model logic in the `main()`

In [None]:
if __name__ == "__main__":
    tf.app.run()