# Distributed tensorflow training using Horovod via OCI Jobs

## Contents

1. [Background](#Background)
1. [Prerequisites](#Prerequisites)
1. [Train](#Train)
1. [Setup IAM](#Setup%20IAM)
1. [Build](#Build)

---

## Background

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and MXNet. This notebook example shows how to use Horovod with Tensorflow in OCI Data Science Jobs .

For more information about the Horovod with TensorFlow , please visit [Horovod-Tensorflow](https://horovod.readthedocs.io/en/stable/tensorflow.html)

---


## Prerequisites

### 1. Install ads package >= 2.5.9

In [None]:
!pip3 install oracle-ads

### 2. Install docker:

https://docs.docker.com/get-docker

## Train

### Training script
A sample training script 'sample.py' which will be used as by different workers in the setup for distributeed training

This script uses Horovod framework for distributed training where Horovod related lines are commented starting with `Horovod:`. For example, `Horovod: add Horovod DistributedOptimizer`, `Horovod: initialize optimize` and etc.

In [None]:
%%writefile sample.py

# Copyright 2021 Uber Technologies, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

import argparse
import tensorflow as tf
import horovod.tensorflow.keras as hvd
from distutils.version import LooseVersion

from ocifs import OCIFileSystem
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

parser = argparse.ArgumentParser(description='Tensorflow 2.0 Keras MNIST Example')

parser.add_argument('--use-mixed-precision', action='store_true', default=False,
                    help='use mixed precision for training')

args = parser.parse_args()

if args.use_mixed_precision:
    print(f"using mixed precision {args.use_mixed_precision}")
    if LooseVersion(tf.__version__) >= LooseVersion('2.4.0'):
        from tensorflow.keras import mixed_precision
        mixed_precision.set_global_policy('mixed_float16')
    else:
        policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
        tf.keras.mixed_precision.experimental.set_policy(policy)

# Horovod: initialize Horovod.
hvd.init()

# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

# (mnist_images, mnist_labels), _ = \
#     tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())

import numpy as np
minist_local = "/etc/datascience/horovod/examples/tf_data/mnist.npz"

def load_data():
    print("using pre-fetched dataset")
    with np.load(minist_local, allow_pickle=True) as f:  # pylint: disable=unexpected-keyword-arg
        x_train, y_train = f['x_train'], f['y_train']
        x_test, y_test = f['x_test'], f['y_test']
        return (x_train, y_train), (x_test, y_test)

(mnist_images, mnist_labels), _ = load_data() if os.path.exists(minist_local) else tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())


dataset = tf.data.Dataset.from_tensor_slices(
    (tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
             tf.cast(mnist_labels, tf.int64))
)
dataset = dataset.repeat().shuffle(10000).batch(128)

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
    tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Horovod: adjust learning rate based on number of GPUs.
scaled_lr = 0.001 * hvd.size()
opt = tf.optimizers.Adam(scaled_lr)

# Horovod: add Horovod DistributedOptimizer.
opt = hvd.DistributedOptimizer(
    opt, backward_passes_per_step=1, average_aggregated_gradients=True)

# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.
model.compile(loss=tf.losses.SparseCategoricalCrossentropy(),
                    optimizer=opt,
                    metrics=['accuracy'],
                    experimental_run_tf_function=False)

# Horovod: initialize optimizer state so we can synchronize across workers
# Keras has empty optimizer variables() for TF2:
# https://sourcegraph.com/github.com/tensorflow/tensorflow@v2.4.1/-/blob/tensorflow/python/keras/optimizer_v2/optimizer_v2.py#L351:10
model.fit(dataset, steps_per_epoch=1, epochs=1, callbacks=None)

state = hvd.elastic.KerasState(model, batch=0, epoch=0)

def on_state_reset():
    tf.keras.backend.set_value(state.model.optimizer.lr,  0.001 * hvd.size())
    # Re-initialize, to join with possible new ranks
    state.model.fit(dataset, steps_per_epoch=1, epochs=1, callbacks=None)

state.register_reset_callbacks([on_state_reset])


callbacks = [
    hvd.callbacks.MetricAverageCallback(),
    hvd.elastic.UpdateEpochStateCallback(state),
    hvd.elastic.UpdateBatchStateCallback(state),
    hvd.elastic.CommitStateCallback(state),
]

# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
# save the artifacts in the ARTIFACTS_DIR dir.
artifacts_dir=os.environ.get("ARTIFACTS_DIR")
tb_logs_path = os.path.join(artifacts_dir,"logs")
check_point_path =  os.path.join(artifacts_dir,"ckpts",'checkpoint-{epoch}.h5')
if hvd.rank() == 0:
    callbacks.append(tf.keras.callbacks.ModelCheckpoint(check_point_path))
    callbacks.append(tf.keras.callbacks.TensorBoard(tb_logs_path))

# Train the model.
# Horovod: adjust number of steps based on number of GPUs.
@hvd.elastic.run
def train(state):
    state.model.fit(dataset, steps_per_epoch=500 // hvd.size(),
                    epochs=2-state.epoch, callbacks=callbacks,
                    verbose=1)

train(state)


## Setup IAM

### Create the Dynamic Group
```
ALL {resource.type = ‘datasciencejobrun’, resource.compartment.id = <COMPARTMENT_OCID>}
```

### Create the following Policy
```
Allow dynamic-group <DYNAMIC_GROUP_NAME> to use log-content in compartment <COMPARTMENT_NAME>
Allow dynamic-group <DYNAMIC_GROUP_NAME> to use log-groups in compartment <COMPARTMENT_NAME>
Allow dynamic-group <DYNAMIC_GROUP_NAME> to inspect repos in compartment <COMPARTMENT_NAME>
Allow dynamic-group <DYNAMIC_GROUP_NAME> to inspect vcns in compartment <COMPARTMENT_NAME>
Allow dynamic-group <DYNAMIC_GROUP_NAME> to manage objects in compartment <COMPARTMENT_NAME> where any {target.bucket.name=<BUCKET_NAME>}
Allow dynamic-group <DYNAMIC_GROUP_NAME> to manage buckets in compartment <COMPARTMENT_NAME> where any {target.bucket.name='BUCKET_NAME'}
```

## Build

### Initialize a distributed-training folder
By this time, you would have created a training file (or files) - sample.py from the above example. Now running the command below

In [None]:
!ads opctl distributed-training init --framework horovod --version v1

### Build Docker image

The sample code is assumed to be in the current working directory.

In [None]:
!docker build -f oci_dist_training_artifacts/horovod/docker/tensorflow.cpu.Dockerfile -t (IMAGE-NAME):(TAG) .

### Push the Docker Image to your Tenancy OCIR

#### Steps
1. Follow the instructions to setup container registry from [here](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm)
2. Make sure you create a repository in OCIR to push the image
3. Tag Local Docker image that needs to be pushed -> `docker tag "image-identifier" "target-tag"`
4. Push the Docker image from the client machine to Container Registry -> `docker push "target-tag"`

##### Tag Docker image

In [None]:
!docker tag hvdjob-cpu-tf-1.0 iad.ocir.io/<TENANCY_NAME>/horovod:hvdjob-cpu-tf-1.0

##### Push Docker Image

In [None]:
!docker push <REGION_CODE>.ocir.io/<TENANCY_NAME>/horovod:hvdjob-cpu-tf-1.0

### Define your workload yaml:

The yaml file is a declarative way to express the workload.
Edit the `"oci_dist_training_artifacts/horovod/jobrun_config/jobrun_config_hvd_tf.yaml"` file to specify run config

The following variables are tenancy specific that needs to be modified

| Variable | Description |
| :-------- | :----------- |
|compartmentId|OCID of the compartment where Data Science projects are created|
|projectId|OCID of the project created in Data Science service|
|subnetId|OCID of the subnet attached your Job|
|logGroupId|OCID of the log group for JobRun logs|
|image|Image from OCIR to be used for JobRuns|
|workDir|URL to the working directory for opctl|
|WORKSPACE|Workspace with the working directory to be used|
|entryPoint|The script to be executed when launching the container|

```yaml
kind: distributed
apiVersion: v1.0
spec:
  infrastructure: # This section maps to Job definition. Does not include environment variables
    kind: infrastructure
    type: dataScienceJob
    apiVersion: v1.0
    spec:
      projectId: <PROJECT_OCID>
      compartmentId: <COMPARTMENT_OCID>
      displayName: <DISPLAY_NAME>
      logGroupId: <LOG_GROUP_OCID>
      subnetId: <SUBNET_OCID>
      shapeName: <COMPUTE_SHAPE>
      blockStorageSize: <SIZE_IN_GB>
      blockStorageSizeInGBs: <SIZE_IN_GB>
  cluster:
    kind: HOROVOD
    apiVersion: v1.0
    spec:
      image: "<REGION_CODE>.ocir.io/<TENANCY_NAME>/<REPOSITORY_NAME>:<IMAGE_TAG>"
      workDir:  "oci://<BUCKET_NAME>@<NAMESPACE>/"
      name: "<DISPLAY_NAME>"
      config:
        env:
          - name: MIN_NP
            value: 2
          - name: MAX_NP
            value: 8
          - name: SLOTS
            value: 2
          - name: WORKER_PORT
            value: 12345
          - name: GLOO_TIMEOUT_SECONDS
            value: 90
          - name: START_TIMEOUT
            value: 700
          - name: ENABLE_TIMELINE
            value: 1
          - name: SYNC_ARTIFACTS
            value: 1
          - name: WORKSPACE
            value: "horovod-ws"
          - name: WORKSPACE_PREFIX
            value: "hvd"
      main:
        name: "scheduler"
        replicas: 1
      worker:
        name: "worker"
        replicas: 2
  runtime:
    kind: python
    apiVersion: v1.0
    spec:
      entryPoint: "/code/sample.py"
      args: ""
      kwargs: ""
      env:
        - name: ARTIFACTS_DIR
          value: "/opt/ml"
```