# Distributed Pytorch training using Horovod via OCI Jobs

## Contents

1. [Background](#Background)
1. [Prerequisites](#Prerequisites)
1. [Train](#Train)
1. [Setup IAM](#Setup%20IAM)
1. [Build](#Build)

---

## Background

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and MXNet. This notebook example shows how to use Horovod with Pytorch in OCI Data Science Jobs. OCI Data Science currently support Elastic Horovod workloads with gloo backend.

For more information about the Horovod with Pytorch , please visit [Horovod-Pytorch](https://horovod.readthedocs.io/en/stable/pytorch.html)

---


## Prerequisites

### 1. Install ads package >= 2.5.9

In [None]:
!pip3 install oracle-ads

### 2. Install docker:

https://docs.docker.com/get-docker

### 3. Set IAM Policies

Following Policies need to be in place in the OCI IAM service. This would allow OCI datascience job runs to access needed services such as logging, object storage, vcns etc.

#### Create the Dynamic Group
```
ALL {resource.type = ‘datasciencejobrun’, resource.compartment.id = <COMPARTMENT_OCID>}
```

#### Create policies
```
Allow dynamic-group <DYNAMIC_GROUP_NAME> to use log-content in compartment <COMPARTMENT_NAME>
Allow dynamic-group <DYNAMIC_GROUP_NAME> to use log-groups in compartment <COMPARTMENT_NAME>
Allow dynamic-group <DYNAMIC_GROUP_NAME> to inspect repos in compartment <COMPARTMENT_NAME>
Allow dynamic-group <DYNAMIC_GROUP_NAME> to inspect vcns in compartment <COMPARTMENT_NAME>
Allow dynamic-group <DYNAMIC_GROUP_NAME> to manage objects in compartment <COMPARTMENT_NAME> where any {target.bucket.name='<BUCKET_NAME>'}
Allow dynamic-group <DYNAMIC_GROUP_NAME> to manage buckets in compartment <COMPARTMENT_NAME> where any {target.bucket.name='<BUCKET_NAME>'}
```


### 4. Create VCN and private subnet

Horovod Distributed Training requires nodes to communicate to each other. Therefor, network settings need to be provisioned. Create a VCN and a private subnet. Ths subnet id of this private subnet needs to be configured in the workload yaml file ('train.yaml').

## Trainining Script

This section will create a horovod pytorch training script. This is the training code that executes on the horovod cluster. The script must confirm to Elastic Horovod apis.

The following script uses Horovod framework for distributed training where Horovod related apis are commented starting with `Horovod:`. <br> For example, `Horovod: add Horovod DistributedOptimizer`, `Horovod: initialize optimize`, etc.

In [None]:
%%writefile train.py

# Script adapted from https://github.com/horovod/horovod/blob/master/examples/elastic/pytorch/pytorch_mnist_elastic.py

# ==============================================================================
import argparse
import os
from filelock import FileLock

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import torch.utils.data.distributed
import horovod.torch as hvd
from torch.utils.tensorboard import SummaryWriter

# Training settings
parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                    help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                    help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
                    help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
                    help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
                    help='SGD momentum (default: 0.5)')
parser.add_argument('--no-cuda', action='store_true', default=False,
                    help='disables CUDA training')
parser.add_argument('--seed', type=int, default=42, metavar='S',
                    help='random seed (default: 42)')
parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                    help='how many batches to wait before logging training status')
parser.add_argument('--fp16-allreduce', action='store_true', default=False,
                    help='use fp16 compression during allreduce')
parser.add_argument('--use-adasum', action='store_true', default=False,
                    help='use adasum algorithm to do reduction')
parser.add_argument('--data-dir',
                    help='location of the training dataset in the local filesystem (will be downloaded if needed)')

args = parser.parse_args()
args.cuda = not args.no_cuda and torch.cuda.is_available()

checkpoint_format = 'checkpoint-{epoch}.pth.tar'

# Horovod: initialize library.
hvd.init()
torch.manual_seed(args.seed)

if args.cuda:
    # Horovod: pin GPU to local rank.
    torch.cuda.set_device(hvd.local_rank())
    torch.cuda.manual_seed(args.seed)


# Horovod: limit # of CPU threads to be used per worker.
torch.set_num_threads(1)

kwargs = {'num_workers': 1, 'pin_memory': True} if args.cuda else {}
data_dir = args.data_dir or './data'
with FileLock(os.path.expanduser("~/.horovod_lock")):
    train_dataset = \
        datasets.MNIST(data_dir, train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ]))
# Horovod: use DistributedSampler to partition the training data.
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, sampler=train_sampler, **kwargs)

test_dataset = \
    datasets.MNIST(data_dir, train=False, transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ]))
# Horovod: use DistributedSampler to partition the test data.
test_sampler = torch.utils.data.distributed.DistributedSampler(
    test_dataset, num_replicas=hvd.size(), rank=hvd.rank())
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=args.test_batch_size,
                                          sampler=test_sampler, **kwargs)


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x)


model = Net()

# By default, Adasum doesn't need scaling up learning rate.
lr_scaler = hvd.size() if not args.use_adasum else 1

if args.cuda:
    # Move model to GPU.
    model.cuda()
    # If using GPU Adasum allreduce, scale learning rate by local_size.
    if args.use_adasum and hvd.nccl_built():
        lr_scaler = hvd.local_size()

# Horovod: scale learning rate by lr_scaler.
optimizer = optim.SGD(model.parameters(), lr=args.lr * lr_scaler,
                      momentum=args.momentum)

# Horovod: (optional) compression algorithm.
compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none


def metric_average(val, name):
    tensor = torch.tensor(val)
    avg_tensor = hvd.allreduce(tensor, name=name)
    return avg_tensor.item()

def create_dir(dir):
    if not os.path.exists(dir):
        os.makedirs(dir)
# Horovod: average metrics from distributed training.
class Metric(object):
    def __init__(self, name):
        self.name = name
        self.sum = torch.tensor(0.)
        self.n = torch.tensor(0.)

    def update(self, val):
        self.sum += hvd.allreduce(val.detach().cpu(), name=self.name)
        self.n += 1

    @property
    def avg(self):
        return self.sum / self.n

@hvd.elastic.run
def train(state):
    # post synchronization event (worker added, worker removed) init ...

    artifacts_dir = os.environ.get("OCI__SYNC_DIR") + "/artifacts"
    chkpts_dir = os.path.join(artifacts_dir,"ckpts")
    logs_dir = os.path.join(artifacts_dir,"logs")
    if hvd.rank() == 0:
        print("creating dirs for checkpoints and logs")
        create_dir(chkpts_dir)
        create_dir(logs_dir)

    writer = SummaryWriter(logs_dir) if hvd.rank() == 0 else None

    for state.epoch in range(state.epoch, args.epochs + 1):
        train_loss = Metric('train_loss')
        state.model.train()

        train_sampler.set_epoch(state.epoch)
        steps_remaining = len(train_loader) - state.batch

        for state.batch, (data, target) in enumerate(train_loader):
            if state.batch >= steps_remaining:
                break

            if args.cuda:
                data, target = data.cuda(), target.cuda()
            state.optimizer.zero_grad()
            output = state.model(data)
            loss = F.nll_loss(output, target)
            train_loss.update(loss)
            loss.backward()
            state.optimizer.step()
            if state.batch % args.log_interval == 0:
                # Horovod: use train_sampler to determine the number of examples in
                # this worker's partition.
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    state.epoch, state.batch * len(data), len(train_sampler),
                    100.0 * state.batch / len(train_loader), loss.item()))
            state.commit()
        if writer:
           writer.add_scalar("Loss", train_loss.avg, state.epoch)
        if hvd.rank() == 0:
            chkpt_path = os.path.join(chkpts_dir,checkpoint_format.format(epoch=state.epoch + 1))
            chkpt = {
                'model': state.model.state_dict(),
                'optimizer': state.optimizer.state_dict(),
            }
            torch.save(chkpt, chkpt_path)
        state.batch = 0


def test():
    model.eval()
    test_loss = 0.
    test_accuracy = 0.
    for data, target in test_loader:
        if args.cuda:
            data, target = data.cuda(), target.cuda()
        output = model(data)
        # sum up batch loss
        test_loss += F.nll_loss(output, target, size_average=False).item()
        # get the index of the max log-probability
        pred = output.data.max(1, keepdim=True)[1]
        test_accuracy += pred.eq(target.data.view_as(pred)).cpu().float().sum()

    # Horovod: use test_sampler to determine the number of examples in
    # this worker's partition.
    test_loss /= len(test_sampler)
    test_accuracy /= len(test_sampler)

    # Horovod: average metric values across workers.
    test_loss = metric_average(test_loss, 'avg_loss')
    test_accuracy = metric_average(test_accuracy, 'avg_accuracy')

    # Horovod: print output only on first rank.
    if hvd.rank() == 0:
        print('\nTest set: Average loss: {:.4f}, Accuracy: {:.2f}%\n'.format(
            test_loss, 100. * test_accuracy))


# Horovod: wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(optimizer,
                                     named_parameters=model.named_parameters(),
                                     compression=compression,
                                     op=hvd.Adasum if args.use_adasum else hvd.Average)


# adjust learning rate on reset
def on_state_reset():
    for param_group in optimizer.param_groups:
        param_group['lr'] = args.lr * hvd.size()


state = hvd.elastic.TorchState(model, optimizer, epoch=1, batch=0)
state.register_reset_callbacks([on_state_reset])
train(state)
test()


## Build

### Initialize a distributed-training folder
Next step would be to create a distributed-training workspace. Execute the following command to fetch the 'horovod-pytorch' framework. This would create a directory 'oci_dist_training_artifacts'. The directory essentially contains artifacts(dockerfile, configurations, gloo code etc) to build a horovod job docker image.

In [None]:
!ads opctl distributed-training init --framework horovod-pytorch --version v1

### Build Docker image

In [None]:
IMAGE_NAME='hvdjob-cpu-pytorch'
IMAGE_TAG=1.0

In [None]:
!docker build -f oci_dist_training_artifacts/horovod/v1/docker/pytorch.cpu.Dockerfile -t $IMAGE_NAME:$IMAGE_TAG .

The training code('train.py') is assumed to be in the current working directory. This can be overwritten using the 'CODE_DIR' build arg.

`docker build --build-arg CODE_DIR=<code_folder> -f oci_dist_training_artifacts/horovod/docker/pytorch.cpu.Dockerfile -t $IMAGE_NAME:$IMAGE_TAG .`


### Push the Docker Image to your Tenancy OCIR
Steps
1. Follow the instructions to setup container registry from [here](https://docs.oracle.com/en-us/iaas/Content/Registry/Tasks/registrypushingimagesusingthedockercli.htm)
2. Make sure you create a repository in OCIR to push the image
3. Tag Local Docker image that needs to be pushed. 
4. Push the Docker image from your machine to OCI Container Registry. 

#### Tag Docker image
Please replace the TENANCY_NAMESPACE (you can find this in tenancy information on oci console) and REGION_CODE [iad|phx ..]

In [None]:
!docker tag $IMAGE_NAME:$IMAGE_TAG iad.ocir.io/<TENANCY_NAMESPACE>/horovod:$IMAGE_NAME:$IMAGE_TAG

# Example: !docker tag $IMAGE_NAME:$IMAGE_TAG iad.ocir.io/ociodscdev/horovod/$IMAGE_NAME:$IMAGE_TAG

#### Push Docker Image

In [None]:
!docker push <REGION_CODE>.ocir.io/<TENANCY_NAMESPACE>/horovod/$IMAGE_NAME:$IMAGE_TAG

#Example: !docker push iad.ocir.io/ociodscdev/horovod/$IMAGE_NAME:$IMAGE_TAG

## Run

### Define your workload yaml:

The yaml file is a declarative way to express the workload.
Create a workload yaml file called `train.yaml` to specify the run config.

Workload yaml file has the following format.
<br>

```yaml
kind: distributed
apiVersion: v1.0
spec:
  infrastructure: # This section maps to Job definition. Does not include environment variables
    kind: infrastructure
    type: dataScienceJob
    apiVersion: v1.0
    spec:
      projectId: oci.xxxx.<project_ocid>
      compartmentId: oci.xxxx.<compartment_ocid>
      displayName: HVD-Distributed-PYTORCH
      logGroupId: oci.xxxx.<log_group_ocid>
      logId: oci.xxx.<log_ocid>
      subnetId: oci.xxxx.<subnet-ocid>
      shapeName: VM.Standard2.4
      blockStorageSize: 50
  cluster:
    kind: HOROVOD
    apiVersion: v1.0
    spec:
      image: "iad.ocir.io/<tenancy_id>/<repo_name>/<image_name>:<image_tag>"
      workDir:  "oci://<bucket_name>@<bucket_namespace>/<bucket_prefix>"
      name: "horovod_pytorch"
      config:
        env:
          # MIN_NP, MAX_NP and SLOTS are inferred from the shape. Modify only when needed.
          # - name: MIN_NP
          #   value: 2
          # - name: MAX_NP
          #   value: 4
          # - name: SLOTS
          #   value: 2
          - name: WORKER_PORT
            value: 12345
          - name: START_TIMEOUT #Optional: Defaults to 600.
            value: 600
          - name: ENABLE_TIMELINE # Optional: Disabled by Default.Significantly increases training duration if switched on (1).
            value: 0
          - name: SYNC_ARTIFACTS #Mandatory: Switched on by Default.
            value: 1
          - name: WORKSPACE #Mandatory if SYNC_ARTIFACTS==1: Destination object bucket to sync generated artifacts to.
            value: "<bucket_name>"
          - name: WORKSPACE_PREFIX #Mandatory if SYNC_ARTIFACTS==1: Destination object bucket folder to sync generated artifacts to.
            value: "<bucket_prefix>"
          - name: HOROVOD_ARGS # Parameters for cluster tuning.
            value: "--verbose"
      main:
        name: "scheduler"
        replicas: 1 #this will be always 1
      worker:
        name: "worker"
        replicas: 2 #number of workers
  runtime:
    kind: python
    apiVersion: v1.0
    spec:
      entryPoint: "/code/train.py" #location of user's training script in docker image.
      args:  #any arguments that the training script requires.
      env:
```
<br> <br>
The following variables are tenancy specific that needs to be modified.

| Variable | Description |
| :-------- | :----------- |
|compartmentId|OCID of the compartment where Data Science projects are created|
|projectId|OCID of the project created in Data Science service|
|subnetId|OCID of the subnet attached your Job|
|logGroupId|OCID of the log group for JobRun logs|
|image|Image from OCIR to be used for JobRuns|
|workDir|URL to the working directory for opctl|
|WORKSPACE|Workspace with the working directory to be used|
|entryPoint|The script to be executed when launching the container|

In [None]:
%%writefile train.yaml

kind: distributed
apiVersion: v1.0
spec:
  infrastructure: # This section maps to Job definition. Does not include environment variables
    kind: infrastructure
    type: dataScienceJob
    apiVersion: v1.0
    spec:
      projectId: oci.xxxx.<project_ocid>
      compartmentId: oci.xxxx.<compartment_ocid>
      displayName: HVD-Distributed-PYTORCH
      logGroupId: oci.xxxx.<log_group_ocid>
      logId: oci.xxx.<log_ocid>
      subnetId: oci.xxxx.<subnet-ocid>
      shapeName: VM.Standard2.4
      blockStorageSize: 50
  cluster:
    kind: HOROVOD
    apiVersion: v1.0
    spec:
      image: "iad.ocir.io/<tenancy_id>/<repo_name>/<image_name>:<image_tag>"
      workDir:  "oci://<bucket_name>@<bucket_namespace>/<bucket_prefix>"
      name: "horovod_pytorch"
      config:
        env:
          # MIN_NP, MAX_NP and SLOTS are inferred from the shape. Modify only when needed.
          # - name: MIN_NP
          #   value: 2
          # - name: MAX_NP
          #   value: 4
          # - name: SLOTS
          #   value: 2
          - name: WORKER_PORT
            value: 12345
          - name: START_TIMEOUT #Optional: Defaults to 600.
            value: 600
          - name: ENABLE_TIMELINE # Optional: Disabled by Default.Significantly increases training duration if switched on (1).
            value: 0
          - name: SYNC_ARTIFACTS #Mandatory: Switched on by Default.
            value: 1
          - name: WORKSPACE #Mandatory if SYNC_ARTIFACTS==1: Destination object bucket to sync generated artifacts to.
            value: "<bucket_name>"
          - name: WORKSPACE_PREFIX #Mandatory if SYNC_ARTIFACTS==1: Destination object bucket folder to sync generated artifacts to.
            value: "<bucket_prefix>"
          - name: HOROVOD_ARGS # Parameters for cluster tuning.
            value: "--verbose"
      main:
        name: "scheduler"
        replicas: 1 #this will be always 1
      worker:
        name: "worker"
        replicas: 2 #number of workers
  runtime:
    kind: python
    apiVersion: v1.0
    spec:
      entryPoint: "/code/train.py" #location of user's training script in docker image.
      args:  #any arguments that the training script requires.
      env:

### Use ads opctl to create the cluster infrastructure and run the workload.

#### Dry Run To check the runtime configurations.

In [None]:
! ads opctl run -f train.yaml --dry-run

#### Submit the workload

In [None]:
! ads opctl run -f train.yaml

This would emit the following information about the job created and the job runs within.<br>
`jobId: <job_id>`<br>
`mainJobRunId: <scheduer_job_run_id>`<br>
`workDir: oci://<bucket_name>@<bucket_namespace>/<bucket_prefix>`<br>
`workerJobRunIds:`<br>
`- <worker_1_jobrun_id>`<br>
`- <worker_2_jobrun_id>`<br>

#### Monitor logs
You can monitor the logs emitted from the job runs using the following command.

In [None]:
! ads jobs watch <jobrun_id>