# This notebook is a work in progress




## Horovod

[Horovod](https://github.com/horovod/horovod) is a distributed deep-learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. Its goal is to make distributed deep learning fast and easy to use. Horovod is an open-source tool initially developed by Uber to support their need for faster deep-learning model training across many engineering teams. It is part of a growing ecosystem of approaches to distributed training, including, for example, Distributed PyTorch. Uber developed a solution that utilized MPI for distributed process communication and the [NVIDIA Collective Communications Library (NCCL)](https://developer.nvidia.com/nccl) for its highly optimized implementation of reductions across distributed processes and nodes. The resulting Horovod package delivers on its promise to scale deep learning model training across multiple GPUs and multiple nodes with only minor code modification and intuitive debugging.

To use [Horovod with PyTorch](https://horovod.readthedocs.io/en/latest/pytorch.html), make the following modifications to your training script:

1. Run `hvd.init().`

2. Pin each GPU to a single process. With the typical setup of one GPU per process, set this to local rank.

```python
if torch.cuda.is_available():
    torch.cuda.set_device(hvd.local_rank())
```
3. Scale the learning rate by the number of workers. This is because the number of workers effectively scales batch size in synchronous distributed training. Note that increasing the learning rate compensates for the increased batch size.

4. Wrap the optimizer in `hvd.DistributedOptimizer.` The function of the distributed optimizer is to delegate gradient computation to the original optimizer, averages gradients using `allreduce` or `allgather,` and then applies those averaged gradients.*

5. Broadcast the initial variable states from rank 0 to all other processes:

```python
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)
```
This step is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.

6. Modify your code to save checkpoints only on `worker 0` to prevent other workers from corrupting them. You can do this by conditioning the model checkpointing code with `hvd.rank() != 0.`

A sample code block demonstrating how to apply the above-itemized steps is given below

```python

import torch
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(hvd.local_rank())

# Define dataset...
train_dataset = ...

# Partition dataset among workers using DistributedSampler
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

# Build model...
model = ...
model.cuda()

optimizer = optim.SGD(model.parameters())

# Add Horovod Distributed Optimizer
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

# Broadcast parameters from rank 0 to all other processes.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
       optimizer.zero_grad()
       output = model(data)
       loss = F.nll_loss(output, target)
       loss.backward()
       optimizer.step()
       if batch_idx % args.log_interval == 0:
           print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
               epoch, batch_idx * len(data), len(train_sampler), loss.item()))

```

To run a sample training code with 2 GPUs on a single machine, use the command:

```text
horovodrun -np 2 python3 training.py 

or use: 

horovodrun -np 2 -H localhost:2 python3 training.py  
```
To run a sample training code with 16 GPUs on four machines (4 GPUs each), use the command:

```text
horovodrun -np 16 -H hostname_1:4,hostname_2:4,hostname_3:4,hostname_4:4 python3 training.py
```
*`-np` denotes the number of GPUs, and `-H` represents the hostname or server name*

In [None]:
!horovodrun -np 2 --mpi-args="--oversubscribe"  python3 ../source_code/horovod_pytorch_mnist.py

---
## References

- https://horovod.readthedocs.io/en/latest/pytorch.html
- https://github.com/horovod/horovod