# Getting started with Intel Optimization for Horovod

This code sample will serve as a getting started sample to perform distributed deep learning workloads using Intel optimization for Horovod on Intel GPUs. In this sample we will be running multi-card inference benchmarks as well as a training example from Horovod. By the end of this sample, users should be able to get started with multi-card distributed deep learning using Intel optimization for Horovd and Tensorflow.

In [None]:
import os
initial_cwd = os.getcwd()

### Find number of devices (GPUs)

Run `sycl-ls` to print the list of all available devices in the system. We can utilize this tool to check how many GPUs we can use to distribute our deep learning workloads among different cards. 

In [None]:
!sycl-ls

Based on the output from `sycl-ls`, set the number_devices below to align with the number of GPUs available in the system.

In [None]:
number_devices = 2

## Inference with Horovod 
In this section we will be downloading and running an inference benchmarking script from Intel-optimization-for-Horovod repository, with the benchmarking script we can run inference with different configurations such as model, batch size and number of iterations. 

Download the `tensorflow2_keras_synthetic_benchmark.py` inference example from the [Intel-optimization-for-horovod](https://github.com/intel/intel-optimization-for-horovod) open-source repository. This example has already been modified for Intel GPU support, and will run on Intel GPU without any code modifications. 

In [None]:
if not os.path.exists(os.path.join(os.getcwd(), "tensorflow2_keras_synthetic_benchmark.py")):
    !wget https://raw.githubusercontent.com/intel/intel-optimization-for-horovod/main/examples/tensorflow2/tensorflow2_keras_synthetic_benchmark.py
else:
    print("Example already in current directory")

Run the ResNet50 benchmark example with the following paramaters:

In [None]:
!horovodrun --num-proc $number_devices \
    python tensorflow2_keras_synthetic_benchmark.py \
    --fp16-allreduce \
    --model ResNet50 \
    --batch-size 32 \
    --num-batches-per-iter 10 \
    --num-iters 10

Upon completion, the example will output to screen benchmarking results for the inference run. Users can compare single GPU images per second to multi-card imges/second. Users can also rerun the workload with different parameters such as `batch size`, `batches per iteration` and `model`.

Below is another inference benchmarking example running `MobileNet` model instead of `ResNet50` and a batch size of `64`:

In [None]:
!horovodrun --num-proc $number_devices \
    python tensorflow2_keras_synthetic_benchmark.py \
    --fp16-allreduce \
    --model MobileNet \
    --batch-size 64 \
    --num-batches-per-iter 10 \
    --num-iters 10

## Training with Horovod
In this section, we will be running a training workload from horovod public repository with MINST dataset. We will be using a training workload from Horovod repository which will require code modifications to run on Intel GPUs. 

Clone the horovod repository and cd into the _tensorflow2/examples_ directory:

In [None]:
!git clone https://github.com/horovod/horovod.git
%cd horovod/examples/tensorflow2

Unlike the inference example, this training example requires a patch to run on Intel GPUs. The patch will make the neccessery changes to run the Horovod training workload on Intel GPUs.
Download the patch from the [Intel-extension-for-tensorflow](https://github.com/intel/intel-extension-for-tensorflow) repository to current directory:

In [None]:
if not os.path.exists(os.path.join(os.getcwd(), "tensorflow2_keras_mnist.patch")):
    !wget https://github.com/intel/intel-extension-for-tensorflow/raw/main/examples/train_horovod/mnist/tensorflow2_keras_mnist.patch
else:
    print("Patch already in current directory")

Lets take a look at the patch file to see the required modifications to run on Intel GPU with Intel Optimization for Horovod.

In [None]:
%cat tensorflow2_keras_mnist.patch

**Note: Users can follow the [offical guild](https://github.com/horovod/horovod/blob/master/docs/tensorflow.rst)
 from Horovod to enable distrubiuted deep learning workloads in Tensorflow v2.x. The only modification needed to run on Intel GPUs is to replace device name from `GPU` to `XPU` while pinning each XPU to a single process.**

Apply the patch to enable Intel GPUs for training example.

In [None]:
!git apply tensorflow2_keras_mnist.patch

We can now run the minst training workload on multiple devices with the patched python file.

In [None]:
!horovodrun --num-proc $number_devices \
    python ./tensorflow2_keras_mnist.py

The output (both stdout and stderr) is displayed on the command line console.

In [None]:
os.chdir(initial_cwd)
print('[CODE_SAMPLE_COMPLETED_SUCCESFULLY]')