# Distributed

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukeconibear/intro_ml/blob/main/docs/05_distributed.ipynb)

In [None]:
# if you're using colab, then install the required modules
import sys

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    pass

Distributing training over multiple devices generally uses either:

- [Data parallelism](https://developers.google.com/machine-learning/glossary/#data-parallelism)
    - Single model copied to multiple devices.
    - Split data over multiple devices.
    - Useful for big data.
- [Model parallelism](https://developers.google.com/machine-learning/glossary/#model-parallelism)
    - Split model over multiple devices.
    - Single data copied to multiple devices.
    - Useful for big models.
    
This lesson focuses on data parallelism.

## [Ray Train](https://docs.ray.io/en/latest/train/train.html)

Ray Train simplifies distributed deep learning for TensorFlow and PyTorch.

It handles the set up for you (e.g., [`TF_CONFIG`](https://www.tensorflow.org/guide/distributed_training#setting_up_the_tf_config_environment_variable) in TensorFlow).

There are a range of examples [here](https://docs.ray.io/en/latest/train/examples.html).

### [TensorFlow](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras)

Here is an [MNIST example](https://docs.ray.io/en/latest/train/examples/tensorflow_mnist_example.html):

In [100]:
import argparse
import json
import os

import numpy as np
import ray
import ray.train as train
import tensorflow as tf
from ray.train import Trainer
from tensorflow.keras.callbacks import Callback

#### [Define callback for reporting](https://docs.ray.io/en/latest/train/user_guide.html#logging-monitoring-and-callbacks)

In [114]:
class TrainReportCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        train.report(**logs)

#### Set up the dataset and model

```{tip}
The default [auto-sharding](https://www.tensorflow.org/api_docs/python/tf/data/experimental/AutoShardPolicy) by `FILE` can cause warning messages. Instead auto-shard by data: `tf.data.experimental.AutoShardPolicy.DATA`
```

In [115]:
def mnist_dataset(batch_size):
    (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
    # The `x` arrays are in uint8 and have values in the [0, 255] range.
    # You need to convert them to float32 with values in the [0, 1] range.
    x_train = x_train / np.float32(255)
    y_train = y_train.astype(np.int64)
    train_dataset = (
        tf.data.Dataset.from_tensor_slices((x_train, y_train))
        .shuffle(60000)
        .repeat()
        .batch(batch_size)
    )
    
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = (
        tf.data.experimental.AutoShardPolicy.DATA
    )
    train_dataset = train_dataset.with_options(options)
    
    return train_dataset

In [116]:
def build_and_compile_cnn_model(config):
    learning_rate = config.get("lr", 0.001)
    model = tf.keras.Sequential(
        [
            tf.keras.Input(shape=(28, 28)),
            tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
            tf.keras.layers.Conv2D(32, 3, activation="relu"),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dense(10),
        ]
    )
    model.compile(
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate),
        metrics=["accuracy"],
    )
    return model

#### Set up the training function for a _single_ worker

You can [configure training](https://docs.ray.io/en/latest/train/user_guide.html#configuring-training) using the `config` parameter.

In [117]:
def train_func(config):
    batch_size = 64
    single_worker_dataset = mnist_dataset(batch_size)
    single_worker_model = build_and_compile_cnn_model(config)
    single_worker_model.fit(
        single_worker_dataset, epochs=config["epochs"], steps_per_epoch=70
    )

In [118]:
config = {"epochs": 3}

In [119]:
train_func(config)

Epoch 1/3
Epoch 2/3
Epoch 3/3


#### [Update training function](https://docs.ray.io/en/latest/train/user_guide.html#update-training-function)

1. Set the _global_ batch size
    - Each worker will process the same size batch as in the single-worker code.
2. Choose your TensorFlow distributed training strategy.
    - In this example we use the [MultiWorkerMirroredStrategy](https://www.tensorflow.org/guide/distributed_training#multiworkermirroredstrategy) for synchronous training of multiple workers across many machines.
    - Within the strategy scope, you build a compiled the model.

In [120]:
def train_func(config):
    per_worker_batch_size = config.get("batch_size", 64)
    epochs = config.get("epochs", 3)
    steps_per_epoch = config.get("steps_per_epoch", 70)

    tf_config = json.loads(os.environ["TF_CONFIG"])
    num_workers = len(tf_config["cluster"]["worker"])

    strategy = tf.distribute.MultiWorkerMirroredStrategy()

    global_batch_size = per_worker_batch_size * num_workers
    multi_worker_dataset = mnist_dataset(global_batch_size)

    with strategy.scope():
        # Model building/compiling need to be within `strategy.scope()`.
        multi_worker_model = build_and_compile_cnn_model(config)

    history = multi_worker_model.fit(
        multi_worker_dataset,
        epochs=epochs,
        steps_per_epoch=steps_per_epoch,
        callbacks=[TrainReportCallback()],
        verbose=False,
    )
    results = history.history
    return results

#### [Create Ray Train Trainer](https://docs.ray.io/en/latest/train/user_guide.html#create-ray-train-trainer)

The `Trainer` manages state and training.

In [121]:
def train_tensorflow_mnist(num_workers=1, use_gpu=False, epochs=4):
    trainer = Trainer(backend="tensorflow", num_workers=num_workers, use_gpu=use_gpu)
    trainer.start()
    results = trainer.run(
        train_func=train_func, config={"lr": 1e-3, "batch_size": 64, "epochs": epochs}
    )
    trainer.shutdown()
    print(f"Results: {results[0]}")

#### [Run the training](https://docs.ray.io/en/latest/train/user_guide.html#run-training-function)

Initialise and shutdown the Ray client:

In [None]:
# ray.init()

In [123]:
# cpu
# train_tensorflow_mnist()

# gpu
# train_tensorflow_mnist(use_gpu=True)

2022-03-25 15:45:59,378	INFO trainer.py:199 -- Trainer logs will be logged in: /home/earlacoa/ray_results/train_2022-03-25_15-45-59
2022-03-25 15:45:59,858	INFO trainer.py:205 -- Run results will be logged in: /home/earlacoa/ray_results/train_2022-03-25_15-45-59/run_001
[2m[36m(BackendExecutor pid=615413)[0m 2022-03-25 15:45:59.976682: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
[2m[36m(BackendExecutor pid=615413)[0m 2022-03-25 15:45:59.976705: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[2m[36m(BaseWorkerMixin pid=615414)[0m 2022-03-25 15:46:00.933622: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such fil

Results: {'loss': [2.2843410968780518, 2.234750747680664, 2.1810903549194336, 2.1127052307128906], 'accuracy': [0.16026785969734192, 0.3738839328289032, 0.5381696224212646, 0.6267856955528259]}


In [124]:
# ray.shutdown()

This Python script is in full [here](https://github.com/lukeconibear/intro_ml/blob/main/docs/distributed/tensorflow_ray_train_mnist_example.py).

The job submission script is (also [here](https://github.com/lukeconibear/intro_ml/blob/main/docs/distributed/distributed_ml_on_arc4_cpu.bash)):

```bash
#!/bin/bash
#$ -cwd
#$ -l h_rt=00:30:00
#$ -pe smp 12
#$ -l h_vmem=6G

conda activate intro_ml
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib  # (sometimes needed)

python tensorflow_ray_train_mnist_example.py --num-workers 12 --epochs 100
```

In this simple example using 12 CPUs, the job efficiency (using `qacct -j <JOBID>`):

```
Efficiency = 100 * cpu / (ru_wallclock * slots)
Efficiency = 100 * 10214 / (928 * 12)
Efficiency = 92 %
```

92% is good.

To run on the GPU ([submission script](https://github.com/lukeconibear/intro_ml/blob/main/docs/distributed/distributed_ml_on_arc4_gpu.bash)):
- Replace `#$ -pe smp 4` with `#$ -l coproc_v100=1`.
- Add `--use-gpu=True`.

### [PyTorch](https://pytorch.org/tutorials/beginner/dist_overview.html)

Here is an [Fashion MNIST example](https://docs.ray.io/en/latest/train/examples/train_fashion_mnist_example.html):

In [126]:
import argparse
from typing import Dict

import torch
import ray
import ray.train as train
from ray.train.trainer import Trainer
from ray.train.callbacks import JsonLoggerCallback
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

#### Set up the dataset and model

In [127]:
training_data = datasets.FashionMNIST(
    root="~/data",
    train=True,
    download=True,
    transform=ToTensor(),
)

test_data = datasets.FashionMNIST(
    root="~/data",
    train=False,
    download=True,
    transform=ToTensor(),
)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to /home/earlacoa/data/FashionMNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/26421880 [00:00<?, ?it/s]

Extracting /home/earlacoa/data/FashionMNIST/raw/train-images-idx3-ubyte.gz to /home/earlacoa/data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to /home/earlacoa/data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/29515 [00:00<?, ?it/s]

Extracting /home/earlacoa/data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to /home/earlacoa/data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to /home/earlacoa/data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/4422102 [00:00<?, ?it/s]

Extracting /home/earlacoa/data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to /home/earlacoa/data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to /home/earlacoa/data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/5148 [00:00<?, ?it/s]

Extracting /home/earlacoa/data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to /home/earlacoa/data/FashionMNIST/raw



In [129]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28 * 28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(),
            nn.Linear(512, 10), nn.ReLU())

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

#### Define training and validation per epoch

In [130]:
def train_epoch(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset) // train.world_size()
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

In [131]:
def validate_epoch(dataloader, model, loss_fn):
    size = len(dataloader.dataset) // train.world_size()
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n "
          f"Accuracy: {(100 * correct):>0.1f}%, "
          f"Avg loss: {test_loss:>8f} \n")
    return test_loss

#### [Setup distributed training function](https://docs.ray.io/en/latest/train/user_guide.html#update-training-function)

Use `ray.train.torch.prepare_model` to automatically move your model to the right device.

Use `ray.train.torch.prepare_data_loader` utility functions to setup your data for distributed training.

In [132]:
def train_func(config: Dict):
    batch_size = config["batch_size"]
    lr = config["lr"]
    epochs = config["epochs"]

    worker_batch_size = batch_size // train.world_size()

    # Create data loaders.
    train_dataloader = DataLoader(training_data, batch_size=worker_batch_size)
    test_dataloader = DataLoader(test_data, batch_size=worker_batch_size)

    train_dataloader = train.torch.prepare_data_loader(train_dataloader)
    test_dataloader = train.torch.prepare_data_loader(test_dataloader)

    # Create model.
    model = NeuralNetwork()
    model = train.torch.prepare_model(model)

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    loss_results = []

    for _ in range(epochs):
        train_epoch(train_dataloader, model, loss_fn, optimizer)
        loss = validate_epoch(test_dataloader, model, loss_fn)
        train.report(loss=loss)
        loss_results.append(loss)

    return loss_results

#### [Create Ray Train Trainer](https://docs.ray.io/en/latest/train/user_guide.html#create-ray-train-trainer)

In [133]:
def train_fashion_mnist(num_workers=1, use_gpu=False):
    trainer = Trainer(
        backend="torch", num_workers=num_workers, use_gpu=use_gpu)
    trainer.start()
    result = trainer.run(
        train_func=train_func,
        config={
            "lr": 1e-3,
            "batch_size": 64,
            "epochs": 4
        },
        callbacks=[JsonLoggerCallback()])
    trainer.shutdown()
    print(f"Loss results: {result}")

#### [Run the training](https://docs.ray.io/en/latest/train/user_guide.html#run-training-function)

In [134]:
# ray.init()

{'node_ip_address': '192.168.0.37',
 'raylet_ip_address': '192.168.0.37',
 'redis_address': None,
 'object_store_address': '/tmp/ray/session_2022-03-25_16-29-28_552181_560894/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-03-25_16-29-28_552181_560894/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2022-03-25_16-29-28_552181_560894',
 'metrics_export_port': 38196,
 'gcs_address': '192.168.0.37:43117',
 'address': '192.168.0.37:43117',
 'node_id': '790fd81aeb0811f07d2183039d15d093c6a6ee3a272d75fa8a64143e'}

In [135]:
# cpu
# train_fashion_mnist()

# gpu
# train_fashion_mnist(use_gpu=True)

2022-03-25 16:29:32,512	INFO trainer.py:199 -- Trainer logs will be logged in: /home/earlacoa/ray_results/train_2022-03-25_16-29-32
2022-03-25 16:29:33,611	INFO trainer.py:205 -- Run results will be logged in: /home/earlacoa/ray_results/train_2022-03-25_16-29-32/run_001
[2m[36m(BaseWorkerMixin pid=618603)[0m 2022-03-25 16:29:33,566	INFO torch.py:66 -- Setting up process group for: env:// [rank=0, world_size=1]
[2m[36m(BaseWorkerMixin pid=618603)[0m 2022-03-25 16:29:34,045	INFO torch.py:244 -- Moving model to device: cpu


[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.306230  [    0/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.301374  [ 6400/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.288759  [12800/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.285810  [19200/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.291880  [25600/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.275090  [32000/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.289195  [38400/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.276231  [44800/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.260754  [51200/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.255165  [57600/60000]
[2m[36m(BaseWorkerMixin pid=618603)[0m Test Error: 
[2m[36m(BaseWorkerMixin pid=618603)[0m  Accuracy: 25.4%, Avg loss: 2.259322 
[2m[36m(BaseWorkerMixin pid=618603)[0m 
[2m[36m(BaseWorkerMixin pid=618603)[0m loss: 2.260004  [    0/60000]
[2m[36m(BaseWorkerMixin pid=

In [None]:
# ray.shutdown()

This Python script is in full [here](https://github.com/lukeconibear/intro_ml/blob/main/docs/distributed/pytorch_ray_train_fashion_mnist_example.py).

The job submission script is the same as before ([here](https://github.com/lukeconibear/intro_ml/blob/main/docs/distributed/distributed_ml_on_arc4_cpu.bash)), except you use the line:

```bash
python tensorflow_ray_train_mnist_example.py --num-workers 12 --epochs 100
```

In this simple example using 12 CPUs, the job efficiency (using `qacct -j <JOBID>`):

```
Efficiency = 100 * cpu / (ru_wallclock * slots)
Efficiency = 100 * X / (X * 12)
Efficiency = X %
```

...

To run on the GPU ([submission script](https://github.com/lukeconibear/intro_ml/blob/main/docs/distributed/distributed_ml_on_arc4_gpu.bash)):
- Replace `#$ -pe smp 4` with `#$ -l coproc_v100=1`.
- Add `--use-gpu=True`.

## Jupyter Notebook to HPC

It's preferable to use a static job on the HPC. To do this, you could test out different ideas locally in a Jupyter Notebook, then when ready convert this to an executable script (`.py`) and move it over. 

...

## Exercises

```{admonition} Exercise 1

...

```

## {ref}`Solutions <distributed>`

## Key Points

```{important}

- [x] _..._

```

## Further information

### Good practices

- Ensure works on a single workers first, _before_ going distributed.
- Really ensure that you need multiple GPUs.
- Batch the dataset with the global batch size e.g., for 8 devices each capable of a btach of 64 use the global batch size of 512 (= 8 * 64).  
- ...

### Other options

- [Horovod](https://horovod.ai/)
- [DeepSpeed](https://www.deepspeed.ai/)
 
### Resources

- ...