# Distributed

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukeconibear/intro_ml/blob/main/docs/04_distributed.ipynb)

In [1]:
# if you're using colab, then install the required modules
import sys

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    pass

...

Examples of how to distribute deep learning on a High Performance Computer (HPC).

## Contents

These examples use [Ray Train](https://docs.ray.io/en/latest/train/train.html) in a static job on a HPC. Ray handles most of the complexity of distributing the work, with minimal changes to your [TensorFlow](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras) or [PyTorch](https://pytorch.org/tutorials/beginner/dist_overview.html) code.

First, install the Python environments for the required HPC: [`install_python_environments.md`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/install_python_environments.md).  

- Python script examples:
  - TensorFlow
    - MNIST end-to-end: [`tensorflow_mnist_example.py`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/tensorflow_mnist_example.py).  
    - MNIST tuning: [`tensorflow_tune_mnist_example.py`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/tensorflow_tune_mnist_example.py).  
    - Train linear model with Ray Datasets: [`tensorflow_linear_dataset_example.py`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/tensorflow_linear_dataset_example.py).  
  - PyTorch
    - Linear: [`pytorch_train_linear_example.py`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/pytorch_train_linear_example.py).  
    - Fashion MNIST: [`pytorch_train_fashion_mnist_example.py`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/pytorch_train_fashion_mnist_example.py).  
    - HuggingFace Transformer: [`pytorch_transformers_example.py`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/pytorch_transformers_example.py).  
    - Tune linear model with Ray Datasets: [`pytorch_tune_linear_dataset_example.py`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/pytorch_tune_linear_dataset_example.py).  
- Then submit the job to HPC (choose one and update the Python script within it):
  - [ARC4](https://arcdocs.leeds.ac.uk/systems/arc4.html) (SGE)  
    - CPU: [`ray_train_on_arc4_cpu.bash`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/ray_train_on_arc4_cpu.bash).  
    - GPU: [`ray_train_on_arc4_gpu.bash`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/ray_train_on_arc4_gpu.bash).  
  - [Bede](https://bede-documentation.readthedocs.io/en/latest/) (SLURM)
    - GPU: [`ray_train_on_bede.bash`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/ray_train_on_bede.bash).  
  - [JADE-2](http://docs.jade.ac.uk/en/latest/index.html) (SLURM)
    - GPU: ...

It's preferable to use a static job on the HPC. To do this, you could test out different ideas locally in a Jupyter Notebook, then when ready convert this to an executable script (`.py`) and move it over. However, it is also possible to use Jupyter Notebooks interactively on the HPC following the instructions here: [`jupyter_notebook_to_hpc.md`](https://github.com/lukeconibear/distributed_deep_learning/blob/main/jupyter_notebook_to_hpc.md).  



https://keras.io/guides/distributed_training/

Synchronous data-parallel training on all available GPUs:

In [None]:
# distribution_strategy = tf.distribute.MirrorStratergy()
# with distribution_strategy.scope():
#     # Everything that creates variables should be under the strategy scope.
#     # In general this is only model construction and compile()
#     model = build_model()
#     model.compile(optimiser, loss)
#     model.fit(dataset, epochs=epochs, callbacks=callbacks)  

should the `model.fit` call be inside or outside the scope?

#$ -cwd

not

#$ -cwd -V

so have to specific the reproducible environment with the job submission (not copied over from the terminal)

## Checkpointing

For longer or distributed training, it's helpful to save the model at regular intervals in case it crashes during training.

This is model checkpointing, and is done via callbacks.

In [1]:
import tensorflow as tf

2022-03-16 14:10:08.522624: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-16 14:10:08.522642: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [None]:
callback_model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='path/to/my/model_{epoch}',
    save_freq='epoch'  # save a model version at the end of each epoch
)

## TensorBoard

In [None]:
callback_tensorboard = tf.keras.callbacks.TensorBoard(log_dir='./logs')

View them with:

`tensorboard --logdir=./logs`

Also, in-line in [Jupyter Notebooks / Google Colab](https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks)

## Example: Callbacks

In [None]:
callbacks = [
    callback_model_checkpoint,
    callback_tensorboard,
]

In [None]:
# model.fit(dataset, epochs=2, callbacks=callbacks)

## Exercises

```{admonition} Exercise 1

...

```

## {ref}`Solutions <distributed>`

## Key Points

```{important}

- [x] _..._

```

## Further information

### Good practices

- ...

### Other options

- [Horovod](https://horovod.ai/)
- [DeepSpeed](https://www.deepspeed.ai/)
 
### Resources

- ...