<a href="https://colab.research.google.com/github/jeffheaton/present/blob/master/youtube/gpu/keras-dual-gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Jeff Heaton

You can follow me at any of:

* [YouTube](https://www.youtube.com/user/HeatonResearch)
* [Website](https://www.heatonresearch.com/)
* [GitHub](https://github.com/jeffheaton)
* [Twitter](https://twitter.com/jeffheaton)


## Multi-GPU Support

Keras makes it easy to use more than one GPU for neural network training or scoring. This tutorial shows how to train a model for the [Cats vs Dogs](https://www.kaggle.com/c/dogs-vs-cats) dataset. Not all models will necessarily benefit from multiple GPUs.  Generally larger batch sizes and more complex neural networks benefit from multiple GPUs.

The technique presented in this notebook can train with between 1 and 8 GPUs on a single host.  It is also possable to train larger numbers of GPUs on multiple hosts; however, a slightly different approach is needed.

First, we will list what GPUs are available on the system.


```
Tensorflow tends to output a ton of useless warnings, the following lines surpress this output.
You might want to remove this cell if you are debugging.
```

In [1]:
%%html
<style>
    div.output_stderr {
    display: none;
    }
</style>

In [2]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

# these two lines are a "hack" to get these two paths set for Jupyter. They should be set automatically with conda activate, just not Jupyter.
os.environ["CUDNN_PATH"] = "/home/jheaton/miniconda3/envs/tensorflow/lib/python3.10/site-packages/nvidia/cudnn"
os.environ["LD_LIBRARY_PATH"] = "/home/jheaton/miniconda3/envs/tensorflow/lib/:/home/jheaton/miniconda3/envs/tensorflow/lib/python3.10/site-packages/nvidia/cudnn/lib:/usr/local/relion-4.0/lib:/usr/local/amber22/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda-11.6/lib:/usr/local/cuda-11.6/lib64:/usr/local/cuda-11.3/lib:/usr/local/cuda-11.3/lib64:/usr/local/cuda-11.2/lib:/usr/local/cuda-11.2/lib64:/usr/local/cuda-11.1/lib:/usr/local/cuda-11.1/lib64:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-10.1/lib:/usr/local/cuda-10.1/lib64:"

import tensorflow as tf
print(tf.__version__)

2.13.0


In [3]:
from tensorflow.python.client import device_lib
devices = device_lib.list_local_devices()

def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

for d in devices:
    t = d.device_type
    name = d.physical_device_desc
    l = [item.split(':',1) for item in name.split(", ")]
    name_attr = dict([x for x in l if len(x)==2])
    dev = name_attr.get('name', 'Unnamed device')
    print(f" {d.name} || {dev} || {t} || {sizeof_fmt(d.memory_limit)}")

 /device:CPU:0 || Unnamed device || CPU || 256.0 MiB
 /device:GPU:0 ||  NVIDIA H100 PCIe || GPU || 76.8 GiB
 /device:GPU:1 ||  NVIDIA H100 PCIe || GPU || 76.8 GiB


# Obtain Cats vs Dogs Dataset

First, we obain the Cats vs Dogs dataset.  We use the [tensorflow_datasets](https://www.tensorflow.org/datasets) library access this data. Any data can be used, **tensorflow_datasets** makes loading common datasets a simple process.

In [4]:
import tensorflow as tf
import tensorflow_datasets as tfds

BATCH_SIZE = 32
GPUS = ["GPU:0","GPU:1"]

def process(image, label):
    image = tf.image.resize(image, [299, 299]) / 255.0
    return image, label

strategy = tf.distribute.MirroredStrategy( GPUS )
print('Number of devices: %d' % strategy.num_replicas_in_sync)

batch_size = BATCH_SIZE * strategy.num_replicas_in_sync

dataset = tfds.load("cats_vs_dogs", split=tfds.Split.TRAIN, as_supervised=True)
dataset = dataset.map(process).shuffle(500).batch(batch_size)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')


  from .autonotebook import tqdm as notebook_tqdm


Number of devices: 2


# Setup Distributed Training

Training with multiple GPUs is not much different than training with a single GPU.  Wrapping the model creation and compilation with a mirror strategy scope is all that is required.

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import time

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

EPOCHS = 5
LR = 0.001

tf.get_logger().setLevel('ERROR')

start = time.time()
with strategy.scope():
    model = tf.keras.applications.InceptionResNetV2(weights=None, classes=2)
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=LR),
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        metrics=[tf.keras.metrics.sparse_categorical_accuracy]
    )

model.fit(dataset, epochs=EPOCHS)

elapsed = time.time()-start
print (f'Training time: {hms_string(elapsed)}')

* 06:16.87 - Dual H100
* 08:38.90 - Single H100
* 13:03.89 - Single A100
* 15:54.81 - Dual Quadro RTX 8000
* 18:30.76 - Single RTX A6000
* 24:17.07 - Single TITAN RTX
* 24:48.10 - Single Tesla V100-SXM2-16GB
* 26:23.19 - Single Quadro RTX 8000
* 37:48.45 - Single Tesla P100-PCIE-16GB
* 50:36.50 - Single Quadro RTX 5000
* 1:10:08.54 - Single Tesla T4