<a href="https://www.kaggle.com/code/peremartramanonellas/using-multiple-gpu-s-with-tensorflow-on-kaggle?scriptVersionId=109874709" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [18]:
from IPython.core.display import HTML
HTML("""
<style>
font-family: monospace;
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
    horizontal-align: middle;
}
h1 {
    text-align: center;
    background-color: AliceBlue;
    padding: 10px;
    margin: 0;
    font-family: monospace;
    color:DimGray;
    border-radius: 10px
}

h2 {
    text-align: center;
    background-color: HoneyDew;
    padding: 10px;
    margin: 0;
    font-family: monospace;
    color:DimGray;
    border-radius: 10px
}

h3 {
    text-align: center;
    background-color: MintCream;
    padding: 10px;
    margin: 0;
    font-family: monospace;
    color:DimGray;
    border-radius: 10px
}


body, p {
    font-family: monospace;
    font-size: 15px;
    color: charcoal;
}
div {
    font-size: 14px;
    margin: 0;

}

h4 {
    padding: 0px;
    margin: 0;
    font-family: monospace;
    color: purple;
}
</style>
""")



# Use multiple GPU's with Tensorflow / Keras on Kaggle. 
Recently, **Kaggle has introduced the possibility of using two GPUs in our notebooks.** We are going to see how to use them in a simple way with TensorFlow. 


The technique of using more than one GPU on a single machine is called MirroredStrategy. It is the one we are going to use in Kaggle. Or on our machine if we had more than one GPU.

Just be sure that you have multiple GPU's selected in the Settings Section of the notebook.

![gpu_settings.png](attachment:67137424-990d-4480-ada8-b990ee512d83.png)

In [19]:
import tensorflow as tf
import time
import tensorflow_datasets as tfds
import tensorflow_hub as hub

## Different strategies

There are different multi-GPU execution strategies that we can use depending on the type of environment in which we execute the model:

* MirroredStrategy: Multiple GPUs in a machine, and variables mirrored in each GPU. 
* CentralStorageStrategy. Multiple GPUs in a machine, but the variables are stored and treated in the CPU. *tf.distribute.experimental.CentralStorageStrategy*
* MultiMowrkerMirroredStrategy. Multiple machines. 

In Kaggle, we can use MirroredStrategy and CentralStorageStrategy. The first is more efficient, while the second consumes less memory, but this latter is still labelled as 'experimental' so I recommend always using MirroredStrategy.


In [20]:
# Define tand get the number os devices. 
strategy = tf.distribute.MirroredStrategy()
print('DEVICES AVAILABLE: {}'.format(strategy.num_replicas_in_sync))

DEVICES AVAILABLE: 2


## Import & data
I retrieved the data from a TensorFlow dataset instead of Kaggle to make it easier to run the notebook in other environments and make it easier to test, in our local machines. 

In [21]:
#Choose the Dataset you want to use. 
#In Kaggle I do not recommend cats_vs_dogs 
#because we can reach the limit of memory consumption.

#setattr(tfds.image_classification.cats_vs_dogs, '_URL',"https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_5340.zip")
#TS_2_DOWNLOAD = 'cats_vs_dogs'

TS_2_DOWNLOAD = 'horses_or_humans'

splits, info_cd = tfds.load(TS_2_DOWNLOAD, 
                         as_supervised=True, 
                         with_info=True, 
                         split=['train[:70%]', 'train[70%:]'], 
                         data_dir='./data')

In [22]:
num_examples_cd = info_cd.splits['train'].num_examples
num_classes_cd = info_cd.features['label'].num_classes

(train_examples, validation_examples) = splits
num_examples = num_examples_cd
num_classes = num_classes_cd

In [23]:
IMAGE_SIZE = 224

BUFFER_SIZE = 10000
BATCH_SIZE_PER_REPLICA = 64

#We obtain the BATCH_SIZE dividing by the number of devices. 
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

In [24]:
#Treat the image 
def map_fn(img, label):
    # resize the image
    img = tf.image.resize(img, size=[IMAGE_SIZE, IMAGE_SIZE])
    # normalize the image
    img /= 255.0
    return img, label

In [25]:
# Prepare train dataset by using preprocessing with map_fn, shuffling and batching
def prepare_dataset(train_examples, validation_examples, num_examples, map_fn, batch_size):
    train_ds = train_examples.map(map_fn).shuffle(buffer_size = num_examples).batch(batch_size)
    valid_ds = validation_examples.map(map_fn).batch(batch_size)
    
    return train_ds, valid_ds

In [26]:
train_ds, valid_ds = prepare_dataset(train_examples, 
                                                validation_examples, 
                                                num_examples, 
                                                map_fn, BATCH_SIZE)

# Executing the model in multiple GPUs

You have two models to use, both from TensorFlow HUB. I recommend using the smaller model. But if you want to try a heavier model, just uncomment the line that loads the *resnet_v2_152 model*.

When defining the model we do it within the scope that we have created at the beginning in the *strategy* variable. 

In [27]:
MODULE_HANDLE = 'https://tfhub.dev/tensorflow/resnet_50/feature_vector/1'
#MODULE_HANDLE = 'https://tfhub.dev/google/imagenet/resnet_v2_152/classification/5'

with strategy.scope():
    model = tf.keras.Sequential([
    hub.KerasLayer(MODULE_HANDLE, input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)),
    tf.keras.layers.Dense(2, activation='softmax')
])

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer_2 (KerasLayer)   (None, 2048)              23561152  
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 4098      
Total params: 23,565,250
Trainable params: 4,098
Non-trainable params: 23,561,152
_________________________________________________________________


In [28]:
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [29]:
start_time=time.time()
model.fit(train_ds, epochs=20, validation_data=valid_ds, verbose=2)
end_time=time.time()

2022-11-02 22:07:18.950618: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:461] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.


Epoch 1/20


Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup ca

6/6 - 19s - loss: 0.6570 - accuracy: 0.6718 - val_loss: 0.1653 - val_accuracy: 0.9740


Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...
Cleanup called...


Epoch 2/20
6/6 - 3s - loss: 0.1148 - accuracy: 0.9708 - val_loss: 0.0425 - val_accuracy: 0.9968
Epoch 3/20
6/6 - 3s - loss: 0.0320 - accuracy: 0.9958 - val_loss: 0.0126 - val_accuracy: 1.0000
Epoch 4/20
6/6 - 3s - loss: 0.0147 - accuracy: 0.9986 - val_loss: 0.0074 - val_accuracy: 1.0000
Epoch 5/20
6/6 - 3s - loss: 0.0100 - accuracy: 0.9986 - val_loss: 0.0051 - val_accuracy: 1.0000
Epoch 6/20
6/6 - 3s - loss: 0.0077 - accuracy: 0.9986 - val_loss: 0.0038 - val_accuracy: 1.0000
Epoch 7/20
6/6 - 3s - loss: 0.0064 - accuracy: 0.9986 - val_loss: 0.0032 - val_accuracy: 1.0000
Epoch 8/20
6/6 - 3s - loss: 0.0054 - accuracy: 0.9986 - val_loss: 0.0028 - val_accuracy: 1.0000
Epoch 9/20
6/6 - 3s - loss: 0.0049 - accuracy: 0.9986 - val_loss: 0.0025 - val_accuracy: 1.0000
Epoch 10/20
6/6 - 3s - loss: 0.0043 - accuracy: 0.9986 - val_loss: 0.0024 - val_accuracy: 1.0000
Epoch 11/20
6/6 - 3s - loss: 0.0038 - accuracy: 0.9986 - val_loss: 0.0022 - val_accuracy: 1.0000
Epoch 12/20
6/6 - 3s - loss: 0.0036 - 

In [30]:
print(end_time-start_time)

75.12945342063904


The model spent 75 seconds to execute in multiple GPUs. 

# Executing the model in Single GPU
I'm preparing a new dataset with the correct BATCH_SIZE, and define the same model but without the *with_strategy scope*. 

In [31]:
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * 1

train_ds_eager, valid_ds_eager = prepare_dataset(train_examples, 
                                                        validation_examples, 
                                                        num_examples, 
                                                        map_fn, BATCH_SIZE)
model2 = tf.keras.Sequential([
    hub.KerasLayer(MODULE_HANDLE, input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

model2.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer_3 (KerasLayer)   (None, 2048)              23561152  
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 4098      
Total params: 23,565,250
Trainable params: 4,098
Non-trainable params: 23,561,152
_________________________________________________________________


In [32]:
model2.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

In [33]:
start_time=time.time()
model2.fit(train_ds_eager, epochs=20, validation_data=valid_ds_eager, verbose=2)
end_time=time.time()

Epoch 1/20
12/12 - 9s - loss: 0.2849 - accuracy: 0.8748 - val_loss: 0.0222 - val_accuracy: 1.0000
Epoch 2/20
12/12 - 4s - loss: 0.0147 - accuracy: 0.9972 - val_loss: 0.0051 - val_accuracy: 1.0000
Epoch 3/20
12/12 - 4s - loss: 0.0078 - accuracy: 0.9986 - val_loss: 0.0046 - val_accuracy: 1.0000
Epoch 4/20
12/12 - 4s - loss: 0.0040 - accuracy: 1.0000 - val_loss: 0.0025 - val_accuracy: 1.0000
Epoch 5/20
12/12 - 4s - loss: 0.0026 - accuracy: 1.0000 - val_loss: 0.0019 - val_accuracy: 1.0000
Epoch 6/20
12/12 - 4s - loss: 0.0023 - accuracy: 1.0000 - val_loss: 0.0018 - val_accuracy: 1.0000
Epoch 7/20
12/12 - 4s - loss: 0.0020 - accuracy: 1.0000 - val_loss: 0.0016 - val_accuracy: 1.0000
Epoch 8/20
12/12 - 4s - loss: 0.0018 - accuracy: 1.0000 - val_loss: 0.0015 - val_accuracy: 1.0000
Epoch 9/20
12/12 - 4s - loss: 0.0017 - accuracy: 1.0000 - val_loss: 0.0014 - val_accuracy: 1.0000
Epoch 10/20
12/12 - 4s - loss: 0.0016 - accuracy: 1.0000 - val_loss: 0.0013 - val_accuracy: 1.0000
Epoch 11/20
12/12 -

In [34]:
print(end_time-start_time)

90.77057695388794


This execution on a single GPU takes 90 seconds. About 10 more than that done on multiple GPUs.
# Conclusions, Fork & More. 
Thank you very much, Kaggle, for allowing us to use multiple GPUs. Not only because of the improvement that we can obtain in performance during the execution of our models, but also because of the possibility of testing these techniques in an environment as affordable as Kaggle's.

Using multiple GPUs is so easy that there really is no reason not to do it now that we can. The advantage obtained will be greater or less depending on our Dataset and model. But we can always test and decide if we want to use multiple GPUs or just one.

If you want you can fork the notebook and try different models, and datasets, even you can try CentralStorageStrategy.Â 

# More Notebooks in the TensorFlow Serie.
I'm working in a serie of notebooks with some interesting techniques in Tensorflow: *Tensorflow beyond the basics*.

How to create a Siamese Network to compare images. https://www.kaggle.com/code/peremartramanonellas/how-to-create-a-siamese-network-to-compare-images

Multiple outputs with Keras Functional API. 
https://www.kaggle.com/code/peremartramanonellas/guide-multiple-outputs-with-keras-functional-api/edit/run/109206893

Improve Tensorflow performance with Graph mode.
https://www.kaggle.com/code/peremartramanonellas/improve-tensorflow-performance-with-graph-mode/edit

In my medium profile you can find the articles where each of the notebooks is explained.
https://medium.com/@peremartra

#### **Please, if you liked the notebook, consider upvoting it. It really encourages me a lot to write more notebooks like that.**
