# Distributed Training with GPUs on Cloud AI Platform

**Learning Objectives:**
  1. Setting up the environment
  1. Create a model to train locally
  1. Train on multiple GPUs/CPUs with MultiWorkerMirrored Strategy

In this notebook, we will walk through using Cloud AI Platform to perform distributed training using the `MirroredStrategy` found within `tf.keras`. This strategy will allow us to use the synchronous AllReduce strategy on a VM with multiple GPUs attached.

Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review the [Solution Notebook](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive2/production_ml/solutions/distributed_training.ipynb) for reference. 


In [1]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst

Next we will configure our environment. Be sure to change the `PROJECT_ID` variable in the below cell to your Project ID. This will be the project to which the Cloud AI Platform resources will be billed. We will also create a bucket for our training artifacts (if it does not already exist).

## Lab Task #1: Setting up the environment


In [2]:
import os
# TODO 1
PROJECT_ID = "qwiklabs-gcp-03-3fcd4f8a7fba"  # Replace with your PROJECT
BUCKET = PROJECT_ID 
REGION = 'australia-southeast1'
os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["BUCKET"] = BUCKET


Since we are going to submit our training job to Cloud AI Platform, we need to create our trainer package. We will create the `train` directory for our package and create a blank `__init__.py` file so Python knows that this folder contains a package.

In [3]:
!mkdir train
!touch train/__init__.py

Next we will create a module containing a function which will create our model. Note that we will be using the Fashion MNIST dataset. Since it's a small dataset, we will simply load it into memory for getting the parameters for our model.

Our model will be a DNN with only dense layers, applying dropout to each hidden layer. We will also use ReLU activation for all hidden layers.

In [4]:
%%writefile train/model_definition.py
import tensorflow as tf
import numpy as np

# Get data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

# add empty color dimension
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

def create_model():
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Flatten(input_shape=x_train.shape[1:]))
    model.add(tf.keras.layers.Dense(1028))
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(512))
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(256))
    model.add(tf.keras.layers.Activation('relu'))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(10))
    model.add(tf.keras.layers.Activation('softmax'))
    return model

Writing train/model_definition.py


Before we submit our training jobs to Cloud AI Platform, let's be sure our model runs locally. We will call the `model_definition` function to create our model and use `tf.keras.datasets.fashion_mnist.load_data()` to import the Fashion MNIST dataset.

## Lab Task #2: Create a model to train locally


In [6]:
import os
import time
import tensorflow as tf
import numpy as np
from train import model_definition

def create_dataset(X, Y, epochs, batch_size):
    dataset = tf.data.Dataset.from_tensor_slices((X, Y))
    return dataset.repeat(epochs).batch(batch_size, drop_remainder=True)

In [7]:
# Get data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

# add empty color dimension
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

ds_train = create_dataset(x_train, y_train, 20, 5000)
ds_test = create_dataset(x_test, y_test, 1, 1000)
model = model_definition.create_model()

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['sparse_categorical_accuracy']
)
    
start = time.time()
model.fit(
    ds_train,
    validation_data=ds_test, 
    verbose=1
)
print(f"Training time without GPUs locally: {time.time() - start} sec.")

Training time without GPUs locally: 134.2877902984619




## Train on multiple GPUs/CPUs with MultiWorkerMirrored Strategy


That took a few minutes to train our model for 20 epochs. Let's see how we can do better using Cloud AI Platform. We will be leveraging the `MultiWorkerMirroredStrategy` supplied in `tf.distribute`. The main difference between this code and the code from the local test is that we need to compile the model within the scope of the strategy. When we do this our training op will use information stored in the `TF_CONFIG` variable to assign ops to the various devices for the AllReduce strategy. 

After the training process finishes, we will print out the time spent training. Since it takes a few minutes to spin up the resources being used for training on Cloud AI Platform, and this time can vary, we want a consistent measure of how long training took.

Note: When we train models on Cloud AI Platform, the `TF_CONFIG` variable is automatically set. So we do not need to worry about adjusting based on what cluster configuration we use.

In [8]:
%%writefile train/train_mult_worker_mirrored.py

strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

# Get data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

# add empty color dimension
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

ds_train = create_dataset(x_train, y_train, 20, 5000)
ds_test = create_dataset(x_test, y_test, 1, 1000)
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
with strategy.scope():
    model = model_definition.create_model()
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
        loss='sparse_categorical_crossentropy',
        metrics=['sparse_categorical_accuracy']
    )
    
start = time.time()
model.fit(
    ds_train,
    validation_data=ds_test, 
    verbose=2
)
print(f"Training time with multiple GPUs: {time.time() - start} sec.")

Writing train/train_mult_worker_mirrored.py


## Lab Task #3: Training with multiple GPUs/CPUs on created model using MultiWorkerMirrored Strategy


First we will train a model without using GPUs to give us a baseline. We will use a consistent format throughout the trials. We will define a `config.yaml` file to contain our cluster configuration and the pass this file in as the value of a command-line argument `--config`.

In our first example, we will use a single `n1-highcpu-16` VM.

In [9]:
%%writefile config.yaml
# TODO 3a
# Configure a master worker
trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16

Writing config.yaml


In [12]:
%%bash

now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="cpu_only_fashion_minst_$now"

gcloud ai-platform jobs submit training $JOB_NAME \
  --staging-bucket=gs://$BUCKET \
  --package-path=train \
  --module-name=train.train_mult_worker_mirrored \
  --runtime-version=2.3 \
  --python-version=3.7 \
  --region=australia-southeast1 \
  --config config.yaml

# us-west1, rather than australia-southeast1

jobId: cpu_only_fashion_minst_20220307_091331
state: QUEUED


Job [cpu_only_fashion_minst_20220307_091331] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe cpu_only_fashion_minst_20220307_091331

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs cpu_only_fashion_minst_20220307_091331


If we go through the logs, we see that the training job will take around 5-7 minutes to complete. Let's now attach two Nvidia Tesla K80 GPUs and rerun the training job.

In [13]:
%%writefile config.yaml
# TODO 3b
trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    acceleratorConfig:
      count: 2
      type: NVIDIA_TESLA_K80

Overwriting config.yaml


In [15]:
%%bash

now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="multi_gpu_fashion_minst_2gpu_$now"

gcloud ai-platform jobs submit training $JOB_NAME \
  --staging-bucket=gs://$BUCKET \
  --package-path=train \
  --module-name=train.train_mult_worker_mirrored \
  --runtime-version=2.3 \
  --python-version=3.7 \
  --region=us-west1 \
  --config config.yaml

# us-west1, rather than australia-southeast1

jobId: multi_gpu_fashion_minst_2gpu_20220307_091536
state: QUEUED


Job [multi_gpu_fashion_minst_2gpu_20220307_091536] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe multi_gpu_fashion_minst_2gpu_20220307_091536

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs multi_gpu_fashion_minst_2gpu_20220307_091536


That was a lot faster! The training job will take upto 5-10 minutes to complete. Let's keep going and add more GPUs!

In [16]:
%%writefile config.yaml
# TODO 3c
trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    acceleratorConfig:
      count: 4
      type: NVIDIA_TESLA_K80

Overwriting config.yaml


In [17]:
%%bash

now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="multi_gpu_fashion_minst_4gpu_$now"

gcloud ai-platform jobs submit training $JOB_NAME \
  --staging-bucket=gs://$BUCKET \
  --package-path=train \
  --module-name=train.train_mult_worker_mirrored \
  --runtime-version=2.3 \
  --python-version=3.7 \
  --region=us-west1 \
  --config config.yaml

# us-west1, rather than australia-southeast1

jobId: multi_gpu_fashion_minst_4gpu_20220307_091722
state: QUEUED


Job [multi_gpu_fashion_minst_4gpu_20220307_091722] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe multi_gpu_fashion_minst_4gpu_20220307_091722

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs multi_gpu_fashion_minst_4gpu_20220307_091722


The training job will take upto 10 minutes to complete. It was faster than no GPUs, but why was it slower than 2 GPUs? If you rerun this job with 8 GPUs you'll actually see it takes just as long as using no GPUs!

The answer is in our input pipeline. In short, the I/O involved in using more GPUs started to outweigh the benefits of having more available devices. We can try to improve our input pipelines to overcome this (e.g. using caching, adjusting batch size, etc.). 
