In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Distributed training with Reduction Server

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/ai-platform-samples/blob/master/ai-platform-unified/notebooks/notebook_template.ipynb"">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/ai-platform-unified/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

## Overview

This notebook demonstrates how to optimize large distributed training Vertex AI jobs using Reduction Server. 

The machine learning task in this example is fine tuning a BERT model for sentence prediction using  the Multi-Genre Natural Language Inference Corpus (MNLI) from the GLUE benchmark. 

The example uses components from [TensorFlow NLP Modelling Toolkit](https://github.com/tensorflow/models/tree/master/official/nlp#tensorflow-nlp-modelling-toolkit) and distributed training is implemented using [tf.distribute.MultiWorkerMirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy). 

For more information about using Reduction Server to optimize distributed training refer to the [Optimizing distributed training with Vertex AI Reduction Server](tbd) article.

### Dataset



The example uses the *glue/mnli* dataset from [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/glue).

### Objective

In this notebook, you will learn how to configure, submit and monitor a Vertex AI custom training job that uses Reduction Server to optimize network bandwith and latency of the gradient reduction operation in the distributed training setting.  

The steps performed include:

- Building a custom training container image based on TensorFlow NLP Modelling Toolkit
- Converting the glue/mnli dataset to the format required by TensorFlow NLP Modelling Toolkit
- Preparing a Vertex AI custom container training job that uses Reduction Server
- Submitting and monitoring the job

### Costs 


This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage


Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Set up your local development environment

**If you are using Colab or Google Cloud Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* The Google Cloud SDK
* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

1. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3. Activate the virtual environment.

1. To install Jupyter, run `pip3 install jupyter` on the
command-line in a terminal shell.

1. To launch Jupyter, run `jupyter notebook` on the command-line in a terminal shell.

1. Open this notebook in the Jupyter Notebook Dashboard.

### Install the required packages

Install TensorFlow NLP Modelling Toolkit

In [42]:
import os

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# Google Cloud Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_GOOGLE_CLOUD_NOTEBOOK:
    USER_FLAG = "--user"

In [43]:
! pip3 install --upgrade tf-models-official==2.5.0 tensorflow-text==2.5.0



Install the latest version of Vertex SDK

In [44]:
! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG



Install the latest GA version of google-cloud-storage library as well 

In [45]:
! pip3 install --upgrade google-cloud-storage $USER_FLAG



### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [46]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin


### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the Vertex AI API and Compute Engine API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component). {TODO: Update the APIs needed for your tutorial. Edit the API names, and update the link to append the API IDs, separating each one with a comma. For example, container.googleapis.com,cloudbuild.googleapis.com}

1. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).

1. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
import os

PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [2]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

### Authenticate your Google Cloud account

**If you are using Google Cloud Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [3]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# The Google Cloud Notebook product has specific requirements
IS_GOOGLE_CLOUD_NOTEBOOK = os.path.exists("/opt/deeplearning/metadata/env_version")

# If on Google Cloud Notebooks, then don't execute this code
if not IS_GOOGLE_CLOUD_NOTEBOOK:
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**


In this example, your training application uses Cloud Storage for accessing training and validation datasets and for storing checkpoints. 

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are
available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). You may
not use a Multi-Regional Storage bucket for training with Vertex AI.

In [4]:
BUCKET_NAME = "gs://[your-bucket-name]"  # @param {type:"string"}
REGION = "[your-region]"  # @param {type:"string"}

In [5]:
BUCKET_NAME = "gs://jk-rs-example"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

In [6]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [7]:
! gsutil mb -l $REGION $BUCKET_NAME

Creating gs://jk-rs-example/...
ServiceException: 409 A Cloud Storage bucket named 'jk-rs-example' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


Finally, validate access to your Cloud Storage bucket by examining its contents:

In [8]:
! gsutil ls -al $BUCKET_NAME

                                 gs://jk-rs-example/MNLI_20210708_222812/
                                 gs://jk-rs-example/datasets/


### Import libraries and define constants

In [9]:
import json
import pprint
import sys
import shutil
import time
import tensorflow as tf

from official.nlp.bert import tokenization
from official.nlp.data import classifier_data_lib

from google.cloud import aiplatform
from google.cloud.aiplatform_v1beta1 import types
from google.cloud.aiplatform_v1beta1.services.job_service import \
    JobServiceClient

### Vertex constants

Setup up the following constants for Vertex:

- API_ENDPOINT: The Vertex API service endpoint for job services.
- PARENT: The Vertex location root path for job resources.

In [10]:
API_ENDPOINT = '{}-aiplatform.googleapis.com'.format(REGION)
PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION

## Tutorial


### Set up clients

The Vertex client library works as a client/server model. On your side (the Python script) you will create a client that sends requests and receives responses from the Vertex server.

In this example, you use the Job Service client for submitting and monitoring custom training jobs.

In [11]:
client_options = {"api_endpoint": API_ENDPOINT}
job_client = JobServiceClient(client_options=client_options)

### Prepare training and validation datasets

TensorFlow NLP Modelling Toolkit that you use for fine tuning BERT requires datasets in the specific format. The toolkit includes a set of utility functions to help with data conversions. You will use them to convert the *glue/mnli* dataset from TensorFlow Datasets.

In [12]:
def generate_mnli_tfrecords(train_data_output_path, 
                            eval_data_output_path,
                            metadata_file_path,
                            vocab_file='gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16/vocab.txt', 
                            mnli_type='matched', 
                            max_seq_length=128, 
                            do_lower_case=True):
    """Generates MNLI training and validation splits in the TFRecord format
    compatible with TensorfFlow NLP Modelling Toolkit."""

    tokenizer = tokenization.FullTokenizer(
        vocab_file=vocab_file, do_lower_case=do_lower_case)

    processor_text_fn = tokenization.convert_to_unicode

    if mnli_type == 'matched':
        tfds_params = 'dataset=glue/mnli,text_key=hypothesis,text_b_key=premise,train_split=train,dev_split=validation_matched'
    else: 
        tfds_params = 'dataset=glue/mnli,text_key=hypothesis,text_b_key=premise,train_split=train,dev_split=validation_mismatched'

    processor = classifier_data_lib.TfdsProcessor(
        tfds_params=tfds_params, process_text_fn=processor_text_fn)

    metadata = classifier_data_lib.generate_tf_record_from_data_file(
        processor,
        None,
        tokenizer,
        train_data_output_path=train_data_output_path,
        eval_data_output_path=eval_data_output_path,
        max_seq_length=max_seq_length)

    with tf.io.gfile.GFile(metadata_file_path, "w") as writer:
        writer.write(json.dumps(metadata, indent=4) + "\n")

In [13]:
OUTPUT_LOCATION = f'{BUCKET_NAME}/datasets/MNLI'
TRAIN_FILE = f'{OUTPUT_LOCATION}/mnli_train.tf_record'
EVAL_FILE = f'{OUTPUT_LOCATION}/mnli_valid.tf_record'
METADATA_FILE = f'{OUTPUT_LOCATION}/metadata.json'

In [14]:
generate_mnli_tfrecords(TRAIN_FILE, EVAL_FILE, METADATA_FILE)

INFO:absl:Load dataset info from /home/jupyter/tensorflow_datasets/glue/mnli/1.0.0
INFO:absl:Reusing dataset glue (/home/jupyter/tensorflow_datasets/glue/mnli/1.0.0)
INFO:absl:Constructing tf.data.Dataset glue for split None, from /home/jupyter/tensorflow_datasets/glue/mnli/1.0.0
INFO:absl:Writing example 0 of 392702
INFO:absl:*** Example ***
INFO:absl:guid: train-0
INFO:absl:tokens: [CLS] meaningful partnerships with stakeholders is crucial . [SEP] in recognition of these tensions , l ##sc has worked dil ##igen ##tly since 1995 to convey the expectations of the state planning initiative and to establish meaningful partnerships with stakeholders aimed at foster ##ing a new sy ##mb ##ios ##is between the federal provider and recipients of legal services funding . [SEP]
INFO:absl:input_ids: 101 15902 13797 2007 22859 2003 10232 1012 102 1999 5038 1997 2122 13136 1010 1048 11020 2038 2499 29454 29206 14626 2144 2786 2000 16636 1996 10908 1997 1996 2110 4041 6349 1998 2000 5323 15902 13797

In [15]:
! gsutil ls {OUTPUT_LOCATION}

gs://jk-rs-example/datasets/MNLI/
gs://jk-rs-example/datasets/MNLI/metadata.json
gs://jk-rs-example/datasets/MNLI/mnli_train.tf_record
gs://jk-rs-example/datasets/MNLI/mnli_valid.tf_record


In [16]:
! gsutil cat {METADATA_FILE}

{
    "processor_type": "TFDS_glue/mnli",
    "train_data_size": 392702,
    "max_seq_length": 128,
    "task_type": "bert_classification",
    "num_labels": 3,
    "eval_data_size": 9815
}


### Build a training container

There are two ways you can train a custom model using a container image:

- **Use a Google Cloud prebuilt container**. If you use a prebuilt container, you will additionally specify a Python package to install into the container image. 

- **Use your own custom container image**. If you use your own container, the container needs to contain your code for training a custom model.

In this example, you use a custom container image.

#### Create a training script

Your training script is based on [the common training driver](https://github.com/tensorflow/models/blob/master/official/nlp/docs/train.md) from TensorFlow NLP Modelling Toolkit. The base driver has been adapted to work seamlessly on a distributed compute environment provisioned when running a Vertex training job. 

In [19]:
! rm -rf training_image
! mkdir training_image
! mkdir training_image/trainer

In [20]:
%%writefile training_image/trainer/train.py

# Copyright 2021 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""TFM common training driver."""

import os
import json

from absl import app
from absl import flags
from absl import logging
import gin


from official.common import distribute_utils
# pylint: disable=unused-import
from official.common import registry_imports
# pylint: enable=unused-import
from official.common import flags as tfm_flags
from official.core import task_factory
from official.core import train_lib
from official.core import train_utils
from official.modeling import performance

from tensorflow.dtypes import float16, bfloat16, float32

FLAGS = flags.FLAGS

def _get_model_dir(model_dir):
  """Adjusts model dir for multi-worker training.
  
     Checkpointing and Saving need to happen on each worker and they need to write 
     to different paths as they would override each others. This utility function
     adjusts the base model dir passed as a flag using Vertex AI cluster topology
  """
  
  def _is_chief(task_type, task_id):
    return ((task_type == 'chief' and task_id == 0) or task_type is None)
  
  tf_config = os.getenv('TF_CONFIG')
  if tf_config:
    tf_config = json.loads(tf_config)
   
    if not _is_chief(tf_config['task']['type'], tf_config['task']['index']):
      model_dir = os.path.join(model_dir, 'worker-{}').format(tf_config['task']['index'])
  
  logging.info('Setting model_dir to: %s', model_dir)
  
  return model_dir

def main(_):
  
  model_dir = _get_model_dir(FLAGS.model_dir)

  gin.parse_config_files_and_bindings(FLAGS.gin_file, FLAGS.gin_params)
  params = train_utils.parse_configuration(FLAGS)
  
  if 'train' in FLAGS.mode:
    # Pure eval modes do not output yaml files. Otherwise continuous eval job
    # may race against the train job for writing the same file.
    train_utils.serialize_config(params, model_dir)

  # Sets mixed_precision policy. Using 'mixed_float16' or 'mixed_bfloat16'
  # can have significant impact on model speeds by utilizing float16 in case of
  # GPUs, and bfloat16 in the case of TPUs. loss_scale takes effect only when
  # dtype is float16
  if params.runtime.mixed_precision_dtype:
    performance.set_mixed_precision_policy(params.runtime.mixed_precision_dtype)
  distribution_strategy = distribute_utils.get_distribution_strategy(
      distribution_strategy=params.runtime.distribution_strategy,
      all_reduce_alg=params.runtime.all_reduce_alg,
      num_gpus=params.runtime.num_gpus,
      tpu_address=params.runtime.tpu,
      **params.runtime.model_parallelism())
  with distribution_strategy.scope():
    task = task_factory.get_task(params.task, logging_dir=model_dir)

  train_lib.run_experiment(
      distribution_strategy=distribution_strategy,
      task=task,
      mode=FLAGS.mode,
      params=params,
      model_dir=model_dir)

  train_utils.save_gin_config(FLAGS.mode, model_dir)

if __name__ == '__main__':
  tfm_flags.define_flags()
  app.run(main)


Writing training_image/trainer/train.py


In [21]:
%%writefile training_image/trainer/glue_mnli_matched.yaml

task:
  hub_module_url: 'https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/4'
  model:
    num_classes: 3
  init_checkpoint: ''
  metric_type: 'accuracy'
  train_data:
    drop_remainder: true
    global_batch_size: 32
    input_path: ''
    is_training: true
    seq_length: 128
    label_type: 'int'
  validation_data:
    drop_remainder: false
    global_batch_size: 32
    input_path: ''
    is_training: false
    seq_length: 128
    label_type: 'int'
trainer:
  checkpoint_interval: 3000
  optimizer_config:
    learning_rate:
      polynomial:
        # 100% of train_steps.
        decay_steps: 36813
        end_learning_rate: 0.0
        initial_learning_rate: 3.0e-05
        power: 1.0
      type: polynomial
    optimizer:
      type: adamw
    warmup:
      polynomial:
        power: 1
        # ~10% of train_steps.
        warmup_steps: 3681
      type: polynomial
  steps_per_loop: 1000
  summary_interval: 1000
  # Training data size 392,702 examples, 3 epochs.
  train_steps: 36813
  validation_interval: 6135
  # Eval data size = 9815 examples.
  validation_steps: 307
  best_checkpoint_export_subdir: 'best_ckpt'
  best_checkpoint_eval_metric: 'cls_accuracy'
  best_checkpoint_metric_comp: 'higher'

Writing training_image/trainer/glue_mnli_matched.yaml


#### Create the Dockerfile

The custom training container image packages TensorFlow NLP Modelling Toolkit with the training script and the default configuration file created in the previous steps. It also install the Reduction Server NCCL plugin.

In [22]:
%%writefile training_image/Dockerfile

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5

RUN apt remove -y google-fast-socket \
&&  echo "deb https://packages.cloud.google.com/apt google-fast-socket main" | tee /etc/apt/sources.list.d/google-fast-socket.list \
&&  curl -s -L https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
&&  apt update && apt install -y google-reduction-server

RUN pip install tf-models-official==2.5.0 tensorflow-text==2.5.0

WORKDIR /

COPY trainer /trainer

ENTRYPOINT ["python"]
CMD ["-c", "print('TF Model Garden')"]

Writing training_image/Dockerfile


#### Build the container image and push it to Container Registry

In [23]:
TRAIN_IMAGE = f'gcr.io/{PROJECT_ID}/mnli_finetuning'

In [24]:
! docker build -t {TRAIN_IMAGE} training_image

Sending build context to Docker daemon  9.216kB
Step 1/7 : FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5
latest: Pulling from deeplearning-platform-release/tf2-gpu.2-5

[1Bd2c87b75: Pulling fs layer 
[1B10be24e1: Pulling fs layer 
[1B7173dcfe: Pulling fs layer 
[1B8de7822d: Pulling fs layer 
[1B4ac0274d: Pulling fs layer 
[1Bb86d08de: Pulling fs layer 
[1B019dd5e8: Pulling fs layer 
[1B73e465ef: Pulling fs layer 
[1B630baacd: Pulling fs layer 
[1B86c72f57: Pulling fs layer 
[1B6fce16a1: Pulling fs layer 
[1Bc64e20d2: Pulling fs layer 
[1B12f3cce5: Pulling fs layer 
[11Bde7822d: Waiting fs layer 
[1B2ea143ea: Pulling fs layer 
[12Bac0274d: Waiting fs layer 
[1B4adad992: Pulling fs layer 
[13B86d08de: Waiting fs layer 
[1B7e5e0af5: Pulling fs layer 
[14B19dd5e8: Waiting fs layer 
[1B0834967b: Pulling fs layer 
[15B3e465ef: Waiting fs layer 
[1Bec7e36f6: Pulling fs layer 
[1Bf0ba3fb3: Pulling fs layer 
[1B12e657e4: Pulling fs layer 
[1Bfad557e1: Pulling f

In [25]:
! docker push {TRAIN_IMAGE}

Using default tag: latest
The push refers to repository [gcr.io/jk-mlops-dev/mnli_finetuning]

[1B7a3f4344: Preparing 
[1Be680029a: Preparing 
[1B0cf1801e: Preparing 
[1B40ebdbd6: Preparing 
[1B54d2bd94: Preparing 
[1Ba056d495: Preparing 
[1Bcb8c2687: Preparing 
[1B43de6bca: Preparing 
[1B87a4088d: Preparing 
[1B522d97b4: Preparing 
[1B519f0898: Preparing 
[1B6aeeabc0: Preparing 
[1Bebfdebb3: Preparing 
[1B6b863e43: Preparing 
[1B43fb2f7a: Preparing 
[9B43de6bca: Waiting g 
[11Bb8c2687: Waiting g 
[10B7a4088d: Waiting g 
[1B9de06c8b: Preparing 
[11B22d97b4: Waiting g 
[11B19f0898: Waiting g 
[1B5264beff: Preparing 
[12Baeeabc0: Waiting g 
[1Bf85bc8aa: Preparing 
[13Bbfdebb3: Waiting g 
[1Ba732d388: Preparing 
[14Bb863e43: Waiting g 
[11B1798c0c: Waiting g 
[14B9d68143: Waiting g 
[8B14beba01: Waiting g 
[1Bdd8ed907: Preparing 
[11B264beff: Waiting g 
[14B9f353b4: Waiting g 
[1B31fc0e08: Preparing 
[34B680029a: Pushed   212.4MB/209.6MB[35A[2K[33A[2

### Prepare your custom job specification

You can now create a Job Specification for your distributed training job with Reduction Server. 

If you run a distributed training job with Vertex AI, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica. A group of replicas with the same configuration is called a worker pool. Vertex AI provides 4 worker pools to cover the different types of machine tasks.

Worker pool 0 configures the Primary, chief, scheduler, or "master".  This worker generally takes on some extra work such as saving checkpoints and writing summary files. There is only ever one chief worker in a cluster, so your worker count for worker pool 0 will always be 1.

Worker pool 1 is where you configure the rest of the workers for your cluster. 
 
Worker pool 2 manages Reduction Server reducers. 

Worker pools 0 and 1 run your custom training container you created in the previous step. Worker pool 2 uses the Reduction Server image provided by Vertex AI.

The below helper function creates a custom job specification using the described worker pool topology.


In [46]:
def prepare_custom_job_spec(
    job_name,
    image_uri,
    args,
    cmd, 
    replica_count=1,
    machine_type='n1-standard-4',
    accelerator_count=0,
    accelerator_type='ACCELERATOR_TYPE_UNSPECIFIED',
    reduction_server_count=0,
    reduction_server_machine_type='n1-highcpu-16',
    reduction_server_image_uri='us-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest'
):

    if accelerator_count > 0:
        machine_spec = {
            'machine_type': machine_type,
            'accelerator_type': accelerator_type,
            'accelerator_count': accelerator_count,
        }
    else:
        machine_spec = {
            'machine_type': machine_type
        }
    
    container_spec = {
        'image_uri': image_uri,
        'args': args,
        'command': cmd,
    }
    
    chief_spec = {
        'replica_count': 1,
        'machine_spec': machine_spec,
        'container_spec': container_spec
    }

    worker_pool_specs = [chief_spec]
    if replica_count > 1:
        workers_spec = {
            'replica_count': replica_count - 1,
            'machine_spec': machine_spec,
            'container_spec': container_spec
        }
        worker_pool_specs.append(workers_spec)
        
    if reduction_server_count > 1:
        workers_spec = {
            'replica_count': reduction_server_count,
            'machine_spec': {
                'machine_type': reduction_server_machine_type,
            },
            'container_spec': {
                'image_uri': reduction_server_image_uri
            }
        }
        worker_pool_specs.append(workers_spec)
        
    custom_job_spec = {
        'display_name': job_name,
        'job_spec': {
            'worker_pool_specs': worker_pool_specs
        }
    }
    
    return custom_job_spec

#### Configure worker pools

Adjust the following constants to reflect the machine types and a number of replicas for your worker pools. 

When choosing the number and type of reducers, you should consider the network bandwidth supported by a reducer replica’s machine type. In GCP, a VM’s machine type defines its maximum possible egress bandwidth. For example, the egress bandwidth of the n1-highcpu-16 machine type is limited at 32 Gbps.

Because reducers perform a very limited function, aggregating blocks of gradients, they can run on relatively low-powered and cost effective machines. Even with a large number of gradients this computation does not require accelerated hardware or high CPU or memory resources. However, to avoid network bottlenecks, the total aggregate bandwidth of all replicas in the reducer worker pool must be greater or equal to the total aggregate bandwidth of all replicas in worker pools 0 and 1, which host the GPU workers.

Refer to [the article](TBD) for more information about configuring worker pools when using Reduction Server.

In [56]:
REPLICA_COUNT = 1
WORKER_MACHINE_TYPE = 'a2-highgpu-1g'
ACCELERATOR_TYPE = 'NVIDIA_TESLA_A100'
PER_MACHINE_ACCELERATOR_COUNT = 1
PER_REPLICA_BATCH_SIZE = 32

REDUCTION_SERVER_COUNT = 0
REDUCTION_SERVER_MACHINE_TYPE = 'n1-highcpu-16'

#### Configure the MNLI experiment settings

As noted before, the training script supports overriding the default configuration of a TensorFlow Modelling Toolkit task using YAML configuration files and command line parameters. The base configuration for the MNLI fine tuning task is defined in the YAML file packaged into the training container. With each training run you can override selected parameters by using the `params_override` command line argument of the training script.

The `params_override` argument accepts a string with comma separated key/value pairs for each parameter to be overwritten.

The following parameters are overwritten in the following cell.
- `trainer.train_step` - A number of training steps. Recall that there is 392,702 examples in the training data set. 
- `trainer.steps_per_loop` - The training script prints out updates about training progress every `steps_per_loop`
- `trainer.summary_interval` - The training script logs Tensorboard summaries every `summary_interval`
- `trainer.validation_interval` - The training script runs validation every `validation_interval`
- `trainer.checkpoint_interval` - The training script creates a checkpoint every `checkpoint_interval`
- `task.train_data.global_batch_size` - A global batch size for training data. This value should be adjusted based on a GPU type and a number of GPU workers. For example when using NVidia V100 GPUs a batch size of 16 per GPU is a good starting point. With 2 workers, the `global_batch_size` would be 32
- `task.validation_data.global_batch_size` - A global batch size for validation data
- `task.train_data.input_path` - A location of the training dataset
- `task.validation_data.input_path` - A location of the validation dataset
- `runtime.num_gpus` - A number of GPUs to use on each worker. This should be adjusted based on the type of a worker machine
- `runtime.distribution_strategy` - TensorFlow distribution strategy. 
- `runtime.all_reduce_alg=nccl` - You ust use NVidia NCCL with Reduction Server

In [57]:
PARAMS_OVERRIDE = ','.join([
    'trainer.train_steps=2000',
    'trainer.steps_per_loop=100',
    'trainer.summary_interval=100',
    'trainer.validation_interval=2000',
    'trainer.checkpoint_interval=2000',
    'task.train_data.global_batch_size=' + str(REPLICA_COUNT*PER_REPLICA_BATCH_SIZE),
    'task.validation_data.global_batch_size=' + str(REPLICA_COUNT*PER_REPLICA_BATCH_SIZE), 
    'task.train_data.input_path=' + TRAIN_FILE,
    'task.validation_data.input_path=' + EVAL_FILE,
    'runtime.num_gpus=' + str(PER_MACHINE_ACCELERATOR_COUNT),
    'runtime.distribution_strategy=multi_worker_mirrored',
    'runtime.all_reduce_alg=nccl',
])

#### Create a base configuration for the MNLI fine tuning task

The training script supports overriding the default configuration of a TensorFlow Modelling Toolkit task using YAML configuration files and/or command line parameters. The below YAML file sets the base configuration for the MNLI fine tuning task used in this example. 

#### Assemble a job specification

You are now ready to assemble a custom job spec.

In [58]:
JOB_NAME = 'MNLI_{}'.format(time.strftime('%Y%m%d_%H%M%S'))
MODEL_DIR = f'{BUCKET_NAME}/{JOB_NAME}/model'

WORKER_CMD = ['python', 'trainer/train.py']
WORKER_ARGS = [
    '--experiment=bert/sentence_prediction',
    '--mode=train',
    '--model_dir=' + MODEL_DIR,
    '--config_file=trainer/glue_mnli_matched.yaml',
    '--params_override=' + PARAMS_OVERRIDE,
]

custom_job_spec = prepare_custom_job_spec(
    job_name=JOB_NAME,
    image_uri=TRAIN_IMAGE,
    args=WORKER_ARGS,
    cmd=WORKER_CMD,
    replica_count=REPLICA_COUNT,
    machine_type=WORKER_MACHINE_TYPE,
    accelerator_count=PER_MACHINE_ACCELERATOR_COUNT,
    accelerator_type=ACCELERATOR_TYPE,
    reduction_server_count=REDUCTION_SERVER_COUNT,
    reduction_server_machine_type=REDUCTION_SERVER_MACHINE_TYPE,
)

pp = pprint.PrettyPrinter()
print(pp.pformat(custom_job_spec))

{'display_name': 'MNLI_20210708_193212',
 'job_spec': {'worker_pool_specs': [{'container_spec': {'args': ['--experiment=bert/sentence_prediction',
                                                                 '--mode=train',
                                                                 '--model_dir=gs://jk-rs-notebook-test/MNLI_20210708_193212/model',
                                                                 '--config_file=trainer/glue_mnli_matched.yaml',
                                                                 '--params_override=trainer.train_steps=2000,trainer.steps_per_loop=100,trainer.summary_interval=100,trainer.validation_interval=2000,trainer.checkpoint_interval=2000,task.train_data.global_batch_size=32,task.validation_data.global_batch_size=32,task.train_data.input_path=gs://jk-rs-notebook-test/datasets/MNLI/mnli_train.tf_record,task.validation_data.input_path=gs://jk-rs-notebook-test/datasets/MNLI/mnli_valid.tf_record,runtime.num_gpus=1,runtime.distributio

### Submit and monitor the job

You will now use the Vertex AI job client to submit and monitor a training job. To submit the job, use the job client service's `create_custom_job` method.

In [59]:
options = dict(api_endpoint=API_ENDPOINT)
client = JobServiceClient(client_options=options)

parent = f"projects/{PROJECT_ID}/locations/{REGION}"

response = client.create_custom_job(
    parent=parent, custom_job=custom_job_spec
)

response

name: "projects/895222332033/locations/us-central1/customJobs/3870056629399453696"
display_name: "MNLI_20210708_193212"
job_spec {
  worker_pool_specs {
    machine_spec {
      machine_type: "a2-highgpu-1g"
      accelerator_type: NVIDIA_TESLA_A100
      accelerator_count: 1
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/jk-mlops-dev/mnli_finetuning"
      command: "python"
      command: "trainer/train.py"
      args: "--experiment=bert/sentence_prediction"
      args: "--mode=train"
      args: "--model_dir=gs://jk-rs-notebook-test/MNLI_20210708_193212/model"
      args: "--config_file=trainer/glue_mnli_matched.yaml"
      args: "--params_override=trainer.train_steps=2000,trainer.steps_per_loop=100,trainer.summary_interval=100,trainer.validation_interval=2000,trainer.checkpoint_interval=2000,task.train_data.global_batch_size=32,task.validation_data.global_batch_size=32,task.tr

#### Get information about a running job

You can use the job client service's `get_custom_job` method to retrieve information about a running job.
Note that you can also monitor the job using [GCP Console]().


In [60]:
client.get_custom_job(name=response.name)

name: "projects/895222332033/locations/us-central1/customJobs/3870056629399453696"
display_name: "MNLI_20210708_193212"
job_spec {
  worker_pool_specs {
    machine_spec {
      machine_type: "a2-highgpu-1g"
      accelerator_type: NVIDIA_TESLA_A100
      accelerator_count: 1
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/jk-mlops-dev/mnli_finetuning"
      command: "python"
      command: "trainer/train.py"
      args: "--experiment=bert/sentence_prediction"
      args: "--mode=train"
      args: "--model_dir=gs://jk-rs-notebook-test/MNLI_20210708_193212/model"
      args: "--config_file=trainer/glue_mnli_matched.yaml"
      args: "--params_override=trainer.train_steps=2000,trainer.steps_per_loop=100,trainer.summary_interval=100,trainer.validation_interval=2000,trainer.checkpoint_interval=2000,task.train_data.global_batch_size=32,task.validation_data.global_batch_size=32,task.tr

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

{TODO: Include commands to delete individual resources below}

In [None]:

# Delete model resource
! gcloud ai models delete $MODEL_NAME --quiet

# Delete Cloud Storage objects that were created
! gsutil -m rm -r $JOB_DIR

## Temporary tests

In [37]:
EXPERIMENT = 'bert/sentence_prediction'
CONFIG_FILE = 'trainer/glue_mnli_matched.yaml'
MODE = 'train'

BERT_HUB_URL = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/4'

PER_REPLICA_BATCH_SIZE = 16
ACCELERATOR_COUNT = 2
ALL_REDUCE_ALG = 'nccl'
STRATEGY = 'mirrored'
GLOBAL_BATCH_SIZE = ACCELERATOR_COUNT * PER_REPLICA_BATCH_SIZE

TRAINING_STEPS = 200
STEPS_PER_LOOP = 50
SUMMARY_INTERVAL = 50
VALIDATION_INTERVAL = 200
CHECKPOINT_INTERVAL = 200

MIXED_PRECISION_TYPE = 'float16'

LOCAL_DIR = '/tmp'

PARAMS_OVERRIDE = ','.join([
    'task.train_data.input_path=' + TRAIN_FILE,
    'task.validation_data.input_path=' + EVAL_FILE,
    'task.train_data.global_batch_size=' + str(GLOBAL_BATCH_SIZE),
    'task.validation_data.global_batch_size=' + str(GLOBAL_BATCH_SIZE),
    'task.hub_module_url=' + BERT_HUB_URL,
    'runtime.num_gpus=' + str(ACCELERATOR_COUNT),
    'runtime.distribution_strategy=' + STRATEGY,
    'runtime.all_reduce_alg=' + ALL_REDUCE_ALG,
    'runtime.mixed_precision_dtype=' + MIXED_PRECISION_TYPE,
    'trainer.train_steps=' + str(TRAINING_STEPS),
    'trainer.steps_per_loop=' + str(STEPS_PER_LOOP),
    'trainer.summary_interval=' + str(SUMMARY_INTERVAL),
    'trainer.validation_interval=' + str(VALIDATION_INTERVAL),
    'trainer.checkpoint_interval=' + str(CHECKPOINT_INTERVAL),
])

In [41]:
! docker run -it --rm --gpus all {TRAIN_IMAGE} trainer/train.py \
--experiment={EXPERIMENT} \
--mode={MODE} \
--model_dir={LOCAL_DIR}/test \
--config_file={CONFIG_FILE}\
--params_override={PARAMS_OVERRIDE}  


2021-07-09 01:23:15.658604: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
I0709 01:23:20.187328 140521104000832 train.py:59] Setting model_dir to: /tmp/test
I0709 01:23:20.204028 140521104000832 train_utils.py:286] Final experiment parameters: {'runtime': {'all_reduce_alg': 'nccl',
             'batchnorm_spatial_persistent': False,
             'dataset_num_private_threads': None,
             'default_shard_dim': -1,
             'distribution_strategy': 'mirrored',
             'enable_xla': False,
             'gpu_thread_mode': None,
             'loss_scale': None,
             'mixed_precision_dtype': 'float16',
             'num_cores_per_replica': 1,
             'num_gpus': 2,
             'num_packs': 1,
             'per_gpu_thread_count': 0,
             'run_eagerly': False,
             'task_index': -1,
             'tpu': None,
             'tpu_enable_xla_dynamic_padder': None,
             'worke