In [None]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# BERT fine-tuning using AI Platform Training


## Overview

This notebook demonstrates how to use [AI Platform (Unified)](https://cloud.google.com/ai-platform-unified/docs/start/introduction-unified-platform) to run TensorFlow 2.x distributed training with GPUs. Both single node and multi-worker scenarios are covered.

The ML scenario is BERT fine-tuning. You will use the text IMDB movie reviews database and the pre-trained BERT model from the [TensorFlow Hub](https://www.tensorflow.org/hub) to develop a text classification model for sentiment analysis.


There are three types of AI Platform resources you can use to train custom models on AI Platform:

- [Custom jobs](https://cloud.google.com/ai-platform-unified/docs/training/create-custom-job)
- [Hyperparameter tuning jobs](https://cloud.google.com/ai-platform-unified/docs/training/using-hyperparameter-tuning)
- [Training pipelines](https://cloud.google.com/ai-platform-unified/docs/training/create-training-pipeline)

This sample focuses on [Custom jobs](https://cloud.google.com/ai-platform-unified/docs/training/create-custom-job) with [custom training containers](https://cloud.google.com/ai-platform-unified/docs/training/containers-overview).

In the notebook, you will go through the following steps:

- Converting the text IMDB database to the TFRecords format
- Developing a custom training container
- Configuring, submitting, and monitoring single node and multi-worker Custom training jobs


### About BERT


[BERT](https://arxiv.org/abs/1810.04805) and other Transformer encoder architectures have been wildly successful on a variety of tasks in NLP (natural language processing). They compute vector-space representations of natural language that are suitable for use in deep learning models. The BERT family of models uses the Transformer encoder architecture to process each token of input text in the full context of all tokens before and after, hence the name: Bidirectional Encoder Representations from Transformers. 

BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks.

## Setting up the environment

### Setting up your notebook environment

This notebook has been tested with [AI Platform Notebooks](https://cloud.google.com/ai-platform/notebooks/docs) configured with the standard TensorFlow 2.4 image.

It may work on other environments as long as the similar hardware and software configuration is used.

#### Provisioning an instance of AI Platform Notebooks

To provision an instance of AI Platform Notebooks, follow the instructions in the [AI Platform Notebooks documentation](https://cloud.google.com/ai-platform/notebooks/docs/create-new). Configure  your instance with multiple GPUs and use the TensorFlow 2.4 image. 

#### Installing software pre-requisities

In addition to standard packages pre-installed in the TensorFlow 2.4 image you need the following additional packages:
- [AI Platform Python client library](https://cloud.google.com/ai-platform-unified/docs/start/client-libraries), and
- [TensorFlow Model Garden library](https://github.com/tensorflow/models/tree/master/official)
- [TF.Text](https://www.tensorflow.org/tutorials/tensorflow_text/intro)

Use `pip` to install the libraries. You can run `pip` from a terminal window of your AI Platform Notebooks instance or execute the following cells. Make sure to restart the notebook after installation.

In [8]:
%pip install -U google-cloud-aiplatform --user

Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-0.5.1-py2.py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 8.6 MB/s eta 0:00:01
Installing collected packages: google-cloud-aiplatform
Successfully installed google-cloud-aiplatform-0.5.1
Note: you may need to restart the kernel to use updated packages.


In [9]:
%pip install tf-models-official tensorflow-text --user

Collecting tf-models-official
  Downloading tf_models_official-2.4.0-py2.py3-none-any.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 7.2 MB/s eta 0:00:01
[?25hCollecting tensorflow-text
  Downloading tensorflow_text-2.4.3-cp37-cp37m-manylinux1_x86_64.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 11.8 MB/s eta 0:00:01
Collecting grpcio~=1.32.0
  Downloading grpcio-1.32.0-cp37-cp37m-manylinux2014_x86_64.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 74.8 MB/s eta 0:00:01
Collecting sentencepiece
  Downloading sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 112.6 MB/s eta 0:00:01
[?25hCollecting Cython
  Downloading Cython-0.29.22-cp37-cp37m-manylinux1_x86_64.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 115.0 MB/s eta 0:00:01
Collecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Collecting gin-config
  Downloading gin_config-0.4.0

If you installed the packages from within the notebookd make sure to restart the kernel

In [10]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Setting up your GCP project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a GCP project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)

3. [Enable the AI Platform APIs, Compute Engine APIs and Container Registry API.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component,containerregistry.googleapis.com)

4. [Google Cloud SDK](https://cloud.google.com/sdk) is already installed in AI Platform Notebooks.

5. Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.


#### Set your project ID

In [25]:
PROJECT_ID = 'jk-demos'

! gcloud config set project $PROJECT_ID

Updated property [core/project].


#### Set the default region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for AI Platform (Unified). We recommend when possible, to choose the region closest to you. 

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You can not use a Multi-Regional Storage bucket for training with AI Platform. Not all regions provide support for all AI Platform services. For the lastest support per region, see [Region support for AI Platform (Unified) services](https://cloud.google.com/ai-platform-unified/docs/general/locations)

In [26]:
REGION = 'us-central1' 

#### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

In this tutorial, your training job retrieves data and  saves the artifacts created during the job, including
a trained model, checkpoints, and the TensorBoard logs, into a Google Cloud storage bucket. 

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets. 

In [27]:
BUCKET_NAME = "jk-demos-bucket" 

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [4]:
! gsutil mb -l $REGION gs://$BUCKET_NAME

Creating gs://jk-demos-bucket/...
ServiceException: 409 A Cloud Storage bucket named 'jk-demos-bucket' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


Finally, validate access to your Cloud Storage bucket by examining its contents:

In [5]:
! gsutil ls -al gs://$BUCKET_NAME

                                 gs://jk-demos-bucket/data/
                                 gs://jk-demos-bucket/test_run/
                                 gs://jk-demos-bucket/tfrecords/


#### Set AI Platform (Unified) constants

Let's now setup some constants for AI Platform (Unified):

- `API_ENDPOINT`: The AI Platform (Unified) API service endpoint for dataset, model, job, pipeline and endpoint services.
- `API_PREDICT_ENDPOINT`: The AI Platform (Unified) API service endpoint for prediction.
- `PARENT`: The AI Platform (Unified) location root path for dataset, model and endpoint resources.

In [28]:
# API Endpoint
API_ENDPOINT = "{}-aiplatform.googleapis.com".format(REGION)
API_PREDICT_ENDPOINT = "{}-prediction-aiplatform.googleapis.com".format(REGION)

# AI Platform (Unified) location root path for your dataset, model and endpoint resources
PARENT = "projects/" + PROJECT_ID + "/locations/" + REGION

## The Lab

#### Import libraries 

In [29]:
import os
import shutil
import sys
import time

import tensorflow as tf

from datetime import datetime
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value

from google.cloud.aiplatform import gapic as aip

### Preparing data

In this section, you will convert the original IMDB dataset that is in plain text into the [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) format. The TFRecord format is recommended for high performance input pipelines that are critical in large scale training scenarios like BERT pre-training and fine-tuning. The TFRecord format works well with the [tf.data API](https://www.tensorflow.org/guide/data) used to implement input pipelines in this sample.

After the TFRecord files are created, you will copy them to a GCS storage bucket. In most distributed training scenarios, training data needs to be located in a shared storage location.

#### Download the IMDB dataset

In [8]:
local_dir = os.path.expanduser('~')
local_dir = f'{local_dir}/datasets'

if tf.io.gfile.exists(local_dir):
    tf.io.gfile.rmtree(local_dir)
tf.io.gfile.makedirs(local_dir)

url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
local_path = f'{local_dir}/aclImdb_v1.tar.gz'

dataset = tf.keras.utils.get_file(local_path, url,
                                  untar=True, 
                                  cache_dir=local_dir,
                                  cache_subdir='.'
                                  )
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

train_dir = os.path.join(dataset_dir, 'train')

# remove unused folders to make it easier to load the data
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


#### Convert the IMDB dataset to TFRecords files

##### Create training, validation and testing splits from IMDB text files

The IMDB dataset has already been divided into train and test, but it lacks a validation set. We will create a validation set using an 80:20 split of the training data.


In [9]:
def create_splits(train_dir, test_dir, val_split, seed):
    
    train_ds = tf.keras.preprocessing.text_dataset_from_directory(
        train_dir,
        validation_split=val_split,
        subset='training',
        seed=seed)

    class_names = train_ds.class_names
    
    train_ds = train_ds.unbatch()

    val_ds = tf.keras.preprocessing.text_dataset_from_directory(
        train_dir,
        validation_split=val_split,
        subset='validation',
        seed=seed).unbatch()

    test_ds = tf.keras.preprocessing.text_dataset_from_directory(
        test_dir).unbatch()

    return train_ds, val_ds, test_ds, class_names

In [10]:
seed = 42
val_split = 0.2
test_dir = f'{dataset_dir}/test'

train_ds, val_ds, test_ds, class_names = (
    create_splits(train_dir, test_dir, val_split, seed)
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


##### Inspect a couple of examples

In [11]:
for text, label in train_ds.take(2):
    print(f'Review: {text.numpy()}')
    label = label.numpy()
    print(f'Label : {label} ({class_names[label]})')

Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label : 0 (neg)
Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they

##### Prepare tf.Example serialization routines

In [12]:
def serialize_example(text_fragment, label):
    """Serializes text fragment and label in tf.Example."""
    
    def _bytes_feature(value):
        """Returns a bytes_list from a string / byte."""
        if isinstance(value, type(tf.constant(0))):
            value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
        return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

    def _int64_feature(value):
        """Returns an int64_list from a bool / enum / int / uint."""
        return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
    
    feature = {
        'text_fragment': _bytes_feature(text_fragment),
        'label': _int64_feature(label)
    }
    
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()
    
def tf_serialize_example(text_fragment, label):
  tf_string = tf.py_function(
    serialize_example,
    (text_fragment, label), 
    tf.string)      
  return tf.reshape(tf_string, ()) 

##### Write TFRecords files

In [13]:
tfrecords_folder = '{}/tfrecords'.format(os.path.expanduser('~'))
if tf.io.gfile.exists(tfrecords_folder):
    tf.io.gfile.rmtree(tfrecords_folder)
tf.io.gfile.makedirs(tfrecords_folder)

filenames = ['train.tfrecords', 'valid.tfrecords', 'test.tfrecords']
for file_name, dataset in zip(filenames, [train_ds, val_ds, test_ds]):
    writer = tf.data.experimental.TFRecordWriter(os.path.join(tfrecords_folder, file_name))
    writer.write(dataset.map(tf_serialize_example))

##### Double check that you can read the created TFRecord files

In [14]:
for record in tf.data.TFRecordDataset([os.path.join(tfrecords_folder, file_name)]).take(2):
    print(record)

tf.Tensor(b'\n\xc4\x0f\n\x0e\n\x05label\x12\x05\x1a\x03\n\x01\x00\n\xb1\x0f\n\rtext_fragment\x12\x9f\x0f\n\x9c\x0f\n\x99\x0fI once had a conversation with my parents who told me British cinema goers in the 1940s and 50s would check to see a film\'s country of origin before going to see it . It didn\'t matter what the plot was or who was in it , if it was an American movie people would want to see it and if it was British people wouldn\'t want to see it . This might sound like a ridiculous generalisation but after seeing THE ASTONISHED HEART I can understand why people in those days preferred American cinema to the home grown variety Back in the 1940s <br /><br />British equity was devoid of working class members and it shows in this movie . Everyone speaks in an English lad dee daa upper class accent that makes the British Royal Family sound like working class scum and what this does is alienate a large amount of a potential British audience who would no doubt prefer to be watching Jim

##### Copy the created TFRecord files to GCS

In [15]:
gcs_paths = [f'gs://{BUCKET_NAME}/tfrecords/train',
             f'gs://{BUCKET_NAME}/tfrecords/valid',
             f'gs://{BUCKET_NAME}/tfrecords/test']

for filename, gcs_path in zip(filenames, gcs_paths):
    local_file_path = os.path.join(tfrecords_folder, filename)
    gcs_file_path = f'{gcs_path}/{filename}'
    !gsutil cp {local_file_path} {gcs_file_path}

Copying file:///home/jupyter/tfrecords/train.tfrecords [Content-Type=application/octet-stream]...
- [1 files][ 26.5 MiB/ 26.5 MiB]                                                
Operation completed over 1 objects/26.5 MiB.                                     
Copying file:///home/jupyter/tfrecords/valid.tfrecords [Content-Type=application/octet-stream]...
/ [1 files][  6.6 MiB/  6.6 MiB]                                                
Operation completed over 1 objects/6.6 MiB.                                      
Copying file:///home/jupyter/tfrecords/test.tfrecords [Content-Type=application/octet-stream]...
- [1 files][ 32.3 MiB/ 32.3 MiB]                                                
Operation completed over 1 objects/32.3 MiB.                                     


## Preparing the training container image


There are two ways of packaging your training code for AI Platform Custom jobs. 

- **Use a Google Cloud prebuilt container**. If you use a prebuilt container, you will additionally specify a Python package to install into the container image. This Python package contains your code for training a custom model.

- **Use your own custom container image**. If you use your own container, the container needs to contain your code plus all the dependencies..

In this sample, we are using a custom container.

To create a custom training container you need to define a Python training module and package it in a container image together with all the required dependencies.

We will use the standard [Deep Learning Containers](https://cloud.google.com/ai-platform/deep-learning-containers/docs) image as a base image for the custom traininer container image. Specifically we are going to use the `gcr.io/deeplearning-platform-release/tf2-gpu.2-4` image.

In [7]:
TRAIN_BASE_IMAGE = 'gcr.io/deeplearning-platform-release/tf2-gpu.2-4'


### Create the training module

A custom training image encapsulates you training code. You can structure your code in anyway you want as long as you can invoke it through a standard docker container interface. 

In this sample, the training code is encapsulated in a single Python module - `task.py`. The runtime parameters can be passed as command line arguments.  The below section summarizes key design decisions taken when designing the training regime.

#### Model design

This sample implements a simple classification model using pre-trained BERT components from TensorFlow Hub. Specifically a classic BERT architecture with L=12 hidden layers, a hidden size of H=768, and A=12 attention heads is used. This [TF Hub model](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3) uses the implementatio of BERT from the [TensorFlow Model Garden repository](https://github.com/tensorflow/models/tree/master/official/nlp/bert). 

Text inputs need to be transformed to numeric token ids and arranged in several Tensors before being input to BERT. TensorFlow Hub provides a matching preprocessing model for each of the BERT models, which implements this transformation using TF ops from the TF.text library. It is not necessary to run pure Python code outside your TensorFlow model to preprocess text.

The model implemented in the script embedds the preprocessing model from TF Hub as a Keras layer.

Since this is a binary classification problem and the model outputs a probability (a single-unit layer), the model uses `tf.keras.losses.BinaryCrossentropy` loss function and `tf.metrics.BinaryAccuracy` metric.

The sample uses the same optimizer that BERT was originally trained with: the "Adaptive Moments" (Adam). This optimizer minimizes the prediction loss and does regularization by weight decay (not using moments), which is also known as [AdamW](https://arxiv.org/abs/1711.05101).

For the learning rate (`init_lr`), the same schedule as BERT pre-training is used: linear decay of a notional initial learning rate, prefixed with a linear warm-up phase over the first 10% of training steps (`num_warmup_steps`). In line with the BERT paper, the initial learning rate is smaller for fine-tuning (best of 5e-5, 3e-5, 2e-5).

#### Input pipelines

The training code utilizes `tf.data` to implement input pipelines. The common techniques for optimizing performance - caching, prefetching - are applied. To better support distributed training the script allows for explicit configuration of [Auto Sharding Policy](https://www.tensorflow.org/tutorials/distribute/input).

#### Fault tolerance

The script utilizes the [`tf.keras.callbacks.experimental.BackupAndRestore`](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#backupandrestore_callback) callback for resilience from failures during training. The callback provides fault tolerance, by backing up the model and current epoch number in a temporary checkpoint file. This is done at the endo of each epoch.


In [17]:
! rm -rf trainer
! mkdir trainer
! touch trainer/__init__.py

In [18]:
%%writefile trainer/task.py


# Copyright 2021 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#            http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and

import os
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

from absl import app
from absl import flags
from absl import logging
from official.nlp import optimization 


TFHUB_HANDLE_ENCODER = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'
TFHUB_HANDLE_PREPROCESS = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
LOCAL_TB_FOLDER = '/tmp/logs'
LOCAL_SAVED_MODEL_DIR = '/tmp/saved_model'

FLAGS = flags.FLAGS
flags.DEFINE_integer('steps_per_epoch', 625, 'Steps per training epoch')
flags.DEFINE_integer('eval_steps', 150, 'Evaluation steps')
flags.DEFINE_integer('epochs', 2, 'Nubmer of epochs')
flags.DEFINE_integer('per_replica_batch_size', 32, 'Per replica batch size')
flags.DEFINE_string('training_data_path', 'gs://jk-demos-bucket/tfrecords/train', 'Training data GCS path')
flags.DEFINE_string('validation_data_path', 'gs://jk-demos-bucket/tfrecords/valid', 'Validation data GCS path')
flags.DEFINE_string('testing_data_path', 'gs://jk-demos-bucket/data/imdb/test', 'Testing data GCS path')

flags.DEFINE_string('job_dir', 'gs://jk-demos-bucket/jobs', 'A base GCS path for jobs')
flags.DEFINE_enum('strategy', 'multiworker', ['mirrored', 'multiworker'], 'Distribution strategy')
flags.DEFINE_enum('auto_shard_policy', 'auto', ['auto', 'data', 'file', 'off'], 'Dataset sharing strategy')



auto_shard_policy = {
    'auto': tf.data.experimental.AutoShardPolicy.AUTO,
    'data': tf.data.experimental.AutoShardPolicy.DATA,
    'file': tf.data.experimental.AutoShardPolicy.FILE,
    'off': tf.data.experimental.AutoShardPolicy.OFF,
}


def create_unbatched_dataset(tfrecords_folder):
    """Creates an unbatched dataset in the format required by the 
       sentiment analysis model from the folder with TFrecords files."""
    
    feature_description = {
        'text_fragment': tf.io.FixedLenFeature([], tf.string, default_value=''),
        'label': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    }

    def _parse_function(example_proto):
        parsed_example = tf.io.parse_single_example(example_proto, feature_description)
        return parsed_example['text_fragment'], parsed_example['label']
  
    file_paths = [f'{tfrecords_folder}/{file_path}' for file_path in tf.io.gfile.listdir(tfrecords_folder)]
    dataset = tf.data.TFRecordDataset(file_paths)
    dataset = dataset.map(_parse_function)
    
    return dataset


def configure_dataset(ds, auto_shard_policy):
    """
    Optimizes the performance of a dataset.
    """
    
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = (
        auto_shard_policy
    )
    
    ds = ds.repeat(-1).cache()
    ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)
    ds = ds.with_options(options)
    return ds


def create_input_pipelines(train_dir, valid_dir, test_dir, batch_size, auto_shard_policy):
    """Creates input pipelines from Imdb dataset."""
    
    train_ds = create_unbatched_dataset(train_dir)
    train_ds = train_ds.batch(batch_size)
    train_ds = configure_dataset(train_ds, auto_shard_policy)
    
    valid_ds = create_unbatched_dataset(valid_dir)
    valid_ds = valid_ds.batch(batch_size)
    valid_ds = configure_dataset(valid_ds, auto_shard_policy)
    
    test_ds = create_unbatched_dataset(test_dir)
    test_ds = test_ds.batch(batch_size)
    test_ds = configure_dataset(test_ds, auto_shard_policy)

    return train_ds, valid_ds, test_ds


def build_classifier_model(tfhub_handle_preprocess, tfhub_handle_encoder):
    """Builds a simple binary classification model with BERT trunk."""
    
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
    encoder_inputs = preprocessing_layer(text_input)
    encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = tf.keras.layers.Dropout(0.1)(net)
    net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
    
    return tf.keras.Model(text_input, net)


def copy_tensorboard_logs(local_path: str, gcs_path: str):
    """Copies Tensorboard logs from a local dir to a GCS location.
    
    After training, batch copy Tensorboard logs locally to a GCS location. This can result
    in faster pipeline runtimes over streaming logs per batch to GCS that can get bottlenecked
    when streaming large volumes.
    
    Args:
      local_path: local filesystem directory uri.
      gcs_path: cloud filesystem directory uri.
    Returns:
      None.
    """
    pattern = '{}/*/events.out.tfevents.*'.format(local_path)
    local_files = tf.io.gfile.glob(pattern)
    gcs_log_files = [local_file.replace(local_path, gcs_path) for local_file in local_files]
    for local_file, gcs_file in zip(local_files, gcs_log_files):
        tf.io.gfile.copy(local_file, gcs_file)


def main(argv):
    del argv
    
    def _is_chief(task_type, task_id):
        return ((task_type == 'chief' or task_type == 'worker') and task_id == 0) or task_type is None
        
    
    logging.info('Setting up training.')
    logging.info('   epochs: {}'.format(FLAGS.epochs))
    logging.info('   steps_per_epoch: {}'.format(FLAGS.steps_per_epoch))
    logging.info('   eval_steps: {}'.format(FLAGS.eval_steps))
    logging.info('   strategy: {}'.format(FLAGS.strategy))
    
    if FLAGS.strategy == 'mirrored':
        strategy = tf.distribute.MirroredStrategy()
    else:
        strategy = tf.distribute.MultiWorkerMirroredStrategy()
        
    if strategy.cluster_resolver:    
        task_type, task_id = (strategy.cluster_resolver.task_type,
                              strategy.cluster_resolver.task_id)
    else:
        task_type, task_id =(None, None)
        
    
    global_batch_size = (strategy.num_replicas_in_sync *
                         FLAGS.per_replica_batch_size)
    
    
    train_ds, valid_ds, test_ds = create_input_pipelines(
        FLAGS.training_data_path,
        FLAGS.validation_data_path,
        FLAGS.testing_data_path,
        global_batch_size,
        auto_shard_policy[FLAGS.auto_shard_policy])
        
    num_train_steps = FLAGS.steps_per_epoch * FLAGS.epochs
    num_warmup_steps = int(0.1*num_train_steps)
    init_lr = 3e-5
    
    with strategy.scope():
        model = build_classifier_model(TFHUB_HANDLE_PREPROCESS, TFHUB_HANDLE_ENCODER)
        loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
        metrics = tf.metrics.BinaryAccuracy()
        optimizer = optimization.create_optimizer(
            init_lr=init_lr,
            num_train_steps=num_train_steps,
            num_warmup_steps=num_warmup_steps,
            optimizer_type='adamw')

        model.compile(optimizer=optimizer,
                      loss=loss,
                      metrics=metrics)
        
    # Configure BackupAndRestore callback
    backup_dir = '{}/backupandrestore'.format(FLAGS.job_dir)
    callbacks = [tf.keras.callbacks.experimental.BackupAndRestore(backup_dir=backup_dir)]
    
    # Configure TensorBoard callback on Chief
    if _is_chief(task_type, task_id):
        callbacks.append(tf.keras.callbacks.TensorBoard(
            log_dir=LOCAL_TB_FOLDER, update_freq='batch'))
    
    logging.info('Starting training ...')
    
    history = model.fit(x=train_ds,
                        validation_data=valid_ds,
                        steps_per_epoch=FLAGS.steps_per_epoch,
                        validation_steps=FLAGS.eval_steps,
                        epochs=FLAGS.epochs,
                        callbacks=callbacks)

    if _is_chief(task_type, task_id):
        # Copy tensorboard logs to GCS
        tb_logs = '{}/tb_logs'.format(FLAGS.job_dir)
        logging.info('Copying TensorBoard logs to: {}'.format(tb_logs))
        copy_tensorboard_logs(LOCAL_TB_FOLDER, tb_logs)
        saved_model_dir = '{}/saved_model'.format(FLAGS.job_dir)
    else:
        saved_model_dir = LOCAL_SAVED_MODEL_DIR
        
    # Save trained model
    saved_model_dir = '{}/saved_model'.format(FLAGS.job_dir)
    logging.info('Training completed. Saving the trained model to: {}'.format(saved_model_dir))
    model.save(saved_model_dir)
    #tf.saved_model.save(model, saved_model_dir)
    
    
if __name__ == '__main__':
    logging.set_verbosity(logging.INFO)
    app.run(main)


Writing trainer/task.py


#### Create a docker file

In [8]:
TRAIN_IMAGE = f'gcr.io/{PROJECT_ID}/imdb_bert'

dockerfile = f'''
FROM {TRAIN_BASE_IMAGE}

RUN pip install pip install tf-models-official tensorflow-text 

WORKDIR /

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]
'''

with open('Dockerfile', 'w') as f:
    f.write(dockerfile)

#### Build a container image and upload it to your Container Registry

In [20]:
! docker build -t {TRAIN_IMAGE} .

Sending build context to Docker daemon  29.18MB
Step 1/5 : FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-4
latest: Pulling from deeplearning-platform-release/tf2-gpu.2-4

[1B57c49d0f: Pulling fs layer 
[1B40447d26: Pulling fs layer 
[1B2f862619: Pulling fs layer 
[1B278deddf: Pulling fs layer 
[1B80049843: Pulling fs layer 
[1B556b2329: Pulling fs layer 
[1Ba0c97a55: Pulling fs layer 
[1B78bd0b24: Pulling fs layer 
[1B6c31766d: Pulling fs layer 
[1Bba07e80d: Pulling fs layer 
[1Bfcc52fd9: Pulling fs layer 
[1Bfc936b5a: Pulling fs layer 
[1B883e2cff: Pulling fs layer 
[1B4b93e8e5: Pulling fs layer 
[1B4e709e13: Pulling fs layer 
[1Be868c172: Pulling fs layer 
[1B4ff2ee46: Pulling fs layer 
[1B576f4ffd: Pulling fs layer 
[1B1b43a473: Pulling fs layer 
[1B4258e91f: Pulling fs layer 
[1Bf1864c27: Pulling fs layer 
[1B6ff8756c: Pulling fs layer 
[1Ba82d4819: Pulling fs layer 
[1B4d9491c9: Pulling fs layer 
[1Bb66220c8: Pulling fs layer 
[1B31b11760: Pulling f

In [21]:
! docker push {TRAIN_IMAGE}

The push refers to repository [gcr.io/jk-demos/imdb_bert]

[1B923208c7: Preparing 
[1B2d9b52b0: Preparing 
[1B3b7141d8: Preparing 
[1B28f81e86: Preparing 
[1Bf77b619e: Preparing 
[1B22f25eab: Preparing 
[1B61cbfb7b: Preparing 
[1Bdb287c92: Preparing 
[1B0f881413: Preparing 
[1Be39f1882: Preparing 
[1B2059d805: Preparing 
[1Badd514e4: Preparing 
[1B40002f73: Preparing 
[1B7670164c: Preparing 
[1B88a169f3: Preparing 
[1Ba13b2926: Preparing 
[1Bd631abca: Preparing 
[1Bea8063f8: Preparing 
[1B5280894d: Preparing 
[1B65bc85a8: Preparing 
[1B00c31be3: Preparing 
[1B18b890fc: Preparing 
[1Ba7c9e3d1: Preparing 
[1B4dce1444: Preparing 
[1B30bcc944: Preparing 
[1Be116c0c0: Preparing 
[1B4df0ad6c: Preparing 
[1Bdf553184: Preparing 
[28Bd9b52b0: Pushed   199.5MB/197.3MB[28A[2K[26A[2K[28A[2K[20A[2K[18A[2K[15A[2K[13A[2K[12A[2K[10A[2K[7A[2K[5A[2K[2A[2K[1A[2K[28A[2K[28A[2K[28A[2K[28A[2K[28A[2K[28A[2K[28A[2K[28A[2K[28A[2K[28A[2K

Alternatively you could use Cloud Build.

!gcloud builds submit --tag {TRAIN_IMAGE} .

#### Testing the image locally

It may be difficult to troubleshoot distributed training jobs running in AI Platform. You can perform some level of troubleshooting by simulating a distributed training environment on your AI Platform Notebooks instance.

Let's assume that you have provisioned your instance with 4 GPUs. To simulate a distributed environment with two nodes, each equipped with two GPUs you can start two local containers configured as per below sample commands. Execute these commands from Jupyter terminal windows.

```
docker run --rm -it --gpus '"device=0,1"' \
--env TF_CONFIG='{"cluster": {"worker": ["localhost:12345", "localhost:23456"]}, "task": {"type": "worker", "index": 0} }' \
--network=host \
gcr.io/jk-demos/imdb_bert --epochs=2 --steps_per_epoch=20 --eval_steps=10 --auto_shard_policy=data --job_dir=gs://jk-demos-bucket/test_run
```

```
docker run --rm -it --gpus '"device=2,3"' \
--env TF_CONFIG='{"cluster": {"worker": ["localhost:12345", "localhost:23456"]}, "task": {"type": "worker", "index": 1} }' \
--network=host \
gcr.io/jk-demos/imdb_bert --epochs=2 --steps_per_epoch=20 --eval_steps=10 --auto_shard_policy=data --job_dir=gs://jk-demos-bucket/test_run
```

## Submitting training jobs

The AI Platform (Unified) SDK works as a client/server model. On your side, the Python script, you  create a client that sends requests and receives responses from the server -- AI Platform. Requests and responses conform to the schemas documented in [AI Platform API Reference](https://cloud.google.com/ai-platform-unified/docs/reference/rest/v1/projects.locations.batchPredictionJobs/create).

We will use the term specification to refer to a formatted request. To submit a Custom job request you need to create a Custom job specification.

The custom job specification comprises two parts:
- A worker pool configuration, and
- A scheduling configuration

For single-node training, you define a single worker pool. For multi-node distributed training, multiple worker pools are defined.

Within a worker pool specification, you configure:
- Machine types and accelerators
- Configuration of what training code the worker pool runs. 

For jobs using custom containers (like in this sample), the latter section of a worker pool specification contains a custom container configuration, including the URI of the container image and parameters passed to the container.

The scheduling configuration includes parameters related to queuing and scheduling of custom jobs, including the maximum job running time and the job restart policy.


### Assembling the job specification for single-node training

For this job we will use a single `n1-standard-4` machine with 2 NVidia V100 GPUs and the container image created in the previous sections of this notebook.

In [11]:
MACHINE_TYPE = 'n1-standard-4'
TRAIN_GPU, TRAIN_NGPU = (aip.AcceleratorType.NVIDIA_TESLA_V100, 2)

When configuring a custom container you pass the command line parameters expected by your script through the `args` field of the container specification.

In [15]:
epochs = 3
steps_per_epoch = 200
eval_steps = 50
per_replica_batch_size = 32
training_data_path = 'gs://jk-demos-bucket/tfrecords/train'
validation_data_path = 'gs://jk-demos-bucket/tfrecords/valid'
job_id = 'job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
job_dir = 'gs://jk-demos-bucket/jobs/{}'.format(job_id)

worker_pool_spec = [
    {
        "replica_count": 1,
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": TRAIN_GPU,
            "accelerator_count": TRAIN_NGPU
        },
        "container_spec": {
            "image_uri": TRAIN_IMAGE,
            "args": [
                "--epochs=" + str(epochs),
                "--steps_per_epoch=" + str(steps_per_epoch),
                "--eval_steps=" + str(eval_steps),
                "--training_data_path=" + training_data_path,
                "--validation_data_path=" + validation_data_path,
                "--job_dir=" + job_dir,
                "--per_replica_batch_size=" + str(per_replica_batch_size),
                "--strategy=mirrored",
                "--auto_shard_policy=data",
            ]
        },
    }
]

custom_job = {
        "display_name": f'imdb-bert-{job_id}',
        "job_spec": {
            "worker_pool_specs": worker_pool_spec
        },
    }

custom_job

{'display_name': 'imdb-bert-job-20210303221013',
 'job_spec': {'worker_pool_specs': [{'replica_count': 1,
    'machine_spec': {'machine_type': 'n1-standard-4',
     'accelerator_type': <AcceleratorType.NVIDIA_TESLA_V100: 3>,
     'accelerator_count': 2},
    'container_spec': {'image_uri': 'gcr.io/jk-demos/imdb_bert',
     'args': ['--epochs=3',
      '--steps_per_epoch=200',
      '--eval_steps=50',
      '--training_data_path=gs://jk-demos-bucket/tfrecords/train',
      '--validation_data_path=gs://jk-demos-bucket/tfrecords/valid',
      '--job_dir=gs://jk-demos-bucket/jobs/job-20210303221013',
      '--per_replica_batch_size=32',
      '--strategy=mirrored',
      '--auto_shard_policy=data']}}]}}

#### Submitting the job

To submit the job you need to create a Job Service client and invoke the `create_custom_job` method.

In [17]:
client_options = {"api_endpoint": API_ENDPOINT}

client = aip.JobServiceClient(client_options=client_options)
response = client.create_custom_job(parent=PARENT, custom_job=custom_job)
job_name = response.name
response

name: "projects/993115309906/locations/us-central1/customJobs/2556364534579200000"
display_name: "imdb-bert-job-20210303221013"
job_spec {
  worker_pool_specs {
    machine_spec {
      machine_type: "n1-standard-4"
      accelerator_type: NVIDIA_TESLA_V100
      accelerator_count: 2
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/jk-demos/imdb_bert"
      args: "--epochs=3"
      args: "--steps_per_epoch=200"
      args: "--eval_steps=50"
      args: "--training_data_path=gs://jk-demos-bucket/tfrecords/train"
      args: "--validation_data_path=gs://jk-demos-bucket/tfrecords/valid"
      args: "--job_dir=gs://jk-demos-bucket/jobs/job-20210303221013"
      args: "--per_replica_batch_size=32"
      args: "--strategy=mirrored"
      args: "--auto_shard_policy=data"
    }
  }
}
state: JOB_STATE_PENDING
create_time {
  seconds: 1614809668
  nanos: 632906000
}
update_time {
  seconds: 

#### Monitoring the job

You can monitor the job through GCP Console or programmaticaly by using the `client.get_custom_job()` method

In [18]:
response = client.get_custom_job(name=job_name)
response

name: "projects/993115309906/locations/us-central1/customJobs/2556364534579200000"
display_name: "imdb-bert-job-20210303221013"
job_spec {
  worker_pool_specs {
    machine_spec {
      machine_type: "n1-standard-4"
      accelerator_type: NVIDIA_TESLA_V100
      accelerator_count: 2
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/jk-demos/imdb_bert"
      args: "--epochs=3"
      args: "--steps_per_epoch=200"
      args: "--eval_steps=50"
      args: "--training_data_path=gs://jk-demos-bucket/tfrecords/train"
      args: "--validation_data_path=gs://jk-demos-bucket/tfrecords/valid"
      args: "--job_dir=gs://jk-demos-bucket/jobs/job-20210303221013"
      args: "--per_replica_batch_size=32"
      args: "--strategy=mirrored"
      args: "--auto_shard_policy=data"
    }
  }
}
state: JOB_STATE_PENDING
create_time {
  seconds: 1614809668
  nanos: 632906000
}
start_time {
  seconds: 1

### Assembling the job specification for multi-worker training

If you run a distributed training job with AI Platform, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica.

Each replica in the training cluster is given a single role or task in distributed training. For example:

- Primary replica: Exactly one replica is designated the primary replica. This task manages the others and reports status for the job as a whole.
- Worker(s): One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.
- Parameter server(s): If supported by your ML framework, one or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.
- Evaluator(s): If supported by your ML framework, one or more replicas may be designated as evaluators. These replicas can be used to evaluate your model. If you are using TensorFlow, note that TensorFlow generally expects that you use no more than one evaluator.

You configure the role by mapping to a worker pool specification:

- First worker pool specification (index 0 in the `workerPoolSpecs` list) maps to Primary or chief worker. There can be only one replica configured in the first worker pool specification
- Second worker pool specification maps to secondary workers
- Third worker pool specification maps to parameters servers, and
- Fourth worker pool specification maps to evaluators

Our second job will be a multi-worker distributed training job with one chief and one secondary worker. Both replicas will run on `n1-standard-4` machines with two NVidia V100 GPUs.

In [21]:
MACHINE_TYPE = 'n1-standard-4'
TRAIN_GPU, TRAIN_NGPU = (aip.AcceleratorType.NVIDIA_TESLA_V100, 2)

In [22]:
epochs = 100
steps_per_epoch = 500
eval_steps = 100
training_data_path = 'gs://jk-demos-bucket/tfrecords/train'
validation_data_path = 'gs://jk-demos-bucket/tfrecords/valid'
job_id = 'job-{}'.format(datetime.now().strftime("%Y%m%d%H%M%S"))
job_dir = 'gs://jk-demos-bucket/jobs/{}'.format(job_id)

worker_pool_spec = [
    {
        "replica_count": 1,
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": TRAIN_GPU,
            "accelerator_count": TRAIN_NGPU
        },
        "container_spec": {
            "image_uri": TRAIN_IMAGE,
            "args": [
                "--epochs=" + str(epochs),
                "--steps_per_epoch=" + str(steps_per_epoch),
                "--eval_steps=" + str(eval_steps),
                "--per_replica_batch_size=" + str(per_replica_batch_size),
                "--training_data_path=" + training_data_path,
                "--validation_data_path=" + validation_data_path,
                "--job_dir=" + job_dir,
                "--strategy=multiworker",
                "--auto_shard_policy=data",
            ]
        },
    },
    {
        "replica_count": 1,
        "machine_spec": {
            "machine_type": MACHINE_TYPE,
            "accelerator_type": TRAIN_GPU,
            "accelerator_count": TRAIN_NGPU
        },
        "container_spec": {
            "image_uri": TRAIN_IMAGE,
            "args": [
                "--epochs=" + str(epochs),
                "--steps_per_epoch=" + str(steps_per_epoch),
                "--eval_steps=" + str(eval_steps),
                "--per_replica_batch_size=" + str(per_replica_batch_size),
                "--training_data_path=" + training_data_path,
                "--validation_data_path=" + validation_data_path,
                "--job_dir=" + job_dir,
                "--strategy=multiworker",
                "--auto_shard_policy=data",
            ]
        },
    },
]

custom_job = {
        "display_name": f'imdb-bert-{job_id}',
        "job_spec": {
            "worker_pool_specs": worker_pool_spec
        },
    }

custom_job

{'display_name': 'imdb-bert-job-20210303223833',
 'job_spec': {'worker_pool_specs': [{'replica_count': 1,
    'machine_spec': {'machine_type': 'n1-standard-4',
     'accelerator_type': <AcceleratorType.NVIDIA_TESLA_V100: 3>,
     'accelerator_count': 2},
    'container_spec': {'image_uri': 'gcr.io/jk-demos/imdb_bert',
     'args': ['--epochs=100',
      '--steps_per_epoch=500',
      '--eval_steps=100',
      '--per_replica_batch_size=32',
      '--training_data_path=gs://jk-demos-bucket/tfrecords/train',
      '--validation_data_path=gs://jk-demos-bucket/tfrecords/valid',
      '--job_dir=gs://jk-demos-bucket/jobs/job-20210303223833',
      '--strategy=multiworker',
      '--auto_shard_policy=data']}},
   {'replica_count': 1,
    'machine_spec': {'machine_type': 'n1-standard-4',
     'accelerator_type': <AcceleratorType.NVIDIA_TESLA_V100: 3>,
     'accelerator_count': 2},
    'container_spec': {'image_uri': 'gcr.io/jk-demos/imdb_bert',
     'args': ['--epochs=100',
      '--steps_per_

In [23]:
client_options = {"api_endpoint": API_ENDPOINT}

client = aip.JobServiceClient(client_options=client_options)
response = client.create_custom_job(parent=PARENT, custom_job=custom_job)
job_name = response.name
response

name: "projects/993115309906/locations/us-central1/customJobs/3254422476821626880"
display_name: "imdb-bert-job-20210303223833"
job_spec {
  worker_pool_specs {
    machine_spec {
      machine_type: "n1-standard-4"
      accelerator_type: NVIDIA_TESLA_V100
      accelerator_count: 2
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/jk-demos/imdb_bert"
      args: "--epochs=100"
      args: "--steps_per_epoch=500"
      args: "--eval_steps=100"
      args: "--per_replica_batch_size=32"
      args: "--training_data_path=gs://jk-demos-bucket/tfrecords/train"
      args: "--validation_data_path=gs://jk-demos-bucket/tfrecords/valid"
      args: "--job_dir=gs://jk-demos-bucket/jobs/job-20210303223833"
      args: "--strategy=multiworker"
      args: "--auto_shard_policy=data"
    }
  }
  worker_pool_specs {
    machine_spec {
      machine_type: "n1-standard-4"
      accelerator_type: NV

In [24]:
job_name = response.name

response = client.get_custom_job(name=job_name)
response

name: "projects/993115309906/locations/us-central1/customJobs/3254422476821626880"
display_name: "imdb-bert-job-20210303223833"
job_spec {
  worker_pool_specs {
    machine_spec {
      machine_type: "n1-standard-4"
      accelerator_type: NVIDIA_TESLA_V100
      accelerator_count: 2
    }
    replica_count: 1
    disk_spec {
      boot_disk_type: "pd-ssd"
      boot_disk_size_gb: 100
    }
    container_spec {
      image_uri: "gcr.io/jk-demos/imdb_bert"
      args: "--epochs=100"
      args: "--steps_per_epoch=500"
      args: "--eval_steps=100"
      args: "--per_replica_batch_size=32"
      args: "--training_data_path=gs://jk-demos-bucket/tfrecords/train"
      args: "--validation_data_path=gs://jk-demos-bucket/tfrecords/valid"
      args: "--job_dir=gs://jk-demos-bucket/jobs/job-20210303223833"
      args: "--strategy=multiworker"
      args: "--auto_shard_policy=data"
    }
  }
  worker_pool_specs {
    machine_spec {
      machine_type: "n1-standard-4"
      accelerator_type: NV

## The End