In [1]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Before you begin

### Select a GPU runtime

**Make sure you're running this notebook in a GPU runtime if you have that option. In Colab, select "Runtime --> Change runtime type > GPU"**

In [10]:
from datetime import datetime
import os

In [24]:
PROJECT_ID = "mg-ce-demos"
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET_NAME = "gs://mg-ce-demos"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

In [25]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/mikegoodman/Documents/developer/mg-ce-demos-baeebaf7fb05.json"

In [27]:
#! gsutil mb -l $REGION $BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [26]:
! gsutil ls -al $BUCKET_NAME

   5511886  2022-04-19T13:27:33Z  gs://mg-ce-demos/area_cover_dataset.csv#1650374853449780  metageneration=1
      2320  2022-03-04T14:20:13Z  gs://mg-ce-demos/mg-ce-demos-baeebaf7fb05.json#1646403613524771  metageneration=1
                                 gs://mg-ce-demos/633472233130/
                                 gs://mg-ce-demos/demand_forecasting/
                                 gs://mg-ce-demos/forecastpipeline/
                                 gs://mg-ce-demos/leak/
TOTAL: 2 objects, 5514206 bytes (5.26 MiB)


### Import libraries and define constants

In [28]:
import sys

from google.cloud import aiplatform
from google.cloud.aiplatform import hyperparameter_tuning as hpt

### Write Dockerfile

The first step in containerizing your code is to create a Dockerfile. In the Dockerfile, you'll include all the commands needed to run the image such as installing the necessary libraries and setting up the entry point for the training code.

This Dockerfile uses the Deep Learning Container TensorFlow Enterprise 2.5 GPU Docker image. The Deep Learning Containers on Google Cloud come with many common ML and data science frameworks pre-installed. After downloading that image, this Dockerfile installs the [CloudML Hypertune](https://github.com/GoogleCloudPlatform/cloudml-hypertune) library and sets up the entrypoint for the training code.


In [29]:
%%writefile Dockerfile

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5
WORKDIR /

# Installs hypertune library
RUN pip install cloudml-hypertune

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

Writing Dockerfile


### Create training application code

Next, you create a trainer directory with a `task.py` script that contains the code for your training application.

In [30]:
# Create trainer directory

! mkdir trainer

In the next cell, you write the contents of the training script, `task.py`. This file downloads the _horses or humans_ dataset from TensorFlow datasets and trains a `tf.keras` functional model using `MirroredStrategy` from the `tf.distribute` module.

There are a few components that are specific to using the hyperparameter tuning service:

* The script imports the `hypertune` library. Note that the Dockerfile included instructions to pip install the hypertune library.
* The function `get_args()` defines a command-line argument for each hyperparameter you want to tune. In this example, the hyperparameters that will be tuned are the learning rate, the momentum value in the optimizer, and the number of units in the last hidden layer of the model. The value passed in those arguments is then used to set the corresponding hyperparameter in the code.
* At the end of the `main()` function, the hypertune library is used to define the metric to optimize. In this example, the metric that will be optimized is the the validation accuracy. This metric is passed to an instance of `HyperTune`.

In [31]:
%%writefile trainer/task.py

import argparse
import hypertune
import tensorflow as tf
import tensorflow_datasets as tfds

def get_args():
  """Parses args. Must include all hyperparameters you want to tune."""

  parser = argparse.ArgumentParser()
  parser.add_argument(
      '--learning_rate', required=True, type=float, help='learning rate')
  parser.add_argument(
      '--momentum', required=True, type=float, help='SGD momentum value')
  parser.add_argument(
      '--units',
      required=True,
      type=int,
      help='number of units in last hidden layer')
  parser.add_argument(
      '--epochs',
      required=False,
      type=int,
      default=10,
      help='number of training epochs')
  args = parser.parse_args()
  return args


def preprocess_data(image, label):
  """Resizes and scales images."""

  image = tf.image.resize(image, (150, 150))
  return tf.cast(image, tf.float32) / 255., label


def create_dataset(batch_size):
  """Loads Horses Or Humans dataset and preprocesses data."""

  data, info = tfds.load(
      name='horses_or_humans', as_supervised=True, with_info=True)

  # Create train dataset
  train_data = data['train'].map(preprocess_data)
  train_data = train_data.shuffle(1000)
  train_data = train_data.batch(batch_size)

  # Create validation dataset
  validation_data = data['test'].map(preprocess_data)
  validation_data = validation_data.batch(64)

  return train_data, validation_data


def create_model(units, learning_rate, momentum):
  """Defines and compiles model."""

  inputs = tf.keras.Input(shape=(150, 150, 3))
  x = tf.keras.layers.Conv2D(16, (3, 3), activation='relu')(inputs)
  x = tf.keras.layers.MaxPooling2D((2, 2))(x)
  x = tf.keras.layers.Conv2D(32, (3, 3), activation='relu')(x)
  x = tf.keras.layers.MaxPooling2D((2, 2))(x)
  x = tf.keras.layers.Conv2D(64, (3, 3), activation='relu')(x)
  x = tf.keras.layers.MaxPooling2D((2, 2))(x)
  x = tf.keras.layers.Flatten()(x)
  x = tf.keras.layers.Dense(units, activation='relu')(x)
  outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
  model = tf.keras.Model(inputs, outputs)
  model.compile(
      loss='binary_crossentropy',
      optimizer=tf.keras.optimizers.SGD(
          learning_rate=learning_rate, momentum=momentum),
      metrics=['accuracy'])
  return model


def main():
  args = get_args()

  # Create Strategy
  strategy = tf.distribute.MirroredStrategy()

  # Scale batch size
  GLOBAL_BATCH_SIZE = 64 * strategy.num_replicas_in_sync  
  train_data, validation_data = create_dataset(GLOBAL_BATCH_SIZE)

  # Wrap model variables within scope
  with strategy.scope():
    model = create_model(args.units, args.learning_rate, args.momentum)

  # Train model
  history = model.fit(
      train_data, epochs=args.epochs, validation_data=validation_data)

  # Define Metric
  hp_metric = history.history['val_accuracy'][-1]

  hpt = hypertune.HyperTune()
  hpt.report_hyperparameter_tuning_metric(
      hyperparameter_metric_tag='accuracy',
      metric_value=hp_metric,
      global_step=args.epochs)


if __name__ == '__main__':
  main()

Writing trainer/task.py


### Build the Container

In the next cells, you build the container and push it to Google Container Registry.

In [32]:
# Set the IMAGE_URI
IMAGE_URI=f"gcr.io/{PROJECT_ID}/horse-human:hypertune"

In [None]:
# Build the docker image
! docker build -f Dockerfile -t $IMAGE_URI ./

[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                                         
[?25h[1A[0G[?25l[+] Building 0.1s (2/3)                                                         
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 340B                                       0.0s
[0m[34m => [internal] load .dockerignore                                          0.0s
[0m[34m => => transferring context: 2B                                            0.0s
[0m => [internal] load metadata for gcr.io/deeplearning-platform-release/tf2  0.1s
[?25h[1A[1A[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (2/3)                                                         
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 340B                                       0.0s
[0m[34m => [internal] load .dockerignore                           

In [None]:
# Push it to Google Container Registry:
! docker push $IMAGE_URI

### Create and run hyperparameter tuning job on Vertex AI

Once your container is pushed to Google Container Registry, you use the Vertex SDK to create and run the hyperparameter tuning job.

You define the following specifications:
* `worker_pool_specs`: Dictionary specifying the machine type and Docker image. This example defines a single node cluster with one `n1-standard-4` machine with two `NVIDIA_TESLA_T4` GPUs.
* `parameter_spec`: Dictionary specifying the parameters to optimize. The dictionary key is the string assigned to the command line argument for each hyperparameter in your training application code, and the dictionary value is the parameter specification. The parameter specification includes the type, min/max values, and scale for the hyperparameter.
* `metric_spec`: Dictionary specifying the metric to optimize. The dictionary key is the `hyperparameter_metric_tag` that you set in your training application code, and the value is the optimization goal.

In [None]:
worker_pool_specs = [{
    'machine_spec': {
        'machine_type': 'n1-standard-4',
        'accelerator_type': 'NVIDIA_TESLA_T4',
        'accelerator_count': 2
    },
    'replica_count': 1,
    'container_spec': {
        'image_uri': IMAGE_URI
    }
}]

metric_spec = {'accuracy': 'maximize'}

parameter_spec = {
    'learning_rate': hpt.DoubleParameterSpec(min=0.001, max=1, scale='log'),
    'momentum': hpt.DoubleParameterSpec(min=0, max=1, scale='linear'),
    'units': hpt.DiscreteParameterSpec(values=[64, 128, 512], scale=None)
}

Create a `CustomJob`.

In [2]:
# Create a CustomJob

JOB_NAME = 'horses-humans-hyperparam-job' + TIMESTAMP

my_custom_job = aiplatform.CustomJob(display_name=JOB_NAME,
                              worker_pool_specs=worker_pool_specs,
                              staging_bucket=BUCKET_NAME)

Then, create and run a `HyperparameterTuningJob`.

There are a few arguments to note:

* `max_trial_count`: Sets an upper bound on the number of trials the service will run. The recommended practice is to start with a smaller number of trials and get a sense of how impactful your chosen hyperparameters are before scaling up.

* `parallel_trial_count`:  If you use parallel trials, the service provisions multiple training processing clusters. The worker pool spec that you specify when creating the job is used for each individual training cluster.  Increasing the number of parallel trials reduces the amount of time the hyperparameter tuning job takes to run; however, it can reduce the effectiveness of the job overall. This is because the default tuning strategy uses results of previous trials to inform the assignment of values in subsequent trials.
 
* `search_algorithm`: The available search algorithms are grid, random, or default (None). The default option applies Bayesian optimization to search the space of possible hyperparameter values and is the recommended algorithm.

In [None]:
# Create and run HyperparameterTuningJob

hp_job = aiplatform.HyperparameterTuningJob(
    display_name=JOB_NAME,
    custom_job=my_custom_job,
    metric_spec=metric_spec,
    parameter_spec=parameter_spec,
    max_trial_count=15,
    parallel_trial_count=3,
    search_algorithm=None)

hp_job.run()

Click on the generated link to see your run in the Cloud Console. When the job completes, you will see the results of the tuning trials.

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
# Delete Cloud Storage objects that were created
! gsutil -m rm -r $BUCKET_NAME