In [None]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Monitoring PyTorch experiments with AI Platform TensorBoard

This code sample demonstrates how to configure AI Platform TensorBoard (Experimental) to monitor PyTorch training jobs.

AI Platform Tensorboard is an enterprise ready, managed version of TensorBoard, a Google Open Source project for Machine Learning experiment visualization, that is tightly integrated with the Google Cloud AI Platform.

TensorBoard provides the visualization and tooling needed for machine learning experimentation:
* Tracking and visualizing metrics such as loss and accuracy
* Visualizing the model graph (ops and layers)
* Viewing histograms of weights, biases, or other tensors as they change over time
* Projecting embeddings to a lower dimensional space
* Displaying images, text, and audio data
* Profiling TensorFlow programs
* And much more

Note that currently, AI Platform TensorBoard requires AI Platform Training and only supports the Scalars dashboard. As support for other features of TensorBoard are added this notebook will be updated.


## Pre-requisites

This notebook was designed to run on [AI Platform Notebooks](https://cloud.google.com/ai-platform/notebooks/docs) using the standard PyTorch 1.6+ image. Your notebook instance should be in the same project as the AI Platform TensorBoard and Training services.

While AI Platform TensorBoard is in the Experimental stage your project must be allow-listed before using the service. Use the [signup form](https://docs.google.com/forms/d/e/1FAIpQLSfbvZ5xrFStX54qEUlJ6A0tWZ-O20i_t-Hifm0JvbX8do5IcQ/viewform) to request the acccess.

After your project has been allow-listed make sure to [enable Cloud AI Platform API](https://console.cloud.google.com/apis/library/aiplatform.googleapis.com?q=cloud%20ai%20platform%20api&id=6189b0c0-23b1-46a4-a32f-70639e83fe9b).

## Scenario

You will use AI Platform TensorBoard to monitor PyTorch AI Platform Training jobs. The training scenario is transfer learning for an image classification problem, inspired by the Kaggle's classic Dogs vs. Cats competition. 

You will run and monitor two types of AI Platform Training jobs: a custom training job and a hyperparameter tuning job. Both types of jobs will utilize the same custom docker image that encapsulates the training application and the PyTorch runtime.



## Setting up the environment

In [11]:
import base64
import os
import json
import time
import numpy as np

import google.auth

from google.auth.credentials import Credentials
from google.auth.transport.requests import AuthorizedSession

from typing import List, Optional, Text, Tuple

### Create AI Platform Training service account



To integrate with AI Platform TensorBoard, AI Platform Training jobs have to run in the context a service account that has permisions to write logs to GCS and access the AI Platform TensorBoard service. 

If you don't have one already set up, create and configure a new service account using the following instructions. Note that by default, your AI Platform Notebook instance is running under the **Compute Engine default service account** that does not have the required permissions (specifically `iam.serviceAccounts.setIamPolicy`) to configure the service account. If this is the case, use a different environment  (for example **Cloud Shell**) and the account with the required permissions to run the below commands.


1. Create a service account

```
PROJECT_ID=[YOUR_PROJECT_ID]
USER_SA_NAME=[YOUR_SA_ACCOUNT_NAME]

gcloud --project=$PROJECT_ID iam service-accounts create $USER_SA_NAME

```

2. Retrieve the internal service account used by AI Platform 

```
GOOGLE_SA=$(gcloud projects get-iam-policy $PROJECT_ID \
    --flatten="bindings[].members" --format="table(bindings.members)" \
    --filter="bindings.role:roles/aiplatform.customCodeServiceAgent" | \
    grep "serviceAccount:" | head -n1)
```

3. Give the AI Platform service account permissions to impersonate your service account

```
SA_EMAIL="${USER_SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

gcloud --project=$PROJECT_ID iam service-accounts add-iam-policy-binding \
    --role roles/iam.serviceAccountAdmin \
    --member $GOOGLE_SA $SA_EMAIL

```

4. Give your service account access to GCS and AI Platform Tensorboard service.

```
gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:${SA_EMAIL}" \
    --role="roles/storage.admin"

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member="serviceAccount:${SA_EMAIL}" \
    --role="roles/aiplatform.user"

```


Set the `SA_EMAIL` constant to the email account of your service account. If you followed the above steps to create the account you can display it by:

```
echo $SA_EMAIL
```

In [12]:
SA_EMAIL = 'aip-training@jk-mlops-dev.iam.gserviceaccount.com'

### Set AI Platform (Unified) constants

`PROJECT_ID` - The GCP Project ID. Both your AI Platform Notebook instance and AI Platform Training jobs should be in the same project. Make sure to modify the placeholder with your Project ID.

`CAIP_REGION` - A GCP compute region to use for the cloud services used in this notebook. Make sure to choose a region where [Cloud AI Platform services](https://cloud.google.com/ml-engine/docs/tensorflow/regions) are available. The default region is `us-central1`. 


In [13]:
PROJECT_ID = 'jk-mlops-dev'
CAIP_REGION = 'us-central1'

The current version of AI Platform (Unified) Python SDK does not support the AI Platform TensorBoard service nor the AI Platform Training service intergration with AI Platform TensorBoard. To mitigate, the notebook calls the APIs directly using the REST interface. You don't need to change the below constants that define the API's endpoint and root resource paths. 

`CAIP_ENDPOINT` - The AI Platform (Unified) API service endpoint.

`CAIP_PARENT_ALPHA` - The AI Platform (Unified) Alpha (Experimental) API root resource path. AI Platform (Unified) TensorBoard is in the Experimental stage.

`CAIP_PARENT_BETA` - The AI Platform (Unified) Beta (Preview) API root resource path. AI Platform (Unified) Training is in the Preview stage.


In [15]:
CAIP_ENDPOINT = f'{CAIP_REGION}-aiplatform.googleapis.com'
CAIP_PARENT_ALPHA = f'https://{CAIP_ENDPOINT}/v1alpha1/projects/{PROJECT_ID}/locations/{CAIP_REGION}'
CAIP_PARENT_BETA = f'https://{CAIP_ENDPOINT}/v1beta1/projects/{PROJECT_ID}/locations/{CAIP_REGION}'

### Create a GCS bucket to store TensorBoard logs

The training script writes TensorBoard logs to a GCS bucket from which the AI Platform Training service  ingests them to the TensorBoard service. The GCS bucket should be a regional bucket  in the same region where the training job is executed.


In [16]:
GCS_BUCKET_NAME = f'{PROJECT_ID}-tensorboard-logs'

!gsutil mb -l {CAIP_REGION} gs://{GCS_BUCKET_NAME}

Creating gs://jk-mlops-dev-tensorboard-logs/...


### Create a TensorBoard instance

You will now create an AI Platform TensorBoard instance. As noted before you will use the REST interface to invoke the AI Platform TensorBoard API.

#### Create an authorized session 

In [17]:
credentials, _ = google.auth.default()
authed_session = AuthorizedSession(credentials)

#### Create a TensorBoard resorce

The following REST call creates a TensorBoard resource and sets its display name. Note that multiple resources can share the same display name, so each execution of the following cell creates a new TensorBoard instance.

In [18]:
tensorboard_display_name = 'pytorch_tensorboard'

In [19]:
api_url = f'{CAIP_PARENT_ALPHA}/tensorboards'

request_body = {
    'display_name': tensorboard_display_name
}

response = authed_session.post(api_url, data=json.dumps(request_body))
response.json()

{'name': 'projects/895222332033/locations/us-central1/tensorboards/2152439146906386432/operations/8989011133394321408',
 'metadata': {'@type': 'type.googleapis.com/google.cloud.aiplatform.v1alpha1.CreateTensorboardOperationMetadata',
  'genericMetadata': {'createTime': '2020-12-08T00:53:29.144269Z',
   'updateTime': '2020-12-08T00:53:29.144269Z'}}}

#### List TensorBoard instances 

To verify that the instance was succesfully created list all instances with the set display name.

In [20]:
api_url = f'{CAIP_PARENT_ALPHA}/tensorboards?filter=display_name={tensorboard_display_name}'

response = authed_session.get(api_url)
response.json()

{'tensorboards': [{'name': 'projects/895222332033/locations/us-central1/tensorboards/2152439146906386432',
   'displayName': 'pytorch_tensorboard',
   'createTime': '2020-12-08T00:53:29.144269Z',
   'updateTime': '2020-12-08T00:53:29.345030Z',
   'etag': 'AMEw9yNS6EIF5uzD3Ze0iP7UoEHXtMmXORoc_x9diWExZgC-S5sVPwJCOf9bAWgKKkBn'}]}

#### Retrieve a full name of your instance

You can retrieve the fullname of the created instance from the JSON response. 

In [21]:
tensorboard_id = response.json()['tensorboards'][0]['name']
tensorboard_id

'projects/895222332033/locations/us-central1/tensorboards/2152439146906386432'

Your environment is ready. You will now create a custom PyTorch training container image and submit and monitor AI Platform Training jobs.

## Preparing a training container image

There are two ways to create AI Platform Training jobs:

* **Using a Google Cloud prebuilt container image**. If you use a prebuilt container, you have to prepare a Python package that is installed on top of the pre-built container image when you submit the job. This Python package contains your code for training a model. The pre-built container provides a runtime environment - e.g. TensorFlow. There is a number of pre-built container images available providing pre-configured environments for the most popular frameworks, including TensorFlow, PyTorch, XGBoost and Scikit-learn.

* **Using your own custom container image**. You are responsible for creating the container image including packaging your training code and all its dependencies.

This notebook uses the second approach. You will build your own custom container image that will be a derivative of the standard PyTorch [Deep Learning container](https://cloud.google.com/ai-platform/deep-learning-containers/docs).



### Create a training script

The first step is to create a training script. If you want to provide runtime paramaters that control the execution of the script they should be provided as command line arguments.

Take the time to review the script:

* The `get_catsanddogs` function creates PyTorch training and validation datasets (`torch.utils.data.Dataset`) using images of cats and dogs, downloaded from `https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip`
* The `get_model` function creates the CNN that is based on the pre-trained ResNet18 model and includes a simple, custom FCNN classification head
* The `train_eval` function implement a training and evaluation loop. Note the use of `torch.utils.tensorboard` to log the training and validation losses and the validation accuracy to the TensorBoard log. Also check, how the `hypertune` package is used to report the validation accuracy to AI Platform Training.



In [22]:
%%writefile train_eval.py

# Copyright 2020 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#            http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and

import argparse
import hypertune
import numpy as np
import time
import os
import copy
import matplotlib.pyplot as plt
import zipfile

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import torchvision
from torchvision import datasets, models, transforms


DEFAULT_ROOT = '/tmp'

def get_catsanddogs(root):
    """
    Creates training and validation Datasets based on images
    of cats and dogs from 
    https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip.
    """
    
    # Download and extract the images
    source_url = 'https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip'
    local_filename = source_url.split('/')[-1]
    datasets.utils.download_url(source_url, root, )
    path_to_zip = os.path.join(root, local_filename)
    with zipfile.ZipFile(path_to_zip, 'r') as zip_ref:
        zip_ref.extractall(root)
    
    
    # Create datasets
    train_transforms = transforms.Compose([
        transforms.RandomResizedCrop(256),
        transforms.RandomHorizontalFlip(),
        transforms.CenterCrop(size=224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])
    
    val_transforms = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])
    
    train_dataset = datasets.ImageFolder(
        root=os.path.join(path_to_zip[:-4], 'train'),
        transform=train_transforms)
    
    val_dataset = datasets.ImageFolder(
        root=os.path.join(path_to_zip[:-4], 'validation'),
        transform=val_transforms
    )
    
    return train_dataset, val_dataset
    


def get_model(num_layers, dropout_ratio, num_classes):
    """
    Creates a convolution net using ResNet50 trunk and
    a custom head.
    """

    # Create the ResNet50 trunk
    model = models.resnet18(pretrained=True)

    # Get the number of input features to the default head
    num_features = model.fc.in_features

    # Freeze trunk weights
    for param in model.parameters():
        param.requires_grad = False

    # Define the new head
    head = nn.Sequential(nn.Linear(num_features, num_layers),
                         nn.ReLU(),
                         nn.Dropout(dropout_ratio),
                         nn.Linear(num_layers, num_classes))

    # Replace the head
    model.fc = head

    return model


def train_eval(device, model, train_dataloader, valid_dataloader,
               criterion, optimizer, scheduler, num_epochs, writer=None):
    """
    Trains and evaluates a model.
    """
    since = time.time()

    model = model.to(device)

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0
    
    hpt = hypertune.HyperTune()

    for epoch in range(1, num_epochs+1):

        # Training phase
        model.train()
        num_train_examples = 0
        train_loss = 0.0

        for inputs, labels in train_dataloader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            num_train_examples += inputs.size(0)
            train_loss += loss.item() * inputs.size(0)
        scheduler.step()

        # Validation phase
        model.eval()
        num_val_examples = 0
        val_loss = 0
        val_corrects = 0

        for inputs, labels in valid_dataloader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)

            num_val_examples += inputs.size(0)
            val_loss += loss.item() * inputs.size(0)
            val_corrects += torch.sum(torch.eq(torch.max(outputs, 1)
                                               [1], labels))

        # Log epoch metrics
        train_loss = train_loss / num_train_examples
        val_loss = val_loss / num_val_examples
        val_acc = val_corrects.double() / num_val_examples

        print('Epoch: {}/{}, Training loss: {:.3f}, Validation loss: {:.3f}, Validation accuracy: {:.3f}'.format(
              epoch, num_epochs, train_loss, val_loss, val_acc))

        # Write to Tensorboard
        if writer:
            writer.add_scalars(
                'Loss', {'training': train_loss, 'validation': val_loss}, epoch)
            writer.add_scalar('Validation accuracy', val_acc, epoch)
            writer.flush()
            
        # Report to HyperTune
        hpt.report_hyperparameter_tuning_metric(
            hyperparameter_metric_tag='accuracy',
            metric_value=val_acc,
            global_step=epoch
        )

        if val_acc > best_acc:
            best_acc = val_acc
            best_model_wts = copy.deepcopy(model.state_dict())

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model, best_acc


def get_args():
    """
    Returns parsed command line arguments.
    """

    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--num-epochs',
        type=int,
        default=20,
        help='number of times to go through the data, default=20')
    parser.add_argument(
        '--batch-size',
        default=32,
        type=int,
        help='number of records to read during each training step, default=32')
    parser.add_argument(
        '--num-layers',
        default=64,
        type=int,
        help='number of hidden layers in the classification head , default=64')
    parser.add_argument(
        '--dropout-ratio',
        default=0.5,
        type=float,
        help='dropout ration in the classification head , default=128')
    parser.add_argument(
        '--step-size',
        default=7,
        type=int,
        help='step size of LR scheduler')
    parser.add_argument(
        '--log-dir',
        type=str,
        default='/tmp',
        help='directory for TensorBoard logs')
    parser.add_argument(
        '--verbosity',
        choices=['DEBUG', 'ERROR', 'FATAL', 'INFO', 'WARN'],
        default='INFO')

    args, _ = parser.parse_known_args()
    return args


if __name__ == "__main__":
    
    # Parse command line arguments
    args = get_args()
    
    # Create train and validation dataloaders
    train_dataset, val_dataset = get_catsanddogs(DEFAULT_ROOT)
    train_dataloader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True)
    val_dataloader = DataLoader(val_dataset, batch_size=args.batch_size, shuffle=True)
    class_names = train_dataset.classes
    
    # Use GPU if available
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print('-' * 10)
    print(f'Training on device: {device}')

    # Configure training
    model = get_model(args.num_layers, args.dropout_ratio, len(class_names))
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    scheduler = torch.optim.lr_scheduler.StepLR(
        optimizer, step_size=args.step_size, gamma=0.1)

    # Set location for the TensorBoard logs
    if 'AIP_TENSORBOARD_LOG_DIR' in os.environ:
        log_dir = os.environ['AIP_TENSORBOARD_LOG_DIR']
    else:
        log_dir = args.log_dir

    with SummaryWriter(log_dir) as writer:
        # Add sample normalized images to Tensorboard
        images, _ = iter(train_dataloader).next()
        img_grid = torchvision.utils.make_grid(images)
        writer.add_image('Example images', img_grid)
        # Add graph to Tensorboard
        writer.add_graph(model, images)
        trained_model, accuracy = train_eval(device, model, train_dataloader, val_dataloader,
                                             criterion, optimizer, scheduler, args.num_epochs, writer)

        # Add final results and hyperparams to Tensorboard
        writer.add_hparams({
            'batch_size': args.batch_size,
            'hidden_layers': args.num_layers,
            'dropout_ratio': args.dropout_ratio
        },
            {
            'hparam/accuracy': accuracy
        })


Overwriting train_eval.py


### Create Dockerfile

The training script is be packaged in a container image that is based on the standard PyTorch 1.6 Deep Learning container image - `gcr.io/deeplearning-platform-release/pytorch-gpu.1-6`. 

In [23]:
%%writefile Dockerfile

FROM gcr.io/deeplearning-platform-release/pytorch-gpu.1-6
    
RUN pip install -U tensorflow cloudml-hypertune

ADD train_eval.py .

ENTRYPOINT ["python3", "train_eval.py"]


Overwriting Dockerfile


### Build the image

You will [Cloud Build](https://cloud.google.com/cloud-build/docs/) to build the image and push it to your project's [Container Registry](https://cloud.google.com/container-registry).

In [None]:
IMAGE_NAME = 'image_classifier'
IMAGE_TAG = 'latest'
IMAGE_URI = f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}:{IMAGE_TAG}'

In [24]:
!gcloud builds submit --tag {IMAGE_URI} .

Creating temporary tarball archive of 8 file(s) totalling 153.6 KiB before compression.
Uploading tarball of [.] to [gs://jk-mlops-dev_cloudbuild/source/1607388892.08279-dec718e2d26d41a5a702882b460ad99d.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/jk-mlops-dev/builds/57919f40-b3e9-48d4-8858-c92d5c6c5283].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/57919f40-b3e9-48d4-8858-c92d5c6c5283?project=895222332033].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "57919f40-b3e9-48d4-8858-c92d5c6c5283"

FETCHSOURCE
Fetching storage object: gs://jk-mlops-dev_cloudbuild/source/1607388892.08279-dec718e2d26d41a5a702882b460ad99d.tgz#1607388892514737
Copying gs://jk-mlops-dev_cloudbuild/source/1607388892.08279-dec718e2d26d41a5a702882b460ad99d.tgz#1607388892514737...
/ [1 files][ 29.9 KiB/ 29.9 KiB]                                                
Operation completed over 1 objects/29.9 KiB.                    

## Submitting training jobs

You are now ready to submit training jobs. You will submit two jobs. First, you will submit a custom training job that runs your custom container once using a single set of hyperparameters. Then, you will submit a hyperparameter tuning job that will use your custom container image to run multiple training trials using a number of hyperparameter combinations selected by AI Platform Training. In both cases you will use AI Platform TensorBoard to monitor the execution of training.


Note that a custom training job and a hyperparameter tuning job are distinct resource types in AI Platform Training (Unified) API. The custom training job is the `CustomJob` resource and the hyperparameter tuning job is the `HyperParameterTuningJob` resource. Refer to the [AI Platform API documentation](https://googleapis.dev/python/aiplatform/latest/aiplatform_v1beta1/types.html) for the detailed information about the job specifications.

### Submit a custom training job

The `worker_pool_specs` section of the job specification defines the configuration of the compute infrastructure to run the training job on on the configuration of a custom training container. In our case, the job will run on a single `n1-standard-8` node equipped with a single NVidia T4 GPU. The job will use the container image created in the previous steps configured to train for 20 epochs. All other arguments of the training script are left at their defaults. The `base_output_directory` section of the job specs defines the location for the TensorBoard logs. The value provided in the request (`BASE_OUTPUT_DIR`) will be exposed to the training script as the `AIP_TENSORBOARD_LOG_DIR` evironment variable. The `tensorboard` field of the job spec is set to a full name of the TensorBoard instance created in the notebook's setup section.



In [25]:
JOB_NAME = "_CUSTOM_JOB_{}".format(time.strftime("%Y%m%d_%H%M%S"))
BASE_OUTPUT_DIR = f'gs://{GCS_BUCKET_NAME}/{JOB_NAME}'

In [26]:
api_url = f'{CAIP_PARENT_BETA}/customJobs'

request_body = {
    'display_name': JOB_NAME,
    'job_spec': {
        'worker_pool_specs': [
            {
                'replica_count': 1,
                'machine_spec': {
                    'machine_type': 'n1-standard-8',
                    'accelerator_type': 'NVIDIA_TESLA_T4',
                    'accelerator_count': 1
                },
                'container_spec': {
                    'image_uri': IMAGE_URI,
                    'args': [
                        f'--num-epochs=20'
                    ]
                }
            }
        ],
        'base_output_directory': {
            'output_uri_prefix': BASE_OUTPUT_DIR,
        },
        'service_account': SA_EMAIL,
        'tensorboard': tensorboard_id
    }
}


response = authed_session.post(api_url, data=json.dumps(request_body))
response.json()

{'name': 'projects/895222332033/locations/us-central1/customJobs/485843401988636672',
 'displayName': '_CUSTOM_JOB_20201208_010310',
 'jobSpec': {'workerPoolSpecs': [{'machineSpec': {'machineType': 'n1-standard-8',
     'acceleratorType': 'NVIDIA_TESLA_T4',
     'acceleratorCount': 1},
    'replicaCount': '1',
    'diskSpec': {'bootDiskType': 'pd-standard', 'bootDiskSizeGb': 100},
    'containerSpec': {'imageUri': 'gcr.io/jk-mlops-dev/image_classifier:latest',
     'args': ['--num-epochs=20']}}],
  'serviceAccount': 'aip-training@jk-mlops-dev.iam.gserviceaccount.com',
  'baseOutputDirectory': {'outputUriPrefix': 'gs://jk-mlops-dev-tensorboard-logs/_CUSTOM_JOB_20201208_010310'},
  'tensorboard': 'projects/895222332033/locations/us-central1/tensorboards/2152439146906386432'},
 'state': 'JOB_STATE_PENDING',
 'createTime': '2020-12-08T01:03:11.582907Z',
 'updateTime': '2020-12-08T01:03:11.582907Z'}

After the job was successfully started you can monitor the progress of training using Tensorboard dashboards. 

Find your job in [Cloud Console](https://pantheon.corp.google.com/ai/platform/training-pipelines). Click on the job name to display the job's details. You will see the *OPEN TENSORBOARD* link in the upper left section of the page. Click on it to open the TensorBoard instance. Note that it may take a few minutes before the job starts running your training script and the logs are available.

### Submit a hyperparameter tuning job

The `worker_pool_specs` of the job spec is the same as in the custom job spec. Since this is a hypertuning job there are additional sections that define the hyperparameter tuning configuration.

In our example, the job will attempt to maximize accuracy using the `grid search` algorithm. The job expects that the training script reports the `accuracy` metric using the `hypertune` package. Recall that the training scripts reports the validation accuracy at the end of each epoch. 

The job is configured to tune two hyperparameters: `batch-size` and `num-layers`. The `batch-size` parameter is the batch size used in the scripts training loop and the `num-layers` parameter is a number of hidden layers in the model's FCNN classification head. Both parameters are passed to the script as command line arguments.


In [27]:
api_url = f'{CAIP_PARENT_BETA}/hyperparameterTuningJobs'
JOB_NAME = "HYPER_JOB_{}".format(time.strftime("%Y%m%d_%H%M%S"))

request_body = {
    'display_name': JOB_NAME,
    'study_spec' : {
        'metrics': [
            {
                'metric_id': 'accuracy',
                'goal': 'MAXIMIZE'
            }
        ],
        'parameters': [
            {
                'parameter_id': 'batch-size',
                'discrete_value_spec': {'values': [32, 64, 128]},
                'scale_type': 'UNIT_LINEAR_SCALE'
            },
            {
                'parameter_id': 'num-layers',
                'discrete_value_spec': {'values': [64, 128]},
                'scale_type': 'UNIT_LINEAR_SCALE'
            }
        ],
    'algorithm':'GRID_SEARCH'
    },
    'maxTrialCount': 6,
    'parallelTrialCount': 3,
    'trial_job_spec': {
        'worker_pool_specs': [
            {
                'replica_count': 1,
                'machine_spec': {
                    'machine_type': 'n1-standard-8',
                    'accelerator_type': 'NVIDIA_TESLA_T4',
                    'accelerator_count': 1
                },
                'container_spec': {
                    'image_uri': IMAGE_URI,
                    'args': [
                        f'--num-epochs=20'
                    ]
                }
            }
        ],
        'base_output_directory': {
            'output_uri_prefix': BASE_OUTPUT_DIR,
        },
        'service_account': SA_EMAIL,
        'tensorboard': tensorboard_id
    }
}


response = authed_session.post(api_url, data=json.dumps(request_body))
response.json()


{'name': 'projects/895222332033/locations/us-central1/hyperparameterTuningJobs/5524949172551155712',
 'displayName': 'HYPER_JOB_20201208_012106',
 'studySpec': {'metrics': [{'metricId': 'accuracy', 'goal': 'MAXIMIZE'}],
  'parameters': [{'parameterId': 'batch-size',
    'discreteValueSpec': {'values': [32, 64, 128]},
    'scaleType': 'UNIT_LINEAR_SCALE'},
   {'parameterId': 'num-layers',
    'discreteValueSpec': {'values': [64, 128]},
    'scaleType': 'UNIT_LINEAR_SCALE'}],
  'algorithm': 'GRID_SEARCH'},
 'maxTrialCount': 6,
 'parallelTrialCount': 3,
 'trialJobSpec': {'workerPoolSpecs': [{'machineSpec': {'machineType': 'n1-standard-8',
     'acceleratorType': 'NVIDIA_TESLA_T4',
     'acceleratorCount': 1},
    'replicaCount': '1',
    'diskSpec': {'bootDiskType': 'pd-standard', 'bootDiskSizeGb': 100},
    'containerSpec': {'imageUri': 'gcr.io/jk-mlops-dev/image_classifier:latest',
     'args': ['--num-epochs=20']}}],
  'serviceAccount': 'aip-training@jk-mlops-dev.iam.gserviceaccount.co

## Cleaning up

### List all tensorboards in the project

In [10]:
api_url = f'{CAIP_PARENT_ALPHA}/tensorboards'

response = authed_session.get(api_url)
response.json()

{}

### Delete a TensorBoard resource

In [9]:
tensorboard_id = '7266276523785584640'

api_url = f'{CAIP_PARENT_ALPHA}/tensorboards/{tensorboard_id}'

response = authed_session.delete(api_url)
response.json()

{'name': 'projects/895222332033/locations/us-central1/operations/1765659543557111808',
 'metadata': {'@type': 'type.googleapis.com/google.cloud.aiplatform.v1alpha1.DeleteOperationMetadata',
  'genericMetadata': {'createTime': '2020-12-08T00:50:34.594360Z',
   'updateTime': '2020-12-08T00:50:34.594360Z'}},
 'done': True,
 'response': {'@type': 'type.googleapis.com/google.protobuf.Empty'}}