In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vertex AI: Medical imaging custom training and prediction

<table align="left">
  <td>
    <a href="https://console.cloud.google.com/ai-platform/notebooks/deploy-notebook?name=Vertex%20AI%20Custom%20Training&download_url=https%3A%2F%2Fraw.githubusercontent.com%2Fkweinmeister%2Fnotebooks%2Fmaster%2Fmedical_imaging_custom_training.ipynb">
      <img src="https://cloud.google.com/images/products/ai/ai-solutions-icon.svg" alt="Google Cloud Notebooks"> Open in GCP Notebooks
    </a>
  </td> 
  <td>
    <a href="https://colab.research.google.com/github/kweinmeister/notebooks/blob/master/medical_imaging_custom_training.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Open in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/kweinmeister/notebooks/blob/master/medical_imaging_custom_training.ipynb">
        <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>

## Overview


This tutorial demonstrates how to use the Vertex SDK for Python to train and deploy a custom image classification model using a managed dataset.

### Usage
The pneumonia detection model in this notebook is intended for **demonstration purposes only**. This model is not intended for use in clinical diagnosis or clinical decision-making or for any other clinical use, and the performance of the model for clinical use has not been established.

### Dataset

The data used for this example comes from the [RSNA 2018 Pneumonia Detection Challenge](https://www.rsna.org/education/ai-resources-and-training/ai-image-challenge/RSNA-Pneumonia-Detection-Challenge-2018). The data is also available in a [Kaggle dataset](https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data). The dataset was originally drawn from [NIH Chest X-ray Dataset](https://nihcc.app.box.com/v/ChestXray-NIHCC)

### Resources
* Notebook adapted from [SDK custom image classification with online prediction sample notebook](https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/ai-platform-unified/notebooks/official/custom/sdk-custom-image-classification-online.ipynb)
* Vertex AI dataset adapted from [SDK custom training sample](https://github.com/GoogleCloudPlatform/ai-platform-samples/blob/master/ai-platform-unified/notebooks/unofficial/sdk/AI_Platform_(Unified)_SDK_Custom_Training_Python_Package_Managed_Text_Dataset_Tensorflow_Serving_Container.ipynb)
* TensorFlow image dataset handling adapted from [image_dataset.py](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/preprocessing/image_dataset.py) 


### References
* Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, Ronald Summers, ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases, IEEE CVPR, pp. 3462-3471, 2017
* George Shih , Carol C. Wu, Safwan S. Halabi, Marc D. Kohli, Luciano M. Prevedello, Tessa S. Cook, Arjun Sharma, Judith K. Amorosa, Veronica Arteaga, Maya Galperin-Aizenberg, Ritu R. Gill, Myrna C.B. Godoy, Stephen Hobbs, Jean Jeudy, Archana Laroia, Palmi N. Shah, Dharshan Vummidi, Kavitha Yaddanapudi, Anouk Stein, Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia, Radiology: AI, January 30, 2019, https://doi.org/10.1148/ryai.2019180041


### Objective

In this notebook, you create a custom-trained model from a Python script in a Docker container using the Vertex SDK for Python. Alternatively, you can create custom-trained models using `gcloud` command-line tool, or online using the Cloud Console.

The steps performed include:

- Create a Vertex AI custom job for training a model.
- Train a TensorFlow model.
- Deploy the `Model` resource to a serving `Endpoint` resource.
- Undeploy the `Model` resource.

### Costs

This tutorial uses billable components of Google Cloud (GCP):

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the latest version of Vertex SDK for Python.

In [None]:
!pip3 install -Uq google-cloud-aiplatform

### Restart the kernel

Once you've installed everything, you need to restart the notebook kernel so it can find the packages.

In [None]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API and Compute Engine API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component).

4. If you are running this notebook locally, you will need to install the [Cloud SDK](https://cloud.google.com/sdk).

5. Enter your project ID in the cell below. Then run the cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
import os

PROJECT_ID = ""

if not os.getenv("IS_TESTING"):
    # Get your Google Cloud project ID from gcloud
    shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

Otherwise, set your project ID here.

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None:
    PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Authenticate your Google Cloud account

**If you are using Google Cloud Notebooks**, your environment is already
authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions
when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

1. In the Cloud Console, go to the [**Create service account key**
   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).

2. Click **Create service account**.

3. In the **Service account name** field, enter a name, and
   click **Create**.

4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type "Vertex AI"
into the filter box, and select
   **Vertex AI Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

5. Click *Create*. A JSON file that contains your key downloads to your
local environment.

6. Enter the path to your service account key as the
`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell.

In [None]:
import os
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

# If on Google Cloud Notebooks, then don't execute this code
if not os.path.exists("/opt/deeplearning/metadata/env_version"):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you submit a training job using the Cloud SDK, you upload a Python package
containing your training code to a Cloud Storage bucket. Vertex AI runs
the code from this package. In this tutorial, Vertex AI also saves the
trained model that results from your job in the same bucket. Using this model artifact, you can then
create Vertex AI model and endpoint resources in order to serve
online predictions.

Set the name of your Cloud Storage bucket below. It must be unique across all
Cloud Storage buckets.

You may also change the `REGION` variable, which is used for operations
throughout the rest of this notebook. Make sure to [choose a region where Vertex AI services are
available](https://cloud.google.com/vertex-ai/docs/general/locations#available_regions). You may
not use a Multi-Regional Storage bucket for training with Vertex AI.

In [None]:
BUCKET_NAME = "YOUR-REGIONAL-BUCKET"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
!gsutil mb -l $REGION $BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
!gsutil ls -al $BUCKET_NAME

### Managed dataset location

This notebook assumes that you've run the Vertex Pipeline notebook that pre-processes the X-ray image data, and imports the images into a managed dataset. Ensure that you've successfully run that notebook, and make note of the dataset ID.

In [None]:
DATASET_RESOURCE = "projects/YOUR-PROJECT/locations/YOUR-REGION/datasets/YOUR-DATASET" # @param {type:"string"}

### Set up variables

Next, set up some variables used throughout the tutorial.

#### Import Vertex SDK for Python

Import the Vertex SDK for Python into your Python environment and initialize it.

In [None]:
import os
import sys

import tensorflow as tf
from google.cloud import aiplatform
from google.cloud.aiplatform import gapic as aip
from IPython.display import Image
from tensorflow.python.ops import image_ops, io_ops

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)

#### Set hardware accelerators

You can set hardware accelerators for both training and prediction.

Set the variables `TRAIN_GPU/TRAIN_NGPU` and `DEPLOY_GPU/DEPLOY_NGPU` to use a container image supporting a GPU and the number of GPUs allocated to the virtual machine (VM) instance. For example, to use a GPU container image with 4 Nvidia Tesla K80 GPUs allocated to each VM, you would specify:

    (aip.AcceleratorType.NVIDIA_TESLA_K80, 4)

See the [locations where accelerators are available](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators).

Otherwise specify `(None, None)` to use a container image to run on a CPU.

*Note*: TensorFlow releases earlier than 2.3 for GPU support fail to load the custom model in this tutorial. This issue is caused by static graph operations that are generated in the serving function. This is a known issue, which is fixed in TensorFlow 2.3. If you encounter this issue with your own custom models, use a container image for TensorFlow 2.3 or later with GPU support.

In [None]:
TRAIN_GPU, TRAIN_NGPU = (aip.AcceleratorType.NVIDIA_TESLA_V100, 4)

DEPLOY_GPU, DEPLOY_NGPU = (aip.AcceleratorType.NVIDIA_TESLA_K80, 1)

#### Set pre-built containers

Vertex AI provides pre-built containers to run training and prediction.

For the latest list, see [Pre-built containers for training](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers) and [Pre-built containers for prediction](https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers)

In [None]:
TRAIN_VERSION = "tf-gpu.2-4"
DEPLOY_VERSION = "tf2-gpu.2-4"

TRAIN_IMAGE = "gcr.io/cloud-aiplatform/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "gcr.io/cloud-aiplatform/prediction/{}:latest".format(DEPLOY_VERSION)

print("Training:", TRAIN_IMAGE, TRAIN_GPU, TRAIN_NGPU)
print("Deployment:", DEPLOY_IMAGE, DEPLOY_GPU, DEPLOY_NGPU)

#### Set machine types

Next, set the machine types to use for training and prediction.

- Set the variables `TRAIN_COMPUTE` and `DEPLOY_COMPUTE` to configure your compute resources for training and prediction.
 - `machine type`
     - `n1-standard`: 3.75GB of memory per vCPU
     - `n1-highmem`: 6.5GB of memory per vCPU
     - `n1-highcpu`: 0.9 GB of memory per vCPU
 - `vCPUs`: number of \[2, 4, 8, 16, 32, 64, 96 \]

*Note: The following is not supported for training:*

 - `standard`: 2 vCPUs
 - `highcpu`: 2, 4 and 8 vCPUs

*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*.

In [None]:
MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

MACHINE_TYPE = "n1-standard"

VCPU = "4"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Deploy machine type", DEPLOY_COMPUTE)

# Tutorial

Now you are ready to start creating your own custom-trained model.

## Train a model

There are two ways you can train a custom model using a container image:

- **Use a Google Cloud prebuilt container**. If you use a prebuilt container, you will additionally specify a Python package to install into the container image. This Python package contains your code for training a custom model.

- **Use your own custom container image**. If you use your own container, the container needs to contain your code for training a custom model.

### Define the command args for the training script

Prepare the command-line arguments to pass to your training script.
- `args`: The command line arguments to pass to the corresponding Python module. In this example, they will be:
  - `"--epochs=" + EPOCHS`: The number of epochs for training.
  - `"--lr=" + LEARNING_RATE`: The learning rate.
  - `"--distribute=" + TRAIN_STRATEGY"` : The training distribution strategy to use for single or distributed training.
     - `"single"`: single device.
     - `"mirror"`: all GPU devices on a single compute instance.
     - `"multi"`: all GPU devices on all compute instances.
  - `"--batch_size=" + LEARNING_RATE`: The batch size.

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

In [None]:
JOB_NAME = "custom_job_" + TIMESTAMP
MODEL_DIR = "{}/{}".format(BUCKET_NAME, JOB_NAME)

if not TRAIN_NGPU or TRAIN_NGPU < 2:
    TRAIN_STRATEGY = "single"
else:
    TRAIN_STRATEGY = "mirror"

EPOCHS = 3
BATCH_SIZE = 8
LEARNING_RATE = 0.05
CMDARGS = [
    f"--epochs={EPOCHS}",
    f"--lr={LEARNING_RATE}",
    f"--distribute={TRAIN_STRATEGY}",
    f"--batch_size={BATCH_SIZE}",
]

# Convert args to dictionary for logging
PARAMS = dict()
args = list(map(lambda x: x.split('='), CMDARGS))
for arg in args:
    PARAMS[arg[0][2:]] = arg[1]

#### Training script

In the next cell, you will write the contents of the training script, `task.py`. In summary:

- Get the directory where to save the model artifacts from the environment variable `AIP_MODEL_DIR`. This variable is set by the training service.
- Load the managed datasets for training, test, and validation from environment variables.
- Builds a model using TF.Keras model API.
- Compiles the model (`compile()`).
- Sets a training distribution strategy according to the argument `args.distribute`.
- Trains the model (`fit()`) with epochs and steps according to the arguments `args.epochs` and `args.steps`
- Saves the trained model (`save(MODEL_DIR)`) to the specified model directory.

In [None]:
%%writefile task.py

import argparse
import json
import os
import sys
from collections import Counter

import tensorflow as tf
import tqdm
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.python.client import device_lib
from tensorflow.python.data.ops import dataset_ops
from tensorflow.python.keras.preprocessing import dataset_utils
from tensorflow.python.ops import image_ops, io_ops

IMAGE_SIZE = [384, 384]
CLASS_NAMES = ['0', '1']

parser = argparse.ArgumentParser()
parser.add_argument('--lr', dest='lr', default=0.001,
                    type=float, help='Learning rate.')
parser.add_argument('--epochs', dest='epochs', default=20,
                    type=int, help='Number of epochs.')
parser.add_argument('--distribute', dest='distribute', type=str,
                    default='single', help='distributed training strategy')
parser.add_argument('--batch_size', dest='batch_size',
                    default=32, type=int, help='Batch size')
args = parser.parse_args()


print('Python Version = {}'.format(sys.version))
print('TensorFlow Version = {}'.format(tf.__version__))
print('TF_CONFIG = {}'.format(os.environ.get('TF_CONFIG', 'Not found')))
print('DEVICES', device_lib.list_local_devices())

aip_model_dir = os.environ.get('AIP_MODEL_DIR')
aip_data_format = os.environ.get('AIP_DATA_FORMAT')
aip_training_data_uri = os.environ.get('AIP_TRAINING_DATA_URI')
aip_validation_data_uri = os.environ.get('AIP_VALIDATION_DATA_URI')
aip_test_data_uri = os.environ.get('AIP_TEST_DATA_URI')

print(f"aip_model_dir: {aip_model_dir}")
print(f"aip_data_format: {aip_data_format}")
print(f"aip_training_data_uri: {aip_training_data_uri}")
print(f"aip_validation_data_uri: {aip_validation_data_uri}")
print(f"aip_test_data_uri: {aip_test_data_uri}")

# Single Machine, single compute device
if args.distribute == 'single':
    is_gpu_available = len(tf.config.list_physical_devices('GPU')) > 0
    if is_gpu_available:
        strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    else:
        strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")
# Single Machine, multiple compute device
elif args.distribute == 'mirror':
    strategy = tf.distribute.MirroredStrategy()
# Multiple Machine, multiple compute device
elif args.distribute == 'multi':
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

# Multi-worker configuration
print('num_replicas_in_sync = {}'.format(strategy.num_replicas_in_sync))


def paths_and_labels_to_dataset(image_paths,
                                labels,
                                label_mode,
                                num_classes):
    """Constructs a dataset of images and labels."""
    path_ds = dataset_ops.Dataset.from_tensor_slices(image_paths)
    img_ds = path_ds.map(lambda x: load_image(x))
    if label_mode:
        label_ds = dataset_utils.labels_to_dataset(
            labels, label_mode, num_classes)
        img_ds = dataset_ops.Dataset.zip((img_ds, label_ds))
    return img_ds


def load_image(path):
    """Load an image from a path and base64 encode it"""
    img = io_ops.read_file(path)
    img = tf.io.encode_base64(img)
    return img


def load_aip_dataset(aip_data_uri_pattern, batch_size, class_names,
                     test_run=False, shuffle=True, seed=42):
    """Load images and labels into a TensorFlow dataset"""

    data_file_urls = list()
    labels = list()

    class_indices = dict(zip(class_names, range(len(class_names))))

    for aip_data_uri in tqdm.tqdm(tf.io.gfile.glob(
                                  pattern=aip_data_uri_pattern)):
        with tf.io.gfile.GFile(name=aip_data_uri, mode='r') as gfile:
            for line in gfile.readlines():
                line = json.loads(line)
                data_file_urls.append(line['imageGcsUri'])
                classification_annotation = line['classificationAnnotations'][0]
                label = classification_annotation['displayName']
                labels.append(class_indices[label])
                if test_run:
                    break

    dataset = paths_and_labels_to_dataset(data_file_urls, labels, 'binary', 2)

    if shuffle:
        dataset = dataset.shuffle(buffer_size=batch_size * 8, seed=seed)
    dataset = dataset.batch(batch_size)
    dataset.class_names = class_names
    dataset.file_paths = data_file_urls

    return dataset, Counter(labels)


def base64str_to_tensor(str, num_channels=1, image_size=IMAGE_SIZE,
                        interpolation='bilinear'):
    """Decode the base64 encoded string, then decode and resize the image"""
    img = tf.io.decode_base64(str)
    img = image_ops.decode_image(
        img, channels=num_channels, expand_animations=False)
    img = image_ops.resize_images_v2(img, image_size, method=interpolation)
    return img


def build_and_compile_cnn_model():
    """Build the Keras model"""
    model = tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape=(1,), dtype='string'),
        tf.keras.layers.Lambda(lambda img: tf.map_fn(
            lambda x: base64str_to_tensor(x[0]), img,
            fn_output_signature='float32')),
        preprocessing.Rescaling(1./255),
        tf.keras.layers.Conv2D(32, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(128, 3, activation='relu'),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Dense(1, activation='sigmoid')])
    model.compile(
        loss=tf.keras.losses.BinaryCrossentropy(),
        optimizer=tf.keras.optimizers.SGD(learning_rate=args.lr),
        metrics=['accuracy'])
    return model


def calculate_class_weight(counts):
    """Calculate a class weighting that will address imbalanced dagtaset"""
    total = counts[0] + counts[1]
    weight_0 = (1 / counts[0]) * total / 2.0
    weight_1 = (1 / counts[1]) * total / 2.0

    class_weight = {0: weight_0, 1: weight_1}

    return class_weight


# Train the model
NUM_WORKERS = strategy.num_replicas_in_sync
# Here the batch size scales up by number of workers since
# `tf.data.Dataset.batch` expects the global batch size.
GLOBAL_BATCH_SIZE = args.batch_size * NUM_WORKERS

with strategy.scope():
    # Creation of dataset, and model building/compiling
    # need to be within`strategy.scope()`.
    train_ds, train_counts = load_aip_dataset(
        aip_training_data_uri, GLOBAL_BATCH_SIZE, CLASS_NAMES)
    validation_ds, validation_counts = load_aip_dataset(
        aip_validation_data_uri, GLOBAL_BATCH_SIZE, CLASS_NAMES)
    test_ds, _ = load_aip_dataset(
        aip_test_data_uri, GLOBAL_BATCH_SIZE, CLASS_NAMES)
    model = build_and_compile_cnn_model()

print(model.summary())

class_weight = calculate_class_weight(train_counts)
train_steps = (train_counts[0] + train_counts[1]) // GLOBAL_BATCH_SIZE
validation_steps = (validation_counts[0] +
                    validation_counts[1]) // GLOBAL_BATCH_SIZE

model.fit(x=train_ds, epochs=args.epochs, steps_per_epoch=train_steps,
          validation_data=validation_ds, validation_steps=validation_steps,
          class_weight=class_weight)
model.evaluate(x=test_ds)
model.save(aip_model_dir)

### Train the model

Define your custom training job on Vertex AI.

Use the `CustomTrainingJob` class to define the job, which takes the following parameters:

- `display_name`: The user-defined name of this training pipeline.
- `script_path`: The local path to the training script.
- `container_uri`: The URI of the training container image.
- `requirements`: The list of Python package dependencies of the script.
- `model_serving_container_image_uri`: The URI of a container that can serve predictions for your model — either a prebuilt container or a custom container.

Use the `run` function to start training, which takes the following parameters:

- `args`: The command line arguments to be passed to the Python script.
- `replica_count`: The number of worker replicas.
- `model_display_name`: The display name of the `Model` if the script produces a managed `Model`.
- `machine_type`: The type of machine to use for training.
- `accelerator_type`: The hardware accelerator type.
- `accelerator_count`: The number of accelerators to attach to a worker replica.

The `run` function creates a training pipeline that trains and creates a `Model` object. After the training pipeline completes, the `run` function returns the `Model` object.

In [None]:
ds = aiplatform.ImageDataset(DATASET_RESOURCE)

# Log parameters
aiplatform.init(experiment='pneumonia-detection-experiment')
aiplatform.start_run("pneumonia-detection-run-" + TIMESTAMP)
aiplatform.log_params(PARAMS)

# Create job
job = aiplatform.CustomTrainingJob(
    display_name=JOB_NAME,
    script_path="task.py",
    container_uri=TRAIN_IMAGE,
    model_serving_container_image_uri=DEPLOY_IMAGE,
)

MODEL_DISPLAY_NAME = "pneumonia-detection-" + TIMESTAMP

# Start the training
if TRAIN_GPU:
    model = job.run(
        dataset=ds,
        annotation_schema_uri=aiplatform.schema.dataset.annotation.image.classification,
        model_display_name=MODEL_DISPLAY_NAME,
        args=CMDARGS,
        replica_count=1,
        machine_type=TRAIN_COMPUTE,
        accelerator_type=TRAIN_GPU.name,
        accelerator_count=TRAIN_NGPU,
    )
else:
    model = job.run(
        dataset=ds,
        annotation_schema_uri=aiplatform.schema.dataset.annotation.image.classification,
        model_display_name=MODEL_DISPLAY_NAME,
        args=CMDARGS,
        replica_count=1,
        machine_type=TRAIN_COMPUTE,
        accelerator_count=0,
    )

### Deploy the model

Before you use your model to make predictions, you need to deploy it to an `Endpoint`. You can do this by calling the `deploy` function on the `Model` resource. This will do two things:

1. Create an `Endpoint` resource for deploying the `Model` resource to.
2. Deploy the `Model` resource to the `Endpoint` resource.


The function takes the following parameters:

- `deployed_model_display_name`: A human readable name for the deployed model.
- `traffic_split`: Percent of traffic at the endpoint that goes to this model, which is specified as a dictionary of one or more key/value pairs.
   - If only one model, then specify as **{ "0": 100 }**, where "0" refers to this model being uploaded and 100 means 100% of the traffic.
   - If there are existing models on the endpoint, for which the traffic will be split, then use `model_id` to specify as **{ "0": percent, model_id: percent, ... }**, where `model_id` is the model id of an existing model to the deployed endpoint. The percents must add up to 100.
- `machine_type`: The type of machine to use for training.
- `accelerator_type`: The hardware accelerator type.
- `accelerator_count`: The number of accelerators to attach to a worker replica.
- `starting_replica_count`: The number of compute instances to initially provision.
- `max_replica_count`: The maximum number of compute instances to scale to. In this tutorial, only one instance is provisioned.

### Traffic split

The `traffic_split` parameter is specified as a Python dictionary. You can deploy more than one instance of your model to an endpoint, and then set the percentage of traffic that goes to each instance.

You can use a traffic split to introduce a new model gradually into production. For example, if you had one existing model in production with 100% of the traffic, you could deploy a new model to the same endpoint, direct 10% of traffic to it, and reduce the original model's traffic to 90%. This allows you to monitor the new model's performance while minimizing the distruption to the majority of users.

### Compute instance scaling

You can specify a single instance (or node) to serve your online prediction requests. This tutorial uses a single node, so the variables `MIN_NODES` and `MAX_NODES` are both set to `1`.

If you want to use multiple nodes to serve your online prediction requests, set `MAX_NODES` to the maximum number of nodes you want to use. Vertex AI autoscales the number of nodes used to serve your predictions, up to the maximum number you set. Refer to the [pricing page](https://cloud.google.com/vertex-ai/pricing#prediction-prices) to understand the costs of autoscaling with multiple nodes.

### Endpoint

The method will block until the model is deployed and eventually return an `Endpoint` object. If this is the first time a model is deployed to the endpoint, it may take a few additional minutes to complete provisioning of resources.

In [None]:
DEPLOYED_NAME = "pneumonia-detection-deployed-" + TIMESTAMP

TRAFFIC_SPLIT = {"0": 100}

MIN_NODES = 1
MAX_NODES = 1

if DEPLOY_GPU:
    endpoint = model.deploy(
        deployed_model_display_name=DEPLOYED_NAME,
        traffic_split=TRAFFIC_SPLIT,
        machine_type=DEPLOY_COMPUTE,
        accelerator_type=DEPLOY_GPU.name,
        accelerator_count=DEPLOY_NGPU,
        min_replica_count=MIN_NODES,
        max_replica_count=MAX_NODES,
    )
else:
    endpoint = model.deploy(
        deployed_model_display_name=DEPLOYED_NAME,
        traffic_split=TRAFFIC_SPLIT,
        machine_type=DEPLOY_COMPUTE,
        accelerator_type=DEPLOY_COMPUTE.name,
        accelerator_count=0,
        min_replica_count=MIN_NODES,
        max_replica_count=MAX_NODES,
    )

## Prediction

Finally, we will test that the custom trained model is functioning properly with a couple predictions.

*Note: the notebook currently resizes images prior to prediction to stay under request size limits. In the future, this example will be updated to avoid this additional step.*

### Constants

In [None]:
# Update paths as needed
BUCKET_PATH = 'data/pneumonia' # Location of images in bucket
LOCAL_IMAGES_PATH = 'images' # Local location to store images
OUTPUT_IMAGES_URI=f'{BUCKET_NAME}/{BUCKET_PATH}/stage_2_train_images_converted' # Full path to converted images

IMAGE_SIZE = [384, 384]

# Test images
PNEUMONIA_IMAGE = 'ce8e0d85-44a8-4b05-b421-cc5c510aa7a5.png'
NO_PNEUMONIA_IMAGE = '4c52f9c5-6ce0-4e28-ae5d-ddb37f374f31.png'

### Helper functions

In [None]:
def encode_image(filename):
    """
    Resize input image to match size of trained images
    (and to fit under 1.5MB prediction request limit)
    """

    # Read file
    img = io_ops.read_file(filename)

    # Convert bytes into a tensor
    img = image_ops.decode_image(img, channels=1, expand_animations=False)

    # Resize image
    img = image_ops.resize_images_v2(img, IMAGE_SIZE, method='bilinear')

    # Cast to uint8
    img = tf.cast(img, 'uint8')

    # Encode image as PNG
    img = image_ops.encode_png(img)

    # Base64 encode image
    img = tf.io.encode_base64(img)

    # Retrieve as array
    img = img.numpy()

    # Convert to string from bytes
    img = tf.compat.as_str_any(img)

    # Return as array
    return [[img]]


def predict_image_classification_custom(
    project: str,
    endpoint_id: str,
    filename: str,
    location: str = REGION,
    api_endpoint: str = f"{REGION}-aiplatform.googleapis.com",
):
    """ Predict using a custom trained image classification model """

    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}

    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    prediction_client = aiplatform.gapic.PredictionServiceClient(
        client_options=client_options)

    endpoint = prediction_client.endpoint_path(
        project=project, location=location, endpoint=endpoint_id
    )

    instances = encode_image(filename)

    response = prediction_client.predict(
        endpoint=endpoint, instances=instances
    )
    print(" deployed_model_id:", response.deployed_model_id)

    predictions = response.predictions
    for prediction in predictions:
        print(" prediction:", prediction[0])

### Copy test files locally

In [None]:
!mkdir -p '{LOCAL_IMAGES_PATH}'
!gsutil cp '{OUTPUT_IMAGES_URI}/{PNEUMONIA_IMAGE}' '{LOCAL_IMAGES_PATH}'
!gsutil cp '{OUTPUT_IMAGES_URI}/{NO_PNEUMONIA_IMAGE}' '{LOCAL_IMAGES_PATH}'

### Make predictions

In [None]:
# Access endpoint ID which will be used to make prediction 

try:
    endpoint_name = endpoint.name
except:
    # Use last endpoint if no endpoint
    endpoint_name = aiplatform.Endpoint.list()[0].name

In [None]:
# Predict on image without pneumonia

display(Image(f'{LOCAL_IMAGES_PATH}/{NO_PNEUMONIA_IMAGE}', width=384, height=384))
predict_image_classification_custom(project=PROJECT_ID, endpoint_id=endpoint_name, location=REGION, filename=f'{LOCAL_IMAGES_PATH}/{NO_PNEUMONIA_IMAGE}')

In [None]:
# Predict on image with pneumonia

display(Image(f'{LOCAL_IMAGES_PATH}/{PNEUMONIA_IMAGE}', width=384, height=384))
predict_image_classification_custom(project=PROJECT_ID, endpoint_id=endpoint_name, location=REGION, filename=f'{LOCAL_IMAGES_PATH}/{PNEUMONIA_IMAGE}')

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

In [None]:
# Delete the training job
job.delete()

# Delete the model
model.delete()

# Delete the endpoint
endpoint.delete()