In [None]:
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

### Import packages

In [40]:
from datetime import datetime
from pytz import timezone

Adjusted this notebook from this codelab:
https://codelabs.developers.google.com/vertex_custom_training_prediction

## 1. Overview
In this lab, you will use GKE instead of [Vertex AI](https://cloud.google.com/vertex-ai/docs) to train and serve a TensorFlow model using code in a custom container.

While we're using scikit-learn for the model code here, you could easily replace it with another framework.

What you learn
You'll learn how to:

Build and containerize model training code in Vertex Notebooks
Submit a custom model training job to GKE
Deploy your trained model GKE as a job, and use that job to get predictions

## 2. Intro to Vertex AI
This lab uses GKE to run training and predictions.

Although running training/prediction is an option, consider the newest AI product offering available on Google Cloud. [Vertex AI](https://cloud.google.com/vertex-ai/docs) integrates the ML offerings across Google Cloud into a seamless development experience. Previously, models trained with AutoML and custom models were accessible via separate services. The new offering combines both into a single API, along with other new products. You can also migrate existing projects to Vertex AI. If you have any feedback, please see the [support page](https://cloud.google.com/vertex-ai/docs/support/getting-support).

Vertex AI includes many different products to support end-to-end ML workflows. This lab will focus on the products highlighted below: Training, Prediction, and Notebooks.

## 3. Setup your environment
You'll need a Google Cloud Platform project with billing enabled to run this codelab. To create a project, follow the [instructions here](https://cloud.google.com/resource-manager/docs/creating-managing-projects).

### Step 3.1: Enable the Compute Engine API
Navigate to [Compute Engine](https://console.cloud.google.com/marketplace/details/google/compute.googleapis.com) and select **Enable** if it isn't already enabled. You'll need this to create your notebook instance.
### Step 3.2: Enable the Vertex AI API
Navigate to the [Vertex AI section of your Cloud Console](https://console.cloud.google.com/ai/platform) and click **Enable** Vertex AI API.
### Step 3.3: Enable the Container Registry API
Navigate to the [Container Registry](https://console.cloud.google.com/apis/library/containerregistry.googleapis.com) and select **Enable** if it isn't already. You'll use this to create a container for your custom training job.
### Step 3.4: Create an Vertex Notebooks instance
From the [Vertex AI section](https://console.cloud.google.com/ai/platform) of your Cloud Console, click on Notebooks.

From there, select **New Instance**. Then select the **TensorFlow Enterprise 2.3** instance type **without GPUs**:

## 4. Containerize training code
We'll submit this training job to Vertex by putting our training code in a [Docker container](https://www.docker.com/resources/what-container) and pushing this container to [Google Container Registry](https://cloud.google.com/container-registry). Using this approach, we can train a model built with any framework.

Create a new directory to packaget the container, in this case *sklearn_container* and cd into it:

In [41]:
CONTAINER_DIR = "sklearn_container_training"
TASK_TYPE = "sklearn_container_training"
TASK_NAME = f"{TASK_TYPE}"
TASK_DIR = f"./{TASK_NAME}"
PYTHON_PACKAGE_APPLICATION_DIR = f"{TASK_NAME}/trainer"

print(f"Task Name:      {TASK_NAME}")
print(f"Task Directory: {TASK_DIR}")
print(f"Python Package Directory: {PYTHON_PACKAGE_APPLICATION_DIR}")

Task Name:      sklearn_container_training
Task Directory: ./sklearn_container_training
Python Package Directory: sklearn_container_training/trainer


In [42]:
# Create the container directory
!mkdir -p $CONTAINER_DIR

### Step 4.1: Create a Dockerfile
Our first step in containerizing our code is to create a Dockerfile. In our Dockerfile we'll include all the commands needed to run our image. It'll install all the libraries we're using and set up the entry point for our training code. From your Terminal, create an empty Dockerfile:

In [43]:
!touch $CONTAINER_DIR/Dockerfile

Open the Dockerfile and copy the following into it:

In [44]:
%%writefile $CONTAINER_DIR/Dockerfile
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-3
WORKDIR /root

WORKDIR /

# Install pip reqs from both user and default
# NOTE: for this implementation, requirements.txt specifies 
#   the tornado, scikit-learn, and joblib libraries in 
#   the format: [library]==[version]. Build the requirements.txt
#   file to match your needs
#RUN pip install -r requirements.txt

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.train"]

Overwriting sklearn_container_training/Dockerfile


This Dockerfile uses the [Deep Learning Container TensorFlow Enterprise 2.3 Docker image](https://cloud.google.com/ai-platform/deep-learning-containers/docs/choosing-container#choose_a_container_image_type). The Deep Learning Containers on Google Cloud come with many common ML and data science frameworks pre-installed. The one we're using includes TF Enterprise 2.3, Pandas, Scikit-learn, and others. After downloading that image, this Dockerfile sets up the entrypoint for our training code. We haven't created these files yet – in the next step, we'll add the code for training and exporting our model.

Additional documentation on creating a custom container image is [here](https://cloud.google.com/vertex-ai/docs/training/create-custom-container).

### Step 4.2: Create a Cloud Storage bucket
In our training job, we'll export our trained TensorFlow model to a Cloud Storage Bucket. GKE or Vertex AI will use this to read our exported model assets and deploy the model. From your Terminal, run the following to define an env variable for your project, making sure to replace your-cloud-project with the ID of your project:

You can get your project ID by running gcloud config list --format 'value(core.project)' in your terminal

In [45]:
project_id = !gcloud config list --format 'value(core.project)' 2>/dev/null

In [46]:
# Configure your global variables
PROJECT = project_id[0]          # Replace with your project ID
BUCKET_NAME = project_id[0] + '-vertex-ai'       # Replace with your gcs bucket name
REGION = 'us-central1'

FOLDER_NAME = 'sklearn_models'
ALGORITHM = 'isolation_forest'
TIMEZONE = 'US/Pacific'         
REGION = 'us-central1'           # bucket should be in same region as Vertex AI   

TRAIN_FEATURE_PATH = f"gs://{BUCKET_NAME}/{FOLDER_NAME}_data/{ALGORITHM}/train/train.csv"

In [47]:
# Google Cloud AI Platform requires each job to have unique name, 
# Therefore, we use prefix + timestamp to form job names.
# JOB_NAME = 'custom_container_isolation_forest_{}'.format(
#     datetime.now(timezone(TIMEZONE)).strftime("%m%d%y_%H%M")
#     )

JOB_NAME = 'custom_container_isolation_forest_gke'
# We use the job names as folder names to store outputs.
JOB_DIR = 'gs://{}/{}/{}'.format(
    BUCKET_NAME,
    FOLDER_NAME,
    JOB_NAME,
    )

print("JOB_NAME = ", JOB_NAME)
print("JOB_DIR = ", JOB_DIR)

MAX_SAMPLES = '100'  # No of samples
RANDOM_STATE_SEED = '42'

print("MAX_SAMPLES = ", MAX_SAMPLES)
print("RANDOM_STATE_SEED = ", RANDOM_STATE_SEED)

JOB_NAME =  custom_container_isolation_forest_gke
JOB_DIR =  gs://virtual-anomaly-vertex-ai/sklearn_models/custom_container_isolation_forest_gke
MAX_SAMPLES =  100
RANDOM_STATE_SEED =  42


Next, run the following in your Terminal to create a new bucket in your project. The -l (location) flag is important since this needs to be in the same region where you deploy a model endpoint later in the tutorial:




In [None]:
!gsutil mb -l $REGION gs://$BUCKET_NAME 

### Step 4.3: Add model training code
From your Terminal, run the following to create a directory for our training code and a Python file where we'll add the code:

In [48]:
!mkdir -p $CONTAINER_DIR/trainer
!touch $CONTAINER_DIR/trainer/train.py

You should now have the following in your sklearn_container/ directory:

+ Dockerfile
+ trainer/
    + train.py

Next, open the train.py file you just created and copy the code below (this is adapted from the tutorial in the TensorFlow docs).

At the beginning of the file, update the BUCKET variable with the name of the Storage Bucket you created in the previous step:

In [52]:
%%writefile {CONTAINER_DIR}/trainer/train.py
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import argparse
import logging
import os

import numpy as np
import pandas as pd
import pathlib
import tensorflow as tf

from datetime import datetime
from sklearn.ensemble import IsolationForest
import joblib

print(tf.__version__)

# If the input CSV file has a header row, then set CSV_COLUMNS to None.
# Otherwise, set CSV_COLUMNS to a list of target and feature names:
# i.e. CSV_COLUMNS = None
CSV_COLUMNS = [
    'dimension_1',
    'dimension_2'
]

# Target name
# i.e. TARGET_NAME = 'tip'
TARGET_NAME = None

# The features to be used for training.
# If FEATURE_NAMES is None, then all the available columns will be
# used as features, except for the target column.
# i.e. FEATURE_NAMES = ['trip_miles','trip_seconds','fare','trip_start_month','trip_start_hour','trip_start_day',]
FEATURE_NAMES = None

# If the model is serialized using joblib
# then use 'model.joblib' for the model name
MODEL_FILE_NAME = 'model.joblib'

# Set to True if you want to tune some hyperparameters
HYPERPARAMTER_TUNING = False

def read_df_from_gcs(file_pattern):
    """Read data from Google Cloud Storage, split into train and validation sets
    Assume that the data on GCS is in csv format without header.
    The column names will be provided through metadata
    Args:
      file_pattern: (string) pattern of the files containing training data.
      For example: [gs://bucket/folder_name/prefix]
    Returns:
      pandas.DataFrame
    """

    # Download the files to local /tmp/ folder
    df_list = []

    for filepath in tf.io.gfile.glob(file_pattern):
        with tf.io.gfile.GFile(filepath, 'r') as f:
            if CSV_COLUMNS is None:
                df_list.append(pd.read_csv(f))
            else:
                df_list.append(pd.read_csv(f, names=CSV_COLUMNS,
                                           header=None))

    data_df = pd.concat(df_list)

    return data_df

def upload_to_gcs(local_path, gcs_path):
    """Upload local file to Google Cloud Storage.
    Args:
      local_path: (string) Local file
      gcs_path: (string) Google Cloud Storage destination
    Returns:
      None
    """
    tf.io.gfile.copy(local_path, gcs_path)
    
def dump_object(object_to_dump, output_path):
    """Pickle the object and save to the output_path.
    Args:
      object_to_dump: Python object to be pickled
      output_path: (string) output path which can be Google Cloud Storage
    Returns:
      None
    """

    if not tf.io.gfile.exists(output_path):
        tf.io.gfile.makedirs(os.path.dirname(output_path))
    with tf.io.gfile.GFile(output_path, 'w') as wf:
        joblib.dump(object_to_dump, wf)
        
def get_estimator(arguments):
    """Create an Isolation Forest classifier for anomaly detection 
    # Generate ML Pipeline which include both pre-processing and model training
    
    Args:
      arguments: (argparse.ArgumentParser), parameters passed from command-line
    Returns:
      classifier - the Isolation Forests classifier(still needs to be trained)
    """

    # max_samples and random_state_seed are expected to be passed as
    # command line argument to task.py
    
    # max_samples: “auto”, int or float, default=”auto”
    # The number of samples to draw from X to train each base estimator.
    
    # random_stateint, RandomState instance or None, default=None
    # Controls the pseudo-randomness of the selection of the feature and split values for each branching step and each tree in the forest.
    
    estimator = IsolationForest(
        max_samples=arguments.max_samples,
        random_state=arguments.random_state_seed)

    return estimator

def train_and_evaluate(estimator, dataset, output_dir):
    """Runs model training and evaluation.
    Args:
      estimator: (pipeline.Pipeline), Pipeline instance, assemble pre-processing
        steps and model training
      dataset: (pandas.DataFrame), DataFrame containing training data
      output_dir: (string), directory that the trained model will be exported
    Returns:
      None
    """
    #x_train, y_train, x_val, y_val = util.data_train_test_split(dataset)
    x_train = dataset

    estimator.fit(x_train)

    # Write model and eval metrics to `output_dir`
    model_output_path = os.path.join(output_dir,
                                     MODEL_FILE_NAME)

    dump_object(estimator, model_output_path)
    
def run_experiment(arguments):
    """Testbed for running model training and evaluation."""
    # Get data for training and evaluation

    logging.info('Arguments: %s', arguments)
    
    # Get the training data
    logging.info('Getting the training data from: ' + arguments.input)
    dataset_df = read_df_from_gcs(arguments.input)
    dataset = dataset_df.to_numpy()

    # Get estimator
    estimator = get_estimator(arguments)

    # Run training and evaluation
    logging.info('Running training, outputting model to: ' + arguments.job_dir)
    train_and_evaluate(estimator, dataset, arguments.job_dir)
    
def parse_args():
    """Parses command-line arguments."""
    """Argument parser.

    Returns:
      Dictionary of arguments.
    """
    parser = argparse.ArgumentParser()

    parser.add_argument('--log-level', help='Logging level.', choices=['DEBUG', 'ERROR', 'FATAL', 'INFO', 'WARN'], default='INFO')
    parser.add_argument('--input', help='CSV file to use for training and evaluation.', required=True)
    parser.add_argument('--job-dir', help='Output directory for exporting model and other metadata.', required=True)
    parser.add_argument('--max-samples', type=int, default=100, help='maximum number of random samples to generate, default=100')
    parser.add_argument('--random-state-seed', type=int, default=42, help='random seed used to initialize the pseudo-random number generator, default=42')
    parser.add_argument('--n-estimators', default=10, type=int, help='Number of trees in the forest.')
    parser.add_argument('--max-depth', type=int, default=3, help='The maximum depth of the tree.')

    return parser.parse_args()

if __name__ == '__main__':
    """Entry point"""

    arguments = parse_args()
    logging.basicConfig(level=arguments.log_level)
    #arguments.input = 'gs://cloud-thunderdome-vertex-ai/sklearn_models_data/isolation_forest/train/train.csv'
    #arguments.job_dir = 'gs://cloud-thunderdome-vertex-ai/sklearn_models/sklearn_isolation_forest_container'
    # Run the train and evaluate experiment
    time_start = datetime.utcnow()
    run_experiment(arguments)
    time_end = datetime.utcnow()
    time_elapsed = time_end - time_start
    logging.info('Experiment elapsed time: {} seconds'.format(
        time_elapsed.total_seconds()))

Overwriting sklearn_container_training/trainer/train.py


### Step 4.4: Build and test the container locally
Define a variable with the URI of your container image in Google Container Registry:




In [None]:
IMAGE_URI=f"gcr.io/{PROJECT}/sklearn_isolation_forest_training:v1"
print(f"Container URI: {IMAGE_URI}")

Then, build the container by running the following from the root of your CONTAINER_DIR directory:

In [None]:
!cd $CONTAINER_DIR && docker build ./ -t $IMAGE_URI

Run the container within your notebook instance to ensure it's working correctly:

In [None]:
!cd $CONTAINER_DIR && docker run $IMAGE_URI --input=$TRAIN_FEATURE_PATH --job-dir=$JOB_DIR

The model should finish training in 1-2 minutes. When you've finished running the container locally, push it to Google Container Registry:

In [None]:
!cd $CONTAINER_DIR && docker push $IMAGE_URI

With our container pushed to Container Registry, we're now ready to kick off a custom model training job.

### Step 5: Create YAML to run a job on GKE
Following below uses cell magic to write out variables stored above for the locations of the file used for inference and where to store model artifacts.

In [50]:
from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writetemplate(line, cell):
    with open(line, 'w') as f:
        f.write(cell.format(**globals()))

In [51]:
%%writetemplate k8s_job_training.yaml
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
apiVersion: batch/v1
kind: Job
metadata:
  name: trainer-job
spec:
  activeDeadlineSeconds: 1800
  template:
    metadata:
      name: train-anomaly
    spec:
      serviceAccountName: sa-trainer
      containers:
      - name: anomaly
        image: gcr.io/virtual-anomaly/sklearn_isolation_forest_training:v1
        env:
          - name: TRAIN_FEATURE_PATH
            # Update below to change the input location
            value: "{TRAIN_FEATURE_PATH}"
          - name: JOB_DIR
            # Update below to change the output location
            value: "{JOB_DIR}"
        args: ["--input=$(TRAIN_FEATURE_PATH)", "--job-dir=$(JOB_DIR)"]
      restartPolicy: Never
  backoffLimit: 2

### Next, open the **2. sklearn-cb-ctr-setup-prediction** notebook