# Using custom containers with Vertex AI Training

**Learning Objectives:**
1. Learn how to create a train and a validation split with BigQuery
1. Learn how to wrap a machine learning model into a Docker container and train in on Vertex AI
1. Learn how to use the hyperparameter tuning engine on Vertex AI to find the best hyperparameters
1. Learn how to deploy a trained machine learning model on Vertex AI as a REST API and query it

In this lab, you develop, package as a docker image, and run on **Vertex AI Training** a training application that trains a multi-class classification model that **predicts the type of forest cover from cartographic data**. The [dataset](../../../datasets/covertype/README.md) used in the lab is based on **Covertype Data Set** from UCI Machine Learning Repository.

The training code uses `scikit-learn` for data pre-processing and modeling. The code has been instrumented using the `hypertune` package so it can be used with **Vertex AI** hyperparameter tuning.


In [1]:
import os
import time

import pandas as pd
from google.cloud import aiplatform, bigquery
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

## Configure environment settings

Set location paths, connections strings, and other environment settings. Make sure to update   `REGION`, and `ARTIFACT_STORE`  with the settings reflecting your lab environment. 

- `REGION` - the compute region for Vertex AI Training and Prediction
- `ARTIFACT_STORE` - A GCS bucket in the created in the same region.

In [2]:
REGION = "us-central1"

PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]

ARTIFACT_STORE = f"gs://{PROJECT_ID}-kfp-artifact-store"

DATA_ROOT = f"{ARTIFACT_STORE}/data"
JOB_DIR_ROOT = f"{ARTIFACT_STORE}/jobs"
TRAINING_FILE_PATH = f"{DATA_ROOT}/training/dataset.csv"
VALIDATION_FILE_PATH = f"{DATA_ROOT}/validation/dataset.csv"
API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"

In [3]:
os.environ["JOB_DIR_ROOT"] = JOB_DIR_ROOT
os.environ["TRAINING_FILE_PATH"] = TRAINING_FILE_PATH
os.environ["VALIDATION_FILE_PATH"] = VALIDATION_FILE_PATH
os.environ["PROJECT_ID"] = PROJECT_ID
os.environ["REGION"] = REGION

We now create the `ARTIFACT_STORE` bucket if it's not there. Note that this bucket should be created in the region specified in the variable `REGION` (if you have already a bucket with this name in a different region than `REGION`, you may want to change the `ARTIFACT_STORE` name so that you can recreate a bucket in `REGION` with the command in the cell below).

In [4]:
!gsutil ls | grep ^{ARTIFACT_STORE}/$ || gsutil mb -l {REGION} {ARTIFACT_STORE}

gs://qwiklabs-gcp-04-853e5675f5e8-kfp-artifact-store/


## Importing the dataset into BigQuery

In [5]:
%%bash

DATASET_LOCATION=US
DATASET_ID=covertype_dataset
TABLE_ID=covertype
DATA_SOURCE=gs://workshop-datasets/covertype/small/dataset.csv
SCHEMA=Elevation:INTEGER,\
Aspect:INTEGER,\
Slope:INTEGER,\
Horizontal_Distance_To_Hydrology:INTEGER,\
Vertical_Distance_To_Hydrology:INTEGER,\
Horizontal_Distance_To_Roadways:INTEGER,\
Hillshade_9am:INTEGER,\
Hillshade_Noon:INTEGER,\
Hillshade_3pm:INTEGER,\
Horizontal_Distance_To_Fire_Points:INTEGER,\
Wilderness_Area:STRING,\
Soil_Type:STRING,\
Cover_Type:INTEGER

bq --location=$DATASET_LOCATION --project_id=$PROJECT_ID mk --dataset $DATASET_ID

bq --project_id=$PROJECT_ID --dataset_id=$DATASET_ID load \
--source_format=CSV \
--skip_leading_rows=1 \
--replace \
$TABLE_ID \
$DATA_SOURCE \
$SCHEMA

BigQuery error in mk operation: Dataset 'qwiklabs-
gcp-04-853e5675f5e8:covertype_dataset' already exists.


Waiting on bqjob_r279d66d9c5000076_0000017f73311340_1 ... (2s) Current status: DONE   


## Explore the Covertype dataset 

In [6]:
%%bigquery
SELECT *
FROM `covertype_dataset.covertype`

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1127.05query/s]                        
Downloading: 100%|██████████| 100000/100000 [00:02<00:00, 43373.34rows/s]


Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type
0,2085,256,18,150,27,738,176,248,208,914,Cache,C2702,5
1,2125,256,20,30,12,871,169,248,215,300,Cache,C2702,2
2,2146,256,34,150,62,1253,122,237,239,511,Cache,C2702,2
3,2186,256,38,210,102,1294,109,232,244,552,Cache,C2702,2
4,2831,256,25,277,183,1706,153,246,225,1485,Commanche,C2705,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,3136,254,12,319,60,5734,193,248,193,2467,Rawah,C7746,1
99996,3242,254,12,636,148,3551,193,248,193,2010,Commanche,C7757,0
99997,2071,255,12,234,63,342,192,247,193,247,Cache,C2706,2
99998,3248,255,12,730,113,725,192,247,193,2724,Commanche,C7756,1


## Create training and validation splits

Use BigQuery to sample training and validation splits and save them to GCS storage
### Create a training split

In [7]:
!bq query \
-n 0 \
--destination_table covertype_dataset.training \
--replace \
--use_legacy_sql=false \
'SELECT * \
FROM `covertype_dataset.covertype` AS cover \
WHERE \
MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(cover))), 10) IN (1, 2, 3, 4)' 

Waiting on bqjob_r7d554294c3e4a30c_0000017f73333420_1 ... (1s) Current status: DONE   


In [8]:
!bq extract \
--destination_format CSV \
covertype_dataset.training \
$TRAINING_FILE_PATH

Waiting on bqjob_r4574ab68f2451d92_0000017f73335f9b_1 ... (0s) Current status: DONE   


### Create a validation split

## Exercise

In [9]:
# TODO: You code to create the BQ table validation split
!bq query \
-n 0 \
--destination_table covertype_dataset.validation \
--replace \
--use_legacy_sql=false \
'SELECT * \
FROM `covertype_dataset.covertype` AS cover \
WHERE \
MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(cover))), 10) IN (8)' 

Waiting on bqjob_r419bcc6e8f99b193_0000017f7334427f_1 ... (1s) Current status: DONE   


In [10]:
# TODO: Your code to export the validation table to GCS
!bq extract \
--destination_format CSV \
covertype_dataset.validation \
$VALIDATION_FILE_PATH

Waiting on bqjob_r4441384921644dce_0000017f73345183_1 ... (0s) Current status: DONE   


In [11]:
df_train = pd.read_csv(TRAINING_FILE_PATH)
df_validation = pd.read_csv(VALIDATION_FILE_PATH)
print(df_train.shape)
print(df_validation.shape)

(40009, 13)
(9836, 13)


## Develop a training application

### Configure the `sklearn` training pipeline.

The training pipeline preprocesses data by standardizing all numeric features using `sklearn.preprocessing.StandardScaler` and encoding all categorical features using `sklearn.preprocessing.OneHotEncoder`. It uses stochastic gradient descent linear classifier (`SGDClassifier`) for modeling.

In [12]:
numeric_feature_indexes = slice(0, 10)
categorical_feature_indexes = slice(10, 12)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_feature_indexes),
        ("cat", OneHotEncoder(), categorical_feature_indexes),
    ]
)

pipeline = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", SGDClassifier(loss="log", tol=1e-3)),
    ]
)

### Convert all numeric features to `float64`

To avoid warning messages from `StandardScaler` all numeric features are converted to `float64`.

In [13]:
num_features_type_map = {
    feature: "float64" for feature in df_train.columns[numeric_feature_indexes]
}

df_train = df_train.astype(num_features_type_map)
df_validation = df_validation.astype(num_features_type_map)

### Run the pipeline locally.

In [14]:
X_train = df_train.drop("Cover_Type", axis=1)
y_train = df_train["Cover_Type"] 
X_validation = df_validation.drop("Cover_Type", axis=1)
y_validation = df_validation["Cover_Type"]

pipeline.set_params(classifier__alpha=0.001, classifier__max_iter=200)
pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  slice(0, 10, None)),
                                                 ('cat', OneHotEncoder(),
                                                  slice(10, 12, None))])),
                ('classifier',
                 SGDClassifier(alpha=0.001, loss='log', max_iter=200))])

### Calculate the trained model's accuracy.

In [15]:
accuracy = pipeline.score(X_validation, y_validation)
print(accuracy)

0.6968279788531924


### Prepare the hyperparameter tuning application.
Since the training run on this dataset is computationally expensive you can benefit from running a distributed hyperparameter tuning job on Vertex AI Training.

In [16]:
TRAINING_APP_FOLDER = "training_app"
os.makedirs(TRAINING_APP_FOLDER, exist_ok=True)

### Write the tuning script. 

Notice the use of the `hypertune` package to report the `accuracy` optimization metric to Vertex AI hyperparameter tuning service.

In [17]:
%%writefile {TRAINING_APP_FOLDER}/train.py
import os
import subprocess
import sys

import fire
import hypertune
import numpy as np
import pandas as pd
import pickle
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder


def train_evaluate(job_dir, training_dataset_path, validation_dataset_path, alpha, max_iter, hptune):
    
    df_train = pd.read_csv(training_dataset_path)
    df_validation = pd.read_csv(validation_dataset_path)

    if not hptune:
        df_train = pd.concat([df_train, df_validation])

    numeric_feature_indexes = slice(0, 10)
    categorical_feature_indexes = slice(10, 12)

    preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_feature_indexes),
        ('cat', OneHotEncoder(), categorical_feature_indexes) 
    ])

    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', SGDClassifier(loss='log',tol=1e-3))
    ])

    num_features_type_map = {feature: 'float64' for feature in df_train.columns[numeric_feature_indexes]}
    df_train = df_train.astype(num_features_type_map)
    df_validation = df_validation.astype(num_features_type_map) 

    print('Starting training: alpha={}, max_iter={}'.format(alpha, max_iter))
    X_train = df_train.drop('Cover_Type', axis=1)
    y_train = df_train['Cover_Type']

    pipeline.set_params(classifier__alpha=alpha, classifier__max_iter=max_iter)
    pipeline.fit(X_train, y_train)

    if hptune:
        X_validation = df_validation.drop('Cover_Type', axis=1)
        y_validation = df_validation['Cover_Type']
        accuracy = pipeline.score(X_validation, y_validation)
        print('Model accuracy: {}'.format(accuracy))
        # Log it with hypertune
        hpt = hypertune.HyperTune()
        hpt.report_hyperparameter_tuning_metric(
          hyperparameter_metric_tag='accuracy',
          metric_value=accuracy
        )

    # Save the model
    if not hptune:
        model_filename = 'model.pkl'
        with open(model_filename, 'wb') as model_file:
            pickle.dump(pipeline, model_file)
        gcs_model_path = "{}/{}".format(job_dir, model_filename)
        subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path], stderr=sys.stdout)
        print("Saved model in: {}".format(gcs_model_path)) 
    
if __name__ == "__main__":
    fire.Fire(train_evaluate)

Writing training_app/train.py


### Package the script into a docker image.

Notice that we are installing specific versions of `scikit-learn` and `pandas` in the training image. This is done to make sure that the training runtime in the training container is aligned with the serving runtime in the serving container. 

Make sure to update the URI for the base image so that it points to your project's **Container Registry**.

### Exercise

Complete the Dockerfile below so that it copies the 'train.py' file into the container
at `/app` and runs it when the container is started. 

In [18]:
%%writefile {TRAINING_APP_FOLDER}/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire cloudml-hypertune scikit-learn==0.20.4 pandas==0.24.2

# TODO
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

Writing training_app/Dockerfile


### Build the docker image. 

You use **Cloud Build** to build the image and push it your project's **Container Registry**. As you use the remote cloud service to build the image, you don't need a local installation of Docker.

In [19]:
IMAGE_NAME = "trainer_image"
IMAGE_TAG = "latest"
IMAGE_URI = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}:{IMAGE_TAG}"

os.environ["IMAGE_URI"] = IMAGE_URI

In [22]:
!gcloud builds submit --async --tag $IMAGE_URI $TRAINING_APP_FOLDER

Creating temporary tarball archive of 2 file(s) totalling 2.6 KiB before compression.
Uploading tarball of [training_app] to [gs://qwiklabs-gcp-04-853e5675f5e8_cloudbuild/source/1646905496.70723-a012a772a5584bb896fac8fd0e2bad1e.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/qwiklabs-gcp-04-853e5675f5e8/locations/global/builds/9e9e5120-7f9f-43f1-8adf-7283b92794fb].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/9e9e5120-7f9f-43f1-8adf-7283b92794fb?project=1076138843678].
ID                                    CREATE_TIME                DURATION  SOURCE                                                                                                     IMAGES  STATUS
9e9e5120-7f9f-43f1-8adf-7283b92794fb  2022-03-10T09:45:09+00:00  -         gs://qwiklabs-gcp-04-853e5675f5e8_cloudbuild/source/1646905496.70723-a012a772a5584bb896fac8fd0e2bad1e.tgz  -       QUEUED


## Submit an Vertex AI hyperparameter tuning job

### Create the hyperparameter configuration file. 
Recall that the training code uses `SGDClassifier`. The training application has been designed to accept two hyperparameters that control `SGDClassifier`:
- Max iterations
- Alpha

The file below configures Vertex AI hypertuning to run up to 5 trials in parallel and to choose from two discrete values of `max_iter` and the linear range between `1.0e-4` and `1.0e-1` for `alpha`.

In [23]:
TIMESTAMP = time.strftime("%Y%m%d_%H%M%S")
JOB_NAME = f"forestcover_tuning_{TIMESTAMP}"
JOB_DIR = f"{JOB_DIR_ROOT}/{JOB_NAME}"

os.environ["JOB_NAME"] = JOB_NAME
os.environ["JOB_DIR"] = JOB_DIR

### Exercise

Complete the `config.yaml` file generated below so that the hyperparameter
tunning engine try for parameter values
* `max_iter` the two values 10 and 20
* `alpha` a linear range of values between  1.0e-4 and 1.0e-1

Also complete the `gcloud` command to start the hyperparameter tuning job with a max trial count and
a max number of parallel trials both of 5 each. 

In [24]:
%%bash

MACHINE_TYPE="n1-standard-4"
REPLICA_COUNT=1
CONFIG_YAML=config.yaml

cat <<EOF > $CONFIG_YAML
studySpec:
  metrics:
  - metricId: accuracy
    goal: MAXIMIZE
  parameters:
  - parameterId: max_iter
    discreteValueSpec:
      values:
      - 10
      - 20
  # TODO
  - parameterId: alpha
    doubleValueSpec:
      minValue: 1.0e-4
      maxValue: 1.0e-1
    scaleType: UNIT_LINEAR_SCALE
  algorithm: ALGORITHM_UNSPECIFIED # results in Bayesian optimization
trialJobSpec:
  workerPoolSpecs:  
  - machineSpec:
      machineType: $MACHINE_TYPE
    replicaCount: $REPLICA_COUNT
    containerSpec:
      imageUri: $IMAGE_URI
      args:
      - --job_dir=$JOB_DIR
      - --training_dataset_path=$TRAINING_FILE_PATH
      - --validation_dataset_path=$VALIDATION_FILE_PATH
      - --hptune
EOF

# TODO
gcloud ai hp-tuning-jobs create \
    --region=$REGION \
    --display-name=$JOB_NAME \
    --config=$CONFIG_YAML \
    --max-trial-count=5 \
    --parallel-trial-count=5

echo "JOB_NAME: $JOB_NAME"

JOB_NAME: forestcover_tuning_20220310_094541


Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Hyperparameter tuning job [9218515493994889216] submitted successfully.

Your job is still active. You may view the status of your job with the command

  $ gcloud ai hp-tuning-jobs describe 9218515493994889216 --region=us-central1

Job State: JOB_STATE_PENDING


Go to the Vertex AI Training dashboard and view the progression of the HP tuning job under "Hyperparameter Tuning Jobs".

### Retrieve HP-tuning results.

After the job completes you can review the results using GCP Console or programmatically using the following functions (note that this code supposes that the metrics that the hyperparameter tuning engine optimizes is maximized): 

## Exercise

Complete the body of the function below to retrieve the best trial from the `JOBNAME`:

In [25]:
# TODO
def get_trials(job_name):
    jobs = aiplatform.HyperparameterTuningJob.list()
    match = [job for job in jobs if job.display_name == JOB_NAME]
    tuning_job = match[0] if match else None
    return tuning_job.trials if tuning_job else None


def get_best_trial(trials):
    metrics = [trial.final_measurement.metrics[0].value for trial in trials]
    best_trial = trials[metrics.index(max(metrics))]
    return best_trial


def retrieve_best_trial_from_job_name(jobname):
    trials = get_trials(jobname)
    best_trial = get_best_trial(trials)
    return best_trial

You'll need to wait for the hyperparameter job to complete before being able to retrieve the best job by running the cell below.

In [26]:
best_trial = retrieve_best_trial_from_job_name(JOB_NAME)

IndexError: list index (0) out of range

## Retrain the model with the best hyperparameters

You can now retrain the model using the best hyperparameters and using combined training and validation splits as a training dataset.

### Configure and run the training job

In [None]:
alpha = best_trial.parameters[0].value
max_iter = best_trial.parameters[1].value

In [None]:
TIMESTAMP = time.strftime("%Y%m%d_%H%M%S")
JOB_NAME = f"JOB_VERTEX_{TIMESTAMP}"
JOB_DIR = f"{JOB_DIR_ROOT}/{JOB_NAME}"

MACHINE_TYPE="n1-standard-4"
REPLICA_COUNT=1

WORKER_POOL_SPEC = f"""\
machine-type={MACHINE_TYPE},\
replica-count={REPLICA_COUNT},\
container-image-uri={IMAGE_URI}\
"""

ARGS = f"""\
--job_dir={JOB_DIR},\
--training_dataset_path={TRAINING_FILE_PATH},\
--validation_dataset_path={VALIDATION_FILE_PATH},\
--alpha={alpha},\
--max_iter={max_iter},\
--nohptune\
"""

!gcloud ai custom-jobs create \
  --region={REGION} \
  --display-name={JOB_NAME} \
  --worker-pool-spec={WORKER_POOL_SPEC} \
  --args={ARGS}


print("The model will be exported at:", JOB_DIR)

### Examine the training output

The training script saved the trained model as the 'model.pkl' in the `JOB_DIR` folder on GCS.

**Note:** We need to wait for job triggered by the cell above to complete before running the cells below.

In [None]:
!gsutil ls $JOB_DIR

## Deploy the model to Vertex AI Prediction

In [None]:
MODEL_NAME = "forest_cover_classifier_2"
SERVING_CONTAINER_IMAGE_URI = (
    "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-20:latest"
)
SERVING_MACHINE_TYPE = "n1-standard-2"

### Uploading the trained model

## Exercise

Upload the trained model using `aiplatform.Model.upload`:

In [27]:
JOB_DIR

'gs://qwiklabs-gcp-04-853e5675f5e8-kfp-artifact-store/jobs/forestcover_tuning_20220310_094541'

In [None]:
# TODO
uploaded_model = aiplatform.Model.upload(
    display_name=MODEL_NAME,
    artifact_uri=JOB_DIR,
    serving_container_image_uri=SERVING_CONTAINER_IMAGE_URI,
)

### Deploying the uploaded model

## Exercise

Deploy the model using `uploaded_model`:

In [None]:
# TODO
endpoint = uploaded_model.deploy(
    machine_type=SERVING_MACHINE_TYPE,
    accelerator_type=None,
    accelerator_count=None,
)

### Serve predictions
#### Prepare the input file with JSON formated instances.

## Exercise

Query the deployed model using `endpoint`:

In [None]:
instance = [
    2841.0,
    45.0,
    0.0,
    644.0,
    282.0,
    1376.0,
    218.0,
    237.0,
    156.0,
    1003.0,
    "Commanche",
    "C4758",
]

# TODO
endpoint.predict([instance])

Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.