# Using custom containers with AI Platform Training

In this lab, you develop, package as a docker image, and run on **AI Platform Training** a training application that trains a multi-class classification model that predicts the type of forest cover from cartographic data. The [dataset](../datasets/covertype/README.md) used in the lab is based on **Covertype Data Set** from UCI Machine Learning Repository.

The training code uses **scikit-learn** for data pre-processing and modeling. The code has been instrumented using the `hypertune` package so it can be used with **AI Platform** hyperparameter tuning.


In [1]:
import os
import numpy as np
import pandas as pd
import uuid
import time
import tempfile

from googleapiclient import discovery
from googleapiclient import errors

from google.cloud import bigquery
from jinja2 import Template
from kfp.components import func_to_container_op
from typing import NamedTuple

from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

## Configure environment settings

Create a GCS bucket that will be used as a staging area during the lab. Make sure to update the `PROJECT_ID` variable with your .

Set location paths, connections strings, and other environment settings. Make sure to update project id and region with your settings.

In [2]:
PROJECT_ID = 'mlops-workshop'
REGION = 'us-central1'

DATA_ROOT = 'gs://workshop-datasets/covertype'
TRAINING_FILE_PATH = DATA_ROOT + '/training/data.csv'
VALIDATION_FILE_PATH = DATA_ROOT + '/validation/data.csv'
TESTING_FILE_PATH = DATA_ROOT + '/testing/data.csv'

Create a bucket for AI Platform job directory.

In [4]:
JOB_DIR_BUCKET = 'gs://{}-lab11'.format(PROJECT_ID)

!gsutil mb $JOB_DIR_BUCKET

Creating gs://mlops-workshop-lab11/...
ServiceException: 409 Bucket mlops-workshop-lab11 already exists.


## Explore the Covertype dataset 

In [5]:
df_train = pd.read_csv(TRAINING_FILE_PATH)
df_validation = pd.read_csv(VALIDATION_FILE_PATH)
df_train

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type
0,3135,135,0,192,5,306,219,238,156,2790,Neota,7201,1
1,3211,90,0,30,1,5286,219,237,155,780,Rawah,7201,1
2,3046,0,0,228,1,666,218,238,156,1298,Rawah,7201,1
3,3211,180,0,437,30,5878,219,238,157,2230,Rawah,7201,2
4,3283,225,0,511,25,6031,218,238,157,631,Rawah,7201,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
173573,3147,96,59,216,-6,3037,220,0,0,1209,Commanche,7756,2
173574,2722,172,60,124,55,2823,169,175,59,6480,Rawah,7746,2
173575,2736,174,61,120,69,2853,163,171,59,6450,Rawah,7746,2
173576,2500,360,61,255,81,569,58,53,74,1473,Commanche,4703,2


In [6]:
print(df_train.shape)
print(df_validation.shape)

(173578, 13)
(58797, 13)


## Develop the training application

### Configure the `sklearn` training pipeline.

In [7]:
numeric_features = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
       'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
       'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
       'Horizontal_Distance_To_Fire_Points']
categorical_features = ['Wilderness_Area', 'Soil_Type']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features) 
    ])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SGDClassifier(loss='log'))
])

### Convert all numeric features in training and validation datasets to Pandas `float64`

In [8]:
num_features_type_map = {feature: 'float64' for feature in numeric_features}

df_train = df_train.astype(num_features_type_map)
df_validation = df_validation.astype(num_features_type_map)

### Train the pipeline locally.

In [9]:
X_train = df_train.drop('Cover_Type', axis=1)
y_train = df_train['Cover_Type']
X_validation = df_validation.drop('Cover_Type', axis=1)
y_validation = df_validation['Cover_Type']

pipeline.set_params(classifier__alpha=0.001, classifier__max_iter=200)
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('num', StandardScaler(copy=True, with_mean=True, with_std=True), ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Di...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))])

### Calculate the trained model's accuracy.

In [10]:
accuracy = pipeline.score(X_validation, y_validation)
print(accuracy)

0.7036413422453527


### Prepare the hyperparameter tuning application.
Since the training run on this dataset is computationally expensive you can benefit from running a distributed hyperparameter tuning job on AI Platform Training.

In [11]:
TRAINING_APP_FOLDER = 'training_app'
os.makedirs(TRAINING_APP_FOLDER, exist_ok=True)

### Write the tuning script. 

Notice the use of the `hypertune` package to report the `accuracy` optimization metric to AI Platform hyperparameter tuning service.

In [21]:
%%writefile {TRAINING_APP_FOLDER}/train.py

import os
import subprocess
import sys

import fire
import pickle
import numpy as np
import pandas as pd

import hypertune

from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder


def train_evaluate(job_dir, training_dataset_path, validation_dataset_path, alpha, max_iter, hptune):
    
  df_train = pd.read_csv(training_dataset_path)
  df_validation = pd.read_csv(validation_dataset_path)
    
  if not hptune:
    df_train = pd.concat([df_train, df_validation])

  numeric_features = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
    'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
    'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
    'Horizontal_Distance_To_Fire_Points']
    
  categorical_features = ['Wilderness_Area', 'Soil_Type']

  preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features) 
    ])

  pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SGDClassifier(loss='log'))
  ])
    
  num_features_type_map = {feature: 'float64' for feature in numeric_features}
  df_train = df_train.astype(num_features_type_map)
  df_validation = df_validation.astype(num_features_type_map) 

  print('Starting training: alpha={}, max_iter={}'.format(alpha, max_iter))
  X_train = df_train.drop('Cover_Type', axis=1)
  y_train = df_train['Cover_Type']
  
  pipeline.set_params(classifier__alpha=alpha, classifier__max_iter=max_iter)
  pipeline.fit(X_train, y_train)
  
  if hptune:
    X_validation = df_validation.drop('Cover_Type', axis=1)
    y_validation = df_validation['Cover_Type']
    accuracy = pipeline.score(X_validation, y_validation)
    print('Model accuracy: {}'.format(accuracy))
    # Log it with hypertune
    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
      hyperparameter_metric_tag='accuracy',
      metric_value=accuracy
    )

  # Save the model
  if not hptune:
    model_filename = 'model.pkl'
    with open(model_filename, 'wb') as model_file:
        pickle.dump(pipeline, model_file)
    gcs_model_path = "{}/{}".format(job_dir, model_filename)
    subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path], stderr=sys.stdout)
    print("Saved model in: {}".format(gcs_model_path)) 
    
if __name__ == "__main__":
  fire.Fire(train_evaluate)

Overwriting training_app/train.py


### Package the script into a docker image.

Notice the use of `mlops-dev:TF115-TFX015-KFP136` as a base image for the training image. The reason is to make sure that the development environment (your AI Platform Notebook instance) and the AI Platform Training environment are consistent. Since the AI Platform Notebook instance is based on the `mlops-dev:TF115-TFX015-KFP136` image we use this image as a base for the training image. 

Make sure to update the URI for the base image so that it points to your project's **Container Registry**.

In [22]:
%%writefile {TRAINING_APP_FOLDER}/Dockerfile

FROM gcr.io/mlops-workshop/mlops-dev:TF115-TFX015-KFP136
RUN pip install -U fire cloudml-hypertune
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

Overwriting training_app/Dockerfile


### Build the docker image. 

You use **Cloud Build** to build the image and push it your project's **Container Registry**. As you use the remote cloud service to build the image, you don't need a local installation of Docker.

In [23]:
IMAGE_NAME='trainer_image'
IMAGE_TAG='latest'
IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, IMAGE_NAME, IMAGE_TAG)

!gcloud builds submit --tag $IMAGE_URI $TRAINING_APP_FOLDER

Creating temporary tarball archive of 3 file(s) totalling 3.1 KiB before compression.
Uploading tarball of [training_app] to [gs://mlops-workshop_cloudbuild/source/1579673215.41-9603be010533419f8a11aae8991689d8.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/mlops-workshop/builds/3dc000a4-4a05-4cc3-8108-6aac58d59e14].
Logs are available at [https://console.cloud.google.com/gcr/builds/3dc000a4-4a05-4cc3-8108-6aac58d59e14?project=745302968357].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "3dc000a4-4a05-4cc3-8108-6aac58d59e14"

FETCHSOURCE
Fetching storage object: gs://mlops-workshop_cloudbuild/source/1579673215.41-9603be010533419f8a11aae8991689d8.tgz#1579673215838515
Copying gs://mlops-workshop_cloudbuild/source/1579673215.41-9603be010533419f8a11aae8991689d8.tgz#1579673215838515...
/ [1 files][  1.5 KiB/  1.5 KiB]                                                
Operation completed over 1 objects/1.5 KiB.                     

## Submit the AI Platform hyperparameter tuning job

### Create the hyperparameter configuration file. 
Recall that the training code uses **sklearn SGDClassifier**. The training application has been designed to accept two hyperparameters that control **SGDClassifier**:
- Max iterations
- Alpha

The below file configures AI Platform hypertuning to run up to 6 trials on up to three nodes and to choose from two discrete values of `max_iter` and the linear range betwee 0.00001 and 0.001 for `alpha`.

In [24]:
%%writefile {TRAINING_APP_FOLDER}/hptuning_config.yaml

trainingInput:
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 6
    maxParallelTrials: 3
    hyperparameterMetricTag: accuracy
    enableTrialEarlyStopping: TRUE 
    params:
    - parameterName: max_iter
      type: DISCRETE
      discreteValues: [
          200,
          500
          ]
    - parameterName: alpha
      type: DOUBLE
      minValue:  0.00001
      maxValue:  0.001
      scaleType: UNIT_LINEAR_SCALE

Overwriting training_app/hptuning_config.yaml


### Start the hyperparameter tuning job.

Use the `gcloud` command to start the hyperparameter tuning job.

In [25]:
JOB_NAME = "JOB_{}".format(time.strftime("%Y%m%d_%H%M%S"))
JOB_DIR = "{}/{}/{}".format(JOB_DIR_BUCKET, "jobs", JOB_NAME)
SCALE_TIER = "BASIC"

!gcloud ai-platform jobs submit training $JOB_NAME \
--region=$REGION \
--job-dir=$JOB_DIR \
--master-image-uri=$IMAGE_URI \
--scale-tier=$SCALE_TIER \
--config $TRAINING_APP_FOLDER/hptuning_config.yaml \
-- \
--training_dataset_path=$TRAINING_FILE_PATH \
--validation_dataset_path=$VALIDATION_FILE_PATH \
--hptune

Job [JOB_20200122_061611] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe JOB_20200122_061611

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs JOB_20200122_061611
jobId: JOB_20200122_061611
state: QUEUED


### Monitor the job.

You can monitor the job using GCP console or from within the notebook using `gcloud` commands.

In [26]:
!gcloud ai-platform jobs describe $JOB_NAME

createTime: '2020-01-22T06:16:13Z'
etag: 06Gy49c8xoY=
jobId: JOB_20200122_061611
state: PREPARING
trainingInput:
  args:
  - --training_dataset_path=gs://workshop-datasets/covertype/training/data.csv
  - --validation_dataset_path=gs://workshop-datasets/covertype/validation/data.csv
  - --hptune
  hyperparameters:
    enableTrialEarlyStopping: true
    goal: MAXIMIZE
    hyperparameterMetricTag: accuracy
    maxParallelTrials: 3
    maxTrials: 6
    params:
    - discreteValues:
      - 200.0
      - 500.0
      parameterName: max_iter
      type: DISCRETE
    - maxValue: 0.001
      minValue: 1e-05
      parameterName: alpha
      scaleType: UNIT_LINEAR_SCALE
      type: DOUBLE
  jobDir: gs://mlops-workshop-lab11/jobs/JOB_20200122_061611
  masterConfig:
    imageUri: gcr.io/mlops-workshop/trainer_image:latest
  region: us-central1
trainingOutput:
  isHyperparameterTuningJob: true

View job in the Cloud Console at:
https://console.cloud.google.com/mlengine/jobs/JOB_20200122_061611?proje

In [None]:
!gcloud ai-platform jobs stream-logs $JOB_NAME

INFO	2020-01-22 06:16:12 +0000	service		Validating job requirements...
INFO	2020-01-22 06:16:13 +0000	service		Job creation request has been successfully validated.
INFO	2020-01-22 06:16:13 +0000	service		Job JOB_20200122_061611 is queued.
INFO	2020-01-22 06:16:22 +0000	service	1	Waiting for job to be provisioned.
INFO	2020-01-22 06:16:22 +0000	service	3	Waiting for job to be provisioned.
INFO	2020-01-22 06:16:22 +0000	service	2	Waiting for job to be provisioned.
INFO	2020-01-22 06:16:24 +0000	service	1	Waiting for training program to start.
INFO	2020-01-22 06:16:24 +0000	service	2	Waiting for training program to start.
INFO	2020-01-22 06:16:24 +0000	service	3	Waiting for training program to start.
INFO	2020-01-22 06:23:12 +0000	master-replica-0	3	Starting training: alpha=0.0009933559858798981, max_iter=500
INFO	2020-01-22 06:23:12 +0000	master-replica-0	3	Model accuracy: 0.703879449631784
INFO	2020-01-22 06:23:23 +0000	master-replica-0	2	Starting training: alpha=0.0005203486084938049,

### Retrieve HP-tuning results.

After the job completes you can review the results using GCP Console or programatically by calling the AI Platform Training REST end-point.

In [None]:
ml = discovery.build('ml', 'v1')

job_id = 'projects/{}/jobs/{}'.format(PROJECT_ID, JOB_NAME)
request = ml.projects().jobs().get(name=job_id)

try:
    response = request.execute()
except errors.HttpError as err:
    print(err)
except:
    print("Unexpected error")
    
response

The returned runs are sorted by the optimization metric. The best run is the first item on the returned list.

In [None]:
response['trainingOutput']['trials'][0]

In [None]:
!gsutil ls gs://mlops-workshop-lab11/datasets/training/data.csv