# Orchestraining training and deployment of scikit-learn model with Kubeflow Pipelines and Cloud AI Platform. 

In this lab you develop the KFP pipeline that orchestrates BigQuery and Cloud AI Platform services to train and deploy a **scikit-learn** model. The lab uses the [Covertype Dat Set](../datasets/covertype/README.md). The model is a multi-class classification model that predicts the type of forest cover from cartographic data. 

The source data is in BigQuery. The pipeline uses BigQuery to prepare training and evaluation splits, AI Platform Training to run a custom container with data preprocessing and training code, and AI Platform Prediction as a deployment target. The below diagram represents the workflow orchestrated by the pipeline.

![Training pipeline](../images/kfp-caip.png)



In [39]:
import kfp
import os
import uuid
import time

from google.cloud import bigquery
from jinja2 import Template
from kfp.components import func_to_container_op
from typing import NamedTuple


from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

## Configure environment settings
Make sure to update the constants to reflect your environment settings.

In [95]:
PROJECT_ID = 'mlops-workshop'
DATASET_LOCATION = 'US'
CLUSTER_NAME = 'mlops-workshop-cluster'
CLUSTER_ZONE = 'us-central1-a'
REGION = 'us-central1'
DATASET_ID = 'lab_12'
SOURCE_TABLE_ID = 'covertype'
SPLITS_TABLE_ID = 'splits'
LAB_GCS_BUCKET='gs://mlops-workshop-lab-12'
COMPONENT_URL_SEARCH_PREFIX = 'https://raw.githubusercontent.com/kubeflow/pipelines/0.1.36/components/gcp/'

## Experimentation

### Explore the dataset 
Use BigQuery Python client library to query the data.

In [19]:
client = bigquery.Client(project=PROJECT_ID, location=DATASET_LOCATION)

Read and display 100 rows from the source table.

In [None]:
query_template = """
SELECT *
FROM `{{ source_table }}`
LIMIT 100
"""

query = Template(query_template).render(
    source_table='{}.{}.{}'.format(PROJECT_ID, DATASET_ID, SOURCE_TABLE_ID))
df = client.query(query).to_dataframe()
df.head(10)

### Preparing the training, validation, and testing splits
#### Prepare the data splitting query

In [None]:
query_template = """
SELECT *, 
CASE(MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(cover))), 10))
  WHEN 9 THEN 'test'
  WHEN 8 THEN 'validation'
  ELSE 'training' END AS Split_Col
from `{{ source_table }}` as cover
"""

query = Template(query_template).render(
    source_table='{}.{}.{}'.format(PROJECT_ID, DATASET_ID, SOURCE_TABLE_ID))

#### Submit the data splitting job

In [None]:
dataset_ref = client.dataset(DATASET_ID)
splits_table_ref = dataset_ref.table(SPLITS_TABLE_ID)

job_config = bigquery.QueryJobConfig()
job_config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE
job_config.destination = splits_table_ref

query_job = client.query(query, job_config)
query_job.result() # Wait for query to finish

#### Explore the table with splits

In [None]:
query_template = """
SELECT Cover_Type, Split_Col 
FROM `{{ source_table }}`
LIMIT 100
"""

query = Template(query_template).render(
    source_table='{}.{}.{}'.format(PROJECT_ID, DATASET_ID, SPLITS_TABLE_ID))

df = client.query(query).to_dataframe()
df.head(10)

### Developing the training script
#### Load training and validation data

In [24]:
query_template = """
SELECT *  
FROM `{{ source_table }}`
WHERE Split_Col in ('training', 'validation')
"""

query = Template(query_template).render(
    source_table='{}.{}.{}'.format(PROJECT_ID, DATASET_ID, SPLITS_TABLE_ID))

df = client.query(query).to_dataframe()
df_train = df[df.Split_Col == 'training'].drop('Split_Col', axis=1)
df_validation = df[df.Split_Col == 'validation'].drop('Split_Col', axis=1)

#### Configure training pipeline

In [21]:
numeric_features = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
       'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
       'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
       'Horizontal_Distance_To_Fire_Points']
categorical_features = ['Wilderness_Area', 'Soil_Type']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features) 
    ])

alpha = 0.0001
max_iter = 1000

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SGDClassifier(alpha=alpha, max_iter=max_iter, loss='log'))
])

[0.68509501 0.7153876  0.71443346 0.69855428 0.71345155 0.72286145
 0.7177117  0.7046909  0.70216342 0.70334217]
*********
0.707769154705924


#### Train and evaluate

In [None]:
X_train = df_train.drop('Cover_Type', axis=1)
y_train = df_train['Cover_Type']
X_validation = df_validation.drop('Cover_Type', axis=1)
y_validation = df_validation['Cover_Type']

pipeline.fit(X_train, y_train)
pipeline.score(X_validation, y_validation)

### Hyperparameter tuning
Since the training run on this dataset is computationally expensive you can benefit from running a distributed hyperparameter tuning job on AI Platform Training.

#### Prepare a hyperparameter tuning script

In [34]:
TRAINING_APP_FOLDER = 'training_app'
os.makedirs(TRAINING_APP_FOLDER, exist_ok=True)

In [114]:
%%writefile {TRAINING_APP_FOLDER}/train.py

import logging
import os
import subprocess
import sys

import fire
import numpy as np
import pandas as pd

import hypertune

from google.cloud import bigquery
from jinja2 import Template
from sklearn.compose import ColumnTransformer
from sklearn.externals import joblib
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder


def train_evaluate(job_dir, project_id, dataset_id, table_id, alpha, max_iter, dataset_location='US'):
    
  query_template = """
    SELECT *  
    FROM `{{ source_table }}`
    WHERE Split_Col in ('training', 'validation')
    """

  source_table='{}.{}.{}'.format(project_id, dataset_id, table_id)
  query = Template(query_template).render(
    source_table=source_table)

  client = bigquery.Client(project=project_id, location=dataset_location)

  logging.info('Reading data from BigQuery table: {}'.format(source_table))
  df = client.query(query).to_dataframe()
  df_train = df[df.Split_Col == 'training'].drop('Split_Col', axis=1)
  df_validation = df[df.Split_Col == 'validation'].drop('Split_Col', axis=1)

  numeric_features = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
    'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
    'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
    'Horizontal_Distance_To_Fire_Points']
    
  categorical_features = ['Wilderness_Area', 'Soil_Type']

  preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features) 
    ])

  pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SGDClassifier(alpha=alpha, max_iter=max_iter, loss='log'))
  ])

  logging.info('Starting training: alpha={}, max_iter={}'.format(alpha, max_iter))
  X_train = df_train.drop('Cover_Type', axis=1)
  y_train = df_train['Cover_Type']
  X_validation = df_validation.drop('Cover_Type', axis=1)
  y_validation = df_validation['Cover_Type']
  pipeline.fit(X_train, y_train)
  accuracy = pipeline.score(X_validation, y_validation)
  logging.info('Finished training. Model accuracy: {}'.format(accuracy))
    
  # Log it with hypertune
  hpt = hypertune.HyperTune()
  hpt.report_hyperparameter_tuning_metric(
    hyperparameter_metric_tag='accuracy',
    metric_value=accuracy
    )

  # Save the model
  model_filename = 'model.joblib'
  joblib.dump(value=pipeline, filename=model_filename)
  gcs_model_path = "{}/{}".format(job_dir, model_filename)
  subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path], stderr=sys.stdout)
  logging.info("Saved model in: {}".format(gcs_model_path)) 
    
if __name__ == "__main__":
  logging.basicConfig(level=logging.INFO)
  fire.Fire(train_evaluate)

Overwriting training_app/train.py


#### Package the script 

In [115]:
%%writefile {TRAINING_APP_FOLDER}/Dockerfile

FROM gcr.io/deeplearning-platform-release/base-cpu
RUN pip install -U fire cloudml-hypertune
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

Overwriting training_app/Dockerfile


In [116]:
IMAGE_NAME='covertype_trainer'
IMAGE_TAG='latest'
IMAGE_URI='gcr.io/{}/{}:{}'.format(PROJECT_ID, IMAGE_NAME, IMAGE_TAG)

!gcloud builds submit --tag $IMAGE_URI $TRAINING_APP_FOLDER

Creating temporary tarball archive of 2 file(s) totalling 2.9 KiB before compression.
Uploading tarball of [training_app] to [gs://mlops-workshop_cloudbuild/source/1576098389.71-d585e18e0b01433c8e3ca751aba78a9a.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/mlops-workshop/builds/2661c0a3-ae10-41a2-b15a-c0206f2e9cd5].
Logs are available at [https://console.cloud.google.com/gcr/builds/2661c0a3-ae10-41a2-b15a-c0206f2e9cd5?project=745302968357].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "2661c0a3-ae10-41a2-b15a-c0206f2e9cd5"

FETCHSOURCE
Fetching storage object: gs://mlops-workshop_cloudbuild/source/1576098389.71-d585e18e0b01433c8e3ca751aba78a9a.tgz#1576098390023138
Copying gs://mlops-workshop_cloudbuild/source/1576098389.71-d585e18e0b01433c8e3ca751aba78a9a.tgz#1576098390023138...
/ [1 files][  1.4 KiB/  1.4 KiB]                                                
Operation completed over 1 objects/1.4 KiB.                     

#### Create hyperparameter configuration file

In [121]:
%%writefile {TRAINING_APP_FOLDER}/hptuning_config.yaml

trainingInput:
  hyperparameters:
    goal: MAXIMIZE
    maxTrials: 10
    maxParallelTrials: 3
    hyperparameterMetricTag: accuracy
    enableTrialEarlyStopping: TRUE 
    params:
    - parameterName: max_iter
      type: DISCRETE
      discreteValues: [
          500,
          1000
          ]
    - parameterName: alpha
      type: DOUBLE
      minValue:  0.00001
      maxValue:  0.01
      scaleType: UNIT_LINEAR_SCALE

Writing training_app/hptuning_config.yaml


#### Configure and submit hyperparameter tuning job

In [122]:
JOB_NAME = "JOB_{}".format(time.strftime("%Y%m%d_%H%M%S"))
JOB_DIR = "{}/{}".format(LAB_GCS_BUCKET, JOB_NAME)
SCALE_TIER = "BASIC"
ALPHA = 0.0001
MAX_ITER = 1000

In [123]:
!gcloud ai-platform jobs submit training $JOB_NAME \
--region=$REGION \
--job-dir=$LAB_GCS_BUCKET/$JOB_NAME \
--master-image-uri=$IMAGE_URI \
--scale-tier=$SCALE_TIER \
--config $TRAINING_APP_FOLDER/hptuning_config.yaml \
-- \
--project_id=$PROJECT_ID \
--dataset_id=$DATASET_ID \
--table_id=$SPLITS_TABLE_ID 

Job [JOB_20191211_212541] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe JOB_20191211_212541

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs JOB_20191211_212541
jobId: JOB_20191211_212541
state: QUEUED


In [124]:
!gcloud ai-platform jobs describe $JOB_NAME

createTime: '2019-12-11T21:25:44Z'
etag: 0yn6EqV4vgs=
jobId: JOB_20191211_212541
startTime: '2019-12-11T21:25:48Z'
state: RUNNING
trainingInput:
  args:
  - --project_id=mlops-workshop
  - --dataset_id=lab_12
  - --table_id=splits
  hyperparameters:
    enableTrialEarlyStopping: true
    goal: MAXIMIZE
    hyperparameterMetricTag: accuracy
    maxParallelTrials: 3
    maxTrials: 10
    params:
    - discreteValues:
      - 500.0
      - 1000.0
      parameterName: max_iter
      type: DISCRETE
    - maxValue: 0.01
      minValue: 1e-05
      parameterName: alpha
      scaleType: UNIT_LINEAR_SCALE
      type: DOUBLE
  jobDir: gs://mlops-workshop-lab-12/JOB_20191211_212541
  masterConfig:
    imageUri: gcr.io/mlops-workshop/covertype_trainer:latest
  region: us-central1
trainingOutput:
  isHyperparameterTuningJob: true

View job in the Cloud Console at:
https://console.cloud.google.com/mlengine/jobs/JOB_20191211_212541?project=mlops-workshop

View logs at:
https://console.cloud.google.co

In [125]:
!gcloud ai-platform jobs stream-logs $JOB_NAME

INFO	2019-12-11 21:25:44 +0000	service		Validating job requirements...
INFO	2019-12-11 21:25:44 +0000	service		Job creation request has been successfully validated.
INFO	2019-12-11 21:25:44 +0000	service		Job JOB_20191211_212541 is queued.
INFO	2019-12-11 21:25:54 +0000	service	2	Waiting for job to be provisioned.
INFO	2019-12-11 21:25:54 +0000	service	3	Waiting for job to be provisioned.
INFO	2019-12-11 21:25:54 +0000	service	1	Waiting for job to be provisioned.
INFO	2019-12-11 21:25:56 +0000	service	1	Waiting for training program to start.
INFO	2019-12-11 21:25:57 +0000	service	2	Waiting for training program to start.
INFO	2019-12-11 21:25:58 +0000	service	3	Waiting for training program to start.
ERROR	2019-12-11 21:29:00 +0000	master-replica-0	1	INFO:root:Reading data from BigQuery table: mlops-workshop.lab_12.splits
ERROR	2019-12-11 21:29:00 +0000	master-replica-0	3	INFO:root:Reading data from BigQuery table: mlops-workshop.lab_12.splits
ERROR	2019-12-11 21:29:13 +0000	master-repli