# Productionising the ML model
This notebook will walk through the process of creating, building and commiting the artifacts required to run the model developed in the Experimentation Notebook in production. 

## Environment Setup
**NOTE:** Set Project ID to your project  

In [1]:
PROJECT_ID = 'demokfp'
PREFIX = PROJECT_ID
REGION = 'us-central1'

DATA_ROOT = 'gs://workshop-datasets/covertype'
TRAINING_FILE_PATH = DATA_ROOT + '/training/dataset.csv'
VALIDATION_FILE_PATH = DATA_ROOT + '/evaluation/dataset.csv'

# Job dir for AI Platform Training
JOB_DIR_ROOT='gs://{}-artifact-store/jobs'.format(PREFIX)


NAMESPACE='kubeflow'
ZONE='us-central1-a'
ARTIFACT_STORE_URI='gs://{}-artifact-store'.format(PREFIX)
GCS_STAGING_PATH='{}/staging'.format(ARTIFACT_STORE_URI)
GKE_CLUSTER_NAME='{}-cluster'.format(PREFIX)

!gcloud container clusters get-credentials $GKE_CLUSTER_NAME --zone $ZONE
HOST_TEMP=!(kubectl describe configmap inverse-proxy-config -n $NAMESPACE | grep "googleusercontent.com")
INVERSE_PROXY_HOSTNAME=HOST_TEMP[0]


Fetching cluster endpoint and auth data.
kubeconfig entry generated for demokfp-cluster.


## Imports

In [2]:
import json
import os
import numpy as np
import pandas as pd
import pickle
import uuid
import time
import tempfile

from googleapiclient import discovery
from googleapiclient import errors
from datetime import datetime

from google.cloud import bigquery
from jinja2 import Template
from kfp.components import func_to_container_op
from typing import NamedTuple

from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer


## Import data set to BQ
Import the data set from cloud storage to BigQuery. A dataset is created and the table is imported under `covertype_data.covertype`

In [None]:
DATASET_LOCATION='US'
DATASET_ID='covertype_dataset'
TABLE_ID='covertype'
DATA_SOURCE='gs://workshop-datasets/covertype/full/dataset.csv'
SCHEMA='Elevation:INTEGER,\
Aspect:INTEGER,\
Slope:INTEGER,\
Horizontal_Distance_To_Hydrology:INTEGER,\
Vertical_Distance_To_Hydrology:INTEGER,\
Horizontal_Distance_To_Roadways:INTEGER,\
Hillshade_9am:INTEGER,\
Hillshade_Noon:INTEGER,\
Hillshade_3pm:INTEGER,\
Horizontal_Distance_To_Fire_Points:INTEGER,\
Wilderness_Area:STRING,\
Soil_Type:STRING,\
Cover_Type:INTEGER'

!bq --location=$DATASET_LOCATION --project_id=$PROJECT_ID mk --dataset $DATASET_ID
!bq --project_id=$PROJECT_ID --dataset_id=$DATASET_ID load \
--source_format=CSV \
--skip_leading_rows=1 \
--replace \
$TABLE_ID \
$DATA_SOURCE \
$SCHEMA

## Prepare the training application.
Now that the data is hosted in BQ, the next step is to create the training application. Start by creating the folders to host the model script, trainer image docker and the base image docker.

In [5]:
!pwd
#os.chdir('01_demo_mdeploy/')

/home/mlops-demo/01_demo_mdeploy


In [6]:
TRAINING_APP_FOLDER = 'trainer_image'
BASE_IMAGE_FOLDER='base_image'
os.makedirs(TRAINING_APP_FOLDER, exist_ok=True)
os.makedirs(BASE_IMAGE_FOLDER, exist_ok=True)

### Write the training script. 

The script written in the Experimentation Notebook which process the data and trains the classification model is written as a training script `train.py` in the training image folder. In addition to the model written during experimentation an additional `hypertune` function is created which allows for a training job to be run with multiple parameters. This will  run multiple models with a range of parameter - in this case we vary *alpha* and the maximum number of itterations which the model trains for. 

In [7]:
%%writefile {TRAINING_APP_FOLDER}/train.py
"""Covertype Classifier trainer script."""

import os
import subprocess
import sys

import fire
import pickle
import numpy as np
import pandas as pd

import hypertune

from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder


def train_evaluate(job_dir, training_dataset_path, validation_dataset_path, alpha, max_iter, hptune):
    
  df_train = pd.read_csv(training_dataset_path)
  df_validation = pd.read_csv(validation_dataset_path)
    
  if not hptune:
    df_train = pd.concat([df_train, df_validation])

  numeric_feature_indexes = slice(0, 10)
  categorical_feature_indexes = slice(10, 12)

  preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_feature_indexes),
        ('cat', OneHotEncoder(), categorical_feature_indexes) 
    ])

  pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SGDClassifier(loss='log'))
  ])
    
  num_features_type_map = {feature: 'float64' for feature in df_train.columns[numeric_feature_indexes]}
  df_train = df_train.astype(num_features_type_map)
  df_validation = df_validation.astype(num_features_type_map) 

  print('Starting training: alpha={}, max_iter={}'.format(alpha, max_iter))
  X_train = df_train.drop('Cover_Type', axis=1)
  y_train = df_train['Cover_Type']
  
  pipeline.set_params(classifier__alpha=alpha, classifier__max_iter=max_iter)
  pipeline.fit(X_train, y_train)
  
  if hptune:
    X_validation = df_validation.drop('Cover_Type', axis=1)
    y_validation = df_validation['Cover_Type']
    accuracy = pipeline.score(X_validation, y_validation)
    print('Model accuracy: {}'.format(accuracy))
    # Log it with hypertune
    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
      hyperparameter_metric_tag='accuracy',
      metric_value=accuracy
    )

  # Save the model
  if not hptune:
    model_filename = 'model.pkl'
    with open(model_filename, 'wb') as model_file:
        pickle.dump(pipeline, model_file)
    gcs_model_path = "{}/{}".format(job_dir, model_filename)
    subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path], stderr=sys.stdout)
    print("Saved model in: {}".format(gcs_model_path)) 
    
if __name__ == "__main__":
  fire.Fire(train_evaluate)

Writing trainer_image/train.py


### Package the script into a docker image.

The docker images used for the training are based off the image `mlops-dev:TF115-TFX015-KFP136` created during the inital set up of the environment. Since the AI Platform Notebook instance is based on the `mlops-dev:TF115-TFX015-KFP136` image we use the same image as a base for the training image. 


We first write a base image dockerfile which replicates the image used for the Notebook. Then we write a training dockerfile which uses the same base image and add the `train.py` to the image. 


**NOTE:** Make sure to update the URI for the image so that it points to your project's **Container Registry**. i.e. `FROM gcr.io/PROJECT_ID/mlops-dev:TF115-TFX015-KFP136` 

In [8]:
%%writefile {BASE_IMAGE_FOLDER}/Dockerfile
FROM gcr.io/demokfp/mlops-dev:TF115-TFX015-KFP136

Writing base_image/Dockerfile


In [9]:
%%writefile {TRAINING_APP_FOLDER}/Dockerfile

FROM gcr.io/demokfp/mlops-dev:TF115-TFX015-KFP136
RUN pip install -U fire cloudml-hypertune
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

Writing trainer_image/Dockerfile


## Build base and trainer images 
Use **Cloud Build** to build the images and save them to the **Cloud Container Registery**.

In [None]:
IMAGE_URI_BASE="gcr.io/{}/{}:latest".format(PROJECT_ID,BASE_IMAGE_FOLDER)

!gcloud builds submit --timeout 15m --tag {IMAGE_URI_BASE} base_image

In [None]:
IMAGE_URI_TRAIN="gcr.io/{}/{}:latest".format(PROJECT_ID,TRAINING_APP_FOLDER)

!gcloud builds submit --timeout 15m --tag {IMAGE_URI_TRAIN} trainer_image

## Create pipeline YAML
Now we build the the yaml file for the pipeline based on a file `covertype_training_pipeline.py`. This script outlines the pipeline and the individual components and the appropriate input and outputs. Using the DSL compile command, a static configuration (in YAML format) is created that the Kubeflow Pipelines can execute. 

**NOTE:** Change the environment settings in `covertype_training_pipeline.py` to reflect your BASE_IMAGE and TRAINER_IMAGE URI's. i.e *'gcr.io/PROJECT_ID/base_image:latest'*

In [10]:
!dsl-compile --py covertype_training_pipeline.py --output covertype_training_pipeline.yaml

## Deploying the pipeline
Select a pipeline name, ensure it is not already in use (else a 500 error will be displayed). Deploy the pipeline. 

In [12]:
PIPELINE_NAME='covertype_classifier_training2'

!kfp --endpoint {INVERSE_PROXY_HOSTNAME} pipeline upload -p {PIPELINE_NAME} covertype_training_pipeline.yaml

Pipeline 7d6a97c4-9dc4-4ee5-9e53-819e4737f68b has been submitted

Pipeline Details
------------------
ID           7d6a97c4-9dc4-4ee5-9e53-819e4737f68b
Name         covertype_classifier_training2
Description
Uploaded at  2020-02-14T04:37:22+00:00
+-----------------------------+--------------------------------------------------+
| Parameter Name              | Default Value                                    |
| project_id                  |                                                  |
+-----------------------------+--------------------------------------------------+
| region                      |                                                  |
+-----------------------------+--------------------------------------------------+
| source_table_name           |                                                  |
+-----------------------------+--------------------------------------------------+
| gcs_root                    |                                                  |
+-----

This command will return a list of pipelines depolyed at the given hostname. We see that `covertype_classifier_training` has been deployed. This list also allows us to copy the pipeline ID. 

In [None]:
!kfp --endpoint {INVERSE_PROXY_HOSTNAME} pipeline list

#### Viewing the pipeline
The deployed pipeline can be viewed through the Kubeflow Pipeline UI given at the URL below. 

In [13]:
print('https://{}'.format(INVERSE_PROXY_HOSTNAME))

https://21bce3d410fd3c82-dot-us-central1.notebooks.googleusercontent.com


## Run Experiment 
Now that the pipeline is deployed we want to run an experiment, this will cause the pipeline to run, pulling the data from bigquery and splitting it, training the models, evaluating them and deploy the best performing model. This experiment takes approximately an hour to execute and will result in a deployed model which can be interacted with through GCP's AI platform predicting service. 

**NOTE:** Change the PIPELINE_ID to reflect the ID copied from above.  

In [14]:
PIPELINE_ID='7d6a97c4-9dc4-4ee5-9e53-819e4737f68b'

EXPERIMENT_NAME='Covertype_Classifier_Training'
RUN_ID='Run_001'
SOURCE_TABLE='covertype_dataset.covertype'
DATASET_ID='splits'
EVALUATION_METRIC='accuracy'
EVALUATION_METRIC_THRESHOLD='0.69'
MODEL_ID='covertype_classifier'
VERSION_ID='v01'
REPLACE_EXISTING_VERSION=True

In [15]:
!kfp --endpoint {INVERSE_PROXY_HOSTNAME} run submit \
-e Covertype_Classifier_Training \
-r {RUN_ID} \
-p {PIPELINE_ID} \
project_id={PROJECT_ID} \
gcs_root={GCS_STAGING_PATH} \
region={REGION} \
source_table_name={SOURCE_TABLE} \
dataset_id={DATASET_ID} \
evaluation_metric_name={EVALUATION_METRIC} \
evaluation_metric_threshold={EVALUATION_METRIC_THRESHOLD} \
model_id={MODEL_ID} \
version_id={VERSION_ID} \
replace_existing_version={REPLACE_EXISTING_VERSION}

Run f761f98c-85c1-4646-91b4-95ecd5ffb52d is submitted
+--------------------------------------+---------+----------+---------------------------+
| run id                               | name    | status   | created at                |
| f761f98c-85c1-4646-91b4-95ecd5ffb52d | Run_001 |          | 2020-02-14T04:40:13+00:00 |
+--------------------------------------+---------+----------+---------------------------+


## Testing model
To test the model we can use the AI platforms prediction API to ask for a prediction based on a JSON input aternatively we can use the prediction UI and input: *{"instances":[[2395,0,0,60,6,1170,218,238,156,1054,"Cache","C2717"]]}* in the test case window.

We write a prediction JSON file with a set of data points, the correct cover types are 3 and 2 respectively.

In [16]:
%%writefile predict.json
[2395,0,0,60,6,1170,218,238,156,1054,"Cache","C2717"]
[2756,135,0,85,14,1608,219,238,156,2451,"Rawah","C4744"]


Writing predict.json


In [17]:
INPUT_DATA_FILE="./predict.json"

!gcloud ai-platform predict --model {MODEL_ID} \
  --version {VERSION_ID} \
  --json-instances {INPUT_DATA_FILE}

[3, 2]









### Clean up demo


In [None]:
#!rm -r predict.json covertype_training_pipeline.yaml trainer_image base_image/