# Productionising the ML model
This notebook will walk through the process of creating, building and commiting the artifacts required to run the model developed in the Experimentation Notebook in production. 

## Environment Setup
**NOTE:** Set Project ID to your project  

In [1]:
PROJECT_ID = 'mmlops3'
PREFIX = PROJECT_ID
REGION = 'us-central1'

DATA_ROOT = 'gs://workshop-datasets/covertype'
TRAINING_FILE_PATH = DATA_ROOT + '/training/dataset.csv'
VALIDATION_FILE_PATH = DATA_ROOT + '/evaluation/dataset.csv'

# Job dir for AI Platform Training
JOB_DIR_ROOT='gs://{}-artifact-store/jobs'.format(PREFIX)


NAMESPACE='kubeflow'
ZONE='us-central1-a'
ARTIFACT_STORE_URI='gs://{}-artifact-store'.format(PREFIX)
GCS_STAGING_PATH='{}/staging'.format(ARTIFACT_STORE_URI)
GKE_CLUSTER_NAME='{}-cluster'.format(PREFIX)

!gcloud container clusters get-credentials $GKE_CLUSTER_NAME --zone $ZONE
HOST_TEMP=!(kubectl describe configmap inverse-proxy-config -n $NAMESPACE | grep "googleusercontent.com")
INVERSE_PROXY_HOSTNAME=HOST_TEMP[0]


Fetching cluster endpoint and auth data.
kubeconfig entry generated for mmlops3-cluster.


In [3]:
HOST_TEMP
INVERSE_PROXY_HOSTNAME

'3ea90122a145b3e7-dot-us-central2.pipelines.googleusercontent.com'

## Imports

In [4]:
import json
import os
import numpy as np
import pandas as pd
import pickle
import uuid
import time
import tempfile

from googleapiclient import discovery
from googleapiclient import errors
from datetime import datetime

from google.cloud import bigquery
from jinja2 import Template
from kfp.components import func_to_container_op
from typing import NamedTuple

from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer


## Import data set to BQ
Import the data set from cloud storage to BigQuery. A dataset is created and the table is imported under `covertype_data.covertype`

In [5]:
DATASET_LOCATION='US'
DATASET_ID='covertype_dataset'
TABLE_ID='covertype'
DATA_SOURCE='gs://workshop-datasets/covertype/full/dataset.csv'
SCHEMA='Elevation:INTEGER,\
Aspect:INTEGER,\
Slope:INTEGER,\
Horizontal_Distance_To_Hydrology:INTEGER,\
Vertical_Distance_To_Hydrology:INTEGER,\
Horizontal_Distance_To_Roadways:INTEGER,\
Hillshade_9am:INTEGER,\
Hillshade_Noon:INTEGER,\
Hillshade_3pm:INTEGER,\
Horizontal_Distance_To_Fire_Points:INTEGER,\
Wilderness_Area:STRING,\
Soil_Type:STRING,\
Cover_Type:INTEGER'

!bq --location=$DATASET_LOCATION --project_id=$PROJECT_ID mk --dataset $DATASET_ID
!bq --project_id=$PROJECT_ID --dataset_id=$DATASET_ID load \
--source_format=CSV \
--skip_leading_rows=1 \
--replace \
$TABLE_ID \
$DATA_SOURCE \
$SCHEMA

Dataset 'mmlops3:covertype_dataset' successfully created.
Waiting on bqjob_r405ca85a3b467511_00000178a5e48ffd_1 ... (7s) Current status: DONE   


## Prepare the training application.
Now that the data is hosted in BQ, the next step is to create the training application. Start by creating the folders to host the model script, trainer image docker and the base image docker.

In [6]:
!pwd
#os.chdir('01_demo_mdeploy/')

/home/mlops-demo/01_demo_mdeploy


In [7]:
TRAINING_APP_FOLDER = 'trainer_image'
BASE_IMAGE_FOLDER='base_image'
os.makedirs(TRAINING_APP_FOLDER, exist_ok=True)
os.makedirs(BASE_IMAGE_FOLDER, exist_ok=True)

### Write the training script. 

The script written in the Experimentation Notebook which process the data and trains the classification model is written as a training script `train.py` in the training image folder. In addition to the model written during experimentation an additional `hypertune` function is created which allows for a training job to be run with multiple parameters. This will  run multiple models with a range of parameter - in this case we vary *alpha* and the maximum number of itterations which the model trains for. 

In [8]:
%%writefile {TRAINING_APP_FOLDER}/train.py
"""Covertype Classifier trainer script."""

import os
import subprocess
import sys

import fire
import pickle
import numpy as np
import pandas as pd

import hypertune

from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder


def train_evaluate(job_dir, training_dataset_path, validation_dataset_path, alpha, max_iter, hptune):
    
  df_train = pd.read_csv(training_dataset_path)
  df_validation = pd.read_csv(validation_dataset_path)
    
  if not hptune:
    df_train = pd.concat([df_train, df_validation])

  numeric_feature_indexes = slice(0, 10)
  categorical_feature_indexes = slice(10, 12)

  preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_feature_indexes),
        ('cat', OneHotEncoder(), categorical_feature_indexes) 
    ])

  pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SGDClassifier(loss='log'))
  ])
    
  num_features_type_map = {feature: 'float64' for feature in df_train.columns[numeric_feature_indexes]}
  df_train = df_train.astype(num_features_type_map)
  df_validation = df_validation.astype(num_features_type_map) 

  print('Starting training: alpha={}, max_iter={}'.format(alpha, max_iter))
  X_train = df_train.drop('Cover_Type', axis=1)
  y_train = df_train['Cover_Type']
  
  pipeline.set_params(classifier__alpha=alpha, classifier__max_iter=max_iter)
  pipeline.fit(X_train, y_train)
  
  if hptune:
    X_validation = df_validation.drop('Cover_Type', axis=1)
    y_validation = df_validation['Cover_Type']
    accuracy = pipeline.score(X_validation, y_validation)
    print('Model accuracy: {}'.format(accuracy))
    # Log it with hypertune
    hpt = hypertune.HyperTune()
    hpt.report_hyperparameter_tuning_metric(
      hyperparameter_metric_tag='accuracy',
      metric_value=accuracy
    )

  # Save the model
  if not hptune:
    model_filename = 'model.pkl'
    with open(model_filename, 'wb') as model_file:
        pickle.dump(pipeline, model_file)
    gcs_model_path = "{}/{}".format(job_dir, model_filename)
    subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path], stderr=sys.stdout)
    print("Saved model in: {}".format(gcs_model_path)) 
    
if __name__ == "__main__":
  fire.Fire(train_evaluate)

Overwriting trainer_image/train.py


### Package the script into a docker image.

The docker images used for the training are based off the image `mlops-dev` created during the inital set up of the environment. Since the AI Platform Notebook instance is based on the `mlops-dev` image we use the same image as a base for the training image. 


We first write a base image dockerfile which replicates the image used for the Notebook. Then we write a training dockerfile which uses the same base image and add the `train.py` to the image. 


**NOTE:** Make sure to update the URI for the image so that it points to your project's **Container Registry**. i.e. `FROM gcr.io/PROJECT_ID/mlops-dev:latest` 

In [12]:
%%writefile {BASE_IMAGE_FOLDER}/Dockerfile
FROM gcr.io/mmlops3/mlops-dev:latest

Overwriting base_image/Dockerfile


In [14]:
%%writefile {TRAINING_APP_FOLDER}/Dockerfile

FROM gcr.io/mmlops3/mlops-dev:latest
RUN pip install -U fire cloudml-hypertune
WORKDIR /app
COPY train.py .

ENTRYPOINT ["python", "train.py"]

Overwriting trainer_image/Dockerfile


## Build base and trainer images 
Use **Cloud Build** to build the images and save them to the **Cloud Container Registery**.

In [15]:
os.environ["BASE_IMAGE"]="gcr.io/{}/{}:latest".format(PROJECT_ID,BASE_IMAGE_FOLDER)
BASE_IMAGE=os.environ["BASE_IMAGE"]

!gcloud builds submit --timeout 15m --tag {BASE_IMAGE} base_image

Creating temporary tarball archive of 1 file(s) totalling 36 bytes before compression.
Uploading tarball of [base_image] to [gs://mmlops3_cloudbuild/source/1617691081.29-506612261fcc497098c8500bad6ef431.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/mmlops3/builds/690ce353-4e64-43ef-bbae-f03a309584df].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/690ce353-4e64-43ef-bbae-f03a309584df?project=286436730533].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "690ce353-4e64-43ef-bbae-f03a309584df"

FETCHSOURCE
Fetching storage object: gs://mmlops3_cloudbuild/source/1617691081.29-506612261fcc497098c8500bad6ef431.tgz#1617691081467474
Copying gs://mmlops3_cloudbuild/source/1617691081.29-506612261fcc497098c8500bad6ef431.tgz#1617691081467474...
/ [1 files][  160.0 B/  160.0 B]                                                
Operation completed over 1 objects/160.0 B.                                      
BUI

In [16]:
os.environ["TRAIN_IMAGE"]="gcr.io/{}/{}:latest".format(PROJECT_ID,TRAINING_APP_FOLDER)
TRAIN_IMAGE=os.environ["TRAIN_IMAGE"]

!gcloud builds submit --timeout 15m --tag {TRAIN_IMAGE} trainer_image

Creating temporary tarball archive of 2 file(s) totalling 2.5 KiB before compression.
Uploading tarball of [trainer_image] to [gs://mmlops3_cloudbuild/source/1617691303.66-99cc9e9d73d5499fa3e2d6d5bfd7afb6.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/mmlops3/builds/d294ddb9-28b1-4f29-9822-f2ca21a2f9fc].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/d294ddb9-28b1-4f29-9822-f2ca21a2f9fc?project=286436730533].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "d294ddb9-28b1-4f29-9822-f2ca21a2f9fc"

FETCHSOURCE
Fetching storage object: gs://mmlops3_cloudbuild/source/1617691303.66-99cc9e9d73d5499fa3e2d6d5bfd7afb6.tgz#1617691303915235
Copying gs://mmlops3_cloudbuild/source/1617691303.66-99cc9e9d73d5499fa3e2d6d5bfd7afb6.tgz#1617691303915235...
/ [1 files][  1.2 KiB/  1.2 KiB]                                                
Operation completed over 1 objects/1.2 KiB.                                      
B

## Create pipeline YAML
Now we build the the yaml file for the pipeline based on a file `covertype_training_pipeline.py`. This script outlines the pipeline and the individual components and the appropriate input and outputs. Using the DSL compile command, a static configuration (in YAML format) is created that the Kubeflow Pipelines can execute. 


In [36]:
!dsl-compile --py covertype_training_pipeline.py --output covertype_training_pipeline.yaml

  import cryptography.exceptions


## Deploying the pipeline
Select a pipeline name, ensure it is not already in use at the allocated hostname (else a 500 error will be displayed). Deploy the pipeline. 

In [18]:
PIPELINE_NAME='covertype_classifier_training'

!kfp --endpoint {INVERSE_PROXY_HOSTNAME} pipeline upload -p {PIPELINE_NAME} covertype_training_pipeline.yaml

  import cryptography.exceptions
Pipeline 8d2f1468-7cfa-49bf-bf1b-11728e7970e6 has been submitted

Pipeline Details
------------------
ID           8d2f1468-7cfa-49bf-bf1b-11728e7970e6
Name         covertype_classifier_training
Description
Uploaded at  2021-04-06T06:45:31+00:00
+-----------------------------+--------------------------------------------------+
| Parameter Name              | Default Value                                    |
| project_id                  |                                                  |
+-----------------------------+--------------------------------------------------+
| region                      |                                                  |
+-----------------------------+--------------------------------------------------+
| source_table_name           |                                                  |
+-----------------------------+--------------------------------------------------+
| gcs_root                    |                          

This command will return a list of pipelines depolyed at the given hostname. We see that `covertype_classifier_training` has been deployed. This list also allows us to copy the pipeline ID. 

In [19]:
!kfp --endpoint {INVERSE_PROXY_HOSTNAME} pipeline list

  import cryptography.exceptions
+--------------------------------------+-------------------------------------------------+---------------------------+
| Pipeline ID                          | Name                                            | Uploaded at               |
| 8d2f1468-7cfa-49bf-bf1b-11728e7970e6 | covertype_classifier_training                   | 2021-04-06T06:45:31+00:00 |
+--------------------------------------+-------------------------------------------------+---------------------------+
| 06e86792-dce6-4ff9-bf9e-0409a245792c | [Tutorial] DSL - Control structures             | 2021-04-06T06:24:19+00:00 |
+--------------------------------------+-------------------------------------------------+---------------------------+
| da511c20-4357-413b-890e-dd6533b2bce0 | [Tutorial] Data passing in python components    | 2021-04-06T06:24:18+00:00 |
+--------------------------------------+-------------------------------------------------+---------------------------+
| 5e9345d6-3d49

#### Viewing the pipeline
The deployed pipeline can be viewed through the Kubeflow Pipeline UI given at the URL below. 

In [20]:
print('https://{}'.format(INVERSE_PROXY_HOSTNAME))

https://3ea90122a145b3e7-dot-us-central2.pipelines.googleusercontent.com


## Run Experiment 
Now that the pipeline is deployed we want to run an experiment, this will cause the pipeline to run, pulling the data from bigquery and splitting it, training the models, evaluating them and deploy the best performing model. This experiment takes approximately an hour to execute and will result in a deployed model which can be interacted with through GCP's AI platform predicting service. 

**NOTE:** Change the PIPELINE_ID to reflect the ID copied from above.  

In [21]:
PIPELINE_ID='8d2f1468-7cfa-49bf-bf1b-11728e7970e6'

EXPERIMENT_NAME='Covertype_Classifier_Training'
RUN_ID='Run_001'
SOURCE_TABLE='covertype_dataset.covertype'
DATASET_ID='splits'
EVALUATION_METRIC='accuracy'
EVALUATION_METRIC_THRESHOLD='0.69'
MODEL_ID='covertype_classifier'
VERSION_ID='v01'
REPLACE_EXISTING_VERSION=True

In [22]:
!kfp --endpoint {INVERSE_PROXY_HOSTNAME} run submit \
-e Covertype_Classifier_Training \
-r {RUN_ID} \
-p {PIPELINE_ID} \
project_id={PROJECT_ID} \
gcs_root={GCS_STAGING_PATH} \
region={REGION} \
source_table_name={SOURCE_TABLE} \
dataset_id={DATASET_ID} \
evaluation_metric_name={EVALUATION_METRIC} \
evaluation_metric_threshold={EVALUATION_METRIC_THRESHOLD} \
model_id={MODEL_ID} \
version_id={VERSION_ID} \
replace_existing_version={REPLACE_EXISTING_VERSION}

  import cryptography.exceptions
Creating experiment Covertype_Classifier_Training.
Run 0d2e06e6-4a11-4c92-b34c-668181c53e1c is submitted
+--------------------------------------+---------+----------+---------------------------+
| run id                               | name    | status   | created at                |
| 0d2e06e6-4a11-4c92-b34c-668181c53e1c | Run_001 |          | 2021-04-06T06:47:44+00:00 |
+--------------------------------------+---------+----------+---------------------------+


## Testing model
To test the model we can use the AI platforms prediction API to ask for a prediction based on a JSON input aternatively we can use the prediction UI and input: *{"instances":[[2395,0,0,60,6,1170,218,238,156,1054,"Cache","C2717"]]}* in the test case window.

We write a prediction JSON file with a set of data points, the correct cover types are 6 and 1 respectively.

In [40]:
%%writefile predict.json
[3366,122,15,789,127,2881,244,227,107,2437,"Commanche","C8772"]
[2791,340,15,30,10,3906,188,217,168,5401,"Rawah","C7745"]

Writing predict.json


In [41]:
INPUT_DATA_FILE="./predict.json"

!gcloud ai-platform predict --model {MODEL_ID} \
  --version {VERSION_ID} \
  --json-instances {INPUT_DATA_FILE}

[6, 1]









### Clean up demo


In [1]:
!rm -r predict.json covertype_training_pipeline.yaml trainer_image base_image/