# Orchestraining training and deployment of scikit-learn model with Kubeflow Pipelines and Cloud AI Platform. 

In this lab you develop the KFP pipeline that orchestrates BigQuery and Cloud AI Platform services to train and deploy a **scikit-learn** model. The lab uses the [Covertype Dat Set](../datasets/covertype/README.md). The model is a multi-class classification model that predicts the type of forest cover from cartographic data. 

The source data is in BigQuery. The pipeline uses BigQuery to prepare training and evaluation splits, AI Platform Training to run a custom container with data preprocessing and training code, and AI Platform Prediction as a deployment target. The below diagram represents the workflow orchestrated by the pipeline.

![Training pipeline](../images/kfp-caip.png)



In [15]:
import kfp
import uuid

from google.cloud import bigquery
from jinja2 import Template
from kfp.components import func_to_container_op
from typing import NamedTuple

## Configure environment settings
Make sure to update the constants to reflect your environment settings.

In [11]:
PROJECT_ID = 'mlops-workshop'
DATASET_LOCATION = 'US'
CLUSTER_NAME = 'mlops-workshop-cluster'
CLUSTER_ZONE = 'us-central1-a'
DATASET_ID = 'lab_12'
SOURCE_TABLE_ID = 'covertype'
SPLITS_TABLE_ID = 'splits'
COMPONENT_URL_SEARCH_PREFIX = 'https://raw.githubusercontent.com/kubeflow/pipelines/0.1.36/components/gcp/'

## Experimentation

### Explore the dataset 
Use BigQuery Python client library to query the data.

In [12]:
client = bigquery.Client(project=PROJECT_ID, location=DATASET_LOCATION)

Read and display 100 rows from the source table.

In [13]:
query_template = """
SELECT *
FROM `{{ source_table }}`
LIMIT 100
"""

query = Template(query_template).render(
    source_table='{}.{}.{}'.format(PROJECT_ID, DATASET_ID, SOURCE_TABLE_ID))
df = client.query(query).to_dataframe()
df.head(10)

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type
0,3094,82,65,42,3,3001,193,0,0,1315,Commanche,7202,2
1,3083,105,57,0,0,3002,228,0,0,1350,Commanche,7202,2
2,3159,60,37,150,0,3045,220,0,17,1177,Commanche,7756,2
3,3158,73,62,170,-4,3042,191,0,0,1187,Commanche,7756,2
4,3147,96,59,216,-6,3037,220,0,0,1209,Commanche,7756,2
5,2506,13,64,201,88,655,73,30,0,1470,Commanche,4703,2
6,2501,3,63,216,81,626,55,40,0,1470,Commanche,4703,2
7,3281,38,59,150,123,3012,137,42,0,1159,Commanche,7756,2
8,2500,0,62,234,83,598,54,45,67,1471,Commanche,4703,2
9,2555,3,60,190,135,684,67,53,65,1470,Commanche,4703,2


### Prepare the training, validation, and testing splits
#### Prepare the data splitting query

In [26]:
query_template = """
SELECT *, 
CASE(MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(cover))), 10))
  WHEN 9 THEN 'test'
  WHEN 8 THEN 'validation'
  ELSE 'training' END AS Split_Col
from `{{ source_table }}` as cover
"""

query = Template(query_template).render(
    source_table='{}.{}.{}'.format(PROJECT_ID, DATASET_ID, SOURCE_TABLE_ID))

#### Submit the data splitting job

In [27]:
dataset_ref = client.dataset(DATASET_ID)
splits_table_ref = dataset_ref.table(SPLITS_TABLE_ID)

job_config = bigquery.QueryJobConfig()
job_config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE
job_config.destination = splits_table_ref

query_job = client.query(query, job_config)
query_job.result() # Wait for query to finish

<google.cloud.bigquery.table.RowIterator at 0x7f4be9c558d0>

#### Explore the table with splits

In [29]:
query_template = """
SELECT Cover_Type, Split_Col 
FROM `{{ source_table }}`
LIMIT 100
"""

query = Template(query_template).render(
    source_table='{}.{}.{}'.format(PROJECT_ID, DATASET_ID, SPLITS_TABLE_ID))

df = client.query(query).to_dataframe()
df.head(10)

Unnamed: 0,Cover_Type,Split_Col
0,2,validation
1,1,training
2,1,validation
3,1,training
4,1,validation
5,1,training
6,1,training
7,1,training
8,1,training
9,2,test
