# Using built-in xgboost with AI Platform Training


This notebook demonstrates how to use AI Platfrom Training built-in XGBoost algorithm. You will train a multi-class classification model that predicts the type of forest cover from cartographic data. The [dataset](../../../datasets/covertype/README.md) used in the lab is based on **Covertype Data Set** from UCI Machine Learning Repository.



In [None]:
import json
import os
import numpy as np
import pandas as pd
import pickle
import uuid
import time
import tempfile

from googleapiclient import discovery
from googleapiclient import errors

from google.cloud import bigquery
from jinja2 import Template
from kfp.components import func_to_container_op
from typing import NamedTuple

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

## Configure environment settings

Set location paths, connections strings, and other environment settings. Make sure to update   `REGION`, and `ARTIFACT_STORE`  with the settings reflecting your lab environment. 

- `REGION` - the compute region for AI Platform Training and Prediction
- `ARTIFACT_STORE` - the GCS bucket used for storing data and output from AI Platform Training.

In [None]:
!gsutil ls

In [None]:
REGION = 'us-central1'
ARTIFACT_STORE = 'gs://mlops-dev-workspace/xgboos-demo'

PROJECT_ID = !(gcloud config get-value core/project)
PROJECT_ID = PROJECT_ID[0]
DATA_ROOT='{}/data'.format(ARTIFACT_STORE)
JOB_DIR_ROOT='{}/jobs'.format(ARTIFACT_STORE)

ORIG_DATASET = 'gs://workshop-datasets/covertype/small/dataset.csv'
TRAINING_DATASET = '{}/covertype_preprocessed/dataset.csv'.format(DATA_ROOT)

## Prepare the dataset for the built-in XGBoost

In [None]:
df = pd.read_csv(DATASET)
df

### Convert numeric features to floats

In [None]:
numeric_feature_indexes = slice(0, 10)
num_features_type_map = {feature: 'float64' for feature in df.columns[numeric_feature_indexes]}

df_training = df.astype(num_features_type_map)
df_training

### Move the target column  to the first position

In [None]:
columns = list(df.columns)
columns.insert(0, columns.pop(columns.index('Cover_Type')))

df_training = df_training.reindex(columns=columns)
df_training

In [None]:
df_training.to_csv(TRAINING_DATASET, header=False, index=False)

In [None]:
!gsutil cat -r 0-297 {TRAINING_DATASET}

## Configure and submit the training job

In [None]:
IMAGE_URI = 'gcr.io/cloud-ml-algos/boosted_trees:latest'

JOB_NAME = 'job_{}'.format(time.strftime("%Y%m%d_%H%M%S"))
JOB_DIR = '{}/{}'.format(JOB_DIR_ROOT, JOB_NAME)
SCALE_TIER = 'CUSTOM'
MASTER_TYPE = 'n1-standard-16'

VALIDATION_SPLIT = 0.10
TEST_SPLIT = 0.10

In [None]:
!gcloud ai-platform jobs submit training {JOB_NAME} \
--master-image-uri={IMAGE_URI} \
--scale-tier={SCALE_TIER} \
--master-machine-type={MASTER_TYPE} \
--job-dir={JOB_DIR} \
--region={REGION} \
-- \
--preprocess \
--objective=multi:softmax \
--training_data_path={TRAINING_DATASET} \
--validation_split={VALIDATION_SPLIT} \
--test_split={TEST_SPLIT}

## Monitor the job

In [None]:
!gcloud ai-platform jobs describe $JOB_NAME

In [None]:
!gcloud ai-platform jobs stream-logs $JOB_NAME

### Inspect the job's output

In [None]:
!gsutil ls {JOB_DIR}

In [None]:
!gsutil cat {JOB_DIR}/artifacts/instance_generator.py

## Configure and submit the hyperparameter tuning job

### Create the hyperparameter configuration file. 


The below file configures AI Platform hypertuning to run up to 12 trials on up to three nodes and to choose from three discrete values of `max_depth` and the linear range betwee 0.2 and 0.4 for `eta`.

In [None]:
HPTUNING_CONFIG = 'hptuning_config.yaml'

In [None]:
%%writefile {HPTUNING_CONFIG}

# Copyright 2019 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#            http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

trainingInput:
  hyperparameters:
    goal: MINIMIZE
    maxTrials: 12
    maxParallelTrials: 3
    hyperparameterMetricTag: merror
    enableTrialEarlyStopping: TRUE 
    params:
    - parameterName: max_depth
      type: DISCRETE
      discreteValues: [
          8,
          10,
          12
          ]
    - parameterName: eta
      type: DOUBLE
      minValue:  0.2
      maxValue:  0.4
      scaleType: UNIT_LINEAR_SCALE

### Start the hyperparameter tuning job.

Use the `gcloud` command to start the hyperparameter tuning job.

In [None]:
JOB_NAME = 'job_{}'.format(time.strftime("%Y%m%d_%H%M%S"))
JOB_DIR = '{}/{}'.format(JOB_DIR_ROOT, JOB_NAME)
SCALE_TIER = 'CUSTOM'
MASTER_TYPE = 'n1-standard-16'

In [None]:
!gcloud ai-platform jobs submit training {JOB_NAME} \
--master-image-uri={IMAGE_URI} \
--scale-tier={SCALE_TIER} \
--master-machine-type={MASTER_TYPE} \
--job-dir={JOB_DIR} \
--region={REGION} \
--config {HPTUNING_CONFIG} \
-- \
--preprocess \
--objective=multi:softmax \
--training_data_path={TRAINING_DATASET}

In [None]:
!gcloud ai-platform jobs describe $JOB_NAME

In [None]:
!gcloud ai-platform jobs stream-logs $JOB_NAME

### Retrieve HP-tuning results.

After the job completes you can review the results using GCP Console or programatically by calling the AI Platform Training REST end-point.

In [None]:
ml = discovery.build('ml', 'v1')

job_id = 'projects/{}/jobs/{}'.format(PROJECT_ID, JOB_NAME)
request = ml.projects().jobs().get(name=job_id)

try:
    response = request.execute()
except errors.HttpError as err:
    print(err)
except:
    print("Unexpected error")
    
response

The returned run results are sorted by a value of the optimization metric. The best run is the first item on the returned list.

In [None]:
response['trainingOutput']['trials'][0]

## Deploy the model to AI Platform Prediction

### Set the deployment config

In [None]:
training_output = response['trainingOutput']['trials'][0]['builtInAlgorithmOutput']['modelPath']

!gsutil cp  {training_output}/deployment_config.yaml .
!cat deployment_config.yaml

### Create a model resource

In [None]:
DATASET_NAME = 'covertype'
ALGORITHM = 'xgboost'
MODEL_TYPE = 'classification'
MODEL_NAME = '{}_{}_{}'.format(DATASET_NAME, ALGORITHM, MODEL_TYPE)

In [None]:
!gcloud ai-platform models create  $MODEL_NAME \
    --regions={REGION}

### Create a model version

In [None]:
MODEL_VERSION = 'v1'

!gcloud ai-platform versions create {MODEL_VERSION} \
  --model {MODEL_NAME} \
  --config deployment_config.yaml

### Serve predictions
#### Download training artifacts.

In [None]:
training_artifacts = training_output[:-14]
#training_artifacts
!gsutil ls {training_artifacts}/artifacts

In [None]:
!gsutil cp {training_artifacts}/artifacts/* .

#### Prepare the input file with JSON formated instances.

In [None]:
INSTANCE_FILE = 'serving_instance.json'
RAW_DATA_POINT = '3142.0, 183.0, 9.0, 648.0, 101.0, 757.0, 223.0, 247.0, 157.0, 1871.0, Commanche, C7757'

!python instance_generator.py --raw_data_string="{RAW_DATA_POINT}" > {INSTANCE_FILE}

In [None]:
!cat {INSTANCE_FILE}

#### Invoke the model

In [None]:
!gcloud ai-platform predict \
--model {MODEL_NAME} \
--version {MODEL_VERSION} \
--json-instances {INSTANCE_FILE}

<font size=-1>Licensed under the Apache License, Version 2.0 (the \"License\");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.</font>