# XGBoost Training on Google Cloud Machine Learning Engine
This notebook uses the Iris data to demonstrate how to train a model on Cloud Machine Learning Engine (ML Engine).

# How to bring your model to ML Engine
Getting your model ready for training can be done in 3 steps:
1. Create your python model file
    1. Add code to download your data from [Google Cloud Storage](https://cloud.google.com/storage) so that ML Engine can use it
    1. Add code to export and save the model to [Google Cloud Storage](https://cloud.google.com/storage) once ML Engine finishes training the model
1. Prepare a package
1. Submit the training job

# Prerequisites
Before you jump in, let’s cover some of the different tools you’ll be using to get online prediction up and running on ML Engine. 

[Google Cloud Platform](https://cloud.google.com/) lets you build and host applications and websites, store data, and analyze data on Google's scalable infrastructure.

[Cloud ML Engine](https://cloud.google.com/ml-engine/) is a managed service that enables you to easily build machine learning models that work on any type of data, of any size.

[Google Cloud Storage](https://cloud.google.com/storage/) (GCS) is a unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.

[Cloud SDK](https://cloud.google.com/sdk/) is a command line tool which allows you to interact with Google Cloud products. In order to run this notebook, make sure that Cloud SDK is [installed](https://cloud.google.com/sdk/downloads) in the same environment as your Jupyter kernel.


# Part 0: Setup
* [Create a project on GCP](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
* [Create a Google Cloud Storage Bucket](https://cloud.google.com/storage/docs/quickstart-console)
* [Enable Cloud Machine Learning Engine and Compute Engine APIs](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.217405014.1312742076.1516128282-1417583630.1516128282)
* [Install Cloud SDK](https://cloud.google.com/sdk/downloads)
* [[Optional] Install XGBoost](http://xgboost.readthedocs.io/en/latest/build.html)
* [[Optional] Install scikit-learn](http://scikit-learn.org/stable/install.html)
* [[Optional] Install pandas](https://pandas.pydata.org/pandas-docs/stable/install.html)
* [[Optional] Install Google API Python Client](https://github.com/google/google-api-python-client)

These variables will be needed for the following steps.
* `TRAINER_PACKAGE_PATH <./census_training>` - A packaged training application that will be staged in a Google Cloud Storage location. The model file created below is placed inside this package path.
* `MAIN_TRAINER_MODULE <census_training.train>` - Tells ML Engine which file to execute. This is formatted as follows <folder_name.python_file_name>
* `JOB_DIR <gs://$BUCKET_ID/xgb_job_dir>` - The path to a Google Cloud Storage location to use for job output.
* `RUNTIME_VERSION <1.9>` - The version of Cloud ML Engine to use for the job. If you don't specify a runtime version, the training service uses the default Cloud ML Engine runtime version 1.0. See the list of runtime versions for more information.
* `PYTHON_VERSION <3.5>` - The Python version to use for the job. Python 3.5 is available with runtime version 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7.

** Replace: **
* `PROJECT_ID <YOUR_PROJECT_ID>` - with your project's id. Use the PROJECT_ID that matches your Google Cloud Platform project.
* `BUCKET_ID <YOUR_BUCKET_ID>` - with the bucket id you created above.
* `JOB_DIR <gs://YOUR_BUCKET_ID/xgb_job_dir>` - with the bucket id you created above.
* `REGION <REGION>` - select a region from [here](https://cloud.google.com/ml-engine/docs/regions) or use the default '`us-central1`'. The region is where the model will be deployed.

In [1]:
GCS_BUCKET = 'gs://test-project-001-210814-acxiom' #CHANGE THIS TO YOUR BUCKET
BUCKET_ID = 'test-project-001-210814-acxiom'
PROJECT = 'test-project-001-210814' #CHANGE THIS TO YOUR PROJECT ID
REGION = 'us-central1' #OPTIONALLY CHANGE THIS

In [2]:
import os
os.environ['GCS_BUCKET'] = GCS_BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

# Part 1: Create your python model file

First, we'll create the python model file (provided below) that we'll upload to ML Engine. This is similar to your normal process for creating a XGBoost model. However, there are two key differences:
1. Downloading the data from GCS at the start of your file, so that ML Engine can access the data.
1. Exporting/saving the model to GCS at the end of your file, so that you can use it for predictions.

The code in this file loads the data into a pandas DataFrame and pre-processes the data with scikit-learn. This data is then loaded into a DMatrix and used to train a model. Lastly, the model is saved to a file that can be uploaded to [ML Engine's prediction service](https://cloud.google.com/ml-engine/docs/scikit/getting-predictions#deploy_models_and_versions).

Note: In normal practice you would want to test your model locally on a small dataset to ensure that it works, before using it with your larger dataset on ML Engine. This avoids wasted time and costs.

# Part 2: Create Trainer Package
Before you can run your trainer application with ML Engine, your code and any dependencies must be placed in a Google Cloud Storage location that your Google Cloud Platform project can access. You can find more info [here](https://cloud.google.com/ml-engine/docs/tensorflow/packaging-trainer)

In [3]:
%%bash
mkdir trainer
touch trainer/__init__.py

mkdir: cannot create directory ‘trainer’: File exists


In [4]:
%%writefile trainer/task.py
#USE THIS FOR SKLEARN EXAMPLE
import datetime
import os
import subprocess
import sys
import pandas as pd
import xgboost as xgb
from sklearn import svm
from sklearn.externals import joblib


# Fill in your Cloud Storage bucket name
BUCKET_ID = 'test-project-001-210814-acxiom'
# [END setup]

# [START download-data]
iris_data_filename = 'iris_data.csv'
iris_target_filename = 'iris_target.csv'
data_dir = 'gs://cloud-samples-data/ml-engine/iris'

# gsutil outputs everything to stderr so we need to divert it to stdout.
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_data_filename),
                       iris_data_filename], stderr=sys.stdout)
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_target_filename),
                       iris_target_filename], stderr=sys.stdout)
# [END download-data]


# [START load-into-pandas]
# Load data into pandas
iris_data = pd.read_csv(iris_data_filename).values
iris_target = pd.read_csv(iris_target_filename).values
iris_target = iris_target.reshape((iris_target.size,))
# [END load-into-pandas]

# Train the model
classifier = svm.SVC(verbose=True)
classifier.fit(iris_data, iris_target)

# Export the classifier to a file
model = 'model.joblib'
joblib.dump(classifier, model)

# [START upload-model]
# Upload the saved model file to Cloud Storage
model_path = os.path.join('gs://', BUCKET_ID, datetime.datetime.now().strftime(
    'iris_%Y%m%d_%H%M%S'), model)
subprocess.check_call(['gsutil', 'cp', model, model_path], stderr=sys.stdout)
# [END upload-model]



Overwriting trainer/task.py


In [5]:
%%writefile trainer/task.py
#USE THIS FOR XGBOOST EXAMPLE
import datetime
import os
import subprocess
import sys
import pandas as pd
import xgboost as xgb
from sklearn import svm
from sklearn.externals import joblib


# Fill in your Cloud Storage bucket name
BUCKET_ID = 'test-project-001-210814-acxiom'
# [END setup]

# [START download-data]
iris_data_filename = 'iris_data.csv'
iris_target_filename = 'iris_target.csv'
data_dir = 'gs://cloud-samples-data/ml-engine/iris'

# gsutil outputs everything to stderr so we need to divert it to stdout.
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_data_filename),
                       iris_data_filename], stderr=sys.stdout)
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_target_filename),
                       iris_target_filename], stderr=sys.stdout)
# [END download-data]


# [START load-into-pandas]
# Load data into pandas
iris_data = pd.read_csv(iris_data_filename).values
iris_target = pd.read_csv(iris_target_filename).values
iris_target = iris_target.reshape((iris_target.size,))
# [END load-into-pandas]

# Load data into DMatrix object
dtrain = xgb.DMatrix(iris_data, label=iris_target)

# Train XGBoost model
bst = xgb.train({}, dtrain, 20)

# Export the classifier to a file
model = 'model.bst'
bst.save_model(model)

# [START upload-model]
# Upload the saved model file to Cloud Storage
model_path = os.path.join('gs://', BUCKET_ID, datetime.datetime.now().strftime(
    'iris_%Y%m%d_%H%M%S'), model)
subprocess.check_call(['gsutil', 'cp', model, model_path], stderr=sys.stdout)
# [END upload-model]



Overwriting trainer/task.py


# Part 3: Submit Training Job
Next we need to submit the job for training on ML Engine. We'll use gcloud to submit the job which has the following flags:

* `job-name` - A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter). In this case: `census_training_$(date +"%Y%m%d_%H%M%S")`
* `job-dir` - The path to a Google Cloud Storage location to use for job output.
* `package-path` - A packaged training application that is staged in a Google Cloud Storage location. If you are using the gcloud command-line tool, this step is largely automated.
* `module-name` - The name of the main module in your trainer package. The main module is the Python file you call to start the application. If you use the gcloud command to submit your job, specify the main module name in the --module-name argument. Refer to Python Packages to figure out the module name.
* `region` - The Google Cloud Compute region where you want your job to run. You should run your training job in the same region as the Cloud Storage bucket that stores your training data. Select a region from [here](https://cloud.google.com/ml-engine/docs/regions) or use the default '`us-central1`'.
* `runtime-version` - The version of Cloud ML Engine to use for the job. If you don't specify a runtime version, the training service uses the default Cloud ML Engine runtime version 1.0. See the list of runtime versions for more information.
* `python-version` - The Python version to use for the job. Python 3.5 is available with runtime version 1.4 or greater. If you don't specify a Python version, the training service uses Python 2.7.
* `scale-tier` - A scale tier specifying the type of processing cluster to run your job on. This can be the CUSTOM scale tier, in which case you also explicitly specify the number and type of machines to use.

Note: Check to make sure gcloud is set to the current PROJECT_ID

In [6]:
%%bash
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=trainer \
   -- \
   --output_dir='./output'

Copying gs://cloud-samples-data/ml-engine/iris/iris_data.csv...
/ [0 files][    0.0 B/  2.4 KiB]                                                / [1 files][  2.4 KiB/  2.4 KiB]                                                
Operation completed over 1 objects/2.4 KiB.                                      
Copying gs://cloud-samples-data/ml-engine/iris/iris_target.csv...
/ [0 files][    0.0 B/  302.0 B]                                                / [1 files][  302.0 B/  302.0 B]                                                
Operation completed over 1 objects/302.0 B.                                      
Copying file://model.bst [Content-Type=application/octet-stream]...
/ [0 files][    0.0 B/ 19.4 KiB]                                                / [1 files][ 19.4 KiB/ 19.4 KiB]                                                
Operation completed over 1 objects/19.4 KiB.                                     


[16:19:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 8 extra nodes, 0 pruned nodes, max_depth=4
[16:19:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=3
[16:19:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=5
[16:19:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 8 extra nodes, 0 pruned nodes, max_depth=4
[16:19:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=5
[16:19:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 0 pruned nodes, max_depth=4
[16:19:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 0 pruned nodes, max_depth=5
[16:19:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 0 pruned nodes, max_depth=6
[16:19:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 0 pruned nodes, max_dep

In [7]:
%%bash
JOBNAME=iris_$(date -u +%y%m%d_%H%M%S)

gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=./trainer \
   --job-dir=$GCS_BUCKET/$JOBNAME/ \
   --runtime-version 1.4 \
   -- \
   --output_dir=$GCS_BUCKET/$JOBNAME/output

jobId: iris_181106_161941
state: QUEUED


Job [iris_181106_161941] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe iris_181106_161941

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs iris_181106_161941


# [Optional] StackDriver Logging
You can view the logs for your training job:
1. Go to https://console.cloud.google.com/
1. Select "Logging" in left-hand pane
1. Select "Cloud ML Job" resource from the drop-down
1. In filter by prefix, use the value of $JOB_NAME to view the logs

# [Optional] Verify Model File in GCS
View the contents of the destination model folder to verify that model file has indeed been uploaded to GCS.

Note: The model can take a few minutes to train and show up in GCS.

In [153]:
! gsutil ls gs://$BUCKET_ID/iris_*

gs://test-project-001-210814-acxiom/iris_181103_051152/:
gs://test-project-001-210814-acxiom/iris_181103_051152/packages/

gs://test-project-001-210814-acxiom/iris_181103_052251/:
gs://test-project-001-210814-acxiom/iris_181103_052251/packages/

gs://test-project-001-210814-acxiom/iris_181103_053305/:
gs://test-project-001-210814-acxiom/iris_181103_053305/packages/

gs://test-project-001-210814-acxiom/iris_20181103_052244/:
gs://test-project-001-210814-acxiom/iris_20181103_052244/model.joblib

gs://test-project-001-210814-acxiom/iris_20181103_052529/:
gs://test-project-001-210814-acxiom/iris_20181103_052529/model.joblib

gs://test-project-001-210814-acxiom/iris_20181103_053254/:
gs://test-project-001-210814-acxiom/iris_20181103_053254/model.bst

gs://test-project-001-210814-acxiom/iris_20181103_053437/:
gs://test-project-001-210814-acxiom/iris_20181103_053437/model.bst

gs://test-project-001-210814-acxiom/iris_20181103_054856/:
gs://test-project-001-210814-acxiom/iris_20181103_054856/m

# Next Steps:
The Cloud Machine Learning Engine online prediction service manages computing resources in the cloud to run your models. Check out the [documentation pages](https://cloud.google.com/ml-engine/docs/scikit/) that describe the process to get online predictions from these exported models using Cloud Machine Learning Engine.