# 1. Overview

**This notebook presents a tutorial from [Codelabs](https://codelabs.developers.google.com/?cat=Machine+Learning) of how to [Build, train, and deploy an XGBoost model on Cloud AI Platform](https://codelabs.developers.google.com/codelabs/xgb-caip-e2e/index.html?index=..%2F..index#0)**

In this lab, you will walk through a complete ML workflow on GCP. From a Cloud AI Platform Notebooks environment, you'll ingest data from a BigQuery public dataset, build and train an XGBoost model, and deploy the model to AI Platform for prediction.

### What you learn

You'll learn how to:
- Ingest and analyze a BigQuery dataset in AI Platform Notebooks
- Build an XGBoost model
- Deploy the XGBoost model to AI Platform and get predictions

The total cost to run this lab on Google Cloud is about **$1**.

# 2. Setup your environment

You'll need a Google Cloud Platform project with billing enabled to run this codelab. To create a project, follow the [instructions here](https://cloud.google.com/resource-manager/docs/creating-managing-projects).

### Step 1: Enable the Cloud AI Platform Models API

Navigate to the [AI Platform Models section](https://console.cloud.google.com/ai-platform/models?folder=&organizationId=&project=poc-cit) of your Cloud Console and click Enable if it isn't already enabled.

![](https://codelabs.developers.google.com/codelabs/xgb-caip-e2e/img/d0d38662851c6af3.png)

### Step 2: Enable the Compute Engine API

Navigate to [Compute Engine](https://console.cloud.google.com/marketplace/product/google/compute.googleapis.com?project=poc-cit&folder=&organizationId=) and select **Enable** if it isn't already enabled. You'll need this to create your notebook instance.

### Step 3: Create an AI Platform Notebooks instance

Navigate to [AI Platform Notebooks section](https://console.cloud.google.com/ai-platform/notebooks/list/instances?project=poc-cit&folder=&organizationId=) of your Cloud Console and click **New Instance**. Then select the latest **Python** instance type:

![](https://codelabs.developers.google.com/codelabs/xgb-caip-e2e/img/a81c82876c6c16f9.png)

Use the default options and then click **Create**. Once the instance has been created, select **Open JupyterLab**:

### Step 4: Install XGBoost

Once your JupyterLab instance has opened, you'll need to add the XGBoost package.

To do this, select Terminal from the launcher:

![](https://codelabs.developers.google.com/codelabs/xgb-caip-e2e/img/28dcf2790ce77c96.png)

Then run the following to install the latest version of XGBoost supported by AI Platform:

`pip3 install xgboost==0.82`

After this completes, open a Python 3 Notebook instance from the launcher. You're ready to get started in your notebook!

### Step 5: Import Python packages

>For the rest of this codelab, run all the code snippets from your Jupyter notebook.

In the first cell of your notebook, add the following imports and run the cell. You can run it by pressing the right arrow button in the top menu or pressing command-enter:

In [1]:
import pandas as pd
import xgboost as xgb
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from google.cloud import bigquery

# 3. Exploring the BigQuery dataset

BigQuery has made many datasets publicly available for your exploration. For this lab, we'll be using the natality dataset. This contains data on nearly every birth in the US over a 40 year time period, including the birth weight of the child, and demographic information on the baby's parents. We'll be using a subset of the features to predict a baby's birth weight.

### Step 1: Download the BigQuery data to our notebook

We'll be using the Python client library for BigQuery to download the data into a Pandas DataFrame. The original dataset is 21GB and contains 123M rows. To keep things simple we'll only be using 10,000 rows from the dataset.

Construct the query and preview the resulting DataFrame with the following code. Here we're getting 4 features from the original dataset, along with baby weight (the thing our model will predict). The dataset goes back many years but for this model we'll use only data from after 2000:

In [2]:
query="""
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks
FROM
  publicdata.samples.natality
WHERE year > 2000
LIMIT 10000
"""
df = bigquery.Client().query(query).to_dataframe()
df.head()

  "Cannot create BigQuery Storage client, the dependency "


Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks
0,6.68662,True,18,1,43.0
1,9.360828,True,32,1,41.0
2,8.437091,False,30,1,39.0
3,6.124442,False,24,1,40.0
4,7.12534,False,26,1,41.0


To get a summary of the numeric features in our dataset, run:

In [3]:
df.describe()

Unnamed: 0,weight_pounds,mother_age,plurality,gestation_weeks
count,9989.0,10000.0,10000.0,9890.0
mean,7.297602,27.2989,1.0344,38.699798
std,1.291685,6.165838,0.192926,2.539957
min,0.612885,12.0,1.0,17.0
25%,6.624891,22.0,1.0,38.0
50%,7.374463,27.0,1.0,39.0
75%,8.124034,32.0,1.0,40.0
max,12.257702,50.0,3.0,47.0


This shows the mean, standard deviation, minimum, and other metrics for our numeric columns. Finally, let's get some data on our boolean column indicating the baby's gender. We can do this with Pandas' value_counts method:

In [4]:
df['is_male'].value_counts()  # get some data on our boolean column indicating the baby's gender

True     5150
False    4850
Name: is_male, dtype: int64

Looks like the dataset is nearly balanced 50/50 by gender.

# 4. Prepare the data for training

In this section, we'll divide the data into train and test sets to prepare it for training our model.

## Step 1: Extract the label column

First drop rows with null values from the dataset and shuffle the data:

In [5]:
df = df.dropna()
df = shuffle(df, random_state=2)

Next, extract the label column into a separate variable and create a DataFrame with only our features:

In [6]:
labels = df['weight_pounds']
data = df.drop(columns=['weight_pounds'])

Now if you preview our dataset by running data.head(), you should see the four features we'll be using for training.

In [7]:
data.head()

Unnamed: 0,is_male,mother_age,plurality,gestation_weeks
39,True,32,1,41.0
6132,False,28,1,30.0
5986,False,44,1,38.0
7682,False,34,1,38.0
4910,True,31,1,40.0


### Step 2: Convert categorical features to integers

Since XGBoost requires all data to be numeric, we'll need to change how we're representing the data in the is_male column, which is currently True / False strings. We can do that simply by changing the type of that column:

In [8]:
data['is_male'] = data['is_male'].astype(int)

### Step 3: Split data into train and test sets

We'll use Scikit Learn's train_test_split utility which we imported at the beginning of the notebook to split our data into train and test sets:

In [9]:
x,y = data,labels
x_train,x_test,y_train,y_test = train_test_split(x,y)

Now we're ready to build and train our model!

# 5. A quick XGBoost primer

XGBoost is a machine learning framework that uses decision trees and gradient boosting to build predictive models. It works by ensembling multiple decision trees together based on the score associated with different leaf nodes in a tree.

The diagram below is a simplified visualization of an ensemble tree network for a model that evaluates whether or not someone will like a specific computer game (this is from the XGBoost docs):

![](https://codelabs.developers.google.com/codelabs/xgb-caip-e2e/img/fb061cd8c8f69999.png)

**Why are we using XGBoost for this model?** While traditional neural networks have been shown to perform best on unstructured data like images and text, decision trees often perform extremely well on structured data like the mortgage dataset we'll be using in this codelab.

# 6. Build, train, and evaluate an XGBoost model

### Step 1: Define and train the XGBoost model

Creating a model in XGBoost is simple. We'll use the `XGBRegressor` class to create the model, and just need to pass the right `objective` parameter for our specific task. Here we're using a regression model since we're predicting a numerical value (baby's weight). If we were instead bucketing our data to determine if a baby weighed more or less than 6 pounds, we'd use a classification model.

In this case we'll use `reg:squarederror` as our model's objective.

The following code will create an XGBoost model:

In [10]:
model = xgb.XGBRegressor(
    objective='reg:linear'
)

You can train the model with one line of code, calling the `fit()` method and passing it the training data and labels.

In [11]:
model.fit(x_train, y_train)

XGBRegressor()

### Step 2: Evaluate your model on test data

We can now use our trained model to generate predictions on our test data with the `predict()` function:

In [12]:
y_pred = model.predict(x_test)

Let's see how the model performed on the first 20 values from our test set. Below we'll print the predicted baby weight along with the actual baby weight for each test example:

In [13]:
for i in range(20):
    print('Predicted weight: ', y_pred[i])
    print('Actual weight: ', y_test.iloc[i])
    print()

Predicted weight:  8.052519
Actual weight:  6.75055446244

Predicted weight:  5.799312
Actual weight:  6.37576861704

Predicted weight:  7.8249702
Actual weight:  9.12934226942

Predicted weight:  7.8249702
Actual weight:  8.6310975573

Predicted weight:  7.661703
Actual weight:  6.4992274837599995

Predicted weight:  6.906344
Actual weight:  6.4374980503999994

Predicted weight:  7.4862857
Actual weight:  7.916799828419999

Predicted weight:  7.875356
Actual weight:  7.23998068408

Predicted weight:  6.0144997
Actual weight:  7.12313568522

Predicted weight:  7.583114
Actual weight:  7.3744626639

Predicted weight:  7.875356
Actual weight:  7.7492485093

Predicted weight:  7.5436783
Actual weight:  6.75055446244

Predicted weight:  7.0714707
Actual weight:  5.9635041871

Predicted weight:  7.817871
Actual weight:  7.87491199864

Predicted weight:  7.4339094
Actual weight:  8.062304921339999

Predicted weight:  7.808158
Actual weight:  7.0988848364

Predicted weight:  7.808158
Actual w

### Step 3: Save your model

In order to deploy the model, run the following code to save it to a local file:

In [14]:
model.save_model('model.bst')

# 7. Deploy model to Cloud AI Platform

We've got our model working locally, but it would be nice if we could make predictions on it from anywhere (not just this notebook!). In this step we'll deploy it to the cloud.

### Step 1: Create a Cloud Storage bucket for our model

Let's first define some environment variables that we'll be using throughout the rest of the codelab. Fill in the values below with the name of your Google Cloud project, the name of the cloud storage bucket you'd like to create (must be globally unique), and the version name for the first version of your model:

In [25]:
# Update these to your own GCP project, model, and version names
GCP_PROJECT = 'poc-cit'
MODEL_BUCKET = 'gs://poc_ml_storage_bucket'
VERSION_NAME = 'v1'
MODEL_NAME = 'xgb_baby_weight'

Now we're ready to create a storage bucket to store our XGBoost model file. We'll point Cloud AI Platform at this file when we deploy.

Run this `gsutil` command from within your notebook to create a bucket:

In [23]:
!gsutil mb $MODEL_BUCKET

Creating gs://poc_ml_storage_bucket/...


### Step 2: Copy the model file to Cloud Storage

Next, we'll copy our XGBoost saved model file to Cloud Storage. Run the following gsutil command:

In [24]:
!gsutil cp ./model.bst $MODEL_BUCKET

Copying file://./model.bst [Content-Type=application/octet-stream]...
/ [1 files][ 67.0 KiB/ 67.0 KiB]                                                
Operation completed over 1 objects/67.0 KiB.                                     


Head over to the [storage browser](https://console.cloud.google.com/storage/browser?project=poc-cit&prefix=) in your Cloud Console to confirm the file has been copied.

### Step 3: Create and deploy the model

The following ai-platform gcloud command will create a new model in your project. We'll call this one `MODEL_NAME`:

In [26]:
!gcloud ai-platform models create $MODEL_NAME

Using endpoint [https://ml.googleapis.com/]

Learn more about regional endpoints and see a list of available regions: https://cloud.google.com/ai-platform/prediction/docs/regional-endpoints
Created ml engine model [projects/poc-cit/models/xgb_baby_weight].


Now it's time to deploy the model. We can do that with this gcloud command:

In [27]:
!gcloud ai-platform versions create $VERSION_NAME \
--model=$MODEL_NAME \
--framework='XGBOOST' \
--runtime-version=1.15 \
--origin=$MODEL_BUCKET \
--python-version=3.7 \
--project=$GCP_PROJECT

Using endpoint [https://ml.googleapis.com/]
Creating version (this might take a few minutes)......done.                    


While this is running, check the  [models section](https://console.cloud.google.com/ai-platform/models) of your AI Platform console. You should see your new version deploying there.

When the deploy completes successfully you'll see a green check mark where the loading spinner is. The deploy should take 2-3 minutes. 

### Step 4: Test the deployed model

To make sure your deployed model is working, test it out using gcloud to make a prediction. First, save a JSON file with two examples from our test set:

In [28]:
%%writefile predictions.json
[0.0, 33.0, 1.0, 27.0]
[1.0, 26.0, 1.0, 40.0]

Writing predictions.json


Test your model by saving the output of the following gcloud command to a variable and printing it:

In [29]:
prediction = !gcloud ai-platform predict --model=$MODEL_NAME --json-instances=predictions.json --version=$VERSION_NAME
print(prediction.s)

Using endpoint [https://ml.googleapis.com/] [3.320957660675049, 7.90995979309082]


You should see your model's prediction in the output. The actual baby weight for these two examples is 1.9 and 8.1 pounds respectively.

# 8. Cleanup

If you'd like to continue using this notebook, it is recommended that you turn it off when not in use. From the [Notebooks UI](https://console.cloud.google.com/ai-platform/notebooks/list/instances?project=poc-cit) in your Cloud Console, select the notebook and then select Stop:

![](https://codelabs.developers.google.com/codelabs/xgb-caip-e2e/img/879147427150b6c7.png)

If you'd like to delete all resources you've created in this lab, simply delete the notebook instance instead of stopping it.

Using the Navigation menu in your Cloud Console, browse to Storage and delete both buckets you created to store your model assets.

In [30]:
!ls

model.bst
PoC - Build, train, and deploy an XGBoost model on Cloud AI Platform.ipynb
predictions.json
tutorials


In [31]:
%pwd

'/home/jupyter'