# Deploying an XGBoost Model to a SageMaker Endpoint that Accepts JSON Payloads

This notebook shows how to deploy an XGBoost model in an Amazon SageMaker Endpoint that accepts a JSON payload. 

For that purpose two [SKlearn model](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html) models are built:

* A transformer model that converts JSON payloads into numeric arrays
* A wrapper for the XGBoost model


These two models are combined into a [PipelineModel](https://sagemaker.readthedocs.io/en/stable/api/inference/pipeline.html) and deployed to a SageMaker endpoint.

The approach is inspired by the pre-processing example found [here](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.ipynb).

To run this notebook please set up a conda environment with the specifications found in the *environment.yaml* in your Jupyter Notebook instance in Amazon SageMaker

# Imports & Accessing SageMaker

In [None]:
import json
import sagemaker
import pickle

import scipy
import pandas as pd
import numpy as np
import xgboost as xgb

from time import gmtime, strftime

from sklearn.metrics import roc_auc_score, log_loss
from sklearn.datasets import fetch_openml

from sagemaker.sklearn.estimator import SKLearn
from sagemaker.pipeline import PipelineModel
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import CSVDeserializer

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Downloading & Pre-Processing Example Training Data

We use the [*adult*](https://www.openml.org/d/1590) dataset from OpenML which contains a mix of categorical and continuous predictors.

In [None]:
adult_data = fetch_openml(name = 'adult',as_frame = True,version = 1)
X = adult_data['data'].drop(columns = ['education'])
y = adult_data['target'].apply(lambda x: 0 if x == "<=50K" else 1).astype(float)

continuous_vars = X.dtypes[(X.dtypes == "float")].index
categorical_vars = X.dtypes[(X.dtypes == "category")].index

# introduce new category for missing values in categorical variables
X[categorical_vars] = X[categorical_vars].apply(lambda x: x.cat.add_categories("missing")).fillna("missing").astype(str)

pd.concat([X,y],axis =1).head(10)

In [None]:
# dummify categorical variables
categorical_dummified = pd.get_dummies(X[categorical_vars])
data_train = pd.concat([X.drop(columns = categorical_vars),categorical_dummified,y],axis = 1)
feature_names = data_train.drop(columns = ['class']).columns.tolist()
data_train.head(10)

# Training the XGBoost model

In this step we train the XGBoost model and write it out to a binary file that can be read-in later when deploying the model (no cross-validation step here or hyperparameter tuning to keep the example simple).

In [None]:
xgboost_params =  {
    "objective": "binary:logistic",
    "eval_metric": "logloss",
    "max_depth": 11,
    "eta": 0.052,
    "nthread": 4,
    "booster": "gbtree",
    "subsample": 0.87,
    "min_child_weight": 2,
    "colsample_bytree": 0.71,
    "colsample_bylevel": 0.64
  }
    
dtrain = xgb.DMatrix(scipy.sparse.csr_matrix(data_train[feature_names]),label = data_train['class'].to_numpy(),feature_names = feature_names) 
xgb_model = xgb.train(dtrain = dtrain, params = xgboost_params,num_boost_round = 500)

print("log loss: {0}".format(log_loss(data_train['class'],xgb_model.predict(dtrain))))
print("AUC: {0}".format(roc_auc_score(data_train['class'],xgb_model.predict(dtrain))))

xgb_model.save_model("sklearn_xgboost_wrapper/xgboost_model.bin")

# Creating the JSON Transformer Model

In order to make predictions with the XGBoost we need to transform the JSON payload into a numeric array. The entries of the array must be corresponding to the columns in the training dataset.

For this purpose we wrote a helper class (see file *sklearn_json_transformer/json_parser.py*) which needs to be initialised with specific information about the training data (categorical/continuous variables, observed values of categorical variables, order of features in the training data).

This information is persisted as a pickled dictionary that will be passed on to the model creation process.

In [None]:
categorical_features_values = {k: X[k].unique().tolist() for k in categorical_vars}
feature_definitions = {'continuous_variables': continuous_vars.tolist(),
                      'categorical_variables': categorical_vars.tolist(),
                      'target_columns': feature_names,
                      'categorical_variables_values': categorical_features_values}
pickle.dump(feature_definitions, open("sklearn_json_transformer/features.pkl", "wb"))

In [None]:
sklearn_json_transformer = SKLearn(
            entry_point = "json_transformer.py",
            role = role,
            instance_type = 'ml.c4.xlarge',
            sagemaker_session = sagemaker_session,
            source_dir = 'sklearn_json_transformer',
            framework_version = "0.23-1")
sklearn_json_transformer.fit()
sklearn_json_transformer = sklearn_json_transformer.create_model(env={"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv"})

# Creating the XGBoost Wrapper Model

In the same way as the JSON transformer the XGBoost model will be provided as an SKLearn model.

In [None]:
sklearn_xgboost_wrapper_model = SKLearn(entry_point = "xgboost_wrapper.py",
                                        role = role,
                                        instance_type = "ml.c5.xlarge",
                                        sagemaker_session = sagemaker_session,
                                        source_dir = "sklearn_xgboost_wrapper",
                                        py_version = "py3",
                                        framework_version = "0.23-1")
sklearn_xgboost_wrapper_model.fit()
xgboost_wrapper_model = sklearn_xgboost_wrapper_model.create_model(env={"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv"})

# Deploying the Pipeline

In [None]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

model_name = 'json-to-xgboost-pipeline-' + timestamp_prefix
endpoint_name = 'json-to-xgboost-pipeline-ep-' + timestamp_prefix
sm_model = PipelineModel(
    name=model_name, 
    role=role,
    models=[
        sklearn_json_transformer, 
        xgboost_wrapper_model
    ])

sm_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)

# Testing the Endpoint

To make sure the deployment has succeeded and the JSON transformation is working correctly we generate some JSON payloads from the training data and compare the results of the local version of the model and the endpoint.

In [None]:
sample_observations = X.sample(n=100)
test_predictions = pd.DataFrame({"prediction_offline":xgb_model.predict(xgb.DMatrix(scipy.sparse.csr_matrix(data_train.iloc[sample_observations.index].drop(columns = ['class'])),feature_names = feature_names))})
predictor_test = Predictor(
            endpoint_name=endpoint_name,
            sagemaker_session=sagemaker_session,
            serializer=JSONSerializer(),
            deserializer=CSVDeserializer())
test_predictions = pd.concat([test_predictions,pd.DataFrame({"prediction_endpoint":[float(p[0]) for p in  predictor_test.predict(sample_observations.T.apply(lambda x: dict(x)).tolist())]})],axis = 1)
assert np.allclose(test_predictions['prediction_offline'],test_predictions['prediction_endpoint'],atol = 1e-10), "Mismatch between offline and endpoint predictions"
test_predictions.head(10)

# Tidying up

Delete the endpoint to avoid unnecessary costs.

In [None]:
sm_client = sagemaker_session.boto_session.client('sagemaker')
sm_client.delete_endpoint(EndpointName=endpoint_name)