This example show how an xgboost model, trained on a AWS Sagemaker can be stored and executed in a MemSQL as a user defined function.
	
### Prerequisites
To run code in this notebook you'll need:
* A MemSQL instance.  You can get a **free** trial cloud cluster at https://portal.memsql.com/
* An AWS account. You can get one at https://aws.amazon.com/

In [1]:
import os
import boto3
import xgboost
import sagemaker
import numpy as np
import pandas as pd
from datetime import datetime

from memsql.common import database
import lib.memsql_s3 as memsql_sagemaker
from lib.memsql_csv import load_csv_to_table

Instructions on how to aquire AWS credentials can be found at https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html

Take into account that if you want to change your **region** you also need to update **container** accordingly. You can find a corresponding container at https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

In [2]:
AWS_PUBLIC_KEY = "<your AWS access key>"
AWS_SECRET_KEY = "<your AWS secret key>"
ROLE = "<your AWS sagemaker role>"

REGION = 'eu-central-1'
CONTAINER = '492215442770.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3'

In [3]:
session = boto3.Session(
    aws_access_key_id=AWS_PUBLIC_KEY,
    aws_secret_access_key=AWS_SECRET_KEY,
    region_name=REGION,
)

# Creating an S3 bucket

In [4]:
s3 = session.resource('s3')
try:
    BUCKET = f'sagemaker-test-memsql-{hash(datetime.now())}'
    if REGION != 'us-east-1':
        s3.create_bucket(Bucket=BUCKET, CreateBucketConfiguration={ 'LocationConstraint': REGION })
    else:
        s3.create_bucket(Bucket=BUCKET)
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ', e)

S3 bucket created successfully


# Preparing data for learning

In [5]:
try:
    model_data = pd.read_csv('data/bank_clean.csv',index_col=0)
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)

Success: Data loaded into dataframe.


In [6]:
features = list(model_data.drop(['y_yes', 'y_no'], axis=1).columns)
print(features)

['age', 'campaign', 'pdays', 'previous', 'no_previous_contact', 'not_working', 'job_admin', 'job_blue_collar', 'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired', 'job_self_employed', 'job_services', 'job_student', 'job_technician', 'job_unemployed', 'job_unknown', 'marital_divorced', 'marital_married', 'marital_single', 'marital_unknown', 'education_basic_4y', 'education_basic_6y', 'education_basic_9y', 'education_high_school', 'education_illiterate', 'education_professional_course', 'education_university_degree', 'education_unknown', 'default_no', 'default_unknown', 'default_yes', 'housing_no', 'housing_unknown', 'housing_yes', 'loan_no', 'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone', 'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep', 'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu', 'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure', 'poutcome_nonexiste

In [7]:
train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data))])

In [8]:
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('ex2_train.csv', index=False)
session.resource('s3').Bucket(BUCKET).Object('train/train.csv').upload_file('ex2_train.csv')
s3_input_train = sagemaker.s3_input(s3_data=f's3://{BUCKET}/train', content_type='csv')
print("Success: Data loaded to S3 bucket")

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


Success: Data loaded to S3 bucket


# Actually Training Model

In [9]:
sess = sagemaker.Session(session)
xgb = sagemaker.estimator.Estimator(
    CONTAINER,
    ROLE,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    output_path=f's3://{BUCKET}/models',
    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,
                        objective='binary:logistic',num_round=100)
xgb.fit({'train': s3_input_train})



2020-09-30 15:34:09 Starting - Starting the training job...
2020-09-30 15:34:11 Starting - Launching requested ML instances...
2020-09-30 15:35:11 Starting - Preparing the instances for training......
2020-09-30 15:36:21 Downloading - Downloading input data
2020-09-30 15:36:21 Training - Downloading the training image...
2020-09-30 15:37:07 Uploading - Uploading generated training model[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[15:36:58] 28832x59 matrix with 1701088 entries loaded from /opt/ml/input/d

# Connecting to MemSQL
Here you will need to specify credentials for your MemSQL instance in order to connect

In [10]:
memsql_host="YOUR MEMSQL HOST HERE"
memsql_port=3306  # YOUR MEMSQL PORT HERE
memsql_user="YOUR USERNAME HERE"
memsql_password="YOUR PASSWORD HERE"

memsql_conn = database.connect(
    host=memsql_host, port=memsql_port, 
    user=memsql_user, password=memsql_password)

memsql_conn.query('CREATE DATABASE IF NOT EXISTS testsm');
memsql_conn.query('USE testsm');

# Deploying Model to MemSQL

In [11]:
xgb.model_data

's3://sagemaker-test-memsql-6122066355281312115/models/sagemaker-xgboost-2020-09-30-15-34-06-144/output/model.tar.gz'

In [12]:
memsql_sagemaker.xgb_model_path_to_memsql('predict_yes', xgb.model_data, memsql_conn, session,
                                          feature_names=features,  allow_overwrite=True)

# Cross Check

In [13]:
memsql_conn.query("DROP TABLE IF EXISTS bank")
load_csv_to_table('data/bank_clean.csv', "bank", ["id"] + features + ["y_yes", "y_no"], memsql_conn)

In [14]:
rows = memsql_conn.query(f"SELECT {','.join(features)} FROM bank ORDER BY id LIMIT 10")

In [15]:
arr = np.array([list(row.values()) for row in rows])
arr.shape

(10, 59)

In [16]:
actual_model = memsql_sagemaker.load_xgboost_from_s3(xgb.model_data, session)
actual_model.predict(xgboost.DMatrix(arr))

array([0.034729  , 0.02179087, 0.02758407, 0.03368836, 0.02679826,
       0.02990467, 0.07324626, 0.02542989, 0.03137181, 0.02972417],
      dtype=float32)

Now using the UDF in the MemSQL:

In [17]:
memsql_conn.query(f"SELECT predict_yes({','.join(features)}) as res FROM bank ORDER BY id LIMIT 10")

[Row({'res': 0.034728985990970206}),
 Row({'res': 0.02179086283853907}),
 Row({'res': 0.027584073278675514}),
 Row({'res': 0.0336883398009049}),
 Row({'res': 0.026798250088515257}),
 Row({'res': 0.029904634599325696}),
 Row({'res': 0.07324624161027454}),
 Row({'res': 0.025429872943719366}),
 Row({'res': 0.03137180130238585}),
 Row({'res': 0.02972417099313468})]