# Amazon SageMaker Workshop - Direct Marketing with XGBoost
_**A basic Amazon SageMaker workshop demonstrating the machine learning workflow from data preparation to model deployment using direct marketing data.**_

The code is based on the [XGBoost Direct Marketing example](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/hyperparameter_tuning/xgboost_direct_marketing) for Amazon SageMaker.

<div class="alert alert-block alert-info">
    <b>Note:</b> You may be asked to choose a kernel for this notebook. Please use the following:    
    <ul><li>Standard Amazon SageMaker notebook instances - <i>conda_python3</i></li>
        <li>Amazon SageMaker Studio notebooks - <i>Python 3 (Data Science)</i></li></ul>
</div>

## Contents

1. [Introduction](#intro)
1. [Setup](#setup)
1. [Data preparation](#data-prep)
1. [Training and hyperparameter tuning](#hyperparam1)
1. [Evaluate model](#eval1)
1. [Deploy](#deploy)
1. [Conclusion](#conclusion)


---

## Introduction <a class="anchor" id="intro"></a>

Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action. In this workshop, you will run through the process end-to-end to understand how SageMaker helps you develop models faster from concept to production.

The process of building models in Amazon SageMaker can be divided into four stages:
1. **Ground Truth**: Label your data for supervised learning tasks. Set up data labeling jobs using public or private teams. Ground Truth takes care of dividing the data between the annotators and consolidating the results. 
1. **Notebook**: Use Jupyter notebooks to explore and process your data. This is your environment for experimentation. Notebooks run on fully managed EC2 servers.
1. **Training**: Run training and hyperparameter tuning jobs at any scale. The underlying resources are automatically spun up as required, and automatically spun down when the job is finished. 
1. **Inference**: Deploy your model in the real world. Make it available 24/7 through inference endpoints, or process data in batches.

<img src="img/sagemaker_process.PNG" alt="The SageMaker process" width="800"/>

This notebook will train a model which can be used to predict if a customer will enroll for a term deposit at a bank, after one or more phone calls, based on data from the [Bank Marketing data set](https://archive.ics.uci.edu/ml/datasets/bank+marketing). You will work through the steps of the common machine learning process depicted below, with the exception of data labeling, since annotated data is already available. Although the process of developing a good model is iterative, this notebook will only cover one iteration of model creation and deployment.

<img src="img/ml_process.png" alt="The SageMaker process" width="600"/>

---

## Setup <a class="anchor" id="setup"></a>

First, set up some of the basic resources required to run SageMaker. This includes:

- The S3 bucket and prefix that you want to use for your training and model data. This should be within the same region as your SageMaker notebook instance.
- The IAM role used to give training access to your data.

<div class="alert alert-block alert-info">
    <b>Note:</b> If you used the CloudFormation template to create the resources for this workshop in your account, or if you are running this notebook as part of an AWS-hosted workshop, an S3 bucket has already been created in your account. Identify the right S3 bucket and copy/paste this name in the code below.

If you do not have an existing S3 bucket in your account for this workshop, use `bucket = sagemaker_session.default_bucket()` to have SageMaker create a bucket for you.
</div>

In [None]:
import sagemaker
import boto3

# Get the role associated with this SageMaker notebook.
role = sagemaker.get_execution_role()
print("Role name: {}".format(role))

# Start a session
sagemaker_session = sagemaker.Session()

# Specify an S3 bucket for storing the training data.
# !ACTION REQUIRED! Replace <TODO> with the name of the S3 bucket created by the CloudFormation template.
# If no S3 bucket has been created, use bucket = sagemaker_session.default_bucket()
bucket = '<TODO>'
print("Bucket name: {}".format(bucket))

# Set a prefix for storing your data - this will look like a folder in the S3 bucket.
prefix = 'sagemaker-workshop-marketing'

## Data preparation <a class="anchor" id="data-prep"></a>
Start by downloading the [Bank Marketing data set](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository. Then read in the data and display it as a table. It contains records of direct marketing campaigns, such that each row is a phone call made to a customer. The goal is to predict if the customer will subscribe a term deposit, which is represented by the target variable `y`.

In [None]:
# Download and unzip the marketing data set.
!wget -N https://sagemaker-sample-data-us-west-2.s3-us-west-2.amazonaws.com/autopilot/direct_marketing/bank-additional.zip
path_to_zip_file = 'bank-additional.zip'
directory_to_extract_to = '.'

import zipfile
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(directory_to_extract_to)

In [None]:
import pandas as pd
import numpy as np

# Read and display the marketing data set.
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 50)         # Keep the output on one page
data

Let's take a look at the data.  At a high level, we can see:

* We have a little over 40K customer records, and 20 features for each customer.
* The features are mixed; some numeric, some categorical.
* The data appears to be sorted, at least by `time` and `contact`.

_**Specifics on each of the features:**_

*Demographics:*
* `age`: Customer's age (numeric)
* `job`: Type of job (categorical: 'admin.', 'services', ...)
* `marital`: Marital status (categorical: 'married', 'single', ...)
* `education`: Level of education (categorical: 'basic.4y', 'high.school', ...)

*Past customer events:*
* `default`: Has credit in default? (categorical: 'no', 'unknown', ...)
* `housing`: Has housing loan? (categorical: 'no', 'yes', ...)
* `loan`: Has personal loan? (categorical: 'no', 'yes', ...)

*Past direct marketing contacts:*
* `contact`: Contact communication type (categorical: 'cellular', 'telephone', ...)
* `month`: Last contact month of year (categorical: 'may', 'nov', ...)
* `day_of_week`: Last contact day of the week (categorical: 'mon', 'fri', ...)
* `duration`: Last contact duration, in seconds (numeric). Important note: If duration = 0 then `y` = 'no'.
 
*Campaign information:*
* `campaign`: Number of contacts performed during this campaign and for this client (numeric, includes last contact)
* `pdays`: Number of days that passed by after the client was last contacted from a previous campaign (numeric)
* `previous`: Number of contacts performed before this campaign and for this client (numeric)
* `poutcome`: Outcome of the previous marketing campaign (categorical: 'nonexistent','success', ...)

*External environment factors:*
* `emp.var.rate`: Employment variation rate - quarterly indicator (numeric)
* `cons.price.idx`: Consumer price index - monthly indicator (numeric)
* `cons.conf.idx`: Consumer confidence index - monthly indicator (numeric)
* `euribor3m`: Euribor 3 month rate - daily indicator (numeric)
* `nr.employed`: Number of employees - quarterly indicator (numeric)

*Target variable:*
* `y`: Has the client subscribed a term deposit? (binary: 'yes','no')

Let's start exploring the data. First, let's understand how the features are distributed.

In [None]:
# Frequency tables for each categorical feature
for column in data.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=data[column], columns='% observations', normalize='columns'))

# Histograms for each numeric features
display(data.describe())
%matplotlib inline
hist = data.hist(bins=30, sharey=True, figsize=(10, 10))

Notice that:

* Almost 90% of the values for our target variable y are "no", so most customers did not subscribe to a term deposit.
* Many of the predictive features take on values of "unknown". Some are more common than others. We should think carefully as to what causes a value of "unknown" (are these customers non-representative in some way?) and how we that should be handled.
    * Even if "unknown" is included as it's own distinct category, what does it mean given that, in reality, those observations likely fall within one of the other categories of that feature?
* Many of the predictive features have categories with very few observations in them. If we find a small category to be highly predictive of our target outcome, do we have enough evidence to make a generalization about that?
* Contact timing is particularly skewed. Almost a third in May and less than 1% in December. What does this mean for predicting our target variable next December?
* There are no missing values in our numeric features. Or missing values have already been imputed.
    * pdays takes a value near 1000 for almost all customers. Likely a placeholder value signifying no previous contact.
* Several numeric features have a very long tail. Do we need to handle these few observations with extremely large values differently?
* Several numeric features (particularly the macroeconomic ones) occur in distinct buckets. Should these be treated as categorical?

Next, let's look at how our features relate to the target that we are attempting to predict.

In [None]:
import matplotlib.pyplot as plt

for column in data.select_dtypes(include=['object']).columns:
    if column != 'y':
        display(pd.crosstab(index=data[column], columns=data['y'], normalize='columns'))

for column in data.select_dtypes(exclude=['object']).columns:
    print(column)
    hist = data[[column, 'y']].hist(by='y', bins=30)
    plt.show()

Notice that:

* Customers who are-- "blue-collar", "married", "unknown" default status, contacted by "telephone", and/or in "may" are a substantially lower portion of "yes" than "no" for subscribing.
* Distributions for numeric variables are different across "yes" and "no" subscribing groups, but the relationships may not be straightforward or obvious.

Now let's look at how our features relate to one another.

In [None]:
display(data.corr())
pd.plotting.scatter_matrix(data, figsize=(12, 12))
plt.show()

Notice that:
* Features vary widely in their relationship with one another. Some with highly negative correlation, others with highly positive correlation.
* Relationships between features is non-linear and discrete in many cases.

Normally, you won't be able to use the data as-is for training machine learning models. Every algorithm has specific requirements on the input it accepts, so you will need to transfrom the data based on these requirements. In addition, your data set will often contain incomplete data, duplicate data, inconsistent data, irrelevant data, or outliers. These problems will affect the performance of the model, so data cleaning is a crucial step in the machine learning process. Several common techniques include:

* Handling missing values: Some machine learning algorithms are capable of handling missing values, but most would rather not. Options include:
    * Removing observations with missing values: This works well if only a very small fraction of observations have incomplete information.
    * Removing features with missing values: This works well if there are a small number of features which have a large number of missing values.
    * Imputing missing values: Entire books have been written on this topic, but common choices are replacing the missing value with the mode or mean of that column's non-missing values.
* Converting categorical to numeric: The most common method is one hot encoding, which for each feature maps every distinct value of that column to its own feature which takes a value of 1 when the categorical feature is equal to that value, and 0 otherwise.
* Oddly distributed data: Although for non-linear models like Gradient Boosted Trees, this has very limited implications, parametric models like regression can produce wildly inaccurate estimates when fed highly skewed data. In some cases, simply taking the natural log of the features is sufficient to produce more normally distributed data. In others, bucketing values into discrete ranges is helpful. These buckets can then be treated as categorical variables and included in the model when one hot encoded.
* Handling more complicated data types: Mainpulating images, text, or data at varying grains is left for other notebook templates.

To get started, one question to ask yourself is whether certain features will add value in your final use case. For example, if your goal is to deliver the best prediction, then will you have access to that data at the moment of prediction? Knowing it's raining is highly predictive for umbrella sales, but forecasting weather far enough out to plan inventory on umbrellas is probably just as difficult as forecasting umbrella sales without knowledge of the weather. So, including this in your model may give you a false sense of precision. Following this logic, let's remove the economic features and duration from our data as they would need to be forecasted with high precision to use as inputs in future predictions.

In [None]:
# Drop columns requiring accurate future forecasts
clean_data = data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

The cell below has been left blank for you to apply your own data cleaning steps. Below are some ideas if you are not sure what to do:
1. Many records have the value of "999" for pdays, number of days that passed by after a client was last contacted. It is very likely to be a magic number to represent that no contact was made before. Considering this, we can create a new column called "no_previous_contact", then grant it value of "1" when pdays is 999 and "0" otherwise.
1. In the "job" column, there are categories that mean the customer is not working, e.g., "student", "retire", and "unemployed". Since it is very likely whether or not a customer is working will affect his/her decision to enroll in the term deposit, we can generate a new column to show whether the customer is working based on "job" column.

Alternatively, you can choose to continue without additional data cleaning. After training and evaluating a model based on the current data set, you can iterate on this by coming back to this data cleaning step.

In [None]:
# !ACTION REQUIRED!
# Apply your own data cleaning steps below on the clean_data variable

In this workshop, you will use the [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to predict the likelihood of a customer enrolling for a term deposit. More information about this algorithm will be provided in the training and tuning section, but for data preparation it is important to know the input requirements of the algorithm. One of the data formats accepted by XGBoost is CSV (comma-separated value). The CSV should be structured such that the rows are observations, the first column is the target variable, and the remaining columns are the features. In addition, the CSV file should contain only numeric data and no header. For more details, [read the documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost) on the input and output format of XGBoost. If the data is not formatted correctly, you can expect to see an error similar to the image below.

<img src="img/workbook_screenshot2.PNG" alt="Error message for incorrect data input." width="600"/>

First, let's ensure that the data only contains numeric values. This means converting all categorical variables (e.g. `marital` column) into sets of indicators. A new column is created for each value (e.g. marital_divorced, marital_married, marital_single, and marital_unknown), and for each user (i.e. row) a 1 marks the relevant value while the rest of the columns contain 0's. This method is known as one-hot encoding.

In [None]:
# Convert categorical variables to sets of indicators
model_data = pd.get_dummies(clean_data)
model_data

Since our target variable was categorical (i.e. 'yes' or 'no'), the one-hot encoding method resulted in two columns: `y_yes` and `y_no`, which are exact opposites of each other. XGBoost only accepts one target variable column and expects that to be the first column in the data set, so the data set needs to be transformed to reflect these requirements.

When building a model whose primary goal is to predict a target value on new data, it is important to understand overfitting. Supervised learning models are designed to minimize error between their predictions of the target value and actuals, in the data they are given. This last part is key, as frequently in their quest for greater accuracy, machine learning models bias themselves toward picking up on minor idiosyncrasies within the data they are shown. These idiosyncrasies then don't repeat themselves in subsequent data, meaning those predictions can actually be made less accurate, at the expense of more accurate predictions in the training phase.

The most common way of preventing this is to build models with the concept that a model shouldn't only be judged on its fit to the data it was trained on, but also on "new" data. There are several different ways of operationalizing this, holdout validation, cross-validation, leave-one-out validation, etc. For our purposes, we'll simply randomly split the data into 3 uneven groups. The model will be trained on 70% of data, it will then be evaluated on 20% of data to give us an estimate of the accuracy we hope to have on "new" data, and 10% will be held back as a final testing dataset which will be used later on.

In [None]:
# Remove excess target variable column and copy the remaining target variable column to be the first column
model_data = pd.concat([model_data['y_yes'], model_data.drop(['y_no', 'y_yes'], axis=1)], axis=1)

# Randomly sort the data then split out first 70%, second 20%, and last 10%
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), 
                                                  [int(0.7 * len(model_data)), 
                                                   int(0.9 * len(model_data))])

# Save to CSV format. Drop headers and index column.
train_data.to_csv('train.csv', index=False, header=False)
validation_data.to_csv('validation.csv', index=False, header=False)

To build a SageMaker model, the training and validation data need to be stored in an S3 bucket. Remember that the bucket should be in the same AWS region as your SageMaker instance.

In [None]:
# Set the file locations
s3train = 's3://{}/{}/train/'.format(bucket, prefix)
s3validation = 's3://{}/{}/validation/'.format(bucket, prefix)

# Upload the data to S3
!aws s3 cp train.csv $s3train
!aws s3 cp validation.csv $s3validation

After uploading the data to S3, we need to tell SageMaker where to find the data and where to save the output.

In [None]:
# Tell SageMaker where to find the input data and where to store the output
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)
s3_input_train = sagemaker.inputs.TrainingInput(s3train, distribution='FullyReplicated', 
                        content_type='text/csv', s3_data_type='S3Prefix')
s3_input_validation = sagemaker.inputs.TrainingInput(s3validation, distribution='FullyReplicated', 
                             content_type='text/csv', s3_data_type='S3Prefix')

## Training and hyperparameter tuning <a class="anchor" id="hyperparam1"></a>

In this workshop, you will use XGBoost, also known as Extreme Gradient Boosting. XGBoost is a built-in algorithm in SageMaker which can be used for regression and classification on structured (tabular) data. It works by building an ensemble of decision trees. During training, decision trees are continuously added to predict and correct the errors of the previous decision trees, until no further improvements can be made (i.e. boosting). This algorithm has been shown to work well with minimal data cleaning, performing well despite missing data points or outliers. With the large number of hyperparameters contained in this algorithm, tuning the model is crucial to achieving success.

First, you need to specify how a training job should be run:
* Provide the container image for the algorithm (XGBoost)
* The location where the output should be saved (S3 bucket)
* The values of any algorithm hyperparameters that are not tuned in the tuning job (static hyperparameters)
* The type and number of instances to use for the training jobs
* The stopping condition for the training jobs

In [None]:
# Fetch the XGBoost container for your region
my_region = sagemaker_session.boto_region_name
container = sagemaker.image_uris.retrieve('xgboost', my_region, '0.90-2')

# Set up an XGBoost trainer
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path=s3_output_location,
                                    sagemaker_session=sagemaker_session)

# Set initial values for all hyperparameters
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

With just these settings from the cell above, you could go ahead and train a model. However, XGBoost has a large number of hyperparameters, so it is unlikely that we chose the best values above.

SageMaker automatic model tuning helps with automating the hyperparameter tuning process. It allows you to define a range for each hyperparameter, and SageMaker will automatically run multiple training jobs with different hyperparameter values, evaluating the results based on a predefined objective metric. Results from the previous tuning jobs are used to choose values for consecutive tuning jobs. You can set a tuning budget (maximum number of training jobs), to stop the tuning process. Often the tuning process is iterative and requires running multiple tuning jobs after analyzing the results to get the best objective metric.

Now we configure the hyperparameter tuning job by defining a JSON object that specifies following information:
* The ranges of hyperparameters we want to tune
* Number of training jobs to run in total and how many training jobs should be run simultaneously. More parallel jobs will finish tuning sooner, but may sacrifice accuracy. We recommend you set the parallel jobs value to less than 10% of the total number of training jobs (we'll set it higher just for this example to keep it short).
* The objective metric that will be used to evaluate training results, in this example, we select *validation:auc* to be the objective metric and the goal is to maximize the value throughout the hyperparameter tuning process. One thing to note is the objective metric has to be among the metrics that are emitted by the algorithm during training. In this example, the built-in XGBoost algorithm emits a bunch of metrics and *validation:auc* is one of them. If you bring your own algorithm to SageMaker, then you need to make sure that it emits the objective metric you select.

We will tune four hyperparameters in this examples:
* *eta*: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative. 
* *alpha*: L1 regularization term on weights. Increasing this value makes models more conservative. 
* *min_child_weight*: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. 
* *max_depth*: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted. 

In the following cell you are required to enter some code following the instructions.

In [None]:
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

# !ACTION REQUIRED! In the code below, you need to replace the <TODO>'s!

# Set the search range for each hyperparameter
# This is a dictionary with 4 members defining the ranges of the hyperparameters to search:
# 'eta' must be set as a continuous parameter ranging from 0 to 1
# 'alpha' must be set as a continuous parameter ranging from 0 to 2
# 'min_child_weight' must be set as a continuous parameter ranging from 1 to 10
# 'max_depth' must be set as an integer parameter raging from 1 to 10
hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                         'alpha': <TODO>,
                         'min_child_weight': <TODO>,
                         'max_depth': <TODO>}

# Choose the objective metric to maximize and set up the tuner
objective_metric_name = 'validation:auc'
# Create the HyperparameterTuner with the following arguments:
# xgb - the estimator we created previously,
# the objective_metric_name
# an objective_type = 'Maximize'
# 5 max_jobs
# 3 max_parallel_jobs
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            objective_type='Maximize',
                            max_jobs=5,
                            max_parallel_jobs=3)

Time to launch the hyperparameter tuning job! You need to point the tuner to the training and validation sets you prepared earlier.

In [None]:
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation}, include_cls_metadata=False)

In the SageMaker console, you can navigate to *Hyperparameter tuning jobs* to see the job in progress. In the column *Training completed / total*, you should at first see the value 0/3. SageMaker starts with 3 parallel tuning jobs because that is what we specified as the maximum parallel jobs. As the first three tuning jobs finish, you will see SageMaker start two more tuning jobs, for a total of 5 jobs, as we specified. Click on the name of your tuning job to see more details. 

<img src="img/tuning.PNG" width="800" />

You can also navigate to *Training jobs* to view the individual training jobs started by the hyperparamater tuner.

<img src="img/training.PNG" width="800" />

## Evaluate model <a class="anchor" id="eval1"></a>

<div class="alert alert-block alert-info"><b>Note:</b> You will be unable to successfully run this section until the tuning job completes. Check the progress of the tuning job in the console, or run the first cell below to receive the status. While waiting, feel free to experiment with the data some more and apply additional cleaning.</div>

In [None]:
# Run this cell to check current status of hyperparameter tuning job
sage_client = sagemaker_session.sagemaker_client
job_name = tuner._current_job_name
tuning_job_result = sage_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=job_name)

status = tuning_job_result['HyperParameterTuningJobStatus']
print("Current status: {}".format(status))
if status != 'Completed':
    print('Reminder: the tuning job has not been completed.')
job_count = tuning_job_result['TrainingJobStatusCounters']['Completed']
print("{} training jobs have completed".format(job_count))

Once the tuning job finishes, we can bring in a table of metrics.

In [None]:
tuning_job_name = tuner._current_job_name

tuner_parent_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)
if not tuner_parent_metrics.dataframe().empty:
    df_parent = tuner_parent_metrics.dataframe().sort_values(['FinalObjectiveValue'], ascending=False)
    
df_parent

If you are not interested in the full list of results, but just want to know which training job was the best, use 
'BestTrainingJob' from the Automatic Model Tuning describe API.

In [None]:
best_overall_training_job = sage_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name)['BestTrainingJob']

best_overall_training_job

You can analyze the results deeper by using [HPO_Analyze_TuningJob_Results.ipynb notebook](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/hyperparameter_tuning/analyze_results). Here, we will just plot how the objective metric changes overtime as the tuning progresses.

In [None]:
import bokeh
import bokeh.io
bokeh.io.output_notebook()
from bokeh.plotting import figure, show
from bokeh.models import HoverTool

import pandas as pd

df_parent_objective_value = df_parent[df_parent['FinalObjectiveValue'] > -float('inf')]

p = figure(plot_width=900, plot_height=400, x_axis_type='datetime',x_axis_label='datetime', y_axis_label=objective_metric_name)
p.circle(source=df_parent_objective_value, x='TrainingStartTime', y='FinalObjectiveValue', color='black')

show(p)

Depending on how your first hyperparameter tuning job went, you may or may not want to try another tuning job to see whether the model quality can be further improved. When you decide to run another tuning job, you would want to leverage what has been known about the search space from the completed tuning job. In that case, you can create a new hyperparameter tuning job using a [warm start](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-warm-start.html). We won't cover this feature in this workshop, but it is good to know that it exists.

## Deploy <a class="anchor" id="deploy"></a>

There are two ways to deploy your models and get predictions in Amazon SageMaker:
* Endpoints: This deploys a persistent HTTPS endpoint containing one or more models. Use it to send individual inference requests which require near real-time responses.
* Batch Transform: This spins up the resources required to run inference on a batch of data, and automatically shuts down the resources when finished. Use this to process large datasets or to run inference requests which do not require an immediate response. 

This notebook contains example code for both batch transform and endpoint deployment.

### Batch Transform

Let's test our best model by running a batch transform job. The advantage of running a batch transform job is that you only pay for the compute resources required to complete the inference request - Amazon SageMaker takes care of starting up and shutting down the resources.

From the hyperparameter tuning job run in the previous section, we fetch only the best performing model.

In [None]:
best_xgb = tuner.best_estimator()
best_xgb

We can see the specific hyperparameter values of this best performing model.

In [None]:
best_xgb.hyperparameters()

To run batch transforms using this model, we need to load it into a [Transformer](https://sagemaker.readthedocs.io/en/stable/transformer.html). You can specify the type and number of compute resources (GPU is usually not required for inference), as well as various parameters on the input and output format of the data. Batch transform allows you to filter the input data, and associate input records with inference results. We don't cover those features in this workshop, but it is worth reading the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html). 

For now, let's create a transformer on the smallest instance type. Remember that creating the transformer does not start up the compute resource - the instance will only be started up when a transform job is requested.

In [None]:
transformer = best_xgb.transformer(
    instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    strategy = 'SingleRecord',
    output_path = s3_output_location,
    assemble_with = 'Line',
)
transformer

Remember that we split off 10% from our dataset into a test set at the start of this workshop?

First remove the target variable column from this dataset - this will be compared with the predictions produced by the model later. Then apply the same processing used for the train and validation sets: save the data in CSV format and upload it to the S3 bucket.

In [None]:
# Drop the target variable column from the dataset.
test_data_wo_target = test_data.drop(['y_yes'], axis=1)

# Save to CSV format. Drop headers and index column.
test_data_wo_target.to_csv('test.csv', index=False, header=False)

# Set the file locations
s3test = 's3://{}/{}/test/'.format(bucket, prefix)

# Upload the data to S3
!aws s3 cp test.csv $s3test

Now you are ready to run a batch transform on your test data! This will take around 5 minutes to complete - it needs to create the compute resources and run the whole dataset through the model.

In [None]:
transformer.transform(s3test, content_type="text/csv", split_type="Line")
transformer.wait()

As soon as you start the batch transform job, you should see it appear in the 'Batch transform jobs' section of the SageMaker console.

<img src="img/batch_transform.PNG" width="800" />

Batch transform saves the results as a CSV file in the S3 bucket. This is great in production, but we want to take a look at the results in this notebook, so we copy the file onto the instance running this notebook. Then we load the data from the file and apply some transformations to get a 1-dimensional array of values.

Using the transformed prediction results, we can create a confusion matrix to compare the model's predictions with the ground truth data.

In [None]:
# Copy the output file from S3 to the instance underlying this notebook
!aws s3 cp --recursive $transformer.output_path ./

In [None]:
predictions = pd.read_csv('test.csv.out', header=None)

# Transpose the matrix, convert to a numpy array, then select only the first column
predictions = predictions.T.to_numpy()[0]
predictions

In [None]:
# Generate a confusion matrix
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions.T[0]), rownames=['actuals'], colnames=['predictions'])

The results may vary depending on the data cleaning steps you applied before training, but in general you will see that the model is slightly biased towards '0', e.g. people who did subscribe for a term deposit. Still, the results are not bad.

### [Optional] Hosting an Endpoint

Another method for deploying your model is through endpoints. This part of the workshop is optional, because an endpoint takes around 10 minutes to start (remember, these are endpoints suitable for production workloads). 

Again, we go back to our hyperparameter tuning job. You can deploy an endpoint straight from the hyperparameter tuner, which automatically selects the best training job to deploy.

First we'll need to determine how we pass data into and receive data from our endpoint. Our data is currently stored as NumPy arrays in memory of our notebook instance. To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV. Therefore, we set the serializer for the endpoint to a CSV serializers provided by the SageMaker library.

In [None]:
my_predictor = tuner.deploy(
    initial_instance_count=1, 
    instance_type='ml.t2.medium',
    serializer=sagemaker.serializers.CSVSerializer()
)

Navigate to the console to see the effects of running this one line of code. In the *Models* section, you should see that the deploy step trained and saved a final model based on the best hyperparameters found during the tuning process.

<img src="img/model.PNG" width="800" />

In the *Endpoints* section, you will see that an API endpoint has been created for you. Similarly, its corresponding configuration can be seen in the *Endpoint configurations* section.

<img src="img/deploy.PNG" width="800" />

Now that we have an API endpoint, we of course want to test it! Let's run our test set through the API endpoint, letting it predict if a customer will subscribe to a term deposit (1) or not (0). We will generate a confusion matrix to see how well the model performs.

<div class="alert alert-block alert-info"><b>Note:</b> For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.</div>

Now, we'll use a simple function to:
* Loop over our test dataset
* Split it into mini-batches of rows
* Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
* Retrieve mini-batch predictions by invoking the XGBoost endpoint
* Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, my_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['y_yes'], axis=1).to_numpy())

Now we'll check our confusion matrix to see how well we predicted versus actuals.

In [None]:
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

These results should be the same as the results we received from the batch transform step. If you are up for the challenge, see if you can improve these results by playing with the code provided in this notebook.

## Conclusion <a class="anchor" id="conclusion"></a>

In this notebook, you learned how to use SageMaker in an end-to-end machine learning workflow. SageMaker is a powerful tool, with many additional options not covered in this workshop, so we recommend reading the [documentation](https://docs.aws.amazon.com/sagemaker/index.html) to find out more. There are also plenty of [detailed blogs](https://aws.amazon.com/blogs/?filtered-posts.q=sagemaker&filtered-posts.q_operator=AND) which provide advice and ideas on using SageMaker in your projects.

**IMPORTANT: Don't forget to clean up the resources created during this workshop! Otherwise, you will continue to incur costs.**
Always go through the following steps when you're finished:
1. Run the cell below to delete the model, the endpoint configuration, and the endpoint itself.
1. Close the notebook through File -> Close and Halt
1. Stop the notebook instance by selecting the instance in the *Notebook instances* section of the console, then selecting Actions -> Stop. Stopping the instance means you are no longer paying for the underlying EC2 server, but the notebook information is still persisted in storage (EBS), so you do continue to pay for storage.
1. When you are completely finished experimenting with the notebook and you are happy for it to be completely deleted, select the instance in the *Notebook instances* section of the console, then select Actions -> Delete. You will no longer incur any costs for the notebook.

In [None]:
my_predictor.delete_model()
my_predictor.delete_endpoint(delete_endpoint_config=True)