# Training ML Models using the new SageMaker XGBoost 


A new version of XGBoost library is available in the Amazon SageMaker Python
SDK. [XGBoost](https://xgboost.readthedocs.io/en/latest/) (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. It is considered as one of the best decision tree-based solutions for structured/tabular data. 

Along with upgrading to [XGBoost v0.90](https://github.com/dmlc/xgboost/releases/tag/v0.90), the new release allows two modes of operation: 
- you can use the algorithm either as an Amazon SageMaker built-in algorithm, same as with the previous 0.72-based version, or 
- as a framework to run training scripts in their local environments as you would typically do, for example, with a TensorFlow deep learning framework. 

Using XGBoost as a framework provides more flexiblility than using it as a built-in algorithm as it enables customized pre-processing and post-processing scripts to be incorporated into the training script. 

This implementation has a smaller memory footprint, better logging, improved
hyperparameter validation, and an expanded set of metrics than the previous
0.72-based version. 

With this new release we also provide 'model checkpointing', facilitating adoption of 
[Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html),
which uses Amazon EC2 Spot instance instead of on-demand
instances to run training jobs. You can specify which training jobs use spot
instances and a stopping condition that specifies how long Amazon SageMaker
waits for a job to run using Amazon EC2 Spot instances.

In this blog post, we show you how to use the SageMaker XGBoost to build, train, and 
deploy a regression model using the two modes of operation, and highlight savings obtained from using Managed Spot Training.

## Training an XGBoost model

In this example, we use the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) originally from the [UCI data repository](https://archive.ics.uci.edu/ml/datasets/abalone). The dataset has more than 4000 samples of Abalone, with multiple other physical measurements related to it. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age.

We use the [libsvm-converted version](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html), where nominal feature (Male/Female/Infant) has been converted into a real valued feature (as required by XGBoost). The goal is to predict the age based on other physical measurements. 

## Use XGBoost as a built-in algorithm

Existing SageMaker Python SDK allows using the `Estimator` to directly run the XGBoost container. We first ensure that we have the latest SDK on our system.

In [None]:
!pip install -qU awscli boto3 sagemaker

### Setup variables and define functions

In [None]:
%%time

import os
import boto3
import re
import sagemaker

# role = sagemaker.get_execution_role()
role = "arn:aws:iam::398557468931:role/service-role/AmazonSageMaker-ExecutionRole-20190626T094141"

region = boto3.Session().region_name
# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
# bucket = sagemaker.Session().default_bucket()
bucket = 'xgboost-examples-1'

prefix = 'sagemaker/DEMO-xgboost-regression'

# customize to your bucket where you have stored the data
bucket_path = 's3://{}'.format(bucket)

### Fetching the dataset

We read the dataset from the existing repository into memory, for preprocessing prior to training. This processing could be done in situ by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous.

In [None]:
%%time

import io
import boto3

# Load the dataset
FILE_DATA = 'abalone'
urllib.request.urlretrieve("https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/abalone", FILE_DATA)
sagemaker.Session().upload_data(FILE_DATA, bucket=bucket, key_prefix=prefix)

### Obtaining the latest XGBoost container

We obtain the new container by specifying the framework version (`0.90-1`). This version specifies the upstream XGBoost framework version (`0.90`) and an additional SageMaker version (`1`). If you have an existing XGBoost workflow based on the previous (`0.72`) container, this would be the only change necessary to get the same workflow working with the new container.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(region, 'xgboost', '0.90-1')

### Training the XGBoost model

After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes few minutes.

To run our training script on SageMaker, we construct a sagemaker.xgboost.estimator.XGBoost estimator, which accepts several constructor arguments:

* __entry_point__: The path to the Python script SageMaker runs for training and prediction.
* __role__: Role ARN
* __hyperparameters__: A dictionary passed to the train function as hyperparameters.
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: This particular mode does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.

In [None]:
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "silent":"0",
        "objective":"reg:linear",
        "num_round":"50"}

instance_type = ""
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb')
content_type = "libsvm"

If Spot instances are used, the training job can be interrupted, causing it to take longer to start or finish.  If a training job is interrupted, a checkpointed snapshot can be used to resume from a previously saved point and can save training time (and cost). 

To enable checkpointing for Managed Spot Training using SageMaker XGBoost we need to configure three things:
1. Enable the `train_use_spot_instances` constructor arg - a simple self-explanatory boolean.
2. Set the `train_max_wait` constructor arg - this is an int arg representing the amount of time you are willing to wait for Spot infrastructure to become available. Some instance types are harder to get at Spot prices and you may have to wait longer. You are not charged for time spent waiting for Spot infrastructure to become available, you're only charged for actual compute time spent once Spot instances have been successfully procured.
3. Setup a `checkpoint_s3_uri` constructor arg. This arg will tell SageMaker an S3 location where to save checkpoints. While not strictly necessary, checkpointing is highly recommended for Manage Spot Training jobs due to the fact that Spot instances can be interrupted with short notice and using checkpoints to resume from the last interruption ensures you don't lose any progress made before the interruption.

Feel free to toggle the `train_use_spot_instances` variable to see the effect of running the same job using regular (a.k.a. "On Demand") infrastructure.

Note that `train_max_wait` can be set if and only if `train_use_spot_instances` is enabled and **must** be greater than or equal to `train_max_run`. 

In [None]:
job_name = 'DEMO-xgboost-regression-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)

train_use_spot_instances = True
train_max_run = 3600
train_max_wait = 7200 if train_use_spot_instances else None
checkpoint_s3_uri = ('{}/{}/checkpoints/{}'.format(bucket_path, prefix, job_name) if train_use_spot_instances 
                      else None)
print("Checkpoint path:", checkpoint_s3_uri)

estimator = sagemaker.estimator.Estimator(container, 
                                          role, 
                                          hyperparameters=hyperparameters,
                                          train_instance_count=1, 
                                          train_instance_type='ml.m5.2xlarge', 
                                          train_volume_size=5,         # 5 GB 
                                          output_path=output_path, 
                                          sagemaker_session=sagemaker.Session(),
                                          train_use_spot_instances=train_use_spot_instances, 
                                          train_max_run=train_max_run, 
                                          train_max_wait=train_max_wait,
                                          checkpoint_s3_uri=checkpoint_s3_uri
                                         );
train_input = sagemaker.s3_input(s3_data='s3://{}/{}/{}'.format(bucket, prefix, 'train'), content_type='libsvm')
estimator.fit({'train': train_input}, job_name=job_name)

### Savings
Towards the end of the job you should see two lines of output printed:

- `Training seconds: X` : This is the actual compute-time your training job spent
- `Billable seconds: Y` : This is the time you will be billed for after Spot discounting is applied.

If you enabled the `train_use_spot_instances`, then you should see a notable difference between `X` and `Y` signifying the cost savings you will get for having chosen Managed Spot Training. This should be reflected in an additional line:
- `Managed Spot Training savings: (1-Y/X)*100 %`

# Use XGBoost as a framework



An additional mode of operation is to run customizable scripts as part of the training and inference jobs. 

#### Entry-point script

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an argparse.ArgumentParser instance. For example, the script that we will run in this post is provided as the accompanying file (`abalone.py`) [here](?).

Let's look at the main elements of the script. Starting with the main guard, we use a parser to read the hyperparameters passed to our Amazon SageMaker Estimator when creating the training job. These hyperparameters are made available as arguments to our input script. We also parse a number of Amazon SageMaker-specific environment variables to get information about the training environment, such as the location of input data and location where we want to save the model.

```
if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here
    parser.add_argument('--num_round', type=int)
    parser.add_argument('--max_depth', type=int, default=5)
    parser.add_argument('--eta', type=float, default=0.2)
    parser.add_argument('--objective', type=str, default='reg:squarederror')
    
    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])
    
    args = parser.parse_args()
    
    train_hp = {
        'max_depth': args.max_depth,
        'eta': args.eta,
        'gamma': args.gamma,
        'min_child_weight': args.min_child_weight,
        'subsample': args.subsample,
        'silent': args.silent,
        'objective': args.objective
    }
```

After all hyperparameters are setup, the data is loaded as an [XGBoost DMatrix](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix). We place the data into a `watchlist` to obtain evaluation metrics for each round as training proceeds. Additionally, we perform checkpointing using the `add_checkpointing` function (described below). 

```
    dtrain = xgb.DMatrix(args.train)
    dval = xgb.DMatrix(args.validation)
    watchlist = [(dtrain, 'train'), (dval, 'validation')] if dval is not None else [(dtrain, 'train')]

    callbacks = []
    prev_checkpoint, n_iterations_prev_run = add_checkpointing(callbacks)
    # If checkpoint is found then we reduce num_boost_round by previously run number of iterations
```

Now we’re ready to train our model. This is as simple as calling the `xgboost.train` in the [XGBoost Learning API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training). The advantage of using the framework mode is that we can replace this call with other forms of API including the [Scikit-learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) and [cross-validation functionality](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.cv). 

```
    bst = xgb.train(
        params=train_hp,
        dtrain=dtrain,
        evals=watchlist,
        num_boost_round=(args.num_round - n_iterations_prev_run),
        xgb_model=prev_checkpoint,
        callbacks=callbacks
    )
```

Finally, we save our model.

```
    model_location = args.model_dir + '/xgboost-model'
    pkl.dump(bst, open(model_location, 'wb'))
    logging.info("Stored trained model at {}".format(model_location))
```

Checkpointing in the framework mode for SageMaker XGBoost can be performed using two convenient functions: 
- `save_checkpoint`: this returns a callback function that performs checkpointing of the model for each round. This is passed to XGBoost as part of the [`callbacks`](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train) argument. 
- `load_checkpoint`: This is used to load existing checkpoints to ensure training resumes from where it previously stopped. 

Both functions take the checkpoint directory as input, which in the below example is set to `/opt/ml/checkpoints`. 

```
CHECKPOINTS_DIR = '/opt/ml/checkpoints'   # default location for Checkpoints
def add_checkpointing(callbacks):
    if os.path.exists(CHECKPOINTS_DIR):
        callbacks.append(save_checkpoint(CHECKPOINTS_DIR))
    # If there are no previous checkpoints, load_checkpoint() returns (None, 0).
    # If there are previous checkpoints:  
    #       load_checkpoint() will return ('/path/to/xgboost-checkpoint', M)
    #       assuming the job was interrupted after iteration M, for example, 
    previous_checkpoint, start_iteration = load_checkpoint(CHECKPOINTS_DIR)
    return previous_checkpoint, start_iteration
```

#### Using the SageMaker XGBoost Estimator

Now that we’ve prepared the training data and our script, the `XGBoost` estimator class in the SageMaker Python SDK allows us to run that script as a training job on the Amazon SageMaker managed training infrastructure. We’ll also pass the estimator our IAM role, the type of instance we want to use, and a dictionary of the hyperparameters that we want to pass to our script. 

In [None]:
job_name = 'DEMO-xgboost-regression-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Training job", job_name)
checkpoint_s3_uri = ('{}/{}/checkpoints/{}'.format(bucket_path, prefix, job_name) if train_use_spot_instances 
                      else None)
print("Checkpoint path:", checkpoint_s3_uri)

In [None]:
from sagemaker.session import s3_input
from sagemaker.xgboost.estimator import XGBoost

xgb_script_mode_estimator = XGBoost(
    entry_point="abalone.py",
    hyperparameters=hyperparameters,
    image_name=container,
    role=role, 
    train_instance_count=1,
    train_instance_type="ml.m5.2xlarge",
    framework_version="0.90-1",
    output_path="s3://{}/{}/{}/output".format(bucket, prefix, "xgboost-script-mode"),
    train_use_spot_instances=train_use_spot_instances,
    train_max_run=train_max_run,
    train_max_wait=train_max_wait,
    checkpoint_s3_uri=checkpoint_s3_uri
)

After we’ve constructed our XGBoost estimator, we can fit it by passing in the data we uploaded to Amazon S3. Amazon SageMaker makes sure our data is available in the local filesystem of the training cluster, so our XGBoost script can simply read the data from disk.

In [None]:
xgb_script_mode_estimator.fit({"train": train_input}, 
                               job_name=job_name)

In [None]:
training_job_description = client.describe_training_job(TrainingJobName=job_name)
training_time = training_job_description["TrainingTimeInSeconds"]
bill_time = training_job_description["BillableTimeInSeconds"]
savings = (1 - bill_time/training_time) * 100
print("Training Time = {} seconds, Billable Time = {} seconds, Savings = {:.1f}%".format(training_time, bill_time, savings))

## Deploying the XGBoost model

After training, we can use the estimator to create an Amazon SageMaker endpoint – a hosted and managed prediction service that we can use to perform inference.

You can also optionally specify other functions to customize the behavior of deserialization of the input request (input_fn()), serialization of the predictions (output_fn()), and how predictions are made (predict_fn()). The defaults work for our current use-case so we don’t need to define them.

In [None]:
predictor = xgb_script_mode_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

In [None]:
import xgboost
test_data = xgboost.DMatrix(FILE_DATA)
predictor.predict(test_data)

After you have finished with this example, remember to delete the prediction endpoint to release the instances associated with it.

In [None]:
xgb_script_mode_estimator.delete_endpoint()

## Conclusion

In this blog post we talk about recent upgrades Amazon SageMaker built-in XGBoost container. We also show you how to use the updated container in two modes: a built-in algorithm mode and a framework mode. The framework mode enables you to use customized scripts for multitude of workflows and works seamlessly with SageMaker training and deployment capabilities. Finally, we also show how Managed Spot Training can be employed to obtain significant savings for your long-running training jobs. 