<h1>Model Training</h1>

In this notebook, we will use the Amazon SageMaker built-in XGBoost algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to train a simple binary classification model, using the pre-processed data generated in the previous step by the AWS Glue job. Let's define some variables first.

<span style="color: red"> Please replace your initials in the bucket_name variable defined in next cell.</span>

In [1]:
import boto3
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print(region)
print(role)

# replace [your-initials] according to the bucket name you have defined.
bucket_name = 'endtoendml-workshop-[your-initials]'
key_prefix = 'data/preprocessed'

eu-west-1
arn:aws:iam::825935527263:role/service-role/AmazonSageMaker-ExecutionRole-endtoendml


Now we take a look at our preprocessed data, which have already been split into training and validation sets by the AWS Glue job. In order to do that, we first download the preprocessed training file from Amazon S3 to the local notebook file system.

In [2]:
import boto3

# The name of the file has been set by AWS Glue job in the previous notebook.
train_file_name = 'part-00000'
train_file_key = '{0}/train/{1}'.format(key_prefix, train_file_name)

s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file(train_file_key, train_file_name)

After the file has been download, we can use Pandas to read the CSV and display the first 10 rows. You can immediately notice that categorical features have been one-hot encoded according to the feature engineering actions executed in the previous step.

In [4]:
import pandas
import numpy

df = pandas.read_csv(train_file_name, header=None)
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,12.0,39.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,64.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,63.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,13.0,85.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.0,77.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,15.0,49.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,8.0,81.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,13.0,40.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,9.0,69.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
9,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,22.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In order to start training, we need to specify the location of the docker container that will be used for training.
Docker Registry paths for Amazon algorithms are specified here: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

By the way, we can use a utility function of the Amazon SageMaker Python SDK to get the path.

In [5]:
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version="latest")
print(container)

	get_image_uri(region, 'xgboost', 0.90-1).


685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest


We can now start training, by specifying the input and output settings and the required hyperparameters. You can find the list of the supported hyperparameters for the XGBoost algorithm here: https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html.

You can also try running the following cell multiple times changing hyperparameters or other settings like the number of instances to be used for training, since XGBoost can be parallelized.

In [6]:
import sagemaker

output_location = 's3://{0}/output'.format(bucket_name)

est = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m5.2xlarge',
                                    output_path=output_location,
                                    base_job_name='predmain-train-xgb')

est.set_hyperparameters(objective='reg:logistic',
                        num_round=20)

train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(
    bucket_name, key_prefix), content_type='text/csv')
val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(
    bucket_name, key_prefix), content_type='text/csv')

est.fit({'train': train_config, 'validation': val_config })

2019-09-05 23:20:55 Starting - Starting the training job...
2019-09-05 23:20:56 Starting - Launching requested ML instances...
2019-09-05 23:21:52 Starting - Preparing the instances for training......
2019-09-05 23:22:53 Downloading - Downloading input data
2019-09-05 23:22:53 Training - Training image download completed. Training in progress..
[31mArguments: train[0m
[31m[2019-09-05:23:22:53:INFO] Running standalone xgboost training.[0m
[31m[2019-09-05:23:22:53:INFO] File size need to be processed in the node: 117.53mb. Available memory size in the node: 23674.17mb[0m
[31m[2019-09-05:23:22:53:INFO] Determined delimiter of CSV input is ','[0m
[31m[23:22:53] S3DistributionType set as FullyReplicated[0m
[31m[23:22:54] 799931x28 matrix with 22398068 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-09-05:23:22:54:INFO] Determined delimiter of CSV input is ','[0m
[31m[23:22:54] S3DistributionType set as FullyReplicated[0m
[31m[

After the training is completed, the serialized model will be saved in the S3 output_location defined above.
You can now move to the next notebook in the **04_deploy_model** folder to see how to use that model for inference.