<h1>Model Training</h1>

In this notebook, we will use the Amazon SageMaker built-in Linear Learner algorithm to train a binary classification model, using the pre-processed data generated in step 1.

First let's take a look at our preprocessed data.

In [None]:
import boto3
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print(region)
print(role)

# Replace username placeholder.
username = '[username]'
bucket_name = '{0}-sm-workshop-lux'.format(username)
prefix = '05'

In [None]:
import boto3

file_name = 'windturbine_raw_data.csv.out'

s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file('{0}/data-bt/{1}'.format(prefix, file_name), file_name)

In [None]:
import pandas
import numpy

df = pandas.read_csv(file_name, header=None)
df.head(10)

Let's split the data into training and test sets. and then copy back to Amazon S3 to start training.

In [None]:
train_set = df[:800000]
test_set = df[800000:]

train_set.to_csv('windturbine_data_train.csv', header=False, index=False)
test_set.to_csv('windturbine_data_test.csv', header=False, index=False)

In [None]:
import boto3

s3 = boto3.resource('s3')
target_bucket = s3.Bucket(bucket_name)

with open('windturbine_data_train.csv', 'rb') as data:
    target_bucket.upload_fileobj(data, '{0}/data-bt/train/windturbine_data_train.csv'.format(prefix))
    
with open('windturbine_data_test.csv', 'rb') as data:
    target_bucket.upload_fileobj(data, '{0}/data-bt/test/windturbine_data_test.csv'.format(prefix))

In order to start training, we need to specify the location of the docker container that will be used for training.
Docker Registry paths for Amazon algorithms are specified here: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

By the way, we can use a utility function of the Amazon SageMaker Python SDK to get the path.

In [None]:
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'linear-learner', repo_version="latest")
print(container)

We can now start training, by specifying the input and output settings and the required hyperparameters. You can find the list of the supported hyperparameters for the linear learner algorithm here: https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html.

You can also try running the following cell multiple times changing hyperparameters or other settings like the number of instances to be used for training.

In [None]:
import sagemaker

output_location = 's3://{0}/{1}/output'.format(bucket_name, prefix)

est = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.c5.4xlarge',
                                    output_path=output_location,
                                    base_job_name='pred-main-train-ll-{0}'.format(username))

est.set_hyperparameters(feature_dim=28,
                        predictor_type='binary_classifier',
                        mini_batch_size=200,
                        normalize_data=False,
                        normalize_label=False,
                        unbias_data=False,
                        unbias_label=False)

train_config = sagemaker.session.s3_input('s3://{0}/{1}/data-bt/train/'.format(
    bucket_name, prefix), content_type='text/csv')
test_config = sagemaker.session.s3_input('s3://{0}/{1}/data-bt/test/'.format(
    bucket_name, prefix), content_type='text/csv')

est.fit({'train': train_config, 'test': test_config })