<h1>Introduction</h1>

This notebook demonstrates the use of Amazon SageMaker’s implementation of the linear learner built-in algorithm by building a  basic binary classification model for a predictive maintenance use-case.

The implementation is provided for educational purposes only and does not take into account several optimizations, with the aim to keep it simple and make it very easy to follow during a lab.

Let's start by importing some libraries and choosing the AWS Region and AWS Role we will use.
Also, we need to change the username that is also the prefix of the bucket that will contain the wind turbine training data file.

In [None]:
import boto3
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print(region)
print(role)

# Replace username placeholder.
username = '[username]'
bucket_name = '{0}-sm-workshop-lux'.format(username)
prefix = '02'

<h2>Data Preparation</h2>

We first copy the dataset from the public S3 bucket storing the data to your bucket and then to the notebook instance. After running the cell below, you can optionally check that the file was downloaded to the notebook instance throught the Jupyter notebook file browser.

In [None]:
import boto3

s3 = boto3.resource('s3')

copy_source = {
    'Bucket': 'gianpo-public',
    'Key': 'windturbine_data.csv'
}

file_name = 'windturbine_data.csv'
file_key = '{0}/data/{1}'.format(prefix, file_name)
s3.Bucket(bucket_name).copy(copy_source, file_key)
s3.Bucket(bucket_name).download_file(file_key, file_name)

In [None]:
import pandas
import numpy

df = pandas.read_csv(file_name)
df.head(10)

Let's display some descriptive statistics for this dataset.

In [None]:
df.describe()

In [None]:
df_ok = df[df['breakdown'] == 0]
print('Number of positive examples: ' + str(df_ok.shape[0]))

df_nok = df[df['breakdown'] == 1]
print('Number of negative examples: ' + str(df_nok.shape[0]))

We now split the input file in training and test files (80/20) and we have to swap columns as the Amazon SageMaker linear learner algorithm expects the target variable to be stored in the first column (the target variable is the last one in the input data).

In [None]:
target_column = df['breakdown']
df.drop(labels=['breakdown'], axis=1, inplace = True)
df.insert(0, 'breakdown', target_column)

train_set = df[:800000]
test_set = df[800000:]

train_set.to_csv('windturbine_data_train.csv', header=False, index=False)
test_set.to_csv('windturbine_data_test.csv', header=False, index=False)

We now upload the transformed files back to S3 as it is the storage that Amazon SageMaker will expect to find training data in.

In [None]:
import boto3

s3 = boto3.resource('s3')
target_bucket = s3.Bucket(bucket_name)

with open('windturbine_data_train.csv', 'rb') as data:
    target_bucket.upload_fileobj(data, '{0}/data/train/windturbine_data_train.csv'.format(prefix))
    
with open('windturbine_data_test.csv', 'rb') as data:
    target_bucket.upload_fileobj(data, '{0}/data/test/windturbine_data_test.csv'.format(prefix))

<h2>Model Training</h2>

In order to start training, we need to specify the location of the docker container that will be used for training.
Docker Registry paths for Amazon algorithms are specified here: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

By the way, we can use a utility function of the Amazon SageMaker Python SDK to get the path.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'linear-learner', repo_version="latest")
print(container)

We can now start training, by specifying the input and output settings and the required hyperparameters.
You can find the list of the supported hyperparameters for the linear learner algorithm here: https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html.

You can also try running the following cell multiple times changing hyperparameters or other settings like the number of instances to be used for training.

In [None]:
import sagemaker

output_location = 's3://{0}/{1}/output'.format(bucket_name, prefix)

est = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m5.2xlarge',
                                    output_path=output_location,
                                    base_job_name='pred-main-ll-{0}'.format(username))

est.set_hyperparameters(feature_dim=10,
                        num_models=1,
                        predictor_type='binary_classifier',
                        mini_batch_size=200,
                        normalize_data=True,
                        normalize_label=False,
                        unbias_data=True,
                        unbias_label=False)

train_config = sagemaker.session.s3_input('s3://{0}/{1}/data/train/'.format(
    bucket_name, prefix), content_type='text/csv')
test_config = sagemaker.session.s3_input('s3://{0}/{1}/data/test/'.format(
    bucket_name, prefix), content_type='text/csv')

est.fit({'train': train_config, 'test': test_config })

<h2>Model Deployment</h2>

Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (inferences) from the model dynamically.

In [None]:
import time

endpoint_name = 'pred-main-ll-{0}-'.format(username) + str(int(time.time()))
pred = est.deploy(initial_instance_count=1, 
                  endpoint_name=endpoint_name,
                  instance_type='ml.m5.xlarge')

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker.predictor import RealTimePredictor

# Uncomment the following line to connect to an existing endpoint.
# pred = RealTimePredictor('[endpoint-name]')

pred.content_type = 'text/csv'
pred.serializer = csv_serializer
pred.deserializer = json_deserializer

test_values = [6,90,61,49,28,82,35,7,61,6]
result = pred.predict(test_values)
print(result)

test_values = [9,20,56,39,15,38,38,10,30,5]
result = pred.predict(test_values)
print(result)

<h2>Cleanup</h2>

Once we have completed the experimentation, we can delete the real-time endpoint to avoid incurring in unexpected charges.

In [None]:
pred.delete_endpoint()