<h1>Introduction</h1>

This notebook demonstrates the use of Amazon SageMaker’s built-in SKLearn container to build a basic binary classification model for a predictive maintenance use-case.

The implementation is provided for educational purposes only and does not take into account several optimizations, with the aim to keep it simple and make it very easy to follow during a lab.

Let's start by importing some libraries and choosing the AWS Region and AWS Role we will use.
Also, we need to change the username that is also the prefix of the bucket that will contain the wind turbine training data file.

In [None]:
import boto3
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print(region)
print(role)

# Replace username placeholder.
username = '[username]'
bucket_name = '{0}-sm-workshop-lux'.format(username)
prefix = '04'

<h2>Data Preparation</h2>

We first copy the dataset from the public S3 bucket storing the data to your bucket and then to the notebook instance. After running the cell below, you can optionally check that the file was downloaded to the notebook instance throught the Jupyter notebook file browser.

In [None]:
import boto3

s3 = boto3.resource('s3')

copy_source = {
    'Bucket': 'gianpo-public',
    'Key': 'windturbine_data.csv'
}

file_name = 'windturbine_data.csv'
file_key = '{0}/data/{1}'.format(prefix, file_name)
s3.Bucket(bucket_name).copy(copy_source, file_key)
s3.Bucket(bucket_name).download_file(file_key, file_name)

In [None]:
import pandas
import numpy

df = pandas.read_csv('windturbine_data.csv')
df.head(10)

Let's display some descriptive statistics for this dataset.

In [None]:
df.describe()

In [None]:
df_ok = df[df['breakdown'] == 0]
print('Number of positive examples: ' + str(df_ok.shape[0]))

df_nok = df[df['breakdown'] == 1]
print('Number of negative examples: ' + str(df_nok.shape[0]))

We now split the input file in training and test files (80/20) to store the target variable in the first column for convenience (the target variable is the last one in the input data).

In [None]:
target_column = df['breakdown']
df.drop(labels=['breakdown'], axis=1, inplace = True)
df.insert(0, 'breakdown', target_column)

train_set = df[:800000]
val_set = df[800000:]

train_set.to_csv('windturbine_data_train.csv', header=False, index=False)
val_set.to_csv('windturbine_data_val.csv', header=False, index=False)

We now upload the transformed files back to S3 as it is the storage that Amazon SageMaker will expect to find training data in.

In [None]:
import boto3

s3 = boto3.resource('s3')
target_bucket = s3.Bucket(bucket_name)

with open('windturbine_data_train.csv', 'rb') as data:
    target_bucket.upload_fileobj(data, '{0}/data/train/windturbine_data_train.csv'.format(prefix))
    
with open('windturbine_data_val.csv', 'rb') as data:
    target_bucket.upload_fileobj(data, '{0}/data/val/windturbine_data_val.csv'.format(prefix))


<h2>Model Training</h2>

We are now ready to run the training using the Amazon SageMaker SKLearn built-in container. First let's have a look at the script defining our model.

In [None]:
!pygmentize 'pred_main_sklearn_script.py'

We are now ready to run the training using the SKLearn estimator object of the SageMaker Python SDK.

In [None]:
from sagemaker.sklearn.estimator import SKLearn

output_location = 's3://{0}/{1}/output'.format(bucket_name, prefix)
code_location = 's3://{0}/{1}/code'.format(bucket_name, prefix)

est = SKLearn(
    entry_point='pred_main_sklearn_script.py',
    role=role,
    train_instance_count=1,
    train_instance_type="ml.c5.2xlarge",
    output_path=output_location,
    base_job_name='pred-main-skl-{0}'.format(username),
    code_location = code_location,
    hyperparameters={'max_leaf_nodes': 5, 'max_depth': 3})

inputs = {'train': 's3://{0}/{1}/data/train/'.format(bucket_name, prefix),
 'val': 's3://{0}/{1}/data/val/'.format(bucket_name, prefix)}

est.fit(inputs)

<h2>Model Deployment</h2>

Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (inferences) from the model dynamically.

In [None]:
import time

endpoint_name = 'pred-main-skl-{0}-'.format(username) + str(int(time.time()))
pred = est.deploy(initial_instance_count=1, 
                  instance_type='ml.m5.xlarge')

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer
from sagemaker.predictor import RealTimePredictor

# Uncomment the following line to connect to an existing endpoint.
# pred = RealTimePredictor('[endpoint-name]')

test_values = [[6,56,61,49,28,82,35,7,61,6]]
result = pred.predict(test_values)
print(result)

test_values = [[9,20,56,39,15,38,38,10,30,5]]
result = pred.predict(test_values)
print(result)

<h2>Cleanup</h2>

Once we have completed the experimentation, we can delete the real-time endpoint to avoid incurring in unexpected charges.

In [None]:
pred.delete_endpoint()