<h1>Introduction</h1>

This notebook demonstrates the use of Amazon SageMaker and SKLearn to pre-process a purpose-built wind turbine dataset to simulate a predictive maintenance use-case.

The implementation is provided for educational purposes only and does not take into account certain optimizations, with the aim to keep it simple and make it very easy to follow during a lab.

Let's start by importing some libraries and choosing the AWS Region and AWS Role we will use.
Also, we need to change the bucket_name to the bucket containing the wind turbine training data file.

In [None]:
import boto3
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print(region)
print(role)

# Replace username placeholder.
username = '[username]'
bucket_name = '{0}-sm-workshop-lux'.format(username)
prefix = '05'

<h2>Data Exploration</h2>

We first copy the dataset from the public S3 bucket storing the data to your bucket and then to the notebook instance. After running the cell below, you can optionally check that the file was downloaded to the notebook instance throught the Jupyter notebook file browser.

In [None]:
import boto3

s3 = boto3.resource('s3')

copy_source = {
    'Bucket': 'gianpo-public',
    'Key': 'windturbine_raw_data.csv'
}

file_name = 'windturbine_raw_data.csv'
file_key = '{0}/data/{1}'.format(prefix, file_name)
s3.Bucket(bucket_name).copy(copy_source, file_key)
s3.Bucket(bucket_name).download_file(file_key, file_name)

In [None]:
import pandas

df = pandas.read_csv('windturbine_raw_data.csv', header=None)
df.columns = ['turbine_id', 'turbine_type', 'wind_speed', 'RPM_blade', 'oil_temperature', 'oil_level', 'temperature', 
              'humidity', 'vibrations_frequency', 'pressure', 'wind_direction', 'breakdown']
df.head(10)

In [None]:
df.dtypes

Let's display some descriptive statistics for this dataset.

In [None]:
df.describe()

In [None]:
df_ok = df[df['breakdown'] == 'yes']
print('Number of positive examples: ' + str(df_ok.shape[0]))

df_nok = df[df['breakdown'] == 'no']
print('Number of negative examples: ' + str(df_nok.shape[0]))

In [None]:
df.isnull().sum()

In [None]:
df.where(df.turbine_type.isnull()).turbine_id.unique()

Let's summarize our findings:
<ul>
    <li><b>turbine_id</b> is a string identifier, that we choose to preserve in the model and we need to encode.</li>
    <li><b>turbine_type</b> is a categorical attribute, and has some missing values. More specifically, all values for turbine TID006 are missing. In this specific case we can choose to replace the value with a constant.</li>
    <li><b>oil_temperature</b> is a numeric attribute, and has some missing values.</li>
    <li><b>wind_direction</b> is a categorical string attribute, that we need to encode.</li>
    <li><b>breakdown</b> is our target variable, that we need to encode.</li>
</ul>

<h2>Data Preprocessing</h2>

Let's do preprocessing of our data. We will use the Amazon SageMaker built-in SKLearn container to do this, with a script as an entry point. The script is very similar to a script you might run outside of SageMaker, but you can access useful properties about the SageMaker environment through various environment variables.

In [None]:
!pygmentize '1-predmain-expprep-sklearn-script.py'

In [None]:
import sagemaker
from sagemaker.sklearn.estimator import SKLearn

entry_point = '1-predmain-expprep-sklearn-script.py'
output_location = 's3://{0}/{1}/output'.format(bucket_name, prefix)
code_location = 's3://{0}/{1}/code'.format(bucket_name, prefix)

sklearn_preprocessor = SKLearn(
    entry_point=entry_point,
    role=role,
    output_path=output_location,
    code_location=code_location,
    base_job_name='pred-main-prep-skl-{0}'.format(username),
    train_instance_count=1,
    train_instance_type="ml.m5.2xlarge")

preprocessing_input = sagemaker.session.s3_input(
    's3://{0}/{1}/data/'.format(bucket_name, prefix), content_type='text/csv')

sklearn_preprocessor.fit({'prep': preprocessing_input})

<h2>Batch Transform</h2>

Once our model has been fit, we can use Amazon SageMaker Batch Transform to transform our input data.

In [None]:
output_location = 's3://{0}/{1}/data-bt/'.format(bucket_name, prefix)

transformer = sklearn_preprocessor.transformer(
    instance_count=1, 
    instance_type='ml.m5.2xlarge',
    output_path=output_location,
    assemble_with = 'Line',
    accept='text/csv')
    
transformer.transform('s3://{0}/{1}/data/'.format(bucket_name, prefix), 
                      content_type='text/csv', split_type='Line')

print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()