<h1>Introduction</h1>

This notebook demonstrates the use of Amazon SageMaker and SKLearn to pre-process a purpose-built wind turbine dataset to simulate a predictive maintenance use-case.

The implementation is provided for educational purposes only and does not take into account certain optimizations, with the aim to keep it simple and make it very easy to follow during a lab.

Let's start by importing some libraries and choosing the AWS Region and AWS Role we will use.
Also, we need to change the bucket_name to the bucket containing the wind turbine training data file.

In [1]:
import boto3
import sagemaker

role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print(region)
print(role)

bucket_name = 'gianpo-predictive-maintenance'

eu-west-1
arn:aws:iam::825935527263:role/gianpo-path/SageMaker-Notebook-Role


<h2>Data Exploration</h2>

We first download the dataset from the S3 bucket to the notebook instance. After running the cell below, you can optionally check that the file was downloaded to the notebook instance throught the Jupyter notebook file browser.

In [2]:
import boto3

s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file('data/windturbine_raw_data.csv', 'windturbine_raw_data.csv')

In [3]:
import pandas

df = pandas.read_csv('windturbine_raw_data.csv', header=None)
df.columns = ['turbine_id', 'turbine_type', 'wind_speed', 'RPM_blade', 'oil_temperature', 'oil_level', 'temperature', 
              'humidity', 'vibrations_frequency', 'pressure', 'wind_direction', 'breakdown']
df.head(10)

Unnamed: 0,turbine_id,turbine_type,wind_speed,RPM_blade,oil_temperature,oil_level,temperature,humidity,vibrations_frequency,pressure,wind_direction,breakdown
0,TID003,HAWT,80,61,,34,33,26,1,77,E,no
1,TID010,HAWT,85,78,36.0,28,35,43,15,62,NE,yes
2,TID007,HAWT,47,31,31.0,23,46,62,15,32,N,no
3,TID008,VAWT,73,70,38.0,8,17,66,6,80,SW,yes
4,TID003,HAWT,16,23,46.0,9,76,53,14,29,W,no
5,TID001,HAWT,78,71,30.0,11,66,79,1,81,SW,no
6,TID009,HAWT,80,25,37.0,31,40,75,4,56,NW,no
7,TID002,VAWT,59,29,37.0,10,25,83,13,55,SE,no
8,TID009,HAWT,58,16,48.0,10,43,17,4,44,NE,no
9,TID001,HAWT,23,38,31.0,28,26,32,11,75,S,no


In [5]:
df.dtypes

turbine_id               object
turbine_type             object
wind_speed                int64
RPM_blade                 int64
oil_temperature         float64
oil_level                 int64
temperature               int64
humidity                  int64
vibrations_frequency      int64
pressure                  int64
wind_direction           object
breakdown                object
dtype: object

Let's display some descriptive statistics for this dataset.

In [6]:
df.describe()

Unnamed: 0,wind_speed,RPM_blade,oil_temperature,oil_level,temperature,humidity,vibrations_frequency,pressure
count,1000000.0,1000000.0,961703.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,49.990414,50.010095,37.435021,19.998577,50.02357,50.014965,7.994064,49.98596
std,20.486019,20.498963,7.640262,8.944855,20.496239,20.483369,4.319314,20.501076
min,15.0,15.0,25.0,5.0,15.0,15.0,1.0,15.0
25%,32.0,32.0,31.0,12.0,32.0,32.0,4.0,32.0
50%,50.0,50.0,37.0,20.0,50.0,50.0,8.0,50.0
75%,68.0,68.0,44.0,28.0,68.0,68.0,12.0,68.0
max,85.0,85.0,50.0,35.0,85.0,85.0,15.0,85.0


In [7]:
df_ok = df[df['breakdown'] == 'yes']
print('Number of positive examples: ' + str(df_ok.shape[0]))

df_nok = df[df['breakdown'] == 'no']
print('Number of negative examples: ' + str(df_nok.shape[0]))

Number of positive examples: 136579
Number of negative examples: 863421


In [8]:
df.isnull().sum()

turbine_id                   0
turbine_type            100107
wind_speed                   0
RPM_blade                    0
oil_temperature          38297
oil_level                    0
temperature                  0
humidity                     0
vibrations_frequency         0
pressure                     0
wind_direction               0
breakdown                    0
dtype: int64

In [9]:
df.where(df.turbine_type.isnull()).turbine_id.unique()

array([nan, 'TID006'], dtype=object)

Let's summarize our findings:
<ul>
    <li><b>turbine_id</b> is a string identifier, that we choose to preserve in the model and we need to encode.</li>
    <li><b>turbine_type</b> is a categorical attribute, and has some missing values. More specifically, all values for turbine TID006 are missing. In this specific case we can choose to replace the value with a constant.</li>
    <li><b>oil_temperature</b> is a numeric attribute, and has some missing values.</li>
    <li><b>wind_direction</b> is a categorical string attribute, that we need to encode.</li>
    <li><b>breakdown</b> is our target variable, that we need to encode.</li>
</ul>

<h2>Data Preprocessing</h2>

Let's do preprocessing of our data. We will use the Amazon SageMaker built-in SKLearn container to do this, with a script as an entry point. The script is very similar to a script you might run outside of SageMaker, but you can access useful properties about the SageMaker environment through various environment variables.

In [10]:
!pygmentize '1-predmain-expprep-sklearn-script.py'

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mfrom[39;49;00m [04m[36mio[39;49;00m [34mimport[39;49;00m StringIO
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mshutil[39;49;00m

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mcsv[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn.compose[39;49;00m [34mimport[39;49;00m ColumnTransformer
[34mfrom[39;49;00m [04m[36msklearn.externals[39;49;00m [34mimport[39;49;00m joblib
[34mfrom[39;49;00m [04m[36msklearn.impute[39;49;00m [34mimport[39;49;00m SimpleImpute

In [11]:
import sagemaker
from sagemaker.sklearn.estimator import SKLearn

entry_point = 'predmain-expprep-sklearn-script.py'
output_location = 's3://{0}/output'.format(bucket_name)
code_location = 's3://{0}/code'.format(bucket_name)

sklearn_preprocessor = SKLearn(
    entry_point=entry_point,
    role=role,
    output_path=output_location,
    code_location=code_location,
    base_job_name='predmain-expprep-sklearn',
    train_instance_count=1,
    train_instance_type="ml.c5.2xlarge")

preprocessing_input = sagemaker.session.s3_input('s3://{0}/data/windturbine_raw_data.csv'.format(bucket_name), content_type='text/csv')

sklearn_preprocessor.fit({'prep': preprocessing_input})

INFO:sagemaker:Creating training-job with name: predmain-expprep-sklearn-2019-05-02-18-50-10-249


2019-05-02 18:50:10 Starting - Starting the training job...
2019-05-02 18:50:12 Starting - Launching requested ML instances......
2019-05-02 18:51:12 Starting - Preparing the instances for training......
2019-05-02 18:52:36 Downloading - Downloading input data
2019-05-02 18:52:36 Training - Training image download completed. Training in progress.
2019-05-02 18:52:36 Uploading - Uploading generated training model
[31m2019-05-02 18:52:26,697 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[31m2019-05-02 18:52:26,700 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-05-02 18:52:26,711 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[31m2019-05-02 18:52:26,944 sagemaker-containers INFO     Module predmain-expprep-sklearn-script does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-05-02 18:52:26,944 sagemaker-containers INFO     Generating setup.cfg[0m


<h2>Batch Transform</h2>

Once our model has been fit, we can use Amazon SageMaker Batch Transform to transform our input data.

In [12]:
output_location = 's3://{0}/data'.format(bucket_name)

transformer = sklearn_preprocessor.transformer(
    instance_count=1, 
    instance_type='ml.c5.4xlarge',
    output_path=output_location,
    assemble_with = 'Line',
    accept='text/csv')

transformer.transform('s3://{0}/data/'.format(bucket_name), content_type='text/csv', split_type='Line')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

INFO:sagemaker:Creating model with name: predmain-expprep-sklearn-2019-05-02-18-50-10-249
INFO:sagemaker:Creating transform job with name: predmain-expprep-sklearn-2019-05-02-18-53-37-756


Waiting for transform job: predmain-expprep-sklearn-2019-05-02-18-53-37-756
.....................................!
