<h1>Data exploration, preprocessing and feature engineering</h1>

In this and the following notebooks we will demonstrate how you can build your ML Pipeline leveraging SKLearn Feature Transformers and SageMaker XGBoost algorithm & after the model is trained, deploy the Pipeline (Feature Transformer and XGBoost) as a SageMaker Inference Pipeline behind a single Endpoint for real-time inference.

In particular, in this notebook we will tackle the first steps related to data exploration and preparation. We will use [Amazon Athena](https://aws.amazon.com/athena/) to query our dataset and have a first insight about data quality and available features, [AWS Glue](https://aws.amazon.com/glue/) to create a Data Catalog and [Amazon SageMaker Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) for building the feature transformer model with SKLearn.

<span style="color: red"><strong>To get started, in the cell below please replace your initials in the bucket_name variable, in order to match the bucket name you've created in the previous steps.</strong></span>

In [None]:
import boto3
import sagemaker
import time

role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print(region)
print(role)
 
# replace [your-initials] according to the bucket name you have defined.
bucket_name = 'endtoendml-workshop-[your-initials]'

print(bucket_name)

We can now copy to our bucket the dataset used for this use case. We will use the `windturbine_raw_data.csv` made available for this workshop in the `gianpo-public` public S3 bucket. In this Notebook, we will download from that bucket and upload to your bucket so that AWS services can access the data.

In [None]:
import boto3

s3 = boto3.resource('s3')

file_key = 'data/raw/windturbine_raw_data.csv'
copy_source = {
    'Bucket': 'gianpo-public',
    'Key': 'endtoendml/{0}'.format(file_key)
}

s3.Bucket(bucket_name).object_versions.delete()
s3.Bucket(bucket_name).copy(copy_source, file_key)

The first thing we need now is to infer a schema for our dataset. Thanks to its [integration with AWS Glue](https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html), we will later use Amazon Athena to run SQL queries against our data stored in S3 without the need to import them into a relational database. To do so, Amazon Athena uses the AWS Glue Data Catalog as a central location to store and retrieve table metadata throughout an AWS account. The Athena execution engine, indeed, requires table metadata that instructs it where to read data, how to read it, and other information necessary to process the data.

To organize our Glue Data Catalog we create a new database named `endtoendml-db`. To do so, we create a Glue client via Boto and invoke the `create_database` method.

However, first we want to make sure these AWS resources to not exist yet to avoid any error.

In [None]:
from notebook_utilities import cleanup_glue_resources
cleanup_glue_resources()

In [None]:
glue_client = boto3.client('glue')

response = glue_client.create_database(DatabaseInput={'Name': 'endtoendml-db'})
response = glue_client.get_database(Name='endtoendml-db')
response
assert response['Database']['Name'] == 'endtoendml-db'

Now we define a Glue Crawler that we point to the S3 path where the dataset resides, and the crawler creates table definitions in the Data Catalog.
To grant the correct set of access permission to the crawler, we use one of the roles created before (`GlueServiceRole-endtoendml`) whose policy grants AWS Glue access to data stored in your S3 buckets.

In [None]:
response = glue_client.create_crawler(
    Name='endtoendml-crawler',
    Role='service-role/GlueServiceRole-endtoendml', 
    DatabaseName='endtoendml-db',
    Targets={'S3Targets': [{'Path': '{0}/data/raw/'.format(bucket_name)}]}
)

We are ready to run the crawler with the `start_crawler` API and to monitor its status upon completion through the `get_crawler_metrics` API.

In [None]:
glue_client.start_crawler(Name='endtoendml-crawler')

while glue_client.get_crawler_metrics(CrawlerNameList=['endtoendml-crawler'])['CrawlerMetricsList'][0]['TablesCreated'] == 0:
    print('RUNNING')
    time.sleep(15)
    
assert glue_client.get_crawler_metrics(CrawlerNameList=['endtoendml-crawler'])['CrawlerMetricsList'][0]['TablesCreated'] == 1


When the crawler has finished its job, we can retrieve the Table definition for the newly created table.
As you can see, the crawler has been able to correctly identify 12 fields, infer a type for each column and assign a name.

In [None]:
table = glue_client.get_table(DatabaseName='endtoendml-db', Name='raw')
table

Based on our knowledge of the dataset, we can assign more specific names to columns.

In [None]:
table['Table']['StorageDescriptor']['Columns'] = [{'Name': 'turbine_id', 'Type': 'string'},
                                                  {'Name': 'turbine_type', 'Type': 'string'},
                                                  {'Name': 'wind_speed', 'Type': 'double'},
                                                  {'Name': 'rpm_blade', 'Type': 'double'},
                                                  {'Name': 'oil_temperature', 'Type': 'double'},
                                                  {'Name': 'oil_level', 'Type': 'double'},
                                                  {'Name': 'temperature', 'Type': 'double'},
                                                  {'Name': 'humidity', 'Type': 'double'},
                                                  {'Name': 'vibrations_frequency', 'Type': 'double'},
                                                  {'Name': 'pressure', 'Type': 'double'},
                                                  {'Name': 'wind_direction', 'Type': 'string'},
                                                  {'Name': 'breakdown', 'Type': 'string'}]
updated_table = table['Table']
updated_table.pop('DatabaseName', None)
updated_table.pop('CreateTime', None)
updated_table.pop('UpdateTime', None)
updated_table.pop('CreatedBy', None)
updated_table.pop('IsRegisteredWithLakeFormation', None)

glue_client.update_table(
    DatabaseName='endtoendml-db',
    TableInput=updated_table
)

<h2>Data exploration with Amazon Athena</h2>

For data exploration, let's install PyAthena, a Python client for Amazon Athena. Note: PyAthena is not maintained by AWS, please visit: https://pypi.org/project/PyAthena/ for additional information.

In [None]:
!pip install pyathena

In [None]:
import pyathena
from pyathena import connect
import pandas as pd

athena_cursor = connect(s3_staging_dir='s3://{0}/staging/'.format(bucket_name), 
                        region_name=region).cursor()

athena_cursor.execute('SELECT * FROM "endtoendml-db".raw limit 8;')
pd.read_csv(athena_cursor.output_location)

Another SQL query to count how many records we have

In [None]:
athena_cursor.execute('SELECT COUNT(*) FROM "endtoendml-db".raw;')
pd.read_csv(athena_cursor.output_location)

Let's try to see what are possible values for the field "breakdown" and how frequently they occur over the entire dataset

In [None]:
athena_cursor.execute('SELECT breakdown, (COUNT(breakdown) * 100.0 / (SELECT COUNT(*) FROM "endtoendml-db".raw)) \
            AS percent FROM "endtoendml-db".raw GROUP BY breakdown;')
pd.read_csv(athena_cursor.output_location)

In [None]:
athena_cursor.execute('SELECT breakdown, COUNT(breakdown) AS bd_count FROM "endtoendml-db".raw GROUP BY breakdown;')
df = pd.read_csv(athena_cursor.output_location)

%matplotlib inline
import matplotlib.pyplot as plt

plt.bar(df.breakdown, df.bd_count)

We have discovered that the dataset is quite unbalanced, although we are not going to try balancing it.

In [None]:
athena_cursor.execute('SELECT DISTINCT(turbine_type) FROM "endtoendml-db".raw')
pd.read_csv(athena_cursor.output_location)

In [None]:
athena_cursor.execute('SELECT COUNT(*) FROM "endtoendml-db".raw WHERE oil_temperature IS NULL GROUP BY oil_temperature')
pd.read_csv(athena_cursor.output_location)

We also realized there are a few null values that need to be managed during the data preparation steps.

For the purpose of keeping the data exploration step short during the workshop, we are not going to execute additional queries. However, feel free to explore the dataset more if you have time.

**Note**: you can go to Amazon Athena console and check for query duration under History tab: usually queries are executed in a few seconds, then it some time for Pandas to load results into a dataframe

<h2>Preprocessing and Feature Engineering with Amazon SageMaker Processing</h2>

The preprocessing and feature engineering code is implemented in the preprocessor.py file. You can go through the code and see that several categorical columns required indexing and one-hot encoding.
Once the SKLearn fit() and transform() is done, we are splitting our dataset into 80-20 train & validation as part of the script and uploading to S3 so that it can be used with XGBoost for training.

In [None]:
!pygmentize source_dir/preprocessor.py

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(role=role,
                                     instance_type='ml.m5.large',
                                     instance_count=1,
                                     framework_version='0.20.0')

raw_data_path = 's3://{0}/data/raw/'.format(bucket_name)
train_data_path = 's3://{0}/data/preprocessed/train/'.format(bucket_name)
val_data_path = 's3://{0}/data/preprocessed/val/'.format(bucket_name)

In [None]:
sklearn_processor.run(code='source_dir/preprocessor.py',
                      inputs=[ProcessingInput(input_name='raw_data', source=raw_data_path, destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data', source='/opt/ml/processing/train', destination=train_data_path),
                               ProcessingOutput(output_name='val_data', source='/opt/ml/processing/val', destination=val_data_path)],
                      arguments=['--train-test-split-ratio', '0.2'])

After the preprocessing and feature engineering are completed, you can move to the next notebook in the **03_train_model** folder to start model training.