## Capstone Project : Near real-time monitoring of a manufacturing production line
**Keywords:** <font color='green'>SQL, AWS (S3 bucket, Sagemaker-XGBoost, lambda function, API Gateway), Power BI (Streaming dashboard, API)</font>   

**Background:** The objective of this capstone project was to build and deploy a model to monitor in near real-time a customer-critical attribute of a product being manufactured at one of my company's facilities. For a variety of reasons, such limited resources and testing capabilities, this critical attribute can only be measured every 12 hours. Finished products are manufactured at 400-600 per minute rate. Therefore, a failure to meet this customer-critical attribute has the consequence of having to put on hold (often scrap) 12 hours of production. During these 12-hour periods several quality and production checks are performed at each stage of the product manufacturing process. The built model uses the data from these intermediary checks to predict the customer-critical attribute during the 12-hour intervals where this attribute is not measured directly. If a failure is predicted, a notification is sent to the appropriate personnel to take immediate action. 
#### Project structure:
- **Part I: ETL** The data used for training, validation and testing is hosted on two SQL servers. One SQL server host product quality data and the other SQL server host the machine state data. The first step is to extract the data from the SQL Servers, transform it and load it to an AWS S3 bucket. 
- **Part II:** 
 - **II.1 Build, Train and Deploy the model** With the data in AWS S3, I used Sagemaker to train and deploy an XGBoost model. Deploying the model creates an endpoint that can be accessed for predictions. 
 - **II.2 Lambda function & Gateway API** I created a lambda function & GateWay API to be able to access the model for predictions. The API allows me to the send the data for prediction as a POST request for low-latency response. This is a cost effective solution since I am only charged when I send the request to the API. 
- **Part III: Predict near real-time and stream to PowerBI dashboard** With the model deployed and the API in service, I scheduled a taks on one of our on-premises servers to send the latest intermediary check to the model and get a prediction of the customer-critical attribute. This prediction is then pass to the PowerBI dashboard (also as POST request). The PowerBI Streaming dashboard visualizes the predictions in real-time.   

### <font color='brown'>Part II:</font> Model training and deploying
* Python script to query the SQL databases with the data for prediction. 

In [47]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [48]:
df = pd.read_csv('axial_for_training.csv').drop('Unnamed: 0', axis=1)

In [49]:
df.columns

Index(['CREW_A', 'LINE_A', 'MACHINE_A', 'STATION_A', 'OPER_A', 'MS_NUMBER_A',
       'FBA__I', 'FBH__1', 'FBH__2', 'FBH__3',
       ...
       'FBL9RNG_I', 'FBL10RNG_I', 'FBL11RNG_I', 'FBL12RNG_I', 'FBL13RNG_I',
       'FBL14RNG_I', 'FBL15RNG_I', 'FBL16RNG_I', 'FBL17RNG_I', 'FBL18RNG_I'],
      dtype='object', length=122)

In [50]:
df_f = df.drop('FBAXIAL_I', axis=1)
df_t = df.FBAXIAL_I
df = pd.concat([df_t, df_f], axis=1)

In [51]:
data = pd.get_dummies(df)
data.shape

(6991, 1513)

In [52]:
train0, test = train_test_split(data, test_size=.15)
train, validation = train_test_split(train0, test_size=.15)

In [53]:
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)
validation.to_csv('validation.csv', index=False)

In [54]:
%%time

import os
import boto3
import re
import sagemaker

CPU times: user 11 µs, sys: 0 ns, total: 11 µs
Wall time: 15.5 µs


In [55]:
bucket = 'axial-load-s3-sagemaker'
training_key = 'axialtrain/train.csv'
validation_key = 'axialtrain/validation.csv'
test_key = 'axialtrain/test.csv'

s3_model_output_location = r's3://{0}/axialtrain/model'.format(bucket)
s3_training_file_location = r's3://{0}/{1}'.format(bucket, training_key)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket, validation_key)
s3_test_file_location = r's3://{0}/{1}'.format(bucket, test_key)

In [56]:
print(s3_model_output_location)
print(s3_training_file_location)
print(s3_validation_file_location)
print(s3_test_file_location)

s3://axial-load-s3-sagemaker/axialtrain/model
s3://axial-load-s3-sagemaker/axialtrain/train.csv
s3://axial-load-s3-sagemaker/axialtrain/validation.csv
s3://axial-load-s3-sagemaker/axialtrain/test.csv


In [57]:
def write_to_s3(filename, bucket, key):
    with open(filename, 'rb') as f:
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)    

In [59]:
write_to_s3('train.csv', bucket, training_key)
write_to_s3('validation.csv', bucket, validation_key)
write_to_s3('test.csv', bucket, test_key)

In [60]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost', '1.0-1')

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


In [61]:
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::157248718313:role/service-role/AmazonSageMaker-ExecutionRole-20200728T214764


In [62]:
sess = sagemaker.Session()

In [63]:
estimator = sagemaker.estimator.Estimator(container,
                                         role,
                                         train_instance_count=1,
                                         train_instance_type='ml.m4.xlarge',
                                         output_path=s3_model_output_location, 
                                         sagemaker_session=sess,
                                         base_job_name='xgboost-axial-load-v1')

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


In [64]:
estimator.set_hyperparameters(max_depth=5, objective="reg:linear", eta=0.1, subsample=0.7, num_round=150)

In [65]:
estimator.hyperparameters()

{'max_depth': 5,
 'objective': 'reg:linear',
 'eta': 0.1,
 'subsample': 0.7,
 'num_round': 150}

In [66]:
training_input_config = sagemaker.session.s3_input(s3_data=s3_training_file_location, content_type='csv')
validation_input_config = sagemaker.session.s3_input(s3_data=s3_validation_file_location, content_type='csv')

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [67]:
print(training_input_config.config)
print(validation_input_config.config)

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://axial-load-s3-sagemaker/axialtrain/train.csv', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'csv'}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://axial-load-s3-sagemaker/axialtrain/validation.csv', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'csv'}


In [68]:
estimator.fit({'train':training_input_config, 'validation':validation_input_config})

2020-07-29 16:47:46 Starting - Starting the training job...
2020-07-29 16:47:48 Starting - Launching requested ML instances......
2020-07-29 16:48:52 Starting - Preparing the instances for training...
2020-07-29 16:49:45 Downloading - Downloading input data...
2020-07-29 16:50:10 Training - Downloading the training image...
2020-07-29 16:50:41 Training - Training image download completed. Training in progress...[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value reg:linear to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m

In [69]:
predictor = estimator.deploy(initial_instance_count=1, 
                             instance_type='ml.m4.xlarge', 
                             endpoint_name='xgboost-axial-load-v2')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


---------------!