# Predictive Maintenance using Machine Learning on Sagemaker
*Part 4 - Model training*

## Initialization
---
Directory structure to run this notebook:
```
nasa-turbofan-rul-lstm
|
+--- data
|   |
|   +--- interim: intermediate data we can manipulate and process
|   |
|   \--- raw: *immutable* data downloaded from the source website
|
+--- notebooks: all the notebooks are positionned here
|
+--- src: the training script will be located here, along with some utility scripts
```
### Imports

In [1]:
%load_ext autoreload

import os
import errno
import random
import sys
import numpy as np
import mxnet as mx
import mxnet.gluon as G
import matplotlib.pyplot as plt
import seaborn as sns

# Initializing random seeds:
random.seed(42)
np.random.seed(42)
sns.set_style('darkgrid')
sys.path.append('../src')

%matplotlib inline
%autoreload 2

import lstm_utils

INTERIM_DATA = '../data/interim'
PROCESSED_DATA = '../data/processed'

Importing Sagemaker & AWS specific libraries: we also obtain the IAM role arn used to give training and hosting access to your data. Here we use the get_execution_role function to obtain the role arn which was specified when creating the notebook. We also configure the container image to be used for the region that we are running in.

In [2]:
import boto3
import sagemaker
from sagemaker import get_execution_role

# S3 bucket for saving code and model artifacts:
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# Bucket location where custom code will be saved in the tar.gz format:
custom_code_upload_location = 's3://{}/nasa-rul-lstm/code'.format(bucket)

# Bucket location where results of model training will be saved:
model_artifacts_location = 's3://{}/nasa-rul-lstm/artifacts'.format(bucket)

# S3 Data location:
train_data_location = 's3://{}/nasa-rul-lstm/data'.format(bucket)

# IAM execution role that gives SageMaker access to resources in your AWS account.
role = get_execution_role()

### Loading data from the previous notebook

In [3]:
# Load data from the notebook local storage:
%store -r data_channels

success_msg = 'Loaded "data_channels"'
if 'data_channels' not in locals():
    print('Nothing in notebook store, trying to load from disk.')
    try:
        local_path = '../data/processed'
        with open(os.path.join(local_path, 'data_channels.txt'), 'r') as f:
            data_channels = eval(f.readline())
            
        print(success_msg)
            
    except Exception as e:
        print(e)
        if (e.errno == errno.ENOENT):
            print('Files not found to load data channel from: you need to execute the previous notebook.')
            
else:
    print('Data channel found in notebook environment.')
    print(success_msg)

Data channel found in notebook environment.
Loaded "data_channels"


## Training
---
### Training script
The `nasa_rul_training.py` script provides all the code we need for training and hosting a SageMaker model. The script also checkpoints the model at the end of every epoch and saves the model graph, params and optimizer state in the model artifacts archive:

In [4]:
from sagemaker.mxnet import MXNet

training_job_name = 'nasa-rul-lstm'

rul_estimator = MXNet(
    entry_point='../src/nasa_rul_training.py',
    role=role,
    output_path=model_artifacts_location,
    code_location=custom_code_upload_location,
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    framework_version='1.6.0',
    py_version='py3',
    distributions={'parameter_server': {'enabled': True}},
    hyperparameters={
        'batch-size': 16,
        'learning-rate': 0.0034711926221743473,
        'sequence-length': 20,
        'hidden-size': 122,
        'num-layers': 3,
        'epoch': 50,
    },
    metric_definitions=[
        {'Name': 'train:loss', 'Regex': 'training loss: (\S+)'},
        {'Name': 'train:rmse', 'Regex': 'training rmse: (\S+)'}
    ],
    base_job_name=training_job_name
)

To launch a single training job, you can use the following commands:
```python
rul_estimator.fit(inputs=data_channels)
job_name = rul_estimator._current_job_name
```

In [5]:
# rul_estimator.fit(inputs=data_channels)
# job_name = rul_estimator._current_job_name

2020-06-16 09:12:37 Starting - Starting the training job...
2020-06-16 09:12:39 Starting - Launching requested ML instances......
2020-06-16 09:13:47 Starting - Preparing the instances for training......
2020-06-16 09:15:04 Downloading - Downloading input data
2020-06-16 09:15:04 Training - Downloading the training image...
2020-06-16 09:15:25 Training - Training image download completed. Training in progress.[34m2020-06-16 09:15:27,244 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2020-06-16 09:15:27,248 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-06-16 09:15:27,264 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"batch-size":16,"epoch":200,"hidden-size":122,"learning-rate":0.0034711926221743473,"num-layers":3,"sequence-length":20}', 'SM_USER_ENTRY_POINT': 'nasa_rul_training.py', 'SM_FRAMEWORK

### Running a hyperparameter tuning job

We will continue by demonstrating than the hyperparameter tuning job also works with a custom script.

In [5]:
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter, CategoricalParameter

rul_lstm_tuner = HyperparameterTuner(
    estimator = rul_estimator,                    # The estimator object to use as the basis for the training jobs.
    metric_definitions=[{                         # We want to minimize the RMSE of the predicted RUL
        'Name': 'train:rmse', 
        'Regex': 'training rmse: (\S+)'
    }],
    objective_metric_name = 'train:rmse',         # The metric used to compare trained models (RMSE).
    objective_type = 'Minimize',                  # We wish to minimize this error metric.
    max_jobs = 20,                                # The total number of models to train
    max_parallel_jobs = 2,                        # The number of models to train in parallel
    hyperparameter_ranges = {
        'num-layers': IntegerParameter(3, 8),
        'hidden-size': IntegerParameter(10, 200),
        'learning-rate': ContinuousParameter(1e-4, 1e-2),
        'batch-size': CategoricalParameter([16, 32, 64, 128, 256]),
    },
    base_tuning_job_name = training_job_name + '-tuner'
)

In [6]:
rul_lstm_tuner.fit(inputs=data_channels, logs=True)
rul_lstm_tuner._current_job_name

'nasa-rul-lstm-tuner-200615-1857'

In [5]:
from lstm_utils import get_tuner_results
#tuner_results = get_tuner_results(rul_lstm_tuner._current_job_name)
tuner_results = get_tuner_results('nasa-rul-lstm-tuner-200615-1857')
tuner_results

20 training jobs have completed
Number of training jobs with valid objective: 19
{'lowest': 0.11708906292915344, 'highest': 0.3347015976905823}


Unnamed: 0,batch-size,hidden-size,learning-rate,num-layers,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
8,"""16""",122.0,0.003471,3.0,nasa-rul-lstm-tuner-200615-1857-012-4f6e2849,Completed,0.117089,2020-06-15 21:51:14+00:00,2020-06-15 22:15:28+00:00,1454.0
12,"""16""",76.0,0.000341,7.0,nasa-rul-lstm-tuner-200615-1857-008-15f61318,Completed,0.122048,2020-06-15 20:29:27+00:00,2020-06-15 20:59:19+00:00,1792.0
13,"""64""",182.0,0.002367,3.0,nasa-rul-lstm-tuner-200615-1857-007-97894b06,Completed,0.122478,2020-06-15 20:23:02+00:00,2020-06-15 20:52:11+00:00,1749.0
1,"""16""",144.0,0.008509,6.0,nasa-rul-lstm-tuner-200615-1857-019-d3a754ca,Completed,0.123188,2020-06-15 23:59:29+00:00,2020-06-16 01:12:51+00:00,4402.0
4,"""32""",18.0,0.001126,3.0,nasa-rul-lstm-tuner-200615-1857-016-0a91147c,Completed,0.123287,2020-06-15 23:15:22+00:00,2020-06-15 23:21:17+00:00,355.0
7,"""32""",162.0,0.000984,7.0,nasa-rul-lstm-tuner-200615-1857-013-91555df5,Completed,0.123455,2020-06-15 21:55:34+00:00,2020-06-15 23:12:13+00:00,4599.0
15,"""32""",85.0,0.002666,4.0,nasa-rul-lstm-tuner-200615-1857-005-a19076d2,Completed,0.12348,2020-06-15 19:58:26+00:00,2020-06-15 20:20:29+00:00,1323.0
5,"""64""",30.0,0.004591,7.0,nasa-rul-lstm-tuner-200615-1857-015-44664b8f,Completed,0.124638,2020-06-15 23:10:33+00:00,2020-06-15 23:21:29+00:00,656.0
17,"""128""",147.0,0.003274,6.0,nasa-rul-lstm-tuner-200615-1857-003-79ab1c18,Completed,0.125434,2020-06-15 19:08:06+00:00,2020-06-15 19:52:33+00:00,2667.0
6,"""256""",157.0,0.007255,5.0,nasa-rul-lstm-tuner-200615-1857-014-b11beeb8,Completed,0.125607,2020-06-15 22:20:11+00:00,2020-06-15 23:01:31+00:00,2480.0


The previous table is sorted by increasing RMSE: the first row is related to the training job with the best RMSE:

In [6]:
job_name = tuner_results.iloc[0]['TrainingJobName']

## Cleanup
---

### Storing data for the next notebook

In [6]:
print(job_name)

%store job_name

nasa-rul-lstm-2020-06-16-09-12-37-314
Stored 'job_name' (str)


### Persisting these data to disk
This is useful in case you want to be able to execute each notebook independantly (from one session to another) and don't want to reexecute every notebooks whenever you want to focus on a particular step.

In [7]:
# Write the best training job name to disk:
with open(os.path.join(PROCESSED_DATA, 'job_name.txt'), 'w') as f:
    f.write(job_name)