# Microsoft Azure automated ML Demo - v2

Azure ML & Azure Databricks notebooks by Parashar Shah.

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

## Purpose and Challenge

The purpose of this notebook is for the user to build and deploy a Machine Learning (ML) application using Azure Machine Learning (AML) service. It is a predictive maintenance scenario based on https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan.

This notebook has the complete code to load, prep, train and deploy the model. We chose a small public data set for this demo so as to run the entire process in only few minutes.

Following are the high level steps:

1. Create AML Workspace
2. Acquire and Prepare Data
3. Automated ML
4. Deploy Model as webservice
5. Predictions

## 1. Create cluster (in this lab it is pre-created)

Please follow the instructions from Microsoft documentation with your customers https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-environment#azure-databricks

## 2. Acquire and Prepare Data
For this notebook, we will use the NASA Prognostics Center's Turbo-Fan Failure dataset.  It is located here: https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan

Download and un-zip the data

In [8]:
import logging
import os
import random
import time

from matplotlib import pyplot as plt
from matplotlib.pyplot import imshow
import numpy as np
import pandas as pd

In [9]:
# import needed libraries for downloading and unzipping the file
import urllib.request
from zipfile import ZipFile

In [10]:
# download from url
response = urllib.request.urlopen("https://ti.arc.nasa.gov/c/6/")
output = open('CMAPSSData.zip', 'wb')    # note the flag:  "wb"        
output.write(response.read())
output.close()

In [11]:
# unzip files
zipfile = ZipFile("CMAPSSData.zip")
zipfile.extract("train_FD001.txt")

Next we read our data into a Pandas DataFrame.
Note the headers were not in the space seperated txt file, so we assign them from the ReadMe in the zip file. In pandas we use read_csv with the delimiter option.

In [13]:
df = pd.read_csv("train_FD001.txt", delimiter="\s|\s\s", index_col=False, engine='python', names=['unit','cycle','os1','os2','os3','sm1','sm2','sm3','sm4','sm5','sm6','sm7','sm8','sm9','sm10','sm11','sm12','sm13','sm14','sm15','sm16','sm17','sm18','sm19','sm20','sm21'])

Take a quick look at the data

In [15]:
df.head(5)

Our dataset has a number of units in it, with each engine flight listed as a cycle. The cycles count up until the engine fails. What we would like to predict is the no. of cycles until failure. 
So we need to calculate a new column called RUL, or Remaining Useful Life.  It will be the last cycle value minus each cycle value per unit.

In [17]:
# Assign ground truth
def assignrul(dft):
    maxi = dft['cycle'].max()
    dft['rul'] = maxi - dft['cycle']
    return dft
    

df_new = df.groupby('unit').apply(assignrul) #derive label column
df_new = df_new.drop(['unit'], axis=1) #Remove unit column because it wont help us do prediction

In [18]:
#download file using https://eastus2.azuredatabricks.net/files/df_new.csv
df_new.to_csv("/dbfs/FileStore/df_new.csv")

In [19]:
# put training data into X and Y df
# removing 1st row to do evaluation later on
X = df_new.drop(['rul'], axis=1)[2:]
y = df_new[['rul']][2:]

features = X.columns #derive features

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

X_train = pd.DataFrame(X_train, columns=features)
X_test = pd.DataFrame(X_test, columns=features)

In [21]:
X_train.head(5)

In [22]:
y_train.head(5)

In [23]:
X_eval = df_new.drop(['rul'], axis=1)[0:1]
y_eval = df_new['rul'][0:1]
print (X_eval)
print (y_eval)

## 3. Azure Automated ML

Here we utilize Azure's AutoML package to automate the scaling of the sensors, selection of sensors, and automatically train and evaluate many different types of ML models.

In [26]:
import azureml.core

# Check core SDK version number - based on build number of preview/master.
print("SDK version:", azureml.core.VERSION)

username = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user').split("@")[0]
print("Your username is {0}".format(username))

In [27]:
%sh  
/databricks/python/bin/pip freeze > /tmp/python_packages.txt
ls -lrt /tmp/python_packages.txt
cat /tmp/python_packages.txt

In [28]:
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun

![Workspace](https://github.com/parasharshah/automl-handson/raw/master/image1.JPG)

Provide your Machine Learning Workspace credentials to run AutoML. You will need to perform Microsoft's MFA. Please follow the manual auth instructions.

In [31]:
subscription_id = "<Your SubscriptionId>" #you should be owner or contributor
resource_group = "<Resource group - new or existing>" #you should be owner or contributor
workspace_name = "<workspace to be created>" #your workspace name

In [32]:
subscription_id = "ba7979f7-d040-49c9-af1a-7414402bf622" #you should be owner or contributor
resource_group = "automl_ps_newrg" #you should be owner or contributor
workspace_name = "AutoML_ws_pasha"              # your workspace name - needs to be unique - can be anything

You can have more options when creating Workspace

https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py has more options.

For auth - https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/manage-azureml-service/authentication-in-azureml/authentication-in-azure-ml.ipynb

In [34]:
ws = Workspace.get(name = workspace_name,
                      subscription_id = subscription_id,
                      resource_group = resource_group)

In [35]:
# Choose a name for the experiment and specify the project folder.
experiment_name = 'automl-predictive-rul'
project_folder = './sample_projects/automl-demo-predmain'

experiment = Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

In [36]:
import azureml.dataprep as dprep
import uuid

Xtrain_dflow = dprep.read_pandas_dataframe(X_train, temp_folder='/dbfs/tmp'+str(uuid.uuid4()))
ytrain_dflow = dprep.read_pandas_dataframe(y_train, temp_folder='/dbfs/tmp'+str(uuid.uuid4()))
Xtest_dflow = dprep.read_pandas_dataframe(X_train, temp_folder='/dbfs/tmp'+str(uuid.uuid4()))
ytest_dflow = dprep.read_pandas_dataframe(y_train, temp_folder='/dbfs/tmp'+str(uuid.uuid4()))

In [37]:
ytrain_dflow.get_profile()

Now we are ready to configure Azure Automated ML.  We provide necessary information on: what we want to predict, what accuracy metric we want to use, how many models we want to try, and many other parameters. Automated ML will also automatically scale the data for us.

![Workspace](https://github.com/parasharshah/automl-handson/raw/master/image6b.jpg)

## Configure Automated ML

You can use these params. All params in Azure Doc - https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlconfig?view=azure-ml-py

|Property|Description|
|-|-|
|**task**|classification or regression or forecasting|
|**primary_metric**|This is the metric that you want to optimize. Classification supports the following primary metrics: <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>average_precision_score_weighted</i><br><i>norm_macro_recall</i><br><i>precision_score_weighted</i>|
|**primary_metric**|This is the metric that you want to optimize. Regression supports the following primary metrics: <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**iteration_timeout_minutes**|Time limit in minutes for each iteration.|
|**iterations**|Number of iterations. In each iteration AutoML trains a specific pipeline with the data.|
|**max_cores_per_iteration**|Default is 1 if not specified else give max cores of your VM. Not every algorithm will use multiple cores.|
|**n_cross_validations**|Number of cross validation splits. Do not use this when explicit validation set is provided.|
|**spark_context**|Spark Context object. for Databricks, use spark_context=sc|
|**max_concurrent_iterations**|Maximum number of iterations to execute in parallel. This should be <= number of worker nodes in your Azure Databricks cluster.|
|**X**|(sparse) array-like, shape = [n_samples, n_features]. For Azure Databricks, this has to be a dataflow.|
|**y**|	(sparse) array-like, shape = [n_samples, ], Multi-class targets. For Azure Databricks, this has to be a dataflow.|
|**X_valid**|	(sparse) array-like, shape = [n_samples, n_features]. For Azure Databricks, this has to be a dataflow.|
|**y_valid**|	(sparse) array-like, shape = [n_samples, ], Multi-class targets. For Azure Databricks, this has to be a dataflow.|
|**model_explainability**|	Indicate to explain each trained pipeline or not. Requires validation set. Set as True or False.|
|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|
|**preprocess**|set this to True to enable pre-processing of data eg. string to numeric using one-hot encoding. Set as True or False.|
|**experiment_exit_score**|Target score for experiment. It is associated with the metric. eg. experiment_exit_score=0.995 will exit experiment after that|
|**enable_early_stopping**|Flag to enble early termination if the score is not improving in the short term. Set as True or False.|

In [41]:
automl_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automl_errors_regression.log',
                             primary_metric = 'r2_score',
                             iteration_timeout_minutes = 10, #some runs may take 10+ mins hence limiting it for workshop
                             iterations = 50, #you may change this to a higher number and see what happens                             
                             verbosity = logging.INFO,
                             max_cores_per_iteration=4, #each VM has 4 cores
                             max_concurrent_iterations = 2, #change it based on number of worker nodes
                             spark_context=sc, #databricks/spark related
                             #n_cross_validations = 4, #only needed for small datasets and if validation size is not set
                             X = Xtrain_dflow,
                             y = ytrain_dflow,
                             X_valid = Xtest_dflow, #either provide this or use cross validation
                             y_valid = ytest_dflow, #either provide this or use cross validation
                             model_explainability = False, #enable only if doing model explain
                             enable_early_stopping = True,
                             preprocess=True, #preprocess
                             path = project_folder)

Finally we are ready to submit the experiment to Automated ML service. This step can take longer depending on the settings. AutoML will give us updates as models are trained and evaluated by the metric we specified above. The information from each ML model training will be stored in the Experiment section of the Azure ML Workspace in Azure Portal.

In [43]:
local_run = experiment.submit(automl_config, show_output = False) # you may set it to True to see results here but portal experience is better.

In [44]:
displayHTML("<a href={} target='_blank'>Your experiment in Azure Portal: {}</a>".format(local_run.get_portal_url(), local_run.id))

In [45]:
# find the run with the highest accuracy value.
best_run, fitted_model = local_run.get_output()
print(best_run)

In [46]:
#The fitted_model is a python object and you can read the different properties of the object. The following shows printing hyperparameters for each step in the pipeline.

from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0]+ ' - ')
        else:
            pprint(step[1].get_params())
            print()
            
print_model(fitted_model)

In [47]:
#from azureml.train.automl.automlexplainer import retrieve_model_explanation

#shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \
#   retrieve_model_explanation(best_run)

In [48]:
#print(overall_summary)
#print(overall_imp)

## 4. Deploy Model

![Workspace](https://github.com/parasharshah/automl-handson/raw/master/image4-automl.jpg)

In [51]:
# register model in workspace & use the same in your score.py file
description = 'AutoML-RUL-Regression-20190514'
tags = None
model=local_run.register_model(description=description, tags=tags)
local_run.model_id # Use this id to deploy the model as a web service in Azure. Update score file with the output.

After we register the model in our AML Workspace, it should be visible in Azure Portal.

Now we want to deploy the model as a REST API (real time webservice) that we can feed a row or rows of "X" data to, and return the predicted 'RUL' value.  To accomplish this, we will build a container image in our AML Workspace and deploy that image as a Container instance in Azure's ACI service.  We will then obtain an IP address where we can submit data and receive back the predicted 'RUL' value.

There are 3 things we need: 
1. A score.py file that contains the init() and run() functions with instructions on how to load and score with the model. Update the model name in this file.
2. A mydeployenv.yml file that contains information on the python environment in which the model needs to run
3. Configurations for our images and our services, using functions provided by AzureML service.

The cells below help you set these up.

In [53]:
scorefilename = (('score'+str(uuid.uuid4()))[0:10]) + ".py"
print(scorefilename) #change the filename in score file

In [54]:
%%writefile scoree9690.py
# Change the name based on the randomly generated filename
# Scoring Script will need model id from registered model
import json
import numpy as np
import os
import pickle
import pandas as pd
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression

from azureml.core.model import Model

import azureml.train.automl

def init():
    global model
    # retreive the path to the model file using the model name
    model_path = Model.get_model_path('AutoML979330f1fbest') # update this based on previously registered model
    print(model_path)
    model = joblib.load(model_path)
    

def run(raw_data):
    # grab and prepare the data
    #data = (np.array(json.loads(raw_data)['data'])).reshape(1,-1)
    data = (pd.DataFrame(np.array(json.loads(raw_data)['data']), columns=['cycle', 'os1', 'os2', 'os3', 'sm1', 'sm2', 'sm3', 'sm4', 'sm5', 'sm6', 'sm7', 'sm8', 'sm9', 'sm10', 'sm11', 'sm12', 'sm13', 'sm14', 'sm15', 'sm16', 'sm17', 'sm18', 'sm19', 'sm20', 'sm21']))
    # make prediction
    y_hat = model.predict(data)
    return json.dumps(y_hat.tolist())

In [55]:
condafilename = (('mydeploy'+str(uuid.uuid4()))[0:14]) + ".yml"
print(condafilename) #change the filename in score file

In [56]:
from azureml.core.conda_dependencies import CondaDependencies

myenv = CondaDependencies.create(conda_packages=['numpy','scikit-learn==0.19.1'], pip_packages=['azureml-sdk[automl]'])

conda_env_file_name = condafilename
myenv.save_to_file('.', conda_env_file_name)

In [57]:
from azureml.core.webservice import AciWebservice

aciconfig = AciWebservice.deploy_configuration(cpu_cores=2, 
                                               memory_gb=5, 
                                               tags={"data": "RUL",  "method" : "sklearn"}, 
                                               description='Predict RUL with Azure AutoML')

In [58]:
# this will take 10-15 minutes to finish

service_name = "rul-pred-demo" #put your name as suffix - no capital letter/special characters
runtime = "python" 
driver_file = scorefilename #uses the name generated earlier - do not change it
my_conda_file = conda_env_file_name #uses the name generated earlier - do not change it

# image creation
from azureml.core.image import ContainerImage
myimage_config = ContainerImage.image_configuration(execution_script = driver_file, 
                                    runtime = runtime, 
                                    conda_file = my_conda_file)

# Webservice creation
myservice = AciWebservice.deploy_from_model(
  workspace=ws, 
  name=service_name,
  deployment_config = aciconfig,
  models = [model],
  image_config = myimage_config
    )

myservice.wait_for_deployment(show_output=True)

In [59]:
print(myservice.scoring_uri)

In [60]:
import requests
import json

headers = {'Content-Type':'application/json'}

#this is same as X_eval
input_data = "{\"data\": [[1.0, -0.0007, -0.0004, 100.0, 518.67, 641.82, 1589.7, 1400.6, 14.62, 21.61, 554.36, 2388.06, 9046.19, 1.3, 47.47, 521.66, 2388.02, 8138.62, 8.4195, 0.03, 392.0, 2388.0, 100.0, 39.06, 23.419]]}"

#this is same as y_eval
testlabel = '191'

resp = requests.post(myservice.scoring_uri, input_data, headers=headers)

print("POST to url", myservice.scoring_uri)
print("input data:", input_data)
print("label:", testlabel)
print("prediction:", resp.text)

To avoid any run-away Azure costs, we always delete un-necessary services when we are done.

In [62]:
myservice.delete()

![Workspace](https://github.com/parasharshah/automl-handson/raw/master/image5.JPG)

In [64]:
from azureml.core.image import Image
myimage = Image(workspace=ws, name=service_name) # image is based on the service name provided earlier for ACI

In [65]:
#create AKS compute
#it may take 20-25 minutes to create a new cluster

from azureml.core.compute import AksCompute, ComputeTarget

# Use the default configuration (can also provide parameters to customize)
prov_config = AksCompute.provisioning_configuration()

aks_name = 'ps-aks-demo' 

# Create the cluster
aks_target = ComputeTarget.create(workspace = ws, 
                                  name = aks_name, 
                                  provisioning_configuration = prov_config)

aks_target.wait_for_completion(show_output = True)

print(aks_target.provisioning_state)
print(aks_target.provisioning_errors)

In [66]:
from azureml.core.webservice import Webservice, AksWebservice
from azureml.core.image import ContainerImage

#Set the web service configuration (using default here with app insights)
aks_config = AksWebservice.deploy_configuration(enable_app_insights=True)

#unique service name
service_name_aks ='ps-aks-service-demo'

# Webservice creation using single command, there is a variant to use image directly as well.
aks_service = Webservice.deploy_from_image(
  workspace=ws, 
  name=service_name_aks,
  deployment_config = aks_config,
  image = myimage,
  deployment_target = aks_target
    )

aks_service.wait_for_deployment(show_output=True)

In [67]:
#for using the Web HTTP API 
print(aks_service.scoring_uri)
print(aks_service.get_keys())

In [68]:
import requests
import json


input_data = "{\"data\": [[1.0, -0.0007, -0.0004, 100.0, 518.67, 641.82, 1589.7, 1400.6, 14.62, 21.61, 554.36, 2388.06, 9046.19, 1.3, 47.47, 521.66, 2388.02, 8138.62, 8.4195, 0.03, 392.0, 2388.0, 100.0, 39.06, 23.419]]}"
testlabel = '191'

headers = {'Content-Type':'application/json'}

# for AKS deployment you'd need to the service key in the header as well
api_key = 'CZXzrbMARTITqh4SxngyBXalkQHUzDEE' #change this value based on above api key value
headers = {'Content-Type':'application/json',  'Authorization':('Bearer '+ api_key)} 

resp = requests.post(aks_service.scoring_uri, input_data, headers=headers)

print("POST to url", aks_service.scoring_uri)
print("input data:", input_data)
print("label:", testlabel)
print("prediction:", resp.text)

## 5. Conclusions

We have executed an end-to-end Azure ML Service project with a real life example. We started with a problem at hand, created an Azure ML Workspace, downloaded a predictive maintenance dataset, processed the data, train a sophisticated model with Azure Automated ML, and deployed that model quickly and easily to production level service with AKS using Azure's Machine Learning service.