# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [22]:
import azureml.core

from azureml.core import Workspace, Experiment, Datastore, Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.train.automl import AutoMLConfig

import pandas as pd

import joblib

In [2]:
azureml.core.VERSION

'1.19.0'

## Dataset

### Overview

Heart failure is a common event caused by cardiovascular diseases. People having cardiovascular disease or who are at high cardiovascular risk due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or already established diseases, need early diagnosis.

A machine learning model could predict if a patient is at risk and it could be of great help to medical personnel in trying to save those patients in serious condition.

The dataset that will be used for the training of the model is about 299 patients with heart failure collected in 2015. It contains 12 features that can be used to predict heart failure mortality.

In [4]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'heart-failure-clinical-data'

experiment = Experiment(ws, experiment_name)

In [5]:
dataset = Dataset.get_by_name(ws, name='Heart Failure Prediction')

In [6]:
dataset.take(5).to_pandas_dataframe()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [7]:
# Checking if the dataset is imbalanced
df = dataset.to_pandas_dataframe()

df['DEATH_EVENT'].value_counts()

0    203
1     96
Name: DEATH_EVENT, dtype: int64

In [19]:
cpu_cluster_name = "cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D3_v2',
                                                           max_nodes=10)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)
    
# For a more detailed view of current AmlCompute status, use get_status().

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration

The model will predict if the patient will face the possibility of death. The target variable is `DEATH_EVENT`. It is a binary variable (1 or 0), so we are faced with a classification problem. Moreover, the dataset is slightly imbalanced (the minority class is about 32%). This imbalance should be automatically handled by AutoML, but to be safe it is prudent to use *AUC_weighted* as the primary metric. The early stopping is enabled in order to avoid waste of money.

In [23]:
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="DEATH_EVENT",
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [24]:
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

As you can see into the following picture, the models that perform better are the RandomForest and the GradientBoosting classifiers. They are two examples of *ensemble learning*. In particular, the RandomForest algorithm follows the *bagging* method; whereas the GradientBoosting one follows the *boosting* method. Ensemble learning methods combine individual models into a new higher performance model. In case of bagging, the ensemble model tends to have less variance. In case of boosting, the ensemble model tries to fix the error of each base model, so its performance are very high (even if it is prone to overfitting).

At the end, the best two models are the two voting and stacking ensemble models built using the previous trained RandomForest and GradientBoosting models. They perform better as they are ensemble models based on others ensemble models!

In [26]:
from azureml.widgets import RunDetails

RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

In [27]:
remote_run.wait_for_completion(show_output=False)

{'runId': 'AutoML_f6bfdd02-a717-4b51-901a-fac26a0d0b3f',
 'target': 'cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-05T14:54:38.3532Z',
 'endTimeUtc': '2021-01-05T15:21:42.582874Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"a16fb1f9-9cf0-4cfb-9da3-18aa3bf20fff\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"ml_data_cool__data\\\\\\", \\\\\\"path\\\\\\": \\\\\\"heart-failure-prediction-dataset/heart_failure_clinical_records_dataset.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"demo\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"bcbf34a7-1936-4783-8840

## Best Model

If already trained, you can reference an existing AutoML run using the following code (just uncomment it and be sure the automl_run_id is correct):

In [5]:
from azureml.train.automl.run import AutoMLRun

automl_run_id = 'AutoML_f6bfdd02-a717-4b51-901a-fac26a0d0b3f'
automl_run = AutoMLRun(experiment, run_id = automl_run_id)
remote_run = automl_run # this ensures that the subsequent code works

In [6]:
best_run, best_model = remote_run.get_output()

In [7]:
run_details = best_run.get_details()

model_details = {
    'RunID': [run_details['runId']],
    'Iteration': [run_details['properties']['iteration']],
    'Primary metric': [run_details['properties']['primary_metric']],
    'Score': [run_details['properties']['score']],
    'Algorithm': [best_model.steps[1][0]],
    'Hyper-parameters': [best_model.steps[1][1]]
}

model_details_df = pd.DataFrame(model_details,
                  columns = ['RunID','Iteration','Primary metric','Score','Algorithm','Hyper-parameters'],
                  index=[run_details['properties']['model_name']])

pd.options.display.max_colwidth = -1

model_details_df

Unnamed: 0,RunID,Iteration,Primary metric,Score,Algorithm,Hyper-parameters
AutoMLf6bfdd02a52,AutoML_f6bfdd02-a717-4b51-901a-fac26a0d0b3f_52,52,AUC_weighted,0.9182536552734494,prefittedsoftvotingclassifier,"PreFittedSoftVotingClassifier(classification_labels=None,\n estimators=[('33',\n Pipeline(memory=None,\n steps=[('standardscalerwrapper',\n <azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper object at 0x7f20089ebba8>),\n ('randomforestclassifier',\n RandomForestClassifier(bootstrap=True,\n ccp_alpha=0.0,\n class_weight='balanced',\n criterion='entropy',\n max_dept...\n min_weight_fraction_leaf=0.0,\n n_estimators=25,\n n_jobs=1,\n oob_score=False,\n random_state=None,\n verbose=0,\n warm_start=False))],\n verbose=False))],\n flatten_transform=None,\n weights=[0.07692307692307693, 0.07692307692307693,\n 0.15384615384615385, 0.07692307692307693,\n 0.15384615384615385, 0.07692307692307693,\n 0.07692307692307693, 0.07692307692307693,\n 0.15384615384615385,\n 0.07692307692307693])"


In [15]:
from notebook.services.config import ConfigManager
cm = ConfigManager().update('notebook', {'limit_output': 4000})

In [20]:
best_model.steps[1][1].estimators

[('33',
  Pipeline(memory=None,
           steps=[('standardscalerwrapper',
                   <azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper object at 0x7f20089ebba8>),
                  ('randomforestclassifier',
                   RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                          class_weight='balanced',
                                          criterion='entropy', max_depth=None,
                                          max_features=0.3, max_leaf_nodes=None,
                                          max_samples=None,
                                          min_impurity_decrease=0.0,
                                          min_impurity_split=None,
                                          min_samples_leaf=0.01,
                                          min_samples_split=0.056842105263157895,
                                          min_weight_fraction_leaf=0.0,
                                          n_estim

In [21]:
best_model.steps[1][1].weights

[0.07692307692307693,
 0.07692307692307693,
 0.15384615384615385,
 0.07692307692307693,
 0.15384615384615385,
 0.07692307692307693,
 0.07692307692307693,
 0.07692307692307693,
 0.15384615384615385,
 0.07692307692307693]

In [26]:
best_run.get_file_names()

['accuracy_table',
 'automl_driver.py',
 'azureml-logs/55_azureml-execution-tvmps_cd6ebea37fb15074a9b33916d913f72114927518eb03885be1ffe9bf88f2a6f6_d.txt',
 'azureml-logs/65_job_prep-tvmps_cd6ebea37fb15074a9b33916d913f72114927518eb03885be1ffe9bf88f2a6f6_d.txt',
 'azureml-logs/70_driver_log.txt',
 'azureml-logs/75_job_post-tvmps_cd6ebea37fb15074a9b33916d913f72114927518eb03885be1ffe9bf88f2a6f6_d.txt',
 'azureml-logs/process_info.json',
 'azureml-logs/process_status.json',
 'confusion_matrix',
 'explanation/1c7c5ead/classes.interpret.json',
 'explanation/1c7c5ead/eval_data_viz.interpret.json',
 'explanation/1c7c5ead/expected_values.interpret.json',
 'explanation/1c7c5ead/features.interpret.json',
 'explanation/1c7c5ead/global_names/0.interpret.json',
 'explanation/1c7c5ead/global_rank/0.interpret.json',
 'explanation/1c7c5ead/global_values/0.interpret.json',
 'explanation/1c7c5ead/local_importance_values.interpret.json',
 'explanation/1c7c5ead/per_class_names/0.interpret.json',
 'explanati

In [23]:
OUTPUT_DIR='./outputs'
os.makedirs(OUTPUT_DIR, exist_ok=True)

model_file_name = 'heart_failure_automl.pkl'
joblib.dump(value=best_model, filename=os.path.join(OUTPUT_DIR, model_file_name))

['./outputs/heart_failure_automl.pkl']

In [27]:
OUTPUT_DIR='./deploy'
os.makedirs(OUTPUT_DIR, exist_ok=True)

scoring_file_name = 'scoring_file_v_1_0_0.py'
joblib.dump(value=best_model, filename=os.path.join(OUTPUT_DIR, scoring_file_name))

env_file_name = 'conda_env_v_1_0_0.yml'
joblib.dump(value=best_model, filename=os.path.join(OUTPUT_DIR, env_file_name))

['./deploy/conda_env_v_1_0_0.yml']

## Model Deployment

As the best performer was this model (an AutoML one), it was necessary to use the files *conda_env_1_0_0.yml* and *scoring_file_v_1_0_0.py* generated in the output folder of the best model to create the endpoint. I renamed the model, the yml file and the py file respectively as following:

* heart_failure_automl.pkl
* myenv.yml
* score.py

Then I deployed the model to ACI.

In [4]:
from azureml.core.model import Model

model = Model.register(model_path = "./outputs/heart_failure_automl.pkl",
                       model_name = "heart_failure",
                       description = "Heart failure mortality model",
                       workspace = ws)

Registering model heart_failure


In [37]:
from azureml.core import Environment

# Instantiate environment
myenv = Environment.from_conda_specification(name = "automl-env",
                                             file_path = "deploy/myenv.yml")
myenv

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/intelmpi2018.3-ubuntu16.04:20200821.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": "2g"
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "automl-env",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
                "conda-forge"
 

In [38]:
from azureml.core.model import InferenceConfig

inference_config = InferenceConfig(entry_script='./deploy/score.py', environment=myenv)

In [47]:
from azureml.core.webservice import AciWebservice

aci_config = AciWebservice.deploy_configuration(cpu_cores=1, 
                                                memory_gb=1,
                                                description='Heart failure mortality predictions',
                                                auth_enabled=True)

In [48]:
# Delete web service if already exists
from azureml.core.webservice import Webservice

aci_webservice_name = 'predict-heart-failure-aci'

try:
    service = Webservice(name=aci_webservice_name, workspace=ws)
    service.delete()
    
    print("The web service '", aci_webservice_name, "' has been deleted.", sep='')
except Exception as e:
    if (e.args[0].split(':', 1)[0] == 'WebserviceNotFound'):
        print("The web service '", aci_webservice_name, "' doesn't exist.", sep='')
    else:
        print(e.args[0])

The web service 'predict-heart-failure-aci' has been deleted.


In [49]:
aci_service = Model.deploy(ws,
                           name=aci_webservice_name,
                           models=[model],
                           inference_config=inference_config,
                           deployment_config=aci_config)

aci_service.wait_for_deployment(True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running............................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [50]:
print(aci_service.state)

Healthy


TODO: In the cell below, send a request to the web service you deployed to test it.

In [28]:
import requests
import json

In [29]:
scoring_uri = 'http://39a3d309-e0bc-452c-ad44-7a42188ef14b.westeurope.azurecontainer.io/score'
# If the service is authenticated, set the key or token
key = 'xROga569Q9AnvzOZMbSU1RAQqkSrlHId'

In [30]:
data = {"data":
        [
          {
            'age': 75,
            'anaemia': 0,
            'creatinine_phosphokinase': 582,
            'diabetes': 0,
            'ejection_fraction': 20,
            'high_blood_pressure': 1,
            'platelets': 265000,
            'serum_creatinine': 1.9,
            'serum_sodium': 130,
            'sex': 1,
            'smoking': 0,
            'time': 4
          },
      ]
    }

# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

In [31]:
# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
headers['Authorization'] = f'Bearer {key}'

In [32]:
# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.json())

{"result": [1]}


TODO: In the cell below, print the logs of the web service and delete the service

In [55]:
print(aci_service.get_logs())

2021-01-07T18:49:41,843052109+00:00 - rsyslog/run 
2021-01-07T18:49:41,844312919+00:00 - iot-server/run 
2021-01-07T18:49:41,843658314+00:00 - gunicorn/run 
2021-01-07T18:49:41,844962724+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_9c539d20199ae6be65c41c0382029684/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_9c539d20199ae6be65c41c0382029684/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_9c539d20199ae6be65c41c0382029684/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_9c539d20199ae6be65c41c0382029684/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_9c539d20199ae6be65c41c0382029684/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd

In [None]:
aci_service.delete()

In [35]:
from winmltools import convert_sklearn
from winmltools.convert.common.data_types import FloatTensorType

In [37]:
heart_failure_onnx = convert_sklearn(best_model, 7, name='heart_failure',
                                  initial_types=[('input', FloatTensorType([75, 0, 582, 0, 20, 1, 265000, 1.9, 130, 1, 0, 4]))])

RuntimeError: Unable to find a shape calculator for type '<class 'azureml.automl.runtime.featurization.data_transformer.DataTransformer'>'.