# Azure Machine Learning Pipeline with AutoMLStep 
This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline.

###  Capstone Project Heart Failure prediction

There are many kinds of machine learning algorithm that you can use to train a model, and sometimes it's not easy to determine the most effective algorithm for your particular data and prediction requirements. Additionally, you can significantly affect the predictive performance of a model by preprocessing the training data, using techniques such as normalization, missing feature imputation, and others. In your quest to find the best model for your requirements, you may need to try many combinations of algorithms and preprocessing transformations; which takes a lot of time and compute resources.

Azure Machine Learning enables you to automate the comparison of models trained using different algorithms and preprocessing options. You can use the visual interface in Azure Machine Learning studio or the SDK to leverage this capability. he SDK gives you greater control over the settings for the automated machine learning experiment, but the visual interface is easier to use. In this lab, you'll explore automated machine learning using the SDK.

## Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. 

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.19.0


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

- In the Microsoft Azure portal, create a new Machine Learning resource, specifying the subscription, resource group and workspace name.
- Use the Azure Machine Learning Python SDK to run code that creates a workspace. For example, the following code creates a workspace named aml-workspace (assuming the Azure ML SDK for Python is installed and a valid subscription ID is specified):


In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-136014
aml-quickstarts-136014
southcentralus
3e42d11f-d64d-4173-af9b-12ecaa1030b3


## Create an Azure ML experiment
Let's create an experiment named "automlstep-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

*Udacity Note:* There is no need to create an Azure ML experiment, this needs to re-use the experiment that was already created


In [3]:
# Choose a name for the run history container in the workspace.
# NOTE: update these to match your existing experiment name
experiment_name = 'AutoML-Pipeline'
project_folder = './pipeline-project3'

experiment = Experiment(ws, experiment_name)
experiment


Name,Workspace,Report Page,Docs Page
AutoML-Pipeline,quick-starts-ws-136014,Link to Azure Machine Learning studio,Link to Documentation


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

To create a cluster, you need to specify a compute configuration that specifies the type of machine to be used and the scalability behaviors.  Then you choose a name for the cluster that is unique within the workspace that can be used to address the cluster later.

The cluster parameters are:
* vm_size - this describes the virtual machine type and size used in the cluster.  All machines in the cluster are the same type.  You can get the list of vm sizes available in your region by using the CLI command

```shell
az vm list-skus -o tsv
```
* min_nodes - this sets the minimum size of the cluster.  If you set the minimum to 0 the cluster will shut down all nodes while not in use.  Setting this number to a value higher than 0 will allow for faster start-up times, but you will also be billed when the cluster is not in use.
* max_nodes - this sets the maximum size of the cluster.  Setting this to a larger number allows for more concurrency and a greater distributed processing of scale-out jobs.


To create a **CPU** cluster now, run the cell below. The autoscale settings mean that the cluster will scale down to 0 nodes when inactive and up to 4 nodes when busy.

In [4]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# TODO: Create compute cluster
# max_nodes should be no greater than 4.

# choose a name for your cluster
cluster_name = "project-compute"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS3_V2', 
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

# can poll for a minimum number of nodes and for a specific timeout. 
# if no min node count is provided it uses the scale settings for the cluster
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=10)
    
 # use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Creating a new compute target...
Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-25T17:21:00.503000+00:00', 'errors': None, 'creationTime': '2021-01-25T17:20:54.580024+00:00', 'modifiedTime': '2021-01-25T17:21:10.614943+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_DS3_V2'}


### Environment
Azure Machine Learning environments specify the Python packages, environment variables, and software settings around your training and scoring scripts. In addition to Python, you can also configure PySpark, Docker and R for environments. Internally, environments result in Docker images that are used to run the training and scoring processes on the compute target. The environments are managed and versioned entities within your Machine Learning workspace that enable reproducible, auditable, and portable machine learning workflows across a variety of compute targets and compute types.

You can use an Environment object to:

- Develop your training script.
- Reuse the same environment on Azure Machine Learning Compute for model training at scale.
- Deploy your model with that same environment without being tied to a specific compute type.

In [5]:
# Define RunConfig for the compute
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# Create a new runconfig object
aml_run_config = RunConfiguration()

# Use the aml_compute you created above. 
aml_run_config.target = compute_target

# Enable Docker
aml_run_config.environment.docker.enabled = True

# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
aml_run_config.environment.python.user_managed_dependencies = False

# Specify CondaDependencies obj, add necessary packages
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
    conda_packages=['pandas','scikit-learn','numpy'], 
    pip_packages=['azureml-sdk[automl,explain]', 'scipy'])

print ("Run configuration created.")

Run configuration created.


## Data

#### Overview
- Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
- Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

- Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

- People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

- The dataset contains 299 training examples in a csv file.
- https://www.kaggle.com/andrewmvd/heart-failure-clinical-data

- https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv

In [6]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "Heartfailure Dataset"
description_text = "Heart failure DataSet for Kaggle or archive.ics.uci.edu machine-learning"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        heartfailure_data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv'
        dataset = Dataset.Tabular.from_delimited_files(heartfailure_data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


### Review the Dataset Result

You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records for all the steps in the TabularDataset, which makes it fast even against large datasets.

`TabularDataset` objects are composed of a list of transformation steps (optional).

In [7]:
dataset.take(5).to_pandas_dataframe()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


## AutoML Configuration  Machine Learning Experiment
This creates a general AutoML settings object.
These inputs must match what was used when training in the portal. `label_column_name` has to be `DEATH_EVENT` for example.
Namespace: azureml.train.automl.automlconfig.AutoMLConfig

Use the AutoMLConfig class to configure parameters for automated machine learning training. Automated machine learning iterates over many combinations of machine learning algorithms and hyperparameter settings. It then finds the best-fit model based on your chosen accuracy metric. Configuration allows for specifying:

- Task type (classification, regression, forecasting)
- Number of algorithm iterations and maximum time per iteration
- Accuracy metric to optimize
- Algorithms to blacklist/whitelist
- Number of cross-validations
- Compute targets
- Training data

In [8]:
import logging
from azureml.train.automl import AutoMLConfig

automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'accuracy',
    "n_cross_validations": 5
}

automl_config = AutoMLConfig(compute_target=compute_target,
                             model_explainability=True,#Generate feature importance!
                             task = "classification",
                             training_data=dataset,
                             label_column_name="DEATH_EVENT",  
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= "auto",
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

print("AutoML config created")

AutoML config created


### Create Pipeline and AutoMLStep

You can define outputs for the AutoMLStep using TrainingOutput.

In [9]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

Create an AutoMLStep.

In [10]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

In [11]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [12]:
pipeline_run = experiment.submit(pipeline)

Created step automl_module [ab777b1c][c4445a3e-90e2-46e2-9573-e5781656638f], (This step will run and generate new outputs)
Submitted PipelineRun b998504e-ca66-4435-aa88-b5fe627137c6
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/AutoML-Pipeline/runs/b998504e-ca66-4435-aa88-b5fe627137c6?wsid=/subscriptions/3e42d11f-d64d-4173-af9b-12ecaa1030b3/resourcegroups/aml-quickstarts-136014/workspaces/quick-starts-ws-136014


In [13]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [14]:
pipeline_run.wait_for_completion()

PipelineRunId: b998504e-ca66-4435-aa88-b5fe627137c6
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/AutoML-Pipeline/runs/b998504e-ca66-4435-aa88-b5fe627137c6?wsid=/subscriptions/3e42d11f-d64d-4173-af9b-12ecaa1030b3/resourcegroups/aml-quickstarts-136014/workspaces/quick-starts-ws-136014

PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': 'b998504e-ca66-4435-aa88-b5fe627137c6', 'status': 'Completed', 'startTimeUtc': '2021-01-25T17:22:29.371108Z', 'endTimeUtc': '2021-01-25T17:56:31.871497Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'outputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://mlstrg136014.blob.core.windows.net/azureml/ExperimentRun/dcid.b998504e-ca66-4435-aa88-b5fe627137c6/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=MpSRSMJAfDcSKB3PWwu0gklY%2F4d47e4lvWWBj1ejpg4%3D&st=2021-01-25T17%3A55%3A20Z&se=2021-0

'Finished'

## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [15]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/cf895164-f343-4c86-b407-658a28e2c609/metrics_data
Downloaded azureml/cf895164-f343-4c86-b407-658a28e2c609/metrics_data, 1 files out of an estimated total of 1


In [16]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,cf895164-f343-4c86-b407-658a28e2c609_19,cf895164-f343-4c86-b407-658a28e2c609_20,cf895164-f343-4c86-b407-658a28e2c609_7,cf895164-f343-4c86-b407-658a28e2c609_18,cf895164-f343-4c86-b407-658a28e2c609_16,cf895164-f343-4c86-b407-658a28e2c609_14,cf895164-f343-4c86-b407-658a28e2c609_32,cf895164-f343-4c86-b407-658a28e2c609_1,cf895164-f343-4c86-b407-658a28e2c609_23,cf895164-f343-4c86-b407-658a28e2c609_26,...,cf895164-f343-4c86-b407-658a28e2c609_25,cf895164-f343-4c86-b407-658a28e2c609_27,cf895164-f343-4c86-b407-658a28e2c609_24,cf895164-f343-4c86-b407-658a28e2c609_2,cf895164-f343-4c86-b407-658a28e2c609_21,cf895164-f343-4c86-b407-658a28e2c609_11,cf895164-f343-4c86-b407-658a28e2c609_9,cf895164-f343-4c86-b407-658a28e2c609_10,cf895164-f343-4c86-b407-658a28e2c609_17,cf895164-f343-4c86-b407-658a28e2c609_38
recall_score_macro,[0.6902380952380952],[0.6526785714285714],[0.8282994186046512],[0.8214244186046512],[0.7683887043189369],[0.781437569213732],[0.8086953211517164],[0.8060125968992248],[0.8066756644518274],[0.7948857973421927],...,[0.5],[0.8104851882613511],[0.6911309523809523],[0.7993625415282392],[0.7935804263565892],[0.7546601605758584],[0.7891625138427464],[0.7780274086378738],[0.7578107696566999],[0.843095238095238]
norm_macro_recall,[0.3804761904761904],[0.3053571428571428],[0.6565988372093023],[0.6428488372093024],[0.5367774086378738],[0.562875138427464],[0.617390642303433],[0.6120251937984496],[0.6133513289036545],[0.5897715946843853],...,[0.0],[0.620970376522702],[0.3822619047619048],[0.5987250830564783],[0.5871608527131784],[0.5093203211517165],[0.5783250276854928],[0.5560548172757475],[0.5156215393133998],[0.6861904761904762]
recall_score_micro,[0.7861016949152542],[0.7593785310734463],[0.8529943502824858],[0.846271186440678],[0.8194350282485875],[0.8193220338983052],[0.8293220338983052],[0.8259322033898304],[0.836045197740113],[0.7928248587570621],...,[0.67909604519774],[0.8293785310734464],[0.786045197740113],[0.8327683615819209],[0.8225988700564972],[0.8159322033898304],[0.8328248587570621],[0.8228813559322035],[0.8027118644067797],[0.8763276836158193]
weighted_accuracy,[0.8529296284801522],[0.8304304344654222],[0.869763825501613],[0.8648422820161216],[0.8529511549433009],[0.8452847516489623],[0.8436998231407419],[0.8406325321291572],[0.8568042475897724],[0.7919799837030644],...,[0.7971865900207855],[0.8433126633454553],[0.8507805993101212],[0.855844459474665],[0.8430180754201076],[0.8577572539704719],[0.8617803853973852],[0.8527182497239538],[0.8359612739636877],[0.8986147612977045]
precision_score_macro,[0.8345700660525457],[0.7956948016485038],[0.8461362717283162],[0.8430596396357266],[0.8238351931718737],[0.803866660142544],[0.8116719366570274],[0.8086642059265495],[0.8215661708015339],[0.7648876875422452],...,[0.33954802259887],[0.8077294562162983],[0.82360348583878],[0.8131976090676611],[0.8004541617678583],[0.8101739780233569],[0.8229916618441993],[0.8142224331870086],[0.7998373741915941],[0.8904235593903296]
matthews_correlation,[0.5009222091990708],[0.421361295383653],[0.6723401507849285],[0.660918858262345],[0.5843747389719086],[0.5831062191672067],[0.6181304554244639],[0.6132552509330647],[0.6265780901590804],[0.5578968880284726],...,[0.0],[0.6162555043440043],[0.49383169497676366],[0.611274773305597],[0.592055816682359],[0.560987454321857],[0.6088770519165811],[0.5894447917430835],[0.551473312050151],[0.7292901226073087]
average_precision_score_macro,[0.8917444055128378],[0.8110750544063533],[0.8735790968079961],[0.876148199188022],[0.86680656318063],[0.8817166087727746],[0.8902935227122153],[0.8826936933050062],[0.885189519726541],[0.8404352577523998],...,[0.5],[0.8745430199646025],[0.8632870672667259],[0.8780603714207388],[0.8811759437979425],[0.8905099842087789],[0.8504776044109501],[0.8692364108669051],[0.8349810331652406],[0.8992534574255965]
recall_score_weighted,[0.7861016949152542],[0.7593785310734463],[0.8529943502824858],[0.846271186440678],[0.8194350282485875],[0.8193220338983052],[0.8293220338983052],[0.8259322033898304],[0.836045197740113],[0.7928248587570621],...,[0.67909604519774],[0.8293785310734464],[0.786045197740113],[0.8327683615819209],[0.8225988700564972],[0.8159322033898304],[0.8328248587570621],[0.8228813559322035],[0.8027118644067797],[0.8763276836158193]
average_precision_score_weighted,[0.9148328887528863],[0.8523224656903533],[0.9019568505623358],[0.9030074138792704],[0.8988703388534495],[0.9054274856155876],[0.9145246244652776],[0.9103506263314479],[0.9094426231686791],[0.8731336684868654],...,[0.5796089246385138],[0.8983176069060608],[0.8914114913261638],[0.9080181997167036],[0.9105885175339221],[0.9147681542021241],[0.8852746051537626],[0.8977241917784641],[0.8753573638077337],[0.9231224238066558]
precision_score_weighted,[0.8180139899085489],[0.7934134443855058],[0.8673122721133508],[0.8654405638985596],[0.8432280916122483],[0.8330629529595084],[0.846753391640973],[0.8403871377947292],[0.8483801057970274],[0.8192054300204766],...,[0.4689005075169971],[0.8470819543723203],[0.8166657656659652],[0.8402669049699378],[0.8361362517752019],[0.8235012823303697],[0.8412741046615324],[0.8383897677212685],[0.8226124253630431],[0.8950471537693939]


### Retrieve the Best Model

In [17]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

Downloading azureml/cf895164-f343-4c86-b407-658a28e2c609/model_data
Downloaded azureml/cf895164-f343-4c86-b407-658a28e2c609/model_data, 1 files out of an estimated total of 1


In [18]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                  min_weight_fraction_leaf=0.0,
                                                                                                  n_estimators=200,
                                                                                                  n_jobs=1,
                                        

In [19]:
best_model.steps

[('datatransformer',
  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                  feature_sweeping_config=None, feature_sweeping_timeout=None,
                  featurization_config=None, force_text_dnn=None,
                  is_cross_validation=None, is_onnx_compatible=None, logger=None,
                  observer=None, task=None, working_dir=None)),
 ('prefittedsoftvotingclassifier',
  PreFittedSoftVotingClassifier(classification_labels=None,
                                estimators=[('7',
                                             Pipeline(memory=None,
                                                      steps=[('sparsenormalizer',
                                                              <azureml.automl.runtime.shared.model_wrappers.SparseNormalizer object at 0x7f441f00ec18>),
                                                             ('xgboostclassifier',
                                                              XGBoostClassifier(base_score=0.

In [20]:
# View the details of the AutoML run
from azureml.train.automl.run import AutoMLRun

for step in pipeline_run.get_steps():
    automl_step_run_id = step.id
    print(step.name)
    print(automl_step_run_id)
    break

automl_run = AutoMLRun(experiment = experiment, run_id=automl_step_run_id)
RunDetails(automl_run).show()

automl_module
cf895164-f343-4c86-b407-658a28e2c609


_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

In [21]:
# Get best model
best_run, fitted_model = automl_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: AutoML-Pipeline,
Id: cf895164-f343-4c86-b407-658a28e2c609_38,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                  min_weight_fraction_leaf=0.0,
                                                                                                  n_estimators=200,
                          

### Save best model

In [22]:
# Save best model
import joblib
model_name = '/best_run_automl.pkl'
model_dir = 'outputs/'
if not os.path.exists(model_dir):
    os.mkdir(model_dir)

filename = model_dir + model_name
joblib.dump(fitted_model, filename)

['outputs//best_run_automl.pkl']

### Explore the results

In [25]:
# functions to download output to local and fetch as dataframe
def get_download_path(download_path, output_name):
    output_folder = os.listdir(download_path + '/azureml')[0]
    path =  download_path + '/azureml/' + output_folder + '/' + output_name
    return path

def fetch_df(step, output_name):
    output_data = step.get_output_data(output_name)    
    download_path = './outputs/' + output_name
    output_data.download(download_path, overwrite=True)
    df_path = get_download_path(download_path, output_name) + '/processed.parquet'
    return pd.read_parquet(df_path)

### Test the Model
#### Load Test Data
For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step.

In [26]:
dataset_test = Dataset.Tabular.from_delimited_files(path='https://archive.ics.uci.edu/ml/machine-learning-databases/00519/heart_failure_clinical_records_dataset.csv')
df_test = dataset_test.to_pandas_dataframe()
df_test = df_test[pd.notnull(df_test['DEATH_EVENT'])]

y_test = df_test['DEATH_EVENT']
X_test = df_test.drop(['DEATH_EVENT'], axis=1)

#### Testing Our Best Fitted Model

We will use confusion matrix to see how our model works.

In [28]:
from sklearn.metrics import confusion_matrix
y_predict = best_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)

In [29]:
# Visualize the confusion matrix
pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

Unnamed: 0,0,1
0,199,4
1,17,79


In [30]:
from sklearn.metrics import roc_auc_score,accuracy_score
print("AUC test AutoML model: " + str(roc_auc_score(y_test, y_predict)))

AUC test AutoML model: 0.9016061165845648


In [31]:
best_run.properties

{'runTemplate': 'automl_child',
 'pipeline_id': '__AutoML_Ensemble__',
 'pipeline_spec': '{"pipeline_id":"__AutoML_Ensemble__","objects":[{"module":"azureml.train.automl.ensemble","class_name":"Ensemble","spec_class":"sklearn","param_args":[],"param_kwargs":{"automl_settings":"{\'task_type\':\'classification\',\'primary_metric\':\'accuracy\',\'verbosity\':20,\'ensemble_iterations\':15,\'is_timeseries\':False,\'name\':\'placeholder\',\'compute_target\':\'project-compute\',\'subscription_id\':\'3e42d11f-d64d-4173-af9b-12ecaa1030b3\',\'region\':\'southcentralus\',\'spark_service\':None}","ensemble_run_id":"cf895164-f343-4c86-b407-658a28e2c609_38","experiment_name":"AutoML-Pipeline","workspace_name":"quick-starts-ws-136014","subscription_id":"3e42d11f-d64d-4173-af9b-12ecaa1030b3","resource_group_name":"aml-quickstarts-136014"}}]}',
 'training_percent': '100',
 'predicted_cost': None,
 'iteration': '38',
 '_aml_system_scenario_identification': 'Remote.Child',
 '_azureml.ComputeTargetType': 

### Scoring File

In [32]:
with open('inference/atoml_scoring_service.py') as f:
    print(f.read())

import json
import logging
import os
import pickle
import numpy as np
import pandas as pd
import joblib

import azureml.automl.core
from azureml.automl.core.shared import logging_utilities, log_server
from azureml.telemetry import INSTRUMENTATION_KEY

from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType


input_sample = pd.DataFrame({"age": pd.Series([0.0], dtype="float64"), "anaemia": pd.Series([0], dtype="int64"), "creatinine_phosphokinase": pd.Series([0], dtype="int64"), "diabetes": pd.Series([0], dtype="int64"), "ejection_fraction": pd.Series([0], dtype="int64"), "high_blood_pressure": pd.Series([0], dtype="int64"), "platelets": pd.Series([0.0], dtype="float64"), "serum_creatinine": pd.Series([0.0], dtype="float64"), "serum_sodium": pd.Series([0], dtype="int64"), "sex": pd.Series([0], dtype=

### Environment

In [33]:
import sklearn
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

myenv = Environment.from_conda_specification(name="env", file_path="inference/conda_env.yml")

In [34]:
with open('inference/conda_env.yml') as f:
    print(f.read())

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2

- pip:
  - azureml-train-automl-runtime==1.19.0
  - inference-schema
  - azureml-interpret==1.19.0
  - azureml-defaults==1.19.0
- numpy>=1.16.0,<1.19.0
- pandas==0.25.1
- scikit-learn==0.22.1
- joblib==0.14.1
- py-xgboost<=0.90
- fbprophet==0.5
- holidays==0.9.11
- psutil>=5.2.2,<6.0.0
channels:
- anaconda
- conda-forge



### Deploy Model usin ACI

### Register Model

In [35]:
from azureml.core import Model
model = Model.register(workspace=ws,model_name = "Automl-Heartfailure-Model", model_path = './outputs/best_run_automl.pkl')
print(model.name, model.id, model.version, sep='\t')

Registering model Automl-Heartfailure-Model
Automl-Heartfailure-Model	Automl-Heartfailure-Model:1	1


In [36]:
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import Model
from azureml.core.environment import Environment
inference_config = InferenceConfig(entry_script='inference/atoml_scoring_service.py', environment=myenv)

aci_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1,
                                                auth_enabled=True,
                                                tags = {'name':'heartfailure-atoml-service'},
                                                description='Heart service for heart failure Classification model')
service = model.deploy(workspace=ws,name="automl-deploy",
                       models=[model],
                       inference_config=inference_config,
                       deployment_config=aci_config)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running........................................................................................................................................................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [37]:
service.update(enable_app_insights=True)
print(service.state)

Healthy


In [41]:
print(service.scoring_uri)

http://e3ec101f-a6f6-4384-ae18-9d43a816ad30.southcentralus.azurecontainer.io/score


### Consume Model Endpoint

In [42]:
import requests
import json

# URL for the web service, should be similar to:
# 'http://8530a665-66f3-49c8-a953-b82a2d312917.eastus.azurecontainer.io/score'
scoring_uri = 'http://e3ec101f-a6f6-4384-ae18-9d43a816ad30.southcentralus.azurecontainer.io/score'
# If the service is authenticated, set the key or token
key = '4bqFOooeWbe6nlLl9VFHDY8agbCV7Ext'

# Two sets of data to score, so we get two results back
data = {"data":
        [
          {
           "age": 75, 
           "anaemia": 0, 
           "creatinine_phosphokinase": 582, 
           "diabetes": 0, 
           "ejection_fraction": 20, 
           "high_blood_pressure": 1, 
           "platelets": 265000, 
           "serum_creatinine": 1.9, 
           "serum_sodium": 130, 
           "sex": 1, 
           "smoking": 0,
           "time": 4
          },
          {
            "age": 90, 
           "anaemia": 1, 
           "creatinine_phosphokinase": 60, 
           "diabetes": 1, 
           "ejection_fraction": 50, 
           "high_blood_pressure": 0, 
           "platelets": 226000, 
           "serum_creatinine": 1, 
           "serum_sodium": 134, 
           "sex": 1, 
           "smoking": 0,
           "time": 30
          },
      ]
    }
# Convert to JSON string
input_data = json.dumps(data)
with open("data.json", "w") as _f:
    _f.write(input_data)

# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
headers['Authorization'] = f'Bearer {key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.json())

{"result": [1, 1]}


### Logs

In [43]:
#print the logs of the web service
print(service.get_logs())

2021-01-25T18:27:48,690798707+00:00 - gunicorn/run 
2021-01-25T18:27:48,692608726+00:00 - iot-server/run 
2021-01-25T18:27:48,694276235+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_7e68b09b71671c954735ad1b2953002f/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7e68b09b71671c954735ad1b2953002f/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7e68b09b71671c954735ad1b2953002f/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7e68b09b71671c954735ad1b2953002f/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_7e68b09b71671c954735ad1b2953002f/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2021-01-25T18:27:48,696286867+00:00 - rsyslog/run 
rsyslogd

In [None]:
from azureml.core import Model
best_run.register_model(model_path='outputs/model.pkl', model_name='automl_model', #try changing pkl to joblib
                        tags={'Training context':'Auto ML'},
                        properties={'Accuracy': best_run_metrics['accuracy']})

## Publish and run from REST endpoint

Run the following code to publish the pipeline to your workspace. In your workspace in the portal, you can see metadata for the pipeline including run history and durations. You can also run the pipeline manually from the portal.

Additionally, publishing the pipeline enables a REST endpoint to rerun the pipeline from any HTTP library on any platform.


In [44]:
published_pipeline = pipeline_run.publish_pipeline(
    name="atml-Heartfailure Train", description="Training Heartfailure pipeline", version="1.0")

published_pipeline


Name,Id,Status,Endpoint
atml-Heartfailure Train,342c4c9c-f803-4387-8a57-07d4dca8c52b,Active,REST Endpoint


Authenticate once again, to retrieve the `auth_header` so that the endpoint can be used

In [45]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()


Get the REST url from the endpoint property of the published pipeline object. You can also find the REST url in your workspace in the portal. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the process_count_per_node is passed through to ParallelRunStep because you defined it is defined as a PipelineParameter object in the step configuration.

Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.


In [46]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "atoml-pipeline-rest-endpoint"}
                        )

In [47]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Submitted pipeline run:  0556506d-d6f2-4ba2-912f-6a5656d27f0f


### Delete Service and Computer cluster cleanup

In [76]:
service.delete()

In [77]:
compute_target.delete()