Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Introduction
Here, we are showcasing how you can use AzureML Dataset to load data for AutoML via AML Pipeline. 

## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.19.0


## Initialize Workspace
Initialize a workspace object from persisted configuration.

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-134667
aml-quickstarts-134667
southcentralus
f5091c60-1c3c-430f-8d81-d802f6bf2414


## Create an Azure ML experiment

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [3]:
# Choose a name for the run history container in the workspace.
experiment_name = 'Bank-run'
project_folder = './pipeline-project'

experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
Bank-run,quick-starts-ws-134667,Link to Azure Machine Learning studio,Link to Documentation


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. Here I am using the default `AmlCompute` as my training compute resource.

In [4]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
amlcompute_cluster_name = "auto-ml"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)
print("Cluster details: ", compute_target.get_status().serialize())

Creating
Succeeded................................................................................................................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"
Cluster details:  {'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-14T13:52:21.181000+00:00', 'errors': None, 'creationTime': '2021-01-14T13:52:14.185495+00:00', 'modifiedTime': '2021-01-14T13:52:30.371611+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 4, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


## Dataset: 
We have used bank marketing dataset. 

In [5]:
found = False
key = "BankMarketing Dataset"
description_text = "Bank Marketing DataSet for 2nd project"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv'
        dataset = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0
mean,40.040212,257.335205,2.56173,962.17478,0.17478,0.076228,93.574243,-40.51868,3.615654,5166.859608
std,10.432313,257.3317,2.763646,187.646785,0.496503,1.572242,0.578636,4.623004,1.735748,72.208448
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,179.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,318.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


### Review the Dataset Result

You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records for all the steps in the TabularDataset, which makes it fast even against large datasets.

`TabularDataset` objects are composed of a list of transformation steps (optional).

In [6]:
dataset.take(5).to_pandas_dataframe()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,57,technician,married,high.school,no,no,yes,cellular,may,mon,...,1,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,no
1,55,unknown,married,unknown,unknown,yes,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
2,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,...,1,999,1,failure,-1.8,92.893,-46.2,1.313,5099.1,no
3,36,admin.,married,high.school,no,no,no,telephone,jun,fri,...,4,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228.1,no
4,27,housemaid,married,high.school,no,yes,no,cellular,jul,fri,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,no


## Train
This step creates a general AutoML settings object.

In [7]:
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="y",   
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

#### Create Pipeline and AutoMLStep

You can define outputs for the AutoMLStep using TrainingOutput.

In [8]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

Create an AutoMLStep.

In [9]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

In [10]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [11]:
pipeline_run = experiment.submit(pipeline)

Created step automl_module [049a0c83][ccf6c6c9-c789-4603-86e8-54fcfd4fe163], (This step will run and generate new outputs)
Submitted PipelineRun a0dbabe7-69b1-43d5-8f5e-166c66892f0a
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/Bank-run/runs/a0dbabe7-69b1-43d5-8f5e-166c66892f0a?wsid=/subscriptions/f5091c60-1c3c-430f-8d81-d802f6bf2414/resourcegroups/aml-quickstarts-134667/workspaces/quick-starts-ws-134667


In [12]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [13]:
pipeline_run.wait_for_completion()

PipelineRunId: a0dbabe7-69b1-43d5-8f5e-166c66892f0a
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/Bank-run/runs/a0dbabe7-69b1-43d5-8f5e-166c66892f0a?wsid=/subscriptions/f5091c60-1c3c-430f-8d81-d802f6bf2414/resourcegroups/aml-quickstarts-134667/workspaces/quick-starts-ws-134667
PipelineRun Status: NotStarted
PipelineRun Status: Running


This usually indicates a package conflict with one of the dependencies of azureml-core or azureml-pipeline-core.
Please check for package conflicts in your python environment






PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': 'a0dbabe7-69b1-43d5-8f5e-166c66892f0a', 'status': 'Completed', 'startTimeUtc': '2021-01-14T14:02:41.216873Z', 'endTimeUtc': '2021-01-14T14:41:46.928025Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'outputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://mlstrg134667.blob.core.windows.net/azureml/ExperimentRun/dcid.a0dbabe7-69b1-43d5-8f5e-166c66892f0a/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=LFSBlO3v7AXaz8QZvB0d6Eejxao1E%2F4BiRqDxqUbG3I%3D&st=2021-01-14T14%3A31%3A48Z&se=2021-01-14T22%3A41%3A48Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://mlstrg134667.blob.core.windows.net/azureml/ExperimentRun/dcid.a0dbabe7-69b1-43d5-8f5e-166c66892f0a/logs/azureml/stderrlogs.txt?sv=2019-02-02&sr=b&sig=bIU%2BqG8vrsPHvicvZcNAeUd3O1l7gheVM6QaCOx1WoA%3D&st=2021-01-14T14%3A31%3A48Z&se=2021-01-14T2

'Finished'

## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. Here we will examine the outputs by retrieve output data and running some tests.

In [14]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/64da9b64-7158-40a4-8999-af437556e443/metrics_data
Downloaded azureml/64da9b64-7158-40a4-8999-af437556e443/metrics_data, 1 files out of an estimated total of 1


In [15]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,64da9b64-7158-40a4-8999-af437556e443_6,64da9b64-7158-40a4-8999-af437556e443_3,64da9b64-7158-40a4-8999-af437556e443_17,64da9b64-7158-40a4-8999-af437556e443_21,64da9b64-7158-40a4-8999-af437556e443_0,64da9b64-7158-40a4-8999-af437556e443_9,64da9b64-7158-40a4-8999-af437556e443_13,64da9b64-7158-40a4-8999-af437556e443_31,64da9b64-7158-40a4-8999-af437556e443_32,64da9b64-7158-40a4-8999-af437556e443_38,...,64da9b64-7158-40a4-8999-af437556e443_11,64da9b64-7158-40a4-8999-af437556e443_20,64da9b64-7158-40a4-8999-af437556e443_28,64da9b64-7158-40a4-8999-af437556e443_24,64da9b64-7158-40a4-8999-af437556e443_15,64da9b64-7158-40a4-8999-af437556e443_14,64da9b64-7158-40a4-8999-af437556e443_27,64da9b64-7158-40a4-8999-af437556e443_1,64da9b64-7158-40a4-8999-af437556e443_23,64da9b64-7158-40a4-8999-af437556e443_34
norm_macro_recall,[0.17402893782868123],[0.6486726794814088],[0.4870768940088581],[0.4302394937824976],[0.5026785366965085],[0.4778011177240957],[0.24549085203770704],[0.3293572067641388],[0.7103151448465954],[0.5030684619901566],...,[0.6804464968778192],[0.3455932884687698],[0.31917654446537647],[0.5006761174925489],[0.0],[0.0],[0.5010178809922072],[0.43834549418631563],[0.4996749078905691],[0.48204769129031]
f1_score_weighted,[0.8726207307625555],[0.8382026301305484],[0.799048412467296],[0.9025288323944487],[0.9091539479147899],[0.7897641604164042],[0.885603431576398],[0.8913234547133979],[0.8659766953019384],[0.9127129113935518],...,[0.791785011129367],[0.892406452644354],[0.8885650482895828],[0.9109321212241842],[0.8353395018439429],[0.8353395018439429],[0.9111858226879949],[0.9021127651963996],[0.9118257356044213],[0.9085375444431306]
log_loss,[0.2596054058725865],[0.4998502340785905],[0.5628311238193519],[0.19708712990741808],[0.17775706110025447],[0.5542765350649138],[0.33655623030329523],[0.22570573238764152],[0.40220576608009107],[0.17643959238527002],...,[0.4767748150180446],[0.20678955773307725],[0.2263505948523789],[0.18349425240335102],[0.2831668269866405],[0.25526117735319215],[0.17988220852657136],[0.1874363495858499],[0.17928679602274605],[0.1837737927569083]
AUC_macro,[0.879322752557669],[0.8970634272303077],[0.8400055941776097],[0.9312457974203802],[0.9450464668693166],[0.8374270858224644],[0.9308878256246677],[0.920343171305944],[0.9244406285484591],[0.9469405220367992],...,[0.9041404323817674],[0.9285931939975585],[0.920127369421336],[0.9432709638101167],[0.7919989367357788],[0.8989574823977905],[0.9444889014850504],[0.9392346349984347],[0.9450196074072839],[0.9424603174603176]
precision_score_macro,[0.7998633923384806],[0.6638352103661618],[0.6173753160398245],[0.7778318057957909],[0.7819118765348991],[0.6121396126537064],[0.822098675416211],[0.7685027182120205],[0.6941070079918361],[0.7978014145512364],...,[0.6472431861290286],[0.7646535215263494],[0.7568725346086531],[0.7904154525215538],[0.4440060698027314],[0.4440060698027314],[0.791450436626639],[0.7723958081530135],[0.7949245271879244],[0.7862007073853139]
weighted_accuracy,[0.9771375834608871],[0.8038077468627398],[0.7628746750410486],[0.9563188254464977],[0.9514937218005303],[0.7484937575573688],[0.9766010009385309],[0.9620229034621705],[0.840858614816875],[0.9571278957721506],...,[0.7222015588831772],[0.9596285749796182],[0.9598771482415283],[0.9547730032881345],[0.9843450583187134],[0.9843450583187134],[0.9551094165001369],[0.9537972210153172],[0.9564126440319367],[0.9548124392866705]
balanced_accuracy,[0.5870144689143406],[0.8243363397407044],[0.7435384470044291],[0.7151197468912488],[0.7513392683482543],[0.7389005588620479],[0.6227454260188535],[0.6646786033820694],[0.8551575724232977],[0.7515342309950783],...,[0.8402232484389096],[0.6727966442343849],[0.6595882722326882],[0.7503380587462745],[0.5],[0.5],[0.7505089404961036],[0.7191727470931578],[0.7498374539452846],[0.741023845645155]
AUC_micro,[0.9638390811479204],[0.8786618802111996],[0.8426204231822254],[0.9760318319244913],[0.979695082216353],[0.8331540177903247],[0.9758990146932517],[0.9698484621708064],[0.9106347272848685],[0.9804067873105202],...,[0.857361846362148],[0.9746105401802059],[0.9696466573485829],[0.9789753638773053],[0.9460887305684569],[0.9673620536012398],[0.9794997248325392],[0.9781770788959222],[0.9798592155770112],[0.9790438448838423]
recall_score_macro,[0.5870144689143406],[0.8243363397407044],[0.7435384470044291],[0.7151197468912488],[0.7513392683482543],[0.7389005588620479],[0.6227454260188535],[0.6646786033820694],[0.8551575724232977],[0.7515342309950783],...,[0.8402232484389096],[0.6727966442343849],[0.6595882722326882],[0.7503380587462745],[0.5],[0.5],[0.7505089404961036],[0.7191727470931578],[0.7498374539452846],[0.741023845645155]
average_precision_score_micro,[0.965078326782299],[0.8597242995912042],[0.815491413703218],[0.977152754319198],[0.9806603102489483],[0.8125359188787321],[0.9766643355999638],[0.9676905965105704],[0.8832542473889691],[0.9813214949741682],...,[0.8699905321750996],[0.9757189583187845],[0.9676109897928908],[0.9799809298079264],[0.9368716121268706],[0.968004695488651],[0.9804687100273154],[0.9791945367231853],[0.9807939838040495],[0.9800362147080485]


### Retrieve the Best Model

In [16]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

Downloading azureml/64da9b64-7158-40a4-8999-af437556e443/model_data
Downloaded azureml/64da9b64-7158-40a4-8999-af437556e443/model_data, 1 files out of an estimated total of 1


In [17]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

PipelineWithYTransformations(Pipeline={'memory': None,
                                       'steps': [('datatransformer',
                                                  DataTransformer(enable_dnn=None,
                                                                  enable_feature_sweeping=None,
                                                                  feature_sweeping_config=None,
                                                                  feature_sweeping_timeout=None,
                                                                  featurization_config=None,
                                                                  force_text_dnn=None,
                                                                  is_cross_validation=None,
                                                                  is_onnx_compatible=None,
                                                                  logger=None,
                                                              

In [18]:
best_model.steps

[('datatransformer',
  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                  feature_sweeping_config=None, feature_sweeping_timeout=None,
                  featurization_config=None, force_text_dnn=None,
                  is_cross_validation=None, is_onnx_compatible=None, logger=None,
                  observer=None, task=None, working_dir=None)),
 ('prefittedsoftvotingclassifier',
  PreFittedSoftVotingClassifier(classification_labels=None,
                                estimators=[('0',
                                             Pipeline(memory=None,
                                                      steps=[('maxabsscaler',
                                                              MaxAbsScaler(copy=True)),
                                                             ('lightgbmclassifier',
                                                              LightGBMClassifier(boosting_type='gbdt',
                                                          

### Test the Model
#### Load Test Data
For the test data, it should have the same preparation step as the train data.

In [19]:
dataset_test = Dataset.Tabular.from_delimited_files(path='https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv')
df_test = dataset_test.to_pandas_dataframe()
df_test = df_test[pd.notnull(df_test['y'])]

y_test = df_test['y']
X_test = df_test.drop(['y'], axis=1)

#### Testing Our Best Fitted Model

We will use confusion matrix to see how our model works.

In [20]:
from sklearn.metrics import confusion_matrix
ypred = best_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

In [21]:
# Visualize the confusion matrix
pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

Unnamed: 0,0,1
0,28920,338
1,919,2773


## Publish and run from REST endpoint

Publishing the pipeline enables a REST endpoint to rerun the pipeline from any HTTP library on any platform.


In [22]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Bankmarketing Train", description="Training bankmarketing pipeline", version="1.0")

published_pipeline


Name,Id,Status,Endpoint
Bankmarketing Train,9f608c78-9b05-4b17-bab1-861cecfcf137,Active,REST Endpoint


Authenticate once again, to retrieve the `auth_header` so that the endpoint can be used

In [23]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()



Get the REST url from the endpoint property of the published pipeline object. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the process_count_per_node is passed through to ParallelRunStep. Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.


In [24]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )

In [25]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Submitted pipeline run:  59e2c496-7be5-4964-8e28-434f00f2a574


Use the run id to monitor the status of the new run. 

In [26]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments["pipeline-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …