Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)

# Azure Machine Learning Pipeline with AutoMLStep (Udacity Course 2)
This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline.

## Introduction
This notebook demonstrates the following:  

- Create an `Experiment` in an existing `Workspace`.
- Create or Attach existing AmlCompute to a workspace.
- Define data loading in a `TabularDataset`.
- Configure AutoML using `AutoMLConfig`.
- Submit AutoML experiment
- Create AutoMLStep to be used in a Azure Pipeline
- Create Azure Pipeline and submit it
- Explore the results.
- Test the best fitted model.
- Publish a pipline and run it


## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.20.0


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-136596
aml-quickstarts-136596
southcentralus
9e65f93e-bdd8-437b-b1e8-0647cd6098f7


## Create an Azure ML experiment
Let's create an experiment named "automlstep-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

*Udacity Note:* There is no need to create an Azure ML experiment, this needs to re-use the experiment that was already created


Reusing the experiment that was already created. 

In [3]:
# Choose a name for the run history container in the workspace.
# NOTE: update these to match your existing experiment name
experiment_name = 'azureml-autoML-2'
project_folder = './pipeline-project' #check if this is needed

experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
azureml-autoML-2,quick-starts-ws-136596,Link to Azure Machine Learning studio,Link to Documentation


### Create or Attach an AmlCompute cluster
The code below will check if the compute cluster with the specified name already exists. If so, it will use it. 

In [None]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# NOTE: update the cluster name to match the existing cluster
# Choose a name for your CPU cluster
amlcompute_cluster_name = "notebook-experiments"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.c(vm_size='STANDARD_D2_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           min_nodes=1,
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, timeout_in_minutes = 10)
# For a more detailed view of current AmlCompute status, use get_status().

## Data


The code below finds of the dataset is already registered using the `key`. If not, it will register the dataset on Azure ML Studio using the link to the original data location. 

In [5]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "Bank-marketing"
description_text = "Bank Marketing DataSet for Udacity Course 2"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv'
        dataset = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0
mean,40.040212,257.335205,2.56173,962.17478,0.17478,0.076228,93.574243,-40.51868,3.615654,5166.859608
std,10.432313,257.3317,2.763646,187.646785,0.496503,1.572242,0.578636,4.623004,1.735748,72.208448
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,179.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,318.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


### Review the Dataset Result

We can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records for all the steps in the TabularDataset, which makes it fast even against large datasets.

`TabularDataset` objects are composed of a list of transformation steps (optional).

In [6]:
dataset.take(5).to_pandas_dataframe()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,57,technician,married,high.school,no,no,yes,cellular,may,mon,...,1,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,no
1,55,unknown,married,unknown,unknown,yes,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
2,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,...,1,999,1,failure,-1.8,92.893,-46.2,1.313,5099.1,no
3,36,admin.,married,high.school,no,no,no,telephone,jun,fri,...,4,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228.1,no
4,27,housemaid,married,high.school,no,yes,no,cellular,jul,fri,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,no


## Train
This creates a general AutoML settings object. We limit the experiment to 20 minutes speed up the experiment completion. 

In [7]:
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="y",   
                             path = project_folder,
                             enable_early_stopping= True,
                             n_cross_validations=5, 
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

## Submit the autoML experiment

In [8]:
autoML_run = experiment.submit(config=automl_config)

Running on remote.


## View details of the autoML run

In [9]:
from azureml.widgets import RunDetails

In [10]:
run_details = RunDetails(run_instance=autoML_run)
run_details.show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

# Create, publish and consume a pipeline

## Create a pipeline

#### Create Pipeline and AutoMLStep

We define outputs for the AutoMLStep using TrainingOutput so that we can inspec them after the step is executed and completed. 

In [13]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

Here, we create a step that consists of an AutoML run, using the same configuration that we used for the autoML run (`automl_config`): 

In [14]:
from azureml.pipeline.steps import AutoMLStep

automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

Now, we construct the pipeline. Note that generally, a pipeline is composed of multiple steps. In this case, we simplify it to have only one step, which is the `automl_step` constructed above. 

In [15]:
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [16]:
pipeline_run = experiment.submit(pipeline)

Created step automl_module [98fd8037][0bc99115-8954-44e7-80fd-1146c63020d7], (This step will run and generate new outputs)
Submitted PipelineRun 43381cd0-6f54-433b-9036-cce8a4bd6b1d
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/azureml-autoML-2/runs/43381cd0-6f54-433b-9036-cce8a4bd6b1d?wsid=/subscriptions/9e65f93e-bdd8-437b-b1e8-0647cd6098f7/resourcegroups/aml-quickstarts-136596/workspaces/quick-starts-ws-136596


In [17]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [None]:
pipeline_run.wait_for_completion()

## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [20]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/d5d96b18-aca5-4c8e-ab34-15e9d0c703de/metrics_data
Downloaded azureml/d5d96b18-aca5-4c8e-ab34-15e9d0c703de/metrics_data, 1 files out of an estimated total of 1


With a comprehensive lists of metrics obtained for all child runs, it is possible to compose a customized metric based on exiting ones and select the best model according to that metric. 

In [21]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_7,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_10,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_25,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_0,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_5,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_15,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_32,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_4,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_13,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_11,...,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_21,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_22,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_28,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_38,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_2,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_20,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_33,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_6,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_27,d5d96b18-aca5-4c8e-ab34-15e9d0c703de_31
f1_score_macro,[0.4854783332983595],[0.6291065296051844],[0.4703248283762755],[0.7695433632192865],[0.6172687984240224],[0.4703248283762755],[0.7598906058833063],[0.4703248283762755],[0.7513317602193598],[0.6344538009685821],...,[0.7690673092994462],[0.7721885881660048],[0.7608791839089427],[0.7727145627585372],[0.5281282069224706],[0.7291723226750826],[0.7628201241538763],[0.6156115596783753],[0.7713728375650223],[0.7728848323679821]
AUC_micro,[0.9556914025711464],[0.8583231686396596],[0.9756630660793357],[0.9810792321100854],[0.9692925502151832],[0.9591182943762219],[0.9787314158344482],[0.9658617254726778],[0.9780340793173083],[0.8652616393533219],...,[0.9810948717535422],[0.9810372040222806],[0.978043349812679],[0.9815005353676536],[0.9681184348382729],[0.9771356748280493],[0.9806279966197],[0.9658473845275294],[0.9810128280997787],[0.9811252990575227]
AUC_weighted,[0.8405652551728098],[0.8658032448788451],[0.9408119985130078],[0.9489629240236944],[0.9065483049817346],[0.8576717754642983],[0.94233421821892],[0.8916149832279873],[0.9363012475203197],[0.8829086705301071],...,[0.9489231290410352],[0.9485839901963683],[0.9402230263593465],[0.9501665473435155],[0.9023288231485566],[0.9366377787576201],[0.9478565334703399],[0.8891483668044258],[0.9482575381168437],[0.9486850394481486]
norm_macro_recall,[0.014961526238924749],[0.5292854209265944],[0.0],[0.5037139284731673],[0.17159047634232652],[0.0],[0.48941554818057026],[0.0],[0.4478239026936099],[0.5830367443214306],...,[0.49875066298085724],[0.504191212813036],[0.495769406818915],[0.5037092742190772],[0.06097691353548389],[0.3859508722990703],[0.48457654698391595],[0.16910121282178178],[0.5004413075931486],[0.5073704668647414]
recall_score_micro,[0.8892564491654023],[0.7526858877086495],[0.8879514415781486],[0.9144157814871017],[0.9],[0.8879514415781486],[0.9101365705614567],[0.8879514415781486],[0.9132018209408195],[0.7487708649468893],...,[0.9149924127465857],[0.916176024279211],[0.909650986342944],[0.9166312594840667],[0.8925644916540213],[0.9102579666160849],[0.9132321699544764],[0.8998786039453718],[0.9162670713201821],[0.9160849772382397]
weighted_accuracy,[0.9840913697786888],[0.7498480612688843],[0.9843197680605863],[0.9548042339535432],[0.9780410176201378],[0.9843197680605863],[0.9512364275567677],[0.9843197680605863],[0.9602502413061001],[0.7380660598086445],...,[0.9561223188432031],[0.9569343321102222],[0.9498454652186054],[0.9575569453453916],[0.9824705494533511],[0.9642224591194022],[0.9556996160834956],[0.9782005341004988],[0.9575065618703371],[0.9564221015674651]
AUC_macro,[0.8405652551728098],[0.8658032448788451],[0.9408119985130078],[0.9489629240236944],[0.9065483049817347],[0.8576717754642983],[0.9423342183434613],[0.8916149832279873],[0.9363012475203197],[0.8829086705301071],...,[0.9489231290410354],[0.9485839901963683],[0.9402230798027988],[0.9501665473435155],[0.9023288231485566],[0.9366377787576201],[0.9478565245423409],[0.8891483668044259],[0.9482575381168437],[0.9486850394481487]
precision_score_macro,[0.6921588033602111],[0.623541101431799],[0.4439757207890743],[0.7914212857326756],[0.8089952761433304],[0.4439757207890743],[0.7782204395625276],[0.4439757207890743],[0.7973952001665008],[0.6316345833528316],...,[0.7940765936867068],[0.7979426012886874],[0.7761828481232318],[0.7995425368986979],[0.8326944499019758],[0.794563971145181],[0.7894440629848487],[0.8090203303072038],[0.7987184597168863],[0.7970041622953407]
accuracy,[0.8892564491654023],[0.7526858877086495],[0.8879514415781486],[0.9144157814871017],[0.9],[0.8879514415781486],[0.9101365705614567],[0.8879514415781486],[0.9132018209408195],[0.7487708649468893],...,[0.9149924127465857],[0.916176024279211],[0.909650986342944],[0.9166312594840667],[0.8925644916540213],[0.9102579666160849],[0.9132321699544764],[0.8998786039453718],[0.9162670713201821],[0.9160849772382397]
recall_score_weighted,[0.8892564491654023],[0.7526858877086495],[0.8879514415781486],[0.9144157814871017],[0.9],[0.8879514415781486],[0.9101365705614567],[0.8879514415781486],[0.9132018209408195],[0.7487708649468893],...,[0.9149924127465857],[0.916176024279211],[0.909650986342944],[0.9166312594840667],[0.8925644916540213],[0.9102579666160849],[0.9132321699544764],[0.8998786039453718],[0.9162670713201821],[0.9160849772382397]


### Retrieve the Best Model

In [22]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

Downloading azureml/d5d96b18-aca5-4c8e-ab34-15e9d0c703de/model_data
Downloaded azureml/d5d96b18-aca5-4c8e-ab34-15e9d0c703de/model_data, 1 files out of an estimated total of 1


In [28]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

PipelineWithYTransformations(Pipeline={'memory': None,
                                       'steps': [('datatransformer',
                                                  DataTransformer(enable_dnn=None,
                                                                  enable_feature_sweeping=None,
                                                                  feature_sweeping_config=None,
                                                                  feature_sweeping_timeout=None,
                                                                  featurization_config=None,
                                                                  force_text_dnn=None,
                                                                  is_cross_validation=None,
                                                                  is_onnx_compatible=None,
                                                                  logger=None,
                                                              

In [29]:
best_model.steps

[('datatransformer',
  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                  feature_sweeping_config=None, feature_sweeping_timeout=None,
                  featurization_config=None, force_text_dnn=None,
                  is_cross_validation=None, is_onnx_compatible=None, logger=None,
                  observer=None, task=None, working_dir=None)),
 ('prefittedsoftvotingclassifier',
  PreFittedSoftVotingClassifier(classification_labels=None,
                                estimators=[('0',
                                             Pipeline(memory=None,
                                                      steps=[('maxabsscaler',
                                                              MaxAbsScaler(copy=True)),
                                                             ('lightgbmclassifier',
                                                              LightGBMClassifier(boosting_type='gbdt',
                                                          

### Test the Model
#### Load Test Data
For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step.

In [None]:
dataset_type = 'train'
dataset_test = Dataset.Tabular.from_delimited_files(
    path='https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_{}.csv'.format(dataset_type)
)
df_test = dataset_test.to_pandas_dataframe()
df_test = df_test[pd.notnull(df_test['y'])]

y_test = df_test['y']
X_test = df_test.drop(['y'], axis=1)

#### Testing Our Best Fitted Model

We will use confusion matrix to see how our model works.

In [None]:
from sklearn.metrics import confusion_matrix
ypred = best_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

In [None]:
# Visualize the confusion matrix
pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

## Publish the pipeline

Run the following code to publish the pipeline to your workspace. In your workspace in the portal, you can see metadata for the pipeline including run history and durations. You can also run the pipeline manually from the portal.

Additionally, publishing the pipeline enables a REST endpoint to rerun the pipeline from any HTTP library on any platform.


In [31]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Bankmarketing autoML model train", description="Training bankmarketing pipeline", version="1.0")

published_pipeline


Name,Id,Status,Endpoint
Bankmarketing autoML model train,152f117c-81e2-4e95-8682-dea470b8ad2b,Active,REST Endpoint


## Consume the published pipeline

Authenticate once again, to retrieve the `auth_header` so that the endpoint can be used

In [35]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()



Get the REST url from the endpoint property of the published pipeline object. We can also find the REST url in our workspace in the portal. 

In [36]:
rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)


https://southcentralus.api.azureml.ms/pipelines/v1.0/subscriptions/9e65f93e-bdd8-437b-b1e8-0647cd6098f7/resourceGroups/aml-quickstarts-136596/providers/Microsoft.MachineLearningServices/workspaces/quick-starts-ws-136596/PipelineRuns/PipelineSubmit/152f117c-81e2-4e95-8682-dea470b8ad2b


Build an HTTP POST request to the endpoint, specifying our authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. 

Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.

In [37]:
import requests
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )

In [38]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Submitted pipeline run:  598550c0-b23f-4fcb-a78c-d3a916506ecc


Use the run id to monitor the status of the new run. This will take another 10-15 min to run and will look similar to the previous pipeline run, so if you don't need to see another pipeline run, you can skip watching the full output.

In [39]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments["pipeline-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …