Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)

# Azure Machine Learning Pipeline with AutoMLStep (Udacity Course 2)
This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline.

## Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. 

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
from matplotlib import pyplot as plt
from sklearn import datasets
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.pipeline.steps import AutoMLStep
from azureml.core.environment import Environment 
from azureml.core.model import InferenceConfig 
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.model import Model

In [2]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.56.0


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [3]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

Udacity_1
cloud_shell
northcentralus
d2d90bd8-e567-4097-88c9-9532cc375686


## Create an Azure ML experiment
Let's create an experiment named "automlstep-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

*Udacity Note:* There is no need to create an Azure ML experiment, this needs to re-use the experiment that was already created


In [4]:
# Choose a name for the run history container in the workspace.
# NOTE: update these to match your existing experiment name
experiment_name = 'ml-experiment-1'
project_folder = './pipeline-project'

experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
ml-experiment-1,Udacity_1,Link to Azure Machine Learning studio,Link to Documentation


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

**Udacity Note** There is no need to create a new compute target, it can re-use the previous cluster

In [5]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# NOTE: update the cluster name to match the existing cluster
# Choose a name for your CPU cluster
amlcompute_cluster_name = "auto-ml"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)
# For a more detailed view of current AmlCompute status, use get_status().

Found existing cluster, use it.
Succeeded.....................................................................................................................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"


## Data

**Udacity note:** Make sure the `key` is the same name as the dataset that is uploaded, and that the description matches. If it is hard to find or unknown, loop over the `ws.datasets.keys()` and `print()` them.
If it *isn't* found because it was deleted, it can be recreated with the link that has the CSV 

In [6]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "BankMarketing Dataset"
description_text = "Bank Marketing DataSet for Udacity Course 2"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        example_data = 'https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv'
        dataset = Dataset.Tabular.from_delimited_files(example_data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

{'infer_column_types': 'False', 'activity': 'to_pandas_dataframe'}
{'infer_column_types': 'False', 'activity': 'to_pandas_dataframe', 'activityApp': 'TabularDataset'}


Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0,32950.0
mean,40.040212,257.335205,2.56173,962.17478,0.17478,0.076228,93.574243,-40.51868,3.615654,5166.859608
std,10.432313,257.3317,2.763646,187.646785,0.496503,1.572242,0.578636,4.623004,1.735748,72.208448
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,38.0,179.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,47.0,318.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1


### Review the Dataset Result

You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records for all the steps in the TabularDataset, which makes it fast even against large datasets.

`TabularDataset` objects are composed of a list of transformation steps (optional).

In [7]:
dataset.take(5).to_pandas_dataframe()

{'infer_column_types': 'False', 'activity': 'to_pandas_dataframe'}
{'infer_column_types': 'False', 'activity': 'to_pandas_dataframe', 'activityApp': 'TabularDataset'}


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,57,technician,married,high.school,no,no,yes,cellular,may,mon,...,1,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1,no
1,55,unknown,married,unknown,unknown,yes,no,telephone,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,no
2,33,blue-collar,married,basic.9y,no,no,no,cellular,may,fri,...,1,999,1,failure,-1.8,92.893,-46.2,1.313,5099.1,no
3,36,admin.,married,high.school,no,no,no,telephone,jun,fri,...,4,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228.1,no
4,27,housemaid,married,high.school,no,yes,no,cellular,jul,fri,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1,no


## Train
This creates a general AutoML settings object.
**Udacity notes:** These inputs must match what was used when training in the portal. `label_column_name` has to be `y` for example.

In [8]:
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted'
}
automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="y",   
                             path = project_folder,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [9]:
remote_run = experiment.submit(automl_config, show_output= True)
remote_run.wait_for_completion()

Submitting remote run.
No run_configuration provided, running on auto-ml with default configuration
Running on remote compute: auto-ml


Experiment,Id,Type,Status,Details Page,Docs Page
ml-experiment-1,AutoML_d63bbd16-25bb-486c-a434-f1f0e6cbb6d1,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Train-Test data split
STATUS:       DONE
DESCRIPTION:  In order to accurately evaluate the model(s) trained by AutoML, we leverage a dataset that the model is not trained on. Hence, if the user doesn't provide an explicit validation dataset, a part of the training dataset is used to achieve this. For smaller datasets (fewer than 20,000 samples), cross-validation is leveraged, else a single hold-out set is split from the training data to serve as the validation dataset. Hence, your input data has been split into a training dataset and a holdout validation dataset.
      

{'runId': 'AutoML_d63bbd16-25bb-486c-a434-f1f0e6cbb6d1',
 'target': 'auto-ml',
 'status': 'Completed',
 'startTimeUtc': '2024-09-07T19:15:24.756237Z',
 'endTimeUtc': '2024-09-07T19:47:42.104501Z',
 'services': {},
   'message': 'No scores improved over last 10 iterations, so experiment stopped early. This early stopping behavior can be disabled by setting enable_early_stopping = False in AutoMLConfig for notebook/python SDK runs.'}],
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'auto-ml',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"47ad08b4-5a55-4e8a-b1b3-a5a432d09a84\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'classification',
  'dependencies_versions': '{"azureml-accel-models

#### Create Pipeline and AutoMLStep

You can define outputs for the AutoMLStep using TrainingOutput.

In [14]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

Create an AutoMLStep.

In [15]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

In [16]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [17]:
pipeline_run = experiment.submit(pipeline)

Created step automl_module [29c892c8][0f30adc5-2e52-4ec6-9bf4-79b1c1a80b34], (This step will run and generate new outputs)
Submitted PipelineRun e93cbeb6-34e3-4874-b69a-85ecec9d2266
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/e93cbeb6-34e3-4874-b69a-85ecec9d2266?wsid=/subscriptions/d2d90bd8-e567-4097-88c9-9532cc375686/resourcegroups/cloud_shell/workspaces/Udacity_1&tid=f3822f31-4d32-4719-a061-c45fac0a64ab


In [23]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported
ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported


In [19]:
pipeline_run.wait_for_completion()

PipelineRunId: e93cbeb6-34e3-4874-b69a-85ecec9d2266
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/e93cbeb6-34e3-4874-b69a-85ecec9d2266?wsid=/subscriptions/d2d90bd8-e567-4097-88c9-9532cc375686/resourcegroups/cloud_shell/workspaces/Udacity_1&tid=f3822f31-4d32-4719-a061-c45fac0a64ab
PipelineRun Status: Running


StepRunId: b13c1a91-0551-4dcb-9338-c227b384df2b
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/b13c1a91-0551-4dcb-9338-c227b384df2b?wsid=/subscriptions/d2d90bd8-e567-4097-88c9-9532cc375686/resourcegroups/cloud_shell/workspaces/Udacity_1&tid=f3822f31-4d32-4719-a061-c45fac0a64ab
StepRun( automl_module ) Status: Running


ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported
ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported


## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [24]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported


In [25]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,b13c1a91-0551-4dcb-9338-c227b384df2b_3,b13c1a91-0551-4dcb-9338-c227b384df2b_10,b13c1a91-0551-4dcb-9338-c227b384df2b_7,b13c1a91-0551-4dcb-9338-c227b384df2b_6,b13c1a91-0551-4dcb-9338-c227b384df2b_26,b13c1a91-0551-4dcb-9338-c227b384df2b_31,b13c1a91-0551-4dcb-9338-c227b384df2b_5,b13c1a91-0551-4dcb-9338-c227b384df2b_28,b13c1a91-0551-4dcb-9338-c227b384df2b_33,b13c1a91-0551-4dcb-9338-c227b384df2b_30,...,b13c1a91-0551-4dcb-9338-c227b384df2b_0,b13c1a91-0551-4dcb-9338-c227b384df2b_18,b13c1a91-0551-4dcb-9338-c227b384df2b_23,b13c1a91-0551-4dcb-9338-c227b384df2b_25,b13c1a91-0551-4dcb-9338-c227b384df2b_22,b13c1a91-0551-4dcb-9338-c227b384df2b_16,b13c1a91-0551-4dcb-9338-c227b384df2b_27,b13c1a91-0551-4dcb-9338-c227b384df2b_24,b13c1a91-0551-4dcb-9338-c227b384df2b_29,b13c1a91-0551-4dcb-9338-c227b384df2b_2
accuracy,[0.8880121396054628],[0.9083459787556905],[0.9119878603945372],[0.7918057663125948],[0.9101669195751139],[0.9059180576631259],[0.908649468892261],[0.9089529590288316],[0.9062215477996965],[0.9059180576631259],...,[0.9101669195751139],[0.9132018209408195],[0.7620637329286798],[0.9089529590288316],[0.9110773899848255],[0.8437025796661608],[0.9144157814871017],[0.9162367223065251],[0.9101669195751139],[0.8977238239757208]
precision_score_macro,[0.4440060698027314],[0.7982640315624551],[0.8065451980757572],[0.6656857633231223],[0.782781125204717],[0.7692516154463418],[0.7856451172940535],[0.7775612617754262],[0.7679055228648837],[0.8023620464980331],...,[0.7781441711329087],[0.7881835838009958],[0.6454977783948903],[0.7860925036001488],[0.7806355042016807],[0.6937974944145098],[0.7971214454807336],[0.8096815856013181],[0.7802371076593941],[0.771526544069397]
weighted_accuracy,[0.9843450583187134],[0.9686663170697974],[0.9679994692811393],[0.7765702339612472],[0.9565733773437545],[0.9539215076462723],[0.9619470207827714],[0.9543457608083667],[0.9507300661087603],[0.9724427450812216],...,[0.9512815952194833],[0.9546457273395061],[0.7471295939345365],[0.9616954582031879],[0.9517028590639043],[0.8411526027126678],[0.9598132228328224],[0.9647715810627646],[0.9539274862816189],[0.9730611889183236]
average_precision_score_micro,[0.9665375179041114],[0.9752498455464272],[0.9760584861374465],[0.9090543843572524],[0.9790738281097624],[0.9780978067836411],[0.9789389178388146],[0.9794501554117797],[0.9779349119273264],[0.9793702498898297],...,[0.9805151927136845],[0.9797798706773968],[0.8693492999099242],[0.9773540307790931],[0.9799578736633584],[0.8847988203557676],[0.9810541114897418],[0.9800356198767971],[0.979577567457319],[0.9666998730375083]
balanced_accuracy,[0.5],[0.6653862112783807],[0.6863829010812322],[0.8531718246095653],[0.7232498281920618],[0.7125685610923095],[0.693976256235563],[0.7261186965936646],[0.7269490244458152],[0.6379682576730074],...,[0.7445642005975768],[0.7462730180958679],[0.8222158315226351],[0.6965154015860049],[0.7474451094476768],[0.8539734406229913],[0.7315628316912014],[0.7207468041871123],[0.7339070143948192],[0.5942781010175104]
AUC_macro,[0.8949406035413736],[0.9237121814143637],[0.9290011799639528],[0.9233254977799266],[0.9394485845063509],[0.9360823529629692],[0.9388252597495217],[0.9403909811483624],[0.9369608426091096],[0.9424031253299546],...,[0.9446537630106308],[0.9415278773430249],[0.8946571899075109],[0.9310008206028745],[0.9437433198665548],[0.9229976271054576],[0.9460763883100212],[0.9418122171652339],[0.9415399177915222],[0.8871962796866519]
norm_macro_recall,[0.0],[0.3307724225567614],[0.37276580216246447],[0.7063436492191306],[0.4464996563841237],[0.42513712218461897],[0.38795251247112605],[0.45223739318732914],[0.45389804889163043],[0.2759365153460147],...,[0.48912840119515355],[0.49254603619173576],[0.6444316630452702],[0.3930308031720098],[0.4948902188953537],[0.7079468812459826],[0.4631256633824028],[0.4414936083742247],[0.4678140287896384],[0.18855620203502088]
precision_score_weighted,[0.788565560086672],[0.8950256468849379],[0.9005211086889047],[0.9166498470928354],[0.9021382069947883],[0.8973044160136986],[0.8973758906640772],[0.9015806674147608],[0.8997259014553856],[0.891188856477618],...,[0.9051980543721705],[0.907597716175493],[0.9076430253103694],[0.8979309459394659],[0.9062625859144872],[0.9162625570891886],[0.9067018682678301],[0.907373046539007],[0.903605295208037],[0.877014103638037]
f1_score_weighted,[0.8353395018439429],[0.8953324743236205],[0.9013350533065821],[0.8272976346559117],[0.9048928710960408],[0.9003945609451778],[0.899959550454415],[0.9043103287151534],[0.9023446319059719],[0.8886031510001888],...,[0.9072831557855962],[0.9098016443897835],[0.8039706057178911],[0.900539981658476],[0.9082846027144389],[0.8659213543958487],[0.9091205800396924],[0.9092400519650629],[0.9061241591737821],[0.8734704046383025]
log_loss,[0.253617897385941],[0.21235370304099976],[0.21382270165851136],[0.39271954524550934],[0.1876885204442146],[0.19353624173211753],[0.20462013175825502],[0.18566624756583958],[0.19288051489032065],[0.19986862844075845],...,[0.17851374134751752],[0.19693610296079414],[0.47111417421873736],[0.19873978109892296],[0.18227122039127427],[0.40290230966751145],[0.1802552796174314],[0.20047350822493099],[0.18333103089239522],[0.25345066198734084]


### Retrieve the Best Model

In [48]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,b13c1a91-0551-4dcb-9338-c227b384df2b_3,b13c1a91-0551-4dcb-9338-c227b384df2b_10,b13c1a91-0551-4dcb-9338-c227b384df2b_7,b13c1a91-0551-4dcb-9338-c227b384df2b_6,b13c1a91-0551-4dcb-9338-c227b384df2b_26,b13c1a91-0551-4dcb-9338-c227b384df2b_31,b13c1a91-0551-4dcb-9338-c227b384df2b_5,b13c1a91-0551-4dcb-9338-c227b384df2b_28,b13c1a91-0551-4dcb-9338-c227b384df2b_33,b13c1a91-0551-4dcb-9338-c227b384df2b_30,...,b13c1a91-0551-4dcb-9338-c227b384df2b_0,b13c1a91-0551-4dcb-9338-c227b384df2b_18,b13c1a91-0551-4dcb-9338-c227b384df2b_23,b13c1a91-0551-4dcb-9338-c227b384df2b_25,b13c1a91-0551-4dcb-9338-c227b384df2b_22,b13c1a91-0551-4dcb-9338-c227b384df2b_16,b13c1a91-0551-4dcb-9338-c227b384df2b_27,b13c1a91-0551-4dcb-9338-c227b384df2b_24,b13c1a91-0551-4dcb-9338-c227b384df2b_29,b13c1a91-0551-4dcb-9338-c227b384df2b_2
accuracy,[0.8880121396054628],[0.9083459787556905],[0.9119878603945372],[0.7918057663125948],[0.9101669195751139],[0.9059180576631259],[0.908649468892261],[0.9089529590288316],[0.9062215477996965],[0.9059180576631259],...,[0.9101669195751139],[0.9132018209408195],[0.7620637329286798],[0.9089529590288316],[0.9110773899848255],[0.8437025796661608],[0.9144157814871017],[0.9162367223065251],[0.9101669195751139],[0.8977238239757208]
precision_score_macro,[0.4440060698027314],[0.7982640315624551],[0.8065451980757572],[0.6656857633231223],[0.782781125204717],[0.7692516154463418],[0.7856451172940535],[0.7775612617754262],[0.7679055228648837],[0.8023620464980331],...,[0.7781441711329087],[0.7881835838009958],[0.6454977783948903],[0.7860925036001488],[0.7806355042016807],[0.6937974944145098],[0.7971214454807336],[0.8096815856013181],[0.7802371076593941],[0.771526544069397]
weighted_accuracy,[0.9843450583187134],[0.9686663170697974],[0.9679994692811393],[0.7765702339612472],[0.9565733773437545],[0.9539215076462723],[0.9619470207827714],[0.9543457608083667],[0.9507300661087603],[0.9724427450812216],...,[0.9512815952194833],[0.9546457273395061],[0.7471295939345365],[0.9616954582031879],[0.9517028590639043],[0.8411526027126678],[0.9598132228328224],[0.9647715810627646],[0.9539274862816189],[0.9730611889183236]
average_precision_score_micro,[0.9665375179041114],[0.9752498455464272],[0.9760584861374465],[0.9090543843572524],[0.9790738281097624],[0.9780978067836411],[0.9789389178388146],[0.9794501554117797],[0.9779349119273264],[0.9793702498898297],...,[0.9805151927136845],[0.9797798706773968],[0.8693492999099242],[0.9773540307790931],[0.9799578736633584],[0.8847988203557676],[0.9810541114897418],[0.9800356198767971],[0.979577567457319],[0.9666998730375083]
balanced_accuracy,[0.5],[0.6653862112783807],[0.6863829010812322],[0.8531718246095653],[0.7232498281920618],[0.7125685610923095],[0.693976256235563],[0.7261186965936646],[0.7269490244458152],[0.6379682576730074],...,[0.7445642005975768],[0.7462730180958679],[0.8222158315226351],[0.6965154015860049],[0.7474451094476768],[0.8539734406229913],[0.7315628316912014],[0.7207468041871123],[0.7339070143948192],[0.5942781010175104]
AUC_macro,[0.8949406035413736],[0.9237121814143637],[0.9290011799639528],[0.9233254977799266],[0.9394485845063509],[0.9360823529629692],[0.9388252597495217],[0.9403909811483624],[0.9369608426091096],[0.9424031253299546],...,[0.9446537630106308],[0.9415278773430249],[0.8946571899075109],[0.9310008206028745],[0.9437433198665548],[0.9229976271054576],[0.9460763883100212],[0.9418122171652339],[0.9415399177915222],[0.8871962796866519]
norm_macro_recall,[0.0],[0.3307724225567614],[0.37276580216246447],[0.7063436492191306],[0.4464996563841237],[0.42513712218461897],[0.38795251247112605],[0.45223739318732914],[0.45389804889163043],[0.2759365153460147],...,[0.48912840119515355],[0.49254603619173576],[0.6444316630452702],[0.3930308031720098],[0.4948902188953537],[0.7079468812459826],[0.4631256633824028],[0.4414936083742247],[0.4678140287896384],[0.18855620203502088]
precision_score_weighted,[0.788565560086672],[0.8950256468849379],[0.9005211086889047],[0.9166498470928354],[0.9021382069947883],[0.8973044160136986],[0.8973758906640772],[0.9015806674147608],[0.8997259014553856],[0.891188856477618],...,[0.9051980543721705],[0.907597716175493],[0.9076430253103694],[0.8979309459394659],[0.9062625859144872],[0.9162625570891886],[0.9067018682678301],[0.907373046539007],[0.903605295208037],[0.877014103638037]
f1_score_weighted,[0.8353395018439429],[0.8953324743236205],[0.9013350533065821],[0.8272976346559117],[0.9048928710960408],[0.9003945609451778],[0.899959550454415],[0.9043103287151534],[0.9023446319059719],[0.8886031510001888],...,[0.9072831557855962],[0.9098016443897835],[0.8039706057178911],[0.900539981658476],[0.9082846027144389],[0.8659213543958487],[0.9091205800396924],[0.9092400519650629],[0.9061241591737821],[0.8734704046383025]
log_loss,[0.253617897385941],[0.21235370304099976],[0.21382270165851136],[0.39271954524550934],[0.1876885204442146],[0.19353624173211753],[0.20462013175825502],[0.18566624756583958],[0.19288051489032065],[0.19986862844075845],...,[0.17851374134751752],[0.19693610296079414],[0.47111417421873736],[0.19873978109892296],[0.18227122039127427],[0.40290230966751145],[0.1802552796174314],[0.20047350822493099],[0.18333103089239522],[0.25345066198734084]


In [58]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)



In [57]:
best_run, fitted_model = remote_run.get_output()


In [60]:
print(best_model_output._path_on_datastore)


azureml/b13c1a91-0551-4dcb-9338-c227b384df2b/model_data


In [69]:
import os

# Print the current working directory
print(os.getcwd())


/mnt/batch/tasks/shared/LS_root/mounts/clusters/monaejam2/code/Users/monaejam


In [76]:
import pickle

# Use the verified file path to load the model
file_path = '/mnt/batch/tasks/shared/LS_root/mounts/clusters/monaejam2/code/Users/monaejam/pipeline-project/model.pkl'

with open(file_path, 'rb') as f:
    best_model = pickle.load(f)

# Print the loaded model
print(best_model)



Pipeline(steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=True, working_dir='/mnt/batch/tasks/shared/LS_root/mounts/clusters/monaejam2/code/Users/monaejam')),
                ('prefittedsoftvotingclassifier',
                 PreFittedSoftVotingClassifier(classification_labels=array([0, 1]), estimators=[('44', Pipeline(steps=[('sparsenormalizer', Normali...nit_type': 'cpu'}), reg_alpha=1.1458333333333335, reg_lambda=2.3958333333333335, subsample=1, tree_method='hist'))]))], flatten_transform=False, weights=[0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142, 0.14285714285714285, 0.07142857142857142, 0.2857142857142857, 0.07142857142857142, 0.07142857142857142, 0.07142857142857142]))])
Y_transformer(['LabelEncoder', LabelEncoder()])


In [77]:
best_model.steps

[('datatransformer',
  DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=False, is_onnx_compatible=False, task='classification')),
 ('prefittedsoftvotingclassifier',
  PreFittedSoftVotingClassifier(classification_labels=numpy.array([0, 1]), estimators=[('44', Pipeline(memory=None, steps=[('sparsenormalizer', Normalizer(copy=True, norm='max')), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=0.8, eta=0.3, gamma=5, max_depth=6, max_leaves=15, n_estimators=100, n_jobs=1, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'cpu'}), random_state=0, reg_alpha=0.7291666666666667, reg_lambda=1.5625, subsample=0.8, tree_method='auto'))], verbose=False)), ('31', Pipeline(memory=None, steps=[('standardscalerwrapper', StandardScalerWrapper(copy=True, with_mean=False, with_std=False))

### Test the Model
#### Load Test Data
For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step.

In [78]:
dataset_test = Dataset.Tabular.from_delimited_files(path='https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_test.csv')
df_test = dataset_test.to_pandas_dataframe()
df_test = df_test[pd.notnull(df_test['y'])]

y_test = df_test['y']
X_test = df_test.drop(['y'], axis=1)

{'infer_column_types': 'False', 'activity': 'to_pandas_dataframe'}
{'infer_column_types': 'False', 'activity': 'to_pandas_dataframe', 'activityApp': 'TabularDataset'}


#### Testing Our Best Fitted Model

We will use confusion matrix to see how our model works.

In [79]:
from sklearn.metrics import confusion_matrix
ypred = best_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

In [80]:
# Visualize the confusion matrix
pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

Unnamed: 0,0,1
0,3544,92
1,252,232


## Publish and run from REST endpoint

Run the following code to publish the pipeline to your workspace. In your workspace in the portal, you can see metadata for the pipeline including run history and durations. You can also run the pipeline manually from the portal.

Additionally, publishing the pipeline enables a REST endpoint to rerun the pipeline from any HTTP library on any platform.


In [44]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Bankmarketing Train", description="Training bankmarketing pipeline", version="1.0")

published_pipeline


Name,Id,Status,Endpoint
Bankmarketing Train,cf83288b-f7e6-4990-b8be-36c0a401b36a,Active,REST Endpoint


Authenticate once again, to retrieve the `auth_header` so that the endpoint can be used

In [81]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()



Get the REST url from the endpoint property of the published pipeline object. You can also find the REST url in your workspace in the portal. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the process_count_per_node is passed through to ParallelRunStep because you defined it is defined as a PipelineParameter object in the step configuration.

Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.


In [85]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Bankmarketing Train", description="Training bankmarketing pipeline", version="1.0")

published_pipeline

Name,Id,Status,Endpoint
Bankmarketing Train,dd8acd08-bc21-4d1f-a5b6-f797340a5c8d,Active,REST Endpoint


In [91]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )

In [92]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Submitted pipeline run:  88e6810d-f11b-4645-bd72-78eb35bb978f


ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported
ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported
ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported
ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported
ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported
ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported
ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported

Use the run id to monitor the status of the new run. This will take another 10-15 min to run and will look similar to the previous pipeline run, so if you don't need to see another pipeline run, you can skip watching the full output.

In [99]:
from azureml.core import Workspace

# Load the workspace from the config.json file
ws = Workspace.from_config()

# List all available experiments
experiment_names = ws.experiments.keys()

# Print the experiment names
print("Available Experiments:")
for experiment in experiment_names:
    print(experiment)


Available Experiments:
banking


ERROR:azureml.data._dataset_client:[NOT_SUPPORTED_API_USE_ATTEMPT] The [_DatasetClient.get] API has been deprecated and is no longer supported


In [100]:
from azureml.core import Workspace
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

# Load the Azure ML Workspace
ws = Workspace.from_config()  # Assumes you have a config.json file

# Use the correct run_id (not the URL, but the run's unique ID)
run_id = "88e6810d-f11b-4645-bd72-78eb35bb978f"  # Use the actual run_id (not a URL)

# Create a PipelineRun object to track the run
experiment_name = "banking"  # Replace with the name of your experiment
published_pipeline_run = PipelineRun(ws.experiments[experiment_name], run_id)

# Show run details in a widget
RunDetails(published_pipeline_run).show()


_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [101]:
from azureml.core import Workspace, Experiment

ws = Workspace.from_config()
experiment = Experiment(ws, 'banking')
runs = experiment.get_runs()
for run in runs:
    print(run.status)

Completed


In [10]:
best_run, fitted_model = remote_run.get_output()


2024-09-07 20:07:40.378130: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-09-07 20:07:40.378209: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2024-09-07 20:07:50.817637: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2024-09-07 20:07:56.764872: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-09-07 20:07:56.765023: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (monaejam2): /proc/driver/nvidia/version does not exist


In [None]:
best_run


In [None]:
best_run.register_model(model_name='best_run_automl', model_path='./outputs/')


In [54]:
model = remote_run.register_model(model_name = 'best_run_automl.pkl')

environment = best_run.get_environment()
entry_script='inference/scoring.py'
best_run.download_file('outputs/scoring_file_v_2_0_0.py', entry_script)


inference_config = InferenceConfig(entry_script = entry_script, environment = environment)

deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                                    memory_gb = 1, 
                                                    auth_enabled= True, 
                                                    enable_app_insights= False)

service = Model.deploy(ws, "aciservices", [model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)

In [48]:
%run endpoint.py


Response from urllib request:
{"Results": ["no", "no"]}
Response from requests:
{'Results': ['no', 'no']}


In [50]:
logs = service.get_logs()
for line in logs.split('\n'):
    print(line)


2024-09-03T01:02:44,264405792+00:00 - rsyslog/run 
2024-09-03T01:02:44,269045245+00:00 - gunicorn/run 
2024-09-03T01:02:44,273232937+00:00 | gunicorn/run | 
2024-09-03T01:02:44,275265753+00:00 | gunicorn/run | ###############################################
2024-09-03T01:02:44,276986729+00:00 | gunicorn/run | AzureML Container Runtime Information
2024-09-03T01:02:44,280194589+00:00 | gunicorn/run | ###############################################
2024-09-03T01:02:44,281285726+00:00 | gunicorn/run | 
2024-09-03T01:02:44,283092811+00:00 - nginx/run 
2024-09-03T01:02:44,288257246+00:00 | gunicorn/run | 
2024-09-03T01:02:44,291294691+00:00 | gunicorn/run | AzureML image information: openmpi4.1.0-ubuntu20.04, Materializaton Build:20240709.v1
2024-09-03T01:02:44,293011033+00:00 | gunicorn/run | 
2024-09-03T01:02:44,298973454+00:00 | gunicorn/run | 
2024-09-03T01:02:44,301742174+00:00 | gunicorn/run | PATH environment variable: /azureml-envs/azureml-automl/bin:/opt/miniconda/bin:/usr/local/sbi

In [49]:
from azureml.core import Workspace, Webservice

# Load the workspace from the config file
ws = Workspace.from_config()

# List all deployed web services in the workspace
services = Webservice.list(ws)

# Print details of each service
for service in services:
    print(f"Name: {service.name}")
    print(f"Scoring URI: {service.scoring_uri}")
    print(f"Region: {service.location}")
    print(f"State: {service.state}")
    print("-" * 40)


Name: aciservice
Scoring URI: http://1b069481-9607-4f37-b5c7-5fab51c30dac.northcentralus.azurecontainer.io/score
Region: northcentralus
State: None
----------------------------------------
