Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)

# Azure Machine Learning Pipeline with AutoMLStep
This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline.

## Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. 

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

## Azure Machine Learning and Pipeline SDK-specific imports

In [120]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.dataset import Dataset
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

from azureml.train.automl.runtime import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.83


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [121]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

cesardl-automl-ncentralus-demo-ws
cesardl-automl-ncentralus-demo-ws-resgrp
northcentralus
381b38e9-9840-4719-a5a0-61d9585e1e91


## Create an Azure ML experiment
Let's create an experiment named "automl-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [122]:
# Choose a name for the run history container in the workspace.
experiment_name = 'automlstep-classif-porto'
project_folder = './project'

experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
automlstep-classif-porto,cesardl-automl-ncentralus-demo-ws,Link to Azure Machine Learning studio,Link to Documentation


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

In [123]:
from azureml.core.compute import AmlCompute, ComputeTarget
# Define remote compute target to use
# Further docs on Remote Compute Target: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-remote

# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating a new training cluster...')
     provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D13_V2", # for GPU, use "STANDARD_NC12"
                                                                 #vm_priority = 'lowpriority', # optional
                                                                 max_nodes = 6)
     # Create the cluster.
     aml_remote_compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)

# For a more detailed view of current AmlCompute status, use get_status().

Found existing training cluster.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned


In [124]:
# NOT NEEDED, USE DEFAULTS IN AML COMPUTE

# create a new RunConfig object
# conda_run_config = RunConfiguration(framework="python")

# conda_run_config.environment.docker.enabled = True
# conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE

# cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], 
#                               conda_packages=['numpy', 'py-xgboost<=0.80'])
# conda_run_config.environment.python.conda_dependencies = cd

# print('run config is ready')

## Data

### (Optional) Submit dataset file into DataStore (Azure Blob under the covers)

In [125]:
datastore = ws.get_default_datastore()
datastore.upload(src_dir='../../../data/', 
                 target_path='Datasets/porto_seguro_safe_driver_prediction', overwrite=True, show_progress=True)

Uploading an estimated of 2 files
Uploading ../../../data/porto_seguro_safe_driver_prediction_test.csv
Uploading ../../../data/porto_seguro_safe_driver_prediction_train.csv
Uploaded ../../../data/porto_seguro_safe_driver_prediction_train.csv, 1 files out of an estimated total of 2
Uploaded ../../../data/porto_seguro_safe_driver_prediction_test.csv, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_10525a9da43d465897b88f19823a0313

## Load data into Azure ML Dataset and Register into Workspace

In [126]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file in the HTTP URL
found = False
aml_dataset_name = "porto_seguro_safe_driver_prediction_train"

if aml_dataset_name in ws.datasets.keys(): 
       found = True
       dataset = ws.datasets[aml_dataset_name] 
       print("Dataset loaded from the Workspace")
       
if not found:
        # Create AML Dataset and register it into Workspace
        print("Dataset does not exist in the current Workspace. It will be imported and registered.")
        
        # Option A: Create AML Dataset from file in AML DataStore
        datastore = ws.get_default_datastore()
        dataset = Dataset.Tabular.from_delimited_files(path=datastore.path('Datasets/porto_seguro_safe_driver_prediction/porto_seguro_safe_driver_prediction_train.csv'))
        data_origin_type = 'AMLDataStore'
        
        # Option B: Create AML Dataset from file in HTTP URL
        # data_url = 'https://url/porto_seguro_safe_driver_prediction_train.csv'
        # aml_dataset = Dataset.Tabular.from_delimited_files(data_url)  
        # data_origin_type = 'HttpUrl'
        
        print(aml_dataset)
                
        #Register Dataset in Workspace
        registration_method = 'SDK'  # or 'UI'
        dataset = aml_dataset.register(workspace=ws,
                                           name=aml_dataset_name,
                                           description='Porto Seguro Safe Driver Prediction Train dataset file',
                                           tags={'Registration-Method': registration_method, 'Data-Origin-Type': data_origin_type},
                                           create_new_version=True)
        
        print("Dataset created from file and registered in the Workspace")

Dataset loaded from the Workspace


In [127]:
dataset.take(5).to_pandas_dataframe()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


### Seggregate a Test dataset for later testing and creating a confusion matrix
Split original AML Tabular Dataset in two test/train AML Tabular Datasets (using AML DS function)

In [128]:
# Split using Azure Tabular Datasets (Better for Remote Compute)
# https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py#random-split-percentage--seed-none-

train_dataset, test_dataset = dataset.random_split(0.9, seed=1)

#Register Train Dataset (90%) after Split in Workspace
registration_method = 'SDK'  # or 'UI'
data_origin_type = 'SPLIT'
train_dataset = train_dataset.register(workspace=ws,
                                       name="porto_seguro_safe_driver_prediction_train90",
                                       description='Porto Seguro Safe Driver Prediction Train dataset file (90%)',
                                       tags={'Registration-Method': registration_method, 'Data-Origin-Type': data_origin_type},
                                       create_new_version=True)

# Load from Workspace
train_dataset = ws.datasets["porto_seguro_safe_driver_prediction_train90"] 
train_dataset

{
  "source": [
    "('workspaceblobstore', 'Datasets/porto_seguro_safe_driver_prediction/porto_seguro_safe_driver_prediction_train.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes",
    "RandomSplit"
  ],
  "registration": {
    "id": "17771274-8b00-4ccf-a090-e0c9821eabdc",
    "name": "porto_seguro_safe_driver_prediction_train90",
    "version": 2,
    "description": "Porto Seguro Safe Driver Prediction Train dataset file (90%)",
    "tags": {
      "Registration-Method": "SDK",
      "Data-Origin-Type": "SPLIT"
    },
    "workspace": "Workspace.create(name='cesardl-automl-ncentralus-demo-ws', subscription_id='381b38e9-9840-4719-a5a0-61d9585e1e91', resource_group='cesardl-automl-ncentralus-demo-ws-resgrp')"
  }
}

### List possible metrics to optimize for (primary metric) in Classification using AutoML

In [129]:
from azureml.train import automl

# List of possible primary metrics is here:
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric
    
# Get a list of valid metrics for your given task
automl.utilities.get_primary_metrics('classification')

['precision_score_weighted',
 'average_precision_score_weighted',
 'AUC_weighted',
 'accuracy',
 'norm_macro_recall']

## Train
This creates a general AutoML settings object.

In [130]:
# You can provide additional settings as a **kwargs parameter for the AutoMLConfig object
automl_settings = {
     "whitelist_models": ['LightGBM']
}

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='classification',
                             primary_metric='AUC_weighted',                            
                             training_data=train_dataset,
                             # validation_data=validation_dataset,
                             path = project_folder,
                             label_column_name="target",
                             enable_early_stopping= True,
                             # blacklist_models=['LinearSVMClassifier', 'MultinomialNaiveBayes'], 
                             # iteration_timeout_minutes= 5,
                             # iterations=2,
                             # experiment_exit_score= 0.65,
                             featurization= 'auto',
                             debug_log='automated_ml_errors.log',
                             verbosity= logging.INFO,
                             enable_onnx_compatible_models=False,
                             **automl_settings
                             )

### Create Pipeline
You can define outputs for the AutoMLStep using TrainingOutput.

In [131]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

### Create an AutoMLStep.

In [132]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

In [133]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [134]:
pipeline_run = experiment.submit(pipeline)

Created step automl_module [6f85a331][778f415c-d69e-40c9-9be2-bdf17377d57a], (This step will run and generate new outputs)
Submitted PipelineRun b7dd76c5-5fba-44d9-890d-d680bb9b36a3
Link to Azure Machine Learning studio: https://ml.azure.com/experiments/automlstep-classif-porto/runs/b7dd76c5-5fba-44d9-890d-d680bb9b36a3?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/cesardl-automl-ncentralus-demo-ws-resgrp/workspaces/cesardl-automl-ncentralus-demo-ws


In [135]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [None]:
pipeline_run.wait_for_completion()

PipelineRunId: b7dd76c5-5fba-44d9-890d-d680bb9b36a3
Link to Portal: https://ml.azure.com/experiments/automlstep-classif-porto/runs/b7dd76c5-5fba-44d9-890d-d680bb9b36a3?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/cesardl-automl-ncentralus-demo-ws-resgrp/workspaces/cesardl-automl-ncentralus-demo-ws
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: 55d8bc5e-17da-4b5f-848d-cad63e426b48
Link to Portal: https://ml.azure.com/experiments/automlstep-classif-porto/runs/55d8bc5e-17da-4b5f-848d-cad63e426b48?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/cesardl-automl-ncentralus-demo-ws-resgrp/workspaces/cesardl-automl-ncentralus-demo-ws
StepRun( automl_module ) Status: NotStarted
StepRun( automl_module ) Status: Running


## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [None]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

In [None]:
import json
with open(metrics_output._path_on_datastore) as f:  
   metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

### Retrieve the Best Model

In [None]:
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

In [None]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

### Test the Model
#### Load Test Data
For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step.

In [None]:
# dataset_test = Dataset.Tabular.from_delimited_files(path='https://url/dataset-test.csv')
test_df = test_dataset.to_pandas_dataframe()
print(test_df.shape)

test_df = test_df[pd.notnull(test_df['target'])]

if 'target' in test_df.columns:
    y_test = test_df[['target']]
    X_test = test_df.drop(['target'], axis=1)

# Method 2:
# if 'target' in test_df.columns:
#     y_test = test_df.pop('target')
#
# X_test = test_df

print(y_test.shape)
print(X_test.shape)

X_test.describe()

# Testing Our Best Fitted Model

In [None]:
# Try the best model making predictions with the test dataset
y_predictions = best_model.predict(X_test)

print('10 predictions: ')
print(y_predictions[:10])

### Calculate Accuracy

In [None]:
from sklearn.metrics import accuracy_score

print('Accuracy with Scikit-Learn model:')
print(accuracy_score(y_test, y_predictions))


### Calculate AUC

In [None]:
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_predictions)
print('AUC (Area Under the Curve) with Scikit-Learn model:')
metrics.auc(fpr, tpr)

# AUC with plain LightGBM was: 0.6374553321494826 

## Show Confusion Matrix
We will use confusion matrix to see how our model works.

In [None]:
from pandas_ml import ConfusionMatrix

cm = ConfusionMatrix(y_test['target'], y_predictions)

print(cm)

cm.plot()