# Azure Machine Learning Pipeline with AutoMLStep (AutoML training in pipeline)
This notebook demonstrates the use of **AutoMLStep** for training in Azure Machine Learning Pipeline.
As secondary pipeline step, it also uses a **PythonScriptStep** for registering model

## Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. 

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a **TabularDataset**.
4. Configure AutoML using **AutoMLConfig**.
5. Configure **AutoMLStep** step for training
6. Configure **PythonScriptStep** for registering the model in the Workspace
6. Run the AML pipeline using AmlCompute
7. Explore the results.
8. Test the best fitted model.
9. Publish the Pipeline in the Workspace

## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.dataset import Dataset
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

from azureml.train.automl.runtime import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.83


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

cesardl-automl-ncentralus-demo-ws
cesardl-automl-ncentralus-demo-ws-resgrp
northcentralus
381b38e9-9840-4719-a5a0-61d9585e1e91


## Create an Azure ML experiment
Let's create an experiment named "automlstep-classif-porto" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [3]:
# Choose a name for the run history container in the workspace.
experiment_name = 'automlstep-classif-porto'
project_folder = './project'

experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
automlstep-classif-porto,cesardl-automl-ncentralus-demo-ws,Link to Azure Machine Learning studio,Link to Documentation


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

In [4]:
from azureml.core.compute import AmlCompute, ComputeTarget
# Define remote compute target to use
# Further docs on Remote Compute Target: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-remote

# Choose a name for your cluster.
amlcompute_cluster_name = "cpu-cluster"

# CODE TBD

Found existing training cluster.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned


## Data

### (***Optional***) Submit dataset file into DataStore (Azure Blob under the covers)

In [5]:
datastore = ws.get_default_datastore()
datastore.upload(src_dir='../../../data/', 
                 target_path='Datasets/porto_seguro_safe_driver_prediction', overwrite=True, show_progress=True)

Uploading an estimated of 2 files
Uploading ../../../data/Place_here_the_Dataset_files.txt
Uploading ../../../data/porto_seguro_safe_driver_prediction_train.csv
Uploaded ../../../data/Place_here_the_Dataset_files.txt, 1 files out of an estimated total of 2
Uploaded ../../../data/porto_seguro_safe_driver_prediction_train.csv, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_c40209a0f453435fa6f4329cc95f717e

## Load data into Azure ML Dataset and Register into Workspace

In [6]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file in the HTTP URL
found = False
aml_dataset_name = "porto_seguro_safe_driver_prediction_train"

if aml_dataset_name in ws.datasets.keys(): 
       found = True
       dataset = ws.datasets[aml_dataset_name] 
       print("Dataset found and loaded from the Workspace")
       
if not found:
        # Create AML Dataset and register it into Workspace
        print("Dataset does not exist in the current Workspace. It will be imported and registered.")
        
        # Option A: Create AML Dataset from file in AML DataStore
        datastore = ws.get_default_datastore()
        dataset = Dataset.Tabular.from_delimited_files(path=datastore.path('Datasets/porto_seguro_safe_driver_prediction/porto_seguro_safe_driver_prediction_train.csv'))
        data_origin_type = 'AMLDataStore'
               
        print(aml_dataset)
                
        #Register Dataset in Workspace
        registration_method = 'SDK'  # or 'UI'
        dataset = aml_dataset.register(workspace=ws,
                                           name=aml_dataset_name,
                                           description='Porto Seguro Safe Driver Prediction Train dataset file',
                                           tags={'Registration-Method': registration_method, 'Data-Origin-Type': data_origin_type},
                                           create_new_version=True)
        
        print("Dataset created from file and registered in the Workspace")

Dataset found and loaded from the Workspace


In [7]:
print(dataset.take(1).to_pandas_dataframe().head())

   id  target  ps_ind_01  ps_ind_02_cat  ps_ind_03  ps_ind_04_cat  \
0   7       0          2              2          5              1   

   ps_ind_05_cat  ps_ind_06_bin  ps_ind_07_bin  ps_ind_08_bin       ...        \
0              0              0              1              0       ...         

   ps_calc_11  ps_calc_12  ps_calc_13  ps_calc_14  ps_calc_15_bin  \
0           9           1           5           8               0   

   ps_calc_16_bin  ps_calc_17_bin  ps_calc_18_bin  ps_calc_19_bin  \
0               1               1               0               0   

   ps_calc_20_bin  
0               1  

[1 rows x 59 columns]


In [8]:
dataset.take(5).to_pandas_dataframe()

Unnamed: 0,id,target,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,...,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14,ps_calc_15_bin,ps_calc_16_bin,ps_calc_17_bin,ps_calc_18_bin,ps_calc_19_bin,ps_calc_20_bin
0,7,0,2,2,5,1,0,0,1,0,...,9,1,5,8,0,1,1,0,0,1
1,9,0,1,1,7,0,0,0,0,1,...,3,1,1,9,0,1,1,0,1,0
2,13,0,5,4,9,1,0,0,0,1,...,4,2,7,7,0,1,1,0,1,0
3,16,0,0,1,2,0,0,1,0,0,...,2,2,4,9,0,0,0,0,0,0
4,17,0,0,2,0,1,0,1,0,0,...,3,1,1,3,0,0,0,1,1,0


### Segregate a Test dataset for later testing and creating a confusion matrix
Split original AML Tabular Dataset in two test/train AML Tabular Datasets (using AML DS function)

In [9]:
# The name and target column of the Dataset to create 
train_dataset_name = "porto_seguro_safe_driver_prediction_train90"

In [10]:
# Split using Azure Tabular Datasets (Better for Remote Compute)
# https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py#random-split-percentage--seed-none-

train_dataset, test_dataset = dataset.random_split(0.9, seed=1)

#Register Train Dataset (90%) after Split in Workspace
registration_method = 'SDK'  # or 'UI'
data_origin_type = 'SPLIT'
train_dataset = train_dataset.register(workspace=ws,
                                       name=train_dataset_name,
                                       description='Porto Seguro Safe Driver Prediction Train dataset file (90%)',
                                       tags={'Registration-Method': registration_method, 'Data-Origin-Type': data_origin_type},
                                       create_new_version=True)

# Load from Workspace
train_dataset = ws.datasets[train_dataset_name] 
train_dataset

{
  "source": [
    "('workspaceblobstore', 'Datasets/porto_seguro_safe_driver_prediction/porto_seguro_safe_driver_prediction_train.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes",
    "RandomSplit"
  ],
  "registration": {
    "id": "17771274-8b00-4ccf-a090-e0c9821eabdc",
    "name": "porto_seguro_safe_driver_prediction_train90",
    "version": 2,
    "description": "Porto Seguro Safe Driver Prediction Train dataset file (90%)",
    "tags": {
      "Registration-Method": "SDK",
      "Data-Origin-Type": "SPLIT"
    },
    "workspace": "Workspace.create(name='cesardl-automl-ncentralus-demo-ws', subscription_id='381b38e9-9840-4719-a5a0-61d9585e1e91', resource_group='cesardl-automl-ncentralus-demo-ws-resgrp')"
  }
}

## Train configuration in AutoMLConfig class
This creates a general AutoML settings object.

In [12]:
# Initialize your AutoMLConfig object
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(
                              # CODE TBD

# The AML Pipeline

### PipelineData objects

The **PipelineData** object is a special kind of data reference that is used for interim storage locations that can be passed between pipeline steps, so we'll create one and use at as the output for the first step and the input for the second step. Note that we also need to pass it as a script argument so our code can access the datastore location referenced by the data reference.

In [13]:
# Create your PipelineData objects needed to communicate data out of the steps:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                            datastore=ds,
                            pipeline_output_name=metrics_output_name,
                            training_output=TrainingOutput(type='Metrics'))

model_data = PipelineData(name='model_data',
                          datastore=ds,
                          pipeline_output_name=best_model_output_name,
                          training_output=TrainingOutput(type='Model'))

print(model_data.get_env_variable_name())

$AZUREML_DATAREFERENCE_model_data


## Create an AutoMLStep for training.
Pipelines consist of one or more *steps*, which can be Python scripts, or specialized steps like an AutoMLStep for training or a data transfer step that copies data from one location to another. 
Each step can run in its own compute context.

In [14]:
# Create your AutoMLStep object providing your automl_config and outputs=[metrics_data, model_data] as parameters.

# CODE TBD

## Create a PythonScriptStep to register the model in the Workspace.

Write/save the Python code to register the model in a file named register_model.py
The script for the second step of the pipeline will load the model from where it was saved, and then register it in the workspace. It includes a single **model_folder** parameter that contains the path where the model was saved.

### Environment, Conda-Dependencies and RunConfiguration for PythonScriptStep

In [16]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import CondaDependencies, DEFAULT_CPU_IMAGE, RunConfiguration

# Create an Environment for future usage
custom_env = Environment("python-script-step-env")
custom_env.python.user_managed_dependencies = False # Let Azure ML manage dependencies
custom_env.docker.enabled = True 
custom_env.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE 

conda_dependencies = CondaDependencies.create(pip_packages=['azureml-sdk[automl]', 'applicationinsights'], #'azureml-explain-model'
                                              conda_packages=['numpy==1.16.2'], 
                                              pin_sdk_version=False)

# Add the dependencies to the environment
custom_env.python.conda_dependencies = conda_dependencies

# Register the environment (To use it again)
custom_env.register(workspace=ws)
registered_env = Environment.get(ws, 'python-script-step-env')

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to AmlCompute
conda_run_config.target = aml_remote_compute

# Assign the environment to the run configuration
conda_run_config.environment
conda_run_config.environment = registered_env

print('Run config is ready')

Run config is ready


## Register model step
Script to register the model to the workspace.

In [17]:
import os
import shutil
scripts_folder="Scripts"
os.makedirs(scripts_folder, exist_ok=True)

### Create the register_model.py script file

In [18]:
# Copy here the contents of the register_model.py.txt file provided in the Challenge
# so you will generate the code to register the Model into the "Scripts/register_model.py" file when running this cell

# CODE TBD

Writing Scripts/register_model.py


### PythonScriptStep to run register_model.py script

In [19]:
# Parameters to use in the Pipeline
from azureml.pipeline.core import PipelineParameter

# The model name with which to register the trained model in the workspace.
model_name = "porto-model-from-automlstep"
model_name_param = PipelineParameter("model_name", default_value=model_name)

# The Dataset name to relate with the model to register in the workspace.
dataset_name_param = PipelineParameter(name="ds_name", default_value=train_dataset_name)

In [20]:
# Write your code to create the PythonScriptStep for Model registration
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep

register_model_step = PythonScriptStep(
                                       # CODE TBD
                                      )

register_model_step.run_after(automl_step)

print("Pipeline steps defined")

Pipeline steps defined


This is a simple example, designed to demonstrate the principle. In reality, you could build more sophisticated logic into the pipeline steps - for example, evaluating the model against some test data to calculate a performance metric like AUC or accuracy, comparing the metric to that of any previously registered versions of the model, and only registering the new model if it performs better.

## Create Pipeline and add the multiple steps into it

In [21]:
from azureml.pipeline.core import Pipeline

# CODE TBD 

Pipeline is built.


In [22]:
# Submit the Pipeline to start the run

pipeline_run = experiment.submit(pipeline, pipeline_parameters={
        "ds_name": train_dataset_name, "model_name": model_name})

print("Pipeline submitted for execution.")

Created step automl_module [34b9fc4a][b2f9cb13-7a2c-4d79-ab3e-22da8c671c62], (This step will run and generate new outputs)Created step register_model [76f8bc02][b5ea4c91-a5fc-4a0b-829e-73542a6c6a9b], (This step will run and generate new outputs)

Submitted PipelineRun 7df6727a-aabc-4428-bd0c-3fea0b07379d
Link to Azure Machine Learning studio: https://ml.azure.com/experiments/automlstep-classif-porto/runs/7df6727a-aabc-4428-bd0c-3fea0b07379d?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/cesardl-automl-ncentralus-demo-ws-resgrp/workspaces/cesardl-automl-ncentralus-demo-ws
Pipeline submitted for execution.


In [23]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [24]:
pipeline_run.wait_for_completion()

PipelineRunId: 7df6727a-aabc-4428-bd0c-3fea0b07379d
Link to Portal: https://ml.azure.com/experiments/automlstep-classif-porto/runs/7df6727a-aabc-4428-bd0c-3fea0b07379d?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/cesardl-automl-ncentralus-demo-ws-resgrp/workspaces/cesardl-automl-ncentralus-demo-ws
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: e1670d4f-690d-45d5-af45-cc8da0a935e6
Link to Portal: https://ml.azure.com/experiments/automlstep-classif-porto/runs/e1670d4f-690d-45d5-af45-cc8da0a935e6?wsid=/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/cesardl-automl-ncentralus-demo-ws-resgrp/workspaces/cesardl-automl-ncentralus-demo-ws
StepRun( automl_module ) Status: NotStarted
StepRun( automl_module ) Status: Running

StepRun(automl_module) Execution Summary
StepRun( automl_module ) Status: Finished




StepRunId: c8fcd14f-a1c2-4207-8e31-9cd0c7cadd18
Link to Portal: https://ml.azure.com/experiments/automlstep-classif-po

'Finished'

## Examine Results from Pipeline

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [None]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

In [None]:
import json
with open(metrics_output._path_on_datastore) as f:  
   metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

### Retrieve info about the trained model

In [None]:
print(pipeline_run.get_file_names())

### Retrieve the Best Model

In [None]:
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

In [None]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

### Test the Model
#### Prepare the Test DataFrame

In [None]:
test_df = test_dataset.to_pandas_dataframe()
print(test_df.shape)

test_df = test_df[pd.notnull(test_df['target'])]

if 'target' in test_df.columns:
    y_test = test_df[['target']]
    X_test = test_df.drop(['target'], axis=1)

print(y_test.shape)
print(X_test.shape)

X_test.describe()

# Testing Our Best Fitted Model

In [None]:
# Try the best model making predictions with the test dataset
y_predictions = best_model.predict(X_test)

print('10 predictions: ')
print(y_predictions[:10])

### Calculate Accuracy

In [None]:
from sklearn.metrics import accuracy_score

print('Accuracy with Scikit-Learn model:')
print(accuracy_score(y_test, y_predictions))


### Calculate AUC with Test Dataset

In [None]:
from sklearn import metrics

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_predictions)
print('AUC (Area Under the Curve) with Test dataset:')
metrics.auc(fpr, tpr)

## Show Confusion Matrix
We will use confusion matrix to see how our model works.

In [None]:
from pandas_ml import ConfusionMatrix

cm = ConfusionMatrix(y_test['target'], y_predictions)

print(cm)

cm.plot()

## Publish the Pipeline
Now that you've created a pipeline and verified it works, you can publish it as a REST service.

In [None]:
# CODE TBD


# Trigger the AML Pipeline by using the Pipeline REST Endpoint

To use the endpoint, client applications need to make a REST call over HTTP. This request must be authenticated, so an authorization header is required. A real application would require a service principal with which to be authenticated, but to test this out, we'll use the authorization header from your current connection to your Azure workspace, which you can get using the following code:

In [None]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()
print(auth_header)

Now you're ready to call the REST interface. The pipeline runs asynchronously, so you'll get an identifier back, which you can use to track the pipeline experiment as it runs:

### The REST Endpoint
Note that the published pipeline has an endpoint, which you can see in the Endpoints page (on the Pipeline Endpoints tab) in Azure Machine Learning studio. You can also find its URI as a property of the published pipeline object.
So, you could also copy that REST Endpoint from the AML portal and paste it like here:

rest_endpoint = "Your copied REST Endpoint here"
    

In [None]:
import requests

response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": experiment_name})
run_id = response.json()["Id"]
run_id

Since you have the run ID, you can use the RunDetails widget to view the experiment as it runs.

In [None]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

# CODE TBD TO CREATE PipelineRun


# Next Steps!
You can use the Azure Machine Learning extension for Azure DevOps to combine Azure ML pipelines with Azure DevOps pipelines (yes, it is confusing that they have the same name!) and integrate model retraining into a continuous integration/continuous deployment (CI/CD) process. For example you could use an Azure DevOps build pipeline to trigger an Azure ML pipeline that trains and registers a model, and when the model is registered it could trigger an Azure Devops release pipeline that deploys the model as a web service, along with the application or service that consumes the model.