## Azure Machine Learning - Model Training Pipeline
This notebook demonstrates creation and execution of an Azure ML pipeline designed to load data from an AML-linked blob storage account, split the data into testing and training subsets, train a classification model, and evaluate and register the model. For the final evaluation step a champion vs. challenger A/B test is performed using a target metric of interest so that the best performing model is always reflected in the registry.

Note: This notebook builds from the Iris Setosa sample dataset available in Scikit-Learn.

### Import Required Packages

In [1]:
# Import required packages
from azureml.core import Workspace, Experiment, Datastore, Environment, Dataset
from azureml.core.compute import ComputeTarget, AmlCompute, DataFactoryCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
from azureml.pipeline.core import Pipeline, PipelineParameter, PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import PipelineParameter, PipelineData
from azureml.data.output_dataset_config import OutputTabularDatasetConfig, OutputDatasetConfig, OutputFileDatasetConfig
from azureml.data.datapath import DataPath
from azureml.data.data_reference import DataReference
from azureml.data.sql_data_reference import SqlDataReference
from azureml.pipeline.steps import DataTransferStep
import logging

### Connect to Azure ML Workspace, Provision Compute Resources, and get References to Datastores
Connect to workspace using config associated config file. Get a reference to you pre-existing AML compute cluster or provision a new cluster to facilitate processing. Finally, get references to your default blob datastore.

In [2]:
# Connect to AML Workspace
ws = None
try:
    ws = Workspace.from_config()
except Exception:
    ws = Workspace(subscription_id=os.getenv('SUBSCRIPTION_ID'),  resource_group = os.getenv('RESOURCE_GROUP'), workspace_name = os.getenv('WORKSPACE_NAME'))


#Select AML Compute Cluster
cpu_cluster_name = 'cluster001'

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found an existing cluster, using it instead.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D3_V2',
                                                           min_nodes=0,
                                                           max_nodes=1)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    cpu_cluster.wait_for_completion(show_output=True)
    
#Get default datastore
default_ds = ws.get_default_datastore()

Found an existing cluster, using it instead.


### Get Sample Toy Dataset and Save to Default (Blob) Datastore
Load data into a pandas dataframe directly from Scikit-Learn and save as a CSV to predefined location inside the default Azure ML datastore.

Note: These data can be written to any AML-linked datastore using the SDK commands shown below.

In [3]:
from sklearn.datasets import load_iris
import pandas as pd
import os
import shutil

data = load_iris()

input_df = pd.DataFrame(data.data, columns = data.feature_names)
output_df = pd.DataFrame(data.target, columns = ['target'])

merged_df = pd.concat([input_df, output_df], axis=1)

os.makedirs('./tmp', exist_ok=True)
merged_df.to_csv('./tmp/iris_data.csv', index=False)

default_ds.upload(src_dir='./tmp',
                 target_path='iris_data_training',
                 overwrite=True)

shutil.rmtree('./tmp')

"Datastore.upload" is deprecated after version 1.0.69. Please use "Dataset.File.upload_directory" to upload your files             from a local directory and create FileDataset in single method call. See Dataset API change notice at https://aka.ms/dataset-deprecation.


Uploading an estimated of 1 files
Uploading ./tmp/iris_data.csv
Uploaded ./tmp/iris_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files


### Create Run Configuration
The `RunConfiguration` defines the environment used across all python steps. You can optionally add additional conda or pip packages to be added to your environment. [More details here](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.conda_dependencies.condadependencies?view=azure-ml-py).

Here, we also register the environment to the AML workspace so that it can be used for future retraining and inferencing operations.

In [4]:
run_config = RunConfiguration()
run_config.docker.use_docker = True
run_config.environment = Environment(name='sample_env')
run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE
run_config.environment.python.conda_dependencies = CondaDependencies.create()
run_config.environment.python.conda_dependencies.set_pip_requirements([
    'requests==2.26.0',
    'pandas==0.25.3',
    'numpy==1.19.2',
    'scikit-learn==0.22.1',
    'joblib==0.14.1',
    'azureml-defaults==1.43.0',
    'azureml-mlflow==1.43.0',
    'mlflow==1.28.0',
    'scipy==1.5.3'
])
run_config.environment.python.conda_dependencies.set_python_version('3.8.10')

#Register environment for reuse 
run_config.environment.register(ws)


{
    "assetId": "azureml://locations/eastus/workspaces/22675d30-4b65-4e39-8ce9-3f081650e29a/environments/sample_env/versions/8",
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:20220708.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "buildContext": null,
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "sample_env",
    "python": 

### Define Output Datasets
Below we define the configuration for datasets that will be passed between steps in our pipeline. Note, in all cases we specify the datastore that should hold the datasets and whether they should be registered following step completion or not. This can optionally be disabled by removing the register_on_complete() call.

In [5]:
raw_data = OutputFileDatasetConfig(name='Raw_Data', destination=(default_ds, 'raw_data/{run-id}')).read_delimited_files().register_on_complete(name='Raw_Data')
training_data = OutputFileDatasetConfig(name='Training_Data', destination=(default_ds, 'training_data/{run-id}')).read_delimited_files().register_on_complete(name='Training_Data')
testing_data = OutputFileDatasetConfig(name='Testing_Data', destination=(default_ds, 'testing_data/{run-id}')).read_delimited_files().register_on_complete(name='Testing_Data')

### Define Pipeline Data
Pipeline data represents intermediate files/data in an Azure Machine Learning pipeline that needs to be shuttled between steps. In our case we are fitting a MinMax scaler and training a classification model which we intend to evaluate in a subsequent step, and register if it performs better than the current model. In this scenario, we use a `PipelineData` object to facilitate this motion of files between steps. [More details can be found here](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py).

In [6]:
# train_to_evaluate_pipeline_data = PipelineData(name='Training_Outputs', datastore=default_ds)

### Define Pipeline Parameters
`PipelineParameter` objects serve as variable inputs to an Azure ML pipeline and can be specified at runtime. Below we specify the percent of data (0.0-1.0) that should be added to our testing dataset, along with the target column name, and pass these as variable parameters into the pipeline at runtime.

In [7]:
testing_size = PipelineParameter(name='testing_size', default_value=0.3)
target_column = PipelineParameter(name='target_column', default_value='target')
model_name = PipelineParameter(name='model_name', default_value='iris-classification')
model_description = PipelineParameter(name='model_description', default_value='Scikit-Learn K-Neighbors Classifier for Iris Dataset')

### Define Pipeline Steps
The pipeline below consists of four distinct steps all of which execute an associated python script located in the ./pipeline_script_steps dir. First, we call get_data.py and retrieve data from the registered blob datastore and register this dataset as Raw_Data. From here we run split_data.py which splits the raw data into test and train datasets according to the variable `testing_size` parameter - both of which are subsequently registered. Then, we pass the test and training datasets into a step that runs train_model.py which trains the iris classifier and computes and registers a set of metrics. Finally, the final step executes evaluate_and_register.py which loads both the new model (challenger) and current best model (champion) into code and evaluates the provided test dataset. Based on the `accuracy` metric, if the challenger model performs better, or no model has been registered to-date, the model is registered in the workspace.

In [8]:
# Get raw data from AML-linked datastore
# Register tabular dataset after retrieval
get_data_step = PythonScriptStep(
    name='Get Data from Blob Storage',
    script_name='get_data.py',
    arguments =['--raw_data', raw_data],
    outputs=[raw_data],
    compute_target=cpu_cluster,
    source_directory='./pipeline_step_scripts',
    allow_reuse=False,
    runconfig=run_config
)

# Load raw data and split into test and train
# datasets according to the specified split percentage
split_data_step = PythonScriptStep(
    name='Split Train and Test Data',
    script_name='split_data.py',
    arguments =['--training_data', training_data,
                '--testing_data', testing_data,
                '--testing_size', testing_size],
    inputs=[raw_data.as_input(name='Raw_Data')],
    outputs=[training_data, testing_data],
    compute_target=cpu_cluster,
    source_directory='./pipeline_step_scripts',
    allow_reuse=False,
    runconfig=run_config
)

# Train iris classification model using split
# test and train datasets. Both the scaler and trained model
# will be saved as PipelineData
train_model_step = PythonScriptStep(
    name='Train Model',
    script_name='train_model.py',
    arguments =[
                '--target_column', target_column
    ],
    inputs=[training_data.as_input(name='Training_Data'),
            testing_data.as_input(name='Testing_Data')
           ],
    outputs=[],
    compute_target=cpu_cluster,
    source_directory='./pipeline_step_scripts',
    allow_reuse=False,
    runconfig=run_config
)

#Evaluate and register model here
#Compare metrics from current model and register if better than current
#best model
evaluate_and_register_step = PythonScriptStep(
    name='Evaluate and Register Model',
    script_name='evaluate_and_register.py',
    arguments=[
               '--target_column', target_column,
               '--model_name', model_name,
               '--model_description', model_description],
    inputs=[training_data.as_input(name='Training_Data'),
            testing_data.as_input(name='Testing_Data')],
    compute_target=cpu_cluster,
    source_directory='./pipeline_step_scripts',
    allow_reuse=False,
    runconfig=run_config
)
evaluate_and_register_step.run_after(train_model_step)

#Package model step
#Container registered champion model here for deployment to target
#endpoints
package_model_step = PythonScriptStep(
    name='Package Model',
    script_name='package_model.py',
    arguments=[
               '--model_name', model_name
    ],
    inputs=[testing_data.as_input(name='Testing_Data')],
    compute_target=cpu_cluster,
    source_directory='./pipeline_step_scripts',
    allow_reuse=False,
    runconfig=run_config
)
package_model_step.run_after(evaluate_and_register_step)

### Create Pipeline
Create an Azure ML Pipeline by specifying the steps to be executed. Note: based on the dataset dependencies between steps, exection occurs logically such that no step will execute unless all of the necessary input datasets have been generated.

In [9]:
pipeline = Pipeline(workspace=ws, steps=[get_data_step, split_data_step, train_model_step, evaluate_and_register_step, package_model_step])

### Optional: Trigger a Pipeline Execution from the Notebook
You can create an Experiment (logical collection for runs) and submit a pipeline run directly from this notebook by running the commands below

In [10]:
experiment = Experiment(ws, 'sample-training-pipeline-run')
run = experiment.submit(pipeline)
run.wait_for_completion(show_output=True)

Created step Get Data from Blob Storage [d2e838d9][1e42476a-6796-4adc-9ac8-4f457b3274b2], (This step will run and generate new outputs)
Created step Split Train and Test Data [c8869eb1][0b427a66-3f17-4460-b34c-803654fd589e], (This step will run and generate new outputs)
Created step Train Model [0d5dd162][f9be0a6a-0473-4b04-aa63-46df1332cb8b], (This step will run and generate new outputs)
Created step Evaluate and Register Model [b0e3a96a][a3a41539-47c0-40ec-9a9f-55860e6135cd], (This step will run and generate new outputs)Created step Package Model [39c0675f][81cc101a-72b0-4564-8082-72c52df83e28], (This step will run and generate new outputs)

Submitted PipelineRun defd61c6-656e-4f9e-8f33-f22e8600a0ec
Link to Azure Machine Learning Portal: https://ml.azure.com/runs/defd61c6-656e-4f9e-8f33-f22e8600a0ec?wsid=/subscriptions/f3e38aaa-dd9c-4f17-95c1-ef3ff472da61/resourcegroups/acuity-secure-aml-rg/workspaces/mlw-actynwk-hhya&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
PipelineRunId: defd61c6-6

'Finished'

### Create a Published PipelineEndpoint
Once we have created our pipeline we will look to retrain our model periodically as new data becomes available. By publishing our pipeline to a `PipelineEndpoint` we can iterate on our pipeline definition but maintain a consistent REST API endpoint. 

In [11]:
# from azureml.pipeline.core import PipelineEndpoint

# def published_pipeline_to_pipeline_endpoint(
#     workspace,
#     published_pipeline,
#     pipeline_endpoint_name,
#     pipeline_endpoint_description="Endpoint to my pipeline",
# ):
#     try:
#         pipeline_endpoint = PipelineEndpoint.get(
#             workspace=workspace, name=pipeline_endpoint_name
#         )
#         print("using existing PipelineEndpoint...")
#         pipeline_endpoint.add_default(published_pipeline)
#     except Exception as ex:
#         print(ex)
#         # create PipelineEndpoint if it doesn't exist
#         print("PipelineEndpoint does not exist, creating one for you...")
#         pipeline_endpoint = PipelineEndpoint.publish(
#             workspace=workspace,
#             name=pipeline_endpoint_name,
#             pipeline=published_pipeline,
#             description=pipeline_endpoint_description
#         )


# pipeline_endpoint_name = 'Classification Model Training Pipeline'
# pipeline_endpoint_description = 'Sample pipeline for training, evaluating, and registering a classification model based on the Iris Setosa dataset'

# published_pipeline = pipeline.publish(name=pipeline_endpoint_name,
#                                      description=pipeline_endpoint_description,
#                                      continue_on_step_failure=False)

# published_pipeline_to_pipeline_endpoint(
#     workspace=ws,
#     published_pipeline=published_pipeline,
#     pipeline_endpoint_name=pipeline_endpoint_name,
#     pipeline_endpoint_description=pipeline_endpoint_description
# )

### Sample Pipeline Trigger Request (REST API)
You can trigger your published pipeline remotely by making an authenticated call the PipelineEndpoint's REST API. The sample request code below requires creation of a service principal and assignment of that SP to your AML workspace as a Contributor [more details here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication). When triggering a pipeline using a REST API call you are required to provide an Experiment Name and can optionally updated the default pipeline parameter values