## Azure Machine Learning - Model Training Pipeline
This notebook demonstrates creation and execution of an Azure ML pipeline designed to load data from an AML-linked blob storage account, split the data into testing and training subsets, train a classification model, and evaluate and register the model. For the final evaluation step a champion vs. challenger A/B test is performed using a target metric of interest so that the best performing model is always reflected in the registry.

Note: This notebook builds from the Iris Setosa sample dataset available in Scikit-Learn.

### Import Required Packages

In [None]:
# Import required packages
from azureml.core import Workspace, Experiment, Datastore, Environment, Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.pipeline.core import Pipeline, PipelineParameter, PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import PipelineParameter, PipelineData
from azureml.data.output_dataset_config import OutputTabularDatasetConfig, OutputDatasetConfig, OutputFileDatasetConfig
from azureml.data.datapath import DataPath
from azureml.data.data_reference import DataReference
from azureml.data.sql_data_reference import SqlDataReference
from azureml.pipeline.steps import DataTransferStep
import logging
import os

### Connect to Azure ML Workspace, Provision Compute Resources, and get References to Datastores
Connect to workspace using config associated config file. Get a reference to you pre-existing AML compute cluster or provision a new cluster to facilitate processing. Finally, get references to your default blob datastore.

In [None]:
# Connect to AML Workspace
ws=None
try:
    ws = Workspace.from_config()
except Exception:
    ws = Workspace(subscription_id=os.getenv('SUBSCRIPTION_ID'),  resource_group = os.getenv('RESOURCE_GROUP'), workspace_name = os.getenv('WORKSPACE_NAME'))

#Select AML Compute Cluster
cpu_cluster_name = 'cpucluster'

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found an existing cluster, using it instead.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D3_V2',
                                                           min_nodes=1,
                                                           max_nodes=1)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
    cpu_cluster.wait_for_completion(show_output=True)
    
#Get default datastore
default_ds = ws.get_default_datastore()

### Create Run Configuration
The `RunConfiguration` defines the environment used across all python steps. You can optionally add additional conda or pip packages to be added to your environment. [More details here](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.conda_dependencies.condadependencies?view=azure-ml-py).

Here, we also register the environment to the AML workspace so that it can be used for future retraining and inferencing operations.

In [None]:
run_env = Environment.get(ws, 'TF_Autoencoder_Env')
run_env.docker.base_dockerfile = open('./Dockerfile', 'r').read() 

run_config = RunConfiguration()
run_config.docker.use_docker = True
run_config.environment  = run_env

# #Register environment for reuse 
run_config.environment.register(ws)

### Define Output Datasets
Below we define the configuration for datasets that will be passed between steps in our pipeline. Note, in all cases we specify the datastore that should hold the datasets and whether they should be registered following step completion or not. This can optionally be disabled by removing the register_on_complete() call.

In [None]:
raw_data = OutputFileDatasetConfig(name='Raw_Data', destination=(default_ds, 'raw_data/{run-id}')).read_delimited_files().register_on_complete(name='Raw_Data')
training_data = OutputFileDatasetConfig(name='Training_Data', destination=(default_ds, 'training_data/{run-id}')).read_delimited_files().register_on_complete(name='Training_Data')
validation_data = OutputFileDatasetConfig(name='Validation_Data', destination=(default_ds, 'validation_data/{run-id}')).read_delimited_files().register_on_complete(name='Validation_Data')
testing_data = OutputFileDatasetConfig(name='Testing_Data', destination=(default_ds, 'testing_data/{run-id}')).read_delimited_files().register_on_complete(name='Testing_Data')

### Define Pipeline Data
Pipeline data represents intermediate files/data in an Azure Machine Learning pipeline that needs to be shuttled between steps. In our case we are fitting a MinMax scaler and training a classification model which we intend to evaluate in a subsequent step, and register if it performs better than the current model. In this scenario, we use a `PipelineData` object to facilitate this motion of files between steps. [More details can be found here](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py).

In [None]:
scale_to_train_pipeline_data = PipelineData(name='SplitScale_Outputs', datastore=default_ds)

### Define Pipeline Parameters
`PipelineParameter` objects serve as variable inputs to an Azure ML pipeline and can be specified at runtime. Below we specify the percent of data (0.0-1.0) that should be added to our testing dataset, along with the target column name, and pass these as variable parameters into the pipeline at runtime.

In [None]:
mfg_site_id = PipelineParameter(name='mfg_site_id', default_value='mfg001')
testing_size = PipelineParameter(name='testing_size', default_value=0.15)
model_name = PipelineParameter(name='model_name', default_value='mfg-001_anomaly-detector')
model_description = PipelineParameter(name='model_description', default_value='Tensorflow/Keras Autoencoder for Anomaly Detection of Water Pump Sensor Data')

### Define Pipeline Steps
The pipeline below consists of four distinct steps all of which execute an associated python script located in the ./pipeline_script_steps dir. First, we call get_data.py and retrieve data from the registered blob datastore and register this dataset as Raw_Data. From here we run split_data.py which splits the raw data into test and train datasets according to the variable `testing_size` parameter - both of which are subsequently registered. Then, we pass the test and training datasets into a step that runs train_model.py which trains the iris classifier and computes and registers a set of metrics. Finally, the final step executes evaluate_and_register.py which loads both the new model (challenger) and current best model (champion) into code and evaluates the provided test dataset. Based on the `accuracy` metric, if the challenger model performs better, or no model has been registered to-date, the model is registered in the workspace.

In [None]:
# Get raw data from AML-linked datastore
# Register tabular dataset after retrieval
get_data_step = PythonScriptStep(
    name='Get Data from Delta Table',
    script_name='get_data.py',
    arguments =['--raw_data', raw_data, '--mfg_site_id', mfg_site_id],
    outputs=[raw_data],
    compute_target=cpu_cluster,
    source_directory='./pipeline_step_scripts',
    allow_reuse=True,
    runconfig=run_config
)

# Load raw data and split into test and train
# datasets according to the specified split percentage
split_data_step = PythonScriptStep(
    name='Split Train and Test Data',
    script_name='split_data.py',
    arguments =['--training_data', training_data,
                '--testing_data', testing_data,
                '--validation_data', validation_data,
                '--testing_size', testing_size,
               '--splitscale_outputs', scale_to_train_pipeline_data],
    inputs=[raw_data.as_input(name='Raw_Data')],
    outputs=[training_data, testing_data, validation_data, scale_to_train_pipeline_data],
    compute_target=cpu_cluster,
    source_directory='./pipeline_step_scripts',
    allow_reuse=False,
    runconfig=run_config
)

# Train iris classification model using split
# test and train datasets. Both the scaler and trained model
# will be saved as PipelineData
train_model_step = PythonScriptStep(
    name='Train Autoencoder',
    script_name='train_model.py',
    arguments=['--model_name', model_name, '--scaler_outputs', scale_to_train_pipeline_data],
    inputs=[training_data.as_input(name='Training_Data'),
            testing_data.as_input(name='Testing_Data'),
            validation_data.as_input(name='Validation_Data'),
            scale_to_train_pipeline_data.as_input('Scaler')
           ],
    compute_target=cpu_cluster,
    source_directory='./pipeline_step_scripts',
    allow_reuse=True,
    runconfig=run_config
)

### Create Pipeline
Create an Azure ML Pipeline by specifying the steps to be executed. Note: based on the dataset dependencies between steps, exection occurs logically such that no step will execute unless all of the necessary input datasets have been generated.

In [None]:
pipeline = Pipeline(workspace=ws, steps=[get_data_step, split_data_step, train_model_step]) #, split_data_step, train_model_step, evaluate_and_register_step

### Optional: Trigger a Pipeline Execution from the Notebook
You can create an Experiment (logical collection for runs) and submit a pipeline run directly from this notebook by running the commands below

In [None]:
# experiment = Experiment(ws, 'Model Training Pipeline')
# run = experiment.submit(pipeline)
# run.wait_for_completion(show_output=True)

### Create a Published PipelineEndpoint
Once we have created our pipeline we will look to retrain our model periodically as new data becomes available. By publishing our pipeline to a `PipelineEndpoint` we can iterate on our pipeline definition but maintain a consistent REST API endpoint. 

In [None]:
from azureml.pipeline.core import PipelineEndpoint

def published_pipeline_to_pipeline_endpoint(
    workspace,
    published_pipeline,
    pipeline_endpoint_name,
    pipeline_endpoint_description="Endpoint to my pipeline",
):
    try:
        pipeline_endpoint = PipelineEndpoint.get(
            workspace=workspace, name=pipeline_endpoint_name
        )
        print("using existing PipelineEndpoint...")
        pipeline_endpoint.add_default(published_pipeline)
    except Exception as ex:
        print(ex)
        # create PipelineEndpoint if it doesn't exist
        print("PipelineEndpoint does not exist, creating one for you...")
        pipeline_endpoint = PipelineEndpoint.publish(
            workspace=workspace,
            name=pipeline_endpoint_name,
            pipeline=published_pipeline,
            description=pipeline_endpoint_description
        )


pipeline_endpoint_name = 'Autoencoder (Anomaly Detection) Model Training Pipeline'
pipeline_endpoint_description = 'Sample pipeline for training, evaluating, and registering a an autoencoder using manufacturing sensor data'

published_pipeline = pipeline.publish(name=pipeline_endpoint_name,
                                     description=pipeline_endpoint_description,
                                     continue_on_step_failure=False)

published_pipeline_to_pipeline_endpoint(
    workspace=ws,
    published_pipeline=published_pipeline,
    pipeline_endpoint_name=pipeline_endpoint_name,
    pipeline_endpoint_description=pipeline_endpoint_description
)