# Creating an ML pipeline

To take our first steps towards an MLOps approach for training and deploying our model, we need to create a [Machine Learning pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines) in Azure ML. 

In essence, a Machine Learning Pipeline is an independently executable workflow of a complete machine learning task. Each subtask (e.g., training the model, evaluating the trained model, etc.) are implemented as a step (or series of steps) within the pipeline.

In principle, you can do whatever you want within a pipeline, as pipeline steps can also include generic tasks such as running Python scripts. However, we would recommend that pipelines focus on specific machine learning tasks such as:

* Data preparation (validating, cleaning, munging data, etc.)
* Training and validating machine learning model(s)
* Deploying trained machine learning models

## Getting started

In this exercise, we would like to build a pipeline that includes the following steps:

* Preprocess (which does some simple preprocessing on the dataset)
* Train (which takes training data and produces a model)
* Evaluate (which logs metrics/parameters for the trained model)
* Register (which registers the trained model in Azure ML)

For each of these steps, we also provide an script (entrypoint) in the model. You can find these scripts in the model/scripts directory. Note that you can also run the scripts locally if you want to try them out!

For example, you can train the model locally using:

   `python model/scripts/train.py --input_dir ../../datasets/titanic --input_file train.csv --model_dir outputs`

## Building the initial pipeline

To build our ML pipeline, we have to do the following:

* Specify the steps to include in the pipeline
* Define dependencies between steps using inputs/outputs

Besides this, we need to configure each step to tell Azure ML what to run and where. For a PythonScriptStep, this includes the following configuration options:

* RunConfig (= environment to run in terms of Python dependencies, etc.)
* Source + entrypoints (= code to run in the step)
* Compute target (= compute resource to run on, e.g. VM cluster, databricks, etc.)
* Inputs/outputs (= datasets/paths to use for inputs/outputs)

### Setting up a run config

Let's start by getting our ML workspace:

In [None]:
from azureml.core import Workspace

workspace = Workspace.from_config()

Once we have the workspace, we can setting up the config for our steps by defining our RunConfig. This RunConfig essentially defines the runtime environment in which our steps will be run. In principle, you can define a different RunConfig for each step, but we'll stick to using one common environment for now.

To keep our RunConfig reproducible, we'll base it off our (model) environment.yaml file:

In [None]:
from pathlib import Path

from azureml.core import RunConfiguration, Environment

model_dir = Path("../model")

run_config = RunConfiguration()
run_config.environment = Environment.from_conda_specification(
    "model-env", model_dir / "environment.yml"
)

### Setting up a compute target

Next, we'll set up our compute target using a helper function, which we'll similarly share across steps. This helper function essentially creates a cluster of VMs in Azure ML, which uses a specific type of VM and supports auto-scaling between a min/max number of nodes:

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException


def _get_or_create_cluster(
    workspace,
    name,
    vm_size="STANDARD_D2_V2",
    vm_priority="lowpriority",
    min_nodes=0,
    max_nodes=4,
    idle_seconds_before_scaledown="300",
    wait=False,
):
    """Helper function for creating a cluster of VMs as compute target."""

    try:
        # pylint: disable=abstract-class-instantiated
        target = ComputeTarget(workspace=workspace, name=name)
        print("Using existing cluster %s" % name)
    except ComputeTargetException: 
        print("Creating cluster %s" % name)
        config = AmlCompute.provisioning_configuration(
            vm_size=vm_size,
            vm_priority=vm_priority,
            min_nodes=min_nodes,
            max_nodes=max_nodes,
            idle_seconds_before_scaledown=idle_seconds_before_scaledown,
        )
        target = ComputeTarget.create(workspace, name, config)

        if wait:
            target.wait_for_completion(show_output=False)

    return target


compute_target = _get_or_create_cluster(workspace, name="my-cluster", wait=True)

### Defining data references

Now we have our basic config set up, we also need to define where our data comes from. In this case, we'll use an instance of the [DataReference](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.data_reference.datareference?view=azure-ml-py) class to reference the data that we uploaded to our datastore:

In [None]:
from azureml.data.data_reference import DataReference

datastore = workspace.get_default_datastore()

train_data = DataReference(
    datastore=datastore,
    data_reference_name="train_data",
    path_on_datastore="titanic",
)

This reference will allow our pipeline steps to reference the dataset on the datastore, allowing Azure ML to fetch the dataset before running the step.

Besides this, we also need to have a place to store intermediate output data from our pipeline. This includes, for example, the preprocessed version of the train dataset produced by the *preprocess* step, as well as the model pickle produced by the *train* step.

We can create intermediate storage for our pipeline using the the [PipelineData](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedata?view=azure-ml-py) class:

In [None]:
from azureml.pipeline.core import PipelineData

preprocessed_data = PipelineData(
    "preprocessed_data", 
    datastore=datastore,
)

model_data = PipelineData(
    "model_data", 
    datastore=datastore, 
    pipeline_output_name="model",
)

### Defining the pipeline + steps

Now, to finally start building our actual pipeline, we can define our first *preprocess* step:

In [None]:
from azureml.pipeline.steps import PythonScriptStep

preprocess_step = PythonScriptStep(
    name="preprocess",
    source_directory=str(model_dir),
    script_name="scripts/preprocess.py",
    arguments=["--input_dir", train_data, "--output_dir", preprocessed_data],
    inputs=[train_data],
    outputs=[preprocessed_data],
    compute_target=compute_target,
    runconfig=run_config,
    allow_reuse=False,
)

Note that this PythonScriptStep definition does the following:
* We call our step *'preprocess'*
* We tell the step to use the code in our model directory and call the *preprocess.py* script.
* We define the arguments to pass to the script, which reference our pipeline storage.
* We define the inputs/outputs, which point to our pipeline storage (note: this is used to define dependencies between tasks).
* We also define our compute target + run config for the step.

Now that we have our first step, we can create an initial version of the pipeline as follows:

In [None]:
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(
    workspace=workspace,
    steps=[
        preprocess_step, 
    ],
)
pipeline.validate()

Note that the *validate* method checks if the pipeline *looks* valid.

### Running the pipeline

To run our pipeline, we first need to publish it to Azure ML:

In [None]:
published_pipeline = pipeline.publish(
    name="my-first-pipeline",
    description="Description of my first pipeline.",
    continue_on_step_failure=False,
) 

print(published_pipeline)

Once published, you should be able to see the pipeline in your Azure ML workspace (under Pipelines > Endpoints). (Check if you can find your pipeline!) 

You can retrieve your pipeline using the following code:

In [None]:
# Can fetch pipeline using the pipeline ID as follows:
# from azureml.pipeline.core import PipelineRun, PublishedPipeline
# pipeline = PublishedPipeline.get(workspace=workspace, id=pipeline_id)

We can trigger our pipeline by sending a POST request to the pipeline's HTTP endpoint. Note that we also need to define an experiment to run our pipeline in, as the experiment will be responsible for tracking model parameters etc.

To trigger our pipeline, you can perform the required request as follows:

In [None]:
import time

import requests

from azureml.pipeline.core import PipelineRun
from azureml.core.authentication import InteractiveLoginAuthentication


def _wait_for_run_completion(pipeline_run):
    """Helper function that waits for the pipeline to finish."""
    
    JOB_STATUS = {
        "not_started": {0, "NotStarted"},
        "running": {1, "Running"},
        "failed": {2, "Failed"},
        "cancelled": {3, "Cancelled"},
        "finished": {4, "Finished"},
    }
    
    print("Waiting for job to start...")
    status = pipeline_run.get_status()
    while status in JOB_STATUS["not_started"]:
        time.sleep(1)
        status = pipeline_run.get_status()

    if status in JOB_STATUS["running"]:
        print("Job started, waiting for completion...")
        while status in JOB_STATUS["running"]:
            time.sleep(1)
            status = pipeline_run.get_status()

    if status in JOB_STATUS["finished"]:
        print("Job finished successfully!")
    elif status in JOB_STATUS["failed"]:
        print("ERROR: Job failed!")
    elif status in JOB_STATUS["cancelled"]:
        print("WARNING: Job was cancelled.")
    else:
        raise ValueError(f"Unexpected status '{status}'")
        
    
# Define the experiment to run in.
experiment_name = "my-first-experiment"

# Get the required authentication token for the endpoint.
auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()

# Define parameters for our request.
request_payload = {
    "ExperimentName": experiment_name,
    "ParameterAssignments": {},
}

# Perform request + check response.
response = requests.post(published_pipeline.endpoint, headers=aad_token, json=request_payload)
response.raise_for_status()

run_id = response.json()["Id"]
print("Job ID: %s" % run_id)
 
# Retrieve the corresponding pipeline run and wait for it to finish.
pipeline_run = PipelineRun.get(workspace, run_id)
_wait_for_run_completion(pipeline_run) 

Running this should trigger a pipeline run and wait for it to finish. Note that you should be able to monitor the progress from the Azure ML interface as well (under Experiments).

## Building the rest of the pipeline

Now we have a start, we want to start building the rest of our pipeline. The idea is to build a pipeline that includes the following steps:

* Preprocessing (which we already implemented).
* Train - for training the model.
* Evaluate - which logs metrics/parameters for the model.
* Register - which registers the model in Azure ML.

Altogether, the pipeline should look something like this:

<img src="images/pipeline.png" width="500"/>

### Exercise

* Implement the above pipeline by adding extra PythonScriptSteps that implement each of the described steps.
* Try publishing the new pipeline and running the pipeline to train and register a new model.

In [None]:
%load answers/pipeline.py

Assignment: See if you can find the running pipeline in your Azure ML portal. 

### Extra exercises

- Try building the above example using the Dataset API (instead of using the *DataReference* class).
- Try using other, more specific pipeline steps (such as the sklearn or Estimator related steps).
- Try adding AutoML or hyper parameter optimization to the pipeline (using the HyperDriveStep).