# Azure Machine Learning - AutoML Pipeline Sample
## 02 - AutoML Regression Training

This notebook explains how to use a Regression AutoML task inside pipeline by consuming the dataset we previously registered as `Regression_HousingData`. Our pipeline consists of multiple steps which first retrieve our raw training data, then splits into train & test components prior to AutoML training. Here, we are configuring our AutoML job to produce a MLflow model for easy consumption downstream.

After training, we perform a champion vs. challenger evaluation by pitting our newly trained model against the current champion (if it exists) to see which model performs better against a common test dataset. If our new model performs better the pipeline continues and the model is added to the registry. If not, the pipeline gracefully ends.

### Import required libraries

In [None]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient, Input, command, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.automl import regression
from azure.ai.ml.entities._job.automl.tabular import TabularFeaturizationSettings
from azure.ai.ml.entities import (
    Environment
)
from azure.ai.ml.constants import AssetTypes, InputOutputModes

### Get a handle to the AML workspace

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()
    
ml_client = MLClient.from_config(credential=credential)
ml_client

### Create compute cluster for training if not exists

In [None]:
compute_name = 'cpucluster'

try:
    compute_target = ml_client.compute.get(compute_name)
except Exception as e:
    # Define the compute configuration  
    cpu_compute_config = AmlCompute(  
        name=compute_name,  
        type="amlcompute",  
        size="STANDARD_D3_V2",  # Specify the VM size  
        min_instances=0,  
        max_instances=4,  
        idle_time_before_scale_down=120  
    )  

    # Create the compute cluster  
    compute_target = ml_client.begin_create_or_update(cpu_compute_config)
    
compute_target

### Define reusable environments

Below, we create environments for data preparation, and for model evaluation. These environments have been defined as conda yaml files and will ensure that our steps will execute within python runtimes that contain all necessary dependencies.

In [None]:
dataprep_env = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="./environments/dataprep.yml",
    name="preprocessing-environment",
    description="Preprocessing environment created from a Docker image plus Conda environment.",
)
ml_client.environments.create_or_update(dataprep_env)

evaluation_env = Environment(
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
    conda_file="./environments/evaluation.yml",
    name="evaluation-environment",
    description="Evaluation environment created from a Docker image plus Conda environment.",
)
ml_client.environments.create_or_update(evaluation_env)


### Get handle to registered regression dataset

In [None]:
raw_dataset = ml_client.data.get("Regression_HousingData", label="latest")

### Build AutoML pipeline with steps for data splitting, model training, model evaluation, and model registration

All of these steps point towards Python files which have been included in this repository. These python scripts will execute in sequence and carry out all necessary steps to train and register a new regression model using AutoML.

Note: the AutoML configuration settings have been adjusted for rapid development. For production runs you will likely look to increase the number of experiments that are run.

In [None]:
# Define pipeline
@pipeline(
    description="AutoML Regression Pipeline",
)
def automl_regression(
    regression_train_data, regression_validation_data, regression_test_data, model_base_name
):
    
    # define command function for preprocessing the model
    preprocessing_command_func = command(
        inputs=dict(
            raw_data=Input(path=raw_dataset.id,
              type=AssetTypes.URI_FILE,
              mode=InputOutputModes.RO_MOUNT
              )
        ),
        outputs=dict(
            preprocessed_train_data=Output(type="mltable"),
            preprocessed_test_data=Output(type="mltable"),
        ),
        code="./preprocess.py",
        command="python preprocess.py "
        + "--raw_data ${{inputs.raw_data}} "
        + "--preprocessed_train_data ${{outputs.preprocessed_train_data}} "
        + "--preprocessed_test_data ${{outputs.preprocessed_test_data}}",
        environment="preprocessing-environment@latest",
        display_name='Get and Split Data'
    )
    preprocess_node = preprocessing_command_func()

    # define the AutoML regression task with AutoML function
    regression_node = regression(
        primary_metric="r2_score",
        target_column_name="MedHouseVal",
        training_data=preprocess_node.outputs.preprocessed_train_data,
        test_data=preprocess_node.outputs.preprocessed_test_data,
        featurization=TabularFeaturizationSettings(mode="auto"),
        # currently need to specify outputs "mlflow_model" explicitly to reference it in following nodes
        outputs={"best_model": Output(type="mlflow_model")},
        display_name='Train Models'
    )
    # set limits & training
    regression_node.set_limits(max_trials=4, max_concurrent_trials=4)
    regression_node.set_training(
        enable_stack_ensemble=False, enable_vote_ensemble=False, enable_model_explainability=True
    )
    
    # define command function for evaluating the newly trained model (champion v. challenger test)
    evaluate_func = command(
        inputs=dict(
            model_input_path=Input(type="mlflow_model"),
            model_base_name='HomePricePredictionModel',
            test_data=Input(type="mltable"),
            target_column='MedHouseVal'
        ),
        outputs=dict(
             comparison_metrics=Output(type="uri_folder"),
        ),
        code="./evaluate.py",
        command="python evaluate.py "
        + "--model_input_path ${{inputs.model_input_path}} "
        + "--model_base_name ${{inputs.model_base_name}} "
        + "--test_data ${{inputs.test_data}} "
        + "--target_column ${{inputs.target_column}} "
        + "--comparison_metrics ${{outputs.comparison_metrics}}",
        environment="evaluation-environment@latest",
        display_name='Evaluate Model (Champion vs. Challenger)'
    )
    evaluate_model = evaluate_func(test_data=preprocess_node.outputs.preprocessed_test_data, model_input_path=regression_node.outputs.best_model)

    # define command function for registering the model
    command_func = command(
        inputs=dict(
            model_input_path=Input(type="mlflow_model"),
            model_base_name='HomePricePredictionModel',
            comparison_metrics=Input(type="uri_folder"),
        ),
        outputs=dict(
            registered_model_details=Output(type="uri_folder")
        ),
        code="./register.py",
        command="python register.py "
        + "--model_input_path ${{inputs.model_input_path}} "
        + "--model_base_name ${{inputs.model_base_name}} "
        + "--registered_model_details ${{outputs.registered_model_details}} "
        + "--comparison_metrics ${{inputs.comparison_metrics}} ",
        environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:1",
        display_name='Register Model'
    )
    register_model = command_func(model_input_path=regression_node.outputs.best_model, comparison_metrics=evaluate_model.outputs.comparison_metrics)

pipeline_regression = automl_regression()

# set pipeline level compute
pipeline_regression.settings.default_compute = "cpucluster"

### Submit pipeline job

Step below will await pipeline completion

In [None]:
# submit the pipeline job
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_regression, experiment_name="AutoML_Regression_Test"
)
pipeline_job

# Wait until the job completes
ml_client.jobs.stream(pipeline_job.name)

### Confirm model registration

After the first run of the pipeline, a model should be added to your workspace registry by default. You can visit the registry to confirm a new model has been added as shown below:

![Azure ML Model](img/aml_model.png "Registered Model")