# AutoML: Train "the best" classifier model - With end-to-end MLFlow model experience.

## Project Setup

_**Make sure you use a compute instance that has the same sklearn version as the AutoML sklearn version.
This is important when we work with fitting the explainer object.**_


# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [None]:
# Import required libraries
from azure.identity import DefaultAzureCredential
from azure.identity import AzureCliCredential
from azure.ai.ml import automl, Input, MLClient
import mltable

from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.automl import (
    classification,
    ClassificationPrimaryMetrics,
    ClassificationModels,
)

# import required libraries
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
    ProbeSettings,
)
from azure.ai.ml.constants import ModelType

import shap 
import numpy as np
import joblib
import json
import pandas as pd 


## 1.2. Workspace details

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](../../configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

By default, we try to use the by default workspace configuration (available out-of-the-box in Compute Instances) or from any Config.json file you might have copied into the folders structure.
If no Config.json is found, then you need to manually introduce the subscription_id, resource_group and workspace when creating MLClient .

In [None]:
credential = DefaultAzureCredential()
ml_client = None

try:
    ml_client = MLClient.from_config(credential=credential)
except Exception as ex:
    # NOTE: Update following workspace information if not correctly configure before
    client_config = {
        "subscription_id": "<>",
        "resource_group": "<>",
        "workspace_name": "<>",
    }
    if client_config["subscription_id"].startswith("<"):
        print(
            "please update your    in notebook cell"
        )
        raise ex
    else:  # write and reload from config file
        import json, os
        config_path = "../azureml/config.json"
        os.makedirs(os.path.dirname(config_path), exist_ok=True)
        with open(config_path, "w") as fo:
            fo.write(json.dumps(client_config))
        ml_client = MLClient.from_config(credential=credential, path=config_path)
print(ml_client)

# 2. Configure and run the AutoML classification job
In this section we will configure and run the AutoML classification job.

## 2.1 Configure the job through the classification() factory function

### classification() parameters:

The `classification()` factory function allows user to configure AutoML for the classification task for the most common scenarios with the following properties.

- `target_column_name` - The name of the column to target for predictions. It must always be specified. This parameter is applicable to 'training_data', 'validation_data' and 'test_data'.
- `primary_metric` - The metric that AutoML will optimize for Classification model selection.
- `training_data` - The data to be used for training. It should contain both training feature columns and a target column. Optionally, this data can be split for segregating a validation or test dataset. 
You can use a registered MLTable in the workspace using the format '<mltable_name>:<version>' OR you can use a local file or folder as a MLTable. For e.g Input(mltable='my_mltable:1') OR Input(mltable=MLTable(local_path="./data"))
The parameter 'training_data' must always be provided.
- `compute` - The compute on which the AutoML job will run. In this example we are using a compute called 'cpu-cluster' present in the workspace. You can replace it any other compute in the workspace. 
- `name` - The name of the Job/Run. This is an optional property. If not specified, a random name will be generated.
- `experiment_name` - The name of the Experiment. An Experiment is like a folder with multiple runs in Azure ML Workspace that should be related to the same logical machine learning experiment.

### set_limits() parameters:
This is an optional configuration method to configure limits parameters such as timeouts.     
    
- timeout_minutes - Maximum amount of time in minutes that the whole AutoML job can take before the job terminates. This timeout includes setup, featurization and training runs but does not include the ensembling and model explainability runs at the end of the process since those actions need to happen once all the trials (children jobs) are done. If not specified, the default job's total timeout is 6 days (8,640 minutes). To specify a timeout less than or equal to 1 hour (60 minutes), make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results.

- trial_timeout_minutes - Maximum time in minutes that each trial (child job) can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.
    
- max_trials - The maximum number of trials/runs each with a different combination of algorithm and hyperparameters to try during an AutoML job. If not specified, the default is 1000 trials. If using 'enable_early_termination' the number of trials used can be smaller.
    
- max_concurrent_trials - Represents the maximum number of trials (children jobs) that would be executed in parallel. It's a good practice to match this number with the number of nodes your cluster.
    
- enable_early_termination - Whether to enable early termination if the score is not improving in the short term. 
    

In [None]:
# Create MLTables for training dataset
named_train_df = ml_client.data.get(name="<>", version="5")
named_test_df = ml_client.data.get(name="<>", version="5")

# Remote MLTable definition
my_training_data_input  = Input(type=AssetTypes.MLTABLE , path=named_train_df.path)# new SDK dataset artefact
my_testing_data_input  = Input(type=AssetTypes.MLTABLE, path=named_test_df.path)

In [None]:
# General job parameters
compute_name = "<>"
max_trials = 5
exp_name = "<>"
target = "<>"

In [None]:
# Create the AutoML classification job with the related factory-function.

classification_job = automl.classification(
    compute=compute_name,
    experiment_name=exp_name,
    training_data=my_training_data_input,
    validation_data=my_testing_data_input,
    target_column_name=target,
    primary_metric="accuracy",
    enable_model_explainability=True,
    tags={
        "use_case": "<>",
        "label": "<>",
    },
)

# Limits are all optional
classification_job.set_limits(
    timeout_minutes=600,
    trial_timeout_minutes=20,
    max_trials=max_trials,
    # max_concurrent_trials = 4,
    # max_cores_per_trial: -1,
    enable_early_termination=True,
)

# Training properties are optional
classification_job.set_training(
    enable_onnx_compatible_models=True,
)

## 2.2 Run the Command
Using the `MLClient` created earlier, we will now run this Command in the workspace.

In [None]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    classification_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

### Wait until the AutoML job is finished
ml_client.jobs.stream(returned_job.name) waits until the specified job is finished

In [None]:
ml_client.jobs.stream(returned_job.name)

In [None]:
print(returned_job.name)

# 3.A. Retrieve the Best Trial (Best Model's trial/run)
Use the MLFLowClient to access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Trial.

## 3.1 Initialize MLFlow Client
The models and artifacts that are produced by AutoML can be accessed via the MLFlow interface. 
Initialize the MLFlow client here, and set the backend as Azure ML, via. the MLFlow Client.

*IMPORTANT*, you need to have installed the latest MLFlow packages with:

    pip install azureml-mlflow

    pip install mlflow

### Obtain the tracking URI for MLFlow

In [None]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

In [None]:
# Set the MLFLOW TRACKING URI

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))

In [None]:
from mlflow.tracking.client import MlflowClient

# Initialize MLFlow client
mlflow_client = MlflowClient()

### Get the AutoML parent Job

In [None]:
job_name = returned_job.name

# Example if providing an specific Job name/ID
#job_name = "<>"

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(job_name)

In [None]:
# Print parent run tags. 'automl_best_child_run_id' tag should be there.
print(mlflow_parent_run.data.tags)

## 3.2 Get the AutoML best child run

In [None]:
# Get the best model's child run

#best_child_run_id = mlflow_parent_run.data.tags["automl_best_child_run_id"] # THIS IS THE BEST MODEL!!
best_child_run_id = "<>" #THIS IS NOT THE BEST MODEL BUT THE BEST TREE BASED MODEL, YOU CAN RETRIEVE IT FROM THE OVERVIEW PAGE OF THE MODEL
print("Found best child run id: ", best_child_run_id)

best_run = mlflow_client.get_run(best_child_run_id)

## 3.3 Get best model run's metrics

Access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Run.

In [None]:
run_metrics = best_run.data.metrics #returns dict

## 3.4 Download the best model locally

Access the results (such as Models, Artifacts, Metrics) of a previously completed AutoML Run.

In [None]:
import os

# Create local folder
local_dir = "../artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)

In [None]:
# Download run's artifacts/outputs
local_path = mlflow_client.download_artifacts(
    best_run.info.run_id, "outputs", local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))

In [None]:
# Show the contents of the MLFlow model folder
os.listdir(f"{local_dir}/outputs/mlflow-model")

# 3.B. Fit Shap Explainer 

In [None]:
model_local = joblib.load(f"{local_dir}/outputs/model.pkl")

#AutoML mapping for feature names after transformation
with open(f"{local_dir}/outputs/engineered_feature_names.json") as f:
    feature_names = json.load(f)

In [None]:
test_df_local = pd.read_csv("./data.csv")
X_test = test_df_local.drop(columns=[target])

## Prepare data for explainer

We will need to apply the pre-processing of the AutoML pipeline before

In [None]:
# preprocess the data using the automl transformer object
X_test_proc_1 = model_local[0].transform(X_test)
X_test_proc = model_local[1].transform(X_test_proc_1)

# Grab the model from the AutoML pipeline, we will use it to fit the SHAP explainer
model_new_ =model_local[2] #first, isolate the model from the pipeline
model_new = model_new_.get_model() #then get the model out of the AutoML wrapper

## Fit SHAP explainer object

In [None]:
%%time

explainer = shap.TreeExplainer(model_new) # we are passing no background dataset to increase computational speed

## Test explainer

In [None]:
# WE NEED TO INDEX THE ROW FIRST AND THEN THE PREDICTED VALUE
n_row = 0 # ROW INDEX
sample_ = X_test.iloc[n_row:n_row+1]

predicted_probas = model_local.predict_proba(sample_).T.reset_index() #[0] #PREDICTED VALUE
idx = predicted_probas[0].idxmax()


In [None]:
# preprocess the data using the automl transformer object
X_test_proc_1 = model_local[0].transform(sample_)
X_test_proc = model_local[1].transform(X_test_proc_1)
shap_values = explainer.shap_values(X_test_proc)

In [None]:
shap.initjs() #this is needed to visualise the shap force plot

shap.force_plot(explainer.expected_value[idx], shap_values[idx][n_row], feature_names= feature_names, link="logit") #GET EXPLANATION FOR PREDICTED VALUE WITH THE HIGHEST PROBABILITY

## Copy explainer and model into the same folder for registration

In [None]:
# Create sub directory within artefacts
local_dir_models = "../artifact_downloads/outputs/models"
if not os.path.exists(local_dir_models):
    os.mkdir(local_dir_models)

# save the explainer object
filename_explainer = f"{local_dir_models}/explainer.pkl" #stick to .sav file as there are problems with pickling shap
joblib.dump(explainer, filename=filename_explainer)

# save the explainer object
filename_model = f"{local_dir_models}/model.pkl"
joblib.dump(model_local, filename=filename_model)

## Register the model and explainer artefacts under the same model register

In [None]:
model_explainer_name = "test-model-explainer-route-automl"
model = Model(
    path=local_dir_models,
    name=model_explainer_name,
    description="my sample mlflow model + explainer",
    type=AssetTypes.CUSTOM_MODEL
)

registered_model = ml_client.models.create_or_update(model)


# 4. Deploy Model

## 4.1 Create managed online endpoint

In [None]:
online_endpoint_name = "<>"

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="this is a sample online endpoint for AUTOML model",
    auth_mode="key",
    tags={"type": "automl",
    "use case": "<>",
    "label": label},
)

In [None]:
ml_client.begin_create_or_update(endpoint).result()

## 4.2 Deploy Model & Explainer

## Write scoring file

The scoring script loads both models into a dictionary keyed on their name in the init function. In the run function, each request is parsed for a model key in the JSON to choose the model. The data payload is then passed to the appropriate model.

In [None]:
print(local_dir_models)
os.listdir(local_dir_models)

Go into the following automl scoring file `../artifact_downloads/output/scoring_file_v_2_0_0.py`, and copy the data sample definition.

In [None]:
%%writefile {local_dir}/outputs/custom_scoring_script_automl_explainer.py

import pandas as pd
import joblib
from azureml.core.model import Model
import scipy as sp
import shap
import sklearn.pipeline
from pathlib import Path
import os
import json

from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.numpy_parameter_type import NumpyParameterType
from inference_schema.parameter_types.pandas_parameter_type import PandasParameterType
from inference_schema.parameter_types.standard_py_parameter_type import StandardPythonParameterType

#COPY FROM AUTOML SCORING FILE scoring_file_v_2_0_0.py
data_sample = PandasParameterType({"": "..."})

input_sample = StandardPythonParameterType({'data': data_sample})

def init():

    global automl_model
    global scoring_explainer

    # Retrieve the path to the model file using the model name
    # Assume original model is named original_prediction_model
    model_dir = Path(os.getenv("AZUREML_MODEL_DIR")) / "models"
    print(os.listdir(model_dir))
    automl_model = joblib.load(f"{model_dir}/model.pkl")
    scoring_explainer = joblib.load(f"{model_dir}/explainer.pkl")

@input_schema('Inputs', input_sample)
def run(Inputs):
    ###-----------------
    # read data and convert to orginal schema
    data = pd.DataFrame(Inputs["data"])

    pred_probs = automl_model.predict_proba(data).T.reset_index() #PREDICTED VALUE
    pred_probs.columns = ["class", "probability"]
    idx = pred_probs["probability"].idxmax()# INDEX OF THE MAXIMUM PREDICTED 
    
    pred_label  = pred_probs.iloc[idx].astype("string")[0]
    pred_probs = pred_probs.to_dict(orient="records")


    # PREPARE THE DATA FOR SHAP EXPLAINER
    # preprocess the data using the automl transformer object
    X_test_proc_1 = automl_model[0].transform(data)
    X_test_proc = automl_model[1].transform(X_test_proc_1)
    shap_values = scoring_explainer.shap_values(X_test_proc)
    shap_values_idx = shap_values[idx][0].astype('float64',casting='same_kind')
    base_value = scoring_explainer.expected_value[idx].astype('float64',casting='same_kind')

    return {'pred_label': pred_label,
    "pred_probs":pred_probs, 
    'shap_values_idx': shap_values_idx.tolist(), 
    "Shap base value" : base_value.tolist()}
    

### Configure environment

In [None]:
# Include the correct shap version by checking with current shap version used to fit explainer locally
print(shap.__version__)

Write a new .yml environment file. Make sure you use Python 3.8 or higher to prevent an error in unpickling the shap explainer.

In [None]:
%%writefile {local_dir}/outputs/conda_env_custom.yml

# Conda environment specification. The dependencies defined in this file will
# be automatically provisioned for runs with userManagedDependencies=False.

# Details about the Conda environment file format:
# https://conda.io/docs/user-guide/tasks/manage-environments.html#create-env-file-manually

name: project_environment
dependencies:
  # The python interpreter version.
  # Currently Azure ML only supports 3.8 and later.
- python= 3.8.13

- pip:
  - azureml-train-automl-runtime==1.46.1
  - inference-schema
  - azureml-interpret==1.46.0
  - azureml-defaults==1.46.0
- numpy==1.21.6
- pandas==1.1.5
- scikit-learn==0.22.1
- py-xgboost==1.3.3
- holidays==0.10.3
- psutil==5.9.0
- shap==0.39.0
- numba==0.55.2
channels:
- anaconda
- conda-forge


In [None]:
env = Environment(
    name="automl-tabular-env",
    description="environment for automl inference",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
    conda_file="../artifact_downloads/outputs/conda_env_custom.yml",
)

### Deploy

In [None]:
code_configuration = CodeConfiguration(
        code=f"{local_dir}/outputs", scoring_script="custom_scoring_script_automl_explainer.py"
        )

deployment = ManagedOnlineDeployment(
    name=online_endpoint_name, 
    endpoint_name=online_endpoint_name,
    model= registered_model,
    environment=env,
    code_configuration=code_configuration, #code_configuration,
    instance_type="Standard_DS2_V2",
    instance_count=1,
)

ml_client.online_deployments.begin_create_or_update(deployment).result()

In [None]:
# automl deployment to take 100% traffic
endpoint.traffic = {online_endpoint_name: 100}
ml_client.begin_create_or_update(endpoint).result()

### Test the deployment

In [None]:
%%time

for i in range(50):

    #select samples to predict
    tt = X_test.iloc[i:i+1].astype(str)

    #define schema to pass into the payload 
    df_schema = dict(zip(tt.columns, tt.dtypes.values.astype("str")))

    #construct the payload object to call the API
    payload = {"Inputs": {"data": tt.to_dict('records')}, "schema": df_schema}
    #payload = json.dumps(payload)

    request_file_name = "sample-request.json"

    with open(request_file_name, "w") as request_file:
        json.dump(payload, request_file)

    resp = ml_client.online_endpoints.invoke(
        endpoint_name=online_endpoint_name,
        deployment_name=online_endpoint_name,
        request_file=request_file_name)
    
    print("Finished request # ", i)

### Consume response output

In [None]:
shap.initjs() #this is needed to visualise the shap force plot

shap.force_plot(resp["Shap base value"], np.array(resp["shap_values_idx"]), feature_names= feature_names, link="logit") #GET EXPLANATION FOR PREDICTED VALUE WITH THE HIGHEST PROBABILITY

### Get endpoint details

In [None]:
# Get the details for online endpoint
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

# existing traffic details
print(endpoint.traffic)

# Get the scoring URI
print(endpoint.scoring_uri)

In [None]:
### Delete the deployment & endpoint
ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()