## Leveraging Databricks Feature Store Along with AML Capabilities Using AML Pipelines

In this notebook, I'll demonstrate how to leverage Azure ML Datasets to approach a data mesh strategy for any model development activities across different compute targets, including databricks and AzureML by leveraging `Azure ML Pipelines`. This notebook is the extension to the second [notebook](adbstep_and_automl.ipynb) by adding the Databricks Feature Store as the source for the AutoML step and Model Registration step as the final step.

<div style="text-align:center; width: 1000px"><img src="./assets/FeatureStorePipeline.jpg" /></div>

*The AML Piepline Image*

In order to run this example, you need to have an AML Workspace with a Compute Cluster. In addition, you need an Azure Databricks cluster with **ML Runtime**. The cluster needs to have azureml-sdk[databricks] package installed.

The overal idea is to create data lineage for the entire life cycle of a model, which starts with data processing and ends with model registration and deployment.
A simple training excersice is picked to focus mostly on the flow of the experiement.

In this example, there are two `DatabricksSteps`; the first is to prepare the base tables and feature tables with synthetic data, and the next step is to prepare the training data by leveraging the Databricks `FeatureLookup` capability. The prepared training and test datasets is then registered in `Databricks FeatureStore` as well as the `AML Dataset` and is passed to the `AutoMLStep` for training. Finally, the best model is registered as an AML Model in the AML Model Registry through `PythonScriptStep`. 

The first step generates and registers three base tables as `network`, `customer` and `ground_truth` tables. Then then base tables are registered as feature tables in `Databricks Feature Store` to be used in the next Databricks step which generates a training set out of these three tables. The final training and test sets are then registered in both Databricks and AML as feature tables and AML Datasets, respectively. Later the final dataframes (traing and test) are saved as a `Parquet` table. Finally, the saved data is registered as a AML Dataset as `TabularDataset` in `Parquet` file format.

Every time the DatabricksSteps are executed, two new datasets are generated called `feature_network_fail_train` and `feature_network_fail_test` as AML TabularDatasets that are then passed to the AutoMLStep. If the allow_reuse parameter on the `DatabricksStep` constructor is set to True, then the output datasets registered from the previous run will be reused for the next step.

<div style="text-align:center; width: 500px"><img src="./assets/ADBStep_automl_ft.jpg" /></div>

*ADB Step details page; the input and output datasets.*

Below is the output dataset which is registered as a Databricks Feature store:

The registered `AML Dataset`s are passed to the subsequent `AutoMLStep` which is meant for training and testing of the AutoML Model. The data is read based on the incoming dataset type. Currently, AutoML supports csv and parquet for tabular datasets. Later the Delta will be supported as input datatype.

<div style="text-align:center; width: 1000px"><img src="./assets/AutoMLStepFT.jpg" /></div>

*AML Step details page; the input and output datasets.*

Once the AutoMLStep is completed, the best model is passed to a subsequent step to register the best model. To register the model, the `AML Dataset` objects (one for training and one for testing) are passed as parameters to the `Model.register` function. This links the model to the datasets that were used the AutoML experiment.

<div style="text-align:center; width: 1000px"><img src="./assets/Model_AutoMLFT.jpg" /></div>

*Registered Model data tab; link to the `feature_network_fail_train` and `feature_network_fail_test` AML Datasets.*

This also helps us to connect the `AML Dataset` to the models as well.

<div style="text-align:center; width: 1000px"><img src="./assets/DatasetToModelAutoMLFT.jpg" /></div>

*Model tab of the Featurized AML Dataset; link to the `network_failure_model` AML Model.*

During the lifecycle of the model and dataset, we leveraged `tags` parameter of the `register` function of `AML Datasets` and `AML Models`. This allows us to always keep and attach important parameters to the model and dataset objects. Parameters such as `dataset schema`, `input dataset`, `run_id`, etc.

<div style="text-align:center; width: 500px"><img src="./assets/DatasetTagsFT.jpg" /></div>

*Taggs of the `feature_network_fail` train and test datasets. This identifies the input datasets, databricks feature store, data types of the final pandas dataframe, etc.*


In [None]:
import os
import azureml.core
import pandas as pd
from azureml.core.runconfig import JarLibrary
from azureml.core.compute import ComputeTarget, DatabricksCompute
from azureml.exceptions import ComputeTargetException
from azureml.core import Workspace, Environment, Experiment, Datastore, Dataset, ScriptRunConfig
from azureml.pipeline.core import Pipeline, PipelineData, TrainingOutput
from azureml.pipeline.steps import DatabricksStep, PythonScriptStep
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
from azureml.core.conda_dependencies import CondaDependencies
from azureml.train.automl import AutoMLConfig
from azureml.pipeline.core import PipelineData, TrainingOutput
from azureml.pipeline.steps import AutoMLStep
from azureml.core.runconfig import RunConfiguration

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

In [None]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

In [None]:
db_compute_name = "ADBCluster" # Databricks compute name

databricks_compute = DatabricksCompute(workspace=ws, name=db_compute_name)
print('Compute target {} already exists'.format(db_compute_name))


In [None]:
from azureml.pipeline.core import PipelineParameter
from azureml.pipeline.core.pipeline_output_dataset import PipelineOutputAbstractDataset

def_blob_store = Datastore(ws, "generalpurposeaccount")
print('Datastore {} will be used'.format(def_blob_store.name))

In [None]:
ds_step_1_train = PipelineData("output_train", datastore=def_blob_store).as_dataset()
ds_step_1_test = PipelineData("output_test", datastore=def_blob_store).as_dataset()

In [None]:
source_directory = "./scripts"

prep_databricks_script_name = "prepare_adb_feature_store.py"
databricks_script_name = "adb_run_automl_featurestore.py"

dataset_name_train = "feature_network_fail_train"
dataset_name_test = "feature_network_fail_test"
ground_truth_name = "ground_truth"

In [None]:
cluster_name = "cpu-clu-pts"
compute_target_automl = ComputeTarget(workspace=ws, name=cluster_name)

In [None]:
reg_comp_name = "<<compute_name>>"
reg_compute_target = ComputeTarget(workspace=ws, name=reg_comp_name)

In [None]:
db_prep_step = DatabricksStep(
    name="ADB_Create_Base_Feature_Tables",
    compute_target=databricks_compute,
    existing_cluster_id="<<existing_cluster_id>>",
    python_script_params=["--ground-truth-tbl-name", ground_truth_name],
    permit_cluster_restart=True,
    python_script_name=prep_databricks_script_name,
    source_directory=source_directory,
    run_name='ADB_Create_Base_Feature_Tables',
    allow_reuse=True
)


In [None]:
db_feature_prap = DatabricksStep(
    name="ADB_Feature_Prep",
    outputs=[ds_step_1_train, ds_step_1_test],
    compute_target=databricks_compute,
    existing_cluster_id="<<existing_cluster_id>>",
    python_script_params=["--ground-truth-tbl-name", ground_truth_name,
                          '--output_datastore_name', def_blob_store.name,
                          "--output_train_feature_set_name", dataset_name_train, 
                          "--output_test_feature_set_name", dataset_name_test],
    permit_cluster_restart=True,
    python_script_name=databricks_script_name,
    source_directory=source_directory,
    run_name='ADB_Feature_Prep',
    allow_reuse=True
)
db_feature_prap.run_after(db_prep_step)


In [None]:
# Change iterations to a reasonable number (50) to get better accuracy
automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 2,
    "primary_metric" : 'AUC_weighted',
    "n_cross_validations": 5
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automated_ml_errors.log',
                             compute_target = compute_target_automl,
                             featurization = 'auto',
                             training_data = ds_step_1_train.parse_parquet_files(),
                             test_data = ds_step_1_test.parse_parquet_files(),
                             label_column_name = 'label',
                             **automl_settings)

print("AutoML config created.")


In [None]:
ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))

model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))


In [None]:
train_automlStep = AutoMLStep(name='AutoML_Classification',
                                 automl_config=automl_config,
                                 outputs=[model_data],
                                 allow_reuse=True)

print("trainWithAutomlStep created.")

In [None]:
conda_dep = CondaDependencies()
conda_dep.add_pip_package("azureml-sdk")

rcfg = RunConfiguration(conda_dependencies=conda_dep)

register_model_step = PythonScriptStep(script_name='register_model.py',
                                       source_directory=source_directory,
                                       name="Register_Best_Model",
                                       inputs=[model_data],
                                       compute_target=reg_compute_target,
                                       arguments=["--saved-model", model_data, 
                                                  '--model-name' , 'network_failure_model', 
                                                  '--featureset-name-train', dataset_name_train, 
                                                  '--featureset-name-test', dataset_name_test],
                                       allow_reuse=True,
                                       runconfig=rcfg)

In [None]:
steps = [register_model_step]
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline_run = Experiment(ws, 'AutoML_ADB_FeatureStore').submit(pipeline)


In [None]:
pipeline_run


In [None]:
pipeline_run.wait_for_completion()


## How to access the model dataset properties in the production setting for deployment or model consumption

Once the pipeline is completed, then you can access the `dataset` information from the registered model by accessing the `datasets` properties of the registered model. In this example, you'll recieve a dictionary that the key is the name provided when the model was registered, `featurized data` in this case.

This is helpful if deployment setting of the retrieving dataset characteristics is important. In addition, you can use this method if you like to access the model and dataset information from outside of AML like Databricks or Kubernetes.

In [None]:
from azureml.core import Model

model = Model(ws, name='network_failure_model')
model

In [None]:
model_datasets = model.datasets
featurized_data = model_datasets['featurized training data'][0]


In [None]:
featurized_data.tags

In [None]:
pdf = featurized_data.to_pandas_dataframe()

In [None]:
pdf.head()

# Accessing the Dataset from outside of AML

In some use-cases, you might want to access the AML Dataset from outside of AML such as Databricks. In order to do this, you can either access the registered data from the `Databricks Feature Store` as is provided in the first step in `DatabricksStep`, or simply calling the `Dataset.get_by_name` function to retrieve the dataset object and start exploring.

In [None]:
ds_feature_dataset = Dataset.get_by_name(ws, feature_dataset_name)

pdf_feature_dataset = ds_feature_dataset.to_pandas_dataframe()
pdf_feature_dataset.head()