## Use of Azure ML Dataset as an approach to data mesh

In this notebook, I'll demonstrate how to leverage Azure ML Datasets to approach a data mesh strategy for any model development activities across different compute targets, including databricks and AzureML by leveraging `Azure ML Pipelines`. This notebook is the extension to the first [notebook](pipeline_def.ipynb) by adding the AutoML and Model Registration step as subsequent steps to the DatabriksStep.

<div style="text-align:center; width: 1000px"><img src="./assets/pipeline_automl.jpg" /></div>

*The AML Piepline Image*

In order to run this example, you need to have an AML Workspace with a Compute Cluster. In addition, you need an Azure Databricks cluster with ML Runtime. The cluster needs to have azureml-sdk[databricks] package installed.

The overal idea is to create data lineage for the entire life cycle of a model, which starts with data processing and ends with model registration and deployment.
A simple training excersice is picked to focus mostly on the use of AML Dataset.

In this example, The data preprocessing happens on Databricks through `DatabricksStep` and the model training takes place on an AML Compute through `PythonScriptStep`. 

The first step receives three input AML Datasets and prepared for a model training excersice in the DatabricksStep. Later the final dataframe is saved as a `Parquet`. Finally, the saved data is registered as a AML Dataset as `TabularDataset` in `Parquet` file format. The spark dataframe is then registered in Azure Databricks `Feature Store` to be natively retrieved within Databricks.

Every time the DatabricksStep is executed, two new datasets are generated called `feature_titanic_train` and `feature_titanic_test` as AML TabularDatasets that are then passed to the AutoMLStep. If the allow_reuse parameter on the `DatabricksStep` constructor is set to True, then the output datasets registered from the previous run will be reused for the next step.

<div style="text-align:center; width: 500px"><img src="./assets/ADBStep_automl.jpg" /></div>

*ADB Step details page; the input and output datasets.*

Below is the output dataset which is registered as a Databricks Feature store:

<div style="text-align:center; width: 500px"><img src="./assets/DatabricksFeatureStoreAutoML.jpg" /></div>

*Feature Titanic dataset registered as an Azure Databricks Feature Store*

The registered `AML Dataset`s are passed to the subsequent `AutoMLStep` which is meant for training and testing of the AutoML Model. The data is read based on the incoming dataset type. Currently, AutoML supports csv and parquet for tabular datasets. Later the Delta will be supported as input datatype.

<div style="text-align:center; width: 1000px"><img src="./assets/AutoMLStep.jpg" /></div>

*AML Step details page; the input and output datasets.*

Once the AutoMLStep is completed, the best model is passed to a subsequent step to register the best model. To register the model, the `AML Dataset` objects (one for training and one for testing) are passed as parameters to the `Model.register` function. This links the model to the datasets that were used the AutoML experiment.

<div style="text-align:center; width: 1000px"><img src="./assets/Model_AutoML.jpg" /></div>

*Registered Model data tab; link to the feature_titanic AML Dataset.*

This also helps us to connect the `AML Dataset` to the models as well.

<div style="text-align:center; width: 1000px"><img src="./assets/DatasetToModelAutoML.jpg" /></div>

*Model tab of the Featurized AML Dataset; link to the titanic_model AML Model.*

During the lifecycle of the model and dataset, we leveraged `tags` parameter of the `register` function of `AML Datasets` and `AML Models`. This allows us to always keep and attach important parameters to the model and dataset objects. Parameters such as `dataset schema`, `input dataset`, `run_id`, etc.

<div style="text-align:center; width: 500px"><img src="./assets/DatasetTags.jpg" /></div>

*Taggs of the feature_titanic train and test datasets. This identifies the input datasets, databricks feature store, data types of the final pandas dataframe, etc.*


In [3]:
import os
import azureml.core
import pandas as pd
from azureml.core.runconfig import JarLibrary
from azureml.core.compute import ComputeTarget, DatabricksCompute
from azureml.exceptions import ComputeTargetException
from azureml.core import Workspace, Environment, Experiment, Datastore, Dataset, ScriptRunConfig
from azureml.pipeline.core import Pipeline, PipelineData, TrainingOutput
from azureml.pipeline.steps import DatabricksStep, PythonScriptStep
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
from azureml.core.conda_dependencies import CondaDependencies
from azureml.train.automl import AutoMLConfig
from azureml.pipeline.core import PipelineData, TrainingOutput
from azureml.pipeline.steps import AutoMLStep
from azureml.core.runconfig import RunConfiguration

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.40.0


In [4]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')


Performing interactive authentication. Please follow the instructions on the terminal.


KeyboardInterrupt: 

In [None]:
db_compute_name = "ADBCluster" # Databricks compute name

databricks_compute = DatabricksCompute(workspace=ws, name=db_compute_name)
print('Compute target {} already exists'.format(db_compute_name))


In [None]:
from azureml.pipeline.core import PipelineParameter
from azureml.pipeline.core.pipeline_output_dataset import PipelineOutputAbstractDataset

def_blob_store = Datastore(ws, "generalpurposeaccount")
print('Datastore {} will be used'.format(def_blob_store.name))

In [None]:
def register_dataset(datastore, dataset_name):
    remote_path = f'dataset-demo/{dataset_name}/'
    local_path = './data/titanic.csv'
    datastore.upload_files(files = [local_path],
                       target_path = remote_path,
                       overwrite = True,
                       show_progress = False)
    
    dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, remote_path)])
    dataset = dataset.register(ws, name=dataset_name, create_new_version=True)
    return dataset

In [None]:
ds_titanic_1 = register_dataset(def_blob_store, 'titanic_1')
ds_titanic_2 = register_dataset(def_blob_store, 'titanic_2')
ds_titanic_3 = register_dataset(def_blob_store, 'titanic_3')

In [None]:
ds_step_1_train = PipelineData("output_train", datastore=def_blob_store).as_dataset()
ds_step_1_test = PipelineData("output_test", datastore=def_blob_store).as_dataset()

In [None]:
source_directory = "./scripts"

databricks_script_name = "adb_run_automl.py"

feature_dataset_name_train = "feature_titanic_train"
feature_dataset_name_test = "feature_titanic_test"

In [None]:
dbNbStep = DatabricksStep(
    name="ADB_Feature_Eng",
    outputs=[ds_step_1_train, ds_step_1_test],
    compute_target=databricks_compute,
    existing_cluster_id="<<cluster_id>>",
    python_script_params=["--feature_set_1", ds_titanic_1.name,
                          "--feature_set_2", ds_titanic_2.name,
                          "--feature_set_3", ds_titanic_3.name,
                          '--output_datastore_name', def_blob_store.name,
                          "--output_train_feature_set_name", feature_dataset_name_train, 
                          "--output_test_feature_set_name", feature_dataset_name_test],
    permit_cluster_restart=True,
    python_script_name=databricks_script_name,
    source_directory=source_directory,
    run_name='ADB_Feature_Eng',
    allow_reuse=True
)


In [None]:
cluster_name = "<<compute_name>>"
compute_target = ComputeTarget(workspace=ws, name=cluster_name)

In [None]:
# Change iterations to a reasonable number (50) to get better accuracy
automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 2,
    "primary_metric" : 'AUC_weighted',
    "n_cross_validations": 5
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automated_ml_errors.log',
                             compute_target = compute_target,
                             featurization = 'auto',
                             training_data = ds_step_1_train.parse_parquet_files(),
                             test_data = ds_step_1_test.parse_parquet_files(),
                             label_column_name = 'Survived',
                             **automl_settings)
                             
print("AutoML config created.")


In [None]:
ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))


In [None]:
train_automlStep = AutoMLStep(name='AutoML_Classification',
                                 automl_config=automl_config,
                                 outputs=[metrics_data, model_data],
                                 allow_reuse=True)

print("trainWithAutomlStep created.")

In [None]:
reg_comp_name = "<<compute_name>>"
reg_compute_target = ComputeTarget(workspace=ws, name=reg_comp_name)

In [None]:
source_directory

In [None]:
conda_dep = CondaDependencies()
conda_dep.add_pip_package("azureml-sdk")

rcfg = RunConfiguration(conda_dependencies=conda_dep)

register_model_step = PythonScriptStep(script_name='register_model.py',
                                       source_directory=source_directory,
                                       name="Register_Best_Model",
                                       inputs=[model_data],
                                               # ds_step_1_train.parse_parquet_files().as_named_input('input_train'), 
                                               # ds_step_1_test.parse_parquet_files().as_named_input('input_test')],
                                       compute_target=reg_compute_target,
                                       arguments=["--saved-model", model_data, 
                                                  '--model-name' , 'titanic_model', 
                                                  '--featureset-name-train', feature_dataset_name_train, 
                                                  '--featureset-name-test', feature_dataset_name_test],
                                       allow_reuse=True,
                                       runconfig=rcfg)

# register_model_step.run_after(train_automlStep)

In [None]:
steps = [dbNbStep, train_automlStep, register_model_step]
pipeline = Pipeline(workspace=ws, steps=steps)
pipeline_run = Experiment(ws, 'DB_FeatureStore_AutoML_Register').submit(pipeline)


In [None]:
pipeline_run

In [None]:
pipeline_run.wait_for_completion()


## How to access the model dataset properties in the production setting for deployment or model consumption

Once the pipeline is completed, then you can access the `dataset` information from the registered model by accessing the `datasets` properties of the registered model. In this example, you'll recieve a dictionary that the key is the name provided when the model was registered, `featurized data` in this case.

This is helpful if deployment setting of the retrieving dataset characteristics is important. In addition, you can use this method if you like to access the model and dataset information from outside of AML like Databricks or Kubernetes.

In [None]:
from azureml.core import Model

model = Model(ws, name='titanic_model')
model

In [None]:
model_datasets = model.datasets
featurized_data = model_datasets['featurized training data'][0]


In [None]:
featurized_data.tags

In [None]:
pdf = featurized_data.to_pandas_dataframe()

In [None]:
pdf.head()

# Accessing the Dataset from outside of AML

In some use-cases, you might want to access the AML Dataset from outside of AML such as Databricks. In order to do this, you can either access the registered data from the `Databricks Feature Store` as is provided in the first step in `DatabricksStep`, or simply calling the `Dataset.get_by_name` function to retrieve the dataset object and start exploring.

In [None]:
ds_feature_dataset = Dataset.get_by_name(ws, feature_dataset_name)

pdf_feature_dataset = ds_feature_dataset.to_pandas_dataframe()
pdf_feature_dataset.head()