# Azure Purview for AI - Demo

## Initial Setup

In [None]:

import dotenv
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core import PurviewClient

dotenv.load_dotenv()

# authenticate and instantiate client
tenant_id = os.environ["TENANT_ID"]
client_id = os.environ["SP_ID"]
client_secret = os.environ["SP_SECRET"]

auth = ServicePrincipalAuthentication(
    tenant_id=tenant_id,
    client_id=client_id,
    client_secret=client_secret
)

client = PurviewClient(account_name="ml-purview", authentication=auth)




## Problem setting

The vast majority of enterprise AI/ML use cases begin with raw data being copied into cloud storage. Usually this is the staging area of the enterprise data lake. 

In  this demo, we use the storage account `topsecretdata` as a stand-in for the enterprise data lake and the staging area is the container `data`. At the beginning of the demo, we assume that the raw data has just been copied there, and the same file will be replaced for future runs. 

The raw data in this case are two files: `iris-train.csv` which contains a subset of rows from iris dataset, and `iris-score.csv` which contains the remaining rows, but with the `Species` column removed. 

We can view this by navigating to Browse Assets > Azure Blob Storage > topsecretdata > data. Click on `iris-train.csv` and examine all the details. Click on "edit" and add/update the description.

Click on lineage. If lineage is not empty:
- click on the box marked TrainETL > switch to asset. 
- Copy the GUID from the URL
- Paste it in the cell below and run.
- Navigate back to the `iris-train.csv` asset lineage, hit refresh, and the lineage should be empty.

![](./.img/train-lineage.png)



In [None]:

# Deletes the TrainETL entity for the demo

guid="bc8b1732-df57-4776-ac79-884784928aa2"
client.delete_entity(guid=guid)

## Lineage with Azure Data Factory

Typically, before data analysis, the raw data is transformed and loaded into a relational database. This could be a DataVault 2.0 model in the enterprise data warehouse, or a small database dedicated to data and ML use cases. 

Go to the [data factory](https://adf.azure.com/en-us/home?factory=%2Fsubscriptions%2Fd50ade7c-2587-4da8-9c63-fc828541722c%2FresourceGroups%2Frgp-show-weu-aml-databricks%2Fproviders%2FMicrosoft.DataFactory%2Ffactories%2Fadf-mlops-demo), select the pipeline `pl_train_etl` and launch a debug run. The run completes in about 15 seconds. 

Navigate to the `train-iris.csv` asset again, click on lineage, refresh, and the full lineage should be visible as seen below.

![](./.img/train-lineage-full.png)

This is en extremely powerful concept on azure since we are able to see that the `traindata` database table is loaded from the `train-iris.csv` file, and that changes in the structure of that file would cause the ETL to break, so the TrainETL Copy Activity and the traindata database table wouls need to be modified. 

## Customized lineage in Azure Purview

Purview is based on Apache Atlas which is a free and open source metadata management tool. As a result we can extend the functionality of Purview bu creating new entities.

For example, Azure ML Pipelines are not a built-in type in Purview. Below we have some code to create a custom type definition to represent an Azure ML pipeline. We can then instantiate this type to represent specific pipeline that we are deploying. Note that since the ML pipeline inherits from the built-in `Process` type, we can specify and array of input and output `Dataset` elements to represent data flows in and out of the pipeline. 

This code is used to create a custom type to represent an Azure ML Pipeline. No need to run this.

```python
# Define ML Pipelne Process Type

from pyapacheatlas.core import EntityTypeDef, AtlasAttributeDef
from pyapacheatlas.core.typedef import Cardinality


pipeline_name = AtlasAttributeDef(
    name="pipeline_name",
    displayName="Pipeline Name",
    description="Name of the Azure ML pipeline"
)

pipeline_owner = AtlasAttributeDef(
    name="pipeline_owner",
    displayName="Pipeline Owner",
    description="Name of the main developer of the ML pipeline"
)

process = EntityTypeDef(
    name="azureml_pipeline",
    superTypes=["Process"],
    attributeDefs = [pipeline_name, pipeline_owner]
)

process.to_json()

client.upload_typedefs(process)
```


Navigate to the `iris-score.csv` asset in the asset browser. Verify that the schema does not contain the label.  

Navigate to the lineage, copy the GUID from the url and delete the entity by running the cell below, just as we did before for the `TrainETL` Copy Activity. 

In [None]:
guid="427896aa-371d-407c-9eb9-5bb95598cf4e"

client.delete_entity(guid=guid)

Navigate back to the `iris-score.csv` asset and click on Lineage and refresh. No lineage should be available. 


Go to `iris/score_pipeline.py`, modify the `description` keyword near Line 68. Ensuring you are in the `master` branch, commit and push. The CI pipeline step responsible for publishing the pipeline also updates the lineage in Purview.