#Tutorial: Use the Microsoft Purview Python SDK
This notebook is based on some parts of the Tutorial on Microsoft Learn
Link: https://learn.microsoft.com/en-us/purview/tutorial-using-python-sdk

However, it extends the tutorial by showcasing how to work with the query results in pyspark.

## Intro

This tutorial will introduce you to using the Microsoft Purview Python SDK. You can use the SDK to do all the most common Microsoft Purview operations programmatically, rather than through the Microsoft Purview governance portal.

In this tutorial, you'll learn how to use the SDK to:

- Grant the required rights to work programmatically with Microsoft Purview
- Register a Blob Storage container as a data source in Microsoft Purview
- Instantiate the Purview Clients in Python
- Define and run a scan
- Search the catalog
- Delete a data source

## Prerequisites
For this tutorial, you'll need:

- Python 3.6 or higher
- An active Azure Subscription. If you don't have one, you can create one for free.
- A Microsoft Entra tenant associated with your subscription.
- An Azure Storage account. If you don't already have one, you can follow our quickstart guide to create one.
- A Microsoft Purview account. If you don't already have one, you can follow our quickstart guide to create one.
- A service principal with a client secret.



## Install Python Packages

In [0]:
pip install azure-identity

In [0]:
pip install azure-purview-scanning

In [0]:
pip install azure-purview-administration

In [0]:
pip install azure-purview-catalog

In [0]:
pip install azure-purview-account

In [0]:
pip install azure-core

In [0]:
dbutils.library.restartPython()

##Instantiate a Scanning, Catalog, and Administration client
In this section, you learn how to instantiate:

- A scanning client useful to registering data sources, creating and managing scan rules, triggering a scan, etc.
- A catalog client useful to interact with the catalog through searching, browsing the discovered assets, identifying the sensitivity of your data, etc.
- An administration client is useful for interacting with the Microsoft Purview Data Map itself, for operations like listing collections.

In [0]:
from azure.purview.scanning import PurviewScanningClient
from azure.purview.catalog import PurviewCatalogClient
from azure.purview.administration.account import PurviewAccountClient
from azure.identity import ClientSecretCredential 
from azure.core.exceptions import HttpResponseError
import json


In [0]:
#Information of Service Principal
client_id = "client_id" #applicationId
client_secret = "client_secret"
tenant_id = "tenant_id" #also called directoryId

### Different endpoints depending on Microsoft Purview Portal (new vs. classic)
Your endpoint value will be different depending on which Microsoft Purview portal you are using. 

Endpoint for the classic Microsoft Purview governance portal: https://{your_purview_account_name}.purview.azure.com/ 
Endpoint for the New Microsoft Purview portal: https://api.purview-service.microsoft.com

Scan endpoint for the classic Microsoft Purview governance portal: https://{your_purview_account_name}.scan.purview.azure.com/ 
Scan Endpoint for the New Microsoft Purview portal: https://api.scan.purview-service.microsoft.com

In [0]:
purview_endpoint = "purview_endpoint"
purview_scan_endpoint = "purview_scan_endpoint"

In [0]:
def get_credentials():
    credentials = ClientSecretCredential(client_id=client_id, client_secret=client_secret, tenant_id=tenant_id)
    return credentials

def get_purview_client():
    credentials = get_credentials()
    client = PurviewScanningClient(endpoint=purview_scan_endpoint, credential=credentials, logging_enable=True)  
    return client

def get_catalog_client():
    credentials = get_credentials()
    client = PurviewCatalogClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True)
    return client

def get_admin_client():
    credentials = get_credentials()
    client = PurviewAccountClient(endpoint=purview_endpoint, credential=credentials, logging_enable=True)
    return client

## Search the catalog for specific entries

In [0]:
from azure.purview.catalog import PurviewCatalogClient
from azure.identity import ClientSecretCredential 
from azure.core.exceptions import HttpResponseError

In [0]:
try:
    client_catalog = get_catalog_client()
except ValueError as e:
    print(e)  

Search for keyword "*" in the whole catalog to get all assets that are scanned

In [0]:
keywords = "*"

body_input={
    "keywords": keywords
}

Example for more advanced filtering

```
classifications = "MICROSOFT.PERSONAL.NAME"
collectionId1 = "gl7byr" #can be retrieved with another python call
collectionId2 = "ehmcsl"

body_input={
    "keywords": keywords,
    "filter": {
        "classification": classifications,
        "collectionId": collectionId1
    }
}
```

In [0]:
try:
    response = client_catalog.discovery.query(search_request=body_input)
    # Format the response for better readability
    formatted_response = json.dumps(response, indent=4)
    
    # Print the formatted response
    print(formatted_response)
except HttpResponseError as e:
    print(e)

#Work with the Query Results in Pyspark

In [0]:
# Importing pyspark library
import pyspark
from pyspark.sql.functions import *

In [0]:
# Convert the query result to a Spark DataFrame
df = spark.read.json(sc.parallelize([formatted_response]))

# Explode the nested entries to create one row per entry
df_explode = df.select(explode_outer('value').alias('values'))

# Display the resulting DataFrame
df_explode.head()


In [0]:
#create new dataframe with separate columns
df_withColumn = df_explode.withColumn('assetType', col('values.assetType')) \
    .withColumn('classification', when(col('values.classification').isNull(), None).otherwise(col('values.classification'))) \
    .withColumn('collectionId', col('values.collectionId')) \
    .withColumn('createBy', col('values.createBy')) \
    .withColumn('createTime', col('values.createTime')) \
    .withColumn('displayText', col('values.displayText')) \
    .withColumn('entityType', col('values.entityType')) \
    .withColumn('id', col('values.id')) \
    .withColumn('isIndexed', col('values.isIndexed')) \
    .withColumn('name', col('values.name')) \
    .withColumn('objectType', col('values.objectType')) \
    .withColumn('qualifiedName', col('values.qualifiedName')) \
    .withColumn('updateBy', col('values.updateBy')) \
    .withColumn('updateTime', col('values.updateTime')) \
    .withColumn('owner', when(col('values.owner').isNull(), None).otherwise(col('values.owner'))) 

df_withColumn.display()

In [0]:
#filter based on a classficationFilter
highlyConfidentialClassification = "HighlyConfidential" #will be used later to filter --> this is a custom classification that I created and added to an asset
classificationFilter = "MICROSOFT.PERSONAL.NAME"

df_classificationFilter = df_withColumn.filter(array_contains(df_withColumn.classification, classificationFilter))
df_classificationFilter.display()

In [0]:
df_classificationFilter.select('qualifiedName').display()

#Other operations

## Get information of one entitiy by GUID

In [0]:
#get guid of one asset
guid = df_classificationFilter.select('id').collect()[0][0]
guid = str(guid)

In [0]:

try:
    response = client_catalog.entity.get_by_guid(guid)
    # Format the response for better readability
    formatted_response = json.dumps(response, indent=4)
    
    # Print the formatted response
    print(formatted_response)

except HttpResponseError as e:
    print(e)

## Get Classifications by GUID

In [0]:
try:
    response = client_catalog.entity.get_classifications(guid)
    
    # Format the response for better readability
    formatted_response = json.dumps(response, indent=4)
    
    # Print the formatted response
    print(formatted_response)

except HttpResponseError as e:
    print(e)