# Deploy Dataplex Profiling and Quality

- Luis Gerardo Baeza, Customer Engineer
- Sep 23rd

## 1. Preparation
**For the principal responsible for running this script (e.g. Vertex Workbench service account):**
- Grant BigQuery Data Editor IAM Role, or [minimum permissions](https://cloud.google.com/dataplex/docs/use-data-profiling#permissions) according to Docs, in each of the tables, datasets and GCP projects configured below
- Grant Dataplex Data Scan Editor (roles/dataplex.dataScanEditor) or [minimum permissions](https://cloud.google.com/dataplex/docs/use-data-profiling#roles-permissions) according to Docs
- This script will create Dataplex scan jobs in the GCP project in context
- This script will inmediately run the jobs
- The last section will modify the jobs to schedule the run

This is a sample code, not meant for production. Please use with caution

## 2. Setup

In [23]:
pip install --upgrade --quiet google-cloud-dataplex

Note: you may need to restart the kernel to use updated packages.


In [None]:
import google.api_core.exceptions
from google.cloud import bigquery
from google.cloud import dataplex_v1
import os
from typing import List
from google.protobuf import field_mask_pb2
import time

In [None]:
# Provide the BigQuery projects that you want to include in the profiling and quality
GOOGLE_CLOUD_PROJECT_ID_BQ_SOURCE = ['']
GOOGLE_CLOUD_DATASET_TEST = ['test_dataset'] #empty list if processing everything
DATAPLEX_LOCATION = "us-central1"
DPLX_EXPORT_DATASET_ID = "dplex_at_scale"
DPLEX_EXPORT_TABLE_NAME_profiling = "profiling"
DPLEX_EXPORT_TABLE_NAME_quality = "quality"
DPLEX_CRON_SCHEDULE = "0 2 * * *"
sampling_percent = 100.0

In [80]:
GOOGLE_CLOUD_PROJECT_DATAPLEX = os.environ.get( 'GOOGLE_CLOUD_PROJECT' )
BQ_DPLEX_EXPORT_TABLE = f"projects/{GOOGLE_CLOUD_PROJECT_DATAPLEX}/datasets/{DPLX_EXPORT_DATASET_ID}/tables/{DPLEX_EXPORT_TABLE_NAME_profiling}"
BQ_DPLEX_EXPORT_TABLE_quality = f"projects/{GOOGLE_CLOUD_PROJECT_DATAPLEX}/datasets/{DPLX_EXPORT_DATASET_ID}/tables/{DPLEX_EXPORT_TABLE_NAME_quality}"
dataplex_client = dataplex_v1.DataScanServiceClient()
bq_client = bigquery.Client()
dplex_service_client  = dataplex_v1.DataplexServiceClient()

In [50]:
# Create dataset for export
SQL = f"""
    CREATE SCHEMA IF NOT EXISTS  {DPLX_EXPORT_DATASET_ID};
"""

In [51]:
%%bigquery
$SQL

Query is running:   0%|          |

## 3. Dataplex deployment

### 3.1 Utility functions

#### 3.1.1 Create all dataplex profiling and quality jobs - main

In [None]:
def create_dataplex_table_profiling_jobs( ) :
    """
    Creates Dataplex table profiling jobs for all datasets within a list of BigQuery projects' datasets.

    Args:
        bigquery_project_ids: A list of BigQuery project IDs to iterate through.
        dataplex_project_id: The ID of the Dataplex project.
        dataplex_location: The Google Cloud region for the Dataplex jobs (e.g., 'us-central1').
    """
    for bigquery_project_id in GOOGLE_CLOUD_PROJECT_ID_BQ_SOURCE:
        print(f"\nProcessing BigQuery project: {bigquery_project_id}")
        
        dplx_parent = dataplex_client.common_location_path(
            GOOGLE_CLOUD_PROJECT_DATAPLEX, DATAPLEX_LOCATION
        )

        try:
            # Initialize BigQuery client for the current project
            bigquery_client = bigquery.Client(project=bigquery_project_id)
            datasets = list(bigquery_client.list_datasets())

            if not datasets:
                print(f"No datasets found in the BigQuery project '{bigquery_project_id}'.")
                continue

            for dataset in datasets:
                dataset_id = dataset.dataset_id
                time.sleep(0.5)

                if len(GOOGLE_CLOUD_DATASET_TEST) > 0:
                    if dataset_id in GOOGLE_CLOUD_DATASET_TEST:
                        pass
                    else:
                        print(f"\Skipping dataset: {dataset_id}")
                        continue
                
                print(f"\nProcessing dataset: {dataset_id}")
                tables = list(bigquery_client.list_tables(dataset))

                if not tables:
                    print(f"No tables found in dataset: {dataset.dataset_id}")
                    continue

                for table in tables:
                    table_id = table.table_id
                    data_scan_name = f"profile-{bigquery_project_id}-{dataset.dataset_id}-{table_id}".replace("_", "-")

                    # Create profiling job
                    create_profiling_job(dplx_parent, data_scan_name, bigquery_project_id, dataset_id, table_id)

                    # Run Profiling Job
                    run_dplex_job("profiling", data_scan_name)

                    # Update publishing to BQ
                    apply_profiling_export_to_bq(data_scan_name, f"{bigquery_project_id}.{dataset_id}.{table_id}")

                    ## Create Quality  Job

                    ## Update publishing DQ to BQ
                    # apply_quality_export_to_bq()

                    ## Run Quality  Job
                    #run_dplex_job("quality", data_q_scan_name)

        except Exception as e:
            print(f"An error occurred while processing project '{bigquery_project_id}': {e}")

#### 3.1.2 Run data scan job

In [29]:
def run_dplex_job(job_type, data_scan_name):
    try:
        run_response = dataplex_client.run_data_scan(
            name = f"projects/{GOOGLE_CLOUD_PROJECT_DATAPLEX}/locations/{DATAPLEX_LOCATION}/dataScans/{data_scan_name}"
        )
        print(f"    - Succesfully started {job_type} job")

    except Exception as e:
        print(f"    - Failed to run {job_type} job. Error: {e}")

#### 3.1.3 Create profiling job

In [75]:
def create_profiling_job(dplx_parent, data_scan_name, bigquery_project_id, dataset_id, table_id):
    bq_uri = f"//bigquery.googleapis.com/projects/{bigquery_project_id}/datasets/{dataset_id}/tables/{table_id}"
    
    # Define the Data Profile Scan configuration
    data_scan = dataplex_v1.DataScan(
        display_name = data_scan_name,
        data={
            "resource": bq_uri
        },
        data_profile_spec={
            "sampling_percent": sampling_percent,
            "post_scan_actions": {
                "bigquery_export": {
                    "results_table": BQ_DPLEX_EXPORT_TABLE
                }
            }
        },
        execution_spec={
            "trigger": {"on_demand": {}}, 
        },
        labels = {
            "source_project": bigquery_project_id,
            "source_dataset": dataset_id,
            "created_by": "dplex_at_scale",
        }
    )

    # Create Profiling Job
    try:
        request = dataplex_client.create_data_scan(
            parent=dplx_parent,
            data_scan_id=data_scan_name,
            data_scan=data_scan,
        )
        request.result()
        
        print(f"  ** Created Dataplex profiling job for table: {table_id}")
    except google.api_core.exceptions.AlreadyExists:
        print(f"    - Job '{data_scan_name}' already exists. Skipping creation.")
    except Exception as e:
        print(f"    - Failed to create job for table {table_id}. Error: {e}")

#### 3.1.4 Apply profiling export to job

In [None]:
def apply_profiling_export_to_bq(SCAN_ID, table_id):
    try:
        labels_to_set = {
            "dataplex-dp-published-scan": SCAN_ID,
            "dataplex-data-documentation-published-location": DATAPLEX_LOCATION,
            "dataplex-dp-published-project": GOOGLE_CLOUD_PROJECT_DATAPLEX.lower(),
            "dataplex-dp-published-location": DATAPLEX_LOCATION,
            "dataplex-data-documentation-published-project": GOOGLE_CLOUD_PROJECT_DATAPLEX.lower(),
            "dataplex-data-documentation-published-scan": SCAN_ID.lower(),
        }
        table = bq_client.get_table(table_id)
        if table.labels:
            updated_labels = table.labels.copy()
            updated_labels.update(labels_to_set)
        else:
            updated_labels = labels_to_set

        table.labels = updated_labels
        bq_client.update_table(table, ["labels"])
        print(f"    - Successfully updated profile export config for {table_id}")
    except Exception as e:
        print(f"    - Failed to update publishing job for table {table_id}. Error: {e}")

#### 3.1.5 Create quality job

In [None]:
# 

#### 3.1.6 Apply quality export to job

In [None]:
def apply_quality_export_to_bq(SCAN_ID, table_id):
    try:
        labels_to_set = {
            "dataplex-dq-published-scan": SCAN_ID.lower(),
            "dataplex-dq-published-project": GOOGLE_CLOUD_PROJECT_DATAPLEX.lower(),
            "dataplex-dq-published-location": DATAPLEX_LOCATION
        }
        table = bq_client.get_table(table_id)
        if table.labels:
            updated_labels = table.labels.copy()
            updated_labels.update(labels_to_set)
        else:
            updated_labels = labels_to_set

        table.labels = updated_labels
        bq_client.update_table(table, ["labels"])

        print(f"    - Successfully updated quality export config for {table_id}")
    except Exception as e:
        print(f"    - Failed to update publishing job for table {table_id}. Error: {e}")

#### 3.1.7 Schedule or disable schedule for all both jobs: quality and profiling

In [None]:
def update_schedule_table_profiling_jobs( schedule ) :
    """
    Creates Dataplex table profiling jobs for all datasets within a list of BigQuery projects' datasets.

    Args:
        bigquery_project_ids: A list of BigQuery project IDs to iterate through.
        dataplex_project_id: The ID of the Dataplex project.
        dataplex_location: The Google Cloud region for the Dataplex jobs (e.g., 'us-central1').
    """
    for bigquery_project_id in GOOGLE_CLOUD_PROJECT_ID_BQ_SOURCE:
        print(f"\nProcessing BigQuery project: {bigquery_project_id}")
        
        dplx_parent = dataplex_client.common_location_path(
            GOOGLE_CLOUD_PROJECT_DATAPLEX, DATAPLEX_LOCATION
        )
        
        if schedule:
            trigger = { 
                "schedule": { 
                    "cron": DPLEX_CRON_SCHEDULE
                }
            }
            trigger_description = f"scheduled at {DPLEX_CRON_SCHEDULE}"
        else:
            trigger = {"on_demand": {}}
            trigger_description = f"disabled scheduling"
            

        try:
            # Initialize BigQuery client for the current project
            bigquery_client = bigquery.Client(project=bigquery_project_id)
            datasets = list(bigquery_client.list_datasets())

            if not datasets:
                print(f"No datasets found in the BigQuery project '{bigquery_project_id}'.")
                continue

            for dataset in datasets:
                dataset_id = dataset.dataset_id
                time.sleep(0.5)

                if len(GOOGLE_CLOUD_DATASET_TEST) > 0:
                    if dataset_id in GOOGLE_CLOUD_DATASET_TEST:
                        pass
                    else:
                        print(f"\Skipping dataset: {dataset_id}")
                        continue
                    
                print(f"\nProcessing dataset: {dataset_id}")
                tables = list(bigquery_client.list_tables(dataset))

                if not tables:
                    print(f"No tables found in dataset: {dataset.dataset_id}")
                    continue

                for table in tables:
                    table_id = table.table_id
                    data_scan_name = f"{bigquery_project_id}-{dataset.dataset_id}-{table_id}".replace("_", "-")
                    scan_name = f"projects/{GOOGLE_CLOUD_PROJECT_DATAPLEX}/locations/{DATAPLEX_LOCATION}/dataScans"

                    # Update profiling job
                    dataplex_client.update_data_scan(
                        data_scan = dataplex_v1.DataScan(
                            name=f"{scan_name}/profile-{data_scan_name}",
                            execution_spec={
                                "trigger": trigger
                            },
                            data_profile_spec = {}
                        ),
                        update_mask = field_mask_pb2.FieldMask(
                            paths=["execution_spec.trigger"]
                        )
                    )
                    print(f"Update schedule of profiling job {scan_name}. {trigger_description}")

                    """
                    # Update quality job
                    dataplex_client.update_data_scan(
                        data_scan = dataplex_v1.DataScan(
                            display_name = f"quality-{data_scan_name}",
                            execution_spec={
                                "trigger": dplex_trigger 
                            },
                        ),
                        update_mask = ["execution_spec.trigger"]
                    )
                    print("Update schedule of quality job {data_scan_name}")
                    """

        except Exception as e:
            print(f"An error occurred while processing project '{bigquery_project_id}': {e}")

### 3.2 Run the job creation

In [76]:
create_dataplex_table_profiling_jobs( )


Processing BigQuery project: bq-sample-project

Processing dataset: bank
  ** Created Dataplex profiling job for table: accounts
    - Succesfully started profiling job
    - Successfully updated profile export config for bq-sample-project.bank.accounts
  ** Created Dataplex profiling job for table: customers
    - Succesfully started profiling job
    - Successfully updated profile export config for bq-sample-project.bank.customers

Processing dataset: retail
  ** Created Dataplex profiling job for table: products
    - Succesfully started profiling job
    - Successfully updated profile export config for bq-sample-project.retail.products
  ** Created Dataplex profiling job for table: sales
    - Succesfully started profiling job
    - Successfully updated profile export config for bq-sample-project.retail.sales


## 4. Schedule

Wait for job run completion before changing schedule

In [67]:
# Change input variable schedule as needed. False = ondemand
schedule = True

In [70]:
update_schedule_table_profiling_jobs(schedule = schedule)


Processing BigQuery project: bq-sample-project

Processing dataset: bank
Update schedule of profiling job projects/bq-dataplex_project/locations/us-central1/dataScans. disabled scheduling
Update schedule of profiling job projects/bq-dataplex_project/locations/us-central1/dataScans. disabled scheduling

Processing dataset: retail
Update schedule of profiling job projects/bq-dataplex_project/locations/us-central1/dataScans. disabled scheduling
Update schedule of profiling job projects/bq-dataplex_project/locations/us-central1/dataScans. disabled scheduling
