# Exporting Dataplex Metadata

You can run a **metadata export job** to get a snapshot of your Dataplex Universal Catalog metadata (which consists of entries and  aspects) for use in external systems.

### Defining the Export Scope

Every export job requires a **job scope** to define exactly what metadata to export. You must choose one of the following primary scopes:

- `Organization`: Export all metadata belonging to your organization.
- `Projects`: Export metadata from one or more specified projects.
- `Entry groups`: Export metadata from one or more specified entry groups.

You can further refine the scope by specifying the entry types or aspect types to include, ensuring the job only exports the specific entries and aspects you need.

In [None]:
from google.cloud import storage
from google.cloud import bigquery
import os
import json


In [None]:
# Configuration
PROJECT_ID = "bq-sme-governance-build"
LOCATION = "us-central1"
EXPORT_BUCKET_NAME = f"{PROJECT_ID}-lab-data-export"
EXISTING_EXPORT_BUCKET = gs://

In [None]:
import google.auth
from google.auth.transport.requests import AuthorizedSession
from requests import HTTPError
from typing import Any, Optional, Dict

def call_google_api(
    url: str,
    http_verb: str,
    request_body: Optional[Dict[str, Any]] = None
) -> Dict[str, Any]:
    creds, project = google.auth.default(
        scopes=["https://www.googleapis.com/auth/cloud-platform"]
    )
    authed_session = AuthorizedSession(creds)
    try:
        response = authed_session.request(
            method=http_verb,
            url=url,
            json=request_body  # requests handles None for json param gracefully
        )

        response.raise_for_status()

        if response.status_code == 204:
            return {}

        return response.json()

    except HTTPError as e:
        # Provide more structured error information
        error_message = f"API call failed with status {e.response.status_code}: {e.response.text}"
        print(error_message) # Or use logging
        raise RuntimeError(error_message) from e

In [None]:
#create bucket if it does not exist
def create_storage_bucket():
  storage_client = storage.Client(project=PROJECT_ID)
  buckets = storage_client.list_buckets()
  bucket_names = [bucket.name for bucket in buckets]

  bucket = storage_client.bucket(EXPORT_BUCKET_NAME)

  if not bucket.exists():
      try:
          bucket = storage_client.create_bucket(EXPORT_BUCKET_NAME)
          print(f"Bucket {bucket.name} created.")
      except Exception as e:
          print(f"Error creating bucket: {e}")
  else:
      print(f"Bucket {EXPORT_BUCKET_NAME} already exists.")

create_storage_bucket()

In [None]:
##### This example exports the metadata for a specific entry group or groups ######
# request_body = {
#   "type": "EXPORT",
#   "export_spec": {
#     "output_path": f"gs://{EXPORT_BUCKET_NAME}/,
#     "scope": {
#       "entryGroups": [
#         "@bigquery",
#         # Additional entry groups
#       ],
#     },
#   }
# }

##### This example exports the metadata for a project or projects ######
# request_body = {
#   "type": "EXPORT",
#   "export_spec": {
#     "output_path": f"gs://{EXPORT_BUCKET_NAME}/",
#     "scope": {
#       "projects": [
#         f"projects/{PROJECT_ID}"
#       ]
#     }
#   }
# }

##### This example exports the metadata for your organization ######
request_body = {
  "type": "EXPORT",
  "export_spec": {
    "output_path": f"gs://{EXPORT_BUCKET_NAME}/",
    "scope": {
      "organizationLevel": "true",
    },
  }
}

In [None]:
url = f"https://dataplex.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/metadataJobs"
response = call_google_api(url, "POST", request_body)
metadata_job_target = response['metadata']['target']
pretty_json = json.dumps(response, indent=4, sort_keys=True)
print(pretty_json)

The metadata export takes approximately 20-25 minutes to complete.  You can refresh this cell to monitor the progress.  

Feel free to move to the next section of the notebook, due to time constraints a complete export is provided for the next section of the lab.

In [None]:
status_url = f"https://dataplex.googleapis.com/v1/{metadata_job_target}"
response = call_google_api(status_url, "GET")
pretty_json = json.dumps(response, indent=4, sort_keys=True)
print(pretty_json)

## Analyzing Dataplex Metadata in BigQuery

We've just exported our Dataplex metadata to GCS. When you want to analyze this metadata in BigQuery, you can create an external table. This lets you query the data directly from its exported location without needing to load or transform it first.

### Why a Business Would Import Dataplex Metadata into BigQuery

There are several key reasons why a business would want to bring its Dataplex metadata into BigQuery for analysis:

* **Advanced Querying and Analysis**: By having the metadata in BigQuery, you can run SQL queries to gain deeper insights.
    * *Example*: Count the number of entries by entry group, or find all entries that have a specific aspect (like data quality scores).
    ```sql
    -- Example: Count entries per entry group
    SELECT
      entry_group,
      COUNT(entry_id) AS number_of_entries
    FROM
      `your_project.your_dataset.dataplex_metadata_external_table`
    GROUP BY
      entry_group
    ORDER BY
      number_of_entries DESC;
    ```

* **Integration with Analytics Tools**: Importing the metadata to BigQuery allows you to analyze your metadata alongside other business data, or visualize it in tools like Looker Studio.

* **Programmatic Processing**: For businesses that need to process large volumes of metadata, exporting it allows for programmatic manipulation using SQL. This processed metadata can then be imported back into Dataplex via API if needed.

* **Custom Applications and Third-Party Tools**: You can integrate your metadata into custom-built applications (like a data governance dashboard) or other third-party tools that connect with BigQuery, extending the functionality and use of your metadata.

In [None]:
import os
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
from google.api_core.exceptions import Conflict

def create_hive_partitioned_external_table(project_id: str, export_bucket_name: str) -> None:
    """
    Creates a Hive-partitioned external table in BigQuery.

    Checks if the dataset exists and creates it if necessary before attempting
    to create the table. The table's data is stored in newline-delimited JSON
    format in a Google Cloud Storage bucket with a Hive-style directory structure.

    Args:
        project_id (str): Your Google Cloud project ID.
        export_bucket_name (str): The GCS bucket name containing the source data.
    """
    # Set these variables
    dataset_id = "dataplex_metadata"
    table_id = "metadata_export"
    location = "US"


    client = bigquery.Client(project=project_id)
    dataset_ref = client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)

    #Check for and create the dataset if it doesn't exist
    try:
        client.get_dataset(dataset_ref)
        print(f"Dataset '{dataset_id}' already exists.")
    except NotFound:
        print(f"Dataset '{dataset_id}' not found. Creating it in location '{location}'.")
        try:
            dataset = bigquery.Dataset(dataset_ref)
            dataset.location = location
            client.create_dataset(dataset, timeout=30)
            print(f"Successfully created dataset '{dataset_id}'.")
        except Exception as e:
            print(f"Failed to create dataset '{dataset_id}': {e}")
            return

    # Table schema
    schema = [
        bigquery.SchemaField(
            "entry", "RECORD", "NULLABLE",
            fields=[
                bigquery.SchemaField("name", "STRING", "NULLABLE"),
                bigquery.SchemaField("entryType", "STRING", "NULLABLE"),
                bigquery.SchemaField("createTime", "STRING", "NULLABLE"),
                bigquery.SchemaField("updateTime", "STRING", "NULLABLE"),
                bigquery.SchemaField("aspects", "JSON", "NULLABLE"),
                bigquery.SchemaField("parentEntry", "STRING", "NULLABLE"),
                bigquery.SchemaField("fullyQualifiedName", "STRING", "NULLABLE"),
                bigquery.SchemaField(
                    "entrySource", "RECORD", "NULLABLE",
                    fields=[
                        bigquery.SchemaField("resource", "STRING", "NULLABLE"),
                        bigquery.SchemaField("system", "STRING", "NULLABLE"),
                        bigquery.SchemaField("platform", "STRING", "NULLABLE"),
                        bigquery.SchemaField("displayName", "STRING", "NULLABLE"),
                        bigquery.SchemaField("description", "STRING", "NULLABLE"),
                        bigquery.SchemaField("labels", "JSON", "NULLABLE"),
                        bigquery.SchemaField(
                            "ancestors", "RECORD", "REPEATED",
                            fields=[
                                bigquery.SchemaField("name", "STRING", "NULLABLE"),
                                bigquery.SchemaField("type", "STRING", "NULLABLE"),
                            ],
                        ),
                        bigquery.SchemaField("createTime", "STRING", "NULLABLE"),
                        bigquery.SchemaField("updateTime", "STRING", "NULLABLE"),
                        bigquery.SchemaField("location", "STRING", "NULLABLE"),
                    ],
                ),
            ],
        )
    ]

    external_config = bigquery.ExternalConfig("NEWLINE_DELIMITED_JSON")
    gcs_uri = f"gs://{export_bucket_name}/*"
    external_config.source_uris = [gcs_uri]

    hive_partitioning_options = bigquery.HivePartitioningOptions()
    hive_partitioning_options.mode = "AUTO"
    hive_partitioning_options.source_uri_prefix = f"gs://{export_bucket_name}/"
    external_config.hive_partitioning = hive_partitioning_options

    table = bigquery.Table(table_ref, schema=schema)
    table.external_data_configuration = external_config

    try:
        created_table = client.create_table(table)
        print(
            f"Successfully created external table: {created_table.project}.{created_table.dataset_id}.{created_table.table_id}"
        )
    except Conflict:
        print(f"Table '{table_id}' already exists.")
    except Exception as e:
        print(f"An unexpected error occurred while creating the table: {e}")



create_hive_partitioned_external_table(PROJECT_ID, EXPORT_BUCKET_NAME)

In [None]:
# List the top 10 projects with the highest number of unique entry source resources.
%%bigquery
SELECT
  PROJECT,
  COUNT(DISTINCT entry.entrySource.resource) AS unique_resources
FROM
  `dataplex_metadata.metadata_export`
WHERE
  year = 2025
GROUP BY
  PROJECT
ORDER BY
  unique_resources DESC
LIMIT
  10;

In [None]:
# Show the unique aspect types and counts

%%bigquery
CREATE TEMP FUNCTION extract_aspect_info(json_str STRING)
RETURNS STRUCT<aspectType STRING, creator STRING, assetType STRING, createTime TIMESTAMP, updateTime TIMESTAMP>
LANGUAGE js AS """
  try {
    const obj = JSON.parse(json_str);
    const dynamicKey = Object.keys(obj)[0];
    if (dynamicKey) {
      const aspectData = obj[dynamicKey];
      return {
        aspectType: aspectData.aspectType,
        creator: aspectData.data.creatorIamPrincipal,
        assetType: aspectData.data.type,
        createTime: new Date(aspectData.createTime),
        updateTime: new Date(aspectData.updateTime)
      };
    }
  } catch (e) {
    return null;
  }
  return null;
""";

SELECT
  (extract_aspect_info(TO_JSON_STRING(entry.aspects))).aspectType AS aspect_type,
  COUNT(1) AS count
FROM
  `bq-sme-governance-build.dataplex_metadata.metadata_export`
WHERE (extract_aspect_info(TO_JSON_STRING(entry.aspects))).aspectType IS NOT NULL
GROUP BY
  aspect_type
ORDER BY
  count DESC;