# BigQuery Data Pipeline: Bronze → Silver → Gold with Data Quality Issues

This notebook demonstrates a complete data pipeline with intentional data quality issues that flow through all layers for detection by Dataplex Data Quality.

## Pipeline Stages

**1. Bronze Layer**: Raw data with injected quality issues
   - Native BigQuery table (customers)
   - BigLake external table with CSV (products)
   - BigLake Iceberg table (orders)
   - Issues injected: duplicates, nulls, inconsistent casing, future dates

**2. Silver Layer**: Structured data with issues preserved
   - Parse JSON fields
   - Convert data types
   - Does NOT clean quality issues - they flow forward

**3. Gold Layer**: Dimensional model with issues still present
   - Fact table: fct_sales
   - Dimensions: dim_customers, dim_products, dim_date
   - Wide view: vw_sales_wide (One Big Table pattern)
   - All quality issues preserved for Dataplex detection

## Data Quality Issues Included

This pipeline intentionally includes:
1. Duplicate records - customer_id = 20 appears twice
2. NULL values - ~5% of customers have NULL emails
3. Inconsistent naming - Categories with mixed case ('Electronics' vs 'electronics')
4. Invalid dates - ~5% of orders have future timestamps

## Workflow

1. Run this notebook to create Bronze → Silver → Gold with quality issues
2. Use Dataplex Data Quality to detect issues in Gold layer tables
3. Trace issues back from Gold → Silver → Bronze
4. Implement fixes in Silver layer based on Dataplex findings

## How to Use

1. Run the installation cell to install required packages
2. Configure parameters using the form fields in the configuration cell (in Google Colab, click the cell to reveal the form)
3. Run all subsequent cells in order
4. Observe quality issues in the query results

## Cleanup

At the end of the notebook, you'll find a cleanup cell to delete all created resources.

In [None]:
! pip install --upgrade google-cloud-bigquery google-cloud-storage google-cloud-dataplex faker

## Setup and Configuration

The following cells handle authentication and configuration. If running in Google Colab, uncomment the authentication cell.

In [None]:
#from google.colab import auth
#auth.authenticate_user()

In [None]:
import pandas as pd
from faker import Faker
import random
import json
from datetime import datetime, timedelta
from typing import List, Dict
import os
import subprocess
from google.cloud import bigquery, storage
from google.api_core.exceptions import NotFound, Conflict

# --- Configuration ---
# @title Pipeline Configuration { display-mode: "form" }

# GCP Settings
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "your-gcp-project-id") #@param {type:"string"}
REGION = "us-central1" #@param ["us-central1", "us-east1", "us-west1", "europe-west1", "asia-southeast1"]
BIGQUERY_CONNECTION_ID = "gcs-biglake-connection" #@param {type:"string"}

# Dataset Names
BRONZE_DATASET = "bronze" #@param {type:"string"}
SILVER_DATASET = "silver" #@param {type:"string"}
GOLD_DATASET = "gold" #@param {type:"string"}

# Data Generation Parameters
NUM_CUSTOMERS = 200 #@param {type:"slider", min:10, max:1000, step:10}
NUM_PRODUCTS = 50 #@param {type:"slider", min:10, max:500, step:10}
NUM_ORDERS = 500 #@param {type:"slider", min:50, max:5000, step:50}
MAX_ITEMS_PER_ORDER = 5 #@param {type:"slider", min:1, max:10, step:1}

# Auto-generated
BIGQUERY_DATASET = BRONZE_DATASET
BUCKET_NAME = f"{PROJECT_ID}-lab-data-source"
CONNECTION_NAME = f"{PROJECT_ID}.{REGION}.{BIGQUERY_CONNECTION_ID}"

# Initialize Faker
fake = Faker()

print("Configuration loaded:")
print(f"   Project: {PROJECT_ID}")
print(f"   Region: {REGION}")
print(f"   Datasets: {BRONZE_DATASET}, {SILVER_DATASET}, {GOLD_DATASET}")
print(f"   Data: {NUM_CUSTOMERS} customers, {NUM_PRODUCTS} products, {NUM_ORDERS} orders")

## Helper, generation, and cleanup functions

Utility functions for infrastructure setup and data generation.

In [None]:
# --- Helper Functions ---

def validate_config():
    """Validates that required configuration is set."""
    if PROJECT_ID == "your-gcp-project-id":
        raise ValueError("Please update the PROJECT_ID variable before running.")

def ensure_dataset(bq_client: bigquery.Client, dataset_name: str) -> str:
    """Creates a BigQuery dataset if it doesn't exist."""
    dataset_id = f"{PROJECT_ID}.{dataset_name}"
    dataset = bigquery.Dataset(dataset_id)
    dataset.location = REGION
    bq_client.create_dataset(dataset, exists_ok=True)
    print(f"Ensured dataset '{dataset_id}' exists in location '{REGION}'.")
    return dataset_id

def setup_gcs_bucket(storage_client: storage.Client) -> storage.Bucket:
    """Checks for a GCS bucket and creates it if it doesn't exist."""
    print(f"--- Checking for GCS Bucket: {BUCKET_NAME} ---")
    try:
        bucket = storage_client.get_bucket(BUCKET_NAME)
        print(f"Bucket '{bucket.name}' already exists.")
    except NotFound:
        print(f"Bucket '{BUCKET_NAME}' not found. Creating it...")
        bucket = storage_client.create_bucket(BUCKET_NAME, location=REGION)
        print(f"Bucket '{bucket.name}' created successfully in region {REGION}.")
    return bucket

def setup_bigquery_connection(bq_client: bigquery.Client, storage_client: storage.Client):
    """Creates a BigQuery external connection and grants GCS permissions."""
    print(f"--- Setting up BigQuery Connection: {CONNECTION_NAME} ---")

    full_connection_id = f"{PROJECT_ID}.{REGION}.{BIGQUERY_CONNECTION_ID}"

    # Check if connection exists
    show_command = ["bq", "show", "--connection", full_connection_id]
    result = subprocess.run(show_command, capture_output=True, text=True)

    if result.returncode != 0:
        print("Connection not found. Creating it...")
        create_command = [
            "bq", "mk", "--connection",
            f"--location={REGION}",
            f"--project_id={PROJECT_ID}",
            "--connection_type=CLOUD_RESOURCE",
            BIGQUERY_CONNECTION_ID
        ]
        subprocess.run(create_command, capture_output=True, text=True, check=True)
        print("Connection created successfully.")
    else:
        print("BigQuery connection already exists.")

    # Get service account and grant permissions
    show_json_command = ["bq", "show", "--connection", "--format=json", full_connection_id]
    result = subprocess.run(show_json_command, capture_output=True, text=True, check=True)
    connection_info = json.loads(result.stdout)
    service_account = connection_info["cloudResource"]["serviceAccountId"]
    print(f"Found service account: {service_account}")

    # Grant IAM permissions
    bucket = storage_client.get_bucket(BUCKET_NAME)
    policy = bucket.get_iam_policy(requested_policy_version=3)
    role = "roles/storage.admin"
    member = f"serviceAccount:{service_account}"

    binding_exists = any(
        b["role"] == role and member in b.get("members", set())
        for b in policy.bindings
    )

    if not binding_exists:
        policy.bindings.append({"role": role, "members": {member}})
        bucket.set_iam_policy(policy)
        print(f"Granted '{role}' to service account on bucket '{BUCKET_NAME}'.")
    else:
        print(f"Service account already has '{role}' on bucket '{BUCKET_NAME}'.")


In [None]:
# --- Data Generation Functions ---

def generate_customers(num_customers: int) -> list:
    """Generates fake customer records with intentional data quality issues."""
    customers = []
    customer_tiers = ['Bronze', 'Silver', 'Gold', 'Platinum']
    signup_channels = ['Web', 'Mobile App', 'In-Store', 'Phone', 'Social Media']

    for i in range(1, num_customers + 1):
        tier = random.choices(customer_tiers, weights=[40, 30, 20, 10])[0]  # Most are Bronze

        # 5% chance of missing email
        profile_details = {
            "name": fake.name(),
            "email": fake.email() if random.random() > 0.05 else None,
            "address": fake.address().replace('\n', ', '),
            "join_date": fake.date_between(start_date='-2y', end_date='today').isoformat(),
            "customer_tier": tier,
            "signup_channel": random.choice(signup_channels)
        }
        customers.append({
            "customer_id": i,
            "profile": json.dumps(profile_details)
        })

    # Introduce duplicate customer ID
    if num_customers > 20:
        customers.append({
            "customer_id": 20,
            "profile": json.dumps({
                "name": fake.name(),
                "email": fake.email(),
                "address": fake.address().replace('\n', ', '),
                "join_date": fake.date_between(start_date='-1y', end_date='today').isoformat(),
                "customer_tier": random.choice(customer_tiers),
                "signup_channel": random.choice(signup_channels)
            })
        })

    return customers

def generate_products(num_products: int) -> list:
    """Generates fake product records with inconsistent category naming and business attributes."""
    categories = ['Electronics', 'Books', 'Home Goods', 'Apparel', 'Toys', 'home goods', 'electronics']
    brands = ['BrandA', 'BrandB', 'BrandC', 'BrandD', 'Generic']

    products = []
    for i in range(1, num_products + 1):
        category = random.choice(categories)
        unit_price = round(random.uniform(5.5, 299.99), 2)
        # Cost is 40-70% of price (margin varies)
        unit_cost = round(unit_price * random.uniform(0.40, 0.70), 2)

        products.append({
            "product_id": i,
            "product_name": f"Product_{fake.word().capitalize()}_{i}",
            "category": category,
            "brand": random.choice(brands),
            "sku": f"SKU-{i:05d}",
            "unit_price": unit_price,
            "unit_cost": unit_cost
        })
    return products

def generate_orders(num_orders: int, customer_ids: list, product_ids: list, customers_data: list, products_data: list) -> list:
    """Generates fake order records with discounts and shipping costs."""
    orders = []

    # Create lookup for customer tiers
    customer_tier_map = {}
    for customer in customers_data:
        profile = json.loads(customer['profile'])
        customer_tier_map[customer['customer_id']] = profile.get('customer_tier', 'Bronze')

    for i in range(1, num_orders + 1):
        customer_id = random.choice(customer_ids)
        order_date = fake.date_time_between(start_date='-1y', end_date='now')

        # 5% chance of future order date
        if random.random() < 0.05:
            order_date = datetime.now() + timedelta(days=random.randint(2, 30))

        # Discount based on customer tier
        tier = customer_tier_map.get(customer_id, 'Bronze')
        if tier == 'Platinum':
            discount_percent = random.uniform(0, 0.15)  # Up to 15% off
        elif tier == 'Gold':
            discount_percent = random.uniform(0, 0.10)  # Up to 10% off
        elif tier == 'Silver':
            discount_percent = random.uniform(0, 0.05)  # Up to 5% off
        else:
            discount_percent = random.uniform(0, 0.02)  # Up to 2% off

        # Shipping cost
        shipping_cost = round(random.uniform(0, 15.99), 2)

        # Create line items
        line_items = []
        products_in_order = random.sample(product_ids, k=random.randint(1, MAX_ITEMS_PER_ORDER))

        for product_id in products_in_order:
            line_items.append({
                "product_id": product_id,
                "quantity": random.randint(1, 5)
            })

        orders.append({
            "order_id": 1000 + i,
            "customer_id": customer_id,
            "order_timestamp": order_date.isoformat(),
            "discount_percent": round(discount_percent, 4),
            "shipping_cost": shipping_cost,
            "line_items": json.dumps(line_items)
        })
    return orders

In [None]:
# --- Cleanup Functions ---

def cleanup_all_resources(bq_client: bigquery.Client, storage_client: storage.Client,
                         delete_datasets: bool = True, delete_bucket: bool = True,
                         delete_connection: bool = True):
    """
    Deletes all Google Cloud resources created by this notebook.

    Args:
        bq_client: BigQuery client instance
        storage_client: Storage client instance
        delete_datasets: If True, deletes bronze, silver, and gold datasets
        delete_bucket: If True, deletes the GCS bucket
        delete_connection: If True, deletes the BigQuery connection
    """
    print("\n--- Starting Cleanup ---")

    # Delete BigQuery datasets
    if delete_datasets:
        for dataset_name in [BRONZE_DATASET, SILVER_DATASET, GOLD_DATASET]:
            dataset_id = f"{PROJECT_ID}.{dataset_name}"
            try:
                bq_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)
                print(f"Deleted dataset '{dataset_id}' and all its tables.")
            except Exception as e:
                print(f"Error deleting dataset '{dataset_id}': {e}")

    # Delete GCS bucket
    if delete_bucket:
        try:
            bucket = storage_client.bucket(BUCKET_NAME)
            # Delete all blobs in the bucket first
            blobs = list(bucket.list_blobs())
            for blob in blobs:
                blob.delete()
            # Delete the bucket
            bucket.delete()
            print(f"Deleted GCS bucket '{BUCKET_NAME}' and all its contents.")
        except NotFound:
            print(f"Bucket '{BUCKET_NAME}' not found, skipping.")
        except Exception as e:
            print(f"Error deleting bucket '{BUCKET_NAME}': {e}")

    # Delete BigQuery connection
    if delete_connection:
        full_connection_id = f"{PROJECT_ID}.{REGION}.{BIGQUERY_CONNECTION_ID}"
        try:
            delete_command = ["bq", "rm", "--connection", full_connection_id]
            result = subprocess.run(delete_command, capture_output=True, text=True)
            if result.returncode == 0:
                print(f"Deleted BigQuery connection '{full_connection_id}'.")
            else:
                print(f"Connection '{full_connection_id}' not found or already deleted.")
        except Exception as e:
            print(f"Error deleting connection '{full_connection_id}': {e}")

    print("\n--- Cleanup Complete! ---")

---

## Data Pipeline Execution

The following cells create the Bronze → Silver → Gold pipeline with intentional data quality issues.

In [None]:
# --- Create Bronze Layer ---

validate_config()

# Initialize GCP clients
bq_client = bigquery.Client(project=PROJECT_ID)
storage_client = storage.Client(project=PROJECT_ID)

# Setup infrastructure
setup_gcs_bucket(storage_client)
setup_bigquery_connection(bq_client, storage_client)
ensure_dataset(bq_client, BIGQUERY_DATASET)

# Generate fake data
print("\n--- Generating Fake Data (with quality issues) ---")
customers_data = generate_customers(NUM_CUSTOMERS)
products_data = generate_products(NUM_PRODUCTS)
orders_data = generate_orders(
    NUM_ORDERS,
    [c['customer_id'] for c in customers_data],
    [p['product_id'] for p in products_data],
    customers_data,
    products_data
)

customers_df = pd.DataFrame(customers_data)
products_df = pd.DataFrame(products_data)
orders_df = pd.DataFrame(orders_data)
print("Data generation complete.")

# Create bronze tables
print("\n--- Creating Bronze Layer Tables ---")

# 1. Native BigQuery table for customers
table_id = f"{PROJECT_ID}.{BIGQUERY_DATASET}.raw_customers"
job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
job = bq_client.load_table_from_dataframe(customers_df, table_id, job_config=job_config)
job.result()
print(f"Created native table 'raw_customers' with {job.output_rows} rows.")

# 2. BigLake external table for products (CSV)
table_id = f"{PROJECT_ID}.{BIGQUERY_DATASET}.raw_products"
gcs_uri = f"gs://{BUCKET_NAME}/bronze/raw_products/products.csv"

blob = storage.Blob.from_string(gcs_uri, client=storage_client)
blob.upload_from_string(products_df.to_csv(index=False), 'text/csv')

ddl = f"""
CREATE OR REPLACE EXTERNAL TABLE `{table_id}`
(
    product_id INT64,
    product_name STRING,
    category STRING,
    brand STRING,
    sku STRING,
    unit_price FLOAT64,
    unit_cost FLOAT64
)
WITH CONNECTION `{PROJECT_ID}.{REGION}.{BIGQUERY_CONNECTION_ID}`
OPTIONS (
    format = 'CSV',
    uris = ['{gcs_uri}'],
    skip_leading_rows = 1,
    max_staleness = INTERVAL 30 MINUTE,
    metadata_cache_mode = 'AUTOMATIC'
);
"""
bq_client.query(ddl).result()
print(f"Created BigLake external table 'raw_products'.")

# 3. BigLake table with Iceberg format for orders
table_id = f"{PROJECT_ID}.{BIGQUERY_DATASET}.raw_orders"
gcs_uri = f"gs://{BUCKET_NAME}/bronze/raw_orders/"

ddl = f"""
CREATE OR REPLACE TABLE `{table_id}`
(
    order_id INT64,
    customer_id INT64,
    order_timestamp STRING,
    discount_percent FLOAT64,
    shipping_cost FLOAT64,
    line_items STRING
)
WITH CONNECTION `{PROJECT_ID}.{REGION}.{BIGQUERY_CONNECTION_ID}`
OPTIONS (
    table_format = 'ICEBERG',
    storage_uri = '{gcs_uri}'
);
"""
bq_client.query(ddl).result()

job_config = bigquery.LoadJobConfig(write_disposition="WRITE_APPEND")
job = bq_client.load_table_from_dataframe(orders_df, table_id, job_config=job_config)
job.result()
print(f"Created Iceberg table 'raw_orders' with {job.output_rows} rows.")

print("\n--- Bronze Layer Complete! ---")

In [None]:
# --- Create Silver Layer ---

ensure_dataset(bq_client, SILVER_DATASET)

print("\n--- Creating Silver Layer Tables ---")
print("NOTE: Silver layer performs basic type conversions but PRESERVES data quality issues")
print("      Quality issues will be detected later using Dataplex Data Quality checks")

# 1. Customers: Parse JSON, convert types (DO NOT deduplicate or filter nulls)
customers_silver_sql = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{SILVER_DATASET}.customers_silver` AS
SELECT
    customer_id,
    JSON_VALUE(profile, '$.name') AS name,
    JSON_VALUE(profile, '$.email') AS email,
    JSON_VALUE(profile, '$.address') AS address,
    CAST(JSON_VALUE(profile, '$.join_date') AS DATE) AS join_date,
    JSON_VALUE(profile, '$.customer_tier') AS customer_tier,
    JSON_VALUE(profile, '$.signup_channel') AS signup_channel
FROM `{PROJECT_ID}.{BRONZE_DATASET}.raw_customers`;
"""
bq_client.query(customers_silver_sql).result()
print("Created 'customers_silver' table (duplicates and nulls preserved).")

# 2. Products: Convert types (DO NOT standardize category naming)
products_silver_sql = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{SILVER_DATASET}.products_silver` AS
SELECT
    product_id,
    product_name,
    category,  -- Preserving original casing (quality issue)
    brand,
    sku,
    CAST(unit_price AS NUMERIC) AS unit_price,
    CAST(unit_cost AS NUMERIC) AS unit_cost
FROM `{PROJECT_ID}.{BRONZE_DATASET}.raw_products`;
"""
bq_client.query(products_silver_sql).result()
print("Created 'products_silver' table (inconsistent category casing preserved).")

# 3. Orders: Fix timestamps (DO NOT filter future dates)
orders_silver_sql = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{SILVER_DATASET}.orders_silver` AS
SELECT
    order_id,
    customer_id,
    CAST(order_timestamp AS TIMESTAMP) AS order_timestamp,
    CAST(discount_percent AS NUMERIC) AS discount_percent,
    CAST(shipping_cost AS NUMERIC) AS shipping_cost
FROM `{PROJECT_ID}.{BRONZE_DATASET}.raw_orders`;
"""
bq_client.query(orders_silver_sql).result()
print("Created 'orders_silver' table (future-dated orders preserved).")

# 4. Order Items: Unnest JSON line items
order_items_silver_sql = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{SILVER_DATASET}.order_items_silver` AS
SELECT
    raw.order_id,
    CAST(JSON_VALUE(item_json, '$.product_id') AS INT64) AS product_id,
    CAST(JSON_VALUE(item_json, '$.quantity') AS INT64) AS quantity
FROM `{PROJECT_ID}.{BRONZE_DATASET}.raw_orders` raw,
    UNNEST(JSON_QUERY_ARRAY(line_items)) AS item_json;
"""
bq_client.query(order_items_silver_sql).result()
print("Created 'order_items_silver' table.")

print("\n--- Silver Layer Complete! ---")
print("Data quality issues preserved for Dataplex detection")

In [None]:
# --- Create Gold Layer ---

ensure_dataset(bq_client, GOLD_DATASET)

print("\n--- Creating Gold Layer Dimensional Model ---")
print("NOTE: Gold layer creates dimensional model but PRESERVES data quality issues")
print("      Issues will flow into dimensions and fact table for Dataplex detection")

# 1. Customer dimension with surrogate key
dim_customers_sql = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{GOLD_DATASET}.dim_customers` AS
SELECT
    GENERATE_UUID() AS customer_key,
    customer_id AS customer_natural_key,
    name,
    email,
    address,
    join_date,
    customer_tier,
    signup_channel,
    CURRENT_TIMESTAMP() AS effective_date,
    TRUE AS is_current
FROM `{PROJECT_ID}.{SILVER_DATASET}.customers_silver`;
"""
bq_client.query(dim_customers_sql).result()
print("Created 'dim_customers' (duplicate customer_id=20 preserved).")

# 2. Product dimension with surrogate key
dim_products_sql = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{GOLD_DATASET}.dim_products` AS
SELECT
    GENERATE_UUID() AS product_key,
    product_id AS product_natural_key,
    product_name,
    category,
    brand,
    sku,
    unit_price,
    unit_cost,
    ROUND(unit_price - unit_cost, 2) AS unit_margin,
    ROUND(SAFE_DIVIDE(unit_price - unit_cost, unit_price) * 100, 2) AS margin_percent
FROM `{PROJECT_ID}.{SILVER_DATASET}.products_silver`;
"""
bq_client.query(dim_products_sql).result()
print("Created 'dim_products' (inconsistent category casing preserved).")

# 3. Date dimension
dim_date_sql = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{GOLD_DATASET}.dim_date` AS
SELECT
    date,
    EXTRACT(YEAR FROM date) AS year,
    EXTRACT(MONTH FROM date) AS month,
    EXTRACT(DAY FROM date) AS day,
    EXTRACT(QUARTER FROM date) AS quarter,
    EXTRACT(WEEK FROM date) AS week_of_year,
    EXTRACT(DAYOFWEEK FROM date) AS day_of_week_num,
    FORMAT_DATE('%A', date) AS day_of_week_name,
    FORMAT_DATE('%B', date) AS month_name,
    -- Weekend flag
    CASE WHEN EXTRACT(DAYOFWEEK FROM date) IN (1, 7) THEN TRUE ELSE FALSE END AS is_weekend,
    -- Fiscal year (assuming July 1 start)
    CASE
        WHEN EXTRACT(MONTH FROM date) >= 7 THEN EXTRACT(YEAR FROM date) + 1
        ELSE EXTRACT(YEAR FROM date)
    END AS fiscal_year,
    -- Fiscal quarter
    CASE
        WHEN EXTRACT(MONTH FROM date) IN (7, 8, 9) THEN 1
        WHEN EXTRACT(MONTH FROM date) IN (10, 11, 12) THEN 2
        WHEN EXTRACT(MONTH FROM date) IN (1, 2, 3) THEN 3
        ELSE 4
    END AS fiscal_quarter,
    -- Month start/end flags
    CASE WHEN EXTRACT(DAY FROM date) = 1 THEN TRUE ELSE FALSE END AS is_month_start,
    CASE WHEN DATE_ADD(date, INTERVAL 1 DAY) != DATE_TRUNC(DATE_ADD(date, INTERVAL 1 DAY), MONTH)
         THEN FALSE ELSE TRUE END AS is_month_end,
    -- Year start/end flags
    CASE WHEN FORMAT_DATE('%m-%d', date) = '01-01' THEN TRUE ELSE FALSE END AS is_year_start,
    CASE WHEN FORMAT_DATE('%m-%d', date) = '12-31' THEN TRUE ELSE FALSE END AS is_year_end
FROM (
    SELECT DISTINCT EXTRACT(DATE FROM order_timestamp) AS date
    FROM `{PROJECT_ID}.{SILVER_DATASET}.orders_silver`
);
"""
bq_client.query(dim_date_sql).result()
print("Created 'dim_date' (includes future dates from bad data).")

# 4. Sales fact table
fct_sales_sql = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{GOLD_DATASET}.fct_sales` AS
SELECT
    -- Surrogate keys
    dc.customer_key,
    dp.product_key,
    EXTRACT(DATE FROM o.order_timestamp) AS order_date,

    -- Degenerate dimension (fact grain identifier)
    oi.order_id,

    -- Order-level attributes
    o.discount_percent AS order_discount_percent,
    o.shipping_cost,

    -- Line item measures
    oi.quantity,
    dp.unit_price,
    dp.unit_cost,

    -- Calculated measures
    ROUND(oi.quantity * dp.unit_price, 2) AS gross_revenue,
    ROUND((oi.quantity * dp.unit_price) * o.discount_percent, 2) AS discount_amount,
    ROUND((oi.quantity * dp.unit_price) * (1 - o.discount_percent), 2) AS net_revenue,
    ROUND(oi.quantity * dp.unit_cost, 2) AS cost_of_goods_sold,
    ROUND((oi.quantity * dp.unit_price) * (1 - o.discount_percent) - (oi.quantity * dp.unit_cost), 2) AS gross_margin,
    ROUND(SAFE_DIVIDE(
        (oi.quantity * dp.unit_price) * (1 - o.discount_percent) - (oi.quantity * dp.unit_cost),
        (oi.quantity * dp.unit_price) * (1 - o.discount_percent)
    ) * 100, 2) AS margin_percent

FROM `{PROJECT_ID}.{SILVER_DATASET}.order_items_silver` AS oi
INNER JOIN `{PROJECT_ID}.{SILVER_DATASET}.orders_silver` AS o
    ON oi.order_id = o.order_id
INNER JOIN `{PROJECT_ID}.{GOLD_DATASET}.dim_products` AS dp
    ON oi.product_id = dp.product_natural_key
INNER JOIN `{PROJECT_ID}.{GOLD_DATASET}.dim_customers` AS dc
    ON o.customer_id = dc.customer_natural_key;
"""
bq_client.query(fct_sales_sql).result()
print("Created 'fct_sales' (all quality issues preserved).")

# 5. Wide table view (One Big Table)
vw_sales_wide_sql = f"""
CREATE OR REPLACE VIEW `{PROJECT_ID}.{GOLD_DATASET}.vw_sales_wide` AS
SELECT
    -- Fact measures
    f.order_id,
    f.quantity,
    f.gross_revenue,
    f.discount_amount,
    f.net_revenue,
    f.cost_of_goods_sold,
    f.gross_margin,
    f.margin_percent,
    f.order_discount_percent,
    f.shipping_cost,

    -- Customer attributes
    c.customer_natural_key AS customer_id,
    c.name AS customer_name,
    c.email AS customer_email,
    c.customer_tier,
    c.signup_channel,
    c.join_date AS customer_join_date,

    -- Product attributes
    p.product_natural_key AS product_id,
    p.product_name,
    p.category AS product_category,
    p.brand AS product_brand,
    p.sku,
    p.unit_price,
    p.unit_cost,
    p.unit_margin,

    -- Date attributes
    d.date AS order_date,
    d.year,
    d.month,
    d.day,
    d.quarter,
    d.fiscal_year,
    d.fiscal_quarter,
    d.week_of_year,
    d.day_of_week_num,
    d.day_of_week_name,
    d.month_name,
    d.is_weekend,
    d.is_month_start,
    d.is_month_end,
    d.is_year_start,
    d.is_year_end

FROM `{PROJECT_ID}.{GOLD_DATASET}.fct_sales` AS f
INNER JOIN `{PROJECT_ID}.{GOLD_DATASET}.dim_customers` AS c
    ON f.customer_key = c.customer_key
INNER JOIN `{PROJECT_ID}.{GOLD_DATASET}.dim_products` AS p
    ON f.product_key = p.product_key
INNER JOIN `{PROJECT_ID}.{GOLD_DATASET}.dim_date` AS d
    ON f.order_date = d.date;
"""
bq_client.query(vw_sales_wide_sql).result()
print("Created 'vw_sales_wide' view (contains all data quality issues).")

print("\n--- Gold Layer Complete! ---")
print("\n   Data Quality Issues Present in Gold Layer:")
print("  1. Duplicate customers (customer_id=20 appears twice)")
print("  2. NULL emails (~5% of customer records)")
print("  3. Inconsistent category casing ('Electronics' vs 'electronics')")
print("  4. Future-dated orders (~5% of orders)")
print("\nNext Step: Use Dataplex Data Quality to detect these issues in Gold,")
print("           then trace them back to Bronze layer for remediation.")

In [None]:
# Query: Current month's revenue and margin analysis
# NOTE: Results contain quality issues that will be detected by Dataplex:
#     - Future-dated orders inflate current month revenue
#     - Revenue calculations based on dirty data

query = f"""
SELECT
    CONCAT('$', FORMAT('%.2f', SUM(s.gross_revenue))) AS gross_revenue,
    CONCAT('$', FORMAT('%.2f', SUM(s.discount_amount))) AS total_discounts,
    CONCAT('$', FORMAT('%.2f', SUM(s.net_revenue))) AS net_revenue,
    CONCAT('$', FORMAT('%.2f', SUM(s.cost_of_goods_sold))) AS total_cogs,
    CONCAT('$', FORMAT('%.2f', SUM(s.gross_margin))) AS gross_margin,
    FORMAT('%.1f%%', SAFE_DIVIDE(SUM(s.gross_margin), SUM(s.net_revenue)) * 100) AS margin_percent
FROM `{PROJECT_ID}.{GOLD_DATASET}.fct_sales` AS s
JOIN `{PROJECT_ID}.{GOLD_DATASET}.dim_date` AS d ON s.order_date = d.date
WHERE
    d.year = EXTRACT(YEAR FROM CURRENT_DATE())
    AND d.month = EXTRACT(MONTH FROM CURRENT_DATE())
"""

df = bq_client.query(query).to_dataframe()
display(df)

---

## Query Examples

The following queries demonstrate the data in the Gold layer. Notice that quality issues are present in the results.

In [None]:
# Query: Top customers by lifetime value with tier analysis
# NOTE: Results contain quality issues that will be detected by Dataplex:
#     - Duplicate customer_id=20 may cause revenue misattribution
#     - NULL emails mean some customers can't receive marketing

query = f"""
SELECT
    c.name,
    c.customer_tier,
    c.signup_channel,
    COUNT(DISTINCT s.order_id) AS total_orders,
    SUM(s.quantity) AS total_items,
    CONCAT('$', FORMAT('%.2f', SUM(s.net_revenue))) AS lifetime_value,
    CONCAT('$', FORMAT('%.2f', AVG(s.net_revenue))) AS avg_line_item_value,
    FORMAT('%.1f%%', AVG(s.margin_percent)) AS avg_margin_pct
FROM `{PROJECT_ID}.{GOLD_DATASET}.fct_sales` AS s
JOIN `{PROJECT_ID}.{GOLD_DATASET}.dim_customers` AS c ON s.customer_key = c.customer_key
GROUP BY c.customer_key, c.name, c.customer_tier, c.signup_channel
ORDER BY SUM(s.net_revenue) DESC
LIMIT 10
"""

df = bq_client.query(query).to_dataframe()
display(df)

In [None]:
# Query: Product category performance with margin analysis
# NOTE: Results contain quality issues that will be detected by Dataplex:
#     - Inconsistent category casing splits same category into multiple rows
#     - 'Electronics' and 'electronics' appear as separate categories

query = f"""
SELECT
    p.category,
    p.brand,
    COUNT(DISTINCT s.order_id) AS orders,
    SUM(s.quantity) AS units_sold,
    CONCAT('$', FORMAT('%.2f', SUM(s.gross_revenue))) AS gross_revenue,
    CONCAT('$', FORMAT('%.2f', SUM(s.gross_margin))) AS gross_margin,
    FORMAT('%.1f%%', SAFE_DIVIDE(SUM(s.gross_margin), SUM(s.net_revenue)) * 100) AS margin_pct
FROM `{PROJECT_ID}.{GOLD_DATASET}.fct_sales` AS s
JOIN `{PROJECT_ID}.{GOLD_DATASET}.dim_products` AS p ON s.product_key = p.product_key
GROUP BY p.category, p.brand
ORDER BY SUM(s.gross_revenue) DESC
LIMIT 15
"""

df = bq_client.query(query).to_dataframe()
display(df)

## Using the Wide Table View

The `vw_sales_wide` view simplifies queries by pre-joining all dimensions.

**Note**: This view contains all data quality issues from the underlying tables. Use Dataplex Data Quality to detect and remediate these issues before using in production analytics.

In [None]:
query = f"""
SELECT
    customer_tier,
    product_category,
    fiscal_year,
    fiscal_quarter,
    CASE WHEN is_weekend THEN 'Weekend' ELSE 'Weekday' END AS day_type,
    COUNT(DISTINCT order_id) AS orders,
    SUM(quantity) AS units,
    CONCAT('$', FORMAT('%.2f', SUM(net_revenue))) AS revenue,
    FORMAT('%.1f%%', AVG(margin_percent)) AS avg_margin
FROM `{PROJECT_ID}.{GOLD_DATASET}.vw_sales_wide`
WHERE fiscal_year = 2025
GROUP BY customer_tier, product_category, fiscal_year, fiscal_quarter, day_type
ORDER BY SUM(net_revenue) DESC
LIMIT 20
"""

df = bq_client.query(query).to_dataframe()
display(df)

# Dataplex Data Quality - Gold Layer

Now we'll use **Dataplex Data Profiling and Data Quality** to detect the issues we injected into our Gold layer.

## Approach

**Phase 1: Data Profiling**
- Statistical analysis of all columns
- Detects NULL percentages, min/max values, distributions, outliers
- No rules required - pure profiling using `DataProfileSpec`

**Phase 2: Data Quality Validation**
- Custom validation rules based on business logic
- Completeness, validity, uniqueness, conformity checks
- Validates cross-column accuracy (e.g., revenue calculations)
- Uses `DataQualitySpec` with defined rules

**Phase 3: Trace to Source**
- Use Data Lineage to understand the source of the data quality problems
- Create necessary checks to catch DQ issues prior to reaching the gold layer

Let's start with Phase 1 and 2 on the Gold layer.

---


In [None]:
from google.cloud import dataplex_v1
from google.protobuf import field_mask_pb2
from google.api_core.exceptions import ResourceExhausted, AlreadyExists
from typing import Optional
import yaml
import time

# Initialize Dataplex client
dataplex_client = dataplex_v1.DataScanServiceClient()

# Configuration
DATAPLEX_REGION = REGION

print("=" * 80)
print("DATAPLEX DATA PROFILING & DATA QUALITY SETUP")
print("=" * 80)
print(f"\nProject: {PROJECT_ID}")
print(f"Region: {DATAPLEX_REGION}")
print(f"Gold Dataset: {GOLD_DATASET}")
print("\nThis will create Dataplex scans for:")
print("\n  Phase 1 - Data Profiling (DataProfileSpec):")
print("    - dim_customers")
print("    - dim_products")
print("    - dim_date")
print("    - fct_sales")
print("\n  Phase 2 - Data Quality Validation (DataQualitySpec with rules):")
print("    - dim_customers (4 rules)")
print("    - dim_products (4 rules)")
print("    - dim_date (1 rule)")
print("    - fct_sales (4 rules)")
print("\n" + "=" * 80)

## Phase 1: Data Profiling

Data Profiling will automatically analyze all columns and provide statistics:
- NULL percentages and counts
- Min/Max/Average values
- Data type distributions
- Cardinality and uniqueness metrics
- Potential quality issues detected

This runs **without any rule definitions** - it's pure statistical profiling using `DataProfileSpec`.

In [None]:
def create_profiling_scan(table_name: str, description: str) -> None:
    """
    Creates a Dataplex Data Profiling scan for a BigQuery table with retry logic.

    Args:
        table_name: Name of the table (e.g., 'dim_customers')
        description: Human-readable description of the scan
    """
    scan_id = f"profile-{table_name.replace('_', '-')}"
    parent = f"projects/{PROJECT_ID}/locations/{DATAPLEX_REGION}"

    # Define the data source
    data_source = dataplex_v1.DataSource(
        resource=f"//bigquery.googleapis.com/projects/{PROJECT_ID}/datasets/{GOLD_DATASET}/tables/{table_name}"
    )

    # Data Profile spec - statistical analysis without rules
    data_profile_spec = dataplex_v1.DataProfileSpec(
        sampling_percent=100.0,  # Profile 100% of data
        row_filter=None,  # No filter, profile all rows
    )

    # Create the DataScan
    data_scan = dataplex_v1.DataScan(
        description=f"{description} - Data Profiling",
        display_name=f"Profile: {table_name}",
        data=data_source,
        data_profile_spec=data_profile_spec,
    )

    request = dataplex_v1.CreateDataScanRequest(
        parent=parent,
        data_scan=data_scan,
        data_scan_id=scan_id,
    )

    # Retry with exponential backoff for the create call only
    max_retries = 5
    base_delay = 2  # seconds

    for attempt in range(max_retries):
        try:
            operation = dataplex_client.create_data_scan(request=request)
            print(f"Created Data Profiling scan for '{table_name}': {scan_id}")
            # Don't wait for operation to complete - just return after successful creation
            return None
        except AlreadyExists:
            print(f"Data Profiling scan already exists for '{table_name}': {scan_id}")
            return None
        except ResourceExhausted as e:
            if attempt < max_retries - 1:
                delay = base_delay * (2 ** attempt)  # Exponential backoff: 2, 4, 8, 16, 32 seconds
                print(f"Rate limit hit. Retrying in {delay} seconds... (attempt {attempt + 1}/{max_retries})")
                time.sleep(delay)
            else:
                print(f"Error creating scan for '{table_name}' after {max_retries} attempts: {e}")
                raise
        except Exception as e:
            print(f"Error creating scan for '{table_name}': {e}")
            raise

    return None

print("\n--- Creating Data Profiling Scans for Gold Layer Tables ---\n")

# Create Data Profiling scans for each Gold table
gold_tables = [
    ("dim_customers", "Customer dimension with duplicate and NULL email issues"),
    ("dim_products", "Product dimension with inconsistent category casing"),
    ("dim_date", "Date dimension with future dates"),
    ("fct_sales", "Sales fact table with quality issues from dimensions"),
]

for i, (table_name, description) in enumerate(gold_tables):
    create_profiling_scan(table_name, description)
    # Add delay between scans to avoid rate limiting (except after last scan)
    if i < len(gold_tables) - 1:
        time.sleep(2)

print("\nData Profiling scans created!")
print("\nNext steps:")
print("1. Go to Dataplex in GCP Console")
print("2. Navigate to Data Profiling > Scans")
print("3. Run the scans manually or wait for scheduled execution")
print("4. Review the profiling results to see column statistics")

## Phase 2: Custom Business Rules

Based on business requirements, we'll define custom validation rules to catch specific issues:

**For dim_customers:**
- Uniqueness: customer_natural_key must be unique
- Completeness: email, name, customer_tier cannot be NULL
- Validity: join_date must be <= today

**For dim_products:**
- Uniqueness: product_natural_key, sku must be unique
- Conformity: category must match approved list
- Range: unit_price, unit_cost must be >= 0

**For dim_date:**
- Validity: date must be <= today (no future dates)

**For fct_sales:**
- Range: revenue/margin metrics must be >= 0
- Accuracy: net_revenue = gross_revenue - discount_amount
- Referential Integrity: Keys must exist in dimensions

In [None]:
def create_custom_dq_scan(table_name: str, description: str, rules: List[dataplex_v1.DataQualityRule]) -> None:
    """
    Creates a Dataplex Data Quality scan with custom validation rules and retry logic.

    Args:
        table_name: Name of the table
        description: Human-readable description
        rules: List of DataQualityRule objects
    """
    scan_id = f"custom-dq-{table_name.replace('_', '-')}"
    parent = f"projects/{PROJECT_ID}/locations/{DATAPLEX_REGION}"

    # Define the data source
    data_source = dataplex_v1.DataSource(
        resource=f"//bigquery.googleapis.com/projects/{PROJECT_ID}/datasets/{GOLD_DATASET}/tables/{table_name}"
    )

    # Custom DQ spec with rules
    data_quality_spec = dataplex_v1.DataQualitySpec(
        sampling_percent=100.0,
        rules=rules,
    )

    # Create the DataScan
    data_scan = dataplex_v1.DataScan(
        description=f"{description} - Custom Business Rules",
        display_name=f"Custom DQ: {table_name}",
        data=data_source,
        data_quality_spec=data_quality_spec,
    )

    request = dataplex_v1.CreateDataScanRequest(
        parent=parent,
        data_scan=data_scan,
        data_scan_id=scan_id,
    )

    # Retry with exponential backoff for the create call only
    max_retries = 5
    base_delay = 2  # seconds

    for attempt in range(max_retries):
        try:
            operation = dataplex_client.create_data_scan(request=request)
            print(f"Created Custom DQ scan for '{table_name}': {scan_id}")
            print(f"  Rules: {len(rules)} validation rules defined")
            # Don't wait for operation to complete - just return after successful creation
            return None
        except AlreadyExists:
            print(f"Custom DQ scan already exists for '{table_name}': {scan_id}")
            return None
        except ResourceExhausted as e:
            if attempt < max_retries - 1:
                delay = base_delay * (2 ** attempt)  # Exponential backoff: 2, 4, 8, 16, 32 seconds
                print(f"Rate limit hit. Retrying in {delay} seconds... (attempt {attempt + 1}/{max_retries})")
                time.sleep(delay)
            else:
                print(f"Error creating scan for '{table_name}' after {max_retries} attempts: {e}")
                raise
        except Exception as e:
            print(f"Error creating scan for '{table_name}': {e}")
            raise

    return None

print("\n--- Creating Custom DQ Scans with Business Rules ---\n")

# Define rules for dim_customers
dim_customers_rules = [
    # Uniqueness: customer_natural_key must be unique (will catch duplicate customer_id=20)
    dataplex_v1.DataQualityRule(
        column="customer_natural_key",
        dimension="UNIQUENESS",
        uniqueness_expectation=dataplex_v1.DataQualityRule.UniquenessExpectation(),
        description="Customer natural key must be unique",
        threshold=0.99,  # Allow 1% tolerance
    ),
    # Completeness: email cannot be NULL (will catch ~5% NULL emails)
    dataplex_v1.DataQualityRule(
        column="email",
        dimension="COMPLETENESS",
        non_null_expectation=dataplex_v1.DataQualityRule.NonNullExpectation(),
        description="Customer email must not be NULL",
        threshold=0.95,  # Expect 95% completeness
    ),
    # Completeness: name cannot be NULL
    dataplex_v1.DataQualityRule(
        column="name",
        dimension="COMPLETENESS",
        non_null_expectation=dataplex_v1.DataQualityRule.NonNullExpectation(),
        description="Customer name must not be NULL",
        threshold=1.0,
    ),
    # Validity: join_date must be <= today (SqlAssertion returns FAILING rows)
    dataplex_v1.DataQualityRule(
        column="join_date",
        dimension="VALIDITY",
        sql_assertion=dataplex_v1.DataQualityRule.SqlAssertion(
            sql_statement=f"SELECT join_date FROM `{PROJECT_ID}.{GOLD_DATASET}.dim_customers` WHERE join_date > CURRENT_DATE()"
        ),
        description="Customer join date cannot be in the future",
    ),
]

create_custom_dq_scan("dim_customers", "Customer dimension validation", dim_customers_rules)
time.sleep(2)  # Pause between scans to avoid rate limiting

# Define rules for dim_products
dim_products_rules = [
    # Uniqueness: product_natural_key must be unique
    dataplex_v1.DataQualityRule(
        column="product_natural_key",
        dimension="UNIQUENESS",
        uniqueness_expectation=dataplex_v1.DataQualityRule.UniquenessExpectation(),
        description="Product natural key must be unique",
        threshold=1.0,
    ),
    # Uniqueness: SKU must be unique
    dataplex_v1.DataQualityRule(
        column="sku",
        dimension="UNIQUENESS",
        uniqueness_expectation=dataplex_v1.DataQualityRule.UniquenessExpectation(),
        description="Product SKU must be unique",
        threshold=1.0,
    ),
    # Conformity: category must be in approved list (will catch lowercase variants)
    dataplex_v1.DataQualityRule(
        column="category",
        dimension="VALIDITY",
        set_expectation=dataplex_v1.DataQualityRule.SetExpectation(
            values=['Electronics', 'Books', 'Home Goods', 'Apparel', 'Toys']
        ),
        description="Product category must match approved list (case-sensitive)",
        threshold=1.0,
    ),
    # Range: unit_price must be positive
    dataplex_v1.DataQualityRule(
        column="unit_price",
        dimension="VALIDITY",
        range_expectation=dataplex_v1.DataQualityRule.RangeExpectation(
            min_value="0",
            strict_min_enabled=True,
        ),
        description="Product unit price must be greater than 0",
        threshold=1.0,
    ),
]

create_custom_dq_scan("dim_products", "Product dimension validation", dim_products_rules)
time.sleep(2)  # Pause between scans to avoid rate limiting

# Define rules for dim_date
dim_date_rules = [
    # Validity: date cannot be in the future (SqlAssertion returns FAILING rows)
    dataplex_v1.DataQualityRule(
        column="date",
        dimension="VALIDITY",
        sql_assertion=dataplex_v1.DataQualityRule.SqlAssertion(
            sql_statement=f"SELECT date FROM `{PROJECT_ID}.{GOLD_DATASET}.dim_date` WHERE date > CURRENT_DATE()"
        ),
        description="Order date cannot be in the future",
    ),
]

create_custom_dq_scan("dim_date", "Date dimension validation", dim_date_rules)
time.sleep(2)  # Pause between scans to avoid rate limiting

# Define rules for fct_sales
fct_sales_rules = [
    # Range: quantity must be positive
    dataplex_v1.DataQualityRule(
        column="quantity",
        dimension="VALIDITY",
        range_expectation=dataplex_v1.DataQualityRule.RangeExpectation(
            min_value="0",
            strict_min_enabled=True,
        ),
        description="Order quantity must be greater than 0",
        threshold=1.0,
    ),
    # Range: gross_revenue must be non-negative
    dataplex_v1.DataQualityRule(
        column="gross_revenue",
        dimension="VALIDITY",
        range_expectation=dataplex_v1.DataQualityRule.RangeExpectation(
            min_value="0",
        ),
        description="Gross revenue must be >= 0",
        threshold=1.0,
    ),
    # Range: net_revenue must be non-negative
    dataplex_v1.DataQualityRule(
        column="net_revenue",
        dimension="VALIDITY",
        range_expectation=dataplex_v1.DataQualityRule.RangeExpectation(
            min_value="0",
        ),
        description="Net revenue must be >= 0",
        threshold=1.0,
    ),
    # Accuracy: net_revenue should equal gross_revenue - discount_amount (SqlAssertion returns FAILING rows)
    dataplex_v1.DataQualityRule(
        column="net_revenue",
        dimension="ACCURACY",
        sql_assertion=dataplex_v1.DataQualityRule.SqlAssertion(
            sql_statement=f"SELECT net_revenue, gross_revenue, discount_amount FROM `{PROJECT_ID}.{GOLD_DATASET}.fct_sales` WHERE ABS(net_revenue - (gross_revenue - discount_amount)) >= 0.01"
        ),
        description="Net revenue must equal gross revenue minus discount (within rounding)",
    ),
]

create_custom_dq_scan("fct_sales", "Sales fact table validation", fct_sales_rules)

print("\nCustom DQ scans created with business validation rules!")
print("\nExpected Failures:")
print("  - dim_customers: Uniqueness (duplicate customer_id=20)")
print("  - dim_customers: Completeness (NULL emails)")
print("  - dim_products: Set membership (lowercase categories)")
print("  - dim_date: Validity (future dates)")
print("  - fct_sales: (inherits issues from dimensions)")
print("\nNext: Run these scans in Dataplex Console to see the actual failures!")

## Summary: Dataplex Scans Created

You now have **8 Dataplex scans** configured on your Gold layer:

### Data Profiling Scans (Phase 1 - Statistical Analysis)
1. `profile-dim-customers` - Profiles all customer dimension columns
2. `profile-dim-products` - Profiles all product dimension columns  
3. `profile-dim-date` - Profiles all date dimension columns
4. `profile-fct-sales` - Profiles all fact table columns

### Data Quality Scans (Phase 2 - Business Validation)
5. `custom-dq-dim-customers` - 4 validation rules (uniqueness, completeness, validity)
6. `custom-dq-dim-products` - 4 validation rules (uniqueness, conformity, range)
7. `custom-dq-dim-date` - 1 validation rule (validity)
8. `custom-dq-fct-sales` - 4 validation rules (range, accuracy)

### How to Run and View Results

**Option 1: GCP Console **
```
1. Go to: https://console.cloud.google.com/dataplex/data-quality
2. Select your project
3. Click on each scan
4. Click "RUN NOW" to execute
5. View results after scan completes
```

**Option 2: Command Line**
```bash
# List all scans
gcloud dataplex datascans list --location=us-central1

# Run a specific profiling scan
gcloud dataplex datascans run profile-dim-customers --location=us-central1

# Run a specific DQ scan
gcloud dataplex datascans run custom-dq-dim-customers --location=us-central1

# Get scan results
gcloud dataplex datascans describe profile-dim-customers --location=us-central1
```

**Option 3: Python API** (see next cell for example)

In [None]:
# Optional: Run scans programmatically

def run_data_scan(scan_id: str) -> Optional[dataplex_v1.RunDataScanResponse]:
    """Triggers a Dataplex scan to run."""
    name = f"projects/{PROJECT_ID}/locations/{DATAPLEX_REGION}/dataScans/{scan_id}"

    request = dataplex_v1.RunDataScanRequest(name=name)

    try:
        response = dataplex_client.run_data_scan(request=request)
        print(f"Scan '{scan_id}' started. Job: {response.job.name}")
        return response
    except Exception as e:
        print(f"Error running scan '{scan_id}': {e}")
        return None

# Running all scans programmatically
print("Running all Data Profiling scans...\n")
run_data_scan("profile-dim-customers")
run_data_scan("profile-dim-products")
run_data_scan("profile-dim-date")
run_data_scan("profile-fct-sales")

print("\nRunning all Data Quality scans...\n")
run_data_scan("custom-dq-dim-customers")
run_data_scan("custom-dq-dim-products")
run_data_scan("custom-dq-dim-date")
run_data_scan("custom-dq-fct-sales")

print(f"\nView results in the GCP Console:")
print(f"https://console.cloud.google.com/dataplex/govern/data-profiling-and-quality?project={PROJECT_ID}")

Next, we're going to investigate the data quality failures using Dataplex and Lineage.  

Open the Dataplex console using the URL above.  Click on the "Custom DQ - dim_products" scan, and view the results:

![Dataplex scan](images/dq01.png)

Copy the query to find the failed rows and paste it in to BigQuery:

![Failed rows](images/dq02.png)

The failed rows are all in the category column - next, let's use column-level lineage to figure out where they came from!

**Note:** Lineage may take several (10+) minutes to appear.  Please move on to the next notebook in the mean time, and check back in 10 minutes for lineage results.

![CLL](images/dq03.png)

Finally, implement your own data quality checks in the bronze layer to detect these anomalies.  The cell below has the basic Python framework for creating them, or feel free to implement them via the UI, more information can be found here about creating them via the Dataplex UI.

---

## Investigating Data Quality Issues

In [None]:
# --- Bronze Layer Data Quality Checks ---
# TODO: Complete this skeleton to create DQ scans for the bronze layer

print("\n--- Creating Data Quality Scans for Bronze Layer ---\n")

# Hint: You traced the quality issues back to the bronze layer using lineage.
# Now create scans to detect those issues at the source!

# TODO: Create a scan for raw_customers
# Hint: What issues did you find in dim_customers that came from raw_customers?
# - Duplicate customer_id?
# - NULL values in the JSON profile field?
# - Check the profile JSON structure itself?

bronze_customers_rules = [
    # TODO: Add uniqueness check for customer_id
    # dataplex_v1.DataQualityRule(
    #     column="???",
    #     dimension="???",
    #     ???_expectation=dataplex_v1.DataQualityRule.???Expectation(),
    #     description="???",
    #     threshold=???,
    # ),

    # TODO: Add completeness check for profile field
    # Hint: Can profile be NULL?

    # TODO: Add validation for JSON structure
    # Hint: Use SqlAssertion to check if JSON_VALUE can extract expected fields
]

# create_custom_dq_scan("raw_customers", "Bronze customer data validation", bronze_customers_rules)


# TODO: Create a scan for raw_products
# Hint: What issues did you find in dim_products that came from raw_products?
# - Category casing inconsistency?
# - Invalid category values?
# - Price validation?

bronze_products_rules = [
    # TODO: Add category conformity check
    # Hint: Should you check for the exact approved list, or something else at bronze layer?

    # TODO: Add price range validation
    # Hint: Prices should be positive, costs should be less than prices

    # TODO: What about product_name or SKU format?
]

# create_custom_dq_scan("raw_products", "Bronze product data validation", bronze_products_rules)


# TODO: Create a scan for raw_orders
# Hint: What issues did you find in dim_date/fct_sales that came from raw_orders?
# - Future order_timestamp?
# - Invalid discount_percent range?
# - Negative shipping_cost?

bronze_orders_rules = [
    # TODO: Add timestamp validation
    # Hint: Use SqlAssertion to find future timestamps

    # TODO: Add discount_percent range check
    # Hint: Should be between 0 and 1 (or 0% to 100%)

    # TODO: Add shipping_cost validation
]

# create_custom_dq_scan("raw_orders", "Bronze order data validation", bronze_orders_rules)


print("\nNext steps:")
print("1. Review the data quality failures you found in the Gold layer")
print("2. Trace them back to Bronze using column-level lineage")
print("3. Complete the rules above to catch issues at the source")
print("4. Can we implement data scans on BigLake tables?  What about BQ Omni tables?")
print("5. Consider: where should we be implementing these checks? Bronze? Silver? Why?")
print("6. Think about: How would you handle these issues - reject, quarantine, or alert/flag?")


## Cleanup

The following cell will delete all resources created in this notebook:
- BigQuery datasets (bronze, silver, gold) and all their tables
- GCS bucket and all its contents
- BigQuery connection

** **WARNING**:** This action is irreversible. Uncomment and run only when you want to completely remove all resources.

---


In [None]:
def cleanup_dataplex_scans() -> None:
    """Deletes all Dataplex scans created by this notebook."""
    print("\n--- Cleaning up Dataplex Scans ---")

    scan_ids = [
        # Data Profiling scans
        "profile-dim-customers",
        "profile-dim-products",
        "profile-dim-date",
        "profile-fct-sales",
        # Data Quality scans
        "custom-dq-dim-customers",
        "custom-dq-dim-products",
        "custom-dq-dim-date",
        "custom-dq-fct-sales",
    ]

    for scan_id in scan_ids:
        try:
            name = f"projects/{PROJECT_ID}/locations/{DATAPLEX_REGION}/dataScans/{scan_id}"
            request = dataplex_v1.DeleteDataScanRequest(name=name)
            dataplex_client.delete_data_scan(request=request)
            print(f"Deleted Dataplex scan: {scan_id}")
        except Exception as e:
            if "NOT_FOUND" in str(e) or "not found" in str(e).lower():
                print(f"Scan not found (already deleted): {scan_id}")
            else:
                print(f"Error deleting scan '{scan_id}': {e}")

# Uncomment to delete all resources including Dataplex scans
# cleanup_dataplex_scans()
# cleanup_all_resources(bq_client, storage_client)

# Or selectively delete resources:
# cleanup_all_resources(bq_client, storage_client,
#                       delete_datasets=True,
#                       delete_bucket=True,
#                       delete_connection=False)