# Module 5: Data Governance Implementation Labs
## Unity Catalog and Enterprise Security

Welcome to the Module 5 hands-on labs! These exercises will guide you through implementing a comprehensive data governance solution using Unity Catalog.

**Total Duration**: 4.5 hours
**Prerequisites**: 
- Completed Modules 1-4
- Unity Catalog enabled workspace
- Admin privileges

## Lab Overview
1. Unity Catalog Setup and Configuration (45 min)
2. Fine-Grained Security Implementation (60 min)
3. Privacy and Compliance Controls (60 min)
4. Audit and Monitoring (45 min)
5. Governance Automation (60 min)

---
## Initial Setup
Run this cell first to set up your environment variables and helper functions.

In [0]:
# Initialize environment
import pyspark.sql.functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import json
import time

# Set your unique identifier (replace with your initials or ID)
USER_ID = "student"  # CHANGE THIS!

# Catalog names for this module
MAIN_CATALOG = f"sm-training"
DEV_CATALOG = f"dev_{USER_ID}"
PROD_CATALOG = f"prod_{USER_ID}"

# Helper function for logging
def log_progress(message):
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    print(f"[{timestamp}] {message}")
    
# Helper function for validation
def validate_step(condition, success_msg, failure_msg):
    if condition:
        print(f"✅ {success_msg}")
        return True
    else:
        print(f"❌ {failure_msg}")
        return False

log_progress("Environment initialized successfully")
print(f"Your catalogs will be: {MAIN_CATALOG}, {DEV_CATALOG}, {PROD_CATALOG}")

---
## Lab 1: Unity Catalog Setup and Configuration (45 minutes)

### Objectives
- Create and configure catalogs
- Set up schemas with appropriate permissions
- Implement basic access controls
- Configure external storage locations

### Background
GlobalMart needs a governance structure that supports development, staging, and production environments while maintaining security and compliance.

### Exercise 1.1: Create Catalog Hierarchy

In [0]:
# Create main catalog for governance examples
#spark.sql(f"CREATE CATALOG IF NOT EXISTS {MAIN_CATALOG} COMMENT 'Main catalog for governance examples'")
spark.sql(f"USE CATALOG {MAIN_CATALOG}")

# Verify catalog creation
catalogs = spark.sql("SHOW CATALOGS").collect()
catalog_names = [row.catalog for row in catalogs]

validate_step(
    MAIN_CATALOG in catalog_names,
    f"Catalog {MAIN_CATALOG} created successfully",
    f"Failed to create catalog {MAIN_CATALOG}"
)

In [0]:
# Create schemas within the catalog
schemas = [
    ("raw", "Raw data ingestion layer"),
    ("bronze", "Bronze layer - raw data with metadata"),
    ("silver", "Silver layer - cleaned and validated data"),
    ("gold", "Gold layer - business-ready aggregates"),
    ("governance", "Governance metadata and audit tables"),
    ("sandbox", "User experimentation area")
]

for schema_name, comment in schemas:
    spark.sql(f"""
        CREATE SCHEMA IF NOT EXISTS {MAIN_CATALOG}.{schema_name}
        COMMENT '{comment}'
    """)
    log_progress(f"Created schema: {schema_name}")

# Display schema structure
display(spark.sql(f"SHOW SCHEMAS IN {MAIN_CATALOG}"))

### Exercise 1.2: Set Up Role-Based Access Control

In [0]:
# Note: In a real environment, you would use actual groups
# For this lab, we'll demonstrate the GRANT syntax

# Define access patterns
access_patterns = f"""
-- Data Engineers: Full access to all schemas
-- GRANT ALL PRIVILEGES ON CATALOG {MAIN_CATALOG} TO `data-engineers`;

-- Analysts: Read access to silver and gold, write to sandbox
-- GRANT USE CATALOG ON CATALOG {MAIN_CATALOG} TO `analysts`;
-- GRANT USE SCHEMA ON SCHEMA {MAIN_CATALOG}.silver TO `analysts`;
-- GRANT SELECT ON SCHEMA {MAIN_CATALOG}.silver TO `analysts`;
-- GRANT USE SCHEMA ON SCHEMA {MAIN_CATALOG}.gold TO `analysts`;
-- GRANT SELECT ON SCHEMA {MAIN_CATALOG}.gold TO `analysts`;
-- GRANT ALL PRIVILEGES ON SCHEMA {MAIN_CATALOG}.sandbox TO `analysts`;

-- Business Users: Read access to gold only
-- GRANT USE CATALOG ON CATALOG {MAIN_CATALOG} TO `business-users`;
-- GRANT USE SCHEMA ON SCHEMA {MAIN_CATALOG}.gold TO `business-users`;
-- GRANT SELECT ON SCHEMA {MAIN_CATALOG}.gold TO `business-users`;
"""

print("Access control patterns (uncomment and modify for your environment):")
print(access_patterns)

# Create a sample access control table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.access_matrix (
        role STRING,
        catalog STRING,
        schema STRING,
        table_pattern STRING,
        permissions ARRAY<STRING>,
        business_justification STRING,
        last_reviewed DATE
    )
""")

log_progress("Access control matrix table created")

### Exercise 1.3: Create Sample Tables for Governance Testing

In [0]:
# Create sample customer table with PII
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.silver.customers (
        customer_id BIGINT,
        first_name STRING COMMENT 'PII - Personal Name',
        last_name STRING COMMENT 'PII - Personal Name',
        email STRING COMMENT 'PII - Email Address',
        phone STRING COMMENT 'PII - Phone Number',
        date_of_birth DATE COMMENT 'PII - Sensitive',
        ssn STRING COMMENT 'PII - Highly Sensitive',
        address_line1 STRING COMMENT 'PII - Address',
        city STRING,
        state STRING,
        country STRING,
        postal_code STRING,
        customer_since DATE,
        lifetime_value DECIMAL(10,2),
        preferred_contact_method STRING,
        marketing_consent BOOLEAN,
        last_updated TIMESTAMP
    )
    USING DELTA
    COMMENT 'Customer master data with PII - requires governance controls'
""")

# Create sample transactions table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.silver.transactions (
        transaction_id STRING,
        customer_id BIGINT,
        transaction_date DATE,
        amount DECIMAL(10,2),
        currency STRING,
        payment_method STRING,
        merchant_name STRING,
        merchant_category STRING,
        transaction_status STRING,
        region STRING COMMENT 'Used for row-level security'
    )
    USING DELTA
    PARTITIONED BY (transaction_date)
    COMMENT 'Transaction data requiring row-level security by region'
""")

log_progress("Sample tables created successfully")

In [0]:
# Insert sample data
spark.sql(f"""
    INSERT INTO {MAIN_CATALOG}.silver.customers VALUES
    (1001, 'John', 'Doe', 'john.doe@email.com', '555-0101', '1980-01-15', '123-45-6789', 
     '123 Main St', 'New York', 'NY', 'US', '10001', '2020-01-15', 5432.10, 'email', true, current_timestamp()),
    (1002, 'Jane', 'Smith', 'jane.smith@email.com', '555-0102', '1985-05-20', '987-65-4321', 
     '456 Oak Ave', 'Los Angeles', 'CA', 'US', '90001', '2019-03-10', 8921.50, 'phone', true, current_timestamp()),
    (1003, 'Robert', 'Johnson', 'r.johnson@email.com', '555-0103', '1978-11-30', '456-78-9012', 
     '789 Pine Rd', 'London', 'LDN', 'UK', 'SW1A 1AA', '2021-06-01', 3210.75, 'email', false, current_timestamp()),
    (1004, 'Maria', 'Garcia', 'maria.g@email.com', '555-0104', '1990-03-25', '321-54-9876', 
     '321 Elm St', 'Madrid', 'MAD', 'ES', '28001', '2020-09-15', 6789.25, 'email', true, current_timestamp()),
    (1005, 'Li', 'Wang', 'li.wang@email.com', '555-0105', '1982-07-12', '654-32-1098', 
     '654 Bamboo Ln', 'Singapore', 'SG', 'SG', '238839', '2018-11-20', 12543.80, 'phone', true, current_timestamp())
""")

# Insert sample transactions
spark.sql(f"""
    INSERT INTO {MAIN_CATALOG}.silver.transactions VALUES
    ('TXN001', 1001, '2024-01-15', 150.00, 'USD', 'credit_card', 'GlobalMart NYC', 'retail', 'completed', 'US'),
    ('TXN002', 1002, '2024-01-15', 89.50, 'USD', 'debit_card', 'GlobalMart LA', 'retail', 'completed', 'US'),
    ('TXN003', 1003, '2024-01-16', 220.00, 'GBP', 'credit_card', 'GlobalMart London', 'retail', 'completed', 'UK'),
    ('TXN004', 1004, '2024-01-16', 175.50, 'EUR', 'paypal', 'GlobalMart Madrid', 'retail', 'completed', 'EU'),
    ('TXN005', 1005, '2024-01-17', 320.00, 'SGD', 'credit_card', 'GlobalMart Singapore', 'retail', 'completed', 'APAC')
""")

log_progress("Sample data inserted")
display(spark.sql(f"SELECT COUNT(*) as customer_count FROM {MAIN_CATALOG}.silver.customers"))
display(spark.sql(f"SELECT COUNT(*) as transaction_count FROM {MAIN_CATALOG}.silver.transactions"))

---
## Lab 2: Fine-Grained Security Implementation (60 minutes)

### Objectives
- Implement column-level security with data masking
- Set up row-level security filters
- Create tag-based access policies
- Test security controls

### Exercise 2.1: Implement Column-Level Security with Data Masking

In [0]:
# Create masking functions for different data types
spark.sql(f"""
    CREATE OR REPLACE FUNCTION {MAIN_CATALOG}.governance.mask_email(email STRING)
    RETURNS STRING
    RETURN 
      CASE 
        WHEN is_member('data-engineers') THEN email
        WHEN is_member('analysts') THEN concat(left(email, 3), '****@****')
        ELSE 'REDACTED'
      END
""")

spark.sql(f"""
    CREATE OR REPLACE FUNCTION {MAIN_CATALOG}.governance.mask_ssn(ssn STRING)
    RETURNS STRING
    RETURN 
      CASE 
        WHEN is_member('data-engineers') THEN ssn
        WHEN is_member('hr-authorized') THEN concat('XXX-XX-', right(ssn, 4))
        ELSE 'XXX-XX-XXXX'
      END
""")

spark.sql(f"""
    CREATE OR REPLACE FUNCTION {MAIN_CATALOG}.governance.mask_phone(phone STRING)
    RETURNS STRING  
    RETURN
      CASE
        WHEN is_member('data-engineers') OR is_member('customer-service') THEN phone
        ELSE concat(left(phone, 3), '-***-****')
      END
""")

log_progress("Masking functions created")

In [0]:
# Apply masking functions to columns
# Note: In production, you would apply these masks. For demo, we'll show the syntax

masking_commands = f"""
-- Apply email masking
ALTER TABLE {MAIN_CATALOG}.silver.customers
ALTER COLUMN email SET MASK {MAIN_CATALOG}.governance.mask_email;

-- Apply SSN masking
ALTER TABLE {MAIN_CATALOG}.silver.customers
ALTER COLUMN ssn SET MASK {MAIN_CATALOG}.governance.mask_ssn;

-- Apply phone masking  
ALTER TABLE {MAIN_CATALOG}.silver.customers
ALTER COLUMN phone SET MASK {MAIN_CATALOG}.governance.mask_phone;
"""

print("To apply masking in production, run:")
print(masking_commands)

# Create a view demonstrating masked data
spark.sql(f"""
    CREATE OR REPLACE VIEW {MAIN_CATALOG}.gold.customers_masked AS
    SELECT 
        customer_id,
        first_name,
        last_name,
        {MAIN_CATALOG}.governance.mask_email(email) as email,
        {MAIN_CATALOG}.governance.mask_phone(phone) as phone,
        date_of_birth,
        {MAIN_CATALOG}.governance.mask_ssn(ssn) as ssn,
        city,
        state,
        country,
        customer_since,
        lifetime_value,
        marketing_consent
    FROM {MAIN_CATALOG}.silver.customers
""")

log_progress("Masked view created")
display(spark.sql(f"SELECT * FROM {MAIN_CATALOG}.gold.customers_masked LIMIT 5"))

### Exercise 2.2: Implement Row-Level Security

In [0]:
# Create row filter function based on region
spark.sql(f"""
    CREATE OR REPLACE FUNCTION {MAIN_CATALOG}.governance.region_filter(region STRING)
    RETURNS BOOLEAN
    RETURN 
      CASE
        WHEN is_member('global-admins') THEN TRUE
        WHEN is_member('us-team') AND region = 'US' THEN TRUE
        WHEN is_member('eu-team') AND region = 'EU' THEN TRUE
        WHEN is_member('uk-team') AND region = 'UK' THEN TRUE
        WHEN is_member('apac-team') AND region = 'APAC' THEN TRUE
        ELSE FALSE
      END
""")

# Create a mapping table for row-level security
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.user_region_access (
        user_email STRING,
        allowed_regions ARRAY<STRING>,
        access_level STRING,
        valid_from DATE,
        valid_until DATE
    )
""")

# Insert sample access mappings
spark.sql(f"""
    INSERT INTO {MAIN_CATALOG}.governance.user_region_access VALUES
    ('admin@globalmart.com', array('US', 'EU', 'UK', 'APAC'), 'FULL', '2024-01-01', '2025-12-31'),
    ('us.manager@globalmart.com', array('US'), 'READ', '2024-01-01', '2025-12-31'),
    ('eu.analyst@globalmart.com', array('EU', 'UK'), 'READ', '2024-01-01', '2025-12-31'),
    ('apac.lead@globalmart.com', array('APAC'), 'READ', '2024-01-01', '2025-12-31')
""")

log_progress("Row-level security functions created")

In [0]:
# Create filtered views demonstrating row-level security
spark.sql(f"""
    CREATE OR REPLACE VIEW {MAIN_CATALOG}.gold.transactions_regional AS
    SELECT t.*
    FROM {MAIN_CATALOG}.silver.transactions t
    WHERE {MAIN_CATALOG}.governance.region_filter(t.region)
""")

# Create a view showing what each user can access
spark.sql(f"""
    CREATE OR REPLACE VIEW {MAIN_CATALOG}.governance.effective_permissions AS
    WITH user_context AS (
        SELECT 
            current_user() as current_user,
            current_date() as check_date
    )
    SELECT 
        u.current_user,
        ura.allowed_regions,
        ura.access_level,
        COUNT(DISTINCT t.transaction_id) as accessible_transactions
    FROM user_context u
    LEFT JOIN {MAIN_CATALOG}.governance.user_region_access ura
        ON u.current_user = ura.user_email
        AND u.check_date BETWEEN ura.valid_from AND ura.valid_until
    LEFT JOIN {MAIN_CATALOG}.silver.transactions t
        ON array_contains(ura.allowed_regions, t.region)
    GROUP BY u.current_user, ura.allowed_regions, ura.access_level
""")

log_progress("Regional access views created")
display(spark.sql(f"SELECT region, COUNT(*) as transaction_count FROM {MAIN_CATALOG}.silver.transactions GROUP BY region"))

### Exercise 2.3: Implement Tag-Based Access Control

In [0]:
# Apply tags to classify data sensitivity
tagging_commands = f"""
-- Tag entire table
ALTER TABLE {MAIN_CATALOG}.silver.customers 
SET TAGS ('contains_pii' = 'true', 'data_classification' = 'confidential');

-- Tag specific columns with PII classification
ALTER TABLE {MAIN_CATALOG}.silver.customers
ALTER COLUMN email SET TAGS ('pii_type' = 'email', 'sensitivity' = 'medium');

ALTER TABLE {MAIN_CATALOG}.silver.customers
ALTER COLUMN ssn SET TAGS ('pii_type' = 'national_id', 'sensitivity' = 'high', 'regulated' = 'true');

ALTER TABLE {MAIN_CATALOG}.silver.customers
ALTER COLUMN date_of_birth SET TAGS ('pii_type' = 'birthdate', 'sensitivity' = 'medium');

-- Tag transaction table
ALTER TABLE {MAIN_CATALOG}.silver.transactions
SET TAGS ('contains_pii' = 'false', 'data_classification' = 'internal');
"""

# Execute tagging (in production Unity Catalog)
for command in tagging_commands.strip().split(';'):
    if command.strip():
        try:
            spark.sql(command)
            print(f"✅ Executed: {command.strip()[:50]}...")
        except Exception as e:
            print(f"ℹ️ Tag command (would run in production): {command.strip()[:50]}...")

# Create a tag inventory table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.data_classification_tags (
        tag_name STRING,
        tag_value STRING,
        description STRING,
        compliance_requirement STRING,
        handling_instructions STRING
    )
""")

spark.sql(f"""
    INSERT INTO {MAIN_CATALOG}.governance.data_classification_tags VALUES
    ('sensitivity', 'high', 'Highly sensitive data requiring maximum protection', 'GDPR, CCPA', 'Encrypt at rest and in transit, audit all access'),
    ('sensitivity', 'medium', 'Moderately sensitive data', 'GDPR, CCPA', 'Encrypt at rest, mask in non-prod'),
    ('sensitivity', 'low', 'Low sensitivity data', 'None', 'Standard controls'),
    ('pii_type', 'national_id', 'National identification numbers (SSN, etc)', 'GDPR Article 9', 'Never display in logs, maximum encryption'),
    ('pii_type', 'email', 'Email addresses', 'GDPR, CAN-SPAM', 'Hash for analytics, mask in non-prod'),
    ('regulated', 'true', 'Subject to regulatory compliance', 'Various', 'Maintain audit trail, restricted access')
""")

log_progress("Tag-based classification implemented")
display(spark.sql(f"SELECT * FROM {MAIN_CATALOG}.governance.data_classification_tags"))

---
## Lab 3: Privacy and Compliance Controls (60 minutes)

### Objectives
- Implement GDPR compliance features (right to be forgotten)
- Create data retention policies
- Build consent management system
- Implement cross-border data transfer controls

### Exercise 3.1: Implement Right to be Forgotten

In [0]:
# Create deletion request tracking table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.deletion_requests (
        request_id STRING,
        customer_id BIGINT,
        request_date TIMESTAMP,
        requestor_email STRING,
        reason STRING,
        status STRING,
        completed_date TIMESTAMP,
        tables_affected ARRAY<STRING>,
        records_deleted INT
    )
""")

# Create a function to handle GDPR deletion
def process_deletion_request(customer_id, reason="GDPR Request"):
    """
    Process a customer deletion request in compliance with GDPR
    """
    import uuid
    request_id = str(uuid.uuid4())
    
    # Log the request
    spark.sql(f"""
        INSERT INTO {MAIN_CATALOG}.governance.deletion_requests 
        (request_id, customer_id, request_date, requestor_email, reason, status)
        VALUES ('{request_id}', {customer_id}, current_timestamp(), current_user(), '{reason}', 'PENDING')
    """)
    
    affected_tables = []
    total_deleted = 0
    
    try:
        # Delete from customers table
        result = spark.sql(f"""
            DELETE FROM {MAIN_CATALOG}.silver.customers 
            WHERE customer_id = {customer_id}
        """)
        affected_tables.append(f"{MAIN_CATALOG}.silver.customers")
        
        # Delete from transactions table
        result = spark.sql(f"""
            DELETE FROM {MAIN_CATALOG}.silver.transactions 
            WHERE customer_id = {customer_id}
        """)
        affected_tables.append(f"{MAIN_CATALOG}.silver.transactions")
        
        # Update request status
        spark.sql(f"""
            UPDATE {MAIN_CATALOG}.governance.deletion_requests
            SET status = 'COMPLETED',
                completed_date = current_timestamp(),
                tables_affected = array({','.join([f"'{t}'" for t in affected_tables])})
            WHERE request_id = '{request_id}'
        """)
        
        log_progress(f"Successfully processed deletion request {request_id} for customer {customer_id}")
        return request_id
        
    except Exception as e:
        # Update request status to failed
        spark.sql(f"""
            UPDATE {MAIN_CATALOG}.governance.deletion_requests
            SET status = 'FAILED'
            WHERE request_id = '{request_id}'
        """)
        raise e

# Test deletion request (don't actually run this on important data!)
# request_id = process_deletion_request(1001, "Test GDPR deletion")
print("✅ Deletion request function created")
print("⚠️ WARNING: Do not run the deletion function on important data!")

### Exercise 3.2: Implement Data Retention Policies

In [0]:
# Create retention policy table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.retention_policies (
        policy_id STRING,
        table_name STRING,
        retention_days INT,
        deletion_column STRING,
        policy_type STRING COMMENT 'HARD_DELETE or ANONYMIZE',
        business_justification STRING,
        compliance_requirement STRING,
        is_active BOOLEAN,
        last_run_date DATE
    )
""")

# Insert retention policies
spark.sql(f"""
    INSERT INTO {MAIN_CATALOG}.governance.retention_policies VALUES
    ('POL001', '{MAIN_CATALOG}.silver.customers', 2555, 'last_updated', 'ANONYMIZE', 
     'GDPR requires deletion after 7 years of inactivity', 'GDPR Article 5(e)', true, null),
    ('POL002', '{MAIN_CATALOG}.silver.transactions', 2190, 'transaction_date', 'HARD_DELETE',
     'Financial records retention for 6 years', 'SOX Compliance', true, null),
    ('POL003', '{MAIN_CATALOG}.governance.audit_logs', 2920, 'event_date', 'HARD_DELETE',
     'Audit logs retained for 8 years', 'Internal Policy', true, null)
""")

# Create procedure to enforce retention
def enforce_retention_policy(policy_id):
    """
    Enforce a specific retention policy
    """
    policy = spark.sql(f"""
        SELECT * FROM {MAIN_CATALOG}.governance.retention_policies
        WHERE policy_id = '{policy_id}' AND is_active = true
    """).collect()
    
    if not policy:
        raise ValueError(f"Policy {policy_id} not found or inactive")
    
    p = policy[0]
    cutoff_date = f"current_date() - {p.retention_days}"
    
    if p.policy_type == 'HARD_DELETE':
        # Count before deletion
        count_query = f"""
            SELECT COUNT(*) as cnt FROM {p.table_name}
            WHERE {p.deletion_column} < {cutoff_date}
        """
        
        # Delete query
        delete_query = f"""
            DELETE FROM {p.table_name}
            WHERE {p.deletion_column} < {cutoff_date}
        """
        
    elif p.policy_type == 'ANONYMIZE':
        # Anonymization query
        delete_query = f"""
            UPDATE {p.table_name}
            SET first_name = 'ANONYMIZED',
                last_name = 'ANONYMIZED', 
                email = 'anonymized@anonymized.com',
                phone = '000-000-0000',
                ssn = '000-00-0000'
            WHERE {p.deletion_column} < {cutoff_date}
        """
    
    # Update last run date
    spark.sql(f"""
        UPDATE {MAIN_CATALOG}.governance.retention_policies
        SET last_run_date = current_date()
        WHERE policy_id = '{policy_id}'
    """)
    
    return f"Policy {policy_id} enforced successfully"

log_progress("Retention policy system created")
display(spark.sql(f"SELECT * FROM {MAIN_CATALOG}.governance.retention_policies"))

### Exercise 3.3: Implement Consent Management

In [0]:
# Create consent tracking table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.consent_records (
        consent_id STRING,
        customer_id BIGINT,
        consent_type STRING,
        consent_given BOOLEAN,
        consent_date TIMESTAMP,
        expiry_date DATE,
        collection_method STRING,
        ip_address STRING,
        withdrawn_date TIMESTAMP
    )
""")

# Insert sample consent records
spark.sql(f"""
    INSERT INTO {MAIN_CATALOG}.governance.consent_records VALUES
    ('CON001', 1001, 'marketing_emails', true, '2024-01-15 10:00:00', '2025-01-15', 'website_form', '192.168.1.1', null),
    ('CON002', 1001, 'data_analytics', true, '2024-01-15 10:00:00', '2025-01-15', 'website_form', '192.168.1.1', null),
    ('CON003', 1002, 'marketing_emails', true, '2024-01-10 14:30:00', '2025-01-10', 'mobile_app', '192.168.1.2', null),
    ('CON004', 1002, 'data_analytics', false, '2024-01-10 14:30:00', null, 'mobile_app', '192.168.1.2', null),
    ('CON005', 1003, 'marketing_emails', false, '2024-01-16 09:15:00', null, 'email_link', '192.168.1.3', null)
""")

# Create view for current consent status
spark.sql(f"""
    CREATE OR REPLACE VIEW {MAIN_CATALOG}.gold.customer_consent_status AS
    WITH latest_consent AS (
        SELECT 
            customer_id,
            consent_type,
            consent_given,
            consent_date,
            expiry_date,
            ROW_NUMBER() OVER (PARTITION BY customer_id, consent_type ORDER BY consent_date DESC) as rn
        FROM {MAIN_CATALOG}.governance.consent_records
    )
    SELECT 
        c.customer_id,
        c.first_name,
        c.last_name,
        MAX(CASE WHEN lc.consent_type = 'marketing_emails' THEN lc.consent_given ELSE false END) as marketing_consent,
        MAX(CASE WHEN lc.consent_type = 'data_analytics' THEN lc.consent_given ELSE false END) as analytics_consent,
        MAX(CASE WHEN lc.consent_type = 'third_party_sharing' THEN lc.consent_given ELSE false END) as third_party_consent
    FROM {MAIN_CATALOG}.silver.customers c
    LEFT JOIN latest_consent lc ON c.customer_id = lc.customer_id AND lc.rn = 1
    GROUP BY c.customer_id, c.first_name, c.last_name
""")

log_progress("Consent management system created")
display(spark.sql(f"SELECT * FROM {MAIN_CATALOG}.gold.customer_consent_status"))

---
## Lab 4: Audit and Monitoring (45 minutes)

### Objectives
- Create comprehensive audit logging system
- Build data lineage tracking
- Implement compliance dashboards
- Set up alerting for policy violations

### Exercise 4.1: Create Audit Logging System

In [0]:
# Create custom audit log table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.custom_audit_logs (
        event_id STRING,
        event_time TIMESTAMP,
        user_name STRING,
        user_email STRING,
        action STRING,
        object_type STRING,
        object_name STRING,
        object_id STRING,
        result STRING,
        error_message STRING,
        client_ip STRING,
        session_id STRING,
        additional_info MAP<STRING, STRING>
    )
    PARTITIONED BY (event_date DATE)
""")

# Function to log audit events
def log_audit_event(action, object_type, object_name, object_id=None, result="SUCCESS", error_message=None, **kwargs):
    """
    Log custom audit event
    """
    import uuid
    from datetime import datetime
    
    event_id = str(uuid.uuid4())
    event_time = datetime.now()
    
    # Create additional info map
    additional_info = {k: str(v) for k, v in kwargs.items()}
    
    spark.sql(f"""
        INSERT INTO {MAIN_CATALOG}.governance.custom_audit_logs
        VALUES (
            '{event_id}',
            '{event_time}',
            current_user(),
            current_user(),
            '{action}',
            '{object_type}',
            '{object_name}',
            {'NULL' if object_id is None else f"'{object_id}'"},
            '{result}',
            {'NULL' if error_message is None else f"'{error_message}'"},
            '127.0.0.1',
            '{spark.sparkContext.applicationId}',
            map({','.join([f"'{k}', '{v}'" for k, v in additional_info.items()])}),
            date('{event_time}')
        )
    """)
    
    return event_id

# Test audit logging
test_event_id = log_audit_event(
    action="READ",
    object_type="TABLE",
    object_name=f"{MAIN_CATALOG}.silver.customers",
    rows_accessed="5",
    query_type="SELECT"
)

log_progress(f"Audit logging system created. Test event ID: {test_event_id}")

In [0]:
# Create audit analysis views
spark.sql(f"""
    CREATE OR REPLACE VIEW {MAIN_CATALOG}.governance.audit_summary AS
    SELECT 
        event_date,
        user_name,
        action,
        object_type,
        COUNT(*) as event_count,
        SUM(CASE WHEN result = 'SUCCESS' THEN 1 ELSE 0 END) as success_count,
        SUM(CASE WHEN result = 'FAILURE' THEN 1 ELSE 0 END) as failure_count
    FROM {MAIN_CATALOG}.governance.custom_audit_logs
    GROUP BY event_date, user_name, action, object_type
""")

# Create suspicious activity detector
spark.sql(f"""
    CREATE OR REPLACE VIEW {MAIN_CATALOG}.governance.suspicious_activity AS
    SELECT 
        user_name,
        event_date,
        COUNT(DISTINCT object_name) as unique_objects_accessed,
        COUNT(*) as total_accesses,
        COUNT(DISTINCT client_ip) as unique_ips,
        COLLECT_SET(object_name) as accessed_objects
    FROM {MAIN_CATALOG}.governance.custom_audit_logs
    WHERE action IN ('READ', 'DOWNLOAD', 'EXPORT')
        AND event_date >= current_date() - 7
    GROUP BY user_name, event_date
    HAVING COUNT(*) > 100 OR COUNT(DISTINCT object_name) > 20
""")

log_progress("Audit analysis views created")

### Exercise 4.2: Implement Data Lineage Tracking

In [0]:
# Create lineage tracking table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.data_lineage (
        lineage_id STRING,
        source_table STRING,
        target_table STRING,
        transformation_type STRING,
        transformation_logic STRING,
        created_by STRING,
        created_date TIMESTAMP,
        job_id STRING,
        is_active BOOLEAN
    )
""")

# Create sample lineage records
spark.sql(f"""
    INSERT INTO {MAIN_CATALOG}.governance.data_lineage VALUES
    ('LIN001', '{MAIN_CATALOG}.bronze.raw_customers', '{MAIN_CATALOG}.silver.customers', 
     'CLEANSE', 'Remove duplicates, validate emails, standardize formats', 
     'etl_job_user', '2024-01-01 00:00:00', 'JOB_001', true),
    
    ('LIN002', '{MAIN_CATALOG}.bronze.raw_transactions', '{MAIN_CATALOG}.silver.transactions',
     'ENRICH', 'Join with customer data, calculate derived fields',
     'etl_job_user', '2024-01-01 00:00:00', 'JOB_002', true),
    
    ('LIN003', '{MAIN_CATALOG}.silver.customers', '{MAIN_CATALOG}.gold.customer_360',
     'AGGREGATE', 'Create customer 360 view with lifetime metrics',
     'analytics_user', '2024-01-05 00:00:00', 'JOB_003', true),
    
    ('LIN004', '{MAIN_CATALOG}.silver.transactions', '{MAIN_CATALOG}.gold.sales_summary',
     'AGGREGATE', 'Daily sales rollup by region and product',
     'analytics_user', '2024-01-05 00:00:00', 'JOB_004', true)
""")

# Create lineage visualization view
spark.sql(f"""
    CREATE OR REPLACE VIEW {MAIN_CATALOG}.governance.lineage_graph AS
    WITH RECURSIVE lineage_tree AS (
        -- Base case: tables with no upstream dependencies
        SELECT 
            target_table as table_name,
            source_table,
            target_table,
            transformation_type,
            1 as level,
            CAST(target_table AS STRING) as path
        FROM {MAIN_CATALOG}.governance.data_lineage
        WHERE source_table LIKE '%bronze%'
        
        UNION ALL
        
        -- Recursive case
        SELECT 
            dl.target_table as table_name,
            dl.source_table,
            dl.target_table,
            dl.transformation_type,
            lt.level + 1 as level,
            CONCAT(lt.path, ' -> ', dl.target_table) as path
        FROM {MAIN_CATALOG}.governance.data_lineage dl
        JOIN lineage_tree lt ON dl.source_table = lt.target_table
        WHERE lt.level < 5  -- Prevent infinite recursion
    )
    SELECT DISTINCT * FROM lineage_tree
    ORDER BY level, table_name
""")

log_progress("Data lineage tracking implemented")
display(spark.sql(f"SELECT * FROM {MAIN_CATALOG}.governance.data_lineage"))

### Exercise 4.3: Create Compliance Dashboard

In [0]:
# Create compliance metrics table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.compliance_metrics (
        metric_date DATE,
        metric_name STRING,
        metric_value DOUBLE,
        metric_status STRING,
        threshold_value DOUBLE,
        measurement_details STRING
    )
""")

# Function to calculate compliance metrics
def calculate_compliance_metrics():
    """
    Calculate daily compliance metrics
    """
    metrics_date = datetime.now().date()
    
    # Metric 1: Tables with proper classification
    spark.sql(f"""
        INSERT INTO {MAIN_CATALOG}.governance.compliance_metrics
        SELECT 
            current_date() as metric_date,
            'tables_classified_percentage' as metric_name,
            100.0 as metric_value,  -- In real scenario, calculate from information_schema
            CASE WHEN 100.0 >= 95 THEN 'PASS' ELSE 'FAIL' END as metric_status,
            95.0 as threshold_value,
            'Percentage of tables with data classification tags' as measurement_details
    """)
    
    # Metric 2: Active consent percentage
    spark.sql(f"""
        INSERT INTO {MAIN_CATALOG}.governance.compliance_metrics
        WITH consent_stats AS (
            SELECT 
                COUNT(DISTINCT customer_id) as total_customers,
                SUM(CASE WHEN marketing_consent = true THEN 1 ELSE 0 END) as consented_customers
            FROM {MAIN_CATALOG}.gold.customer_consent_status
        )
        SELECT 
            current_date() as metric_date,
            'active_consent_percentage' as metric_name,
            (consented_customers * 100.0 / total_customers) as metric_value,
            'INFO' as metric_status,
            0.0 as threshold_value,
            'Percentage of customers with active marketing consent' as measurement_details
        FROM consent_stats
    """)
    
    # Metric 3: Data retention compliance
    spark.sql(f"""
        INSERT INTO {MAIN_CATALOG}.governance.compliance_metrics
        SELECT 
            current_date() as metric_date,
            'retention_policies_current' as metric_name,
            100.0 as metric_value,  -- Simplified for demo
            CASE WHEN 100.0 = 100 THEN 'PASS' ELSE 'FAIL' END as metric_status,
            100.0 as threshold_value,
            'Percentage of retention policies run within SLA' as measurement_details
    """)
    
    log_progress("Compliance metrics calculated")

# Calculate metrics
calculate_compliance_metrics()

# Create compliance dashboard view
spark.sql(f"""
    CREATE OR REPLACE VIEW {MAIN_CATALOG}.governance.compliance_dashboard AS
    SELECT 
        metric_date,
        metric_name,
        metric_value,
        threshold_value,
        metric_status,
        CASE 
            WHEN metric_status = 'PASS' THEN '✅'
            WHEN metric_status = 'FAIL' THEN '❌'
            ELSE 'ℹ️'
        END as status_icon,
        measurement_details
    FROM {MAIN_CATALOG}.governance.compliance_metrics
    WHERE metric_date >= current_date() - 30
    ORDER BY metric_date DESC, metric_name
""")

display(spark.sql(f"SELECT * FROM {MAIN_CATALOG}.governance.compliance_dashboard WHERE metric_date = current_date()"))

---
## Lab 5: Governance Automation (60 minutes)

### Objectives
- Implement policy as code
- Create automated compliance checks
- Build self-service data access workflows
- Set up governance health monitoring

### Exercise 5.1: Implement Policy as Code

In [0]:
# Create policy definitions table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.policy_definitions (
        policy_id STRING,
        policy_name STRING,
        policy_type STRING,
        policy_rule STRING,
        enforcement_level STRING COMMENT 'BLOCK, WARN, or AUDIT',
        is_active BOOLEAN,
        created_date TIMESTAMP,
        created_by STRING
    )
""")

# Insert policy definitions
spark.sql(f"""
    INSERT INTO {MAIN_CATALOG}.governance.policy_definitions VALUES
    ('P001', 'PII Access Restriction', 'ACCESS_CONTROL', 
     'Users must be in pii-readers group to access tables tagged with pii=true',
     'BLOCK', true, current_timestamp(), 'security_admin'),
     
    ('P002', 'Cross-Region Data Transfer', 'DATA_MOVEMENT',
     'Data tagged with region-specific cannot be accessed outside that region',
     'BLOCK', true, current_timestamp(), 'security_admin'),
     
    ('P003', 'Bulk Data Export Limit', 'EXPORT_CONTROL',
     'Exports exceeding 10000 rows require approval',
     'WARN', true, current_timestamp(), 'security_admin'),
     
    ('P004', 'Sensitive Data Masking', 'DATA_PROTECTION',
     'SSN and credit card fields must be masked for non-privileged users',
     'BLOCK', true, current_timestamp(), 'security_admin'),
     
    ('P005', 'Audit Trail Requirement', 'COMPLIANCE',
     'All access to financial data must be logged',
     'AUDIT', true, current_timestamp(), 'compliance_officer')
""")

# Create policy evaluation function
def evaluate_policy(policy_id, context):
    """
    Evaluate a policy against given context
    Returns: (result: bool, message: str)
    """
    policy = spark.sql(f"""
        SELECT * FROM {MAIN_CATALOG}.governance.policy_definitions
        WHERE policy_id = '{policy_id}' AND is_active = true
    """).collect()
    
    if not policy:
        return True, "Policy not found or inactive"
    
    p = policy[0]
    
    # Simplified policy evaluation logic
    if p.policy_id == 'P001':
        # Check PII access
        if context.get('table_tags', {}).get('pii') == 'true':
            if 'pii-readers' not in context.get('user_groups', []):
                return False, "Access denied: PII data requires pii-readers group membership"
    
    elif p.policy_id == 'P003':
        # Check export limits
        if context.get('export_rows', 0) > 10000:
            return False, "Warning: Large export requires approval"
    
    return True, "Policy check passed"

# Test policy evaluation
test_context = {
    'table_tags': {'pii': 'true'},
    'user_groups': ['analysts'],
    'export_rows': 5000
}

result, message = evaluate_policy('P001', test_context)
print(f"Policy P001 evaluation: {result} - {message}")

log_progress("Policy as code framework created")

### Exercise 5.2: Create Automated Compliance Checks

In [0]:
# Create compliance check results table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.compliance_check_results (
        check_id STRING,
        check_name STRING,
        check_type STRING,
        object_name STRING,
        check_status STRING,
        check_message STRING,
        severity STRING,
        check_timestamp TIMESTAMP,
        remediation_required BOOLEAN
    )
""")

# Automated compliance check function
def run_compliance_checks():
    """
    Run automated compliance checks across the catalog
    """
    import uuid
    check_timestamp = datetime.now()
    
    # Check 1: Tables without classification tags
    untagged_tables = spark.sql(f"""
        SELECT table_name 
        FROM {MAIN_CATALOG}.information_schema.tables
        WHERE table_schema NOT IN ('information_schema', 'governance')
        LIMIT 10  -- For demo purposes
    """).collect()
    
    # Simulate checking for missing tags
    for table in untagged_tables[:2]:  # Simulate some tables missing tags
        spark.sql(f"""
            INSERT INTO {MAIN_CATALOG}.governance.compliance_check_results VALUES
            ('{uuid.uuid4()}', 'Missing Data Classification', 'TAGGING', 
             '{table.table_name}', 'FAIL', 'Table missing required classification tags',
             'MEDIUM', '{check_timestamp}', true)
        """)
    
    # Check 2: Unused access grants
    spark.sql(f"""
        INSERT INTO {MAIN_CATALOG}.governance.compliance_check_results
        SELECT
            '{uuid.uuid4()}' as check_id,
            'Unused Access Grant Review' as check_name,
            'ACCESS_REVIEW' as check_type,
            'Multiple Tables' as object_name,
            'WARN' as check_status,
            '5 users have table access but no activity in 90 days' as check_message,
            'LOW' as severity,
            '{check_timestamp}' as check_timestamp,
            false as remediation_required
    """)
    
    # Check 3: Sensitive data without masking
    spark.sql(f"""
        INSERT INTO {MAIN_CATALOG}.governance.compliance_check_results VALUES
        ('{uuid.uuid4()}', 'Sensitive Data Protection', 'DATA_PROTECTION',
         '{MAIN_CATALOG}.silver.customers.ssn', 'PASS', 
         'SSN column has appropriate masking function applied',
         'HIGH', '{check_timestamp}', false)
    """)
    
    log_progress("Compliance checks completed")

# Run compliance checks
run_compliance_checks()

# Display results
display(spark.sql(f"""
    SELECT check_name, check_status, object_name, check_message, severity
    FROM {MAIN_CATALOG}.governance.compliance_check_results
    WHERE check_timestamp >= current_timestamp() - INTERVAL 1 HOUR
    ORDER BY severity DESC, check_status
"""))

### Exercise 5.3: Build Self-Service Data Access

In [0]:
# Create access request workflow tables
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.access_requests (
        request_id STRING,
        requestor_email STRING,
        requested_object STRING,
        access_type STRING,
        business_justification STRING,
        risk_score INT,
        request_status STRING,
        request_date TIMESTAMP,
        approved_by STRING,
        approval_date TIMESTAMP,
        expiry_date DATE,
        auto_approved BOOLEAN
    )
""")

# Create approval rules table
spark.sql(f"""
    CREATE OR REPLACE TABLE {MAIN_CATALOG}.governance.approval_rules (
        rule_id STRING,
        object_pattern STRING,
        access_type STRING,
        max_risk_score INT,
        auto_approve BOOLEAN,
        approval_group STRING,
        max_duration_days INT
    )
""")

# Insert approval rules
spark.sql(f"""
    INSERT INTO {MAIN_CATALOG}.governance.approval_rules VALUES
    ('RULE001', '%.sandbox.%', 'SELECT', 100, true, NULL, 90),
    ('RULE002', '%.gold.%', 'SELECT', 30, true, NULL, 30),
    ('RULE003', '%.silver.%', 'SELECT', 50, false, 'data-stewards', 30),
    ('RULE004', '%.bronze.%', 'ALL', 0, false, 'data-engineers', 7)
""")

# Self-service access request function
def request_data_access(requested_object, access_type, justification):
    """
    Submit a self-service data access request
    """
    import uuid
    request_id = str(uuid.uuid4())
    
    # Calculate risk score (simplified)
    risk_score = 0
    if 'customer' in requested_object.lower() or 'pii' in requested_object.lower():
        risk_score += 40
    if access_type != 'SELECT':
        risk_score += 30
    if 'bronze' in requested_object:
        risk_score += 20
    
    # Check approval rules
    rules = spark.sql(f"""
        SELECT * FROM {MAIN_CATALOG}.governance.approval_rules
        WHERE '{requested_object}' LIKE object_pattern
          AND access_type = '{access_type}'
          AND max_risk_score >= {risk_score}
    """).collect()
    
    auto_approved = False
    approval_status = 'PENDING'
    approved_by = None
    
    if rules and rules[0].auto_approve:
        auto_approved = True
        approval_status = 'APPROVED'
        approved_by = 'SYSTEM_AUTO_APPROVAL'
    
    # Insert request
    spark.sql(f"""
        INSERT INTO {MAIN_CATALOG}.governance.access_requests VALUES
        ('{request_id}', current_user(), '{requested_object}', '{access_type}',
         '{justification}', {risk_score}, '{approval_status}', current_timestamp(),
         {f"'{approved_by}'" if approved_by else 'NULL'},
         {"current_timestamp()" if auto_approved else 'NULL'},
         current_date() + 30, {auto_approved})
    """)
    
    return request_id, auto_approved, risk_score

# Test self-service access
request_id, auto_approved, risk_score = request_data_access(
    f"{MAIN_CATALOG}.gold.sales_summary",
    "SELECT",
    "Need access for quarterly business review dashboard"
)

print(f"Access request submitted: {request_id}")
print(f"Risk score: {risk_score}")
print(f"Auto-approved: {auto_approved}")

# Display pending requests
display(spark.sql(f"""
    SELECT request_id, requested_object, access_type, risk_score, 
           request_status, auto_approved
    FROM {MAIN_CATALOG}.governance.access_requests
    WHERE request_date >= current_timestamp() - INTERVAL 1 HOUR
"""))

### Exercise 5.4: Governance Health Dashboard

In [0]:
# Create governance health metrics view
spark.sql(f"""
    CREATE OR REPLACE VIEW {MAIN_CATALOG}.governance.health_dashboard AS
    WITH metrics AS (
        -- Metric 1: Data Classification Coverage
        SELECT 
            'Data Classification Coverage' as metric_name,
            95.0 as metric_value,  -- In production, calculate from actual tags
            'PERCENT' as metric_unit,
            CASE WHEN 95.0 >= 90 THEN 'GREEN' ELSE 'RED' END as status
        
        UNION ALL
        
        -- Metric 2: Active Policy Compliance
        SELECT 
            'Active Policy Compliance' as metric_name,
            COUNT(CASE WHEN is_active THEN 1 END) * 100.0 / COUNT(*) as metric_value,
            'PERCENT' as metric_unit,
            'GREEN' as status
        FROM {MAIN_CATALOG}.governance.policy_definitions
        
        UNION ALL
        
        -- Metric 3: Access Request Processing Time
        SELECT 
            'Avg Access Request Time' as metric_name,
            2.5 as metric_value,  -- In production, calculate actual average
            'HOURS' as metric_unit,
            CASE WHEN 2.5 <= 4 THEN 'GREEN' ELSE 'YELLOW' END as status
        
        UNION ALL
        
        -- Metric 4: Failed Compliance Checks
        SELECT 
            'Failed Compliance Checks (24h)' as metric_name,
            COUNT(*) as metric_value,
            'COUNT' as metric_unit,
            CASE 
                WHEN COUNT(*) = 0 THEN 'GREEN'
                WHEN COUNT(*) <= 5 THEN 'YELLOW'
                ELSE 'RED'
            END as status
        FROM {MAIN_CATALOG}.governance.compliance_check_results
        WHERE check_status = 'FAIL'
          AND check_timestamp >= current_timestamp() - INTERVAL 24 HOURS
    )
    SELECT 
        metric_name,
        metric_value,
        metric_unit,
        status,
        CASE 
            WHEN status = 'GREEN' THEN '🟢'
            WHEN status = 'YELLOW' THEN '🟡'
            WHEN status = 'RED' THEN '🔴'
        END as status_indicator
    FROM metrics
""")

# Display governance health dashboard
display(spark.sql(f"SELECT * FROM {MAIN_CATALOG}.governance.health_dashboard"))

log_progress("Governance automation complete!")

## Lab Summary and Cleanup

Congratulations! You've successfully implemented a comprehensive data governance solution including:

✅ Unity Catalog setup with proper hierarchy
✅ Fine-grained security controls (column/row level)
✅ Privacy compliance features (GDPR, consent management)
✅ Comprehensive audit and lineage tracking
✅ Automated governance with self-service capabilities

### Key Takeaways
1. Unity Catalog provides centralized governance across all data assets
2. Fine-grained controls enable secure data sharing
3. Automation reduces governance overhead
4. Compliance requires continuous monitoring
5. Self-service empowers users while maintaining security

### Next Steps
- Review the governance patterns implemented
- Consider how these would apply to your organization
- Prepare for Module 6's capstone project
- Explore Unity Catalog documentation for advanced features

In [0]:
# Optional: Cleanup (DO NOT RUN if you want to keep your work)
# spark.sql(f"DROP CATALOG IF EXISTS {MAIN_CATALOG} CASCADE")
# print("Cleanup complete")

print("\n🎉 Module 5 Labs Complete! 🎉")
print(f"\nYour governance catalog '{MAIN_CATALOG}' contains all the implemented features.")
print("Ready for Module 6: Capstone Project!")