# Databricks Compliance & Audit Trail Monitor

## Overview

This notebook provides **comprehensive compliance monitoring and audit trail analysis** for your Databricks workspace. It identifies sensitive data, tracks access patterns, monitors data retention compliance, validates regulatory requirements (GDPR, CCPA, SOX), detects anomalies, and generates executive compliance reports.

**✨ Enterprise-grade compliance monitoring with PII detection, audit trail analysis, retention tracking, regulatory compliance checks, and anomaly detection.**

---

## Features

### Sensitive Data Discovery
* **PII Detection**: Automatic identification of sensitive columns (SSN, email, phone, etc.)
* **Sensitivity Classification**: HIGH/MEDIUM/LOW risk categorization
* **Table-Level Aggregation**: Sensitive column counts per table
* **Column Pattern Matching**: Keyword-based PII detection
* **Data Type Analysis**: Type-based sensitivity assessment

### Audit Trail Analysis
* **Workspace Activity Tracking**: Complete audit log analysis
* **Event Categorization**: Data access, modifications, permission changes, exports
* **User Activity Monitoring**: Access patterns by user
* **Permission Change Tracking**: Grant/revoke operations
* **Data Export Monitoring**: Download and export activities
* **Failed Access Attempts**: Security incident detection

### Data Retention Compliance
* **Retention Policy Validation**: Tables exceeding maximum retention (7 years)
* **Stale Data Identification**: Tables not modified in 180+ days
* **Compliance Status**: COMPLIANT, WITHIN_POLICY, EXCEEDS_POLICY
* **Sensitive Data Retention**: Cross-reference with PII tables
* **Staleness Categories**: 6mo-1yr, 1-2yrs, 2-5yrs, 5+yrs

### Regulatory Compliance
* **GDPR Compliance**: Right to erasure, data portability, consent tracking, retention limits
* **CCPA Compliance**: Right to know, right to delete, opt-out tracking
* **SOX Compliance**: Financial data controls, audit trail completeness, segregation of duties
* **HIPAA Considerations**: Healthcare data protection (if applicable)
* **Violation Detection**: Automated compliance gap identification

### Anomaly Detection
* **After-Hours Access**: Activity outside 6 AM - 8 PM
* **High-Frequency Access**: Users with >1000 events/day
* **Failed Access Attempts**: Unauthorized access detection
* **Geographic Analysis**: Source IP tracking and multi-user IPs
* **Unusual Patterns**: Behavioral anomaly identification

### Compliance Reporting
* **Compliance Score**: Weighted overall score (0-100%)
* **Metric Breakdown**: Retention, sensitive data, audit coverage, access control
* **Executive Summary**: Key findings and recommendations
* **Visualizations**: Charts for compliance metrics, retention status, audit events
* **Export Capabilities**: Delta tables, Excel, JSON formats

---

## Version Control

| Version | Date | Author | Changes |
|---------|------|--------|---------|  
| 1.0.0 | 2026-02-17 | Assistant | Comprehensive compliance and audit trail monitoring system with enterprise-grade features. Includes: sensitive data discovery with PII detection across all catalogs using keyword matching (SSN, email, phone, credit card, passport, medical, etc.) and sensitivity classification (HIGH/MEDIUM/LOW), table-level aggregation of sensitive columns, audit trail analysis via system.access.audit with event categorization (data access, modifications, permission changes, exports), user activity monitoring, permission change tracking, data export monitoring, failed access attempt detection, data retention compliance validation against 7-year policy with stale data identification (180+ days), compliance status tracking (COMPLIANT/WITHIN_POLICY/EXCEEDS_POLICY), cross-reference with sensitive tables, regulatory compliance checks for GDPR (right to erasure, data portability, consent tracking, retention limits), CCPA (right to know/delete, opt-out tracking), SOX (financial data controls, audit trail completeness), HIPAA considerations, automated violation detection, anomaly detection for after-hours access (outside 6 AM-8 PM), high-frequency access (>1000 events/day), failed access attempts, geographic analysis by source IP, compliance reporting with weighted scoring (retention 30%, sensitive data 25%, audit coverage 25%, access control 20%), metric breakdown, executive summary with key findings and recommendations, interactive visualizations (compliance metrics, retention status, audit events, sensitive data distribution), multiple export formats (Delta tables with historical tracking, Excel workbooks, JSON), job mode support with automatic configuration, serverless compute optimization, execution time tracking, DataFrame caching, error handling, and comprehensive logging. |

---

## Configuration

### Analysis Period:
* `days_back = 30` - Days of audit logs to analyze (default: 30)
* `start_date` / `end_date` - Automatically calculated from days_back

### Compliance Thresholds:
* `max_retention_days = 2555` - Maximum retention period (7 years)
* `min_retention_days = 90` - Minimum retention period
* `sensitive_access_threshold = 100` - Alert threshold for sensitive data access

### PII Keywords:
* Customizable list of sensitive data patterns
* Default: SSN, email, phone, address, credit card, passport, DOB, salary, account numbers, tax ID, medical, health

### Performance Settings:
* `MAX_WORKERS = 10` - Parallel processing threads
* `ENABLE_CACHING = True` - DataFrame caching for performance

### Export Settings:
* `ENABLE_EXCEL_EXPORT = False` - Excel workbook generation
* `ENABLE_DELTA_EXPORT = False` - Delta table for historical tracking
* `ENABLE_JSON_EXPORT = False` - JSON report generation
* `ENABLE_VISUALIZATIONS = True` - Generate charts
* `DELTA_CATALOG = 'main'` - Target catalog for exports
* `DELTA_SCHEMA = 'compliance_audit'` - Target schema for exports

---

## Usage

### Interactive Mode
1. Configure analysis period and thresholds in Cell 2
2. Customize PII keywords for your organization
3. Run all cells to perform compliance analysis
4. Review sensitive data inventory and audit findings
5. Examine compliance score and recommendations
6. Enable exports if needed for reporting

### Job Mode
1. Schedule as a Databricks job (monthly recommended)
2. Set `ENABLE_DELTA_EXPORT = True` for historical tracking
3. Automatically analyzes compliance posture
4. Exports to Delta tables for trend analysis
5. Returns execution summary for orchestration

---

## Data Sources

| Data Source | Purpose |
|-------------|----------|
| `system.access.audit` | Workspace audit logs, access patterns |
| `system.information_schema.tables` | Table metadata, retention analysis |
| `system.information_schema.columns` | Column-level PII detection |
| Unity Catalog Tags | Sensitivity classification (future) |

---

## Key Features

✓ **Sensitive Data Discovery**: Automatic PII detection across all catalogs  
✓ **Audit Trail Analysis**: Complete workspace activity tracking  
✓ **Retention Compliance**: 7-year policy validation  
✓ **Regulatory Compliance**: GDPR, CCPA, SOX checks  
✓ **Anomaly Detection**: After-hours, high-frequency, failed access  
✓ **Compliance Scoring**: Weighted metrics (0-100%)  
✓ **Executive Reporting**: Key findings and recommendations  
✓ **Multiple Export Formats**: Delta, Excel, JSON  
✓ **Interactive Visualizations**: Compliance dashboards  
✓ **Job Mode Support**: Automated scheduled execution  
✓ **Serverless Optimized**: Compute-aware optimizations  
✓ **Performance Tracking**: Execution time logging  
✓ **Historical Tracking**: Delta table with append mode

In [0]:
# ============================================================================
# IMPORTS
# ============================================================================

# Standard library
import time
import os
from datetime import datetime, timedelta
from concurrent.futures import ThreadPoolExecutor, as_completed

# Third-party
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pytz

# Databricks SDK
from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import NotFound, PermissionDenied

# PySpark
from pyspark.sql import functions as F
from pyspark.sql.types import *

# ============================================================================
# CONFIGURATION
# ============================================================================

# Execution mode detection
try:
    dbutils.widgets.get('run_mode')
    is_job_mode = True
except:
    is_job_mode = False

# Timezone
eastern = pytz.timezone('America/New_York')

# Analysis period (default: last 30 days)
days_back = 30
start_date = (datetime.now(eastern) - timedelta(days=days_back)).strftime('%Y-%m-%d')
end_date = datetime.now(eastern).strftime('%Y-%m-%d')

# Compliance thresholds
max_retention_days = 2555  # 7 years for regulatory compliance
min_retention_days = 90     # Minimum retention period
sensitive_access_threshold = 100  # Alert if sensitive data accessed more than this

# PII/Sensitive data keywords (customize based on your organization)
pii_keywords = ['ssn', 'social_security', 'email', 'phone', 'address', 'credit_card', 
                'passport', 'driver_license', 'dob', 'date_of_birth', 'salary', 
                'account_number', 'tax_id', 'national_id', 'medical', 'health']

# Sensitive data discovery optimization
SCAN_SPECIFIC_CATALOGS = True  # Set to False to scan all catalogs
CATALOGS_TO_SCAN = ['main', 'rai_prod_uc', 'rai_qa_uc', 'rai_dev_uc']  # Customize for your environment

# Schema exclusions (skip known non-sensitive schemas)
EXCLUDE_SCHEMAS = ['information_schema', 'default', 'tmp', 'temp', 'test', 'sandbox', 
                   'dev_scratch', 'adhoc', 'samples']  # Customize for your environment
ENABLE_SCHEMA_EXCLUSIONS = True  # Set to False to scan all schemas

USE_SPARK_CLASSIFICATION = True  # Use Spark instead of pandas for classification (faster)

# Incremental scanning configuration
ENABLE_INCREMENTAL_SCAN = True  # Set to False for full scan every time
FORCE_FULL_SCAN = False  # Set to True to force a full scan this run (overrides incremental)
INCREMENTAL_TABLE_CATALOG = 'main'
INCREMENTAL_TABLE_SCHEMA = 'compliance_audit'
INCREMENTAL_TABLE_NAME = 'sensitive_columns_history'
INCREMENTAL_LOOKBACK_DAYS = 7  # Only scan tables modified in last N days (incremental mode)

# Service principals to exclude from user-focused analysis
# These are automated accounts that create noise in compliance monitoring
SERVICE_PRINCIPALS = [
    'System-User',
    'System user',
    'unknown',
    ''  # Empty string for null emails
]

# Filter service principals by pattern (UUIDs, system accounts)
FILTER_SERVICE_PRINCIPALS = True  # Set to False to include all accounts
FILTER_UUID_ACCOUNTS = True       # Filter accounts that look like UUIDs
FILTER_GROUP_ACCOUNTS = True      # Filter accounts starting with 'Developers-'

# Performance settings
MAX_WORKERS = 10  # Parallel processing threads
ENABLE_CACHING = False  # Disable caching on serverless (not beneficial)

# Export settings
ENABLE_EXCEL_EXPORT = False  # Set to True to export Excel reports
ENABLE_DELTA_EXPORT = False  # Set to True to save to Delta tables
ENABLE_JSON_EXPORT = False   # Set to True to export JSON reports
ENABLE_VISUALIZATIONS = True  # Set to False to skip visualizations

# Export paths
EXPORT_BASE_PATH = '/Workspace/Users/85055763@bat.com/exports/compliance'
DELTA_CATALOG = 'main'
DELTA_SCHEMA = 'compliance_audit'

# Initialize Workspace Client
wc = WorkspaceClient()

# Logging function
def log(message):
    """Print timestamped log message"""
    timestamp = datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S')
    print(f"[{timestamp}] {message}")

if is_job_mode:
    log("🤖 Job mode: Exports ENABLED")
else:
    log("💻 Interactive mode: Exports DISABLED")

log("")
log("="*60)
log("COMPLIANCE & AUDIT TRAIL MONITOR")
log("="*60)
log(f"Execution mode: {'JOB' if is_job_mode else 'INTERACTIVE'}")
log(f"Compute: Serverless (caching disabled for performance)")
log(f"Timezone: {eastern.zone}")
log(f"Analysis Period: {start_date} to {end_date}")
log(f"Retention Policy: {min_retention_days} - {max_retention_days} days")
log(f"Sensitive Access Threshold: {sensitive_access_threshold} accesses")
if SCAN_SPECIFIC_CATALOGS:
    log(f"Scanning catalogs: {', '.join(CATALOGS_TO_SCAN)}")
else:
    log(f"Scanning: ALL catalogs")
if ENABLE_SCHEMA_EXCLUSIONS:
    log(f"Excluding schemas: {', '.join(EXCLUDE_SCHEMAS)}")
log(f"Incremental scanning: {'ENABLED' if ENABLE_INCREMENTAL_SCAN else 'DISABLED'}")
if FORCE_FULL_SCAN:
    log(f"⚠️  FORCE_FULL_SCAN = True - Will perform full scan this run")
if ENABLE_INCREMENTAL_SCAN:
    log(f"Incremental table: {INCREMENTAL_TABLE_CATALOG}.{INCREMENTAL_TABLE_SCHEMA}.{INCREMENTAL_TABLE_NAME}")
    log(f"Incremental lookback: {INCREMENTAL_LOOKBACK_DAYS} days")
log(f"Service principal filtering: {'ENABLED' if FILTER_SERVICE_PRINCIPALS else 'DISABLED'}")
log(f"Excel export: {'ENABLED' if ENABLE_EXCEL_EXPORT else 'DISABLED'}")
log(f"Delta export: {'ENABLED' if ENABLE_DELTA_EXPORT else 'DISABLED'}")
log(f"JSON export: {'ENABLED' if ENABLE_JSON_EXPORT else 'DISABLED'}")
log("="*60)

In [0]:
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def log_execution_time(cell_name, start_time):
    """Log execution time for a cell"""
    elapsed = time.time() - start_time
    log(f"⏱️  {cell_name} completed in {elapsed:.2f} seconds")

def validate_dataframe_exists(df_name, df):
    """Validate that a DataFrame exists and has data"""
    if df is None:
        log(f"⚠️  {df_name} is None, skipping dependent operations")
        return False
    try:
        count = df.count()
        if count == 0:
            log(f"⚠️  {df_name} is empty (0 rows)")
            return False
        log(f"✓ {df_name} validated: {count:,} rows")
        return True
    except Exception as e:
        log(f"❌ Error validating {df_name}: {str(e)}")
        return False

def safe_cache(df, df_name):
    """Safely cache a DataFrame if caching is enabled and not on serverless"""
    if not ENABLE_CACHING or df is None:
        return df
    
    try:
        # Try to cache - if it fails due to serverless, catch and skip silently
        df.cache()
        log(f"💾 Cached {df_name}")
    except Exception as e:
        error_msg = str(e)
        # Check if it's a serverless-related error
        if 'PERSIST' in error_msg or 'serverless' in error_msg.lower():
            log(f"ℹ️  Skipping cache for {df_name} (serverless compute)")
        else:
            # For other errors, log the actual error
            log(f"⚠️  Could not cache {df_name}: {error_msg}")
    
    return df

def is_service_principal(email):
    """Determine if an email/user is a service principal"""
    if not email or pd.isna(email):
        return True  # Treat null/empty as service principal
    
    email_str = str(email).strip()
    
    # Check explicit list
    if email_str in SERVICE_PRINCIPALS:
        return True
    
    # Check UUID pattern (8-4-4-4-12 hex digits)
    if FILTER_UUID_ACCOUNTS:
        import re
        uuid_pattern = r'^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$'
        if re.match(uuid_pattern, email_str.lower()):
            return True
    
    # Check group accounts
    if FILTER_GROUP_ACCOUNTS:
        if email_str.startswith('Developers-') or email_str.startswith('developers-'):
            return True
    
    return False

def filter_service_principals_spark(df, email_column='user_email'):
    """Filter out service principals from a Spark DataFrame"""
    if not FILTER_SERVICE_PRINCIPALS:
        return df
    
    # Filter explicit list
    df_filtered = df.filter(~F.col(email_column).isin(SERVICE_PRINCIPALS))
    
    # Filter UUIDs
    if FILTER_UUID_ACCOUNTS:
        df_filtered = df_filtered.filter(
            ~F.col(email_column).rlike('^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$')
        )
    
    # Filter group accounts
    if FILTER_GROUP_ACCOUNTS:
        df_filtered = df_filtered.filter(
            ~F.lower(F.col(email_column)).startswith('developers-')
        )
    
    return df_filtered

def classify_sensitivity(column_name, data_type):
    """Classify column sensitivity based on name and type"""
    column_lower = column_name.lower()
    for keyword in pii_keywords:
        if keyword in column_lower:
            return 'HIGH'
    if any(term in column_lower for term in ['name', 'user', 'customer', 'employee']):
        return 'MEDIUM'
    return 'LOW'

def calculate_retention_status(last_modified_days):
    """Determine retention compliance status"""
    if last_modified_days is None:
        return 'UNKNOWN'
    if last_modified_days > max_retention_days:
        return 'EXCEEDS_POLICY'
    elif last_modified_days < min_retention_days:
        return 'WITHIN_POLICY'
    else:
        return 'COMPLIANT'

def categorize_action(action_name):
    """Categorize audit actions into compliance-relevant groups"""
    action_lower = action_name.lower()
    
    # Authentication & Authorization (tokenLogin, oidcTokenAuthorization, mintOAuthToken)
    if any(term in action_lower for term in ['login', 'auth', 'token', 'oauth', 'authenticate']):
        return 'AUTHENTICATION'
    
    # Credential Generation (generateTemporaryPathCredential, generateTemporaryTableCredential)
    if any(term in action_lower for term in ['credential', 'getsessioncredentials']):
        return 'CREDENTIAL_GENERATION'
    
    # Metadata Access (getTable, getSchema, metadataSnapshot, tableExists, getPipeline)
    if any(term in action_lower for term in ['gettable', 'getschema', 'metadata', 'tableexists', 'getpipeline', 'getcatalog', 'listschemas', 'listtables', 'gettablebyid']):
        return 'METADATA_ACCESS'
    
    # Job/Cluster Execution (runCommand, runStart, runSucceeded, submitRun)
    if any(term in action_lower for term in ['runcommand', 'runstart', 'runsucceeded', 'runfailed', 'runtriggered', 'submitrun', 'submitcommand']):
        return 'JOB_EXECUTION'
    
    # Cluster Operations (create, delete, resize clusters)
    if any(term in action_lower for term in ['cluster']) and any(term in action_lower for term in ['create', 'delete', 'resize', 'start', 'terminate']):
        return 'CLUSTER_OPERATIONS'
    
    # Secret Access (getSecret)
    if 'secret' in action_lower:
        return 'SECRET_ACCESS'
    
    # Data Access (read, select, query, scan)
    if any(term in action_lower for term in ['read', 'select', 'query', 'scan']):
        return 'DATA_ACCESS'
    
    # Data Modification (create, update, delete, drop, alter)
    if any(term in action_lower for term in ['create', 'update', 'delete', 'drop', 'alter', 'insert', 'merge', 'modify']):
        return 'DATA_MODIFICATION'
    
    # Permission Changes (grant, revoke, permission, changeJobAcl)
    if any(term in action_lower for term in ['grant', 'revoke', 'permission', 'acl']):
        return 'PERMISSION_CHANGE'
    
    # Data Export (export, download)
    if any(term in action_lower for term in ['export', 'download']):
        return 'DATA_EXPORT'
    
    # Monitoring & Metrics (PutMetrics, getRunOutput)
    if any(term in action_lower for term in ['metric', 'getrunoutput', 'monitoring']):
        return 'MONITORING'
    
    # Everything else
    return 'OTHER'

log("✓ Helper functions loaded")


---

## 1. Sensitive Data Discovery & Classification

Identifying tables and columns containing PII or sensitive information based on:
* Column naming patterns
* Unity Catalog tags
* Data types commonly associated with sensitive data


---

## 📖 Incremental Scanning Guide

### How It Works

**First Run (Automatic):**
* No history table exists → Full scan of all tables (~15-20 minutes)
* Creates `main.compliance_audit.sensitive_columns_history`
* Saves all results for future reference

**Subsequent Runs (Automatic):**
* History table exists → Incremental scan (1-3 minutes) ⚡
* Only scans tables modified in last 7 days
* Merges with existing history
* Returns complete inventory

### Configuration Options

#### Normal Operation (Default)
```python
ENABLE_INCREMENTAL_SCAN = True   # Incremental mode enabled
FORCE_FULL_SCAN = False          # Use incremental logic
INCREMENTAL_LOOKBACK_DAYS = 7    # Scan last 7 days of changes
```

#### Force Full Scan (Monthly/Quarterly Refresh)
```python
ENABLE_INCREMENTAL_SCAN = True   # Keep history
FORCE_FULL_SCAN = True           # Override incremental, scan everything
INCREMENTAL_LOOKBACK_DAYS = 7    # Ignored when FORCE_FULL_SCAN = True
```
**Use when:** You want a comprehensive refresh without losing history

#### Disable Incremental (Always Full Scan)
```python
ENABLE_INCREMENTAL_SCAN = False  # No history tracking
FORCE_FULL_SCAN = False          # Not applicable
```
**Use when:** Testing or one-time analysis

### When to Force a Full Scan

✅ **Monthly/Quarterly Compliance Reports** - Comprehensive refresh

✅ **After Major Schema Changes** - Ensure all new tables are captured

✅ **Audit Requirements** - Complete inventory verification

✅ **Suspect Missing Data** - Validate incremental logic

### How to Force a Full Scan

**Option 1: Use FORCE_FULL_SCAN (Recommended)**
1. Go to Cell 2 (Configuration)
2. Set `FORCE_FULL_SCAN = True`
3. Run Cell 2 and Cell 5 (Incremental)
4. After completion, set `FORCE_FULL_SCAN = False` and rerun Cell 2

**Option 2: Drop History Table (Nuclear Option)**
```python
spark.sql('DROP TABLE IF EXISTS main.compliance_audit.sensitive_columns_history')
```
* Loses all historical scan data
* Next run will be a fresh start

### Performance Expectations

| Scan Type | Duration | When It Runs |
|-----------|----------|-------------|
| **First Run** | 15-20 min | Only once (no history table) |
| **Incremental** | 1-3 min | Every run after first |
| **Force Full** | 15-20 min | When FORCE_FULL_SCAN = True |

### Schema Exclusions

To further optimize, exclude non-sensitive schemas:
```python
ENABLE_SCHEMA_EXCLUSIONS = True
EXCLUDE_SCHEMAS = ['information_schema', 'default', 'tmp', 'temp', 'test', 'sandbox']
```

### Viewing History

Run Cell 6 (View Incremental Scan History) to see:
* Scan dates and statistics
* Sensitive data by schema
* Tables not scanned recently
* Management commands

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("IDENTIFYING SENSITIVE COLUMNS (INCREMENTAL)")
log("="*60)

try:
    # Build full table name for incremental tracking
    incremental_table_full = f"{INCREMENTAL_TABLE_CATALOG}.{INCREMENTAL_TABLE_SCHEMA}.{INCREMENTAL_TABLE_NAME}"
    
    # Check if force full scan is enabled
    if FORCE_FULL_SCAN:
        log("⚠️  FORCE_FULL_SCAN enabled - performing full scan regardless of history")
        is_first_run = False  # Keep history, just rescan everything
        lookback_days = 36500  # ~100 years (effectively all tables)
    # Check if incremental mode is enabled
    elif ENABLE_INCREMENTAL_SCAN:
        log(f"Incremental mode ENABLED")
        log(f"Checking for existing history table: {incremental_table_full}")
        
        # Check if history table exists
        try:
            history_exists = spark.catalog.tableExists(incremental_table_full)
        except:
            history_exists = False
        
        if history_exists:
            # Get the last scan date
            last_scan_df = spark.sql(f"""
                SELECT MAX(scan_date) as last_scan_date
                FROM {incremental_table_full}
            """)
            last_scan_date = last_scan_df.collect()[0]['last_scan_date']
            
            if last_scan_date:
                log(f"✓ History table found. Last scan: {last_scan_date}")
                log(f"Scanning only tables modified in last {INCREMENTAL_LOOKBACK_DAYS} days...")
                is_first_run = False
                lookback_days = INCREMENTAL_LOOKBACK_DAYS
            else:
                log("⚠️  History table exists but is empty. Performing full scan...")
                is_first_run = True
                lookback_days = 36500  # ~100 years (effectively all tables)
        else:
            log("ℹ️  No history table found. Performing initial full scan...")
            is_first_run = True
            lookback_days = 36500  # ~100 years (effectively all tables)
    else:
        log("Incremental mode DISABLED. Performing full scan...")
        is_first_run = True
        lookback_days = 36500
    
    # Build catalog filter
    if SCAN_SPECIFIC_CATALOGS:
        catalog_list = "', '".join(CATALOGS_TO_SCAN)
        catalog_filter = f"AND t.table_catalog IN ('{catalog_list}')"
        log(f"Scanning catalogs: {', '.join(CATALOGS_TO_SCAN)}")
    else:
        catalog_filter = ""
    
    # Build schema exclusion filter
    if ENABLE_SCHEMA_EXCLUSIONS:
        schema_list = "', '".join(EXCLUDE_SCHEMAS)
        schema_filter = f"AND t.table_schema NOT IN ('{schema_list}')"
        log(f"Excluding schemas: {', '.join(EXCLUDE_SCHEMAS)}")
    else:
        schema_filter = ""
    
    # Build PII pattern for SQL RLIKE
    pii_pattern = '|'.join(pii_keywords)
    medium_pattern = 'name|user|customer|employee'
    
    # Query with incremental filter
    log(f"Executing query (lookback: {lookback_days} days)...")
    query = f"""
    SELECT 
        c.table_catalog,
        c.table_schema,
        c.table_name,
        c.column_name,
        c.data_type,
        c.comment,
        CONCAT(c.table_catalog, '.', c.table_schema, '.', c.table_name) as full_table_name,
        CASE 
            WHEN LOWER(c.column_name) RLIKE '{pii_pattern}' THEN 'HIGH'
            WHEN LOWER(c.column_name) RLIKE '{medium_pattern}' THEN 'MEDIUM'
            ELSE 'LOW'
        END as sensitivity_level,
        CURRENT_DATE() as scan_date,
        t.last_altered as table_last_altered
    FROM system.information_schema.columns c
    INNER JOIN system.information_schema.tables t
        ON c.table_catalog = t.table_catalog
        AND c.table_schema = t.table_schema
        AND c.table_name = t.table_name
    WHERE c.table_catalog NOT IN ('system', '__databricks_internal')
        AND t.table_type IN ('MANAGED', 'EXTERNAL')
        AND DATEDIFF(CURRENT_DATE(), DATE(t.last_altered)) <= {lookback_days}
        {catalog_filter}
        {schema_filter}
    """

    sensitive_columns_classified = spark.sql(query)
    
    # Filter to HIGH and MEDIUM only
    log("Filtering to HIGH and MEDIUM sensitivity columns...")
    sensitive_spark = sensitive_columns_classified.filter(
        F.col('sensitivity_level').isin(['HIGH', 'MEDIUM'])
    )
    
    # Get counts
    log("Calculating sensitivity statistics...")
    sensitivity_counts = sensitive_spark.groupBy('sensitivity_level').count().collect()
    count_dict = {row['sensitivity_level']: row['count'] for row in sensitivity_counts}
    high_count = count_dict.get('HIGH', 0)
    medium_count = count_dict.get('MEDIUM', 0)
    total_sensitive = high_count + medium_count
    
    log(f"\n📊 Sensitive columns identified in this scan:")
    log(f"  HIGH sensitivity: {high_count:,}")
    log(f"  MEDIUM sensitivity: {medium_count:,}")
    log(f"  Total sensitive: {total_sensitive:,}")
    
    # Save to history table if incremental mode is enabled
    if ENABLE_INCREMENTAL_SCAN and total_sensitive > 0:
        log(f"\nSaving results to history table: {incremental_table_full}")
        try:
            # Create schema if it doesn't exist
            spark.sql(f"CREATE SCHEMA IF NOT EXISTS {INCREMENTAL_TABLE_CATALOG}.{INCREMENTAL_TABLE_SCHEMA}")
            
            if is_first_run:
                # First run: create table
                log("Creating new history table...")
                sensitive_spark.write.mode('overwrite').saveAsTable(incremental_table_full)
                log("✓ History table created successfully")
            else:
                # Subsequent runs or forced full scan: merge new/updated data
                if FORCE_FULL_SCAN:
                    log("Full scan mode: Replacing entire history table...")
                    sensitive_spark.write.mode('overwrite').saveAsTable(incremental_table_full)
                    log("✓ History table replaced with full scan results")
                else:
                    log("Merging with existing history...")
                    
                    # Delete old records for tables that were rescanned
                    tables_scanned = sensitive_spark.select('full_table_name').distinct()
                    tables_scanned.createOrReplaceTempView('tables_scanned_temp')
                    
                    spark.sql(f"""
                        DELETE FROM {incremental_table_full}
                        WHERE full_table_name IN (SELECT full_table_name FROM tables_scanned_temp)
                    """)
                    
                    # Append new records
                    sensitive_spark.write.mode('append').saveAsTable(incremental_table_full)
                    log("✓ History table updated successfully")
                
            # Load complete history for analysis
            log("Loading complete history from Delta table...")
            sensitive_spark = spark.table(incremental_table_full).filter(
                F.col('sensitivity_level').isin(['HIGH', 'MEDIUM'])
            )
            
            # Recalculate totals from complete history
            sensitivity_counts = sensitive_spark.groupBy('sensitivity_level').count().collect()
            count_dict = {row['sensitivity_level']: row['count'] for row in sensitivity_counts}
            high_count = count_dict.get('HIGH', 0)
            medium_count = count_dict.get('MEDIUM', 0)
            total_sensitive = high_count + medium_count
            
            log(f"\n📊 Complete inventory (from history):")
            log(f"  HIGH sensitivity: {high_count:,}")
            log(f"  MEDIUM sensitivity: {medium_count:,}")
            log(f"  Total sensitive: {total_sensitive:,}")
            
        except Exception as e:
            log(f"⚠️  Could not save to history table: {str(e)}")
            log("Continuing with current scan results...")
    
    # Convert to pandas for downstream analysis
    log("\nConverting to pandas for analysis...")
    if total_sensitive > 50000:
        log(f"⚠️  Large dataset ({total_sensitive:,} rows), limiting to 50,000")
        sensitive_data = sensitive_spark.limit(50000).toPandas()
    else:
        sensitive_data = sensitive_spark.toPandas()
    
    log(f"Loaded {len(sensitive_data):,} sensitive columns into memory")
    display(sensitive_data.head(20))
    
    # Reset FORCE_FULL_SCAN flag reminder
    if FORCE_FULL_SCAN:
        log("\n" + "="*60)
        log("⚠️  REMINDER: FORCE_FULL_SCAN is still enabled")
        log("Set FORCE_FULL_SCAN = False in the configuration cell")
        log("to return to incremental mode for faster scans.")
        log("="*60)
    
except Exception as e:
    log(f"❌ Error in incremental scan: {str(e)}")
    import traceback
    log(traceback.format_exc())
    sensitive_data = None
    
log_execution_time("Identify Sensitive Columns (Incremental)", cell_start_time)

In [0]:
# Optional: View incremental scan history and statistics

log("\n" + "="*60)
log("INCREMENTAL SCAN HISTORY")
log("="*60)

incremental_table_full = f"{INCREMENTAL_TABLE_CATALOG}.{INCREMENTAL_TABLE_SCHEMA}.{INCREMENTAL_TABLE_NAME}"

try:
    # Check if history table exists
    if spark.catalog.tableExists(incremental_table_full):
        log(f"✓ History table exists: {incremental_table_full}")
        
        # Get scan history statistics
        history_stats = spark.sql(f"""
            SELECT 
                scan_date,
                COUNT(*) as total_columns,
                COUNT(DISTINCT full_table_name) as total_tables,
                SUM(CASE WHEN sensitivity_level = 'HIGH' THEN 1 ELSE 0 END) as high_sensitivity,
                SUM(CASE WHEN sensitivity_level = 'MEDIUM' THEN 1 ELSE 0 END) as medium_sensitivity
            FROM {incremental_table_full}
            GROUP BY scan_date
            ORDER BY scan_date DESC
            LIMIT 10
        """)
        
        log("\nRecent scan history:")
        display(history_stats)
        
        # Get table-level summary
        table_summary = spark.sql(f"""
            SELECT 
                table_catalog,
                table_schema,
                COUNT(DISTINCT table_name) as table_count,
                COUNT(*) as sensitive_column_count,
                MAX(scan_date) as last_scanned
            FROM {incremental_table_full}
            GROUP BY table_catalog, table_schema
            ORDER BY sensitive_column_count DESC
        """)
        
        log("\nSensitive data by schema:")
        display(table_summary)
        
        # Show total size
        total_rows = spark.table(incremental_table_full).count()
        log(f"\nTotal rows in history table: {total_rows:,}")
        
        log("\n" + "="*60)
        log("MANAGEMENT OPTIONS")
        log("="*60)
        log("\nTo reset the incremental scan history (force full rescan):")
        log(f"  spark.sql('DROP TABLE IF EXISTS {incremental_table_full}')")
        log("\nTo view specific tables:")
        log(f"  spark.sql('SELECT * FROM {incremental_table_full} WHERE table_name = \"your_table\"').display()")
        log("\nTo see tables not scanned recently:")
        log(f"  spark.sql('SELECT DISTINCT full_table_name, MAX(scan_date) as last_scan FROM {incremental_table_full} GROUP BY full_table_name HAVING DATEDIFF(CURRENT_DATE(), MAX(scan_date)) > 30').display()")
        
    else:
        log(f"ℹ️  No history table found: {incremental_table_full}")
        log("Run the incremental scan cell to create it.")
        
except Exception as e:
    log(f"❌ Error viewing history: {str(e)}")

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("IDENTIFYING SENSITIVE COLUMNS (SAMPLE-BASED)")
log("="*60)

try:
    # PERFORMANCE OPTIMIZATION: Only scan recently modified tables
    # This reduces scan from 1.5M columns to ~100K columns
    
    # Build catalog filter
    if SCAN_SPECIFIC_CATALOGS:
        catalog_list = "', '".join(CATALOGS_TO_SCAN)
        catalog_filter = f"AND t.table_catalog IN ('{catalog_list}')"
        log(f"Scanning specific catalogs: {', '.join(CATALOGS_TO_SCAN)}")
    else:
        catalog_filter = ""
        log("Scanning ALL catalogs")
    
    # Build PII pattern for SQL RLIKE
    pii_pattern = '|'.join(pii_keywords)
    medium_pattern = 'name|user|customer|employee'
    
    log("Using sample-based approach: scanning only recently modified tables (last 180 days)...")
    
    # Join columns with tables to filter by last_altered date
    query = f"""
    SELECT 
        c.table_catalog,
        c.table_schema,
        c.table_name,
        c.column_name,
        c.data_type,
        c.comment,
        CONCAT(c.table_catalog, '.', c.table_schema, '.', c.table_name) as full_table_name,
        CASE 
            WHEN LOWER(c.column_name) RLIKE '{pii_pattern}' THEN 'HIGH'
            WHEN LOWER(c.column_name) RLIKE '{medium_pattern}' THEN 'MEDIUM'
            ELSE 'LOW'
        END as sensitivity_level
    FROM system.information_schema.columns c
    INNER JOIN system.information_schema.tables t
        ON c.table_catalog = t.table_catalog
        AND c.table_schema = t.table_schema
        AND c.table_name = t.table_name
    WHERE c.table_catalog NOT IN ('system', '__databricks_internal')
        AND t.table_type IN ('MANAGED', 'EXTERNAL')
        AND DATEDIFF(CURRENT_DATE(), DATE(t.last_altered)) <= 180
        {catalog_filter}
    """

    log("Executing optimized query...")
    sensitive_columns_classified = spark.sql(query)
    
    # Filter to HIGH and MEDIUM only
    log("Filtering to HIGH and MEDIUM sensitivity columns...")
    sensitive_spark = sensitive_columns_classified.filter(
        F.col('sensitivity_level').isin(['HIGH', 'MEDIUM'])
    )
    
    # Get counts using single-pass aggregation
    log("Calculating sensitivity statistics...")
    sensitivity_counts = sensitive_spark.groupBy('sensitivity_level').count().collect()
    
    # Extract counts
    count_dict = {row['sensitivity_level']: row['count'] for row in sensitivity_counts}
    high_count = count_dict.get('HIGH', 0)
    medium_count = count_dict.get('MEDIUM', 0)
    total_sensitive = high_count + medium_count
    
    log(f"\n📊 Sensitive columns identified (recently modified tables only):")
    log(f"  HIGH sensitivity: {high_count:,}")
    log(f"  MEDIUM sensitivity: {medium_count:,}")
    log(f"  Total sensitive: {total_sensitive:,}")
    log(f"\nℹ️  Note: This is a sample based on tables modified in last 180 days")
    log(f"     For full scan, use the previous cell (slower but comprehensive)")
    
    # Convert to pandas for downstream analysis
    log("Converting to pandas for analysis...")
    if total_sensitive > 50000:
        log(f"⚠️  Large dataset ({total_sensitive:,} rows), limiting to 50,000")
        sensitive_data = sensitive_spark.limit(50000).toPandas()
    else:
        sensitive_data = sensitive_spark.toPandas()
    
    log(f"Loaded {len(sensitive_data):,} sensitive columns into memory")
    display(sensitive_data.head(20))
    
except Exception as e:
    log(f"❌ Error identifying sensitive columns: {str(e)}")
    sensitive_data = None
    
log_execution_time("Identify Sensitive Columns (Sample-Based)", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("IDENTIFYING SENSITIVE COLUMNS")
log("="*60)

try:
    # Build catalog filter for performance
    if SCAN_SPECIFIC_CATALOGS:
        catalog_list = "', '".join(CATALOGS_TO_SCAN)
        catalog_filter = f"AND table_catalog IN ('{catalog_list}')"
        log(f"Scanning specific catalogs: {', '.join(CATALOGS_TO_SCAN)}")
    else:
        catalog_filter = ""
        log("Scanning ALL catalogs (this may take longer)")
    
    # Build PII pattern for SQL RLIKE
    pii_pattern = '|'.join(pii_keywords)
    medium_pattern = 'name|user|customer|employee'
    
    # Use SQL-based classification (fastest approach - single query)
    log("Querying and classifying columns using SQL...")
    query = f"""
    SELECT 
        table_catalog,
        table_schema,
        table_name,
        column_name,
        data_type,
        comment,
        CONCAT(table_catalog, '.', table_schema, '.', table_name) as full_table_name,
        CASE 
            WHEN LOWER(column_name) RLIKE '{pii_pattern}' THEN 'HIGH'
            WHEN LOWER(column_name) RLIKE '{medium_pattern}' THEN 'MEDIUM'
            ELSE 'LOW'
        END as sensitivity_level
    FROM system.information_schema.columns
    WHERE table_catalog NOT IN ('system', '__databricks_internal')
    {catalog_filter}
    """

    sensitive_columns_classified = spark.sql(query)
    
    # Filter to HIGH and MEDIUM only (push-down filter)
    log("Filtering to HIGH and MEDIUM sensitivity columns...")
    sensitive_spark = sensitive_columns_classified.filter(
        F.col('sensitivity_level').isin(['HIGH', 'MEDIUM'])
    )
    
    # Get counts using single-pass aggregation
    log("Calculating sensitivity statistics...")
    sensitivity_counts = sensitive_spark.groupBy('sensitivity_level').count().collect()
    
    # Extract counts from results
    count_dict = {row['sensitivity_level']: row['count'] for row in sensitivity_counts}
    high_count = count_dict.get('HIGH', 0)
    medium_count = count_dict.get('MEDIUM', 0)
    total_sensitive = high_count + medium_count
    
    log(f"\n📊 Sensitive columns identified:")
    log(f"  HIGH sensitivity: {high_count:,}")
    log(f"  MEDIUM sensitivity: {medium_count:,}")
    log(f"  Total sensitive: {total_sensitive:,}")
    
    # Convert only a sample to pandas for display and downstream analysis
    log("Converting sample to pandas for analysis...")
    if total_sensitive > 100000:
        log(f"⚠️  Large dataset detected ({total_sensitive:,} rows), limiting to 100,000 for memory efficiency")
        sensitive_data = sensitive_spark.limit(100000).toPandas()
    else:
        sensitive_data = sensitive_spark.toPandas()
    
    log(f"Loaded {len(sensitive_data):,} sensitive columns into memory")
    display(sensitive_data.head(20))
    
except Exception as e:
    log(f"❌ Error identifying sensitive columns: {str(e)}")
    sensitive_data = None
    
log_execution_time("Identify Sensitive Columns", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("AGGREGATING SENSITIVE TABLES")
log("="*60)

if sensitive_data is not None and len(sensitive_data) > 0:
    try:
        # Aggregate sensitive columns by table
        log("Grouping sensitive columns by table...")
        sensitive_tables = sensitive_data.groupby(['table_catalog', 'table_schema', 'table_name', 'full_table_name']).agg({
            'column_name': 'count',
            'sensitivity_level': lambda x: 'HIGH' if 'HIGH' in x.values else 'MEDIUM'
        }).reset_index()

        sensitive_tables.columns = ['catalog', 'schema', 'table', 'full_table_name', 'sensitive_column_count', 'max_sensitivity']
        sensitive_tables = sensitive_tables.sort_values('sensitive_column_count', ascending=False)

        high_tables = len(sensitive_tables[sensitive_tables['max_sensitivity'] == 'HIGH'])
        medium_tables = len(sensitive_tables[sensitive_tables['max_sensitivity'] == 'MEDIUM'])
        
        log(f"\n📊 Tables with sensitive data: {len(sensitive_tables):,}")
        log(f"  HIGH sensitivity tables: {high_tables:,}")
        log(f"  MEDIUM sensitivity tables: {medium_tables:,}")

        display(sensitive_tables.head(20))
        
    except Exception as e:
        log(f"❌ Error aggregating sensitive tables: {str(e)}")
        sensitive_tables = None
else:
    log("⚠️  No sensitive data to aggregate")
    sensitive_tables = None
    
log_execution_time("Sensitive Tables Summary", cell_start_time)


---

## 2. Audit Trail Analysis

Analyzing workspace audit logs to track:
* Access to sensitive tables
* Data modification events
* Permission changes
* Data export activities
* Unusual access patterns

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("LOADING AUDIT LOGS")
log("="*60)

try:
    # Query audit logs for the analysis period
    log(f"Querying system.access.audit from {start_date} to {end_date}...")
    audit_query = f"""
    SELECT 
        event_time,
        event_date,
        workspace_id,
        user_identity.email as user_email,
        service_name,
        action_name,
        request_id,
        request_params,
        response.status_code as status_code,
        response.error_message as error_message,
        source_ip_address
    FROM system.access.audit
    WHERE event_date >= '{start_date}'
        AND event_date <= '{end_date}'
        AND action_name IS NOT NULL
    ORDER BY event_time DESC
    """

    audit_logs_df = spark.sql(audit_query)
    total_events = audit_logs_df.count()

    log(f"✓ Total audit events: {total_events:,}")

    # Cache for performance
    audit_logs_df = safe_cache(audit_logs_df, "audit_logs_df")

    # Show sample
    display(audit_logs_df.limit(10))
    
except Exception as e:
    log(f"❌ Error loading audit logs: {str(e)}")
    audit_logs_df = None
    total_events = 0
    
log_execution_time("Load Audit Logs", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("CATEGORIZING AUDIT EVENTS")
log("="*60)

if validate_dataframe_exists("audit_logs_df", audit_logs_df):
    try:
        # For large datasets, use Spark UDF instead of pandas
        from pyspark.sql.functions import udf
        from pyspark.sql.types import StringType
        
        # Create UDF for categorization
        categorize_udf = udf(categorize_action, StringType())
        
        log("Categorizing events by action type using Spark...")
        audit_logs_categorized = audit_logs_df.withColumn('event_category', categorize_udf(F.col('action_name')))
        
        # Aggregate summary using Spark
        event_summary_spark = audit_logs_categorized.groupBy('event_category').agg(
            F.count('request_id').alias('event_count'),
            F.countDistinct('user_email').alias('unique_users')
        ).orderBy(F.desc('event_count'))
        
        # Convert only the summary to pandas (small dataset)
        event_summary = event_summary_spark.toPandas()
        
        log("\n📊 Audit Event Summary by Category:")
        for _, row in event_summary.iterrows():
            log(f"  {row['event_category']}: {row['event_count']:,} events ({row['unique_users']} users)")
        
        display(event_summary)
        
        # Store the categorized Spark DataFrame for downstream use
        audit_logs_categorized.createOrReplaceTempView("audit_logs_categorized")
        
    except Exception as e:
        log(f"❌ Error categorizing audit events: {str(e)}")
        audit_logs_categorized = None
        event_summary = None
else:
    log("⚠️  No audit logs to categorize")
    audit_logs_categorized = None
    event_summary = None
    
log_execution_time("Categorize Audit Events", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("FILTERING SERVICE PRINCIPALS")
log("="*60)

if 'audit_logs_categorized' in locals() and audit_logs_categorized is not None:
    try:
        # Get counts before filtering
        total_before = audit_logs_categorized.count()
        users_before = audit_logs_categorized.select('user_email').distinct().count()
        
        if FILTER_SERVICE_PRINCIPALS:
            log("Applying service principal filters...")
            
            # Apply filtering
            audit_logs_users_only = filter_service_principals_spark(audit_logs_categorized, 'user_email')
            
            # Get counts after filtering
            total_after = audit_logs_users_only.count()
            users_after = audit_logs_users_only.select('user_email').distinct().count()
            
            events_filtered = total_before - total_after
            users_filtered = users_before - users_after
            
            log(f"\n📊 Filtering Results:")
            log(f"  Events before: {total_before:,}")
            log(f"  Events after: {total_after:,}")
            log(f"  Events filtered: {events_filtered:,} ({events_filtered/total_before*100:.1f}%)")
            log(f"  Users before: {users_before:,}")
            log(f"  Users after: {users_after:,}")
            log(f"  Service principals filtered: {users_filtered:,}")
            
            # Create temp view for downstream analysis
            audit_logs_users_only.createOrReplaceTempView("audit_logs_users_only")
            log("\n✓ Created view: audit_logs_users_only (for user-focused analysis)")
            log("✓ Original view: audit_logs_categorized (includes all accounts)")
        else:
            log("ℹ️  Service principal filtering is DISABLED")
            audit_logs_users_only = audit_logs_categorized
            audit_logs_users_only.createOrReplaceTempView("audit_logs_users_only")
            
    except Exception as e:
        log(f"❌ Error filtering service principals: {str(e)}")
        audit_logs_users_only = audit_logs_categorized
else:
    log("⚠️  No audit logs to filter")
    audit_logs_users_only = None

log_execution_time("Filter Service Principals", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("SENSITIVE DATA ACCESS TRACKING")
log("="*60)

if 'audit_logs_categorized' in locals() and audit_logs_categorized is not None:
    try:
        # Filter for data access events using Spark
        log("Filtering data access events...")
        data_access_events_spark = audit_logs_categorized.filter(F.col('event_category') == 'DATA_ACCESS')
        data_access_count = data_access_events_spark.count()
        
        log(f"Total data access events: {data_access_count:,}")

        if data_access_count > 0:
            # Analyze access patterns using Spark aggregations
            log("Analyzing access patterns by user...")
            access_by_user_spark = data_access_events_spark.groupBy('user_email').agg(
                F.count('request_id').alias('access_count'),
                F.min('event_date').alias('first_access'),
                F.max('event_date').alias('last_access')
            ).orderBy(F.desc('access_count'))
            
            # Convert only top results to pandas for display
            access_by_user = access_by_user_spark.limit(100).toPandas()
            
            log(f"\nUnique users with data access: {access_by_user_spark.count():,}")
            log(f"Top data accessors:")
            display(access_by_user.head(20))
        else:
            log("No data access events found in the audit logs for this period.")
            
    except Exception as e:
        log(f"❌ Error tracking sensitive data access: {str(e)}")
else:
    log("⚠️  No categorized audit logs available")

log_execution_time("Sensitive Data Access Tracking", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("PERMISSION CHANGES AUDIT")
log("="*60)

if 'audit_logs_categorized' in locals() and audit_logs_categorized is not None:
    try:
        # Track permission changes using Spark
        log("Filtering permission change events...")
        permission_changes_spark = audit_logs_categorized.filter(F.col('event_category') == 'PERMISSION_CHANGE')
        perm_change_count = permission_changes_spark.count()
        
        log(f"Total permission change events: {perm_change_count:,}")

        if perm_change_count > 0:
            # Analyze permission changes
            log("Analyzing permission changes...")
            perm_summary_spark = permission_changes_spark.groupBy('user_email', 'action_name').agg(
                F.count('request_id').alias('change_count'),
                F.min('event_date').alias('first_change'),
                F.max('event_date').alias('last_change')
            ).orderBy(F.desc('change_count'))
            
            # Convert top results to pandas
            perm_summary = perm_summary_spark.limit(100).toPandas()
            perm_summary.columns = ['user_email', 'action', 'change_count', 'first_change', 'last_change']
            
            log("\nPermission Changes Summary:")
            display(perm_summary.head(20))
            
            # Recent permission changes (last 7 days)
            from datetime import datetime, timedelta
            recent_cutoff = (datetime.now(eastern) - timedelta(days=7)).strftime('%Y-%m-%d')
            recent_perms_spark = permission_changes_spark.filter(F.col('event_date') >= recent_cutoff)
            recent_count = recent_perms_spark.count()
            
            log(f"\nRecent permission changes (last 7 days): {recent_count:,}")
            if recent_count > 0:
                recent_perms = recent_perms_spark.select('event_time', 'user_email', 'action_name', 'service_name').limit(20).toPandas()
                display(recent_perms)
        else:
            log("No permission change events found in the audit logs.")
            
    except Exception as e:
        log(f"❌ Error auditing permission changes: {str(e)}")
else:
    log("⚠️  No categorized audit logs available")

log_execution_time("Permission Changes Audit", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("DATA EXPORT MONITORING")
log("="*60)

if 'audit_logs_categorized' in locals() and audit_logs_categorized is not None:
    try:
        # Monitor data export activities using Spark
        log("Filtering data export events...")
        export_events_spark = audit_logs_categorized.filter(F.col('event_category') == 'DATA_EXPORT')
        export_count = export_events_spark.count()
        
        log(f"Total data export events: {export_count:,}")

        if export_count > 0:
            # Analyze exports
            log("Analyzing export patterns...")
            export_summary_spark = export_events_spark.groupBy('user_email', 'action_name').agg(
                F.count('request_id').alias('export_count'),
                F.min('event_date').alias('first_export'),
                F.max('event_date').alias('last_export')
            ).orderBy(F.desc('export_count'))
            
            # Convert top results to pandas
            export_summary = export_summary_spark.limit(100).toPandas()
            export_summary.columns = ['user_email', 'export_action', 'export_count', 'first_export', 'last_export']
            
            log("\nData Export Summary:")
            display(export_summary.head(20))
            
            # Flag high-volume exporters
            high_volume_exporters = export_summary[export_summary['export_count'] > 50]
            if len(high_volume_exporters) > 0:
                log(f"\n⚠️ HIGH VOLUME EXPORTERS (>50 exports): {len(high_volume_exporters)}")
                display(high_volume_exporters)
            else:
                log("✅ No high-volume exporters detected")
        else:
            log("No data export events found in the audit logs.")
            
    except Exception as e:
        log(f"❌ Error monitoring data exports: {str(e)}")
else:
    log("⚠️  No categorized audit logs available")

log_execution_time("Data Export Monitoring", cell_start_time)


---

## 3. Data Retention Compliance

Monitoring table retention against organizational policies:
* Tables exceeding maximum retention period
* Tables below minimum retention requirements
* Stale data identification
* Retention policy recommendations

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("ANALYZING TABLE RETENTION")
log("="*60)

try:
    # Query table metadata for retention analysis
    log("Querying system.information_schema.tables...")
    retention_query = """
    SELECT 
        table_catalog,
        table_schema,
        table_name,
        table_type,
        CONCAT(table_catalog, '.', table_schema, '.', table_name) as full_table_name,
        created,
        last_altered,
        DATEDIFF(CURRENT_DATE(), DATE(last_altered)) as days_since_modified,
        DATEDIFF(CURRENT_DATE(), DATE(created)) as days_since_created,
        comment
    FROM system.information_schema.tables
    WHERE table_catalog NOT IN ('system', '__databricks_internal')
        AND table_type IN ('MANAGED', 'EXTERNAL')
    ORDER BY days_since_modified DESC
    """

    retention_df = spark.sql(retention_query)
    total_tables = retention_df.count()

    log(f"✓ Total tables analyzed: {total_tables:,}")

    # Cache for performance
    retention_df = safe_cache(retention_df, "retention_df")

    # Convert to pandas for analysis
    log("Converting to pandas for retention analysis...")
    retention_pd = retention_df.toPandas()

    # Apply retention status classification
    log("Classifying retention status...")
    retention_pd['retention_status'] = retention_pd['days_since_modified'].apply(calculate_retention_status)

    display(retention_pd.head(20))
    
except Exception as e:
    log(f"❌ Error analyzing table retention: {str(e)}")
    retention_pd = None
    total_tables = 0
    
log_execution_time("Analyze Table Retention", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("RETENTION COMPLIANCE SUMMARY")
log("="*60)

if retention_pd is not None and len(retention_pd) > 0:
    try:
        # Summarize retention compliance
        retention_summary = retention_pd.groupby('retention_status').agg({
            'full_table_name': 'count'
        }).reset_index()
        retention_summary.columns = ['retention_status', 'table_count']
        retention_summary = retention_summary.sort_values('table_count', ascending=False)

        log("\n📊 Data Retention Compliance Summary:")
        display(retention_summary)

        # Calculate percentages
        retention_summary['percentage'] = (retention_summary['table_count'] / total_tables * 100).round(2)
        log("\nRetention Status Distribution:")
        for _, row in retention_summary.iterrows():
            status_icon = "✅" if row['retention_status'] in ['COMPLIANT', 'WITHIN_POLICY'] else "⚠️"
            log(f"  {status_icon} {row['retention_status']}: {row['table_count']:,} tables ({row['percentage']}%)")
            
    except Exception as e:
        log(f"❌ Error summarizing retention compliance: {str(e)}")
        retention_summary = None
else:
    log("⚠️  No retention data to summarize")
    retention_summary = None
    
log_execution_time("Retention Compliance Summary", cell_start_time)

In [0]:
# Identify tables exceeding maximum retention period (compliance risk)
exceeds_policy = retention_pd[retention_pd['retention_status'] == 'EXCEEDS_POLICY'].copy()
exceeds_policy = exceeds_policy.sort_values('days_since_modified', ascending=False)

print(f"⚠️ Tables exceeding retention policy ({max_retention_days} days): {len(exceeds_policy):,}")

if len(exceeds_policy) > 0:
    print(f"\nOldest tables (by last modification):")
    display(exceeds_policy[['full_table_name', 'table_type', 'days_since_modified', 'days_since_created', 'last_altered']].head(30))
    
    # Check if any sensitive tables exceed retention
    if 'sensitive_tables' in locals():
        sensitive_exceeds = exceeds_policy[exceeds_policy['full_table_name'].isin(sensitive_tables['full_table_name'])]
        if len(sensitive_exceeds) > 0:
            print(f"\n⚠️⚠️ CRITICAL: {len(sensitive_exceeds)} SENSITIVE tables exceed retention policy!")
            display(sensitive_exceeds[['full_table_name', 'days_since_modified', 'last_altered']].head(20))
else:
    print("✅ No tables exceed the retention policy.")

In [0]:
# Identify stale tables (not modified in 180+ days)
stale_threshold = 180
stale_tables = retention_pd[retention_pd['days_since_modified'] >= stale_threshold].copy()
stale_tables = stale_tables.sort_values('days_since_modified', ascending=False)

print(f"Stale tables (not modified in {stale_threshold}+ days): {len(stale_tables):,}")

if len(stale_tables) > 0:
    # Categorize by staleness
    stale_tables['staleness_category'] = pd.cut(
        stale_tables['days_since_modified'],
        bins=[180, 365, 730, 1825, float('inf')],
        labels=['6mo-1yr', '1-2yrs', '2-5yrs', '5+yrs']
    )
    
    staleness_summary = stale_tables.groupby('staleness_category').size().reset_index(name='count')
    print("\nStaleness Distribution:")
    display(staleness_summary)
    
    print("\nTop 30 stale tables:")
    display(stale_tables[['full_table_name', 'table_type', 'days_since_modified', 'last_altered', 'staleness_category']].head(30))
else:
    print("✅ No stale tables found.")


---

## 4. Regulatory Compliance Tracking

Compliance checks for major regulations:
* **GDPR** - Right to erasure, data portability, consent tracking
* **CCPA** - Consumer data rights, opt-out tracking
* **SOX** - Financial data access controls, audit trails
* **HIPAA** - Healthcare data protection (if applicable)

In [0]:
# GDPR Compliance: Check for tables with EU personal data
print("=== GDPR Compliance Analysis ===")
print("\nKey Requirements:")
print("  • Right to erasure (data deletion capability)")
print("  • Data portability (export capability)")
print("  • Consent tracking")
print("  • Data retention limits")

# Identify tables likely containing EU personal data
if 'sensitive_tables' in locals() and len(sensitive_tables) > 0:
    gdpr_relevant_tables = sensitive_tables[sensitive_tables['max_sensitivity'] == 'HIGH'].copy()
    print(f"\nTables with HIGH sensitivity (potential GDPR scope): {len(gdpr_relevant_tables):,}")
    
    # Cross-reference with retention compliance
    if 'retention_pd' in locals():
        gdpr_retention = gdpr_relevant_tables.merge(
            retention_pd[['full_table_name', 'days_since_modified', 'retention_status']],
            on='full_table_name',
            how='left'
        )
        
        # Flag GDPR violations (sensitive data exceeding retention)
        gdpr_violations = gdpr_retention[gdpr_retention['retention_status'] == 'EXCEEDS_POLICY']
        
        if len(gdpr_violations) > 0:
            print(f"\n⚠️ GDPR RISK: {len(gdpr_violations)} sensitive tables exceed retention policy")
            display(gdpr_violations[['full_table_name', 'sensitive_column_count', 'days_since_modified']].head(20))
        else:
            print("\n✅ No GDPR retention violations detected")
    
    # Check for deletion tracking (look for deleted_at, is_deleted columns)
    if 'sensitive_data' in locals():
        deletion_columns = sensitive_data[
            sensitive_data['column_name'].str.lower().str.contains('delet|remov|erasure', na=False)
        ]
        tables_with_deletion = deletion_columns['full_table_name'].nunique()
        print(f"\nTables with deletion tracking columns: {tables_with_deletion:,}")
        if tables_with_deletion > 0:
            print("✅ Deletion capability appears to be implemented")
        else:
            print("⚠️ Consider implementing soft-delete columns for GDPR compliance")
else:
    print("\nNo sensitive tables identified for GDPR analysis.")

In [0]:
# CCPA Compliance: California Consumer Privacy Act
print("=== CCPA Compliance Analysis ===")
print("\nKey Requirements:")
print("  • Right to know (data access)")
print("  • Right to delete")
print("  • Right to opt-out of sale")
print("  • Non-discrimination")

# Check for opt-out tracking
if 'sensitive_data' in locals():
    opt_out_columns = sensitive_data[
        sensitive_data['column_name'].str.lower().str.contains('opt.?out|consent|preference', na=False, regex=True)
    ]
    tables_with_optout = opt_out_columns['full_table_name'].nunique()
    
    print(f"\nTables with opt-out/consent tracking: {tables_with_optout:,}")
    if tables_with_optout > 0:
        print("✅ Opt-out tracking appears to be implemented")
        display(opt_out_columns[['full_table_name', 'column_name', 'data_type']].head(20))
    else:
        print("⚠️ Consider implementing opt-out tracking for CCPA compliance")

# Check for California-specific data
if 'sensitive_data' in locals():
    ca_columns = sensitive_data[
        sensitive_data['column_name'].str.lower().str.contains('california|ca_resident|state', na=False)
    ]
    if len(ca_columns) > 0:
        print(f"\nTables with California-related columns: {ca_columns['full_table_name'].nunique():,}")
        print("⚠️ These tables may require CCPA compliance measures")

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("SOX COMPLIANCE CHECK")
log("="*60)

try:
    # SOX Compliance: Sarbanes-Oxley Act (financial data controls)
    log("=== SOX Compliance Analysis ===")
    log("\nKey Requirements:")
    log("  • Financial data access controls")
    log("  • Audit trail completeness")
    log("  • Segregation of duties")
    log("  • Change management tracking")

    # Identify financial data tables
    if 'sensitive_data' in locals() and sensitive_data is not None:
        financial_keywords = ['revenue', 'invoice', 'payment', 'transaction', 'financial', 
                             'accounting', 'ledger', 'balance', 'cost', 'price', 'amount']
        
        financial_columns = sensitive_data[
            sensitive_data['column_name'].str.lower().str.contains('|'.join(financial_keywords), na=False)
        ]
        financial_tables = financial_columns['full_table_name'].unique()
        
        log(f"\nTables with financial data: {len(financial_tables):,}")
        
        if len(financial_tables) > 0:
            log("\nFinancial tables requiring SOX controls:")
            financial_summary = financial_columns.groupby('full_table_name').agg({
                'column_name': 'count'
            }).reset_index()
            financial_summary.columns = ['full_table_name', 'financial_column_count']
            financial_summary = financial_summary.sort_values('financial_column_count', ascending=False)
            display(financial_summary.head(20))
            
            # Check audit trail coverage for financial tables
            if 'audit_logs_categorized' in locals() and audit_logs_categorized is not None:
                audit_count = audit_logs_categorized.count()
                log(f"\n✅ Audit logging is active (SOX requirement met)")
                log(f"   Total audit events in period: {audit_count:,}")
            else:
                log("\n⚠️ Limited audit data available - ensure comprehensive logging for SOX")
        else:
            log("\nNo financial data tables identified.")
    else:
        log("\nNo sensitive data available for SOX analysis.")
        
except Exception as e:
    log(f"❌ Error in SOX compliance check: {str(e)}")

log_execution_time("SOX Compliance Check", cell_start_time)


---

## 5. Access Pattern Anomaly Detection

Identifying unusual access patterns that may indicate:
* Unauthorized access attempts
* Data exfiltration
* Insider threats
* Compromised accounts

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("DETECTING UNUSUAL ACCESS PATTERNS")
log("="*60)

try:
    # Anomaly detection: Unusual access patterns (USER ACCOUNTS ONLY)
    log("=== Access Pattern Anomaly Detection (Users Only) ===")

    if 'audit_logs_users_only' in locals() and audit_logs_users_only is not None:
        # 1. After-hours access (outside 6 AM - 8 PM) using Spark
        log("\nAnalyzing after-hours access patterns...")
        audit_with_hour = audit_logs_users_only.withColumn('event_hour', F.hour(F.col('event_time')))
        after_hours_spark = audit_with_hour.filter((F.col('event_hour') < 6) | (F.col('event_hour') > 20))
        
        after_hours_count = after_hours_spark.count()
        total_count = audit_logs_users_only.count()
        after_hours_pct = (after_hours_count / total_count * 100) if total_count > 0 else 0
        
        log(f"\nAfter-hours access events: {after_hours_count:,} ({after_hours_pct:.1f}%)")
        
        if after_hours_count > 0:
            after_hours_users_spark = after_hours_spark.groupBy('user_email').agg(
                F.count('request_id').alias('after_hours_count')
            ).orderBy(F.desc('after_hours_count'))
            
            after_hours_users = after_hours_users_spark.limit(20).toPandas()
            log("\nTop after-hours users:")
            display(after_hours_users)
        
        # 2. High-frequency access (potential automated scraping)
        log("\nAnalyzing high-frequency access patterns...")
        user_activity_spark = audit_logs_users_only.groupBy('user_email').agg(
            F.count('request_id').alias('total_events'),
            F.countDistinct('event_date').alias('active_days')
        )
        user_activity_spark = user_activity_spark.withColumn(
            'events_per_day', 
            F.round(F.col('total_events') / F.col('active_days'), 1)
        )
        
        high_frequency_spark = user_activity_spark.filter(F.col('events_per_day') > 1000)
        high_freq_count = high_frequency_spark.count()
        
        if high_freq_count > 0:
            log(f"\n⚠️ High-frequency users (>1000 events/day): {high_freq_count}")
            high_frequency = high_frequency_spark.orderBy(F.desc('events_per_day')).limit(20).toPandas()
            display(high_frequency)
        else:
            log("\n✅ No unusually high-frequency access detected")
        
        # 3. Failed access attempts
        log("\nAnalyzing failed access attempts...")
        failed_attempts_spark = audit_logs_users_only.filter(
            F.col('status_code').isNotNull() & (F.col('status_code') >= 400)
        )
        failed_count = failed_attempts_spark.count()
        
        if failed_count > 0:
            failed_pct = (failed_count / total_count * 100) if total_count > 0 else 0
            log(f"\nFailed access attempts: {failed_count:,} ({failed_pct:.1f}%)")
            
            failed_by_user_spark = failed_attempts_spark.groupBy('user_email').agg(
                F.count('request_id').alias('failed_count')
            ).orderBy(F.desc('failed_count'))
            
            failed_by_user = failed_by_user_spark.limit(20).toPandas()
            log("\nUsers with most failed attempts:")
            display(failed_by_user)
        else:
            log("\n✅ No failed access attempts detected")
    else:
        log("\nInsufficient audit data for anomaly detection.")
        
except Exception as e:
    log(f"❌ Error detecting unusual access patterns: {str(e)}")

log_execution_time("Detect Unusual Access Patterns", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("GEOGRAPHIC ACCESS ANALYSIS")
log("="*60)

try:
    # Analyze access by source IP (geographic anomalies) - USER ACCOUNTS ONLY
    log("=== Geographic Access Analysis (Users Only) ===")

    if 'audit_logs_users_only' in locals() and audit_logs_users_only is not None:
        log("\nAnalyzing source IP patterns...")
        ip_summary_spark = audit_logs_users_only.filter(F.col('source_ip_address').isNotNull()).groupBy('source_ip_address').agg(
            F.count('request_id').alias('event_count'),
            F.countDistinct('user_email').alias('unique_users')
        ).orderBy(F.desc('event_count'))
        
        unique_ips = ip_summary_spark.count()
        log(f"Unique source IPs: {unique_ips:,}")
        
        # Convert top IPs to pandas for display
        ip_summary = ip_summary_spark.limit(20).toPandas()
        ip_summary.columns = ['source_ip', 'event_count', 'unique_users']
        
        log("\nTop source IPs by activity:")
        display(ip_summary)
        
        # Flag IPs with multiple users (potential proxy/VPN)
        multi_user_ips_spark = ip_summary_spark.filter(F.col('unique_users') > 5)
        multi_user_count = multi_user_ips_spark.count()
        
        if multi_user_count > 0:
            log(f"\nIPs with multiple users (>5): {multi_user_count:,}")
            log("(May indicate corporate proxy, VPN, or shared infrastructure)")
            multi_user_ips = multi_user_ips_spark.limit(10).toPandas()
            multi_user_ips.columns = ['source_ip', 'event_count', 'unique_users']
            display(multi_user_ips)
        else:
            log("\n✅ No unusual multi-user IP patterns detected")
    else:
        log("\nSource IP information not available in audit logs.")
        
except Exception as e:
    log(f"❌ Error in geographic access analysis: {str(e)}")

log_execution_time("Geographic Access Analysis", cell_start_time)


---

## 6. Compliance Reporting & Visualizations

Executive dashboards and detailed reports for compliance stakeholders.

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("CALCULATING COMPLIANCE SCORE")
log("="*60)

try:
    compliance_metrics = {}

    # 1. Retention compliance (weight: 30%)
    if 'retention_summary' in locals() and retention_summary is not None:
        compliant_tables = retention_summary[
            retention_summary['retention_status'].isin(['COMPLIANT', 'WITHIN_POLICY'])
        ]['table_count'].sum()
        retention_score = (compliant_tables / total_tables * 100) if total_tables > 0 else 0
        compliance_metrics['Retention Compliance'] = retention_score
    else:
        compliance_metrics['Retention Compliance'] = 0

    # 2. Sensitive data protection (weight: 25%)
    if 'sensitive_tables' in locals() and sensitive_tables is not None and len(sensitive_tables) > 0:
        # Check if sensitive tables have proper controls (simplified check)
        sensitive_score = 75  # Baseline score if sensitive data is identified
        compliance_metrics['Sensitive Data Protection'] = sensitive_score
    else:
        compliance_metrics['Sensitive Data Protection'] = 100  # No sensitive data = compliant

    # 3. Audit trail coverage (weight: 25%)
    if 'audit_logs_pd' in locals() and audit_logs_pd is not None and len(audit_logs_pd) > 0:
        audit_score = min(100, (len(audit_logs_pd) / (days_back * 100)) * 100)  # Expect ~100 events/day
        compliance_metrics['Audit Trail Coverage'] = audit_score
    else:
        compliance_metrics['Audit Trail Coverage'] = 0

    # 4. Access control compliance (weight: 20%)
    if 'permission_changes' in locals() and permission_changes is not None:
        # Penalize excessive permission changes
        perm_change_rate = len(permission_changes) / days_back
        access_score = max(0, 100 - (perm_change_rate * 2))  # Deduct 2 points per change/day
        compliance_metrics['Access Control'] = access_score
    else:
        compliance_metrics['Access Control'] = 100

    # Calculate weighted overall score
    weights = {
        'Retention Compliance': 0.30,
        'Sensitive Data Protection': 0.25,
        'Audit Trail Coverage': 0.25,
        'Access Control': 0.20
    }

    overall_score = sum(compliance_metrics[k] * weights[k] for k in compliance_metrics.keys())

    log("\n📊 Compliance Metrics:")
    for metric, score in compliance_metrics.items():
        status = "✅" if score >= 80 else "⚠️" if score >= 60 else "❌"
        log(f"  {status} {metric}: {score:.1f}%")

    log(f"\n{'='*60}")
    log(f"Overall Compliance Score: {overall_score:.1f}%")
    log(f"{'='*60}")

    if overall_score >= 80:
        log("✅ GOOD - Compliance posture is strong")
    elif overall_score >= 60:
        log("⚠️ FAIR - Some compliance gaps need attention")
    else:
        log("❌ POOR - Significant compliance issues require immediate action")
        
except Exception as e:
    log(f"❌ Error calculating compliance score: {str(e)}")
    overall_score = 0
    compliance_metrics = {}
    
log_execution_time("Compliance Score Calculation", cell_start_time)

In [0]:
cell_start_time = time.time()

if not is_job_mode and ENABLE_VISUALIZATIONS:
    log("\n" + "="*60)
    log("GENERATING VISUALIZATIONS")
    log("="*60)
    
    try:
        # Create compliance visualizations
        import matplotlib.pyplot as plt
        import seaborn as sns

        sns.set_style('whitegrid')
        fig, axes = plt.subplots(2, 2, figsize=(16, 12))

        # 1. Compliance Score Breakdown
        if 'compliance_metrics' in locals() and compliance_metrics:
            ax1 = axes[0, 0]
            metrics_df = pd.DataFrame(list(compliance_metrics.items()), columns=['Metric', 'Score'])
            colors = ['green' if s >= 80 else 'orange' if s >= 60 else 'red' for s in metrics_df['Score']]
            ax1.barh(metrics_df['Metric'], metrics_df['Score'], color=colors, alpha=0.7)
            ax1.set_xlabel('Score (%)', fontsize=12)
            ax1.set_title('Compliance Metrics Breakdown', fontsize=14, fontweight='bold')
            ax1.set_xlim(0, 100)
            for i, v in enumerate(metrics_df['Score']):
                ax1.text(v + 2, i, f'{v:.1f}%', va='center', fontsize=10)

        # 2. Retention Status Distribution
        if 'retention_summary' in locals() and retention_summary is not None:
            ax2 = axes[0, 1]
            colors_retention = {'COMPLIANT': 'green', 'WITHIN_POLICY': 'lightgreen', 
                               'EXCEEDS_POLICY': 'red', 'UNKNOWN': 'gray'}
            retention_colors = [colors_retention.get(status, 'blue') for status in retention_summary['retention_status']]
            ax2.pie(retention_summary['table_count'], labels=retention_summary['retention_status'], 
                    autopct='%1.1f%%', colors=retention_colors, startangle=90)
            ax2.set_title('Data Retention Status', fontsize=14, fontweight='bold')

        # 3. Audit Event Categories
        if 'event_summary' in locals() and event_summary is not None:
            ax3 = axes[1, 0]
            top_events = event_summary.head(8)
            ax3.bar(range(len(top_events)), top_events['event_count'], color='steelblue', alpha=0.7)
            ax3.set_xticks(range(len(top_events)))
            ax3.set_xticklabels(top_events['event_category'], rotation=45, ha='right')
            ax3.set_ylabel('Event Count', fontsize=12)
            ax3.set_title('Audit Events by Category', fontsize=14, fontweight='bold')
            ax3.ticklabel_format(style='plain', axis='y')

        # 4. Sensitive Data Distribution
        if 'sensitive_data' in locals() and sensitive_data is not None and len(sensitive_data) > 0:
            ax4 = axes[1, 1]
            sensitivity_counts = sensitive_data['sensitivity_level'].value_counts()
            colors_sens = {'HIGH': 'red', 'MEDIUM': 'orange', 'LOW': 'yellow'}
            sens_colors = [colors_sens.get(level, 'blue') for level in sensitivity_counts.index]
            ax4.pie(sensitivity_counts.values, labels=sensitivity_counts.index, 
                    autopct='%1.1f%%', colors=sens_colors, startangle=90)
            ax4.set_title('Sensitive Column Distribution', fontsize=14, fontweight='bold')

        plt.tight_layout()
        plt.show()

        log("✓ Compliance visualizations generated successfully")
        
    except Exception as e:
        log(f"❌ Error generating visualizations: {str(e)}")
else:
    log("ℹ️ Visualizations disabled (job mode or ENABLE_VISUALIZATIONS=False)")
    
log_execution_time("Compliance Visualizations", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*70)
log(" "*15 + "COMPLIANCE & AUDIT TRAIL REPORT")
log(" "*20 + f"Period: {start_date} to {end_date}")
log("="*70)

try:
    if 'overall_score' in locals():
        log(f"\n📊 OVERALL COMPLIANCE SCORE: {overall_score:.1f}%\n")
    else:
        log(f"\n📊 OVERALL COMPLIANCE SCORE: N/A\n")

    log("📋 KEY FINDINGS:\n")

    # Finding 1: Sensitive Data
    if 'sensitive_tables' in locals() and sensitive_tables is not None:
        log(f"1. SENSITIVE DATA INVENTORY")
        log(f"   • {len(sensitive_tables):,} tables contain sensitive/PII data")
        if 'sensitive_data' in locals() and sensitive_data is not None:
            high_count = len(sensitive_data[sensitive_data['sensitivity_level']=='HIGH'])
            medium_count = len(sensitive_data[sensitive_data['sensitivity_level']=='MEDIUM'])
            log(f"   • {high_count:,} HIGH sensitivity columns identified")
            log(f"   • {medium_count:,} MEDIUM sensitivity columns identified")
        log("")

    # Finding 2: Audit Activity
    if 'total_events' in locals() and total_events > 0:
        log(f"2. AUDIT TRAIL ACTIVITY")
        log(f"   • {total_events:,} total audit events recorded")
        if 'audit_logs_categorized' in locals() and audit_logs_categorized is not None:
            unique_users = audit_logs_categorized.select('user_email').distinct().count()
            log(f"   • {unique_users:,} unique users active")
        if 'event_summary' in locals() and event_summary is not None:
            for _, row in event_summary.iterrows():
                log(f"   • {row['event_count']:,} {row['event_category']} events")
        log("")

    # Finding 3: Retention Compliance
    if 'retention_summary' in locals() and retention_summary is not None:
        log(f"3. DATA RETENTION COMPLIANCE")
        log(f"   • {total_tables:,} total tables analyzed")
        for _, row in retention_summary.iterrows():
            status_icon = "✅" if row['retention_status'] in ['COMPLIANT', 'WITHIN_POLICY'] else "⚠️"
            log(f"   {status_icon} {row['retention_status']}: {row['table_count']:,} tables")
        log("")

    # Finding 4: Anomalies
    log(f"4. SECURITY ANOMALIES")
    anomalies_found = False
    
    # Check for after-hours access
    if 'after_hours_count' in locals() and after_hours_count > 0:
        log(f"   ⚠️ {after_hours_count:,} after-hours access events detected")
        anomalies_found = True
    
    # Check for high-frequency users
    if 'high_freq_count' in locals() and high_freq_count > 0:
        log(f"   ⚠️ {high_freq_count} users with high-frequency access (>1000 events/day)")
        anomalies_found = True
    
    # Check for failed attempts
    if 'failed_count' in locals() and failed_count > 0:
        log(f"   ⚠️ {failed_count:,} failed access attempts")
        anomalies_found = True
    
    if not anomalies_found:
        log(f"   ✅ No significant anomalies detected")
    log("")

    log("⚙️ RECOMMENDATIONS:\n")

    recommendations = []

    if 'exceeds_policy' in locals() and exceeds_policy is not None and len(exceeds_policy) > 0:
        recommendations.append(f"1. Review and archive/delete {len(exceeds_policy):,} tables exceeding retention policy")

    if 'sensitive_tables' in locals() and sensitive_tables is not None and len(sensitive_tables) > 0:
        recommendations.append(f"2. Implement encryption and access controls for {len(sensitive_tables):,} sensitive tables")

    if 'high_freq_count' in locals() and high_freq_count > 0:
        recommendations.append(f"3. Investigate {high_freq_count} users with unusually high access frequency")

    if 'after_hours_count' in locals() and after_hours_count > 0:
        recommendations.append(f"4. Review after-hours access patterns ({after_hours_count:,} events)")

    if 'overall_score' in locals() and overall_score < 80:
        recommendations.append(f"5. Develop remediation plan to improve compliance score to 80%+")

    if len(recommendations) == 0:
        recommendations.append("✅ No critical recommendations - maintain current compliance posture")

    for rec in recommendations:
        log(f"   {rec}")

    log("\n" + "="*70)
    log(f"Report generated: {datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S')}")
    log("="*70)
    
except Exception as e:
    log(f"❌ Error generating executive summary: {str(e)}")

log_execution_time("Executive Summary Report", cell_start_time)

In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("EXPORTING COMPLIANCE REPORT")
log("="*60)

if ENABLE_DELTA_EXPORT or ENABLE_JSON_EXPORT or ENABLE_EXCEL_EXPORT:
    try:
        report_timestamp = datetime.now(eastern)

        # 1. Prepare sensitive tables inventory
        if 'sensitive_tables' in locals() and sensitive_tables is not None and len(sensitive_tables) > 0:
            sensitive_export = sensitive_tables.copy()
            sensitive_export['report_date'] = report_timestamp
            sensitive_export['analysis_period_start'] = start_date
            sensitive_export['analysis_period_end'] = end_date
            log(f"  Sensitive tables inventory: {len(sensitive_export):,} records")
        else:
            sensitive_export = None

        # 2. Prepare retention compliance status
        if 'retention_pd' in locals() and retention_pd is not None:
            retention_export = retention_pd.copy()
            retention_export['report_date'] = report_timestamp
            retention_export['analysis_period_start'] = start_date
            retention_export['analysis_period_end'] = end_date
            log(f"  Retention compliance data: {len(retention_export):,} records")
        else:
            retention_export = None

        # 3. Prepare compliance metrics
        if 'compliance_metrics' in locals() and compliance_metrics:
            metrics_export = pd.DataFrame([
                {
                    'report_date': report_timestamp,
                    'analysis_period_start': start_date,
                    'analysis_period_end': end_date,
                    'overall_compliance_score': overall_score if 'overall_score' in locals() else 0,
                    'retention_compliance_score': compliance_metrics.get('Retention Compliance', 0),
                    'sensitive_data_protection_score': compliance_metrics.get('Sensitive Data Protection', 0),
                    'audit_trail_coverage_score': compliance_metrics.get('Audit Trail Coverage', 0),
                    'access_control_score': compliance_metrics.get('Access Control', 0),
                    'total_tables_analyzed': total_tables if 'total_tables' in locals() else 0,
                    'sensitive_tables_count': len(sensitive_tables) if 'sensitive_tables' in locals() and sensitive_tables is not None else 0,
                    'audit_events_count': len(audit_logs_pd) if 'audit_logs_pd' in locals() and audit_logs_pd is not None else 0,
                    'retention_violations_count': len(exceeds_policy) if 'exceeds_policy' in locals() and exceeds_policy is not None else 0
                }
            ])
            log(f"  Compliance metrics summary: {len(metrics_export):,} record")
        else:
            metrics_export = None

        # Export to Delta tables
        if ENABLE_DELTA_EXPORT:
            log("\n💾 Exporting to Delta tables...")
            
            if sensitive_export is not None:
                sensitive_df = spark.createDataFrame(sensitive_export)
                table_name = f"{DELTA_CATALOG}.{DELTA_SCHEMA}.sensitive_tables_inventory"
                sensitive_df.write.mode('append').saveAsTable(table_name)
                log(f"  ✓ Exported to {table_name}")
            
            if retention_export is not None:
                retention_df = spark.createDataFrame(retention_export)
                table_name = f"{DELTA_CATALOG}.{DELTA_SCHEMA}.retention_compliance"
                retention_df.write.mode('append').saveAsTable(table_name)
                log(f"  ✓ Exported to {table_name}")
            
            if metrics_export is not None:
                metrics_df = spark.createDataFrame(metrics_export)
                table_name = f"{DELTA_CATALOG}.{DELTA_SCHEMA}.compliance_metrics"
                metrics_df.write.mode('append').saveAsTable(table_name)
                log(f"  ✓ Exported to {table_name}")

        # Export to Excel
        if ENABLE_EXCEL_EXPORT:
            log("\n📊 Exporting to Excel...")
            excel_path = f"{EXPORT_BASE_PATH}/compliance_report_{datetime.now(eastern).strftime('%Y%m%d_%H%M%S')}.xlsx"
            
            with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
                if metrics_export is not None:
                    metrics_export.to_excel(writer, sheet_name='Compliance Metrics', index=False)
                if sensitive_export is not None:
                    sensitive_export.to_excel(writer, sheet_name='Sensitive Tables', index=False)
                if retention_export is not None:
                    retention_export.head(1000).to_excel(writer, sheet_name='Retention Compliance', index=False)
            
            log(f"  ✓ Exported to {excel_path}")

        # Export to JSON
        if ENABLE_JSON_EXPORT:
            log("\n📝 Exporting to JSON...")
            json_path = f"{EXPORT_BASE_PATH}/compliance_report_{datetime.now(eastern).strftime('%Y%m%d_%H%M%S')}.json"
            
            export_data = {
                'report_metadata': {
                    'report_date': report_timestamp.isoformat(),
                    'analysis_period_start': start_date,
                    'analysis_period_end': end_date
                },
                'compliance_metrics': compliance_metrics if 'compliance_metrics' in locals() else {},
                'overall_score': overall_score if 'overall_score' in locals() else 0
            }
            
            import json
            with open(json_path, 'w') as f:
                json.dump(export_data, f, indent=2, default=str)
            
            log(f"  ✓ Exported to {json_path}")

        log("\n✅ Compliance report export completed successfully")
        
        # Display metrics summary
        if metrics_export is not None:
            display(metrics_export)
            
    except Exception as e:
        log(f"❌ Error exporting compliance report: {str(e)}")
else:
    log("ℹ️ Export disabled (ENABLE_DELTA_EXPORT, ENABLE_JSON_EXPORT, and ENABLE_EXCEL_EXPORT are all False)")
    log("\n💡 To enable exports, set one or more of these flags to True in the configuration cell")
    
log_execution_time("Export Compliance Report", cell_start_time)

In [0]:
# ============================================================================
# EXECUTION SUMMARY
# ============================================================================

log("\n" + "="*70)
log(" "*20 + "EXECUTION SUMMARY")
log("="*70)

try:
    # Calculate total execution time (if notebook_start_time was set)
    if 'cell_start_time' in dir():
        total_time = time.time() - cell_start_time
        log(f"\n⏱️  Total execution time: {total_time:.2f} seconds ({total_time/60:.1f} minutes)")
    
    # Summary statistics
    log("\n📊 Key Metrics:")
    
    if 'total_columns' in locals():
        log(f"  • Total columns analyzed: {total_columns:,}")
    
    if 'sensitive_data' in locals() and sensitive_data is not None:
        log(f"  • Sensitive columns identified: {len(sensitive_data):,}")
    
    if 'sensitive_tables' in locals() and sensitive_tables is not None:
        log(f"  • Sensitive tables identified: {len(sensitive_tables):,}")
    
    if 'total_tables' in locals():
        log(f"  • Total tables analyzed: {total_tables:,}")
    
    if 'total_events' in locals():
        log(f"  • Audit events analyzed: {total_events:,}")
    
    if 'overall_score' in locals():
        log(f"  • Overall compliance score: {overall_score:.1f}%")
    
    # Export status
    log("\n💾 Export Status:")
    log(f"  • Excel export: {'ENABLED' if ENABLE_EXCEL_EXPORT else 'DISABLED'}")
    log(f"  • Delta export: {'ENABLED' if ENABLE_DELTA_EXPORT else 'DISABLED'}")
    log(f"  • JSON export: {'ENABLED' if ENABLE_JSON_EXPORT else 'DISABLED'}")
    
    log("\n✅ Compliance & Audit Trail Monitor completed successfully")
    
except Exception as e:
    log(f"\n⚠️  Error generating execution summary: {str(e)}")

log("="*70)
log(f"Report generated: {datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')}")
log("="*70)