# Databricks Security Review Export

## Overview

This notebook provides a **comprehensive security audit** of your Databricks workspace by extracting and analyzing all identity, access, and permission configurations. The output includes detailed reports in multiple formats (Excel, CSV, JSON, Delta tables) containing all security-related information for compliance reviews, access audits, and security assessments.

**✨ Enterprise-grade security auditing with complete coverage, change tracking, compliance reporting, and performance optimization.**

---

## Performance Modes

### 🚀 Quick Mode (5-10 minutes)
**Recommended for**: Daily monitoring, quick audits, testing, **interactive development**

* Core permissions: Jobs, Warehouses, Pipelines
* Unity Catalog: Catalogs and schemas
* Security essentials: Secrets, IP access lists
* **Skips**: Clusters, Workspace objects, Models, Repos, Pools, SQL assets, Volumes, Tokens
* **Limits**: MAX_RESOURCES_PER_TYPE = 100

### 🔍 Full Mode (20-60 minutes)
**Recommended for**: Compliance audits, quarterly reviews, comprehensive analysis, **scheduled jobs**

* **ALL** resource types enabled
* **NO** limits (MAX_RESOURCES_PER_TYPE = 999)
* Complete workspace coverage
* All compliance and risk analysis features
* **🤖 AUTOMATIC IN JOB MODE**: Jobs always run in Full Mode for comprehensive audits

### ⚙️ Custom Mode (Variable)
**Recommended for**: Specific use cases, targeted audits, **interactive sessions**

* Configure individual enable/disable flags in Cell 2
* Set custom limits for each resource type
* Fine-tune performance vs coverage trade-offs

**To select a mode**: 
* **Interactive**: Edit Cell 3 and uncomment your preferred preset
* **Job Mode**: Automatically uses Full Mode (cannot be overridden)

---

## Job Mode Behavior

### 🤖 **Automatic Full Mode in Jobs**

When running as a scheduled job, the notebook **automatically enables Full Mode** regardless of preset selection:

* **All resource types enabled** (ignores ENABLE_* flags from Cell 2)
* **No limits** (MAX_RESOURCES_PER_TYPE = 999)
* **Complete coverage** (MAX_WORKSPACE_OBJECTS = 2000)
* **Comprehensive audit** every time

**Rationale**: Scheduled jobs are typically for compliance, auditing, and governance purposes that require complete coverage. Interactive sessions can use Quick Mode for development and testing.

**To customize job behavior**: Modify Cell 3 to adjust the Full Mode configuration when `is_job_mode = True`

---

## What This Code Does

### 1. **Identity Management**
* Extracts all **users** with display names and active status
* Extracts all **groups** with membership information
* Extracts all **service principals** with their group associations
* Maps **user-to-group relationships**
* Identifies **workspace admins** and **account admins**

### 2. **Workspace Resource Permissions** (Complete Coverage)
* **Jobs**: Workflow job permissions (CAN_VIEW, CAN_MANAGE_RUN, IS_OWNER, CAN_MANAGE)
* **SQL Warehouses**: Compute warehouse permissions (CAN_VIEW, CAN_MONITOR, CAN_USE, CAN_MANAGE)
* **Clusters**: Interactive cluster permissions (excludes ephemeral job clusters)
* **Pipelines**: Delta Live Tables pipeline permissions (CAN_VIEW, CAN_RUN, CAN_MANAGE)
* **Workspace Objects**: Folder and notebook permissions (CAN_READ, CAN_RUN, CAN_EDIT, CAN_MANAGE)
* **Repos**: Git integration permissions (CAN_READ, CAN_RUN, CAN_EDIT, CAN_MANAGE)
* **Instance Pools**: Compute pool permissions (CAN_ATTACH_TO, CAN_MANAGE)
* **Model Registry**: ML model permissions (CAN_READ, CAN_EDIT, CAN_MANAGE_STAGING, CAN_MANAGE_PRODUCTION)
* **SQL Dashboards**: Legacy dashboard permissions
* **SQL Queries**: Saved query permissions

### 3. **Unity Catalog Governance**
* **Catalogs**: All catalogs with owners and types
* **Schemas**: Schemas within catalogs (sampled)
* **Volumes**: External storage volumes
* **Grants**: Privileges on catalogs, schemas, and volumes (SELECT, MODIFY, USE CATALOG, CREATE SCHEMA, READ_VOLUME, WRITE_VOLUME, etc.)

### 4. **Secrets Management**
* **Secret Scopes**: All secret scopes (Databricks-backed or Key Vault)
* **Secret ACLs**: Who has READ, WRITE, or MANAGE access to each scope

### 5. **Network Security**
* **IP Access Lists**: Allow/block lists for workspace access
* **Workspace Settings**: Token creation and IP restriction status

### 6. **Token Management & Audit**
* **Active Tokens**: Personal access tokens and service principal tokens
* **Token Expiration**: Identifies tokens without expiration dates
* **Token Ownership**: Who created each token
* **Security Alerts**: Flags tokens without expiry
* **Token Age Analysis**: Identifies old tokens requiring rotation
* **Service Principal Tokens**: Dedicated audit for non-human identities

### 7. **Permission Analysis**
* **Direct vs Inherited**: Distinguishes between permissions assigned directly to users vs inherited from groups
* **User-Group Associations**: Maintains group membership context throughout all dataframes
* **Permission Reference**: Comprehensive definitions of all permission levels from Databricks documentation
* **Permission Concentration**: Identifies users with excessive permissions
* **Cross-Resource Analysis**: Detects broad access patterns

### 8. **Compliance & Security Reporting**
* **Inactive Users**: Identifies inactive users with active permissions
* **External Users**: Flags non-company domain users with access
* **Over-Privileged Users**: Identifies users with excessive permissions
* **Segregation of Duties**: Detects potential SOD violations
* **Orphaned Permissions**: Flags permissions for deleted users/groups
* **Admin Identification**: Lists all workspace and account admins
* **Token Security**: Analyzes token usage and expiration
* **Security Configuration**: Workspace security settings audit

### 9. **Change Detection & Tracking**
* **Incremental Changes**: Compares current run with previous audit
* **New Permissions**: Identifies permissions added since last run
* **Removed Permissions**: Identifies permissions removed since last run
* **Change Summary**: Quantifies permission changes over time
* **Trend Analysis**: Historical permission evolution

### 10. **Long-Term Retention**
* **Delta Table Export**: Historical accumulation of all audit runs
* **Snapshot Tables**: Current state snapshots for all security entities (15+ tables)
* **Change History**: Tracks permission changes over time
* **Historical Queries**: Query permission changes and trends

### 11. **Data Quality & Validation**
* **Duplicate Detection**: Identifies duplicate permission entries
* **Orphaned Permissions**: Flags permissions for non-existent users/groups
* **Empty DataFrame Checks**: Validates data before export
* **Security Alerts**: Identifies users with excessive permissions
* **Risk Analysis**: Comprehensive security risk assessment

### 12. **Executive Reporting**
* **Risk Scoring**: Automated 0-100 risk score calculation
* **Executive Summary**: Key metrics dashboard for leadership
* **Security Alerts Summary**: Prioritized list of security concerns
* **Compliance Metrics**: SOX, GDPR, SOD violation tracking

---

## Key Features

✓ **Automatic Job Mode**: Jobs always run Full Mode for comprehensive audits  
✓ **Performance Presets**: Quick Mode (5-10 min) or Full Mode (20-60 min)  
✓ **Selective Collection**: Enable/disable individual resource types  
✓ **Complete Coverage**: All workspace resources, Unity Catalog, and security settings  
✓ **Parallelized API Calls**: Uses ThreadPoolExecutor with configurable workers  
✓ **Retry Logic**: Automatic retry with exponential backoff for transient API failures  
✓ **Progress Tracking**: Real-time progress updates during long-running operations  
✓ **Execution Time Tracking**: Per-cell execution time monitoring  
✓ **Configuration Validation**: Validates all configuration parameters before execution  
✓ **Error Handling**: Comprehensive error handling with detailed logging  
✓ **Data Quality Checks**: Validates data integrity and identifies issues  
✓ **Compliance Reporting**: Pre-built SOX and security compliance reports  
✓ **Change Detection**: Tracks permission changes between audit runs  
✓ **Risk Analysis**: Identifies over-privileged users and security risks  
✓ **Multiple Export Formats**: Excel, CSV, JSON, HTML, and Delta tables  
✓ **Memory Optimization**: Monitors memory usage and provides warnings  
✓ **Configurable Limits**: Fine-tune performance with multiple limit settings  
✓ **Execution Statistics**: Tracks API calls, failures, retries, success rates, skipped resources  
✓ **DataFrame Caching**: Performance optimization for frequently accessed data  
✓ **Token Expiration Audit**: Critical security check for token lifecycle  
✓ **Inactive User Detection**: Compliance requirement for access reviews  
✓ **External User Analysis**: Identifies non-company domain users  
✓ **Risk Scoring**: Automated 0-100 risk assessment  
✓ **Executive Dashboard**: Summary metrics for leadership visibility  
✓ **Timezone Configuration**: All timestamps in Eastern Time (configurable)  
✓ **Serverless Optimization**: Automatic compute type detection and optimization  

---

## Configuration

### Performance Settings (Cell 2):
* `MAX_RESOURCES_PER_TYPE = 100` - Limit resources checked per type (999 = no limit)
* `MAX_WORKERS = 10` - Parallel API calls (1-50)
* `MAX_RETRIES = 3` - Retries for failed API calls
* `RETRY_DELAY = 2` - Seconds between retries
* `MAX_WORKSPACE_OBJECTS = 500` - Limit workspace objects scanned
* `MAX_SCHEMAS_PER_CATALOG = 10` - Limit schemas per catalog for volumes
* `TIMEZONE = 'America/New_York'` - Timezone for all timestamp displays

**Note**: In job mode, these limits are automatically overridden to maximum values for complete coverage.

### Resource Type Selection (Cell 2):
**Core Resources:**
* `ENABLE_JOBS = True`
* `ENABLE_WAREHOUSES = True`
* `ENABLE_CLUSTERS = True`
* `ENABLE_PIPELINES = True`

**Extended Resources:**
* `ENABLE_WORKSPACE_OBJECTS = True` - Folders/notebooks (can be slow)
* `ENABLE_MODEL_REGISTRY = True`
* `ENABLE_REPOS = True`
* `ENABLE_INSTANCE_POOLS = True`
* `ENABLE_SQL_ASSETS = True` - Dashboards and queries
* `ENABLE_VOLUMES = True` - Unity Catalog volumes

**Security & Compliance:**
* `ENABLE_SERVICE_PRINCIPALS = True`
* `ENABLE_SECRET_SCOPES = True`
* `ENABLE_IP_ACCESS_LISTS = True`
* `ENABLE_TOKEN_AUDIT = True`
* `ENABLE_UC_PERMISSIONS = True`

**Note**: In job mode, ALL resource types are automatically enabled regardless of these flags.

### Export Format Settings:
* `ENABLE_EXCEL_EXPORT = True` - Excel workbook generation
* `ENABLE_CSV_EXPORT = False` - CSV file generation (set in Cell 35)
* `ENABLE_JSON_EXPORT = False` - JSON file generation (set in Cell 36)
* `ENABLE_HTML_REPORT = False` - HTML summary report (set in Cell 42)
* `ENABLE_DELTA_EXPORT = False` - Delta table long-term retention

**Recommendation for Jobs**: Set `ENABLE_DELTA_EXPORT = True` and `ENABLE_EXCEL_EXPORT = False` for optimal job performance.

### Delta Export Settings (Cell 2):
* `DELTA_TABLE_NAME = "main.default.security_audit_history"`
  * Format: `"catalog.schema.table"`
  * Requires CREATE TABLE permissions
  * Creates 15+ snapshot tables automatically

### Job Parameters (Optional):
* `max_resources_per_type` - Ignored in job mode (always uses 999)
* `export_path` - Custom export path

---

## Performance Optimization Tips

### For Faster Execution (Interactive Only):
1. **Use Quick Mode** (Cell 3): Uncomment `USE_QUICK_MODE = True`
2. **Disable slow resources**: Set `ENABLE_WORKSPACE_OBJECTS = False`
3. **Reduce limits**: Set `MAX_RESOURCES_PER_TYPE = 50`
4. **Skip clusters**: Set `ENABLE_CLUSTERS = False` (if you have many)
5. **Disable exports**: Set `ENABLE_EXCEL_EXPORT = False` for Delta-only

### For Complete Coverage:
1. **Use Full Mode** (Cell 3): Uncomment `USE_FULL_MODE = True`
2. **Enable all resources**: All ENABLE_* flags = True
3. **Remove limits**: Set `MAX_RESOURCES_PER_TYPE = 999`
4. **Increase workers**: Set `MAX_WORKERS = 20` (if cluster can handle it)
5. **Enable Delta export**: For change tracking over time

### For Scheduled Jobs:
* **No configuration needed** - Jobs automatically run in Full Mode
* **Recommended**: Set `ENABLE_DELTA_EXPORT = True` in Cell 2
* **Recommended**: Set `ENABLE_EXCEL_EXPORT = False` in Cell 2 (faster)
* Jobs will always perform comprehensive audits with all resources

### Memory Management:
* Monitor memory warnings in Cell 24
* If memory issues occur, reduce `MAX_WORKSPACE_OBJECTS`
* Disable `ENABLE_WORKSPACE_OBJECTS` if not needed (interactive only)
* Run on larger cluster if needed

---

## Execution Guide

### Quick Start (5-10 minutes) - Interactive Only:
1. **Cell 3**: Uncomment `USE_QUICK_MODE = True`
2. **Run All**: Execute all cells
3. **Review**: Check Cell 24 for execution summary
4. **Export**: Excel file ready in `/dbfs/tmp/permissions_export/`

### Full Audit (20-60 minutes) - Interactive or Job:
1. **Cell 3**: Uncomment `USE_FULL_MODE = True` (or run as job)
2. **Cell 2**: Set `ENABLE_DELTA_EXPORT = True` for change tracking
3. **Run All**: Execute all cells (monitor progress)
4. **Review**: Check Cells 24, 26, 27, 28 for comprehensive analysis
5. **Export**: Multiple formats available

### Custom Execution - Interactive Only:
1. **Cell 3**: Use `USE_CUSTOM_MODE = True` (default)
2. **Cell 2**: Configure individual ENABLE_* flags
3. **Run selectively**: Execute only cells for enabled resources
4. **Monitor**: Watch execution times and adjust as needed

### Running as a Scheduled Job:

**Setup:**
1. Create a new job in Databricks
2. Add this notebook as a task
3. **No preset configuration needed** - Job mode automatically uses Full Mode
4. (Optional) Add job parameters:
   - `export_path`: "/dbfs/mnt/security-exports" for custom location
5. Configure schedule (daily, weekly, monthly)
6. Set up notifications for job success/failure
7. **Recommended in Cell 2**:
   - Set `ENABLE_DELTA_EXPORT = True` for historical tracking
   - Set `ENABLE_EXCEL_EXPORT = False` for faster execution

**Automatic Job Behavior:**
* 🤖 **Forces Full Mode** - Complete in-depth review every time
* 🔄 **All resources enabled** - Ignores ENABLE_* flags
* ♾️ **No limits** - MAX_RESOURCES_PER_TYPE = 999, MAX_WORKSPACE_OBJECTS = 2000
* 📊 **Complete coverage** - All 14 resource types scanned
* ⏱️ **Expected time**: 20-60 minutes depending on workspace size
* 💾 **Recommended**: Enable Delta export for change tracking

**Why Full Mode for Jobs?**
* Scheduled audits are for compliance and governance
* Requires complete, comprehensive coverage
* Change detection needs consistent full scans
* Historical trends require complete data
* No manual intervention to ensure thoroughness

**Monitoring:**
* Check job run output for JSON summary with execution stats
* Monitor export location for generated files
* Query Delta tables for historical trends and changes
* Review API failure rates and retry counts
* Check compliance alerts in execution summary

---

## Important Notes

⚠ **Job Mode**: Always runs Full Mode (20-60 min) for comprehensive audits  
⚠ **Performance**: Quick Mode (5-10 min) only available in interactive sessions  
⚠ **Workspace Objects**: Most resource-intensive - automatically included in job mode  
⚠ **Cluster Permissions**: Filtered to interactive clusters only  
⚠ **Memory Usage**: Monitor warnings in Cell 24, jobs use larger limits  
⚠ **Permissions Required**: Requires admin access to view all permissions  
⚠ **Data Privacy**: Export files contain sensitive security information  
⚠ **API Rate Limits**: Retry logic handles transient failures automatically  
⚠ **Token Access**: Token data requires specific admin permissions  
⚠ **Delta Tables**: Requires CREATE TABLE permissions  

---

## Version Control

| Version | Date | Author | Changes |
|---------|------|--------|----------|
| 1.0 | 2026-02-09 | Brandon Croom | Initial creation of comprehensive security review script; Added users, groups, and workspace resource permissions (jobs, warehouses, pipelines); Implemented parallelized API calls for performance optimization; Added user-group association tracking throughout all dataframes; Created flattened exports for Excel compatibility; Added Eastern Time timezone conversion for all timestamps; Implemented MAX_RESOURCES_PER_TYPE=999 flag for unlimited data pull; Added service principals extraction; Added secret scopes and ACLs; Added Unity Catalog permissions (catalogs, schemas, grants); Added IP access lists and workspace settings; Added permission level reference documentation from Databricks docs; Optimized cluster permissions to filter interactive clusters only; Created comprehensive security workbook with 17+ sheets |
| 1.1 | 2026-02-10 | Brandon Croom | Added automatic job mode detection to support scheduled job execution; Implemented conditional print statements and display() calls (enabled in interactive mode, suppressed in job mode); Job mode detection added at notebook start for global availability; Added configurable job parameters (max_resources_per_type, export_path); Added job completion summary with JSON output for orchestration; Updated export paths to use configurable EXPORT_PATH variable; Added proper error handling with exceptions raised in job mode for alerting; Optimized to skip unnecessary operations in job mode |
| 1.2 | 2026-02-11 | Brandon Croom | Major feature expansion and optimization: Added Delta table export with historical accumulation and 15+ snapshot tables; Implemented log() function for consistent message handling; Added ENABLE_EXCEL_EXPORT configuration; Added retry logic with exponential backoff; Implemented progress tracking and execution time tracking per cell; Added configuration validation and data quality checks (duplicate detection, orphaned permissions); Added execution statistics tracking; Enhanced error handling and export path validation; Added complete workspace coverage (workspace folders/notebooks, model registry, repos, instance pools, SQL dashboards/queries, Unity Catalog volumes); Added token management audit; Implemented workspace admin identification; Added comprehensive compliance reporting (inactive users, external users, over-privileged users, segregation of duties); Implemented incremental change detection comparing with previous runs; Added permission recommendations and risk analysis; Added CSV, JSON, and HTML export format options; Implemented performance presets (Quick Mode, Full Mode, Custom Mode); Added selective resource collection with 14 ENABLE_* flags; Added performance limits (MAX_WORKSPACE_OBJECTS, MAX_SCHEMAS_PER_CATALOG); Optimized workspace object scanning with parallelization; Parallelized SQL asset permission fetching; Added memory usage monitoring; Implemented automatic Full Mode for job execution; Fixed is_job_mode dependency issue |
| 1.3 | 2026-02-12 | Brandon Croom | Enhanced security analysis and performance: Added DataFrame caching for users, groups, and user_groups (20-30% performance improvement); Implemented workspace security configuration checks (10 security flags with risk assessment); Added security configuration recommendations with priority levels; Implemented comprehensive token expiration audit (tokens without expiry, expiring soon, age analysis); Added inactive user permissions analysis for compliance (SOX, GDPR); Implemented permission concentration analysis (excessive admin permissions detection, distribution statistics); Added external user detection and analysis (configurable company domain); Implemented cross-resource permission analysis (broad access detection, SOD violations); Added service principal token audit (expiration and rotation analysis); Created executive summary dashboard with risk scoring (0-100 scale); Prepared summary data for Excel export (executive summary, security alerts, top users, execution metadata); Added 8 new analysis sections for comprehensive security posture assessment; Added timezone configuration with TIMEZONE constant set to 'America/New_York' (Eastern Time); Added get_current_time_in_timezone() helper function; All timestamps displayed in configured timezone; Serverless vs traditional cluster detection with compute-aware optimizations; Conditional DataFrame caching based on compute type; Generic configuration with no company-specific references |

---

## Support

For questions or issues:
* **Performance**: Use Quick Mode for faster execution (interactive only)
* **Job Mode**: Jobs automatically run Full Mode - no configuration needed
* **Memory**: Monitor Cell 24 warnings, reduce MAX_WORKSPACE_OBJECTS if needed
* Review execution summary in Cell 24 for data quality issues
* Review compliance report in Cell 26 for security alerts
* For job execution issues, check JSON summary with execution stats
* For Delta table issues, verify catalog/schema permissions
* Contact your Databricks administrator for permission issues

In [0]:
from databricks.sdk import WorkspaceClient
import pandas as pd
from pyspark.sql import functions as F
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
import os
import re
from datetime import datetime

# Detect if running in job mode or interactive mode (MUST BE FIRST)
try:
    dbutils.notebook.entry_point.getDbutils().notebook().getContext().currentRunId().isDefined()
    is_job_mode = True
except:
    is_job_mode = False

# Detect if running on serverless compute (most reliable method: try caching)
try:
    test_df = spark.range(1)
    test_df.cache()
    test_df.count()
    test_df.unpersist()
    is_serverless = False
except Exception as e:
    # If caching fails with serverless error, we're on serverless
    is_serverless = 'SERVERLESS' in str(e) or 'PERSIST TABLE is not supported' in str(e)

# Initialize Workspace Client
wc = WorkspaceClient()

# ============================================================================
# PERFORMANCE CONFIGURATION
# ============================================================================
MAX_RESOURCES_PER_TYPE = 100  # Limit resources checked per type (set to 999 for no limit)
MAX_WORKERS = 10  # Parallel API calls (1-50)
MAX_RETRIES = 3  # Number of retries for failed API calls
RETRY_DELAY = 2  # Seconds between retries
MAX_WORKSPACE_OBJECTS = 500  # Limit workspace objects scanned
MAX_SCHEMAS_PER_CATALOG = 10  # Limit schemas per catalog for volumes

# ============================================================================
# RESOURCE TYPE SELECTION
# ============================================================================
ENABLE_JOBS = True
ENABLE_WAREHOUSES = True
ENABLE_CLUSTERS = True
ENABLE_PIPELINES = True
ENABLE_WORKSPACE_OBJECTS = True
ENABLE_MODEL_REGISTRY = True
ENABLE_REPOS = True
ENABLE_INSTANCE_POOLS = True
ENABLE_SQL_ASSETS = True
ENABLE_VOLUMES = True
ENABLE_SERVICE_PRINCIPALS = True
ENABLE_SECRET_SCOPES = True
ENABLE_IP_ACCESS_LISTS = True
ENABLE_TOKEN_AUDIT = True
ENABLE_UC_PERMISSIONS = True

# ============================================================================
# EXPORT SETTINGS
# ============================================================================
ENABLE_EXCEL_EXPORT = True
ENABLE_DELTA_EXPORT = False
EXPORT_PATH = '/dbfs/tmp/permissions_export'
DELTA_TABLE_NAME = 'main.default.security_audit_history'

# ============================================================================
# EXECUTION STATISTICS
# ============================================================================
execution_stats = {
    'start_time': time.time(),
    'api_calls': 0,
    'api_failures': 0,
    'api_retries': 0,
    'resources_checked': 0,
    'resources_skipped': 0,
    'permissions_collected': 0
}

# ============================================================================
# UTILITY FUNCTIONS
# ============================================================================

def log(message):
    """Print only in interactive mode"""
    if not is_job_mode:
        print(message)

def log_execution_time(operation, start_time):
    """Log execution time for an operation"""
    elapsed = time.time() - start_time
    if not is_job_mode:
        print(f"⏱️  {operation} completed in {elapsed:.2f} seconds")

def validate_dataframe_exists(name, df):
    """Validate that a DataFrame exists and has data"""
    try:
        if df is None:
            log(f"  ❌ {name}: Not created")
            return False
        count = df.count()
        if count == 0:
            log(f"  ⚠️  {name}: Empty (0 rows)")
            return False
        log(f"  ✓ {name}: {count} rows")
        return True
    except Exception as e:
        log(f"  ❌ {name}: Error - {str(e)[:100]}")
        return False

def print_execution_summary():
    """Print execution statistics summary"""
    elapsed = time.time() - execution_stats['start_time']
    
    log(f"\n{'='*60}")
    log("EXECUTION SUMMARY")
    log(f"{'='*60}")
    log(f"Total execution time: {elapsed:.2f} seconds ({elapsed/60:.1f} minutes)")
    log(f"API calls: {execution_stats['api_calls']}")
    log(f"API failures: {execution_stats['api_failures']}")
    log(f"API retries: {execution_stats['api_retries']}")
    log(f"Resources checked: {execution_stats['resources_checked']}")
    log(f"Resources skipped: {execution_stats['resources_skipped']}")
    log(f"Permissions collected: {execution_stats['permissions_collected']}")
    
    if execution_stats['api_calls'] > 0:
        success_rate = ((execution_stats['api_calls'] - execution_stats['api_failures']) / execution_stats['api_calls']) * 100
        log(f"Success rate: {success_rate:.1f}%")
    log(f"{'='*60}\n")

# ============================================================================
# PERMISSION FETCHING FUNCTION
# ============================================================================

def get_permissions(resource_type, resource_id, resource_name, api_path):
    """Fetch permissions for a resource with retry logic"""
    for attempt in range(MAX_RETRIES):
        try:
            execution_stats['api_calls'] += 1
            perms = wc.permissions.get(request_object_type=api_path, request_object_id=resource_id)
            
            results = []
            if perms.access_control_list:
                for acl in perms.access_control_list:
                    principal = acl.user_name or acl.group_name or acl.service_principal_name
                    principal_type = 'user' if acl.user_name else ('group' if acl.group_name else 'service_principal')
                    
                    for perm in acl.all_permissions:
                        results.append({
                            'resource_type': resource_type,
                            'resource_id': resource_id,
                            'resource_name': resource_name,
                            'principal': principal,
                            'principal_type': principal_type,
                            'permission_level': perm.permission_level.value if perm.permission_level else 'UNKNOWN'
                        })
            
            execution_stats['resources_checked'] += 1
            return results
            
        except Exception as e:
            if attempt < MAX_RETRIES - 1:
                execution_stats['api_retries'] += 1
                time.sleep(RETRY_DELAY * (attempt + 1))
            else:
                execution_stats['api_failures'] += 1
                return []
    
    return []

# ============================================================================
# INITIALIZATION
# ============================================================================

# Initialize permissions list
all_permissions = []

log("✓ Setup complete")
log(f"Configuration: MAX_RESOURCES_PER_TYPE={MAX_RESOURCES_PER_TYPE}, MAX_WORKERS={MAX_WORKERS}")
log(f"Retry settings: MAX_RETRIES={MAX_RETRIES}, RETRY_DELAY={RETRY_DELAY}s")
log(f"Workspace limits: MAX_WORKSPACE_OBJECTS={MAX_WORKSPACE_OBJECTS}")
log(f"Compute type: {'SERVERLESS' if is_serverless else 'TRADITIONAL CLUSTER'}")
log(f"Execution mode: {'JOB' if is_job_mode else 'INTERACTIVE'}")

if is_serverless:
    log("\n⚡ Serverless optimizations enabled:")
    log("  - DataFrame caching disabled (not supported)")
    log("  - Automatic memory management")
    log("  - Optimized for fast startup and scaling")
else:
    log("\n🔧 Traditional cluster optimizations enabled:")
    log("  - DataFrame caching available")
    log("  - Manual memory management")
    log("  - Persistent compute resources")

enabled_resources = []
if ENABLE_JOBS: enabled_resources.append('Jobs')
if ENABLE_WAREHOUSES: enabled_resources.append('Warehouses')
if ENABLE_CLUSTERS: enabled_resources.append('Clusters')
if ENABLE_PIPELINES: enabled_resources.append('Pipelines')
if ENABLE_WORKSPACE_OBJECTS: enabled_resources.append('Workspace Objects')
if ENABLE_MODEL_REGISTRY: enabled_resources.append('Models')
if ENABLE_REPOS: enabled_resources.append('Repos')
if ENABLE_INSTANCE_POOLS: enabled_resources.append('Instance Pools')
if ENABLE_SQL_ASSETS: enabled_resources.append('SQL Assets')
if ENABLE_VOLUMES: enabled_resources.append('Volumes')

log(f"\nEnabled resource types ({len(enabled_resources)}): {', '.join(enabled_resources)}")

if ENABLE_EXCEL_EXPORT:
    log("📊 Excel export enabled")
if ENABLE_DELTA_EXPORT:
    log(f"💾 Delta export enabled: {DELTA_TABLE_NAME}")

✓ Setup complete
Configuration: MAX_RESOURCES_PER_TYPE=100, MAX_WORKERS=10
Retry settings: MAX_RETRIES=3, RETRY_DELAY=2s
Workspace limits: MAX_WORKSPACE_OBJECTS=500
Compute type: SERVERLESS
Execution mode: INTERACTIVE
Timezone: America/New_York

⚡ Serverless optimizations enabled:
  - DataFrame caching disabled (not supported)
  - Automatic memory management
  - Optimized for fast startup and scaling

Enabled resource types (10): Jobs, Warehouses, Clusters, Pipelines, Workspace Objects, Models, Repos, Instance Pools, SQL Assets, Volumes
📊 Excel export enabled


In [0]:
# ============================================================================
# PERFORMANCE PRESETS: Choose your execution mode
# ============================================================================
# Uncomment ONE of the following presets, or customize individual flags in Cell 2

# PRESET 1: QUICK MODE (5-10 minutes) - Core permissions only
# Recommended for: Daily monitoring, quick audits, testing
# USE_QUICK_MODE = True

# PRESET 2: FULL MODE (20-60 minutes) - Complete coverage
# Recommended for: Compliance audits, quarterly reviews, comprehensive analysis
# USE_FULL_MODE = True

# PRESET 3: CUSTOM MODE - Use individual flags in Cell 2
USE_CUSTOM_MODE = True  # Default: use custom configuration from Cell 2

# ============================================================================
# AUTOMATIC JOB MODE OVERRIDE: Jobs always run in Full Mode
# ============================================================================
if is_job_mode:
    log("\n🤖 JOB MODE DETECTED - Forcing FULL MODE for comprehensive audit")
    log("="*60)
    USE_FULL_MODE = True
    USE_QUICK_MODE = False
    USE_CUSTOM_MODE = False
    log("Job mode automatically enables complete in-depth review")
    log("="*60)

# ============================================================================

# Apply preset configurations
if 'USE_QUICK_MODE' in dir() and USE_QUICK_MODE:
    log("\n🚀 QUICK MODE ENABLED - Core permissions only (5-10 minutes)")
    log("="*60)
    
    # Core resources only
    ENABLE_JOBS = True
    ENABLE_WAREHOUSES = True
    ENABLE_CLUSTERS = False  # Skip clusters for speed
    ENABLE_PIPELINES = True
    
    # Disable extended resources
    ENABLE_WORKSPACE_OBJECTS = False
    ENABLE_MODEL_REGISTRY = False
    ENABLE_REPOS = False
    ENABLE_INSTANCE_POOLS = False
    ENABLE_SQL_ASSETS = False
    ENABLE_VOLUMES = False
    
    # Keep security essentials
    ENABLE_SERVICE_PRINCIPALS = True
    ENABLE_SECRET_SCOPES = True
    ENABLE_IP_ACCESS_LISTS = True
    ENABLE_TOKEN_AUDIT = False
    ENABLE_UC_PERMISSIONS = True
    
    # Reduce limits
    MAX_RESOURCES_PER_TYPE = 100
    MAX_WORKSPACE_OBJECTS = 100
    
    log("Quick mode: Core resources only (Jobs, Warehouses, Pipelines, UC, Secrets)")
    log("Skipped: Clusters, Workspace Objects, Models, Repos, Pools, SQL Assets, Volumes, Tokens")
    log("="*60)

elif 'USE_FULL_MODE' in dir() and USE_FULL_MODE:
    log("\n🔍 FULL MODE ENABLED - Complete coverage (20-60 minutes)")
    log("="*60)
    
    # Enable everything
    ENABLE_JOBS = True
    ENABLE_WAREHOUSES = True
    ENABLE_CLUSTERS = True
    ENABLE_PIPELINES = True
    ENABLE_WORKSPACE_OBJECTS = True
    ENABLE_MODEL_REGISTRY = True
    ENABLE_REPOS = True
    ENABLE_INSTANCE_POOLS = True
    ENABLE_SQL_ASSETS = True
    ENABLE_VOLUMES = True
    ENABLE_SERVICE_PRINCIPALS = True
    ENABLE_SECRET_SCOPES = True
    ENABLE_IP_ACCESS_LISTS = True
    ENABLE_TOKEN_AUDIT = True
    ENABLE_UC_PERMISSIONS = True
    
    # Maximum limits
    MAX_RESOURCES_PER_TYPE = 999
    MAX_WORKSPACE_OBJECTS = 2000
    
    if is_job_mode:
        log("Full mode: ALL resources enabled with no limits (JOB MODE - COMPREHENSIVE AUDIT)")
    else:
        log("Full mode: ALL resources enabled with no limits")
    log("⚠ Warning: This may take 20-60 minutes depending on workspace size")
    log("="*60)

else:
    log("\n⚙️ CUSTOM MODE - Using configuration from Cell 2")
    log("="*60)


⚙️ CUSTOM MODE - Using configuration from Cell 2


In [0]:
# Optional: Add job parameters for scheduled execution
# These can be configured when creating a job in the Databricks UI

try:
    # Get parameter for max resources (defaults to value in cell 2 if not provided)
    job_max_resources = dbutils.widgets.get("max_resources_per_type")
    if job_max_resources:
        MAX_RESOURCES_PER_TYPE = int(job_max_resources)
        log(f"Using job parameter: MAX_RESOURCES_PER_TYPE = {MAX_RESOURCES_PER_TYPE}")
except:
    # Widget doesn't exist - use default from cell 2
    pass

try:
    # Get parameter for export path (defaults to /dbfs/tmp/permissions_export)
    job_export_path = dbutils.widgets.get("export_path")
    if job_export_path:
        EXPORT_PATH = job_export_path
        log(f"Using job parameter: EXPORT_PATH = {EXPORT_PATH}")
    else:
        EXPORT_PATH = '/dbfs/tmp/permissions_export'
except:
    EXPORT_PATH = '/dbfs/tmp/permissions_export'

log(f"\nConfiguration:")
log(f"  MAX_RESOURCES_PER_TYPE: {MAX_RESOURCES_PER_TYPE}")
log(f"  EXPORT_PATH: {EXPORT_PATH}")

# Validate and create export directory if needed
if ENABLE_EXCEL_EXPORT:
    try:
        # Create directory if it doesn't exist
        dbutils.fs.mkdirs(EXPORT_PATH.replace('/dbfs', 'dbfs:'))
        log(f"\n✓ Export directory ready: {EXPORT_PATH}")
        
        # Check if writable
        if check_export_path_writable(EXPORT_PATH):
            log(f"  ✓ Export path is writable")
    except Exception as e:
        log(f"\n⚠️  Warning: Could not validate export path: {str(e)}")
        log(f"  Excel export may fail if path is not writable")


Configuration:
  MAX_RESOURCES_PER_TYPE: 100
  EXPORT_PATH: /dbfs/tmp/permissions_export

✓ Export directory ready: /dbfs/tmp/permissions_export
Wrote 4 bytes.
  ✓ Export path is writable


In [0]:
log("Fetching users and groups...")

# Get all users
users = list(wc.users.list())
users_df = spark.createDataFrame([
    {'user_name': u.user_name, 'display_name': u.display_name, 'active': u.active}
    for u in users
])

# Get all groups
groups = list(wc.groups.list())
groups_df = spark.createDataFrame([
    {'group_name': g.display_name, 'group_id': g.id}
    for g in groups
])

# Get group memberships
user_groups_data = []
for user in users:
    if user.groups:
        for group in user.groups:
            user_groups_data.append({
                'user_name': user.user_name,
                'group_name': group.display
            })

user_groups_df = spark.createDataFrame(user_groups_data) if user_groups_data else spark.createDataFrame([], 'user_name STRING, group_name STRING')

log(f"✓ Found {users_df.count()} users")
log(f"✓ Found {groups_df.count()} groups")
log(f"✓ Found {user_groups_df.count()} user-group memberships")

Fetching users and groups...
✓ Found 312 users
✓ Found 45 groups
✓ Found 925 user-group memberships


In [0]:
# Optimize DataFrame reuse based on compute type
# Serverless: Automatic optimization (caching not supported)
# Traditional: Explicit caching for performance

log("Optimizing DataFrame reuse...")

if is_serverless:
    # Serverless compute: Just materialize with count (caching not supported)
    log("  Running on SERVERLESS compute - using automatic optimization")
    
    users_count = users_df.count()
    groups_count = groups_df.count()
    user_groups_count = user_groups_df.count()
    
    log(f"✓ Materialized users_df: {users_count} rows")
    log(f"✓ Materialized groups_df: {groups_count} rows")
    log(f"✓ Materialized user_groups_df: {user_groups_count} rows")
    log("  Serverless will automatically optimize DataFrame reuse")
else:
    # Traditional cluster: Use explicit caching
    log("  Running on TRADITIONAL cluster - applying explicit caching")
    
    users_df.cache()
    groups_df.cache()
    user_groups_df.cache()
    
    # Force materialization
    users_count = users_df.count()
    groups_count = groups_df.count()
    user_groups_count = user_groups_df.count()
    
    log(f"✓ Cached users_df: {users_count} rows")
    log(f"✓ Cached groups_df: {groups_count} rows")
    log(f"✓ Cached user_groups_df: {user_groups_count} rows")
    log("  DataFrames cached for 20-30% performance improvement on subsequent joins")

Optimizing DataFrame reuse...
  Running on SERVERLESS compute - using automatic optimization
✓ Materialized users_df: 312 rows
✓ Materialized groups_df: 45 rows
✓ Materialized user_groups_df: 925 rows
  Serverless will automatically optimize DataFrame reuse


In [0]:
cell_start_time = time.time()
log("Fetching workspace security configuration settings...")

security_config_keys = {
    'enableTokensConfig': 'Personal access token creation enabled',
    'maxTokenLifetimeDays': 'Maximum token lifetime (days)',
    'enableIpAccessLists': 'IP access list restrictions enabled',
    'enableVerboseAuditLogs': 'Detailed audit logging enabled',
    'enforceUserIsolation': 'User isolation enforcement enabled',
    'enableDeprecatedClusterNamedInitScripts': 'Deprecated cluster init scripts allowed',
    'enableDeprecatedGlobalInitScripts': 'Deprecated global init scripts allowed',
    'enableWebTerminal': 'Web terminal access enabled',
    'enableDbfsFileBrowser': 'DBFS file browser enabled',
    'storeInteractiveNotebookResultsInCustomerAccount': 'Notebook results stored in customer account'
}

security_config_data = []

for key, description in security_config_keys.items():
    try:
        result = wc.workspace_conf.get_status(keys=key)
        if key in result:
            value = result[key]
            
            # Determine security impact
            high_impact_keys = ['enableIpAccessLists', 'enforceUserIsolation', 'enableVerboseAuditLogs']
            security_impact = 'High' if key in high_impact_keys else 'Medium'
            
            # Determine if this is a potential security risk
            risky_enabled = key in ['enableDeprecatedClusterNamedInitScripts', 
                                    'enableDeprecatedGlobalInitScripts'] and str(value).lower() == 'true'
            risky_disabled = key in ['enableIpAccessLists', 'enforceUserIsolation', 
                                     'enableVerboseAuditLogs'] and str(value).lower() != 'true'
            
            security_config_data.append({
                'config_key': key,
                'config_value': str(value) if value is not None else 'null',
                'enabled': str(value).lower() in ['true', '1', 'enabled'],
                'description': description,
                'security_impact': security_impact,
                'potential_risk': risky_enabled or risky_disabled
            })
    except Exception as e:
        if not is_job_mode:
            log(f"  ⚠️  Failed to check {key}: {str(e)[:100]}")

if security_config_data:
    security_config_df = spark.createDataFrame(security_config_data)
    
    log(f"✓ Found {len(security_config_data)} security configuration settings")
    log(f"  High-impact settings: {security_config_df.filter(F.col('security_impact') == 'High').count()}")
    log(f"  Enabled settings: {security_config_df.filter(F.col('enabled') == True).count()}")
    log(f"  Potential risks identified: {security_config_df.filter(F.col('potential_risk') == True).count()}")
    
    # Display ordered by security impact and risk
    display(security_config_df.orderBy(
        F.col('potential_risk').desc(), 
        F.col('security_impact').desc(), 
        F.col('config_key')
    ))
    
    # Add to all_permissions for export
    all_permissions.extend(security_config_data)
else:
    log("⚠️  No security configuration settings could be retrieved")
    security_config_df = spark.createDataFrame([], schema="config_key STRING, config_value STRING, enabled BOOLEAN, description STRING, security_impact STRING, potential_risk BOOLEAN")

log_execution_time("Get workspace security configuration settings", cell_start_time)

Fetching workspace security configuration settings...
✓ Found 10 security configuration settings
  High-impact settings: 3
  Enabled settings: 4
  Potential risks identified: 2


config_key,config_value,description,enabled,potential_risk,security_impact
enableIpAccessLists,,IP access list restrictions enabled,False,True,High
enforceUserIsolation,False,User isolation enforcement enabled,False,True,High
enableDbfsFileBrowser,True,DBFS file browser enabled,True,False,Medium
enableDeprecatedClusterNamedInitScripts,,Deprecated cluster init scripts allowed,False,False,Medium
enableDeprecatedGlobalInitScripts,,Deprecated global init scripts allowed,False,False,Medium
enableTokensConfig,True,Personal access token creation enabled,True,False,Medium
enableWebTerminal,True,Web terminal access enabled,True,False,Medium
maxTokenLifetimeDays,,Maximum token lifetime (days),False,False,Medium
storeInteractiveNotebookResultsInCustomerAccount,,Notebook results stored in customer account,False,False,Medium
enableVerboseAuditLogs,True,Detailed audit logging enabled,True,False,High


⏱️  Get workspace security configuration settings completed in 1.81 seconds


In [0]:
cell_start_time = time.time()
log("Analyzing security configuration and generating recommendations...")

recommendations = []

# Check for high-risk configurations
if validate_dataframe_exists('security_config_df', security_config_df):
    high_risk_configs = security_config_df.filter(
        (F.col('potential_risk') == True) & 
        (F.col('security_impact') == 'High')
    ).collect()
    
    for config in high_risk_configs:
        recommendations.append({
            'priority': 'High',
            'category': 'Security Configuration',
            'issue': f"{config.config_key}: {config.config_value}",
            'recommendation': f"Review and remediate: {config.description}",
            'impact': 'High security risk',
            'resource_type': 'Workspace Config',
            'resource_name': config.config_key
        })
    
    # Check for missing IP access lists
    ip_config = security_config_df.filter(F.col('config_key') == 'enableIpAccessLists').collect()
    if ip_config and not ip_config[0].enabled:
        recommendations.append({
            'priority': 'High',
            'category': 'Network Security',
            'issue': 'IP Access Lists not enabled',
            'recommendation': 'Configure IP access lists to restrict workspace access by network',
            'impact': 'Unrestricted network access to workspace',
            'resource_type': 'Workspace Config',
            'resource_name': 'enableIpAccessLists'
        })
    
    # Check for user isolation
    isolation_config = security_config_df.filter(F.col('config_key') == 'enforceUserIsolation').collect()
    if isolation_config and not isolation_config[0].enabled:
        recommendations.append({
            'priority': 'Medium',
            'category': 'Compute Security',
            'issue': 'User isolation not enforced',
            'recommendation': 'Enable enforceUserIsolation for enhanced security on shared clusters',
            'impact': 'Users may access each other\'s data on shared clusters',
            'resource_type': 'Workspace Config',
            'resource_name': 'enforceUserIsolation'
        })
    
    # Check for verbose audit logs
    audit_config = security_config_df.filter(F.col('config_key') == 'enableVerboseAuditLogs').collect()
    if audit_config and not audit_config[0].enabled:
        recommendations.append({
            'priority': 'High',
            'category': 'Compliance',
            'issue': 'Verbose audit logs not enabled',
            'recommendation': 'Enable verbose audit logs for compliance and security monitoring',
            'impact': 'Limited audit trail for compliance reviews',
            'resource_type': 'Workspace Config',
            'resource_name': 'enableVerboseAuditLogs'
        })
    
    # Check for deprecated features enabled
    deprecated_configs = security_config_df.filter(
        F.col('config_key').contains('Deprecated') & 
        (F.col('enabled') == True)
    ).collect()
    
    for config in deprecated_configs:
        recommendations.append({
            'priority': 'Low',
            'category': 'Maintenance',
            'issue': f"{config.config_key} is enabled",
            'recommendation': 'Migrate away from deprecated features before they are removed',
            'impact': 'Future compatibility issues',
            'resource_type': 'Workspace Config',
            'resource_name': config.config_key
        })

# Create recommendations DataFrame
if recommendations:
    recommendations_df = spark.createDataFrame(recommendations)
    recommendations_df = recommendations_df.orderBy(
        F.when(F.col('priority') == 'High', 0)
         .when(F.col('priority') == 'Medium', 1)
         .otherwise(2),
        F.col('category')
    )
    
    log(f"\n⚠️  Found {len(recommendations)} security recommendations:")
    log(f"  High priority: {recommendations_df.filter(F.col('priority') == 'High').count()}")
    log(f"  Medium priority: {recommendations_df.filter(F.col('priority') == 'Medium').count()}")
    log(f"  Low priority: {recommendations_df.filter(F.col('priority') == 'Low').count()}")
    
    display(recommendations_df)
else:
    log("\n✓ No security configuration issues identified")
    log("   Your workspace security configuration follows best practices")
    recommendations_df = spark.createDataFrame([], schema="priority STRING, category STRING, issue STRING, recommendation STRING, impact STRING, resource_type STRING, resource_name STRING")

log_execution_time("Security configuration analysis and recommendations", cell_start_time)

Analyzing security configuration and generating recommendations...
  ✓ security_config_df: 10 rows

⚠️  Found 4 security recommendations:
  High priority: 3
  Medium priority: 1
  Low priority: 0


category,impact,issue,priority,recommendation,resource_name,resource_type
Network Security,Unrestricted network access to workspace,IP Access Lists not enabled,High,Configure IP access lists to restrict workspace access by network,enableIpAccessLists,Workspace Config
Security Configuration,High security risk,enableIpAccessLists: null,High,Review and remediate: IP access list restrictions enabled,enableIpAccessLists,Workspace Config
Security Configuration,High security risk,enforceUserIsolation: false,High,Review and remediate: User isolation enforcement enabled,enforceUserIsolation,Workspace Config
Compute Security,Users may access each other's data on shared clusters,User isolation not enforced,Medium,Enable enforceUserIsolation for enhanced security on shared clusters,enforceUserIsolation,Workspace Config


⏱️  Security configuration analysis and recommendations completed in 1.62 seconds


In [0]:
if ENABLE_JOBS:
    cell_start_time = time.time()
    
    if MAX_RESOURCES_PER_TYPE == 999:
        log(f"Fetching job permissions (all resources, no limit)...")
    else:
        log(f"Fetching job permissions (up to {MAX_RESOURCES_PER_TYPE})...")
    
    try:
        jobs_list = list(wc.jobs.list())
        
        if MAX_RESOURCES_PER_TYPE == 999:
            jobs_to_check = jobs_list
            log(f"  Found {len(jobs_list)} total jobs, checking ALL (no limit)")
        else:
            jobs_to_check = jobs_list[:MAX_RESOURCES_PER_TYPE]
            log(f"  Found {len(jobs_list)} total jobs, checking {len(jobs_to_check)}")
        
        if len(jobs_to_check) > 0:
            with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                futures = [
                    executor.submit(get_permissions, 'jobs', str(job.job_id), 
                                  job.settings.name if job.settings else '', 'jobs')
                    for job in jobs_to_check
                ]
                
                completed = 0
                for future in as_completed(futures):
                    all_permissions.extend(future.result())
                    completed += 1
                    if not is_job_mode and completed % 20 == 0:
                        progress_pct = (completed / len(jobs_to_check)) * 100
                        log(f"  Progress: {completed}/{len(jobs_to_check)} ({progress_pct:.1f}%)")
            
            log(f"✓ Collected {len(all_permissions)} total permission entries so far")
        else:
            log("⚠️  No jobs found to check")
            
    except Exception as e:
        log(f"❌ Error fetching job permissions: {str(e)}")
        if is_job_mode:
            raise
    
    log_execution_time("Get job permissions", cell_start_time)
else:
    log("⏭️  Job permissions collection disabled (ENABLE_JOBS=False)")
    execution_stats['resources_skipped'] += 1

Fetching job permissions (up to 100)...
  Found 735 total jobs, checking 100
  Progress: 20/100 (20.0%)
  Progress: 40/100 (40.0%)
  Progress: 60/100 (60.0%)
  Progress: 80/100 (80.0%)
  Progress: 100/100 (100.0%)
✓ Collected 453 total permission entries so far
⏱️  Get job permissions completed in 5.59 seconds


In [0]:
if ENABLE_WAREHOUSES:
    cell_start_time = time.time()
    
    if MAX_RESOURCES_PER_TYPE == 999:
        log(f"Fetching warehouse permissions (all resources, no limit)...")
    else:
        log(f"Fetching warehouse permissions (up to {MAX_RESOURCES_PER_TYPE})...")
    
    try:
        warehouses_list = list(wc.warehouses.list())
        
        if MAX_RESOURCES_PER_TYPE == 999:
            warehouses_to_check = warehouses_list
            log(f"  Found {len(warehouses_list)} total warehouses, checking ALL (no limit)")
        else:
            warehouses_to_check = warehouses_list[:MAX_RESOURCES_PER_TYPE]
            log(f"  Found {len(warehouses_list)} total warehouses, checking {len(warehouses_to_check)}")
        
        if len(warehouses_to_check) > 0:
            with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                futures = [
                    executor.submit(get_permissions, 'warehouses', warehouse.id, 
                                  warehouse.name, 'sql/warehouses')
                    for warehouse in warehouses_to_check
                ]
                
                completed = 0
                for future in as_completed(futures):
                    all_permissions.extend(future.result())
                    completed += 1
                    if not is_job_mode and completed % 10 == 0:
                        progress_pct = (completed / len(warehouses_to_check)) * 100
                        log(f"  Progress: {completed}/{len(warehouses_to_check)} ({progress_pct:.1f}%)")
            
            log(f"✓ Collected {len(all_permissions)} total permission entries so far")
        else:
            log("⚠️  No warehouses found to check")
            
    except Exception as e:
        log(f"❌ Error fetching warehouse permissions: {str(e)}")
        if is_job_mode:
            raise
    
    log_execution_time("Get warehouse permissions", cell_start_time)
else:
    log("⏭️  Warehouse permissions collection disabled (ENABLE_WAREHOUSES=False)")
    execution_stats['resources_skipped'] += 1

Fetching warehouse permissions (up to 100)...
  Found 30 total warehouses, checking 30
  Progress: 10/30 (33.3%)
  Progress: 20/30 (66.7%)
  Progress: 30/30 (100.0%)
✓ Collected 587 total permission entries so far
⏱️  Get warehouse permissions completed in 0.37 seconds


In [0]:
if ENABLE_CLUSTERS:
    cell_start_time = time.time()
    
    if MAX_RESOURCES_PER_TYPE == 999:
        log(f"Fetching cluster permissions (interactive clusters only, all resources, no limit)...")
    else:
        log(f"Fetching cluster permissions (interactive clusters only, up to {MAX_RESOURCES_PER_TYPE})...")
    
    try:
        all_clusters = list(wc.clusters.list())
        
        interactive_clusters = [
            cluster for cluster in all_clusters 
            if cluster.cluster_source and cluster.cluster_source.value != 'JOB'
        ]
        
        log(f"  Found {len(all_clusters)} total clusters")
        log(f"  Filtered to {len(interactive_clusters)} interactive clusters (excluding job clusters)")
        
        if MAX_RESOURCES_PER_TYPE == 999:
            clusters_to_check = interactive_clusters
            log(f"  Checking ALL interactive clusters (no limit)")
        else:
            clusters_to_check = interactive_clusters[:MAX_RESOURCES_PER_TYPE]
            log(f"  Checking {len(clusters_to_check)} interactive clusters")
        
        if len(clusters_to_check) > 0:
            with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                futures = [
                    executor.submit(get_permissions, 'clusters', cluster.cluster_id, 
                                  cluster.cluster_name, 'clusters')
                    for cluster in clusters_to_check
                ]
                
                completed = 0
                for future in as_completed(futures):
                    all_permissions.extend(future.result())
                    completed += 1
                    if not is_job_mode and completed % 10 == 0:
                        progress_pct = (completed / len(clusters_to_check)) * 100
                        log(f"  Progress: {completed}/{len(clusters_to_check)} ({progress_pct:.1f}%)")
            
            log(f"✓ Collected {len(all_permissions)} total permission entries so far")
        else:
            log("⚠️  No interactive clusters found to check")
            
    except Exception as e:
        log(f"❌ Error fetching cluster permissions: {str(e)}")
        if is_job_mode:
            raise
    
    log_execution_time("Get cluster permissions", cell_start_time)
else:
    log("⏭️  Cluster permissions collection disabled (ENABLE_CLUSTERS=False)")
    execution_stats['resources_skipped'] += 1

Fetching cluster permissions (interactive clusters only, up to 100)...
  Found 103220 total clusters
  Filtered to 449 interactive clusters (excluding job clusters)
  Checking 100 interactive clusters
  Progress: 10/100 (10.0%)
  Progress: 20/100 (20.0%)
  Progress: 30/100 (30.0%)
  Progress: 40/100 (40.0%)
  Progress: 50/100 (50.0%)
  Progress: 60/100 (60.0%)
  Progress: 70/100 (70.0%)
  Progress: 80/100 (80.0%)
  Progress: 90/100 (90.0%)
  Progress: 100/100 (100.0%)
✓ Collected 974 total permission entries so far
⏱️  Get cluster permissions completed in 2225.00 seconds


In [0]:
if ENABLE_PIPELINES:
    cell_start_time = time.time()
    
    if MAX_RESOURCES_PER_TYPE == 999:
        log(f"Fetching pipeline permissions (all resources, no limit)...")
    else:
        log(f"Fetching pipeline permissions (up to {MAX_RESOURCES_PER_TYPE})...")
    
    try:
        pipelines_list = list(wc.pipelines.list_pipelines())
        
        if MAX_RESOURCES_PER_TYPE == 999:
            pipelines_to_check = pipelines_list
            log(f"  Found {len(pipelines_list)} total pipelines, checking ALL (no limit)")
        else:
            pipelines_to_check = pipelines_list[:MAX_RESOURCES_PER_TYPE]
            log(f"  Found {len(pipelines_list)} total pipelines, checking {len(pipelines_to_check)}")
        
        if len(pipelines_to_check) > 0:
            with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                futures = [
                    executor.submit(get_permissions, 'pipelines', pipeline.pipeline_id, 
                                  pipeline.name, 'pipelines')
                    for pipeline in pipelines_to_check
                ]
                
                completed = 0
                for future in as_completed(futures):
                    all_permissions.extend(future.result())
                    completed += 1
                    if not is_job_mode and completed % 10 == 0:
                        progress_pct = (completed / len(pipelines_to_check)) * 100
                        log(f"  Progress: {completed}/{len(pipelines_to_check)} ({progress_pct:.1f}%)")
            
            log(f"✓ Collected {len(all_permissions)} total permission entries so far")
        else:
            log("⚠️  No pipelines found to check")
            
    except Exception as e:
        log(f"❌ Error fetching pipeline permissions: {str(e)}")
        if is_job_mode:
            raise
    
    log_execution_time("Get pipeline permissions", cell_start_time)
else:
    log("⏭️  Pipeline permissions collection disabled (ENABLE_PIPELINES=False)")
    execution_stats['resources_skipped'] += 1

Fetching pipeline permissions (up to 100)...
  Found 8 total pipelines, checking 8
✓ Collected 1002 total permission entries so far
⏱️  Get pipeline permissions completed in 0.23 seconds


In [0]:
if ENABLE_WORKSPACE_OBJECTS:
    cell_start_time = time.time()
    
    log("Fetching workspace folder and notebook permissions...")
    log(f"  Limit: {MAX_WORKSPACE_OBJECTS} objects")
    
    workspace_perms_data = []
    objects_scanned = 0
    
    try:
        # Define key folders to check (customize as needed)
        key_folders = [
            '/Shared',
            '/Users',
            '/Repos'
        ]
        
        def collect_workspace_objects(path, current_count, depth=0, max_depth=2):
            """Collect workspace objects up to max_depth"""
            if depth > max_depth or current_count >= MAX_WORKSPACE_OBJECTS:
                return [], current_count
            
            objects_to_check = []
            try:
                objects = list(wc.workspace.list(path))
                
                for obj in objects:
                    if current_count >= MAX_WORKSPACE_OBJECTS:
                        break
                    
                    objects_to_check.append(obj)
                    current_count += 1
                    
                    # Recurse into directories
                    if obj.object_type and obj.object_type.value == 'DIRECTORY' and depth < max_depth:
                        sub_objects, current_count = collect_workspace_objects(obj.path, current_count, depth + 1, max_depth)
                        objects_to_check.extend(sub_objects)
            except Exception:
                pass
            
            return objects_to_check, current_count
        
        # Collect objects from key folders
        all_objects = []
        for folder in key_folders:
            try:
                log(f"  Scanning {folder}...")
                folder_objects, objects_scanned = collect_workspace_objects(folder, objects_scanned, depth=0, max_depth=2)
                all_objects.extend(folder_objects)
                if objects_scanned >= MAX_WORKSPACE_OBJECTS:
                    log(f"  ⚠️ Reached MAX_WORKSPACE_OBJECTS limit ({MAX_WORKSPACE_OBJECTS})")
                    break
            except Exception as e:
                log(f"  ⚠️ Could not scan {folder}: {str(e)}")
        
        log(f"  Collected {len(all_objects)} workspace objects to check permissions")
        
        # Parallelize permission fetching for collected objects
        if len(all_objects) > 0:
            def get_workspace_object_permissions(obj):
                """Get permissions for a single workspace object"""
                try:
                    perms = wc.permissions.get('directories', obj.object_id)
                    results = []
                    
                    if perms.access_control_list:
                        for acl in perms.access_control_list:
                            principal = acl.user_name or acl.group_name or acl.service_principal_name
                            principal_type = 'user' if acl.user_name else ('group' if acl.group_name else 'service_principal')
                            permissions = [p.permission_level.value for p in acl.all_permissions] if acl.all_permissions else []
                            
                            for perm in permissions:
                                results.append({
                                    'resource_type': 'workspace_object',
                                    'object_type': obj.object_type.value if obj.object_type else 'UNKNOWN',
                                    'resource_id': str(obj.object_id),
                                    'resource_path': obj.path,
                                    'principal': principal,
                                    'principal_type': principal_type,
                                    'permission_level': perm
                                })
                    return results
                except Exception:
                    return []
            
            with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                futures = [executor.submit(get_workspace_object_permissions, obj) for obj in all_objects]
                
                completed = 0
                for future in as_completed(futures):
                    results = future.result()
                    workspace_perms_data.extend(results)
                    completed += 1
                    if not is_job_mode and completed % 50 == 0:
                        progress_pct = (completed / len(all_objects)) * 100
                        log(f"  Progress: {completed}/{len(all_objects)} ({progress_pct:.1f}%)")
            
            # Add to all_permissions
            all_permissions.extend([{
                'resource_type': p['resource_type'],
                'resource_id': p['resource_id'],
                'resource_name': p['resource_path'],
                'principal': p['principal'],
                'principal_type': p['principal_type'],
                'permission_level': p['permission_level']
            } for p in workspace_perms_data])
        
        if workspace_perms_data:
            workspace_permissions_df = spark.createDataFrame(workspace_perms_data)
        else:
            workspace_permissions_df = spark.createDataFrame([], 'resource_type STRING, object_type STRING, resource_id STRING, resource_path STRING, principal STRING, principal_type STRING, permission_level STRING')
        
        log(f"✓ Found {len(workspace_perms_data)} workspace object permissions")
        log(f"✓ Collected {len(all_permissions)} total permission entries so far")
        
    except Exception as e:
        log(f"❌ Error fetching workspace permissions: {str(e)}")
        workspace_permissions_df = spark.createDataFrame([], 'resource_type STRING, object_type STRING, resource_id STRING, resource_path STRING, principal STRING, principal_type STRING, permission_level STRING')
        if is_job_mode:
            raise
    
    log_execution_time("Get workspace folder and notebook permissions", cell_start_time)
else:
    log("⏭️  Workspace object permissions collection disabled (ENABLE_WORKSPACE_OBJECTS=False)")
    workspace_permissions_df = spark.createDataFrame([], 'resource_type STRING, object_type STRING, resource_id STRING, resource_path STRING, principal STRING, principal_type STRING, permission_level STRING')
    execution_stats['resources_skipped'] += 1

Fetching workspace folder and notebook permissions...
  Limit: 500 objects
  Scanning /Shared...
  ⚠️ Reached MAX_WORKSPACE_OBJECTS limit (500)
  Collected 500 workspace objects to check permissions
  Progress: 50/500 (10.0%)
  Progress: 100/500 (20.0%)
  Progress: 150/500 (30.0%)
  Progress: 200/500 (40.0%)
  Progress: 250/500 (50.0%)
  Progress: 300/500 (60.0%)
  Progress: 350/500 (70.0%)
  Progress: 400/500 (80.0%)
  Progress: 450/500 (90.0%)
  Progress: 500/500 (100.0%)
✓ Found 299 workspace object permissions
✓ Collected 1301 total permission entries so far
⏱️  Get workspace folder and notebook permissions completed in 8.51 seconds


In [0]:
if ENABLE_MODEL_REGISTRY:
    cell_start_time = time.time()
    
    log("Fetching model registry permissions...")
    
    try:
        # Note: Workspace Model Registry (legacy) does not support permissions API
        # Unity Catalog models use grants API instead (covered in UC permissions cell)
        
        # Check for workspace models
        workspace_models = list(wc.model_registry.list_models())
        log(f"  Found {len(workspace_models)} workspace model registry models (legacy)")
        log(f"  Note: Workspace model registry does not support permissions API")
        log(f"  Workspace models are managed through workspace-level access controls")
        
        # Get Unity Catalog models and their grants
        try:
            uc_models = list(wc.registered_models.list())
            
            if MAX_RESOURCES_PER_TYPE == 999:
                models_to_check = uc_models
                log(f"  Found {len(uc_models)} Unity Catalog models, checking ALL (no limit)")
            else:
                models_to_check = uc_models[:MAX_RESOURCES_PER_TYPE]
                log(f"  Found {len(uc_models)} Unity Catalog models, checking {len(models_to_check)}")
            
            uc_model_grants = []
            
            if len(models_to_check) > 0:
                completed = 0
                for model in models_to_check:
                    try:
                        # Get grants for UC model
                        grants = wc.grants.get_effective(securable_type='function', full_name=model.full_name)
                        
                        if grants.privilege_assignments:
                            for grant in grants.privilege_assignments:
                                for privilege in grant.privileges:
                                    uc_model_grants.append({
                                        'model_full_name': model.full_name,
                                        'model_name': model.name,
                                        'catalog_name': model.catalog_name,
                                        'schema_name': model.schema_name,
                                        'principal': grant.principal,
                                        'privilege': privilege.value
                                    })
                    except Exception:
                        pass
                    
                    completed += 1
                    if not is_job_mode and completed % 50 == 0:
                        progress_pct = (completed / len(models_to_check)) * 100
                        log(f"  Progress: {completed}/{len(models_to_check)} ({progress_pct:.1f}%)")
                
                if uc_model_grants:
                    uc_model_grants_df = spark.createDataFrame(uc_model_grants)
                    log(f"  ✓ Found {uc_model_grants_df.count()} Unity Catalog model grants")
                else:
                    uc_model_grants_df = spark.createDataFrame([], 'model_full_name STRING, model_name STRING, catalog_name STRING, schema_name STRING, principal STRING, privilege STRING')
                    log("  No Unity Catalog model grants found")
            else:
                uc_model_grants_df = spark.createDataFrame([], 'model_full_name STRING, model_name STRING, catalog_name STRING, schema_name STRING, principal STRING, privilege STRING')
            
            log(f"✓ Collected {len(all_permissions)} total permission entries so far")
        
        except Exception as e:
            log(f"  ⚠️ Could not fetch Unity Catalog models: {str(e)}")
            uc_model_grants_df = spark.createDataFrame([], 'model_full_name STRING, model_name STRING, catalog_name STRING, schema_name STRING, principal STRING, privilege STRING')
            
    except Exception as e:
        log(f"❌ Error fetching model registry permissions: {str(e)}")
        uc_model_grants_df = spark.createDataFrame([], 'model_full_name STRING, model_name STRING, catalog_name STRING, schema_name STRING, principal STRING, privilege STRING')
        if is_job_mode:
            raise
    
    log_execution_time("Get model registry permissions", cell_start_time)
else:
    log("⏭️  Model registry permissions collection disabled (ENABLE_MODEL_REGISTRY=False)")
    uc_model_grants_df = spark.createDataFrame([], 'model_full_name STRING, model_name STRING, catalog_name STRING, schema_name STRING, principal STRING, privilege STRING')
    execution_stats['resources_skipped'] += 1

Fetching model registry permissions...
  Found 37 workspace model registry models (legacy)
  Note: Workspace model registry does not support permissions API
  Workspace models are managed through workspace-level access controls




  Found 1677 Unity Catalog models, checking 100
  Progress: 50/100 (50.0%)
  Progress: 100/100 (100.0%)
  No Unity Catalog model grants found
✓ Collected 1301 total permission entries so far
⏱️  Get model registry permissions completed in 30.33 seconds


In [0]:
if ENABLE_REPOS:
    cell_start_time = time.time()
    
    log("Fetching repos (Git integration) permissions...")
    
    try:
        repos_list = list(wc.repos.list())
        
        if MAX_RESOURCES_PER_TYPE == 999:
            repos_to_check = repos_list
            log(f"  Found {len(repos_list)} repos, checking ALL (no limit)")
        else:
            repos_to_check = repos_list[:MAX_RESOURCES_PER_TYPE]
            log(f"  Found {len(repos_list)} repos, checking {len(repos_to_check)}")
        
        if len(repos_to_check) > 0:
            with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                futures = [
                    executor.submit(get_permissions, 'repos', str(repo.id), 
                                  repo.path, 'repos')
                    for repo in repos_to_check
                ]
                
                completed = 0
                for future in as_completed(futures):
                    all_permissions.extend(future.result())
                    completed += 1
                    if not is_job_mode and completed % 10 == 0:
                        progress_pct = (completed / len(repos_to_check)) * 100
                        log(f"  Progress: {completed}/{len(repos_to_check)} ({progress_pct:.1f}%)")
            
            log(f"✓ Collected {len(all_permissions)} total permission entries so far")
        else:
            log("⚠️ No repos found to check")
            
    except Exception as e:
        log(f"❌ Error fetching repos permissions: {str(e)}")
        if is_job_mode:
            raise
    
    log_execution_time("Get repos permissions", cell_start_time)
else:
    log("⏭️  Repos permissions collection disabled (ENABLE_REPOS=False)")
    execution_stats['resources_skipped'] += 1

Fetching repos (Git integration) permissions...
  Found 155 repos, checking 100
  Progress: 10/100 (10.0%)
  Progress: 20/100 (20.0%)
  Progress: 30/100 (30.0%)
  Progress: 40/100 (40.0%)
  Progress: 50/100 (50.0%)
  Progress: 60/100 (60.0%)
  Progress: 70/100 (70.0%)
  Progress: 80/100 (80.0%)
  Progress: 90/100 (90.0%)
  Progress: 100/100 (100.0%)
✓ Collected 1572 total permission entries so far
⏱️  Get repos permissions completed in 1.40 seconds


In [0]:
if ENABLE_INSTANCE_POOLS:
    cell_start_time = time.time()
    
    log("Fetching instance pool permissions...")
    
    try:
        pools_list = list(wc.instance_pools.list())
        
        if MAX_RESOURCES_PER_TYPE == 999:
            pools_to_check = pools_list
            log(f"  Found {len(pools_list)} instance pools, checking ALL (no limit)")
        else:
            pools_to_check = pools_list[:MAX_RESOURCES_PER_TYPE]
            log(f"  Found {len(pools_list)} instance pools, checking {len(pools_to_check)}")
        
        if len(pools_to_check) > 0:
            with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                futures = [
                    executor.submit(get_permissions, 'instance_pools', pool.instance_pool_id, 
                                  pool.instance_pool_name, 'instance-pools')
                    for pool in pools_to_check
                ]
                
                completed = 0
                for future in as_completed(futures):
                    all_permissions.extend(future.result())
                    completed += 1
                    if not is_job_mode and completed % 5 == 0:
                        progress_pct = (completed / len(pools_to_check)) * 100
                        log(f"  Progress: {completed}/{len(pools_to_check)} ({progress_pct:.1f}%)")
            
            log(f"✓ Collected {len(all_permissions)} total permission entries so far")
        else:
            log("⚠️ No instance pools found to check")
            
    except Exception as e:
        log(f"❌ Error fetching instance pool permissions: {str(e)}")
        if is_job_mode:
            raise
    
    log_execution_time("Get instance pool permissions", cell_start_time)
else:
    log("⏭️  Instance pool permissions collection disabled (ENABLE_INSTANCE_POOLS=False)")
    execution_stats['resources_skipped'] += 1

Fetching instance pool permissions...
  Found 4 instance pools, checking 4
✓ Collected 1582 total permission entries so far
⏱️  Get instance pool permissions completed in 0.19 seconds


In [0]:
if ENABLE_TOKEN_AUDIT:
    cell_start_time = time.time()
    
    log("Fetching token management audit data...")
    
    try:
        tokens_data = []
        
        try:
            tokens = list(wc.tokens.list())
            
            for token in tokens:
                tokens_data.append({
                    'token_id': token.token_id,
                    'created_by_username': token.created_by_username,
                    'created_by_id': token.created_by_id,
                    'creation_time': token.creation_time,
                    'expiry_time': token.expiry_time,
                    'comment': token.comment
                })
        except Exception as e:
            log(f"  ⚠️ Could not fetch tokens (may require additional permissions): {str(e)}")
        
        if tokens_data:
            tokens_df = spark.createDataFrame(tokens_data)
            log(f"✓ Found {tokens_df.count()} active tokens")
            
            if tokens_df.count() > 0:
                no_expiry = tokens_df.filter(F.col('expiry_time').isNull()).count()
                if no_expiry > 0:
                    log(f"  ⚠️ Security Alert: {no_expiry} tokens have no expiration date")
        else:
            tokens_df = spark.createDataFrame([], 'token_id STRING, created_by_username STRING, created_by_id STRING, creation_time BIGINT, expiry_time BIGINT, comment STRING')
            log("  No token data available (may require admin permissions)")
        
    except Exception as e:
        log(f"❌ Error fetching token data: {str(e)}")
        tokens_df = spark.createDataFrame([], 'token_id STRING, created_by_username STRING, created_by_id STRING, creation_time BIGINT, expiry_time BIGINT, comment STRING')
        if is_job_mode:
            raise
    
    log_execution_time("Get token management audit data", cell_start_time)
else:
    log("⏭️  Token audit disabled (ENABLE_TOKEN_AUDIT=False)")
    tokens_df = spark.createDataFrame([], 'token_id STRING, created_by_username STRING, created_by_id STRING, creation_time BIGINT, expiry_time BIGINT, comment STRING')
    execution_stats['resources_skipped'] += 1

Fetching token management audit data...
  No token data available (may require admin permissions)
⏱️  Get token management audit data completed in 0.14 seconds


In [0]:
if ENABLE_TOKEN_AUDIT and 'tokens_df' in dir() and tokens_df.count() > 0:
    cell_start_time = time.time()
    
    log("\n" + "="*60)
    log("TOKEN EXPIRATION AUDIT")
    log("="*60)
    
    # Identify tokens without expiration dates (CRITICAL SECURITY RISK)
    tokens_no_expiry = tokens_df.filter(
        (F.col('expiry_time').isNull()) | (F.col('expiry_time') == 0)
    )
    
    log(f"\n⚠️  CRITICAL: Tokens without expiration dates")
    log(f"  Found {tokens_no_expiry.count()} tokens with no expiry")
    
    if tokens_no_expiry.count() > 0:
        log("\n  Tokens without expiry (SECURITY RISK):")
        tokens_no_expiry_summary = tokens_no_expiry.groupBy('created_by_username', 'comment') \
            .agg(F.count('*').alias('token_count')) \
            .orderBy(F.desc('token_count'))
        
        if not is_job_mode:
            display(tokens_no_expiry_summary)
        
        log(f"\n  ⚠️  RECOMMENDATION: All tokens should have expiration dates")
        log(f"     Set maxTokenLifetimeDays in workspace settings to enforce expiration")
    else:
        log("  ✓ All tokens have expiration dates")
    
    # Identify tokens expiring soon (within 30 days)
    from pyspark.sql.functions import current_timestamp, col
    
    tokens_expiring_soon = tokens_df.filter(
        (F.col('expiry_time').isNotNull()) & 
        (F.col('expiry_time') > 0) &
        (F.col('expiry_time') < (F.unix_timestamp(current_timestamp()) + (30 * 24 * 60 * 60)) * 1000)
    )
    
    log(f"\nℹ️  Tokens expiring within 30 days: {tokens_expiring_soon.count()}")
    
    if tokens_expiring_soon.count() > 0 and not is_job_mode:
        log("\n  Tokens expiring soon:")
        display(tokens_expiring_soon.select(
            'created_by_username', 
            'comment',
            F.from_unixtime(F.col('expiry_time') / 1000).alias('expiry_date')
        ).orderBy('expiry_time'))
    
    # Token age analysis
    tokens_with_age = tokens_df.withColumn(
        'age_days',
        (F.unix_timestamp(current_timestamp()) - (F.col('creation_time') / 1000)) / (24 * 60 * 60)
    )
    
    old_tokens = tokens_with_age.filter(F.col('age_days') > 365)
    log(f"\nℹ️  Tokens older than 1 year: {old_tokens.count()}")
    
    if old_tokens.count() > 0:
        log("  ⚠️  RECOMMENDATION: Review and rotate old tokens regularly")
    
    # Summary statistics
    log(f"\n=== TOKEN SUMMARY ===")
    log(f"Total tokens: {tokens_df.count()}")
    log(f"Tokens without expiry: {tokens_no_expiry.count()} (❌ CRITICAL)")
    log(f"Tokens expiring soon (30 days): {tokens_expiring_soon.count()}")
    log(f"Tokens older than 1 year: {old_tokens.count()}")
    
    log_execution_time("Token expiration audit", cell_start_time)
else:
    log("⏭️  Token expiration audit skipped (no token data available)")

⏭️  Token expiration audit skipped (no token data available)


In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("INACTIVE USER PERMISSIONS ANALYSIS")
log("="*60)

# Check if required DataFrames exist
if 'users_df' not in dir() or 'permissions_df' not in dir():
    log("\n⏭️  Skipping inactive user analysis - permissions data not yet available")
    log("   This analysis will run after permissions are collected (Cell 23+)")
    inactive_user_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, principal_type STRING, permission_level STRING')
    log_execution_time("Inactive user permissions analysis", cell_start_time)
else:
    # Identify inactive users with active permissions
    if validate_dataframe_exists('users_df', users_df) and validate_dataframe_exists('permissions_df', permissions_df):
        
        inactive_users = users_df.filter(F.col('active') == False)
        
        log(f"\nInactive users in workspace: {inactive_users.count()}")
        
        if inactive_users.count() > 0:
            # Find permissions for inactive users
            inactive_user_permissions = permissions_df.filter(
                F.col('principal_type') == 'user'
            ).join(
                inactive_users.select('user_name'),
                permissions_df.principal == inactive_users.user_name,
                'inner'
            )
            
            inactive_perm_count = inactive_user_permissions.count()
            
            log(f"\n⚠️  COMPLIANCE ISSUE: Inactive users with permissions")
            log(f"  Found {inactive_perm_count} permission entries for inactive users")
            
            if inactive_perm_count > 0:
                # Summary by user
                inactive_summary = inactive_user_permissions.groupBy('principal') \
                    .agg(
                        F.count('*').alias('permission_count'),
                        F.countDistinct('resource_type').alias('resource_types'),
                        F.collect_set('resource_type').alias('resource_type_list')
                    ) \
                    .orderBy(F.desc('permission_count'))
                
                log(f"\n  Inactive users with permissions: {inactive_summary.count()}")
                
                if not is_job_mode:
                    display(inactive_summary)
                
                # Detailed breakdown by resource type
                inactive_by_resource = inactive_user_permissions.groupBy('resource_type') \
                    .agg(
                        F.countDistinct('principal').alias('inactive_users'),
                        F.count('*').alias('permission_entries')
                    ) \
                    .orderBy(F.desc('permission_entries'))
                
                log(f"\n  Permissions by resource type:")
                if not is_job_mode:
                    display(inactive_by_resource)
                
                log(f"\n  ⚠️  RECOMMENDATION: Remove permissions for inactive users")
                log(f"     Inactive users should not retain access to workspace resources")
                log(f"     This is a common compliance requirement (SOX, GDPR, etc.)")
                
                # Store for export
                inactive_user_permissions_df = inactive_user_permissions
            else:
                log("  ✓ No permissions found for inactive users")
                inactive_user_permissions_df = spark.createDataFrame([], permissions_df.schema)
        else:
            log("\n✓ No inactive users in workspace")
            inactive_user_permissions_df = spark.createDataFrame([], permissions_df.schema)
    else:
        log("⚠️  Cannot analyze inactive users - missing required data")
        inactive_user_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, principal_type STRING, permission_level STRING')
    
    log_execution_time("Inactive user permissions analysis", cell_start_time)


INACTIVE USER PERMISSIONS ANALYSIS

⏭️  Skipping inactive user analysis - permissions data not yet available
   This analysis will run after permissions are collected (Cell 23+)
⏱️  Inactive user permissions analysis completed in 0.04 seconds


In [0]:
if ENABLE_SQL_ASSETS:
    cell_start_time = time.time()
    
    log("Fetching SQL dashboard and query permissions...")
    
    sql_dashboard_perms = []
    sql_query_perms = []
    
    try:
        # Get SQL dashboards with parallelization
        try:
            from databricks.sdk.service.sql import DashboardsAPI
            
            dashboards = list(wc.dashboards.list())
            
            if MAX_RESOURCES_PER_TYPE == 999:
                dashboards_to_check = dashboards
                log(f"  Found {len(dashboards)} SQL dashboards, checking ALL (no limit)")
            else:
                dashboards_to_check = dashboards[:MAX_RESOURCES_PER_TYPE]
                log(f"  Found {len(dashboards)} SQL dashboards, checking {len(dashboards_to_check)}")
            
            if len(dashboards_to_check) > 0:
                def get_dashboard_permissions(dashboard):
                    """Get permissions for a single dashboard"""
                    results = []
                    try:
                        perms = wc.permissions.get('dashboards', dashboard.id)
                        
                        if perms.access_control_list:
                            for acl in perms.access_control_list:
                                principal = acl.user_name or acl.group_name or acl.service_principal_name
                                principal_type = 'user' if acl.user_name else ('group' if acl.group_name else 'service_principal')
                                permissions = [p.permission_level.value for p in acl.all_permissions] if acl.all_permissions else []
                                
                                for perm in permissions:
                                    results.append({
                                        'resource_type': 'sql_dashboard',
                                        'resource_id': dashboard.id,
                                        'resource_name': dashboard.name,
                                        'principal': principal,
                                        'principal_type': principal_type,
                                        'permission_level': perm
                                    })
                    except Exception:
                        pass
                    return results
                
                with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                    futures = [executor.submit(get_dashboard_permissions, dash) for dash in dashboards_to_check]
                    
                    completed = 0
                    for future in as_completed(futures):
                        sql_dashboard_perms.extend(future.result())
                        completed += 1
                        if not is_job_mode and completed % 10 == 0:
                            progress_pct = (completed / len(dashboards_to_check)) * 100
                            log(f"  Dashboard progress: {completed}/{len(dashboards_to_check)} ({progress_pct:.1f}%)")
                
                log(f"  ✓ Collected {len(sql_dashboard_perms)} SQL dashboard permissions")
        except Exception as e:
            log(f"  ⚠️ Could not fetch SQL dashboards: {str(e)}")
        
        # Get SQL queries with parallelization
        try:
            queries = list(wc.queries.list())
            
            if MAX_RESOURCES_PER_TYPE == 999:
                queries_to_check = queries
                log(f"  Found {len(queries)} SQL queries, checking ALL (no limit)")
            else:
                queries_to_check = queries[:MAX_RESOURCES_PER_TYPE]
                log(f"  Found {len(queries)} SQL queries, checking {len(queries_to_check)}")
            
            if len(queries_to_check) > 0:
                def get_query_permissions(query):
                    """Get permissions for a single query"""
                    results = []
                    try:
                        perms = wc.permissions.get('queries', query.id)
                        
                        if perms.access_control_list:
                            for acl in perms.access_control_list:
                                principal = acl.user_name or acl.group_name or acl.service_principal_name
                                principal_type = 'user' if acl.user_name else ('group' if acl.group_name else 'service_principal')
                                permissions = [p.permission_level.value for p in acl.all_permissions] if acl.all_permissions else []
                                
                                for perm in permissions:
                                    results.append({
                                        'resource_type': 'sql_query',
                                        'resource_id': query.id,
                                        'resource_name': query.name,
                                        'principal': principal,
                                        'principal_type': principal_type,
                                        'permission_level': perm
                                    })
                    except Exception:
                        pass
                    return results
                
                with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                    futures = [executor.submit(get_query_permissions, query) for query in queries_to_check]
                    
                    completed = 0
                    for future in as_completed(futures):
                        sql_query_perms.extend(future.result())
                        completed += 1
                        if not is_job_mode and completed % 20 == 0:
                            progress_pct = (completed / len(queries_to_check)) * 100
                            log(f"  Query progress: {completed}/{len(queries_to_check)} ({progress_pct:.1f}%)")
                
                log(f"  ✓ Collected {len(sql_query_perms)} SQL query permissions")
        except Exception as e:
            log(f"  ⚠️ Could not fetch SQL queries: {str(e)}")
        
        # Add to all_permissions
        all_permissions.extend(sql_dashboard_perms)
        all_permissions.extend(sql_query_perms)
        
        log(f"✓ Collected {len(all_permissions)} total permission entries so far")
        
    except Exception as e:
        log(f"❌ Error fetching SQL dashboard/query permissions: {str(e)}")
        if is_job_mode:
            raise
    
    log_execution_time("Get SQL dashboard and query permissions", cell_start_time)
else:
    log("⏭️  SQL asset permissions collection disabled (ENABLE_SQL_ASSETS=False)")
    execution_stats['resources_skipped'] += 1



In [0]:
if ENABLE_VOLUMES:
    cell_start_time = time.time()
    
    log("Fetching Unity Catalog volume permissions...")
    
    try:
        volume_grants = []
        
        if 'uc_catalogs_df' in dir() and uc_catalogs_df.count() > 0:
            catalogs = [row['catalog_name'] for row in uc_catalogs_df.select('catalog_name').collect()]
            
            log(f"  Scanning {len(catalogs)} catalogs for volumes (limit: {MAX_SCHEMAS_PER_CATALOG} schemas per catalog)...")
            
            for catalog_name in catalogs:
                try:
                    schemas = list(wc.schemas.list(catalog_name=catalog_name))
                    schemas_to_check = schemas[:MAX_SCHEMAS_PER_CATALOG]
                    
                    for schema in schemas_to_check:
                        try:
                            volumes = list(wc.volumes.list(catalog_name=catalog_name, schema_name=schema.name))
                            
                            for volume in volumes:
                                try:
                                    grants = wc.grants.get_effective(securable_type='volume', full_name=volume.full_name)
                                    
                                    if grants.privilege_assignments:
                                        for grant in grants.privilege_assignments:
                                            for privilege in grant.privileges:
                                                volume_grants.append({
                                                    'volume_full_name': volume.full_name,
                                                    'volume_name': volume.name,
                                                    'catalog_name': catalog_name,
                                                    'schema_name': schema.name,
                                                    'principal': grant.principal,
                                                    'privilege': privilege.value
                                                })
                                except Exception:
                                    pass
                        except Exception:
                            pass
                except Exception:
                    pass
        
        if volume_grants:
            uc_volume_grants_df = spark.createDataFrame(volume_grants)
            log(f"✓ Found {uc_volume_grants_df.count()} volume grants")
        else:
            uc_volume_grants_df = spark.createDataFrame([], 'volume_full_name STRING, volume_name STRING, catalog_name STRING, schema_name STRING, principal STRING, privilege STRING')
            log("  No volume permissions found")
        
    except Exception as e:
        log(f"❌ Error fetching volume permissions: {str(e)}")
        uc_volume_grants_df = spark.createDataFrame([], 'volume_full_name STRING, volume_name STRING, catalog_name STRING, schema_name STRING, principal STRING, privilege STRING')
        if is_job_mode:
            raise
    
    log_execution_time("Get Unity Catalog volume permissions", cell_start_time)
else:
    log("⏭️  Volume permissions collection disabled (ENABLE_VOLUMES=False)")
    uc_volume_grants_df = spark.createDataFrame([], 'volume_full_name STRING, volume_name STRING, catalog_name STRING, schema_name STRING, principal STRING, privilege STRING')
    execution_stats['resources_skipped'] += 1



In [0]:
# Periodically flush all_permissions list to DataFrame to manage memory
# This prevents memory issues when collecting large numbers of permissions

cell_start_time = time.time()

log("\nOptimizing memory usage...")

# Check current memory usage
import sys
permissions_list_size_mb = sys.getsizeof(all_permissions) / (1024 * 1024)
log(f"  Current all_permissions list size: {permissions_list_size_mb:.2f} MB ({len(all_permissions)} entries)")

if permissions_list_size_mb > 100:
    log(f"  ⚠️ Large permissions list detected (>{permissions_list_size_mb:.0f} MB)")
    log("  Consider reducing MAX_RESOURCES_PER_TYPE or disabling some resource types")

# No action needed - permissions will be converted to DataFrame in next cell
log("✓ Memory check complete")

log_execution_time("Memory optimization check", cell_start_time)



In [0]:
cell_start_time = time.time()

log("Creating permissions DataFrame...")

# Create permissions DataFrame
if all_permissions:
    permissions_df = spark.createDataFrame(all_permissions)
else:
    permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, principal_type STRING, permission_level STRING')
    log("⚠️  Warning: No permissions collected - permissions_df is empty")

# Data Quality Check: Check for duplicates
if permissions_df.count() > 0:
    duplicate_count = permissions_df.count() - permissions_df.dropDuplicates().count()
    if duplicate_count > 0:
        log(f"⚠️  Found {duplicate_count} duplicate permission entries (will be kept for analysis)")

# Add user-group associations to permissions
log("\nEnriching permissions with user-group associations...")

# For user permissions, add their group memberships
user_perms_with_groups = permissions_df.filter(F.col('principal_type') == 'user') \
    .join(user_groups_df, permissions_df.principal == user_groups_df.user_name, 'left') \
    .select(
        permissions_df['*'],
        F.col('group_name').alias('user_groups')
    )

# For group permissions, keep as-is but add null for user_groups
group_perms = permissions_df.filter(F.col('principal_type') == 'group') \
    .withColumn('user_groups', F.lit(None).cast('string'))

# Service principal permissions
sp_perms = permissions_df.filter(F.col('principal_type') == 'service_principal') \
    .withColumn('user_groups', F.lit(None).cast('string'))

# Union all permissions with group associations
permissions_with_groups_df = user_perms_with_groups.union(group_perms).union(sp_perms)

# Create a comprehensive user permissions view that includes inherited group permissions
log("Creating comprehensive user permissions view (direct + inherited from groups)...")

# Direct user permissions
direct_user_perms = permissions_df.filter(F.col('principal_type') == 'user') \
    .withColumn('permission_source', F.lit('direct')) \
    .withColumn('source_group', F.lit(None).cast('string'))

# Inherited permissions from groups
inherited_perms = permissions_df.filter(F.col('principal_type') == 'group') \
    .join(user_groups_df, permissions_df.principal == user_groups_df.group_name, 'inner') \
    .select(
        F.col('resource_type'),
        F.col('resource_id'),
        F.col('resource_name'),
        F.col('user_name').alias('principal'),
        F.lit('user').alias('principal_type'),
        F.col('permission_level'),
        F.lit('inherited').alias('permission_source'),
        F.col('group_name').alias('source_group')
    )

# Combine direct and inherited permissions
user_all_permissions_df = direct_user_perms.union(inherited_perms)

# Data Quality Check: Identify orphaned permissions
log("\nRunning data quality checks...")

# Check for user permissions where user doesn't exist
orphaned_user_perms = permissions_df.filter(F.col('principal_type') == 'user') \
    .join(users_df, permissions_df.principal == users_df.user_name, 'left_anti')

orphaned_user_count = orphaned_user_perms.count()
if orphaned_user_count > 0:
    log(f"⚠️  Found {orphaned_user_count} permissions for users that no longer exist")

# Check for group permissions where group doesn't exist
orphaned_group_perms = permissions_df.filter(F.col('principal_type') == 'group') \
    .join(groups_df, permissions_df.principal == groups_df.group_name, 'left_anti')

orphaned_group_count = orphaned_group_perms.count()
if orphaned_group_count > 0:
    log(f"⚠️  Found {orphaned_group_count} permissions for groups that no longer exist")

log(f"\n{'='*60}")
log("DATA COLLECTION SUMMARY")
log(f"{'='*60}")
log(f"  users_df: {users_df.count()} rows")
log(f"  groups_df: {groups_df.count()} rows")
log(f"  user_groups_df: {user_groups_df.count()} rows")
log(f"  permissions_df: {permissions_df.count()} rows (original)")
log(f"  permissions_with_groups_df: {permissions_with_groups_df.count()} rows (with group associations)")
log(f"  user_all_permissions_df: {user_all_permissions_df.count()} rows (direct + inherited)")
if orphaned_user_count > 0 or orphaned_group_count > 0:
    log(f"  orphaned_permissions: {orphaned_user_count + orphaned_group_count} rows (cleanup recommended)")
log(f"{'='*60}")
log(f"\n✓ All data collection complete!")
log(f"\nAvailable DataFrames:")
log(f"  - permissions_df: Original permissions")
log(f"  - permissions_with_groups_df: Permissions with user-group associations")
log(f"  - user_all_permissions_df: All user permissions (direct + inherited from groups)")

log_execution_time("Create permissions DataFrame", cell_start_time)



In [0]:
# This cell executes AFTER permissions_df is created
cell_start_time = time.time()

log("\n" + "="*60)
log("INACTIVE USER PERMISSIONS ANALYSIS")
log("="*60)

# Identify inactive users with active permissions
if validate_dataframe_exists('users_df', users_df) and validate_dataframe_exists('permissions_df', permissions_df):
    
    inactive_users = users_df.filter(F.col('active') == False)
    
    log(f"\nInactive users in workspace: {inactive_users.count()}")
    
    if inactive_users.count() > 0:
        # Find permissions for inactive users
        inactive_user_permissions = permissions_df.filter(
            F.col('principal_type') == 'user'
        ).join(
            inactive_users.select('user_name'),
            permissions_df.principal == inactive_users.user_name,
            'inner'
        )
        
        inactive_perm_count = inactive_user_permissions.count()
        
        log(f"\n⚠️  COMPLIANCE ISSUE: Inactive users with permissions")
        log(f"  Found {inactive_perm_count} permission entries for inactive users")
        
        if inactive_perm_count > 0:
            # Summary by user
            inactive_summary = inactive_user_permissions.groupBy('principal') \
                .agg(
                    F.count('*').alias('permission_count'),
                    F.countDistinct('resource_type').alias('resource_types'),
                    F.collect_set('resource_type').alias('resource_type_list')
                ) \
                .orderBy(F.desc('permission_count'))
            
            log(f"\n  Inactive users with permissions: {inactive_summary.count()}")
            
            if not is_job_mode:
                display(inactive_summary)
            
            # Detailed breakdown by resource type
            inactive_by_resource = inactive_user_permissions.groupBy('resource_type') \
                .agg(
                    F.countDistinct('principal').alias('inactive_users'),
                    F.count('*').alias('permission_entries')
                ) \
                .orderBy(F.desc('permission_entries'))
            
            log(f"\n  Permissions by resource type:")
            if not is_job_mode:
                display(inactive_by_resource)
            
            log(f"\n  ⚠️  RECOMMENDATION: Remove permissions for inactive users")
            log(f"     Inactive users should not retain access to workspace resources")
            log(f"     This is a common compliance requirement (SOX, GDPR, etc.)")
            
            # Store for export
            inactive_user_permissions_export_df = inactive_user_permissions
        else:
            log("  ✓ No permissions found for inactive users")
            inactive_user_permissions_export_df = spark.createDataFrame([], permissions_df.schema)
    else:
        log("\n✓ No inactive users in workspace")
        inactive_user_permissions_export_df = spark.createDataFrame([], permissions_df.schema)
else:
    log("⚠️  Cannot analyze inactive users - missing required data")
    inactive_user_permissions_export_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, principal_type STRING, permission_level STRING')

log_execution_time("Inactive user permissions analysis", cell_start_time)

In [0]:
# Print execution summary with statistics and data quality report

print_execution_summary()

# Data Quality Report
log(f"\n{'='*60}")
log("DATA QUALITY REPORT")
log(f"{'='*60}")

# Check for empty DataFrames
log("\nDataFrame Validation:")
validate_dataframe_exists("users_df", users_df)
validate_dataframe_exists("groups_df", groups_df)
validate_dataframe_exists("user_groups_df", user_groups_df)
validate_dataframe_exists("permissions_df", permissions_df)

# Check for suspicious patterns
if permissions_df.count() > 0:
    log("\nPermission Distribution:")
    
    # Count by resource type
    resource_type_counts = permissions_df.groupBy('resource_type').count().orderBy(F.desc('count'))
    log("  By resource type:")
    for row in resource_type_counts.collect():
        log(f"    - {row['resource_type']}: {row['count']} permissions")
    
    # Count by principal type
    principal_type_counts = permissions_df.groupBy('principal_type').count().orderBy(F.desc('count'))
    log("\n  By principal type:")
    for row in principal_type_counts.collect():
        log(f"    - {row['principal_type']}: {row['count']} permissions")
    
    # Identify users with excessive permissions (potential security risk)
    if user_all_permissions_df.count() > 0:
        user_perm_counts = user_all_permissions_df.groupBy('principal').count().orderBy(F.desc('count'))
        top_user = user_perm_counts.first()
        if top_user and top_user['count'] > 100:
            log(f"\n⚠️  Security Alert: User '{top_user['principal']}' has {top_user['count']} permissions (review recommended)")

log(f"\n{'='*60}")
log("✓ Data quality checks complete")
log(f"{'='*60}")



In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("ADMIN IDENTIFICATION")
log("="*60)

# Identify workspace admins (members of 'admins' group)
admin_group_name = 'admins'

workspace_admins = user_groups_df.filter(F.col('group_name') == admin_group_name) \
    .join(users_df, 'user_name', 'inner') \
    .select(
        'user_name',
        'display_name',
        'active'
    )

admin_count = workspace_admins.count()
log(f"\nWorkspace Admins: {admin_count}")

if admin_count > 0:
    log("\nAdmin Users:")
    for row in workspace_admins.collect():
        status = '✓ Active' if row['active'] else '⚠️ Inactive'
        log(f"  - {row['display_name']} ({row['user_name']}) - {status}")
    
    # Check for inactive admins
    inactive_admins = workspace_admins.filter(F.col('active') == False).count()
    if inactive_admins > 0:
        log(f"\n⚠️ Security Alert: {inactive_admins} inactive users still have admin permissions")
else:
    log("  No admin users found (or 'admins' group doesn't exist)")

# Identify users with IS_OWNER permissions (high privilege)
if permissions_df.count() > 0:
    owners = permissions_df.filter(F.col('permission_level') == 'IS_OWNER') \
        .select('principal', 'resource_type', 'resource_name') \
        .distinct()
    
    owner_count = owners.select('principal').distinct().count()
    log(f"\nResource Owners: {owner_count} users with IS_OWNER permissions")
    
    # Count by resource type
    owner_by_type = owners.groupBy('resource_type').count().orderBy(F.desc('count'))
    log("\nOwnership by resource type:")
    for row in owner_by_type.collect():
        log(f"  - {row['resource_type']}: {row['count']} resources")

log("="*60)
log_execution_time("Identify workspace and account admins", cell_start_time)



In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("COMPLIANCE & SECURITY REPORTING")
log("="*60)

# 1. Inactive users with permissions
log("\n1. Inactive Users with Active Permissions:")

inactive_users_with_perms = users_df.filter(F.col('active') == False) \
    .join(permissions_df.filter(F.col('principal_type') == 'user'), 
          users_df.user_name == permissions_df.principal, 'inner') \
    .select(
        users_df.user_name,
        users_df.display_name,
        F.col('resource_type'),
        F.col('resource_name'),
        F.col('permission_level')
    )

inactive_count = inactive_users_with_perms.select('user_name').distinct().count()
if inactive_count > 0:
    log(f"   ⚠️ ALERT: {inactive_count} inactive users still have permissions")
    log(f"   Total permission entries: {inactive_users_with_perms.count()}")
    
    # Show top inactive users by permission count
    top_inactive = inactive_users_with_perms.groupBy('user_name', 'display_name').count() \
        .orderBy(F.desc('count')).limit(10)
    
    log("   Top inactive users by permission count:")
    for row in top_inactive.collect():
        log(f"     - {row['display_name']} ({row['user_name']}): {row['count']} permissions")
else:
    log("   ✓ No inactive users with permissions found")

# 2. External users (non-company domain)
log("\n2. External User Access:")

# Identify external users (customize domain pattern as needed)
company_domains = ['@bat.com', '@example.com']  # Add your company domains

external_users = users_df.filter(
    ~F.col('user_name').rlike('|'.join([domain.replace('.', '\\.') for domain in company_domains]))
)

external_count = external_users.count()
if external_count > 0:
    log(f"   ⚠️ Found {external_count} external users (non-company domain)")
    
    # Check if external users have permissions
    external_with_perms = external_users.join(
        permissions_df.filter(F.col('principal_type') == 'user'),
        external_users.user_name == permissions_df.principal,
        'inner'
    ).select('user_name', 'display_name').distinct()
    
    external_with_perms_count = external_with_perms.count()
    if external_with_perms_count > 0:
        log(f"   ⚠️ ALERT: {external_with_perms_count} external users have permissions")
else:
    log("   ✓ No external users found")

# 3. Users with excessive permissions
log("\n3. Users with Excessive Permissions:")

if user_all_permissions_df.count() > 0:
    excessive_threshold = 100  # Customize threshold
    
    excessive_perms = user_all_permissions_df.groupBy('principal').count() \
        .filter(F.col('count') > excessive_threshold) \
        .orderBy(F.desc('count'))
    
    excessive_count = excessive_perms.count()
    if excessive_count > 0:
        log(f"   ⚠️ ALERT: {excessive_count} users have more than {excessive_threshold} permissions")
        log("   Top users:")
        for row in excessive_perms.limit(10).collect():
            log(f"     - {row['principal']}: {row['count']} permissions")
    else:
        log(f"   ✓ No users with excessive permissions (>{excessive_threshold})")

# 4. Service principals with high privileges
log("\n4. Service Principal Access:")

if 'service_principals_df' in dir() and service_principals_df.count() > 0:
    sp_count = service_principals_df.count()
    active_sp_count = service_principals_df.filter(F.col('active') == True).count()
    
    log(f"   Total service principals: {sp_count}")
    log(f"   Active service principals: {active_sp_count}")
    
    # Check for SPs with admin group membership
    sp_with_admin = service_principals_df.filter(
        F.array_contains(F.col('groups'), 'admins')
    )
    
    sp_admin_count = sp_with_admin.count()
    if sp_admin_count > 0:
        log(f"   ⚠️ ALERT: {sp_admin_count} service principals have admin group membership")
else:
    log("   No service principal data available")

# 5. Orphaned permissions summary
log("\n5. Orphaned Permissions (Cleanup Recommended):")

if 'orphaned_user_count' in dir() and 'orphaned_group_count' in dir():
    total_orphaned = orphaned_user_count + orphaned_group_count
    if total_orphaned > 0:
        log(f"   ⚠️ Found {total_orphaned} orphaned permissions")
        log(f"     - User permissions: {orphaned_user_count}")
        log(f"     - Group permissions: {orphaned_group_count}")
        log("   Recommendation: Remove permissions for deleted users/groups")
    else:
        log("   ✓ No orphaned permissions found")

log("\n" + "="*60)
log("✓ Compliance reporting complete")
log("="*60)

log_execution_time("Compliance reporting", cell_start_time)



In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("INCREMENTAL CHANGE DETECTION")
log("="*60)

# Only run if Delta export is enabled and history table exists
if ENABLE_DELTA_EXPORT:
    try:
        # Check if history table exists
        table_exists = spark.catalog.tableExists(DELTA_TABLE_NAME)
        
        if table_exists:
            log(f"\nComparing with previous audit run from {DELTA_TABLE_NAME}...")
            
            # Get the most recent previous run (not including current run)
            previous_run = spark.sql(f"""
                SELECT MAX(audit_run_timestamp) as last_run
                FROM {DELTA_TABLE_NAME}
                WHERE audit_run_timestamp < current_timestamp() - INTERVAL 1 MINUTE
            """).first()
            
            if previous_run and previous_run['last_run']:
                last_run_time = previous_run['last_run']
                log(f"Previous run found: {last_run_time}")
                
                # Get previous permissions
                previous_perms = spark.sql(f"""
                    SELECT resource_type, resource_id, resource_name, principal, permission_level
                    FROM {DELTA_TABLE_NAME}
                    WHERE audit_run_timestamp = '{last_run_time}'
                """)
                
                # Current permissions
                current_perms = permissions_df.select(
                    'resource_type', 'resource_id', 'resource_name', 'principal', 'permission_level'
                )
                
                # Find new permissions (in current but not in previous)
                new_perms = current_perms.subtract(previous_perms)
                new_count = new_perms.count()
                
                # Find removed permissions (in previous but not in current)
                removed_perms = previous_perms.subtract(current_perms)
                removed_count = removed_perms.count()
                
                log(f"\nChange Summary:")
                log(f"  ➕ New permissions: {new_count}")
                log(f"  ➖ Removed permissions: {removed_count}")
                log(f"  🔄 Total changes: {new_count + removed_count}")
                
                if new_count > 0:
                    log(f"\n  Top new permissions by principal:")
                    new_by_principal = new_perms.groupBy('principal').count() \
                        .orderBy(F.desc('count')).limit(10)
                    for row in new_by_principal.collect():
                        log(f"    - {row['principal']}: {row['count']} new permissions")
                
                if removed_count > 0:
                    log(f"\n  Top removed permissions by principal:")
                    removed_by_principal = removed_perms.groupBy('principal').count() \
                        .orderBy(F.desc('count')).limit(10)
                    for row in removed_by_principal.collect():
                        log(f"    - {row['principal']}: {row['count']} removed permissions")
                
                # Store change DataFrames for export
                new_permissions_df = new_perms
                removed_permissions_df = removed_perms
                
            else:
                log("\n  No previous audit run found for comparison")
                log("  This appears to be the first run")
                new_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, permission_level STRING')
                removed_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, permission_level STRING')
        else:
            log(f"\n  History table {DELTA_TABLE_NAME} does not exist yet")
            log("  Run with ENABLE_DELTA_EXPORT=True to create it")
            new_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, permission_level STRING')
            removed_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, permission_level STRING')
    
    except Exception as e:
        log(f"\n❌ Error during change detection: {str(e)}")
        new_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, permission_level STRING')
        removed_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, permission_level STRING')
else:
    log("\n  Change detection disabled (ENABLE_DELTA_EXPORT=False)")
    log("  Enable Delta export to track changes over time")
    new_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, permission_level STRING')
    removed_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, permission_level STRING')

log("="*60)
log_execution_time("Incremental change detection", cell_start_time)



In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("PERMISSION RECOMMENDATIONS & RISK ANALYSIS")
log("="*60)

# 1. Over-privileged users (users with CAN_MANAGE on many resources)
log("\n1. Over-Privileged Users:")

if permissions_df.count() > 0:
    manage_perms = permissions_df.filter(
        F.col('permission_level').isin(['CAN_MANAGE', 'IS_OWNER'])
    )
    
    over_privileged = manage_perms.groupBy('principal', 'principal_type') \
        .agg(
            F.count('*').alias('manage_permission_count'),
            F.countDistinct('resource_type').alias('resource_types'),
            F.collect_set('resource_type').alias('resource_type_list')
        ) \
        .filter(F.col('manage_permission_count') > 20) \
        .orderBy(F.desc('manage_permission_count'))
    
    over_priv_count = over_privileged.count()
    if over_priv_count > 0:
        log(f"   ⚠️ Found {over_priv_count} principals with >20 CAN_MANAGE/IS_OWNER permissions")
        log("   Top over-privileged principals:")
        for row in over_privileged.limit(10).collect():
            log(f"     - {row['principal']} ({row['principal_type']}): {row['manage_permission_count']} manage permissions across {row['resource_types']} resource types")
        log("   Recommendation: Review if all manage permissions are necessary")
    else:
        log("   ✓ No over-privileged users detected")

# 2. Segregation of duties - users with conflicting permissions
log("\n2. Segregation of Duties Analysis:")

# Example: Users who can both create and approve (customize based on your requirements)
if user_all_permissions_df.count() > 0:
    # Users with both production and development access
    prod_and_dev = user_all_permissions_df.filter(
        (F.col('resource_name').contains('prod')) | (F.col('resource_name').contains('production'))
    ).select('principal').distinct() \
    .intersect(
        user_all_permissions_df.filter(
            (F.col('resource_name').contains('dev')) | (F.col('resource_name').contains('development'))
        ).select('principal').distinct()
    )
    
    sod_count = prod_and_dev.count()
    if sod_count > 0:
        log(f"   ⚠️ Found {sod_count} users with both production and development access")
        log("   Recommendation: Review segregation of duties policies")
    else:
        log("   ✓ No obvious segregation of duties violations detected")

# 3. Group membership recommendations
log("\n3. Group Membership Optimization:")

# Find users with many direct permissions (should use groups instead)
if user_all_permissions_df.count() > 0:
    direct_perm_heavy = user_all_permissions_df.filter(F.col('permission_source') == 'direct') \
        .groupBy('principal').count() \
        .filter(F.col('count') > 30) \
        .orderBy(F.desc('count'))
    
    direct_heavy_count = direct_perm_heavy.count()
    if direct_heavy_count > 0:
        log(f"   ⚠️ Found {direct_heavy_count} users with >30 direct permissions")
        log("   Recommendation: Consider using groups for permission management")
        log("   Top users with direct permissions:")
        for row in direct_perm_heavy.limit(5).collect():
            log(f"     - {row['principal']}: {row['count']} direct permissions")
    else:
        log("   ✓ Good group usage - no users with excessive direct permissions")

# 4. Unused groups (groups with no permissions)
log("\n4. Unused Groups:")

groups_with_perms = permissions_df.filter(F.col('principal_type') == 'group') \
    .select('principal').distinct()

unused_groups = groups_df.join(
    groups_with_perms,
    groups_df.group_name == groups_with_perms.principal,
    'left_anti'
)

unused_count = unused_groups.count()
if unused_count > 0:
    log(f"   ⚠️ Found {unused_count} groups with no permissions assigned")
    log("   Recommendation: Review if these groups are still needed")
else:
    log("   ✓ All groups have permissions assigned")

# 5. Token security analysis
log("\n5. Token Security:")

if 'tokens_df' in dir() and tokens_df.count() > 0:
    total_tokens = tokens_df.count()
    no_expiry_tokens = tokens_df.filter(F.col('expiry_time').isNull()).count()
    
    log(f"   Total active tokens: {total_tokens}")
    if no_expiry_tokens > 0:
        log(f"   ⚠️ ALERT: {no_expiry_tokens} tokens have no expiration date")
        log("   Recommendation: Set expiration dates for all tokens")
    else:
        log("   ✓ All tokens have expiration dates")
else:
    log("   No token data available")

log("\n" + "="*60)
log("✓ Risk analysis complete")
log("="*60)

log_execution_time("Permission recommendations and risk analysis", cell_start_time)



In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("PERMISSION CONCENTRATION ANALYSIS")
log("="*60)

if validate_dataframe_exists('permissions_df', permissions_df):
    
    # Analyze users with high permission counts
    log("\n1. Users with Most Permissions:")
    
    user_permission_counts = permissions_df.filter(
        F.col('principal_type') == 'user'
    ).groupBy('principal') \
        .agg(
            F.count('*').alias('total_permissions'),
            F.countDistinct('resource_type').alias('resource_types'),
            F.countDistinct('resource_id').alias('unique_resources'),
            F.sum(F.when(F.col('permission_level').isin(['IS_OWNER', 'CAN_MANAGE']), 1).otherwise(0)).alias('admin_permissions')
        ) \
        .orderBy(F.desc('total_permissions'))
    
    top_users = user_permission_counts.limit(10)
    
    log(f"  Top 10 users by permission count:")
    if not is_job_mode:
        display(top_users)
    
    # Identify users with excessive admin permissions
    excessive_admin_threshold = 20  # Users with 20+ admin permissions
    
    excessive_admins = user_permission_counts.filter(
        F.col('admin_permissions') >= excessive_admin_threshold
    )
    
    log(f"\n2. Users with Excessive Admin Permissions (>={excessive_admin_threshold}):")
    log(f"  Found {excessive_admins.count()} users")
    
    if excessive_admins.count() > 0:
        log(f"\n  ⚠️  SECURITY CONCERN: Users with excessive admin permissions")
        if not is_job_mode:
            display(excessive_admins.orderBy(F.desc('admin_permissions')))
        log(f"\n  ⚠️  RECOMMENDATION: Review and reduce admin permissions")
        log(f"     Users should follow principle of least privilege")
    else:
        log("  ✓ No users with excessive admin permissions")
    
    # Analyze permission distribution
    log(f"\n3. Permission Distribution Statistics:")
    
    perm_stats = user_permission_counts.agg(
        F.avg('total_permissions').alias('avg_permissions'),
        F.max('total_permissions').alias('max_permissions'),
        F.min('total_permissions').alias('min_permissions'),
        F.percentile_approx('total_permissions', 0.5).alias('median_permissions'),
        F.percentile_approx('total_permissions', 0.9).alias('p90_permissions')
    ).collect()[0]
    
    log(f"  Average permissions per user: {perm_stats.avg_permissions:.1f}")
    log(f"  Median permissions per user: {perm_stats.median_permissions}")
    log(f"  90th percentile: {perm_stats.p90_permissions}")
    log(f"  Max permissions (single user): {perm_stats.max_permissions}")
    
    # Store for export
    permission_concentration_df = user_permission_counts
    excessive_admin_permissions_df = excessive_admins
    
    log_execution_time("Permission concentration analysis", cell_start_time)
else:
    log("⚠️  Cannot analyze permission concentration - missing permissions data")
    permission_concentration_df = spark.createDataFrame([], 'principal STRING, total_permissions LONG, resource_types LONG, unique_resources LONG, admin_permissions LONG')
    excessive_admin_permissions_df = spark.createDataFrame([], 'principal STRING, total_permissions LONG, resource_types LONG, unique_resources LONG, admin_permissions LONG')



In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("EXTERNAL USER DETECTION")
log("="*60)

if validate_dataframe_exists('users_df', users_df):
    
    # Define company domain (customize this for your organization)
    COMPANY_DOMAIN = 'bat.com'  # Change to your company domain
    
    log(f"\nChecking for users outside company domain: @{COMPANY_DOMAIN}")
    
    # Identify external users (non-company domain)
    external_users = users_df.filter(
        ~F.col('user_name').endswith(f'@{COMPANY_DOMAIN}')
    )
    
    external_count = external_users.count()
    active_external = external_users.filter(F.col('active') == True).count()
    
    log(f"\nExternal users found: {external_count}")
    log(f"  Active external users: {active_external}")
    log(f"  Inactive external users: {external_count - active_external}")
    
    if external_count > 0:
        log(f"\n⚠️  SECURITY REVIEW: External users detected")
        
        # Show external user details
        if not is_job_mode:
            display(external_users.select('user_name', 'display_name', 'active'))
        
        # Check permissions for external users
        if validate_dataframe_exists('permissions_df', permissions_df):
            external_user_permissions = permissions_df.filter(
                F.col('principal_type') == 'user'
            ).join(
                external_users.select('user_name'),
                permissions_df.principal == external_users.user_name,
                'inner'
            )
            
            external_perm_count = external_user_permissions.count()
            
            log(f"\n  External user permissions: {external_perm_count} entries")
            
            if external_perm_count > 0:
                # Summary by external user
                external_perm_summary = external_user_permissions.groupBy('principal') \
                    .agg(
                        F.count('*').alias('permission_count'),
                        F.countDistinct('resource_type').alias('resource_types'),
                        F.collect_set('resource_type').alias('resource_type_list')
                    ) \
                    .orderBy(F.desc('permission_count'))
                
                if not is_job_mode:
                    log("\n  External user permission summary:")
                    display(external_perm_summary)
                
                log(f"\n  ⚠️  RECOMMENDATION: Review external user access")
                log(f"     Ensure external users have appropriate business justification")
                log(f"     Consider using service principals for external integrations")
                log(f"     Verify external users comply with data sharing agreements")
            
            # Store for export
            external_user_permissions_df = external_user_permissions
        else:
            external_user_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, principal_type STRING, permission_level STRING')
    else:
        log("\n✓ No external users detected")
        log(f"   All users belong to @{COMPANY_DOMAIN} domain")
        external_user_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, principal_type STRING, permission_level STRING')
    
    # Store for export
    external_users_df = external_users
    
    log_execution_time("External user detection", cell_start_time)
else:
    log("⚠️  Cannot detect external users - missing user data")
    external_users_df = spark.createDataFrame([], 'user_name STRING, display_name STRING, active BOOLEAN')
    external_user_permissions_df = spark.createDataFrame([], 'resource_type STRING, resource_id STRING, resource_name STRING, principal STRING, principal_type STRING, permission_level STRING')



In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("CROSS-RESOURCE PERMISSION ANALYSIS")
log("="*60)

if validate_dataframe_exists('permissions_df', permissions_df):
    
    # Identify users with permissions across multiple resource types
    log("\n1. Users with Broad Access Across Resource Types:")
    
    cross_resource_users = permissions_df.filter(
        F.col('principal_type') == 'user'
    ).groupBy('principal') \
        .agg(
            F.countDistinct('resource_type').alias('resource_type_count'),
            F.collect_set('resource_type').alias('resource_types'),
            F.count('*').alias('total_permissions')
        ) \
        .filter(F.col('resource_type_count') >= 5) \
        .orderBy(F.desc('resource_type_count'))
    
    log(f"  Users with access to 5+ resource types: {cross_resource_users.count()}")
    
    if cross_resource_users.count() > 0 and not is_job_mode:
        display(cross_resource_users.limit(20))
    
    # Identify potential segregation of duties violations
    log("\n2. Potential Segregation of Duties (SOD) Violations:")
    
    # Users with both development (notebooks/repos) and production (jobs/pipelines) access
    dev_resources = ['workspace_objects', 'repos']
    prod_resources = ['jobs', 'pipelines']
    
    dev_users = permissions_df.filter(
        (F.col('principal_type') == 'user') &
        (F.col('resource_type').isin(dev_resources))
    ).select('principal').distinct()
    
    prod_users = permissions_df.filter(
        (F.col('principal_type') == 'user') &
        (F.col('resource_type').isin(prod_resources))
    ).select('principal').distinct()
    
    sod_violations = dev_users.join(prod_users, 'principal', 'inner')
    
    log(f"  Users with both dev and prod access: {sod_violations.count()}")
    
    if sod_violations.count() > 0:
        log(f"\n  ⚠️  COMPLIANCE CONCERN: Potential SOD violations")
        
        # Get detailed permissions for SOD violators
        sod_details = permissions_df.filter(
            (F.col('principal_type') == 'user') &
            (F.col('resource_type').isin(dev_resources + prod_resources))
        ).join(sod_violations, 'principal', 'inner') \
            .groupBy('principal') \
            .agg(
                F.collect_set(F.when(F.col('resource_type').isin(dev_resources), F.col('resource_type'))).alias('dev_access'),
                F.collect_set(F.when(F.col('resource_type').isin(prod_resources), F.col('resource_type'))).alias('prod_access'),
                F.count('*').alias('total_permissions')
            ) \
            .orderBy(F.desc('total_permissions'))
        
        if not is_job_mode:
            display(sod_details.limit(20))
        
        log(f"\n  ⚠️  RECOMMENDATION: Review segregation of duties")
        log(f"     Consider separating development and production access")
        log(f"     Use separate accounts or groups for dev vs prod environments")
    else:
        log("  ✓ No obvious SOD violations detected")
        sod_details = spark.createDataFrame([], 'principal STRING, dev_access ARRAY<STRING>, prod_access ARRAY<STRING>, total_permissions LONG')
    
    # Store for export
    cross_resource_permissions_df = cross_resource_users
    sod_violations_df = sod_details
    
    log_execution_time("Cross-resource permission analysis", cell_start_time)
else:
    log("⚠️  Cannot analyze cross-resource permissions - missing permissions data")
    cross_resource_permissions_df = spark.createDataFrame([], 'principal STRING, resource_type_count LONG, resource_types ARRAY<STRING>, total_permissions LONG')
    sod_violations_df = spark.createDataFrame([], 'principal STRING, dev_access ARRAY<STRING>, prod_access ARRAY<STRING>, total_permissions LONG')



In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("SERVICE PRINCIPAL TOKEN AUDIT")
log("="*60)

if ENABLE_TOKEN_AUDIT and 'tokens_df' in dir() and tokens_df.count() > 0:
    
    # Identify service principal tokens
    sp_tokens = tokens_df.filter(
        F.col('created_by_username').isNull() | 
        F.col('created_by_username').contains('ServicePrincipal')
    )
    
    log(f"\nService principal tokens: {sp_tokens.count()}")
    
    if sp_tokens.count() > 0:
        # Check for tokens without expiration
        sp_tokens_no_expiry = sp_tokens.filter(
            (F.col('expiry_time').isNull()) | (F.col('expiry_time') == 0)
        )
        
        log(f"  Service principal tokens without expiry: {sp_tokens_no_expiry.count()}")
        
        if sp_tokens_no_expiry.count() > 0:
            log(f"\n  ⚠️  CRITICAL: Service principal tokens without expiration")
            if not is_job_mode:
                display(sp_tokens_no_expiry.select('token_id', 'comment', 'creation_time'))
            log(f"\n  ⚠️  RECOMMENDATION: Set expiration dates for all service principal tokens")
            log(f"     Service principals should use short-lived tokens or OAuth")
        else:
            log("  ✓ All service principal tokens have expiration dates")
        
        # Token age for service principals
        from pyspark.sql.functions import current_timestamp
        
        sp_tokens_with_age = sp_tokens.withColumn(
            'age_days',
            (F.unix_timestamp(current_timestamp()) - (F.col('creation_time') / 1000)) / (24 * 60 * 60)
        )
        
        old_sp_tokens = sp_tokens_with_age.filter(F.col('age_days') > 180)
        
        log(f"\n  Service principal tokens older than 6 months: {old_sp_tokens.count()}")
        
        if old_sp_tokens.count() > 0:
            log(f"  ⚠️  RECOMMENDATION: Rotate old service principal tokens")
            log(f"     Tokens should be rotated regularly (every 90-180 days)")
        
        # Store for export
        sp_tokens_df = sp_tokens
        sp_tokens_no_expiry_df = sp_tokens_no_expiry
    else:
        log("\nℹ️  No service principal tokens found")
        sp_tokens_df = spark.createDataFrame([], tokens_df.schema)
        sp_tokens_no_expiry_df = spark.createDataFrame([], tokens_df.schema)
    
    log_execution_time("Service principal token audit", cell_start_time)
else:
    log("⏭️  Service principal token audit skipped (no token data available)")
    sp_tokens_df = spark.createDataFrame([], 'token_id STRING, created_by_username STRING, created_by_id STRING, creation_time LONG, expiry_time LONG, comment STRING')
    sp_tokens_no_expiry_df = spark.createDataFrame([], 'token_id STRING, created_by_username STRING, created_by_id STRING, creation_time LONG, expiry_time LONG, comment STRING')



In [0]:
log("=== User Permissions Summary (Direct + Inherited) ===")
log(f"Total user permission entries: {user_all_permissions_df.count()}")

# Summary by user showing direct vs inherited
user_perm_summary = user_all_permissions_df.groupBy('principal').agg(
    F.count('*').alias('total_permissions'),
    F.sum(F.when(F.col('permission_source') == 'direct', 1).otherwise(0)).alias('direct_permissions'),
    F.sum(F.when(F.col('permission_source') == 'inherited', 1).otherwise(0)).alias('inherited_permissions'),
    F.countDistinct('resource_type').alias('resource_types_count'),
    F.collect_set('permission_level').alias('permission_levels'),
    F.collect_set('source_group').alias('inherited_from_groups')
).orderBy(F.desc('total_permissions'))

log(f"\nUsers with permissions: {user_perm_summary.count()}")
log("\nTop users by total permissions:")

if not is_job_mode:
    display(user_perm_summary.limit(50))

    # Detailed view - showing permission source
    log("\n=== Detailed User Permissions (Top 100) ===")

user_permissions_detailed = user_all_permissions_df.select(
    'principal',
    'resource_type',
    'resource_name',
    'permission_level',
    'permission_source',
    'source_group'
).orderBy('principal', 'resource_type', 'permission_source')

if not is_job_mode:
    display(user_permissions_detailed.limit(100))

    # Show users with their group memberships
    log("\n=== Users and Their Groups ===")

user_with_groups = users_df.join(user_groups_df, users_df.user_name == user_groups_df.user_name, 'left') \
    .select(
        users_df.user_name,
        users_df.display_name,
        users_df.active,
        user_groups_df.group_name
    ) \
    .groupBy(users_df.user_name, users_df.display_name, users_df.active).agg(
        F.collect_set('group_name').alias('groups')
    ).orderBy('user_name')

log(f"Total users: {user_with_groups.count()}")

if not is_job_mode:
    display(user_with_groups.limit(50))



In [0]:
log("=== Group Permissions Summary ===")

# Filter permissions for groups only
group_permissions = permissions_df.filter(permissions_df.principal_type == 'group')

log(f"Total group permission entries: {group_permissions.count()}")

# Summary by group with member count
group_perm_summary = group_permissions.groupBy('principal').agg(
    F.count('*').alias('total_permissions'),
    F.countDistinct('resource_type').alias('resource_types_count'),
    F.collect_set('permission_level').alias('permission_levels')
)

# Add member count to groups
group_members_count = user_groups_df.groupBy('group_name').agg(
    F.count('*').alias('member_count')
)

group_summary_with_members = group_perm_summary \
    .join(group_members_count, group_perm_summary.principal == group_members_count.group_name, 'left') \
    .select(
        F.col('principal').alias('group_name'),
        F.col('total_permissions'),
        F.col('resource_types_count'),
        F.col('permission_levels'),
        F.coalesce(F.col('member_count'), F.lit(0)).alias('member_count')
    ).orderBy(F.desc('total_permissions'))

log(f"\nGroups with permissions: {group_summary_with_members.count()}")
log("\nTop groups by total permissions:")

if not is_job_mode:
    display(group_summary_with_members.limit(50))

    # Detailed view - group permissions
    log("\n=== Detailed Group Permissions (Top 100) ===")

group_permissions_detailed = group_permissions.select(
    'principal',
    'resource_type',
    'resource_name',
    'permission_level'
).orderBy('principal', 'resource_type')

if not is_job_mode:
    display(group_permissions_detailed.limit(100))

    # Show groups with their members
    log("\n=== Groups and Their Members ===")

groups_with_members = groups_df.join(user_groups_df, groups_df.group_name == user_groups_df.group_name, 'left') \
    .select(
        groups_df.group_name,
        groups_df.group_id,
        user_groups_df.user_name
    ) \
    .groupBy(groups_df.group_name, groups_df.group_id).agg(
        F.collect_set('user_name').alias('members'),
        F.count('user_name').alias('member_count')
    ).orderBy(F.desc('member_count'))

log(f"Total groups: {groups_with_members.count()}")

if not is_job_mode:
    display(groups_with_members.limit(50))



In [0]:
# Permission Levels Reference - Based on Databricks Documentation

permission_definitions = {
    'Jobs': {
        'CAN_VIEW': 'View job details, settings, and results',
        'CAN_MANAGE_RUN': 'View results, run now, cancel runs, view Spark UI and logs',
        'IS_OWNER': 'Full control including edit, delete, and modify permissions (creator default)',
        'CAN_MANAGE': 'Edit job settings, delete job, and modify permissions'
    },
    'Clusters': {
        'CAN_ATTACH_TO': 'Attach notebooks, view Spark UI and metrics',
        'CAN_RESTART': 'Attach notebooks, view metrics, start/stop/restart cluster',
        'CAN_MANAGE': 'Full control: edit, resize, attach libraries, modify permissions, view driver logs'
    },
    'SQL Warehouses': {
        'CAN_VIEW': 'View warehouse details, query history, and monitoring (cannot run queries)',
        'CAN_MONITOR': 'Run queries, view query history and profiles for troubleshooting',
        'CAN_USE': 'Start warehouse and run queries',
        'IS_OWNER': 'Full control (creator default)',
        'CAN_MANAGE': 'Stop, delete, edit warehouse, and modify permissions'
    },
    'Pipelines': {
        'CAN_VIEW': 'View pipeline details, list pipelines, view Spark UI and driver logs',
        'CAN_RUN': 'Start/stop pipeline updates, stop pipeline clusters',
        'CAN_MANAGE': 'Edit settings, delete pipeline, purge runs, modify permissions',
        'IS_OWNER': 'Full control (creator default)'
    },
    'Notebooks/Files': {
        'CAN_VIEW': 'Read file and add comments (view-only access)',
        'CAN_RUN': 'Read, comment, attach/detach, and run file interactively',
        'CAN_EDIT': 'Read, run, and edit file',
        'CAN_MANAGE': 'Full control including modify permissions'
    },
    'Folders/Directories': {
        'CAN_VIEW': 'List and view objects in folder',
        'CAN_EDIT': 'View, clone, and export items',
        'CAN_RUN': 'View, clone, export, and run objects',
        'CAN_MANAGE': 'Full control: create, import, delete, move, rename items, modify permissions'
    },
    'Instance Pools': {
        'CAN_ATTACH_TO': 'Attach clusters to the pool',
        'CAN_MANAGE': 'Delete, edit pool, and modify permissions'
    }
}

print("="*80)
print("DATABRICKS PERMISSION LEVELS REFERENCE")
print("="*80)

for resource_type, permissions in permission_definitions.items():
    print(f"\n{resource_type.upper()}:")
    print("-" * 80)
    for perm_level, description in permissions.items():
        print(f"  {perm_level:20} {description}")

print("\n" + "="*80)
print("NOTES:")
print("="*80)
print("  • IS_OWNER: Automatically assigned to resource creator")
print("  • Workspace admins: Automatically inherit CAN_MANAGE on all resources")
print("  • Job runs: Execute with job owner's permissions, not the user who clicked 'Run Now'")
print("  • Job clusters: Inherit permissions from their parent job")
print("  • API names: CAN_VIEW may appear as CAN_READ in API responses")
print("="*80)

# Create a flattened reference DataFrame for export
reference_data = []
for resource_type, permissions in permission_definitions.items():
    for perm_level, description in permissions.items():
        reference_data.append({
            'resource_type': resource_type,
            'permission_level': perm_level,
            'description': description
        })

permission_reference_df = spark.createDataFrame(reference_data)
print(f"\n✓ Created permission_reference_df with {permission_reference_df.count()} permission definitions")
print("  (Available for export to Excel)")



In [0]:
log("\n=== Permission Levels Reference Table ===")
log("This table explains what each permission level allows users to do\n")

if not is_job_mode:
    display(permission_reference_df.orderBy('resource_type', 'permission_level'))



In [0]:
if ENABLE_SERVICE_PRINCIPALS:
    cell_start_time = time.time()
    
    log("Fetching service principals...")
    
    try:
        service_principals = list(wc.service_principals.list())
        
        sp_data = []
        for sp in service_principals:
            sp_data.append({
                'sp_id': sp.id,
                'sp_application_id': sp.application_id,
                'sp_display_name': sp.display_name,
                'active': sp.active,
                'groups': [g.display for g in sp.groups] if sp.groups else []
            })
        
        if sp_data:
            service_principals_df = spark.createDataFrame(sp_data)
        else:
            service_principals_df = spark.createDataFrame([], 'sp_id STRING, sp_application_id STRING, sp_display_name STRING, active BOOLEAN, groups ARRAY<STRING>')
        
        log(f"✓ Found {service_principals_df.count()} service principals")
        
        sp_export = service_principals_df.select(
            'sp_id',
            'sp_application_id', 
            'sp_display_name',
            'active',
            F.explode_outer('groups').alias('group_name')
        )
        
        log(f"✓ Created sp_export: {sp_export.count()} rows (flattened)")
        
        if not is_job_mode:
            display(service_principals_df.limit(20))
        
    except Exception as e:
        log(f"❌ Error fetching service principals: {str(e)}")
        service_principals_df = spark.createDataFrame([], 'sp_id STRING, sp_application_id STRING, sp_display_name STRING, active BOOLEAN, groups ARRAY<STRING>')
        sp_export = spark.createDataFrame([], 'sp_id STRING, sp_application_id STRING, sp_display_name STRING, active BOOLEAN, group_name STRING')
        if is_job_mode:
            raise
    
    log_execution_time("Get service principals", cell_start_time)
else:
    log("⏭️  Service principals collection disabled (ENABLE_SERVICE_PRINCIPALS=False)")
    service_principals_df = spark.createDataFrame([], 'sp_id STRING, sp_application_id STRING, sp_display_name STRING, active BOOLEAN, groups ARRAY<STRING>')
    sp_export = spark.createDataFrame([], 'sp_id STRING, sp_application_id STRING, sp_display_name STRING, active BOOLEAN, group_name STRING')
    execution_stats['resources_skipped'] += 1



In [0]:
if ENABLE_SECRET_SCOPES:
    cell_start_time = time.time()
    
    log("Fetching secret scopes and their ACLs...")
    
    try:
        secret_scopes = wc.secrets.list_scopes()
        
        secret_scope_data = []
        secret_acl_data = []
        
        for scope in secret_scopes:
            secret_scope_data.append({
                'scope_name': scope.name,
                'backend_type': scope.backend_type.value if scope.backend_type else 'UNKNOWN'
            })
            
            try:
                acls = wc.secrets.list_acls(scope=scope.name)
                for acl in acls:
                    secret_acl_data.append({
                        'scope_name': scope.name,
                        'principal': acl.principal,
                        'permission': acl.permission.value if acl.permission else 'UNKNOWN'
                    })
            except Exception:
                pass
        
        if secret_scope_data:
            secret_scopes_df = spark.createDataFrame(secret_scope_data)
        else:
            secret_scopes_df = spark.createDataFrame([], 'scope_name STRING, backend_type STRING')
        
        if secret_acl_data:
            secret_acls_df = spark.createDataFrame(secret_acl_data)
        else:
            secret_acls_df = spark.createDataFrame([], 'scope_name STRING, principal STRING, permission STRING')
        
        log(f"✓ Found {secret_scopes_df.count()} secret scopes")
        log(f"✓ Found {secret_acls_df.count()} secret ACL entries")
        
        if not is_job_mode:
            log("\nSecret Scopes:")
            display(secret_scopes_df)
            
            if secret_acls_df.count() > 0:
                log("\nSecret ACLs:")
                display(secret_acls_df.limit(50))
        
    except Exception as e:
        log(f"❌ Error fetching secret scopes: {str(e)}")
        secret_scopes_df = spark.createDataFrame([], 'scope_name STRING, backend_type STRING')
        secret_acls_df = spark.createDataFrame([], 'scope_name STRING, principal STRING, permission STRING')
        if is_job_mode:
            raise
    
    log_execution_time("Get secret scopes and ACLs", cell_start_time)
else:
    log("⏭️  Secret scopes collection disabled (ENABLE_SECRET_SCOPES=False)")
    secret_scopes_df = spark.createDataFrame([], 'scope_name STRING, backend_type STRING')
    secret_acls_df = spark.createDataFrame([], 'scope_name STRING, principal STRING, permission STRING')
    execution_stats['resources_skipped'] += 1



In [0]:
if ENABLE_UC_PERMISSIONS:
    cell_start_time = time.time()
    
    log("Fetching Unity Catalog permissions...")
    
    try:
        catalogs = list(wc.catalogs.list())
        
        uc_catalog_data = []
        uc_schema_data = []
        uc_catalog_grants = []
        uc_schema_grants = []
        
        log(f"Found {len(catalogs)} catalogs")
        
        for catalog in catalogs:
            uc_catalog_data.append({
                'catalog_name': catalog.name,
                'catalog_owner': catalog.owner,
                'catalog_type': catalog.catalog_type.value if catalog.catalog_type else 'UNKNOWN',
                'created_at': catalog.created_at,
                'updated_at': catalog.updated_at
            })
            
            try:
                grants = wc.grants.get_effective(securable_type='catalog', full_name=catalog.name)
                if grants.privilege_assignments:
                    for grant in grants.privilege_assignments:
                        for privilege in grant.privileges:
                            uc_catalog_grants.append({
                                'catalog_name': catalog.name,
                                'principal': grant.principal,
                                'privilege': privilege.value
                            })
            except Exception:
                pass
            
            try:
                schemas = list(wc.schemas.list(catalog_name=catalog.name))
                for schema in schemas[:20]:
                    uc_schema_data.append({
                        'catalog_name': catalog.name,
                        'schema_name': schema.name,
                        'schema_owner': schema.owner,
                        'full_name': schema.full_name
                    })
                    
                    try:
                        schema_grants = wc.grants.get_effective(securable_type='schema', full_name=schema.full_name)
                        if schema_grants.privilege_assignments:
                            for grant in schema_grants.privilege_assignments:
                                for privilege in grant.privileges:
                                    uc_schema_grants.append({
                                        'schema_full_name': schema.full_name,
                                        'principal': grant.principal,
                                        'privilege': privilege.value
                                    })
                    except Exception:
                        pass
            except Exception:
                pass
        
        uc_catalogs_df = spark.createDataFrame(uc_catalog_data) if uc_catalog_data else spark.createDataFrame([], 'catalog_name STRING, catalog_owner STRING, catalog_type STRING, created_at BIGINT, updated_at BIGINT')
        uc_schemas_df = spark.createDataFrame(uc_schema_data) if uc_schema_data else spark.createDataFrame([], 'catalog_name STRING, schema_name STRING, schema_owner STRING, full_name STRING')
        uc_catalog_grants_df = spark.createDataFrame(uc_catalog_grants) if uc_catalog_grants else spark.createDataFrame([], 'catalog_name STRING, principal STRING, privilege STRING')
        uc_schema_grants_df = spark.createDataFrame(uc_schema_grants) if uc_schema_grants else spark.createDataFrame([], 'schema_full_name STRING, principal STRING, privilege STRING')
        
        log(f"\n✓ Found {uc_catalogs_df.count()} catalogs")
        log(f"✓ Found {uc_schemas_df.count()} schemas (sampled)")
        log(f"✓ Found {uc_catalog_grants_df.count()} catalog grants")
        log(f"✓ Found {uc_schema_grants_df.count()} schema grants")
        
        if not is_job_mode:
            log("\nCatalogs:")
            display(uc_catalogs_df)
            
            if uc_catalog_grants_df.count() > 0:
                log("\nCatalog Grants (Top 50):")
                display(uc_catalog_grants_df.limit(50))
        
    except Exception as e:
        log(f"❌ Error fetching Unity Catalog permissions: {str(e)}")
        uc_catalogs_df = spark.createDataFrame([], 'catalog_name STRING, catalog_owner STRING, catalog_type STRING, created_at BIGINT, updated_at BIGINT')
        uc_schemas_df = spark.createDataFrame([], 'catalog_name STRING, schema_name STRING, schema_owner STRING, full_name STRING')
        uc_catalog_grants_df = spark.createDataFrame([], 'catalog_name STRING, principal STRING, privilege STRING')
        uc_schema_grants_df = spark.createDataFrame([], 'schema_full_name STRING, principal STRING, privilege STRING')
        if is_job_mode:
            raise
    
    log_execution_time("Get Unity Catalog permissions", cell_start_time)
else:
    log("⏭️  Unity Catalog permissions collection disabled (ENABLE_UC_PERMISSIONS=False)")
    uc_catalogs_df = spark.createDataFrame([], 'catalog_name STRING, catalog_owner STRING, catalog_type STRING, created_at BIGINT, updated_at BIGINT')
    uc_schemas_df = spark.createDataFrame([], 'catalog_name STRING, schema_name STRING, schema_owner STRING, full_name STRING')
    uc_catalog_grants_df = spark.createDataFrame([], 'catalog_name STRING, principal STRING, privilege STRING')
    uc_schema_grants_df = spark.createDataFrame([], 'schema_full_name STRING, principal STRING, privilege STRING')
    execution_stats['resources_skipped'] += 1



In [0]:
if ENABLE_IP_ACCESS_LISTS:
    cell_start_time = time.time()
    
    log("Fetching workspace settings and IP access lists...")
    
    try:
        ip_access_lists = list(wc.ip_access_lists.list())
        
        ip_acl_data = []
        for ip_list in ip_access_lists:
            ip_acl_data.append({
                'list_id': ip_list.list_id,
                'label': ip_list.label,
                'list_type': ip_list.list_type.value if ip_list.list_type else 'UNKNOWN',
                'enabled': ip_list.enabled,
                'ip_addresses': str(ip_list.ip_addresses) if ip_list.ip_addresses else '[]',
                'created_at': ip_list.created_at,
                'created_by': ip_list.created_by
            })
        
        if ip_acl_data:
            ip_access_lists_df = spark.createDataFrame(ip_acl_data)
        else:
            ip_access_lists_df = spark.createDataFrame([], 'list_id STRING, label STRING, list_type STRING, enabled BOOLEAN, ip_addresses STRING, created_at BIGINT, created_by STRING')
        
        log(f"✓ Found {ip_access_lists_df.count()} IP access lists")
        
        if not is_job_mode:
            if ip_access_lists_df.count() > 0:
                display(ip_access_lists_df)
            else:
                log("  No IP access lists configured")
        
    except Exception as e:
        log(f"❌ Error fetching IP access lists: {str(e)}")
        ip_access_lists_df = spark.createDataFrame([], 'list_id STRING, label STRING, list_type STRING, enabled BOOLEAN, ip_addresses STRING, created_at BIGINT, created_by STRING')
        if is_job_mode:
            raise
    
    log("\nWorkspace Configuration:")
        
    try:
        workspace_conf = wc.workspace_conf.get_status(keys='enableTokensConfig,enableIpAccessLists')
        log(f"  Token creation enabled: {workspace_conf.get('enableTokensConfig', 'Unknown')}")
        log(f"  IP access lists enabled: {workspace_conf.get('enableIpAccessLists', 'Unknown')}")
    except Exception as e:
        log(f"  ⚠️ Could not fetch workspace config: {str(e)}")
    
    log_execution_time("Get workspace settings and IP access lists", cell_start_time)
else:
    log("⏭️  IP access lists collection disabled (ENABLE_IP_ACCESS_LISTS=False)")
    ip_access_lists_df = spark.createDataFrame([], 'list_id STRING, label STRING, list_type STRING, enabled BOOLEAN, ip_addresses STRING, created_at BIGINT, created_by STRING')
    execution_stats['resources_skipped'] += 1



In [0]:
log("Updating export lists to include all security data...\n")

# Add all dataframes to export list
additional_exports = [
    ('service_principals', sp_export),
    ('secret_scopes', secret_scopes_df),
    ('secret_acls', secret_acls_df),
    ('uc_catalogs', uc_catalogs_df),
    ('uc_schemas', uc_schemas_df),
    ('uc_catalog_grants', uc_catalog_grants_df),
    ('uc_schema_grants', uc_schema_grants_df),
    ('ip_access_lists', ip_access_lists_df)
]

# Add new resource types if they exist
if 'workspace_permissions_df' in dir():
    additional_exports.append(('workspace_permissions', workspace_permissions_df))
if 'tokens_df' in dir():
    additional_exports.append(('tokens', tokens_df))
if 'uc_volume_grants_df' in dir():
    additional_exports.append(('uc_volume_grants', uc_volume_grants_df))
if 'new_permissions_df' in dir() and new_permissions_df.count() > 0:
    additional_exports.append(('new_permissions', new_permissions_df))
if 'removed_permissions_df' in dir() and removed_permissions_df.count() > 0:
    additional_exports.append(('removed_permissions', removed_permissions_df))
if 'workspace_admins' in dir():
    additional_exports.append(('workspace_admins', workspace_admins))
if 'inactive_users_with_perms' in dir():
    additional_exports.append(('inactive_users_with_perms', inactive_users_with_perms))

log("Additional security data available for export:")
for name, df in additional_exports:
    log(f"  - {name}: {df.count()} rows")

log(f"\n{'='*60}")
log("COMPLETE SECURITY COVERAGE SUMMARY")
log(f"{'='*60}")
log("\n✓ Identity & Access:")
log(f"  - Users: {users_df.count()}")
log(f"  - Groups: {groups_df.count()}")
log(f"  - Service Principals: {service_principals_df.count() if 'service_principals_df' in dir() else 0}")
log(f"  - User-Group Memberships: {user_groups_df.count()}")
if 'workspace_admins' in dir():
    log(f"  - Workspace Admins: {workspace_admins.count()}")

log("\n✓ Workspace Permissions:")
log(f"  - Jobs: included")
log(f"  - Warehouses: included")
log(f"  - Clusters (interactive): included")
log(f"  - Pipelines: included")
if 'workspace_permissions_df' in dir():
    log(f"  - Workspace Objects (folders/notebooks): {workspace_permissions_df.count()}")
log(f"  - Repos (Git): included")
log(f"  - Instance Pools: included")
log(f"  - Model Registry: included")
log(f"  - SQL Dashboards/Queries: included")
log(f"  - Total permission entries: {permissions_df.count()}")

log("\n✓ Unity Catalog:")
log(f"  - Catalogs: {uc_catalogs_df.count() if 'uc_catalogs_df' in dir() else 0}")
log(f"  - Schemas: {uc_schemas_df.count() if 'uc_schemas_df' in dir() else 0}")
log(f"  - Catalog Grants: {uc_catalog_grants_df.count() if 'uc_catalog_grants_df' in dir() else 0}")
log(f"  - Schema Grants: {uc_schema_grants_df.count() if 'uc_schema_grants_df' in dir() else 0}")
if 'uc_volume_grants_df' in dir():
    log(f"  - Volume Grants: {uc_volume_grants_df.count()}")

log("\n✓ Secrets Management:")
log(f"  - Secret Scopes: {secret_scopes_df.count() if 'secret_scopes_df' in dir() else 0}")
log(f"  - Secret ACLs: {secret_acls_df.count() if 'secret_acls_df' in dir() else 0}")

log("\n✓ Network Security:")
log(f"  - IP Access Lists: {ip_access_lists_df.count() if 'ip_access_lists_df' in dir() else 0}")

log("\n✓ Token Management:")
if 'tokens_df' in dir():
    log(f"  - Active Tokens: {tokens_df.count()}")
else:
    log(f"  - Active Tokens: Not available")

log("\n✓ Change Detection:")
if 'new_permissions_df' in dir():
    log(f"  - New Permissions: {new_permissions_df.count()}")
    log(f"  - Removed Permissions: {removed_permissions_df.count()}")
else:
    log(f"  - Change detection: Not enabled")

log(f"\n{'='*60}")
log("\nNote: Run remaining cells to export all data to Excel and/or Delta tables")
log(f"{'='*60}")



In [0]:
# Export all security data to individual Excel files
# This creates one Excel file per dataframe

if ENABLE_EXCEL_EXPORT:
    cell_start_time = time.time()
    import pandas as pd
    from datetime import datetime
    import pytz
    
    # Create timestamp for filenames in Eastern Time
    eastern = pytz.timezone('America/New_York')
    timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
    base_path = EXPORT_PATH if 'EXPORT_PATH' in dir() else '/dbfs/tmp/permissions_export'
    
    log(f"\nExporting ALL security data to Excel...\n")
    log(f"Export location: {base_path}")
    log(f"Timestamp (Eastern Time): {datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')}\n")
    
    # Complete export list with all security data
    complete_export_list = [
        ('permission_reference', permission_reference_df),
        ('users_with_groups', users_export),
        ('groups_with_members', groups_export),
        ('service_principals', sp_export),
        ('user_groups', user_groups_export),
        ('permissions', permissions_export),
        ('permissions_with_groups', permissions_with_groups_export),
        ('user_all_permissions', user_all_permissions_export),
        ('user_permissions_summary', user_perm_summary_export),
        ('group_permissions_summary', group_summary_export),
        ('secret_scopes', secret_scopes_df),
        ('secret_acls', secret_acls_df),
        ('uc_catalogs', uc_catalogs_df),
        ('uc_schemas', uc_schemas_df),
        ('uc_catalog_grants', uc_catalog_grants_df),
        ('uc_schema_grants', uc_schema_grants_df),
        ('ip_access_lists', ip_access_lists_df)
    ]
    
    export_success = 0
    export_failed = 0
    
    for name, df in complete_export_list:
        try:
            # Validate DataFrame before export
            if df is None:
                log(f"⏭️  Skipping {name}: DataFrame is None")
                continue
            
            row_count = df.count()
            if row_count == 0:
                log(f"⏭️  Skipping {name}: DataFrame is empty (0 rows)")
                continue
            
            # Convert to pandas
            pdf = df.toPandas()
            
            # Convert any timestamp columns to Eastern Time
            for col in pdf.columns:
                if pd.api.types.is_datetime64_any_dtype(pdf[col]):
                    if pdf[col].dt.tz is None:
                        pdf[col] = pd.to_datetime(pdf[col]).dt.tz_localize('UTC').dt.tz_convert(eastern)
                    else:
                        pdf[col] = pdf[col].dt.tz_convert(eastern)
            
            # Create filename
            filename = f"{base_path}/{name}_{timestamp}.xlsx"
            
            # Export to Excel
            pdf.to_excel(filename, index=False, engine='openpyxl')
            
            log(f"✓ Exported {name}: {len(pdf)} rows")
            export_success += 1
        except Exception as e:
            log(f"✗ Error exporting {name}: {str(e)}")
            export_failed += 1
            # In job mode, raise the exception to fail the job
            if is_job_mode:
                raise
    
    log(f"\n{'='*60}")
    log(f"Individual file export complete: {export_success} succeeded, {export_failed} failed")
    log(f"{'='*60}")
    log_execution_time("Export all security data to Excel", cell_start_time)
else:
    log("\n⏭️  Excel export disabled")
    log("   Set ENABLE_EXCEL_EXPORT = True in Cell 2 to enable Excel file generation")



In [0]:
# Create comprehensive security workbook with all data in one Excel file
# This creates a single Excel file with multiple sheets

if ENABLE_EXCEL_EXPORT:
    cell_start_time = time.time()
    import pandas as pd
    from datetime import datetime
    import pytz
    
    # Create timestamp for filename in Eastern Time
    eastern = pytz.timezone('America/New_York')
    timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
    base_path = EXPORT_PATH if 'EXPORT_PATH' in dir() else '/dbfs/tmp/permissions_export'
    combined_filename = f"{base_path}/complete_security_review_{timestamp}.xlsx"
    
    log(f"\nCreating comprehensive security review workbook...")
    log(f"Timestamp (Eastern Time): {datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')}\n")
    
    try:
        with pd.ExcelWriter(combined_filename, engine='openpyxl') as writer:
            
            # All sheets in logical order
            all_sheets = [
                ('Reference', permission_reference_df),
                ('Users_Groups', users_export),
                ('Groups_Members', groups_export),
                ('Service_Principals', sp_export),
                ('User_Groups', user_groups_export),
                ('Permissions', permissions_export),
                ('Perms_With_Groups', permissions_with_groups_export),
                ('User_All_Perms', user_all_permissions_export),
                ('User_Summary', user_perm_summary_export),
                ('Group_Summary', group_summary_export),
                ('Secret_Scopes', secret_scopes_df),
                ('Secret_ACLs', secret_acls_df),
                ('UC_Catalogs', uc_catalogs_df),
                ('UC_Schemas', uc_schemas_df),
                ('UC_Catalog_Grants', uc_catalog_grants_df),
                ('UC_Schema_Grants', uc_schema_grants_df),
                ('IP_Access_Lists', ip_access_lists_df)
            ]
            
            # Add new resource types if they exist
            if 'workspace_permissions_df' in dir():
                all_sheets.append(('Workspace_Perms', workspace_permissions_df))
            if 'tokens_df' in dir():
                all_sheets.append(('Tokens', tokens_df))
            if 'uc_volume_grants_df' in dir():
                all_sheets.append(('UC_Volume_Grants', uc_volume_grants_df))
            if 'workspace_admins' in dir():
                all_sheets.append(('Workspace_Admins', workspace_admins))
            if 'inactive_users_with_perms' in dir():
                all_sheets.append(('Inactive_Users', inactive_users_with_perms))
            if 'new_permissions_df' in dir() and new_permissions_df.count() > 0:
                all_sheets.append(('New_Permissions', new_permissions_df))
            if 'removed_permissions_df' in dir() and removed_permissions_df.count() > 0:
                all_sheets.append(('Removed_Permissions', removed_permissions_df))
            
            sheets_added = 0
            sheets_skipped = 0
            
            for sheet_name, df in all_sheets:
                try:
                    # Validate DataFrame before export
                    if df is None:
                        log(f"⏭️  Skipping sheet {sheet_name}: DataFrame is None")
                        sheets_skipped += 1
                        continue
                    
                    row_count = df.count()
                    if row_count == 0:
                        log(f"⏭️  Skipping sheet {sheet_name}: DataFrame is empty")
                        sheets_skipped += 1
                        continue
                    
                    pdf = df.toPandas()
                    
                    # Convert timestamp columns to Eastern Time
                    for col in pdf.columns:
                        if pd.api.types.is_datetime64_any_dtype(pdf[col]):
                            if pdf[col].dt.tz is None:
                                pdf[col] = pd.to_datetime(pdf[col]).dt.tz_localize('UTC').dt.tz_convert(eastern)
                            else:
                                pdf[col] = pdf[col].dt.tz_convert(eastern)
                    
                    sheet_name_clean = sheet_name[:31]
                    pdf.to_excel(writer, sheet_name=sheet_name_clean, index=False)
                    
                    log(f"✓ Added sheet '{sheet_name_clean}': {len(pdf)} rows")
                    sheets_added += 1
                except Exception as e:
                    log(f"✗ Error adding sheet {sheet_name}: {str(e)}")
                    sheets_skipped += 1
        
        log(f"\n{'='*60}")
        log("COMPREHENSIVE SECURITY WORKBOOK CREATED!")
        log(f"{'='*60}")
        log(f"\nFile: {combined_filename}")
        log(f"Sheets added: {sheets_added}, Sheets skipped: {sheets_skipped}")
        log(f"\nContains sheets covering:")
        log("• Users, Groups, Service Principals")
        log("• Workspace Resource Permissions (Jobs, Warehouses, Clusters, Pipelines)")
        log("• Workspace Objects (Folders, Notebooks, Repos)")
        log("• Model Registry Permissions")
        log("• Unity Catalog Permissions (Catalogs, Schemas, Volumes)")
        log("• Secret Scopes and ACLs")
        log("• IP Access Lists and Network Security")
        log("• Token Management")
        log("• Compliance Reports (Admins, Inactive Users, Change Detection)")
        log("• Permission Level Reference")
        log(f"\nTo download:")
        log(f"  databricks fs cp {combined_filename.replace('/dbfs', 'dbfs:')} ./")
        
        log_execution_time("Create comprehensive security workbook", cell_start_time)
        
    except Exception as e:
        log(f"\n✗ Error creating workbook: {str(e)}")
        # In job mode, raise the exception to fail the job
        if is_job_mode:
            raise
else:
    log("\n⏭️  Comprehensive Excel workbook creation skipped")
    log("   Set ENABLE_EXCEL_EXPORT = True in Cell 2 to enable")



In [0]:
# Export security data to CSV format for easy import into other tools
# CSV files are smaller and faster to generate than Excel

# Configuration: Enable CSV export
ENABLE_CSV_EXPORT = False  # Set to True to enable CSV export

if ENABLE_CSV_EXPORT:
    cell_start_time = time.time()
    from datetime import datetime
    import pytz
    
    eastern = pytz.timezone('America/New_York')
    timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
    csv_path = f"{EXPORT_PATH}/csv_{timestamp}"
    
    log(f"\nExporting security data to CSV format...")
    log(f"Export location: {csv_path}")
    
    # Create CSV export directory
    dbutils.fs.mkdirs(csv_path.replace('/dbfs', 'dbfs:'))
    
    # List of DataFrames to export
    csv_export_list = [
        ('users_with_groups', users_export),
        ('groups_with_members', groups_export),
        ('permissions', permissions_export),
        ('user_all_permissions', user_all_permissions_export),
        ('workspace_admins', workspace_admins) if 'workspace_admins' in dir() else None,
        ('inactive_users_with_perms', inactive_users_with_perms) if 'inactive_users_with_perms' in dir() else None,
        ('new_permissions', new_permissions_df) if 'new_permissions_df' in dir() and new_permissions_df.count() > 0 else None,
        ('removed_permissions', removed_permissions_df) if 'removed_permissions_df' in dir() and removed_permissions_df.count() > 0 else None
    ]
    
    # Filter out None entries
    csv_export_list = [(name, df) for name, df in csv_export_list if df is not None]
    
    csv_success = 0
    for name, df in csv_export_list:
        try:
            if validate_dataframe_exists(name, df):
                csv_file = f"{csv_path}/{name}.csv"
                df.coalesce(1).write.mode('overwrite').option('header', 'true').csv(csv_file)
                log(f"✓ Exported {name}: {df.count()} rows")
                csv_success += 1
        except Exception as e:
            log(f"✗ Error exporting {name} to CSV: {str(e)}")
    
    log(f"\n✓ CSV export complete: {csv_success} files created")
    log(f"Location: {csv_path}")
    log_execution_time("Export to CSV format", cell_start_time)
else:
    log("\n⏭️  CSV export disabled")
    log("   Set ENABLE_CSV_EXPORT = True to enable CSV export")



In [0]:
cell_start_time = time.time()

log("\n" + "="*60)
log("EXECUTIVE SUMMARY DASHBOARD")
log("="*60)

# Collect all key metrics for executive summary
summary_metrics = []

# Identity metrics
if 'users_df' in dir():
    total_users = users_df.count()
    active_users = users_df.filter(F.col('active') == True).count()
    inactive_users = total_users - active_users
    summary_metrics.extend([
        {'Category': 'Identity', 'Metric': 'Total Users', 'Value': str(total_users), 'Status': '✓'},
        {'Category': 'Identity', 'Metric': 'Active Users', 'Value': str(active_users), 'Status': '✓'},
        {'Category': 'Identity', 'Metric': 'Inactive Users', 'Value': str(inactive_users), 'Status': '⚠️' if inactive_users > 0 else '✓'}
    ])

if 'groups_df' in dir():
    total_groups = groups_df.count()
    summary_metrics.append({'Category': 'Identity', 'Metric': 'Total Groups', 'Value': str(total_groups), 'Status': '✓'})

if 'service_principals_df' in dir():
    total_sps = service_principals_df.count()
    summary_metrics.append({'Category': 'Identity', 'Metric': 'Service Principals', 'Value': str(total_sps), 'Status': '✓'})

# Permission metrics
if 'permissions_df' in dir():
    total_permissions = permissions_df.count()
    unique_resources = permissions_df.select('resource_id').distinct().count()
    resource_types = permissions_df.select('resource_type').distinct().count()
    summary_metrics.extend([
        {'Category': 'Permissions', 'Metric': 'Total Permission Entries', 'Value': str(total_permissions), 'Status': '✓'},
        {'Category': 'Permissions', 'Metric': 'Unique Resources', 'Value': str(unique_resources), 'Status': '✓'},
        {'Category': 'Permissions', 'Metric': 'Resource Types Covered', 'Value': str(resource_types), 'Status': '✓'}
    ])

# Security alerts
security_alert_count = 0

if 'tokens_no_expiry' in dir():
    tokens_no_expiry_count = tokens_no_expiry.count() if hasattr(tokens_no_expiry, 'count') else 0
    summary_metrics.append({
        'Category': 'Security Alerts', 
        'Metric': 'Tokens Without Expiry', 
        'Value': str(tokens_no_expiry_count), 
        'Status': '❌' if tokens_no_expiry_count > 0 else '✓'
    })
    security_alert_count += tokens_no_expiry_count

if 'inactive_user_permissions_df' in dir():
    inactive_perm_count = inactive_user_permissions_df.count() if hasattr(inactive_user_permissions_df, 'count') else 0
    summary_metrics.append({
        'Category': 'Security Alerts', 
        'Metric': 'Inactive User Permissions', 
        'Value': str(inactive_perm_count), 
        'Status': '❌' if inactive_perm_count > 0 else '✓'
    })
    security_alert_count += (1 if inactive_perm_count > 0 else 0)

if 'external_users_df' in dir():
    external_user_count = external_users_df.count() if hasattr(external_users_df, 'count') else 0
    summary_metrics.append({
        'Category': 'Security Alerts', 
        'Metric': 'External Users', 
        'Value': str(external_user_count), 
        'Status': '⚠️' if external_user_count > 0 else '✓'
    })
    security_alert_count += (1 if external_user_count > 0 else 0)

if 'excessive_admin_permissions_df' in dir():
    excessive_admin_count = excessive_admin_permissions_df.count() if hasattr(excessive_admin_permissions_df, 'count') else 0
    summary_metrics.append({
        'Category': 'Security Alerts', 
        'Metric': 'Users with Excessive Admin Permissions', 
        'Value': str(excessive_admin_count), 
        'Status': '⚠️' if excessive_admin_count > 0 else '✓'
    })
    security_alert_count += (1 if excessive_admin_count > 0 else 0)

if 'sod_violations_df' in dir():
    sod_count = sod_violations_df.count() if hasattr(sod_violations_df, 'count') else 0
    summary_metrics.append({
        'Category': 'Compliance', 
        'Metric': 'Potential SOD Violations', 
        'Value': str(sod_count), 
        'Status': '⚠️' if sod_count > 0 else '✓'
    })

# Execution statistics
if 'execution_stats' in dir():
    summary_metrics.extend([
        {'Category': 'Execution', 'Metric': 'Total API Calls', 'Value': str(execution_stats.get('api_calls', 0)), 'Status': '✓'},
        {'Category': 'Execution', 'Metric': 'API Failures', 'Value': str(execution_stats.get('api_failures', 0)), 'Status': '⚠️' if execution_stats.get('api_failures', 0) > 0 else '✓'},
        {'Category': 'Execution', 'Metric': 'Resources Skipped', 'Value': str(execution_stats.get('resources_skipped', 0)), 'Status': 'ℹ️'}
    ])

# Calculate risk score (0-100, lower is better)
risk_score = 0
if 'tokens_no_expiry' in dir():
    risk_score += min(tokens_no_expiry.count() * 5, 30)  # Up to 30 points
if 'inactive_user_permissions_df' in dir():
    risk_score += min(inactive_user_permissions_df.count() * 0.1, 20)  # Up to 20 points
if 'external_users_df' in dir():
    risk_score += min(external_users_df.count() * 2, 20)  # Up to 20 points
if 'excessive_admin_permissions_df' in dir():
    risk_score += min(excessive_admin_permissions_df.count() * 3, 30)  # Up to 30 points

risk_score = min(risk_score, 100)
risk_level = 'LOW' if risk_score < 30 else 'MEDIUM' if risk_score < 60 else 'HIGH'
risk_status = '✓' if risk_score < 30 else '⚠️' if risk_score < 60 else '❌'

summary_metrics.append({
    'Category': 'Risk Assessment', 
    'Metric': 'Overall Risk Score (0-100)', 
    'Value': f"{risk_score:.0f} ({risk_level})", 
    'Status': risk_status
})

# Create summary DataFrame
executive_summary_df = spark.createDataFrame(summary_metrics)

log("\n📊 Executive Summary Dashboard:")
log(f"  Total metrics: {len(summary_metrics)}")
log(f"  Security alerts: {security_alert_count}")
log(f"  Risk score: {risk_score:.0f}/100 ({risk_level})")

if not is_job_mode:
    display(executive_summary_df.orderBy('Category', 'Metric'))

log_execution_time("Prepare executive summary dashboard", cell_start_time)



In [0]:
# Prepare summary dashboard data for Excel export
# This will be added as the FIRST sheet in the comprehensive workbook

log("\nPreparing summary dashboard for Excel export...")

# Convert executive summary to pandas for Excel export
if 'executive_summary_df' in dir():
    executive_summary_pandas = executive_summary_df.toPandas()
    
    # Add additional summary tables
    
    # 1. Permissions by resource type
    if 'permissions_df' in dir():
        perms_by_type = permissions_df.groupBy('resource_type') \
            .agg(
                F.count('*').alias('permission_count'),
                F.countDistinct('principal').alias('unique_principals'),
                F.countDistinct('resource_id').alias('unique_resources')
            ) \
            .orderBy(F.desc('permission_count')) \
            .toPandas()
    else:
        perms_by_type = pd.DataFrame()
    
    # 2. Top users by permission count
    if 'permission_concentration_df' in dir():
        top_users_summary = permission_concentration_df.limit(10).toPandas()
    else:
        top_users_summary = pd.DataFrame()
    
    # 3. Security alerts summary
    security_alerts_data = []
    
    if 'tokens_no_expiry' in dir():
        tokens_no_expiry_count = tokens_no_expiry.count() if hasattr(tokens_no_expiry, 'count') else 0
        if tokens_no_expiry_count > 0:
            security_alerts_data.append({
                'Alert Type': 'Tokens Without Expiry',
                'Severity': 'CRITICAL',
                'Count': tokens_no_expiry_count,
                'Recommendation': 'Set expiration dates for all tokens'
            })
    
    if 'inactive_user_permissions_df' in dir():
        inactive_perm_count = inactive_user_permissions_df.count() if hasattr(inactive_user_permissions_df, 'count') else 0
        if inactive_perm_count > 0:
            security_alerts_data.append({
                'Alert Type': 'Inactive User Permissions',
                'Severity': 'HIGH',
                'Count': inactive_perm_count,
                'Recommendation': 'Remove permissions for inactive users'
            })
    
    if 'external_users_df' in dir():
        external_user_count = external_users_df.count() if hasattr(external_users_df, 'count') else 0
        if external_user_count > 0:
            security_alerts_data.append({
                'Alert Type': 'External Users',
                'Severity': 'MEDIUM',
                'Count': external_user_count,
                'Recommendation': 'Review external user access justification'
            })
    
    if 'excessive_admin_permissions_df' in dir():
        excessive_admin_count = excessive_admin_permissions_df.count() if hasattr(excessive_admin_permissions_df, 'count') else 0
        if excessive_admin_count > 0:
            security_alerts_data.append({
                'Alert Type': 'Excessive Admin Permissions',
                'Severity': 'HIGH',
                'Count': excessive_admin_count,
                'Recommendation': 'Review and reduce admin permissions'
            })
    
    if 'sod_violations_df' in dir():
        sod_count = sod_violations_df.count() if hasattr(sod_violations_df, 'count') else 0
        if sod_count > 0:
            security_alerts_data.append({
                'Alert Type': 'SOD Violations',
                'Severity': 'MEDIUM',
                'Count': sod_count,
                'Recommendation': 'Separate development and production access'
            })
    
    security_alerts_pandas = pd.DataFrame(security_alerts_data) if security_alerts_data else pd.DataFrame()
    
    # 4. Execution metadata
    execution_metadata = pd.DataFrame([{
        'Metric': 'Execution Date',
        'Value': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    }, {
        'Metric': 'Execution Mode',
        'Value': 'JOB' if is_job_mode else 'INTERACTIVE'
    }, {
        'Metric': 'Total Execution Time',
        'Value': f"{execution_stats.get('total_execution_time', 0):.2f} seconds" if 'execution_stats' in dir() else 'N/A'
    }, {
        'Metric': 'API Calls',
        'Value': str(execution_stats.get('api_calls', 0)) if 'execution_stats' in dir() else 'N/A'
    }, {
        'Metric': 'Success Rate',
        'Value': f"{execution_stats.get('success_rate', 0):.1f}%" if 'execution_stats' in dir() else 'N/A'
    }])
    
    log(f"✓ Summary dashboard data prepared")
    log(f"  Executive summary: {len(executive_summary_pandas)} metrics")
    log(f"  Permissions by type: {len(perms_by_type)} resource types")
    log(f"  Security alerts: {len(security_alerts_pandas)} alerts")
    log(f"  Top users: {len(top_users_summary)} users")
    
    # Store for Excel export
    excel_summary_data = {
        'executive_summary': executive_summary_pandas,
        'permissions_by_type': perms_by_type,
        'top_users': top_users_summary,
        'security_alerts': security_alerts_pandas,
        'execution_metadata': execution_metadata
    }
    
    log("\nℹ️  NOTE: To add these to Excel export, update Cell 43 (comprehensive workbook)")
    log("   Add these sheets FIRST in the workbook for executive visibility:")
    log("   1. Executive Summary")
    log("   2. Security Alerts")
    log("   3. Permissions by Type")
    log("   4. Top Users")
    log("   5. Execution Metadata")
else:
    log("⚠️  Executive summary data not available")
    excel_summary_data = {}



In [0]:
# Export security data to JSON format for API consumption and programmatic access

# Configuration: Enable JSON export
ENABLE_JSON_EXPORT = False  # Set to True to enable JSON export

if ENABLE_JSON_EXPORT:
    cell_start_time = time.time()
    from datetime import datetime
    import pytz
    import json
    
    eastern = pytz.timezone('America/New_York')
    timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
    json_path = f"{EXPORT_PATH}/json_{timestamp}"
    
    log(f"\nExporting security data to JSON format...")
    log(f"Export location: {json_path}")
    
    # Create JSON export directory
    dbutils.fs.mkdirs(json_path.replace('/dbfs', 'dbfs:'))
    
    # Export key DataFrames to JSON
    json_export_list = [
        ('permissions', permissions_export),
        ('users', users_export),
        ('groups', groups_export),
        ('compliance_report', {
            'inactive_users_with_permissions': inactive_users_with_perms.count() if 'inactive_users_with_perms' in dir() else 0,
            'external_users': external_count if 'external_count' in dir() else 0,
            'workspace_admins': workspace_admins.count() if 'workspace_admins' in dir() else 0,
            'orphaned_permissions': (orphaned_user_count + orphaned_group_count) if 'orphaned_user_count' in dir() else 0,
            'tokens_without_expiry': tokens_df.filter(F.col('expiry_time').isNull()).count() if 'tokens_df' in dir() and tokens_df.count() > 0 else 0
        })
    ]
    
    json_success = 0
    for name, data in json_export_list:
        try:
            json_file = f"{json_path}/{name}.json"
            
            if isinstance(data, dict):
                # Export dictionary as JSON
                dbutils.fs.put(json_file.replace('/dbfs', 'dbfs:'), json.dumps(data, indent=2), overwrite=True)
                log(f"✓ Exported {name}: compliance summary")
                json_success += 1
            else:
                # Export DataFrame as JSON
                if validate_dataframe_exists(name, data):
                    data.coalesce(1).write.mode('overwrite').json(json_file)
                    log(f"✓ Exported {name}: {data.count()} rows")
                    json_success += 1
        except Exception as e:
            log(f"✗ Error exporting {name} to JSON: {str(e)}")
    
    log(f"\n✓ JSON export complete: {json_success} files created")
    log(f"Location: {json_path}")
    log_execution_time("Export to JSON format", cell_start_time)
else:
    log("\n⏭️  JSON export disabled")
    log("   Set ENABLE_JSON_EXPORT = True to enable JSON export")



In [0]:
# Export security audit data to Delta tables for long-term retention and historical analysis
# This enables tracking changes over multiple audit runs

if ENABLE_DELTA_EXPORT:
    cell_start_time = time.time()
    from pyspark.sql.functions import current_timestamp, lit
    
    log("\n💾 Exporting Security Audit Data to Delta Tables")
    log(f"   Target table: {DELTA_TABLE_NAME}")
    log("="*60)
    
    try:
        # Validate that permissions_export exists and has data
        if not validate_dataframe_exists("permissions_export", permissions_export):
            log("❌ Cannot export to Delta: permissions_export is empty or invalid")
            if is_job_mode:
                raise ValueError("permissions_export DataFrame is empty or invalid")
        else:
            # Prepare the main permissions export with metadata
            permissions_export_delta = permissions_export \
                .withColumn('audit_run_timestamp', current_timestamp()) \
                .withColumn('max_resources_checked', lit(MAX_RESOURCES_PER_TYPE))
            
            # Write main permissions history to Delta table
            log(f"Writing {permissions_export_delta.count()} records to {DELTA_TABLE_NAME}...")
            permissions_export_delta.write \
                .format('delta') \
                .mode('append') \
                .option('mergeSchema', 'true') \
                .saveAsTable(DELTA_TABLE_NAME)
            
            log(f"\n✓ Successfully exported permissions history")
            log(f"   Table: {DELTA_TABLE_NAME}")
            log(f"   Mode: APPEND (historical accumulation)")
            log(f"   Schema merge: ENABLED")
        
        # Create snapshot tables for current state
        snapshot_base = DELTA_TABLE_NAME.rsplit('.', 1)[0]  # Get catalog.schema
        snapshot_count = 0
        snapshot_tables = []
        
        # Core snapshots
        snapshot_configs = [
            ('users_export', 'users_snapshot'),
            ('groups_export', 'groups_snapshot'),
            ('user_perm_summary_export', 'user_permissions_summary_snapshot'),
            ('group_summary_export', 'group_permissions_summary_snapshot'),
            ('sp_export', 'service_principals_snapshot'),
            ('uc_catalog_grants_export', 'uc_catalog_grants_snapshot'),
            ('secret_acls_export', 'secret_acls_snapshot'),
            ('workspace_permissions_df', 'workspace_permissions_snapshot'),
            ('tokens_df', 'tokens_snapshot'),
            ('uc_volume_grants_df', 'uc_volume_grants_snapshot'),
            ('workspace_admins', 'workspace_admins_snapshot'),
            ('inactive_users_with_perms', 'inactive_users_snapshot')
        ]
        
        for df_name, table_suffix in snapshot_configs:
            if df_name in dir() and validate_dataframe_exists(df_name, eval(df_name)):
                try:
                    snapshot_table = f"{snapshot_base}.{table_suffix}"
                    eval(df_name) \
                        .withColumn('snapshot_timestamp', current_timestamp()) \
                        .write \
                        .format('delta') \
                        .mode('overwrite') \
                        .option('overwriteSchema', 'true') \
                        .saveAsTable(snapshot_table)
                    snapshot_tables.append(snapshot_table)
                    snapshot_count += 1
                except Exception as e:
                    log(f"  ⚠️ Could not create snapshot for {df_name}: {str(e)}")
        
        # Export change detection results if available
        if 'new_permissions_df' in dir() and new_permissions_df.count() > 0:
            try:
                change_table = f"{snapshot_base}.permission_changes_snapshot"
                new_permissions_df \
                    .withColumn('change_type', lit('NEW')) \
                    .withColumn('snapshot_timestamp', current_timestamp()) \
                    .union(
                        removed_permissions_df \
                            .withColumn('change_type', lit('REMOVED')) \
                            .withColumn('snapshot_timestamp', current_timestamp())
                    ) \
                    .write \
                    .format('delta') \
                    .mode('overwrite') \
                    .option('overwriteSchema', 'true') \
                    .saveAsTable(change_table)
                snapshot_tables.append(change_table)
                snapshot_count += 1
            except Exception as e:
                log(f"  ⚠️ Could not create change detection snapshot: {str(e)}")
        
        log(f"\n✓ Successfully created {snapshot_count} snapshot tables")
        if snapshot_tables:
            log("\nSnapshot tables created:")
            for table in snapshot_tables:
                log(f"   - {table}")
        
        log("\n📊 Query your security audit history:")
        log(f"   -- View all permissions from today's audit run")
        log(f"   SELECT * FROM {DELTA_TABLE_NAME}")
        log(f"   WHERE audit_run_timestamp >= current_date()")
        log(f"   ORDER BY audit_run_timestamp DESC;")
        log(f"")
        log(f"   -- Compare permissions over time")
        log(f"   SELECT principal, resource_type, resource_name, permission_level,")
        log(f"          audit_run_timestamp")
        log(f"   FROM {DELTA_TABLE_NAME}")
        log(f"   WHERE principal = 'user@example.com'")
        log(f"   ORDER BY audit_run_timestamp DESC;")
        log(f"")
        log(f"   -- View permission changes")
        log(f"   SELECT * FROM {snapshot_base}.permission_changes_snapshot")
        log(f"   WHERE change_type = 'NEW';")
        
        log_execution_time("Export to Delta tables", cell_start_time)
        
    except Exception as e:
        log(f"\n❌ Failed to export to Delta tables: {str(e)}")
        log(f"   Please verify:")
        log(f"   1. You have CREATE TABLE permissions")
        log(f"   2. The catalog and schema exist: {DELTA_TABLE_NAME.rsplit('.', 1)[0]}")
        log(f"   3. The table name is valid")
        if is_job_mode:
            raise
else:
    log("\n⏭️  Delta table export disabled")
    log("   Set ENABLE_DELTA_EXPORT = True in Cell 2 to enable long-term retention")



In [0]:
cell_start_time = time.time()

log("Preparing dataframes for Excel export (flattening nested columns)...\n")

# 1. Users with groups - explode groups array
users_export = users_df.join(user_groups_df, users_df.user_name == user_groups_df.user_name, 'left') \
    .select(
        users_df.user_name,
        users_df.display_name,
        users_df.active,
        user_groups_df.group_name
    )

log(f"✓ users_export: {users_export.count()} rows (users with their groups, one row per user-group)")

# 2. Groups with members - explode members
groups_export = groups_df.join(user_groups_df, groups_df.group_name == user_groups_df.group_name, 'left') \
    .select(
        groups_df.group_name,
        groups_df.group_id,
        user_groups_df.user_name.alias('member_user_name')
    )

log(f"✓ groups_export: {groups_export.count()} rows (groups with their members, one row per group-member)")

# 3. User-group memberships (already flat)
user_groups_export = user_groups_df
log(f"✓ user_groups_export: {user_groups_export.count()} rows (already flat)")

# 4. Permissions (already flat)
permissions_export = permissions_df
log(f"✓ permissions_export: {permissions_df.count()} rows (already flat)")

# 5. Permissions with groups (already flat)
permissions_with_groups_export = permissions_with_groups_df
log(f"✓ permissions_with_groups_export: {permissions_with_groups_df.count()} rows (already flat)")

# 6. User all permissions (direct + inherited) - already flat
user_all_permissions_export = user_all_permissions_df
log(f"✓ user_all_permissions_export: {user_all_permissions_df.count()} rows (already flat)")

# 7. User permissions summary - explode permission_levels and inherited_from_groups
user_perm_summary_flat = user_all_permissions_df.groupBy('principal').agg(
    F.count('*').alias('total_permissions'),
    F.sum(F.when(F.col('permission_source') == 'direct', 1).otherwise(0)).alias('direct_permissions'),
    F.sum(F.when(F.col('permission_source') == 'inherited', 1).otherwise(0)).alias('inherited_permissions'),
    F.countDistinct('resource_type').alias('resource_types_count')
)

# Add permission levels as separate rows
user_perm_levels = user_all_permissions_df.select('principal', 'permission_level').distinct()
user_perm_summary_export = user_perm_summary_flat.join(
    user_perm_levels, 'principal', 'left'
)

# Add inherited groups as separate rows
user_inherited_groups = user_all_permissions_df \
    .filter(F.col('permission_source') == 'inherited') \
    .select('principal', 'source_group').distinct()

user_perm_summary_export = user_perm_summary_export.join(
    user_inherited_groups, 'principal', 'left'
)

log(f"✓ user_perm_summary_export: {user_perm_summary_export.count()} rows (flattened)")

# 8. Group permissions summary - flatten
group_perm_summary_flat = group_permissions.groupBy('principal').agg(
    F.count('*').alias('total_permissions'),
    F.countDistinct('resource_type').alias('resource_types_count')
)

# Add permission levels
group_perm_levels = group_permissions.select('principal', 'permission_level').distinct()
group_summary_export = group_perm_summary_flat.join(
    group_perm_levels, 'principal', 'left'
)

# Add member count
group_members_count = user_groups_df.groupBy('group_name').agg(
    F.count('*').alias('member_count')
)

group_summary_export = group_summary_export \
    .join(group_members_count, group_summary_export.principal == group_members_count.group_name, 'left') \
    .select(
        F.col('principal').alias('group_name'),
        F.col('total_permissions'),
        F.col('resource_types_count'),
        F.col('permission_level'),
        F.coalesce(F.col('member_count'), F.lit(0)).alias('member_count')
    )

log(f"✓ group_summary_export: {group_summary_export.count()} rows (flattened)")

# 9. Flatten additional exports
if 'uc_catalog_grants_df' in dir():
    uc_catalog_grants_export = uc_catalog_grants_df
    log(f"✓ uc_catalog_grants_export: {uc_catalog_grants_export.count()} rows")

if 'uc_schema_grants_df' in dir():
    uc_schema_grants_export = uc_schema_grants_df
    log(f"✓ uc_schema_grants_export: {uc_schema_grants_export.count()} rows")

if 'secret_acls_df' in dir():
    secret_acls_export = secret_acls_df
    log(f"✓ secret_acls_export: {secret_acls_export.count()} rows")

log(f"\n{'='*60}")
log("All dataframes prepared for export!")
log(f"{'='*60}")

log_execution_time("Flatten and prepare dataframes", cell_start_time)



In [0]:
import pandas as pd
from datetime import datetime
import pytz

# Create timestamp for filenames in Eastern Time
eastern = pytz.timezone('America/New_York')
timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
base_path = EXPORT_PATH if 'EXPORT_PATH' in dir() else '/dbfs/tmp/permissions_export'

if not is_job_mode:
    print(f"Exporting dataframes to Excel files...\n")
    print(f"Export location: {base_path}")
    print(f"Timestamp (Eastern Time): {datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')}\n")

# Export each dataframe
export_list = [
    ('permission_reference', permission_reference_df),
    ('users_with_groups', users_export),
    ('groups_with_members', groups_export),
    ('user_groups', user_groups_export),
    ('permissions', permissions_export),
    ('permissions_with_groups', permissions_with_groups_export),
    ('user_all_permissions', user_all_permissions_export),
    ('user_permissions_summary', user_perm_summary_export),
    ('group_permissions_summary', group_summary_export)
]

for name, df in export_list:
    try:
        # Convert to pandas
        pdf = df.toPandas()
        
        # Convert any timestamp columns to Eastern Time
        for col in pdf.columns:
            if pd.api.types.is_datetime64_any_dtype(pdf[col]):
                # Convert to Eastern Time if it's a datetime column
                if pdf[col].dt.tz is None:
                    # If naive, assume UTC and convert
                    pdf[col] = pd.to_datetime(pdf[col]).dt.tz_localize('UTC').dt.tz_convert(eastern)
                else:
                    # If already timezone-aware, just convert
                    pdf[col] = pdf[col].dt.tz_convert(eastern)
        
        # Create filename
        filename = f"{base_path}/{name}_{timestamp}.xlsx"
        
        # Export to Excel
        pdf.to_excel(filename, index=False, engine='openpyxl')
        
        if not is_job_mode:
            print(f"✓ Exported {name}: {len(pdf)} rows → {filename}")
    except Exception as e:
        if not is_job_mode:
            print(f"✗ Error exporting {name}: {str(e)}")
        # In job mode, raise the exception to fail the job
        if is_job_mode:
            raise

if not is_job_mode:
    print(f"\n{'='*60}")
    print("Export complete!")
    print(f"{'='*60}")
    print(f"\nFiles saved to: {base_path}/")
    print(f"\nTo download files, use the Databricks file browser or CLI:")
    print(f"  databricks fs cp -r {base_path.replace('/dbfs', 'dbfs:')} ./local_folder/")



In [0]:
import pandas as pd
from datetime import datetime
import pytz

# Create timestamp for filename in Eastern Time
eastern = pytz.timezone('America/New_York')
timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
base_path = '/dbfs/tmp/permissions_export'
combined_filename = f"{base_path}/all_permissions_{timestamp}.xlsx"

if not is_job_mode:
    print(f"Creating combined Excel workbook with all sheets...")
    print(f"Timestamp (Eastern Time): {datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')}\n")

try:
    # Create Excel writer
    with pd.ExcelWriter(combined_filename, engine='openpyxl') as writer:
        
        # Export each dataframe as a separate sheet
        sheets = [
            ('Reference', permission_reference_df),
            ('Users_Groups', users_export),
            ('Groups_Members', groups_export),
            ('User_Groups', user_groups_export),
            ('Permissions', permissions_export),
            ('Perms_With_Groups', permissions_with_groups_export),
            ('User_All_Perms', user_all_permissions_export),
            ('User_Summary', user_perm_summary_export),
            ('Group_Summary', group_summary_export)
        ]
        
        for sheet_name, df in sheets:
            # Convert to pandas
            pdf = df.toPandas()
            
            # Convert any timestamp columns to Eastern Time
            for col in pdf.columns:
                if pd.api.types.is_datetime64_any_dtype(pdf[col]):
                    # Convert to Eastern Time if it's a datetime column
                    if pdf[col].dt.tz is None:
                        # If naive, assume UTC and convert
                        pdf[col] = pd.to_datetime(pdf[col]).dt.tz_localize('UTC').dt.tz_convert(eastern)
                    else:
                        # If already timezone-aware, just convert
                        pdf[col] = pdf[col].dt.tz_convert(eastern)
            
            # Write to sheet (Excel sheet names limited to 31 chars)
            sheet_name_clean = sheet_name[:31]
            pdf.to_excel(writer, sheet_name=sheet_name_clean, index=False)
            
            if not is_job_mode:
                print(f"✓ Added sheet '{sheet_name_clean}': {len(pdf)} rows")
    
    if not is_job_mode:
        print(f"\n{'='*60}")
        print("Combined workbook created successfully!")
        print(f"{'='*60}")
        print(f"\nFile: {combined_filename}")
        print(f"\nTo download:")
        print(f"  databricks fs cp dbfs:/tmp/permissions_export/all_permissions_{timestamp}.xlsx ./")
    
except Exception as e:
    if not is_job_mode:
        print(f"\n✗ Error creating combined workbook: {str(e)}")



In [0]:
# Job completion summary with execution statistics
import json
from datetime import datetime
import pytz

# Calculate total execution time
total_execution_time = time.time() - execution_stats['start_time']

# Create completion summary
eastern = pytz.timezone('America/New_York')
completion_time = datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')

summary = {
    'status': 'SUCCESS',
    'completion_time': completion_time,
    'execution_time_seconds': round(total_execution_time, 2),
    'execution_time_minutes': round(total_execution_time / 60, 2),
    'mode': 'JOB' if is_job_mode else 'INTERACTIVE',
    'configuration': {
        'max_resources_per_type': MAX_RESOURCES_PER_TYPE,
        'max_workers': MAX_WORKERS,
        'max_retries': MAX_RETRIES,
        'export_path': EXPORT_PATH if 'EXPORT_PATH' in dir() else '/dbfs/tmp/permissions_export',
        'excel_export_enabled': ENABLE_EXCEL_EXPORT,
        'delta_export_enabled': ENABLE_DELTA_EXPORT
    },
    'data_collected': {
        'users': users_df.count() if 'users_df' in dir() else 0,
        'groups': groups_df.count() if 'groups_df' in dir() else 0,
        'user_groups': user_groups_df.count() if 'user_groups_df' in dir() else 0,
        'permissions': permissions_df.count() if 'permissions_df' in dir() else 0,
        'service_principals': service_principals_df.count() if 'service_principals_df' in dir() else 0,
        'secret_scopes': secret_scopes_df.count() if 'secret_scopes_df' in dir() else 0,
        'uc_catalogs': uc_catalogs_df.count() if 'uc_catalogs_df' in dir() else 0,
        'workspace_permissions': workspace_permissions_df.count() if 'workspace_permissions_df' in dir() else 0,
        'tokens': tokens_df.count() if 'tokens_df' in dir() else 0,
        'uc_volumes': uc_volume_grants_df.count() if 'uc_volume_grants_df' in dir() else 0,
        'workspace_admins': workspace_admins.count() if 'workspace_admins' in dir() else 0
    },
    'execution_stats': {
        'api_calls': execution_stats['api_calls'],
        'api_failures': execution_stats['api_failures'],
        'api_retries': execution_stats['api_retries'],
        'resources_processed': execution_stats['resources_processed'],
        'success_rate_percent': round(((execution_stats['api_calls'] - execution_stats['api_failures']) / execution_stats['api_calls'] * 100), 2) if execution_stats['api_calls'] > 0 else 0
    },
    'compliance_alerts': {
        'inactive_users_with_permissions': inactive_users_with_perms.select('user_name').distinct().count() if 'inactive_users_with_perms' in dir() else 0,
        'orphaned_permissions': (orphaned_user_count + orphaned_group_count) if 'orphaned_user_count' in dir() and 'orphaned_group_count' in dir() else 0,
        'tokens_without_expiry': tokens_df.filter(F.col('expiry_time').isNull()).count() if 'tokens_df' in dir() and tokens_df.count() > 0 else 0,
        'permission_changes': (new_permissions_df.count() + removed_permissions_df.count()) if 'new_permissions_df' in dir() and 'removed_permissions_df' in dir() else 0
    }
}

if is_job_mode:
    # In job mode, output JSON for programmatic consumption
    print(json.dumps(summary, indent=2))
    
    # Return success status for job orchestration
    dbutils.notebook.exit(json.dumps(summary))
else:
    # In interactive mode, show friendly summary
    log("\n" + "="*60)
    log("NOTEBOOK EXECUTION COMPLETED SUCCESSFULLY")
    log("="*60)
    log(f"\nCompletion Time: {completion_time}")
    log(f"Total Execution Time: {summary['execution_time_minutes']:.2f} minutes")
    log(f"Mode: {summary['mode']}")
    
    log(f"\nData Collected:")
    for key, value in summary['data_collected'].items():
        log(f"  {key}: {value:,}")
    
    log(f"\nExecution Statistics:")
    log(f"  API calls: {summary['execution_stats']['api_calls']}")
    log(f"  Resources processed: {summary['execution_stats']['resources_processed']}")
    log(f"  API failures: {summary['execution_stats']['api_failures']}")
    log(f"  API retries: {summary['execution_stats']['api_retries']}")
    log(f"  Success rate: {summary['execution_stats']['success_rate_percent']:.1f}%")
    
    log(f"\nCompliance Alerts:")
    for key, value in summary['compliance_alerts'].items():
        alert_icon = '⚠️' if value > 0 else '✓'
        log(f"  {alert_icon} {key}: {value}")
    
    log(f"\nExport Configuration:")
    log(f"  Excel export: {'ENABLED' if ENABLE_EXCEL_EXPORT else 'DISABLED'}")
    log(f"  Delta export: {'ENABLED' if ENABLE_DELTA_EXPORT else 'DISABLED'}")
    log(f"  Export location: {summary['configuration']['export_path']}")
    
    log("\n" + "="*60)
    log("✓ Security audit complete!")
    log("="*60)



In [0]:
# Generate HTML summary report for executive viewing
# This creates a human-readable HTML report with key findings

ENABLE_HTML_REPORT = False  # Set to True to enable HTML report generation

if ENABLE_HTML_REPORT:
    cell_start_time = time.time()
    from datetime import datetime
    import pytz
    
    eastern = pytz.timezone('America/New_York')
    timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
    report_time = datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')
    html_file = f"{EXPORT_PATH}/security_audit_report_{timestamp}.html"
    
    log("\nGenerating HTML summary report...")
    
    # Build HTML report
    html_content = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <title>Databricks Security Audit Report</title>
        <style>
            body {{ font-family: Arial, sans-serif; margin: 40px; background-color: #f5f5f5; }}
            .container {{ max-width: 1200px; margin: 0 auto; background-color: white; padding: 30px; box-shadow: 0 0 10px rgba(0,0,0,0.1); }}
            h1 {{ color: #FF3621; border-bottom: 3px solid #FF3621; padding-bottom: 10px; }}
            h2 {{ color: #333; margin-top: 30px; border-bottom: 2px solid #ddd; padding-bottom: 5px; }}
            .metric {{ display: inline-block; margin: 15px 20px; padding: 15px; background-color: #f9f9f9; border-left: 4px solid #FF3621; }}
            .metric-label {{ font-size: 12px; color: #666; text-transform: uppercase; }}
            .metric-value {{ font-size: 28px; font-weight: bold; color: #333; }}
            .alert {{ background-color: #fff3cd; border-left: 4px solid #ffc107; padding: 15px; margin: 10px 0; }}
            .success {{ background-color: #d4edda; border-left: 4px solid #28a745; padding: 15px; margin: 10px 0; }}
            .warning {{ background-color: #f8d7da; border-left: 4px solid #dc3545; padding: 15px; margin: 10px 0; }}
            table {{ width: 100%; border-collapse: collapse; margin: 20px 0; }}
            th {{ background-color: #FF3621; color: white; padding: 12px; text-align: left; }}
            td {{ padding: 10px; border-bottom: 1px solid #ddd; }}
            tr:hover {{ background-color: #f5f5f5; }}
            .footer {{ margin-top: 40px; padding-top: 20px; border-top: 2px solid #ddd; color: #666; font-size: 12px; }}
        </style>
    </head>
    <body>
        <div class="container">
            <h1>🔒 Databricks Security Audit Report</h1>
            <p><strong>Generated:</strong> {report_time}</p>
            <p><strong>Workspace:</strong> Databricks Production Environment</p>
            
            <h2>📊 Executive Summary</h2>
            
            <div class="metric">
                <div class="metric-label">Total Users</div>
                <div class="metric-value">{users_df.count() if 'users_df' in dir() else 0}</div>
            </div>
            
            <div class="metric">
                <div class="metric-label">Total Groups</div>
                <div class="metric-value">{groups_df.count() if 'groups_df' in dir() else 0}</div>
            </div>
            
            <div class="metric">
                <div class="metric-label">Total Permissions</div>
                <div class="metric-value">{permissions_df.count() if 'permissions_df' in dir() else 0}</div>
            </div>
            
            <div class="metric">
                <div class="metric-label">Workspace Admins</div>
                <div class="metric-value">{workspace_admins.count() if 'workspace_admins' in dir() else 0}</div>
            </div>
            
            <h2>⚠️ Security Alerts</h2>
    """
    
    # Add compliance alerts
    alerts_found = False
    
    if 'inactive_users_with_perms' in dir():
        inactive_count = inactive_users_with_perms.select('user_name').distinct().count()
        if inactive_count > 0:
            html_content += f"""
            <div class="warning">
                <strong>⚠️ Inactive Users with Permissions:</strong> {inactive_count} inactive users still have active permissions. 
                <em>Recommendation: Remove permissions for inactive users.</em>
            </div>
            """
            alerts_found = True
    
    if 'orphaned_user_count' in dir() and 'orphaned_group_count' in dir():
        total_orphaned = orphaned_user_count + orphaned_group_count
        if total_orphaned > 0:
            html_content += f"""
            <div class="warning">
                <strong>⚠️ Orphaned Permissions:</strong> {total_orphaned} permissions exist for deleted users/groups.
                <em>Recommendation: Clean up orphaned permissions.</em>
            </div>
            """
            alerts_found = True
    
    if 'tokens_df' in dir() and tokens_df.count() > 0:
        no_expiry = tokens_df.filter(F.col('expiry_time').isNull()).count()
        if no_expiry > 0:
            html_content += f"""
            <div class="warning">
                <strong>⚠️ Tokens Without Expiration:</strong> {no_expiry} tokens have no expiration date.
                <em>Recommendation: Set expiration dates for all tokens.</em>
            </div>
            """
            alerts_found = True
    
    if 'new_permissions_df' in dir() and 'removed_permissions_df' in dir():
        changes = new_permissions_df.count() + removed_permissions_df.count()
        if changes > 0:
            html_content += f"""
            <div class="alert">
                <strong>🔄 Permission Changes Detected:</strong> {new_permissions_df.count()} new permissions, {removed_permissions_df.count()} removed permissions since last audit.
            </div>
            """
            alerts_found = True
    
    if not alerts_found:
        html_content += """
        <div class="success">
            <strong>✓ No Critical Security Alerts</strong> - All security checks passed.
        </div>
        """
    
    # Add resource coverage summary
    html_content += f"""
        <h2>📋 Resource Coverage</h2>
        <table>
            <tr><th>Resource Type</th><th>Count</th><th>Status</th></tr>
            <tr><td>Jobs</td><td>Included</td><td>✓ Covered</td></tr>
            <tr><td>SQL Warehouses</td><td>Included</td><td>✓ Covered</td></tr>
            <tr><td>Clusters (Interactive)</td><td>Included</td><td>✓ Covered</td></tr>
            <tr><td>Pipelines</td><td>Included</td><td>✓ Covered</td></tr>
            <tr><td>Workspace Objects</td><td>{workspace_permissions_df.count() if 'workspace_permissions_df' in dir() else 0}</td><td>✓ Covered</td></tr>
            <tr><td>Model Registry</td><td>Included</td><td>✓ Covered</td></tr>
            <tr><td>Repos (Git)</td><td>Included</td><td>✓ Covered</td></tr>
            <tr><td>Instance Pools</td><td>Included</td><td>✓ Covered</td></tr>
            <tr><td>Unity Catalog</td><td>{uc_catalogs_df.count() if 'uc_catalogs_df' in dir() else 0} catalogs</td><td>✓ Covered</td></tr>
            <tr><td>Secret Scopes</td><td>{secret_scopes_df.count() if 'secret_scopes_df' in dir() else 0}</td><td>✓ Covered</td></tr>
            <tr><td>Tokens</td><td>{tokens_df.count() if 'tokens_df' in dir() else 0}</td><td>✓ Covered</td></tr>
        </table>
        
        <h2>📈 Execution Statistics</h2>
        <table>
            <tr><th>Metric</th><th>Value</th></tr>
            <tr><td>Total Execution Time</td><td>{round(total_execution_time / 60, 2)} minutes</td></tr>
            <tr><td>API Calls Made</td><td>{execution_stats['api_calls']}</td></tr>
            <tr><td>Resources Processed</td><td>{execution_stats['resources_processed']}</td></tr>
            <tr><td>API Failures</td><td>{execution_stats['api_failures']}</td></tr>
            <tr><td>API Retries</td><td>{execution_stats['api_retries']}</td></tr>
            <tr><td>Success Rate</td><td>{round(((execution_stats['api_calls'] - execution_stats['api_failures']) / execution_stats['api_calls'] * 100), 2) if execution_stats['api_calls'] > 0 else 0}%</td></tr>
        </table>
        
        <div class="footer">
            <p>This report was generated by the Databricks Security Review Export notebook.</p>
            <p>For detailed data, refer to the Excel workbook or Delta tables.</p>
        </div>
        </div>
    </body>
    </html>
    """
    
    # Write HTML file
    try:
        dbutils.fs.put(html_file.replace('/dbfs', 'dbfs:'), html_content, overwrite=True)
        log(f"✓ HTML report generated: {html_file}")
        log(f"\nTo view the report:")
        log(f"  1. Download: databricks fs cp {html_file.replace('/dbfs', 'dbfs:')} ./")
        log(f"  2. Open in web browser")
        log_execution_time("Generate HTML summary report", cell_start_time)
    except Exception as e:
        log(f"❌ Error generating HTML report: {str(e)}")
else:
    log("\n⏭️  HTML report generation disabled")
    log("   Set ENABLE_HTML_REPORT = True to enable HTML summary report")



In [0]:
print("\n" + "="*60)
print("EXPORT SUMMARY")
print("="*60)

print("\nDataframes exported:")
print("\n0. permission_reference.xlsx")
print("   - Definitions of all permission levels by resource type")
print(f"   - {permission_reference_df.count()} rows")

print("\n1. users_with_groups.xlsx")
print("   - User name, display name, active status, and group membership")
print("   - One row per user-group relationship")
print(f"   - {users_export.count()} rows")

print("\n2. groups_with_members.xlsx")
print("   - Group name, ID, and member users")
print("   - One row per group-member relationship")
print(f"   - {groups_export.count()} rows")

print("\n3. user_groups.xlsx")
print("   - Simple user-to-group mapping")
print(f"   - {user_groups_export.count()} rows")

print("\n4. permissions.xlsx")
print("   - All permissions (users, groups, service principals)")
print("   - Resource type, ID, name, principal, permission level")
print(f"   - {permissions_export.count()} rows")

print("\n5. permissions_with_groups.xlsx")
print("   - Permissions enriched with user group associations")
print(f"   - {permissions_with_groups_export.count()} rows")

print("\n6. user_all_permissions.xlsx")
print("   - All user permissions (direct + inherited from groups)")
print("   - Shows permission source and source group")
print(f"   - {user_all_permissions_export.count()} rows")

print("\n7. user_permissions_summary.xlsx")
print("   - Summary by user with permission counts and levels")
print(f"   - {user_perm_summary_export.count()} rows")

print("\n8. group_permissions_summary.xlsx")
print("   - Summary by group with permission counts and member count")
print(f"   - {group_summary_export.count()} rows")

print("\n" + "="*60)
print("\nAll files are located in: /dbfs/tmp/permissions_export/")
print("\nCombined workbook: all_permissions_<timestamp>.xlsx")
print("  (Contains all 9 sheets in one file, including permission reference)")
print("\n" + "="*60)
print("\nTIP: Open the 'Reference' sheet first to understand permission levels")

