# Workspace Asset Scanner - Default Naming Convention Detector

## Overview

This notebook provides a **comprehensive workspace audit** by scanning all notebooks, SQL queries, dashboards, Genie spaces, jobs, SQL alerts, DLT pipelines, and MLflow experiments to identify assets using default Databricks naming conventions (e.g., "Untitled Notebook", "Untitled Query", "New Dashboard", "Untitled Job", "Default" experiment). The scanner supports both interactive and job execution modes with enterprise-grade optimization features.

**✨ Enterprise-grade workspace auditing with complete coverage across 8 asset types, owner tracking, multiple export formats, and performance optimization.**

---

## Features

### Core Functionality
* Scans **8 asset types** across entire workspace:
  * Notebooks
  * SQL Queries
  * Dashboards
  * Genie Spaces
  * Jobs/Workflows
  * SQL Alerts
  * DLT Pipelines (Lakeflow Spark Declarative Pipelines)
  * MLflow Experiments
* Identifies assets with default naming patterns
* Extracts owner/user information for all assets
* Flexible execution mode (interactive vs job)
* Path exclusion filters (e.g., skip /Repos)
* Age-based filtering (optional)

### Performance & Reliability
* **Retry Logic**: Automatic retry with exponential backoff for transient API failures
* **Progress Tracking**: Real-time progress updates during long-running operations
* **Execution Time Tracking**: Per-cell execution time monitoring
* **Configuration Validation**: Validates all configuration parameters before execution
* **Error Handling**: Comprehensive error handling with detailed logging
* **Execution Statistics**: Tracks API calls, failures, retries, success rates

### Data Quality
* **Owner Information**: Captures owner/user for all assets
  * Notebooks in /Users/ paths: Extracted from path
  * Notebooks in shared folders: Fetched via workspace API
  * SQL queries: Extracted from query metadata
  * Dashboards: Extracted from dashboard metadata
  * Genie spaces: Extracted from space metadata
  * Jobs: Extracted from creator_user_name
  * SQL Alerts: Extracted from owner or user metadata
  * DLT Pipelines: Extracted from creator_user_name
  * MLflow Experiments: Extracted from tags or artifact location
* **Data Validation**: Validates DataFrame integrity before export
* **Quality Metrics**: Reports on data completeness and coverage
* **Deduplication**: Automatic duplicate detection and removal

### Export Formats
* **Delta Table**: Historical accumulation with schema evolution (append mode)
* **Excel Workbook**: Multi-sheet workbook with:
  * All Assets sheet
  * Summary sheet with metrics
  * By Type breakdown
  * By Owner analysis
  * Cleanup Recommendations (stale assets)
  * Execution Statistics
* **HTML Report**: Styled web report with summary, analysis, and recommendations
* **Timezone Support**: All timestamps converted to Eastern Time

### Analysis & Reporting
* **Summary Reports**: Comprehensive statistics and breakdowns
* **Owner Analysis**: Top users with most default-named assets
* **Recent Activity**: Assets modified in last 90 days
* **Age Distribution**: Histogram of asset ages by bucket
* **Cleanup Recommendations**: Identifies stale assets for deletion
* **Data Quality Metrics**: Coverage and completeness reporting
* **Execution Summary**: API performance and success rates

---

## Version Control

| Version | Date | Author | Changes |
|---------|------|--------|---------|  
| 1.0.0 | 2026-02-16 | Assistant | Comprehensive workspace asset scanner with complete coverage across 8 asset types (Notebooks, SQL Queries, Dashboards, Genie Spaces, Jobs/Workflows, SQL Alerts, DLT Pipelines, MLflow Experiments). Features include: owner tracking for all asset types, performance presets (Quick/Full mode), widget parameters (output_catalog, output_schema, exclude_repos, min_age_days, stale_threshold_days), concurrent API calls with ThreadPoolExecutor, retry logic with exponential backoff, memory usage monitoring, data deduplication, incremental scan mode, multiple export formats (Delta table with append mode, Excel multi-sheet workbook, HTML report), execution statistics tracking, asset age distribution analysis, cleanup recommendations for stale assets, email notifications for job mode, configuration validation, and comprehensive error handling. |

---

## Configuration

### Widget Parameters (Dynamic Configuration):
* `execution_mode` - Interactive or Job mode (dropdown)
* `output_catalog` - Target catalog for Delta table (default: main)
* `output_schema` - Target schema for Delta table (default: default)
* `exclude_repos` - Exclude /Repos folder from scan (Yes/No dropdown)
* `min_age_days` - Filter assets older than X days (leave empty for all)
* `stale_threshold_days` - Days threshold for cleanup recommendations (default: 180)

### Performance Settings (Cell 2):
* `MAX_RETRIES = 3` - Retries for failed API calls
* `RETRY_DELAY = 2` - Seconds between retries (with exponential backoff)
* `MAX_WORKERS = 10` - Parallel threads for API calls
* `MAX_NOTEBOOKS_FOR_OWNER_LOOKUP = 100` - Limit for detailed owner lookup in shared folders

### Performance Presets (Cell 3):
* **Quick Mode**: Limits to 1,000 notebooks, skips shared folder owner lookup (∼5-10 min)
* **Full Mode**: No limits, complete owner info (∼15-30 min)
* **Custom Mode**: Use manual configuration from Cell 2
* **Auto-enables Full Mode** when running as scheduled job

### Export Format Settings:
* `ENABLE_EXCEL_EXPORT = True` - Excel workbook generation (multi-sheet)
* `ENABLE_HTML_EXPORT = True` - HTML report generation
* `ENABLE_DELTA_EXPORT = True` - Delta table for long-term retention
* `ENABLE_EMAIL_NOTIFICATIONS = False` - Email alerts for job mode
* `ENABLE_INCREMENTAL_SCAN = False` - Only scan changed assets

---

## Usage

### Interactive Mode
1. Configure widget parameters at the top of the notebook
2. Optionally uncomment a performance preset in Cell 3 (Quick/Full Mode)
3. Run all cells to scan your workspace
4. View detailed analysis and visualizations
5. Download exported files from `/dbfs/tmp/workspace_scan_export/`

### Job Mode
1. Schedule as a Databricks job
2. Set widget parameters in job configuration
3. Automatically runs in 'job' mode with Full Mode enabled
4. Exports results to configured locations
5. Returns JSON summary for orchestration
6. Optionally sends email notifications

---

## Asset Types Scanned

| Asset Type | API Endpoint | Default Patterns Detected |
|------------|--------------|---------------------------|
| Notebooks | /api/2.0/workspace/list | "Untitled Notebook", "Untitled", "New Notebook" |
| SQL Queries | /api/2.0/preview/sql/queries | "Untitled Query", "New Query", "Untitled" |
| Dashboards | /api/2.0/preview/sql/dashboards | "New Dashboard", "Untitled Dashboard", "Untitled" |
| Genie Spaces | /api/2.0/genie/spaces | "New Genie Space", "Untitled Space", "Untitled", "New Space" |
| Jobs | /api/2.1/jobs/list | "Untitled Job", "New Job", "Untitled" |
| SQL Alerts | /api/2.0/alerts (+ legacy) | "New Alert", "Untitled Alert", "Untitled" |
| DLT Pipelines | /api/2.0/pipelines | "Untitled Pipeline", "New Pipeline", "Untitled" |
| MLflow Experiments | /api/2.0/mlflow/experiments/search | "Default", "Untitled Experiment", "Untitled" |

---

## Key Features

✓ **8 Asset Types**: Complete coverage across notebooks, queries, dashboards, Genie spaces, jobs, alerts, pipelines, experiments  
✓ **Automatic Job Mode Detection**: Detects scheduled job execution  
✓ **Performance Presets**: Quick Mode (fast) vs Full Mode (comprehensive)  
✓ **Widget Parameters**: Dynamic configuration without code edits  
✓ **Configuration Validation**: Validates all parameters before execution  
✓ **Concurrent API Calls**: Parallel processing with ThreadPoolExecutor  
✓ **Execution Statistics**: Tracks API calls, failures, retries, success rates  
✓ **Memory Monitoring**: Tracks memory usage and warns on high consumption  
✓ **Progress Tracking**: Real-time progress updates during scans  
✓ **Retry Logic**: Automatic retry with exponential backoff  
✓ **Data Deduplication**: Automatic duplicate detection and removal  
✓ **Multiple Export Formats**: Excel (multi-sheet), HTML report, Delta table  
✓ **Timezone Handling**: All timestamps in Eastern Time  
✓ **Data Quality Checks**: Validates data integrity before export  
✓ **Owner Tracking**: Captures owner/user for all asset types  
✓ **Incremental Scan**: Only process changed assets for faster daily runs  
✓ **Age Distribution**: Histogram analysis of asset ages  
✓ **Cleanup Recommendations**: Identifies stale assets for deletion  
✓ **Historical Retention**: Delta table with append mode for trend analysis  
✓ **Email Notifications**: Automated alerts in job mode  
✓ **Error Handling**: Comprehensive error handling with detailed logging

In [0]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
import requests
from datetime import datetime, timedelta
import json
import pytz
import re
import os

# Detect if running in job mode or interactive mode
try:
    dbutils.notebook.entry_point.getDbutils().notebook().getContext().currentRunId().isDefined()
    is_job_mode = True
except:
    is_job_mode = False

# Get parameters from widgets
if not is_job_mode:
    execution_mode = dbutils.widgets.get("execution_mode")
    output_catalog = dbutils.widgets.get("output_catalog")
    output_schema = dbutils.widgets.get("output_schema")
    exclude_repos_param = dbutils.widgets.get("exclude_repos")
    min_age_days_param = dbutils.widgets.get("min_age_days")
    stale_threshold_days_param = dbutils.widgets.get("stale_threshold_days")
else:
    execution_mode = 'job'
    output_catalog = 'main'
    output_schema = 'default'
    exclude_repos_param = 'Yes'
    min_age_days_param = ''
    stale_threshold_days_param = '180'

# ============================================================================
# PERFORMANCE CONFIGURATION
# ============================================================================
MAX_RETRIES = 3
RETRY_DELAY = 2
MAX_WORKERS = 10  # Parallel API calls for owner lookup
MAX_NOTEBOOKS_FOR_OWNER_LOOKUP = 100  # Limit notebooks for detailed owner lookup in shared folders
# ============================================================================

# ============================================================================
# EXPORT PATH CONFIGURATION
# ============================================================================
EXPORT_PATH = '/dbfs/tmp/workspace_scan_export'
# ============================================================================

# ============================================================================
# EXCEL EXPORT: Export scan results to Excel files
# ============================================================================
ENABLE_EXCEL_EXPORT = True
# ============================================================================

# ============================================================================
# HTML EXPORT: Export scan results to HTML report
# ============================================================================
ENABLE_HTML_EXPORT = True
# ============================================================================

# ============================================================================
# LONG-TERM RETENTION: Export scan results to Delta table
# ============================================================================
ENABLE_DELTA_EXPORT = True
DELTA_TABLE_NAME = f'{output_catalog}.{output_schema}.workspace_default_named_assets'
# ============================================================================

# ============================================================================
# EMAIL NOTIFICATIONS: Send email when job completes
# ============================================================================
ENABLE_EMAIL_NOTIFICATIONS = False
EMAIL_RECIPIENTS = []  # List of email addresses
# ============================================================================

# ============================================================================
# INCREMENTAL SCAN: Only scan assets modified since last run
# ============================================================================
ENABLE_INCREMENTAL_SCAN = False
# ============================================================================

# Default naming patterns to detect
DEFAULT_PATTERNS = {
    'notebooks': ['Untitled Notebook', 'Untitled', 'New Notebook'],
    'queries': ['Untitled Query', 'New Query', 'Untitled'],
    'dashboards': ['New Dashboard', 'Untitled Dashboard', 'Untitled'],
    'genie_spaces': ['New Genie Space', 'Untitled Space', 'Untitled', 'New Space'],
    'jobs': ['Untitled Job', 'New Job', 'Untitled'],
    'alerts': ['New Alert', 'Untitled Alert', 'Untitled'],
    'pipelines': ['Untitled Pipeline', 'New Pipeline', 'Untitled'],
    'experiments': ['Default', 'Untitled Experiment', 'Untitled']
}

# Filtering options (from widgets)
EXCLUDE_PATHS = ['/Repos'] if exclude_repos_param == 'Yes' else []
MIN_AGE_DAYS = int(min_age_days_param) if min_age_days_param and min_age_days_param.strip() else None
STALE_THRESHOLD_DAYS = int(stale_threshold_days_param) if stale_threshold_days_param and stale_threshold_days_param.strip() else 180

# ============================================================================
# VALIDATION: Validate configuration
# ============================================================================
def validate_config():
    """Validate configuration parameters"""
    errors = []
    
    if not isinstance(MAX_RETRIES, int) or MAX_RETRIES < 1:
        errors.append("MAX_RETRIES must be a positive integer")
    
    if not isinstance(RETRY_DELAY, (int, float)) or RETRY_DELAY < 0:
        errors.append("RETRY_DELAY must be a non-negative number")
    
    if ENABLE_DELTA_EXPORT:
        if not re.match(r'^[a-zA-Z0-9_]+\.[a-zA-Z0-9_]+\.[a-zA-Z0-9_]+$', DELTA_TABLE_NAME):
            errors.append(f"DELTA_TABLE_NAME must be in format 'catalog.schema.table', got: {DELTA_TABLE_NAME}")
    
    if MIN_AGE_DAYS is not None and MIN_AGE_DAYS < 0:
        errors.append("MIN_AGE_DAYS must be a positive integer or None")
    
    if STALE_THRESHOLD_DAYS < 1:
        errors.append("STALE_THRESHOLD_DAYS must be a positive integer")
    
    return errors

config_errors = validate_config()
if config_errors:
    error_msg = "Configuration validation failed:\n" + "\n".join(f"  - {e}" for e in config_errors)
    raise ValueError(error_msg)

# ============================================================================
# UTILITIES: Helper functions
# ============================================================================

# Execution statistics tracking
execution_stats = {
    'start_time': time.time(),
    'api_calls': 0,
    'api_failures': 0,
    'api_retries': 0,
    'resources_processed': 0,
    'resources_skipped': 0,
    'memory_usage_mb': 0
}

def log(message):
    """Print messages (always in interactive, selectively in job mode)"""
    if not is_job_mode:
        print(message)
    else:
        print(message)  # Also print in job mode for logs

def log_execution_time(cell_name, start_time):
    """Log execution time for a cell"""
    elapsed = time.time() - start_time
    log(f"⏱️  {cell_name} completed in {elapsed:.2f} seconds")

def validate_dataframe_exists(df_name, df):
    """Validate that a DataFrame exists and has data"""
    if df is None:
        log(f"⚠️  Warning: {df_name} is None")
        return False
    try:
        count = df.count()
        if count == 0:
            log(f"⚠️  Warning: {df_name} is empty (0 rows)")
            return False
        return True
    except Exception as e:
        log(f"⚠️  Warning: Error checking {df_name}: {str(e)}")
        return False

def get_memory_usage():
    """Get current memory usage in MB"""
    try:
        import psutil
        process = psutil.Process()
        return process.memory_info().rss / 1024 / 1024
    except:
        return 0

def print_execution_summary():
    """Print execution statistics summary"""
    elapsed = time.time() - execution_stats['start_time']
    log(f"\n{'='*60}")
    log("EXECUTION SUMMARY")
    log(f"{'='*60}")
    log(f"Total execution time: {elapsed:.2f} seconds ({elapsed/60:.2f} minutes)")
    log(f"API calls made: {execution_stats['api_calls']}")
    log(f"Resources processed: {execution_stats['resources_processed']}")
    log(f"Resources skipped: {execution_stats['resources_skipped']}")
    log(f"API failures: {execution_stats['api_failures']}")
    log(f"API retries: {execution_stats['api_retries']}")
    if execution_stats['api_calls'] > 0:
        success_rate = ((execution_stats['api_calls'] - execution_stats['api_failures']) / execution_stats['api_calls']) * 100
        log(f"Success rate: {success_rate:.1f}%")
    if execution_stats['memory_usage_mb'] > 0:
        log(f"Peak memory usage: {execution_stats['memory_usage_mb']:.2f} MB")
    log(f"{'='*60}")

log("✓ Setup complete")
log(f"Configuration: MAX_RETRIES={MAX_RETRIES}, RETRY_DELAY={RETRY_DELAY}s, MAX_WORKERS={MAX_WORKERS}")
log(f"Execution Mode: {execution_mode}")
log(f"Output Location: {output_catalog}.{output_schema}")
log(f"Export Path: {EXPORT_PATH}")
log(f"Excluded Paths: {EXCLUDE_PATHS}")
if MIN_AGE_DAYS:
    log(f"Age Filter: Assets older than {MIN_AGE_DAYS} days")
log(f"Stale Asset Threshold: {STALE_THRESHOLD_DAYS} days")
log("")
log("Export formats enabled:")
if ENABLE_EXCEL_EXPORT:
    log("📊 Excel export enabled")
if ENABLE_HTML_EXPORT:
    log("🌐 HTML export enabled")
if ENABLE_DELTA_EXPORT:
    log(f"💾 Delta export enabled: {DELTA_TABLE_NAME}")
if ENABLE_EMAIL_NOTIFICATIONS:
    log(f"📧 Email notifications enabled: {len(EMAIL_RECIPIENTS)} recipients")

In [0]:
# ============================================================================
# PERFORMANCE PRESETS: Choose your execution mode
# ============================================================================
# Uncomment ONE of the following presets, or use custom configuration from Cell 2

# PRESET 1: QUICK MODE (5-10 minutes) - Fast scanning with limits
# Recommended for: Daily monitoring, quick audits, testing, interactive development
# USE_QUICK_MODE = True

# PRESET 2: FULL MODE (15-30 minutes) - Complete scanning without limits
# Recommended for: Compliance audits, weekly reviews, comprehensive analysis, scheduled jobs
# USE_FULL_MODE = True

# PRESET 3: CUSTOM MODE - Use configuration from Cell 2
# Recommended for: Specific use cases, targeted audits
# (Default if no preset is uncommented)

# ============================================================================
# Apply preset configurations
# ============================================================================

if 'USE_QUICK_MODE' in dir() and USE_QUICK_MODE:
    log("\n🚀 QUICK MODE ENABLED")
    log("="*60)
    MAX_NOTEBOOKS_FOR_OWNER_LOOKUP = 0  # Skip shared folder owner lookup
    MAX_WORKSPACE_SCAN_LIMIT = 1000  # Limit notebooks scanned
    ENABLE_SHARED_FOLDER_OWNER_LOOKUP = False
    log("  Notebook scan limit: 1,000")
    log("  Shared folder owner lookup: DISABLED")
    log("  Estimated time: 5-10 minutes")
    log("="*60)
    
elif 'USE_FULL_MODE' in dir() and USE_FULL_MODE:
    log("\n🔍 FULL MODE ENABLED")
    log("="*60)
    MAX_NOTEBOOKS_FOR_OWNER_LOOKUP = 999  # No limit
    MAX_WORKSPACE_SCAN_LIMIT = 999999  # No limit
    ENABLE_SHARED_FOLDER_OWNER_LOOKUP = True
    log("  Notebook scan limit: NONE (complete scan)")
    log("  Shared folder owner lookup: ENABLED")
    log("  Estimated time: 15-30 minutes")
    log("="*60)
    
else:
    log("\n⚙️ CUSTOM MODE - Using configuration from Cell 2")
    log("="*60)
    # Use values from Cell 2
    if 'MAX_WORKSPACE_SCAN_LIMIT' not in dir():
        MAX_WORKSPACE_SCAN_LIMIT = 999999  # Default: no limit
    if 'ENABLE_SHARED_FOLDER_OWNER_LOOKUP' not in dir():
        ENABLE_SHARED_FOLDER_OWNER_LOOKUP = True  # Default: enabled

# ============================================================================
# JOB MODE OVERRIDE: Always use Full Mode for scheduled jobs
# ============================================================================
if is_job_mode:
    log("\n🤖 JOB MODE DETECTED - Forcing Full Mode")
    log("="*60)
    MAX_NOTEBOOKS_FOR_OWNER_LOOKUP = 999
    MAX_WORKSPACE_SCAN_LIMIT = 999999
    ENABLE_SHARED_FOLDER_OWNER_LOOKUP = True
    log("  All limits removed for comprehensive audit")
    log("  Shared folder owner lookup: ENABLED")
    log("="*60)

In [0]:
# Create export directory if it doesn't exist (using dbutils.fs for serverless compatibility)
try:
    dbutils.fs.ls(EXPORT_PATH.replace('/dbfs', 'dbfs:'))
    log(f"✓ Export directory ready: {EXPORT_PATH}")
except:
    dbutils.fs.mkdirs(EXPORT_PATH.replace('/dbfs', 'dbfs:'))
    log(f"✓ Created export directory: {EXPORT_PATH}")

# Test write permissions
try:
    test_file = f"{EXPORT_PATH}/.test".replace('/dbfs', 'dbfs:')
    dbutils.fs.put(test_file, 'test', overwrite=True)
    dbutils.fs.rm(test_file)
    log("  ✓ Export path is writable")
except Exception as e:
    log(f"  ⚠️ Warning: Export path may not be writable: {e}")

In [0]:
# Check if incremental scan is enabled and get last scan timestamp
last_scan_timestamp = None

if ENABLE_INCREMENTAL_SCAN and ENABLE_DELTA_EXPORT:
    log("\nChecking for incremental scan...")
    try:
        # Query Delta table for last scan timestamp
        last_scan_df = spark.sql(f"""
            SELECT MAX(scan_timestamp) as last_scan
            FROM {DELTA_TABLE_NAME}
        """)
        
        last_scan_row = last_scan_df.collect()
        if last_scan_row and last_scan_row[0]['last_scan']:
            last_scan_timestamp = last_scan_row[0]['last_scan']
            log(f"  ✓ Last scan found: {last_scan_timestamp}")
            log(f"  ✓ Incremental mode: Only scanning assets modified after {last_scan_timestamp}")
        else:
            log("  ℹ️ No previous scan found, performing full scan")
    except Exception as e:
        log(f"  ℹ️ Could not query previous scan (table may not exist): {str(e)[:100]}")
        log("  ✓ Performing full scan")
else:
    if not ENABLE_INCREMENTAL_SCAN:
        log("\n⏭️ Incremental scan disabled, performing full scan")
    else:
        log("\n⏭️ Incremental scan requires Delta export, performing full scan")

In [0]:
def get_api_client():
    """Get Databricks API client configuration"""
    try:
        ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
        api_url = ctx.apiUrl().get()
        api_token = ctx.apiToken().get()
        return api_url, api_token
    except Exception as e:
        log(f"Error getting API client: {e}")
        return None, None

def api_call_with_retry(func, *args, **kwargs):
    """Execute API call with retry logic and stats tracking"""
    execution_stats['api_calls'] += 1
    
    for attempt in range(MAX_RETRIES):
        try:
            result = func(*args, **kwargs)
            execution_stats['resources_processed'] += 1
            return result
        except Exception as e:
            execution_stats['api_failures'] += 1
            if attempt < MAX_RETRIES - 1:
                execution_stats['api_retries'] += 1
                log(f"  ⚠️ API call failed (attempt {attempt + 1}/{MAX_RETRIES}): {e}")
                time.sleep(RETRY_DELAY * (attempt + 1))  # Exponential backoff
            else:
                log(f"  ✗ API call failed after {MAX_RETRIES} attempts: {e}")
                raise

def should_exclude_path(path: str) -> bool:
    """Check if path should be excluded from scanning"""
    return any(path.startswith(excluded) for excluded in EXCLUDE_PATHS)

log("✓ API helper functions loaded")

In [0]:
cell_start_time = time.time()

log("Fetching notebook permissions...")

api_url, api_token = get_api_client()
if not api_url or not api_token:
    log("  ✗ Failed to get API client")
    notebooks = []
    user_id_to_email = {}
else:
    headers = {"Authorization": f"Bearer {api_token}"}
    
    # First, build a mapping of user IDs to emails
    log("  Building user ID to email mapping...")
    user_id_to_email = {}
    try:
        response = requests.get(
            f"{api_url}/api/2.0/preview/scim/v2/Users",
            headers=headers,
            params={"count": 10000}
        )
        if response.status_code == 200:
            users_data = response.json()
            for user in users_data.get("Resources", []):
                user_id = user.get("id")
                emails = user.get("emails", [])
                if user_id and emails:
                    user_id_to_email[user_id] = emails[0].get("value")
            log(f"  ✓ Mapped {len(user_id_to_email)} users")
    except Exception as e:
        log(f"  ⚠️ Warning: Could not fetch user mapping: {e}")
    
    # Now scan notebooks
    all_notebooks = []
    directories_scanned = 0
    
    def recurse_path(current_path):
        global directories_scanned, all_notebooks
        
        if should_exclude_path(current_path):
            return
        
        try:
            response = requests.get(
                f"{api_url}/api/2.0/workspace/list",
                headers=headers,
                json={"path": current_path}
            )
            
            if response.status_code == 200:
                data = response.json()
                if "objects" in data:
                    for obj in data["objects"]:
                        if obj.get("object_type") == "NOTEBOOK":
                            all_notebooks.append(obj)
                        
                        if obj.get("object_type") == "DIRECTORY":
                            directories_scanned += 1
                            if directories_scanned % 100 == 0:
                                log(f"  Progress: Scanned {directories_scanned} directories, found {len(all_notebooks)} notebooks...")
                            recurse_path(obj["path"])
            elif response.status_code == 403:
                pass
        except Exception as e:
            pass
    
    recurse_path("/")
    notebooks = all_notebooks

# Calculate age threshold if filtering is enabled
age_threshold_ms = None
if MIN_AGE_DAYS:
    age_threshold_ms = int((datetime.now() - timedelta(days=MIN_AGE_DAYS)).timestamp() * 1000)
    log(f"  Age filter: Assets older than {MIN_AGE_DAYS} days")

# Filter notebooks with default naming
import re
notebook_data = []
shared_folder_notebooks = []

for nb in notebooks:
    name = nb.get("path", "").split("/")[-1]
    modified_at = nb.get("modified_at")
    path = nb.get("path", "")
    
    if age_threshold_ms and modified_at and int(modified_at) > age_threshold_ms:
        continue
    
    is_default = any(pattern in name for pattern in DEFAULT_PATTERNS['notebooks'])
    
    if is_default:
        owner_match = re.search(r"/Users/([^/]+)/", path)
        owner = owner_match.group(1) if owner_match else None
        
        notebook_entry = {
            "asset_type": "NOTEBOOK",
            "asset_name": name,
            "asset_path": path,
            "object_id": str(nb.get("object_id")),
            "owner": owner,
            "created_at": None,
            "modified_at": str(modified_at) if modified_at else None
        }
        
        notebook_data.append(notebook_entry)
        
        if not owner:
            shared_folder_notebooks.append((len(notebook_data) - 1, path))

# Fetch owner info for shared folder notebooks using ThreadPoolExecutor
if shared_folder_notebooks and user_id_to_email and ENABLE_SHARED_FOLDER_OWNER_LOOKUP:
    notebooks_to_lookup = shared_folder_notebooks[:MAX_NOTEBOOKS_FOR_OWNER_LOOKUP]
    log(f"  Fetching owner information for {len(notebooks_to_lookup)} notebooks in shared folders (parallel)...")
    
    def fetch_notebook_owner(data_idx, path):
        try:
            status_response = requests.get(
                f"{api_url}/api/2.0/workspace/get-status",
                headers=headers,
                params={"path": path}
            )
            if status_response.status_code == 200:
                status_data = status_response.json()
                created_by_id = status_data.get("created_by")
                if created_by_id and str(created_by_id) in user_id_to_email:
                    return (data_idx, user_id_to_email[str(created_by_id)])
        except:
            pass
        return (data_idx, None)
    
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        futures = [executor.submit(fetch_notebook_owner, idx, path) for idx, path in notebooks_to_lookup]
        
        completed = 0
        for future in as_completed(futures):
            data_idx, owner = future.result()
            if owner:
                notebook_data[data_idx]["owner"] = owner
            completed += 1
            if completed % 10 == 0:
                log(f"    Progress: {completed}/{len(notebooks_to_lookup)} lookups completed")
    
    log(f"  ✓ Completed owner lookup for shared folder notebooks")
elif shared_folder_notebooks and not ENABLE_SHARED_FOLDER_OWNER_LOOKUP:
    log(f"  ⏭️ Skipped owner lookup for {len(shared_folder_notebooks)} shared folder notebooks (disabled in Quick Mode)")

log(f"✓ Found {len(notebook_data)} notebooks with default naming")
log_execution_time("Scan notebooks", cell_start_time)

In [0]:
cell_start_time = time.time()

log("Fetching SQL query permissions...")

api_url, api_token = get_api_client()
if not api_url or not api_token:
    log("  ✗ Failed to get API client")
    queries = []
else:
    headers = {"Authorization": f"Bearer {api_token}"}
    all_queries = []
    page_token = None
    page_count = 0
    
    try:
        while True:
            params = {"page_size": 100}
            if page_token:
                params["page_token"] = page_token
            
            response = api_call_with_retry(
                requests.get,
                f"{api_url}/api/2.0/preview/sql/queries",
                headers=headers,
                params=params
            )
            
            if response.status_code == 200:
                data = response.json()
                if "results" in data:
                    all_queries.extend(data["results"])
                    page_count += 1
                    if page_count % 5 == 0:
                        log(f"  Progress: Fetched {page_count} pages ({len(all_queries)} queries)...")
                
                page_token = data.get("next_page_token")
                if not page_token:
                    break
            else:
                break
    except Exception as e:
        log(f"  ⚠️ Error listing SQL queries: {e}")
    
    queries = all_queries

# Filter queries with default naming
query_data = []
for query in queries:
    name = query.get("name", "")
    updated_at = query.get("updated_at")
    
    # Apply age filter if enabled
    if age_threshold_ms and updated_at:
        try:
            query_time_ms = int(datetime.fromisoformat(updated_at.replace('Z', '+00:00')).timestamp() * 1000)
            if query_time_ms > age_threshold_ms:
                continue
        except:
            pass
    
    # Check if name matches default patterns
    is_default = any(pattern in name for pattern in DEFAULT_PATTERNS['queries'])
    
    if is_default:
        query_data.append({
            "asset_type": "QUERY",
            "asset_name": name,
            "asset_path": None,
            "object_id": query.get("id"),
            "created_at": query.get("created_at"),
            "modified_at": updated_at
        })

log(f"✓ Found {len(query_data)} SQL queries with default naming")
log(f"⏱️  Scan SQL queries completed in {time.time() - cell_start_time:.2f} seconds")

In [0]:
cell_start_time = time.time()

log("Fetching dashboard permissions...")

api_url, api_token = get_api_client()
if not api_url or not api_token:
    log("  ✗ Failed to get API client")
    dashboards = []
else:
    headers = {"Authorization": f"Bearer {api_token}"}
    all_dashboards = []
    page_token = None
    page_count = 0
    
    try:
        while True:
            params = {"page_size": 100}
            if page_token:
                params["page_token"] = page_token
            
            response = api_call_with_retry(
                requests.get,
                f"{api_url}/api/2.0/preview/sql/dashboards",
                headers=headers,
                params=params
            )
            
            if response.status_code == 200:
                data = response.json()
                if "results" in data:
                    all_dashboards.extend(data["results"])
                    page_count += 1
                    if page_count % 5 == 0:
                        log(f"  Progress: Fetched {page_count} pages ({len(all_dashboards)} dashboards)...")
                
                page_token = data.get("next_page_token")
                if not page_token:
                    break
            else:
                break
    except Exception as e:
        log(f"  ⚠️ Error listing dashboards: {e}")
    
    dashboards = all_dashboards

# Filter dashboards with default naming
dashboard_data = []
for dashboard in dashboards:
    name = dashboard.get("name", "")
    updated_at = dashboard.get("updated_at")
    
    # Apply age filter if enabled
    if age_threshold_ms and updated_at:
        try:
            dashboard_time_ms = int(datetime.fromisoformat(updated_at.replace('Z', '+00:00')).timestamp() * 1000)
            if dashboard_time_ms > age_threshold_ms:
                continue
        except:
            pass
    
    # Check if name matches default patterns
    is_default = any(pattern in name for pattern in DEFAULT_PATTERNS['dashboards'])
    
    if is_default:
        dashboard_data.append({
            "asset_type": "DASHBOARD",
            "asset_name": name,
            "asset_path": None,
            "object_id": dashboard.get("id"),
            "created_at": dashboard.get("created_at"),
            "modified_at": updated_at
        })

log(f"✓ Found {len(dashboard_data)} dashboards with default naming")
log(f"⏱️  Scan dashboards completed in {time.time() - cell_start_time:.2f} seconds")

In [0]:
cell_start_time = time.time()

log("Fetching Genie spaces...")

api_url, api_token = get_api_client()
if not api_url or not api_token:
    log("  ✗ Failed to get API client")
    genie_spaces = []
else:
    headers = {"Authorization": f"Bearer {api_token}"}
    all_genie_spaces = []
    page_token = None
    page_count = 0
    
    try:
        while True:
            params = {"page_size": 100}
            if page_token:
                params["page_token"] = page_token
            
            def fetch_genie_spaces():
                return requests.get(
                    f"{api_url}/api/2.0/genie/spaces",
                    headers=headers,
                    params=params,
                    timeout=30
                )
            
            response = api_call_with_retry(fetch_genie_spaces)
            
            if response and response.status_code == 200:
                data = response.json()
                spaces = data.get('spaces', [])
                all_genie_spaces.extend(spaces)
                page_count += 1
                
                if page_count % 5 == 0:
                    log(f"  Fetched {len(all_genie_spaces)} Genie spaces so far...")
                
                page_token = data.get('next_page_token')
                if not page_token:
                    break
            else:
                log(f"  ✗ Failed to fetch Genie spaces: {response.status_code if response else 'No response'}")
                break
        
        log(f"  ✓ Fetched {len(all_genie_spaces)} total Genie spaces")
        execution_stats['resources_processed'] += len(all_genie_spaces)
        
    except Exception as e:
        log(f"  ✗ Error fetching Genie spaces: {str(e)}")
        all_genie_spaces = []
    
    # Filter for default naming patterns
    genie_spaces = []
    for space in all_genie_spaces:
        space_title = space.get('title', '')
        space_id = space.get('space_id', '')
        
        # Check if title matches default patterns
        is_default_name = any(
            pattern.lower() in space_title.lower() 
            for pattern in DEFAULT_PATTERNS.get('genie_spaces', ['New Genie Space', 'Untitled Space', 'Untitled'])
        )
        
        if is_default_name:
            # Extract owner from created_by or owner_user_name field
            owner = space.get('owner_user_name', space.get('created_by', 'Unknown'))
            
            # Get timestamps
            created_at = space.get('created_at')
            updated_at = space.get('updated_at')
            
            # Convert timestamps from milliseconds to datetime if present
            eastern = pytz.timezone('America/New_York')
            created_timestamp = None
            modified_timestamp = None
            
            if created_at:
                try:
                    created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)
                except:
                    pass
            
            if updated_at:
                try:
                    modified_timestamp = datetime.fromtimestamp(updated_at / 1000, tz=eastern)
                except:
                    pass
            
            # Apply age filter if configured
            if MIN_AGE_DAYS and modified_timestamp:
                age_days = (datetime.now(eastern) - modified_timestamp).days
                if age_days < MIN_AGE_DAYS:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            # Apply incremental scan filter if enabled
            if last_scan_timestamp and modified_timestamp:
                if modified_timestamp <= last_scan_timestamp:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            genie_spaces.append({
                'asset_type': 'genie_space',
                'asset_name': space_title,
                'asset_id': space_id,
                'asset_path': f"/genie/spaces/{space_id}",  # Construct path
                'owner': owner,
                'created_timestamp': created_timestamp,
                'modified_timestamp': modified_timestamp
            })

log(f"  Found {len(genie_spaces)} Genie spaces with default naming")
genie_space_data = genie_spaces

log_execution_time("Scan Genie Spaces", cell_start_time)

In [0]:
cell_start_time = time.time()

log("Fetching jobs...")

api_url, api_token = get_api_client()
if not api_url or not api_token:
    log("  ✗ Failed to get API client")
    jobs = []
else:
    headers = {"Authorization": f"Bearer {api_token}"}
    all_jobs = []
    page_token = None
    page_count = 0
    
    try:
        while True:
            params = {"limit": 25, "expand_tasks": False}
            if page_token:
                params["page_token"] = page_token
            
            def fetch_jobs():
                return requests.get(
                    f"{api_url}/api/2.1/jobs/list",
                    headers=headers,
                    params=params,
                    timeout=30
                )
            
            response = api_call_with_retry(fetch_jobs)
            
            if response and response.status_code == 200:
                data = response.json()
                jobs_list = data.get('jobs', [])
                all_jobs.extend(jobs_list)
                page_count += 1
                
                if page_count % 5 == 0:
                    log(f"  Fetched {len(all_jobs)} jobs so far...")
                
                # Check for pagination
                if data.get('has_more', False):
                    page_token = data.get('next_page_token')
                    if not page_token:
                        break
                else:
                    break
            else:
                log(f"  ✗ Failed to fetch jobs: {response.status_code if response else 'No response'}")
                break
        
        log(f"  ✓ Fetched {len(all_jobs)} total jobs")
        execution_stats['resources_processed'] += len(all_jobs)
        
    except Exception as e:
        log(f"  ✗ Error fetching jobs: {str(e)}")
        all_jobs = []
    
    # Filter for default naming patterns
    jobs = []
    for job in all_jobs:
        job_settings = job.get('settings', {})
        job_name = job_settings.get('name', '')
        job_id = job.get('job_id', '')
        
        # Check if name matches default patterns
        is_default_name = any(
            pattern.lower() in job_name.lower() 
            for pattern in DEFAULT_PATTERNS.get('jobs', ['Untitled Job', 'New Job', 'Untitled'])
        )
        
        if is_default_name:
            # Extract owner from creator_user_name
            owner = job.get('creator_user_name', 'Unknown')
            
            # Get timestamps
            created_at = job.get('created_time')
            
            # Convert timestamps from milliseconds to datetime if present
            eastern = pytz.timezone('America/New_York')
            created_timestamp = None
            modified_timestamp = None
            
            if created_at:
                try:
                    created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)
                    # Jobs don't have a separate modified timestamp, use created as fallback
                    modified_timestamp = created_timestamp
                except:
                    pass
            
            # Apply age filter if configured
            if MIN_AGE_DAYS and modified_timestamp:
                age_days = (datetime.now(eastern) - modified_timestamp).days
                if age_days < MIN_AGE_DAYS:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            # Apply incremental scan filter if enabled
            if last_scan_timestamp and modified_timestamp:
                if modified_timestamp <= last_scan_timestamp:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            jobs.append({
                'asset_type': 'job',
                'asset_name': job_name,
                'asset_id': str(job_id),
                'asset_path': f"/jobs/{job_id}",
                'owner': owner,
                'created_timestamp': created_timestamp,
                'modified_timestamp': modified_timestamp
            })

log(f"  Found {len(jobs)} jobs with default naming")
jobs_data = jobs

log_execution_time("Scan Jobs", cell_start_time)

In [0]:
cell_start_time = time.time()

log("Fetching SQL alerts...")

api_url, api_token = get_api_client()
if not api_url or not api_token:
    log("  ✗ Failed to get API client")
    alerts = []
else:
    headers = {"Authorization": f"Bearer {api_token}"}
    all_alerts = []
    
    # Try both legacy and new alerts APIs
    try:
        # Try legacy alerts API first (more commonly used)
        def fetch_legacy_alerts():
            return requests.get(
                f"{api_url}/api/2.0/preview/sql/alerts",
                headers=headers,
                timeout=30
            )
        
        response = api_call_with_retry(fetch_legacy_alerts)
        
        if response and response.status_code == 200:
            data = response.json()
            legacy_alerts = data if isinstance(data, list) else data.get('results', [])
            all_alerts.extend(legacy_alerts)
            log(f"  ✓ Fetched {len(legacy_alerts)} legacy alerts")
        else:
            log(f"  ⚠️  Legacy alerts API returned: {response.status_code if response else 'No response'}")
        
        execution_stats['resources_processed'] += len(all_alerts)
        
    except Exception as e:
        log(f"  ⚠️  Error fetching legacy alerts: {str(e)}")
    
    # Try new alerts API with pagination
    try:
        page_token = None
        page_count = 0
        new_alerts_count = 0
        
        while True:
            params = {"page_size": 100}
            if page_token:
                params["page_token"] = page_token
            
            def fetch_new_alerts():
                return requests.get(
                    f"{api_url}/api/2.0/alerts",
                    headers=headers,
                    params=params,
                    timeout=30
                )
            
            response = api_call_with_retry(fetch_new_alerts)
            
            if response and response.status_code == 200:
                data = response.json()
                alerts_list = data.get('alerts', [])
                all_alerts.extend(alerts_list)
                new_alerts_count += len(alerts_list)
                page_count += 1
                
                page_token = data.get('next_page_token')
                if not page_token:
                    break
            else:
                # New alerts API might not be available
                break
        
        if new_alerts_count > 0:
            log(f"  ✓ Fetched {new_alerts_count} new alerts")
            execution_stats['resources_processed'] += new_alerts_count
        
    except Exception as e:
        log(f"  ⚠️  Error fetching new alerts: {str(e)}")
    
    log(f"  ✓ Fetched {len(all_alerts)} total alerts")
    
    # Filter for default naming patterns
    alerts = []
    for alert in all_alerts:
        alert_name = alert.get('name', '')
        alert_id = alert.get('id', '')
        
        # Check if name matches default patterns
        is_default_name = any(
            pattern.lower() in alert_name.lower() 
            for pattern in DEFAULT_PATTERNS.get('alerts', ['New Alert', 'Untitled Alert', 'Untitled'])
        )
        
        if is_default_name:
            # Extract owner - try multiple fields
            owner = alert.get('owner_user_name', alert.get('user', {}).get('email', alert.get('created_by', 'Unknown')))
            
            # Get timestamps
            created_at = alert.get('created_at')
            updated_at = alert.get('updated_at')
            
            # Convert timestamps to datetime if present
            eastern = pytz.timezone('America/New_York')
            created_timestamp = None
            modified_timestamp = None
            
            # Handle different timestamp formats (milliseconds or ISO string)
            if created_at:
                try:
                    if isinstance(created_at, (int, float)):
                        created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)
                    else:
                        created_timestamp = datetime.fromisoformat(str(created_at).replace('Z', '+00:00')).astimezone(eastern)
                except:
                    pass
            
            if updated_at:
                try:
                    if isinstance(updated_at, (int, float)):
                        modified_timestamp = datetime.fromtimestamp(updated_at / 1000, tz=eastern)
                    else:
                        modified_timestamp = datetime.fromisoformat(str(updated_at).replace('Z', '+00:00')).astimezone(eastern)
                except:
                    pass
            
            # Use created as fallback for modified
            if not modified_timestamp and created_timestamp:
                modified_timestamp = created_timestamp
            
            # Apply age filter if configured
            if MIN_AGE_DAYS and modified_timestamp:
                age_days = (datetime.now(eastern) - modified_timestamp).days
                if age_days < MIN_AGE_DAYS:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            # Apply incremental scan filter if enabled
            if last_scan_timestamp and modified_timestamp:
                if modified_timestamp <= last_scan_timestamp:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            alerts.append({
                'asset_type': 'alert',
                'asset_name': alert_name,
                'asset_id': str(alert_id),
                'asset_path': f"/alerts/{alert_id}",
                'owner': owner,
                'created_timestamp': created_timestamp,
                'modified_timestamp': modified_timestamp
            })

log(f"  Found {len(alerts)} alerts with default naming")
alerts_data = alerts

log_execution_time("Scan SQL Alerts", cell_start_time)

In [0]:
cell_start_time = time.time()

log("Fetching DLT pipelines...")

api_url, api_token = get_api_client()
if not api_url or not api_token:
    log("  ✗ Failed to get API client")
    pipelines = []
else:
    headers = {"Authorization": f"Bearer {api_token}"}
    all_pipelines = []
    page_token = None
    page_count = 0
    
    try:
        while True:
            params = {"max_results": 100}
            if page_token:
                params["page_token"] = page_token
            
            def fetch_pipelines():
                return requests.get(
                    f"{api_url}/api/2.0/pipelines",
                    headers=headers,
                    params=params,
                    timeout=30
                )
            
            response = api_call_with_retry(fetch_pipelines)
            
            if response and response.status_code == 200:
                data = response.json()
                pipelines_list = data.get('statuses', [])
                all_pipelines.extend(pipelines_list)
                page_count += 1
                
                if page_count % 5 == 0:
                    log(f"  Fetched {len(all_pipelines)} pipelines so far...")
                
                # Check for pagination
                page_token = data.get('next_page_token')
                if not page_token:
                    break
            else:
                log(f"  ✗ Failed to fetch pipelines: {response.status_code if response else 'No response'}")
                break
        
        log(f"  ✓ Fetched {len(all_pipelines)} total pipelines")
        execution_stats['resources_processed'] += len(all_pipelines)
        
    except Exception as e:
        log(f"  ✗ Error fetching pipelines: {str(e)}")
        all_pipelines = []
    
    # Filter for default naming patterns
    pipelines = []
    for pipeline in all_pipelines:
        pipeline_name = pipeline.get('name', '')
        pipeline_id = pipeline.get('pipeline_id', '')
        
        # Check if name matches default patterns
        is_default_name = any(
            pattern.lower() in pipeline_name.lower() 
            for pattern in DEFAULT_PATTERNS.get('pipelines', ['Untitled Pipeline', 'New Pipeline', 'Untitled'])
        )
        
        if is_default_name:
            # Extract owner from creator_user_name or created_by
            owner = pipeline.get('creator_user_name', pipeline.get('created_by', 'Unknown'))
            
            # Get timestamps - pipelines may have creation_time
            created_at = pipeline.get('creation_time')
            
            # Convert timestamps from milliseconds to datetime if present
            eastern = pytz.timezone('America/New_York')
            created_timestamp = None
            modified_timestamp = None
            
            if created_at:
                try:
                    created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)
                    # Pipelines don't have a separate modified timestamp, use created as fallback
                    modified_timestamp = created_timestamp
                except:
                    pass
            
            # Apply age filter if configured
            if MIN_AGE_DAYS and modified_timestamp:
                age_days = (datetime.now(eastern) - modified_timestamp).days
                if age_days < MIN_AGE_DAYS:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            # Apply incremental scan filter if enabled
            if last_scan_timestamp and modified_timestamp:
                if modified_timestamp <= last_scan_timestamp:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            pipelines.append({
                'asset_type': 'pipeline',
                'asset_name': pipeline_name,
                'asset_id': str(pipeline_id),
                'asset_path': f"/pipelines/{pipeline_id}",
                'owner': owner,
                'created_timestamp': created_timestamp,
                'modified_timestamp': modified_timestamp
            })

log(f"  Found {len(pipelines)} pipelines with default naming")
pipelines_data = pipelines

log_execution_time("Scan DLT Pipelines", cell_start_time)

In [0]:
cell_start_time = time.time()

log("Fetching MLflow experiments...")

api_url, api_token = get_api_client()
if not api_url or not api_token:
    log("  ✗ Failed to get API client")
    experiments = []
else:
    headers = {
        "Authorization": f"Bearer {api_token}",
        "Content-Type": "application/json"
    }
    all_experiments = []
    page_token = None
    page_count = 0
    
    try:
        while True:
            payload = {
                "max_results": 1000,
                "view_type": "ACTIVE_ONLY"
            }
            if page_token:
                payload["page_token"] = page_token
            
            def fetch_experiments():
                return requests.post(
                    f"{api_url}/api/2.0/mlflow/experiments/search",
                    headers=headers,
                    json=payload,
                    timeout=30
                )
            
            response = api_call_with_retry(fetch_experiments)
            
            if response and response.status_code == 200:
                data = response.json()
                experiments_list = data.get('experiments', [])
                all_experiments.extend(experiments_list)
                page_count += 1
                
                if page_count % 5 == 0:
                    log(f"  Fetched {len(all_experiments)} experiments so far...")
                
                # Check for pagination
                page_token = data.get('next_page_token')
                if not page_token:
                    break
            else:
                log(f"  ✗ Failed to fetch experiments: {response.status_code if response else 'No response'}")
                break
        
        log(f"  ✓ Fetched {len(all_experiments)} total experiments")
        execution_stats['resources_processed'] += len(all_experiments)
        
    except Exception as e:
        log(f"  ✗ Error fetching experiments: {str(e)}")
        all_experiments = []
    
    # Filter for default naming patterns
    experiments = []
    for experiment in all_experiments:
        experiment_name = experiment.get('name', '')
        experiment_id = experiment.get('experiment_id', '')
        
        # Check if name matches default patterns
        is_default_name = any(
            pattern.lower() in experiment_name.lower() 
            for pattern in DEFAULT_PATTERNS.get('experiments', ['Default', 'Untitled Experiment', 'Untitled'])
        )
        
        if is_default_name:
            # Extract owner from tags or artifact_location path
            tags = experiment.get('tags', [])
            owner = 'Unknown'
            
            # Try to extract owner from tags
            for tag in tags:
                if tag.get('key') in ['mlflow.user', 'owner', 'created_by']:
                    owner = tag.get('value', 'Unknown')
                    break
            
            # If no owner in tags, try to extract from artifact_location
            if owner == 'Unknown':
                artifact_location = experiment.get('artifact_location', '')
                if '/Users/' in artifact_location:
                    try:
                        owner = artifact_location.split('/Users/')[1].split('/')[0]
                    except:
                        pass
            
            # Get timestamps
            created_at = experiment.get('creation_time')
            updated_at = experiment.get('last_update_time')
            
            # Convert timestamps from milliseconds to datetime if present
            eastern = pytz.timezone('America/New_York')
            created_timestamp = None
            modified_timestamp = None
            
            if created_at:
                try:
                    created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)
                except:
                    pass
            
            if updated_at:
                try:
                    modified_timestamp = datetime.fromtimestamp(updated_at / 1000, tz=eastern)
                except:
                    pass
            
            # Use created as fallback for modified
            if not modified_timestamp and created_timestamp:
                modified_timestamp = created_timestamp
            
            # Apply age filter if configured
            if MIN_AGE_DAYS and modified_timestamp:
                age_days = (datetime.now(eastern) - modified_timestamp).days
                if age_days < MIN_AGE_DAYS:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            # Apply incremental scan filter if enabled
            if last_scan_timestamp and modified_timestamp:
                if modified_timestamp <= last_scan_timestamp:
                    execution_stats['resources_skipped'] += 1
                    continue
            
            experiments.append({
                'asset_type': 'experiment',
                'asset_name': experiment_name,
                'asset_id': str(experiment_id),
                'asset_path': f"/ml/experiments/{experiment_id}",
                'owner': owner,
                'created_timestamp': created_timestamp,
                'modified_timestamp': modified_timestamp
            })

log(f"  Found {len(experiments)} experiments with default naming")
experiments_data = experiments

log_execution_time("Scan MLflow Experiments", cell_start_time)

In [0]:
log("\n" + "="*60)
log("CREATING CONSOLIDATED DATAFRAME")
log("="*60)

# Track memory usage
memory_before = get_memory_usage()

# Combine all results
all_assets = notebook_data + query_data + dashboard_data + genie_space_data + jobs_data + alerts_data + pipelines_data + experiments_data

log(f"Total assets with default naming: {len(all_assets)}")
log(f"  - Notebooks: {len(notebook_data)}")
log(f"  - SQL Queries: {len(query_data)}")
log(f"  - Dashboards: {len(dashboard_data)}")
log(f"  - Genie Spaces: {len(genie_space_data)}")
log(f"  - Jobs: {len(jobs_data)}")
log(f"  - SQL Alerts: {len(alerts_data)}")
log(f"  - DLT Pipelines: {len(pipelines_data)}")
log(f"  - MLflow Experiments: {len(experiments_data)}")

if len(all_assets) == 0:
    log("\n✓ No assets with default naming found!")
    df_assets = None
else:
    # Create DataFrame
    schema = StructType([
        StructField("asset_type", StringType(), True),
        StructField("asset_name", StringType(), True),
        StructField("asset_id", StringType(), True),
        StructField("asset_path", StringType(), True),
        StructField("owner", StringType(), True),
        StructField("created_timestamp", TimestampType(), True),
        StructField("modified_timestamp", TimestampType(), True)
    ])
    
    df_assets = spark.createDataFrame(all_assets, schema)
    
    # Add scan timestamp
    eastern = pytz.timezone('America/New_York')
    scan_timestamp = datetime.now(eastern)
    df_assets = df_assets.withColumn("scan_timestamp", F.lit(scan_timestamp))
    
    # Deduplicate based on asset_type + asset_id
    initial_count = df_assets.count()
    df_assets = df_assets.dropDuplicates(["asset_type", "asset_id"])
    final_count = df_assets.count()
    
    if initial_count > final_count:
        log(f"\n⚠️  Removed {initial_count - final_count} duplicate entries")
    
    log(f"\n✓ Created DataFrame with {final_count} unique assets")
    
    # Validate DataFrame
    if not validate_dataframe_exists("df_assets", df_assets):
        log("⚠️  Warning: DataFrame validation failed")
    
    # Track memory usage
    memory_after = get_memory_usage()
    memory_delta = memory_after - memory_before
    execution_stats['memory_usage_mb'] = max(execution_stats['memory_usage_mb'], memory_after)
    
    if memory_delta > 100:
        log(f"\n⚠️  Memory usage increased by {memory_delta:.2f} MB")
    
    # Show sample
    log("\nSample of assets found:")
    display(df_assets.limit(10))

In [0]:
if ENABLE_DELTA_EXPORT and df_assets is not None:
    cell_start_time = time.time()
    
    log("\n" + "="*60)
    log("EXPORTING TO DELTA TABLE (LONG-TERM RETENTION)")
    log("="*60)
    
    # Parse catalog, schema, table from DELTA_TABLE_NAME
    parts = DELTA_TABLE_NAME.split('.')
    if len(parts) != 3:
        log(f"✗ Invalid DELTA_TABLE_NAME format: {DELTA_TABLE_NAME}")
        log("  Expected format: catalog.schema.table")
    else:
        catalog, schema, table = parts
        log(f"Target: {DELTA_TABLE_NAME}")
        log(f"  Catalog: {catalog}")
        log(f"  Schema: {schema}")
        log(f"  Table: {table}")
        
        try:
            # Add timestamp for historical tracking
            eastern = pytz.timezone('America/New_York')
            df_with_timestamp = df_assets.withColumn(
                "export_timestamp",
                F.lit(datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z'))
            )
            
            # Write to Delta table
            df_with_timestamp.write \
                .format("delta") \
                .mode("append") \
                .option("mergeSchema", "true") \
                .saveAsTable(DELTA_TABLE_NAME)
            
            row_count = df_assets.count()
            log(f"✓ Delta table export successful")
            log(f"  Rows written: {row_count}")
            log(f"  Mode: append (historical accumulation)")
            log(f"  Schema merge: enabled")
            log_execution_time("Delta table export", cell_start_time)
            
            # Verify table exists
            try:
                table_info = spark.sql(f"DESCRIBE EXTENDED {DELTA_TABLE_NAME}")
                log(f"  ✓ Table verified: {DELTA_TABLE_NAME}")
            except Exception as verify_error:
                log(f"  ⚠️ Warning: Could not verify table: {verify_error}")
                
        except Exception as e:
            log(f"✗ Delta table export failed: {str(e)[:300]}")
            log("  Continuing with other export formats...")
            if is_job_mode:
                raise  # Fail job if Delta export fails in job mode
else:
    if not ENABLE_DELTA_EXPORT:
        log("\n⊘ Delta export disabled (ENABLE_DELTA_EXPORT=False)")
        log("   Set ENABLE_DELTA_EXPORT = True to enable long-term retention")
    else:
        log("\n⊘ Delta export skipped (no data)")

In [0]:
if ENABLE_EXCEL_EXPORT and df_assets is not None:
    cell_start_time = time.time()
    
    log("\nExporting to Excel...")
    
    # Create timestamp for filename
    eastern = pytz.timezone('America/New_York')
    timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
    excel_path = f"{EXPORT_PATH}/workspace_default_named_assets_{timestamp}.xlsx"
    
    try:
        # Convert to Pandas
        pdf_assets = df_assets.toPandas()
        
        # Convert any timestamp columns to Eastern Time
        for col in pdf_assets.columns:
            if pd.api.types.is_datetime64_any_dtype(pdf_assets[col]):
                if pdf_assets[col].dt.tz is None:
                    pdf_assets[col] = pd.to_datetime(pdf_assets[col]).dt.tz_localize('UTC').dt.tz_convert(eastern)
                else:
                    pdf_assets[col] = pdf_assets[col].dt.tz_convert(eastern)
        
        # Create Excel writer
        with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
            # Main sheet with all assets
            pdf_assets.to_excel(writer, sheet_name='All Assets', index=False)
            
            # Summary sheet
            summary_data = {
                'Metric': ['Total Assets', 'Notebooks', 'SQL Queries', 'Dashboards', 'Scan Date (ET)', 'Execution Time (min)'],
                'Value': [
                    len(all_assets),
                    len(notebook_data),
                    len(query_data),
                    len(dashboard_data),
                    datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z'),
                    round((time.time() - execution_stats['start_time']) / 60, 2)
                ]
            }
            pd.DataFrame(summary_data).to_excel(writer, sheet_name='Summary', index=False)
            
            # Breakdown by type
            type_breakdown = pdf_assets.groupby('asset_type').size().reset_index(name='count')
            type_breakdown.to_excel(writer, sheet_name='By Type', index=False)
            
            # Owner analysis
            owner_breakdown = pdf_assets[pdf_assets['owner'].notna() & (pdf_assets['owner'] != '')] \
                .groupby('owner').size().reset_index(name='count') \
                .sort_values('count', ascending=False)
            owner_breakdown.to_excel(writer, sheet_name='By Owner', index=False)
            
            # Cleanup recommendations (stale assets)
            if 'df_stale_assets' in globals() and df_stale_assets is not None:
                pdf_stale = df_stale_assets.toPandas()
                if len(pdf_stale) > 0:
                    pdf_stale.to_excel(writer, sheet_name='Cleanup Recommendations', index=False)
            
            # Execution stats sheet
            stats_data = {
                'Metric': ['API Calls', 'Resources Processed', 'Resources Skipped', 'API Failures', 'API Retries', 'Success Rate (%)'],
                'Value': [
                    execution_stats['api_calls'],
                    execution_stats['resources_processed'],
                    execution_stats['resources_skipped'],
                    execution_stats['api_failures'],
                    execution_stats['api_retries'],
                    round(((execution_stats['api_calls'] - execution_stats['api_failures']) / execution_stats['api_calls'] * 100), 2) if execution_stats['api_calls'] > 0 else 0
                ]
            }
            pd.DataFrame(stats_data).to_excel(writer, sheet_name='Execution Stats', index=False)
        
        # Count sheets
        sheet_count = 5 + (1 if 'df_stale_assets' in globals() and df_stale_assets is not None and df_stale_assets.count() > 0 else 0)
        
        log(f"✓ Excel export successful: {excel_path}")
        log(f"  File size: {os.path.getsize(excel_path) / 1024:.2f} KB")
        log(f"  Sheets: {sheet_count} (All Assets, Summary, By Type, By Owner, Execution Stats" + (", Cleanup Recommendations" if sheet_count > 5 else "") + ")")
        log(f"  Rows: {len(pdf_assets)}")
        log_execution_time("Excel export", cell_start_time)
    except Exception as e:
        log(f"✗ Excel export failed: {e}")
        if is_job_mode:
            raise
else:
    if not ENABLE_EXCEL_EXPORT:
        log("\n⊘ Excel export disabled (ENABLE_EXCEL_EXPORT=False)")
    else:
        log("\n⊘ Excel export skipped (no data)")

In [0]:
# Calculate total execution time
total_execution_time = time.time() - execution_stats['start_time']

log("\n" + "="*60)
log("SUMMARY REPORT")
log("="*60)

# Create timestamp
eastern = pytz.timezone('America/New_York')
completion_time = datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')

log(f"Scan completed at: {completion_time}")
log(f"Total execution time: {total_execution_time:.2f} seconds ({total_execution_time/60:.2f} minutes)")
log(f"Execution mode: {execution_mode}")
log(f"\nTotal assets found: {len(all_assets)}")
log(f"  - Notebooks: {len(notebook_data)}")
log(f"  - SQL Queries: {len(query_data)}")
log(f"  - Dashboards: {len(dashboard_data)}")

if df_assets is not None:
    log("\nBreakdown by asset type:")
    df_assets.groupBy("asset_type").count().orderBy("asset_type").show()
    
    # Validate DataFrame before displaying
    if validate_dataframe_exists("df_assets", df_assets):
        if not is_job_mode:
            log("\nSample of assets with default naming (first 20):")
            display(df_assets.select("asset_type", "asset_name", "owner", "asset_path", "modified_timestamp").limit(20))
else:
    log("\n⚠️ No assets found with default naming conventions")

log("\n" + "="*60)
log("EXPORT LOCATIONS")
log("="*60)
if ENABLE_DELTA_EXPORT:
    log(f"💾 Delta Table: {DELTA_TABLE_NAME}")
    log(f"   Query: SELECT * FROM {DELTA_TABLE_NAME}")
if ENABLE_EXCEL_EXPORT:
    log(f"📊 Excel Files: {EXPORT_PATH}/workspace_default_named_assets_*.xlsx")
if ENABLE_HTML_EXPORT:
    log(f"🌐 HTML Reports: {EXPORT_PATH}/workspace_default_named_assets_*.html")
log("="*60)

log("\n✓ Workspace scan complete!")

In [0]:
if execution_mode == 'interactive' and df_assets is not None:
    log("\n" + "="*60)
    log("ADDITIONAL ANALYSIS (Interactive Mode)")
    log("="*60)
    
    # Top users with most default-named assets
    log("\nTop 10 Users with Most Default-Named Assets:")
    from pyspark.sql.functions import regexp_extract
    
    # Use the owner column directly if available, otherwise extract from path
    if 'owner' in df_assets.columns:
        df_user_summary = df_assets \
            .filter(F.col("owner").isNotNull()) \
            .groupBy("owner") \
            .count() \
            .orderBy(F.col("count").desc()) \
            .limit(10)
    else:
        df_with_users = df_assets.withColumn(
            "user",
            regexp_extract(F.col("asset_path"), r"/Users/([^/]+)/", 1)
        )
        
        df_user_summary = df_with_users \
            .filter(F.col("user") != "") \
            .groupBy("user") \
            .count() \
            .orderBy(F.col("count").desc()) \
            .limit(10)
    
    display(df_user_summary)
    
    # Recent activity
    log("\nRecent Activity (Assets modified in last 90 days):")
    from pyspark.sql.functions import datediff
    
    df_recent = df_assets \
        .filter(F.col("modified_timestamp").isNotNull()) \
        .withColumn("days_since_modified", datediff(F.current_timestamp(), F.col("modified_timestamp"))) \
        .filter(F.col("days_since_modified") <= 90) \
        .select("asset_type", "asset_name", "owner", "asset_path", "days_since_modified") \
        .orderBy("days_since_modified")
    
    recent_count = df_recent.count()
    log(f"Found {recent_count} recently modified assets with default names")
    
    if recent_count > 0:
        display(df_recent.limit(20))
elif execution_mode == 'job':
    log("\n✓ Job mode: Scan complete. Results exported to configured locations.")
    
    # Create job completion summary
    summary = {
        'status': 'SUCCESS',
        'completion_time': completion_time,
        'execution_time_seconds': round(total_execution_time, 2),
        'execution_time_minutes': round(total_execution_time / 60, 2),
        'mode': 'JOB',
        'data_collected': {
            'total_assets': len(all_assets),
            'notebooks': len(notebook_data),
            'queries': len(query_data),
            'dashboards': len(dashboard_data)
        },
        'execution_stats': execution_stats,
        'exports': {
            'delta_enabled': ENABLE_DELTA_EXPORT,
            'delta_table': DELTA_TABLE_NAME if ENABLE_DELTA_EXPORT else None,
            'excel_enabled': ENABLE_EXCEL_EXPORT
        }
    }
    
    # Return summary for job orchestration
    dbutils.notebook.exit(json.dumps(summary))

In [0]:
# Print detailed execution statistics
print_execution_summary()

# Additional statistics
if df_assets is not None:
    log("\nData Quality Metrics:")
    
    # Count assets with owner information
    assets_with_owner = df_assets.filter(F.col("owner").isNotNull() & (F.col("owner") != "")).count()
    assets_without_owner = df_assets.filter(F.col("owner").isNull() | (F.col("owner") == "")).count()
    
    log(f"  Assets with owner info: {assets_with_owner} ({assets_with_owner/df_assets.count()*100:.1f}%)")
    log(f"  Assets without owner info: {assets_without_owner} ({assets_without_owner/df_assets.count()*100:.1f}%)")
    
    # Count by asset type
    log("\nAssets by Type:")
    type_counts = df_assets.groupBy("asset_type").count().collect()
    for row in type_counts:
        log(f"  {row['asset_type']}: {row['count']}")

In [0]:
if execution_mode == 'interactive' and df_assets is not None:
    log("\n" + "="*60)
    log("OWNER ANALYSIS BY ASSET TYPE")
    log("="*60)
    
    # Notebooks by owner
    log("\nTop 10 Notebook Owners:")
    df_notebook_owners = df_assets \
        .filter(F.col("asset_type") == "NOTEBOOK") \
        .filter(F.col("owner").isNotNull() & (F.col("owner") != "")) \
        .groupBy("owner") \
        .count() \
        .orderBy(F.col("count").desc()) \
        .limit(10)
    
    display(df_notebook_owners)
    
    # Queries by owner
    log("\nTop 10 Query Owners:")
    df_query_owners = df_assets \
        .filter(F.col("asset_type") == "QUERY") \
        .filter(F.col("owner").isNotNull() & (F.col("owner") != "")) \
        .groupBy("owner") \
        .count() \
        .orderBy(F.col("count").desc()) \
        .limit(10)
    
    if df_query_owners.count() > 0:
        display(df_query_owners)
    else:
        log("  No query owner information available")
    
    # Assets without owner (need attention)
    log("\nAssets Without Owner Information:")
    df_no_owner = df_assets \
        .filter(F.col("owner").isNull() | (F.col("owner") == "")) \
        .select("asset_type", "asset_name", "asset_path") \
        .orderBy("asset_type", "asset_name")
    
    no_owner_count = df_no_owner.count()
    log(f"Found {no_owner_count} assets without owner information")
    
    if no_owner_count > 0:
        display(df_no_owner.limit(20))

In [0]:
if execution_mode == 'interactive' and df_assets is not None:
    log("\n" + "="*60)
    log("ASSET AGE DISTRIBUTION ANALYSIS")
    log("="*60)
    
    from pyspark.sql.functions import datediff, floor
    
    # Calculate age in days
    df_with_age = df_assets \
        .filter(F.col("modified_timestamp").isNotNull()) \
        .withColumn("age_days", datediff(F.current_timestamp(), F.col("modified_timestamp")))
    
    # Create age buckets
    df_age_buckets = df_with_age.withColumn(
        "age_bucket",
        when(F.col("age_days") <= 30, "0-30 days")
        .when(F.col("age_days") <= 90, "31-90 days")
        .when(F.col("age_days") <= 180, "91-180 days")
        .when(F.col("age_days") <= 365, "181-365 days")
        .otherwise("365+ days")
    )
    
    log("\nAge Distribution of Default-Named Assets:")
    age_distribution = df_age_buckets.groupBy("age_bucket").count().orderBy("age_bucket")
    display(age_distribution)
    
    log("\nAge Distribution by Asset Type:")
    age_by_type = df_age_buckets.groupBy("asset_type", "age_bucket").count() \
        .orderBy("asset_type", "age_bucket")
    display(age_by_type)

In [0]:
if df_assets is not None:
    log("\n" + "="*60)
    log("CLEANUP RECOMMENDATIONS")
    log("="*60)
    
    from pyspark.sql.functions import datediff
    
    # Use configurable threshold from widget parameter
    log(f"\nIdentifying stale assets (not modified in {STALE_THRESHOLD_DAYS}+ days)...")
    
    df_stale = df_assets \
        .filter(F.col("modified_timestamp").isNotNull()) \
        .withColumn("age_days", datediff(F.current_timestamp(), F.col("modified_timestamp"))) \
        .filter(F.col("age_days") >= STALE_THRESHOLD_DAYS) \
        .select("asset_type", "asset_name", "owner", "asset_path", "age_days", "modified_timestamp") \
        .orderBy(F.col("age_days").desc())
    
    stale_count = df_stale.count()
    
    log(f"\n🗑️ Stale Assets (not modified in {STALE_THRESHOLD_DAYS}+ days): {stale_count}")
    
    if stale_count > 0:
        log(f"\nRecommendation: Consider reviewing and renaming or deleting these {stale_count} assets.")
        log(f"These assets have default names and haven't been modified in {STALE_THRESHOLD_DAYS}+ days.")
        
        # Breakdown by type
        stale_by_type = df_stale.groupBy("asset_type").count().collect()
        log("\nStale assets by type:")
        for row in stale_by_type:
            log(f"  {row['asset_type']}: {row['count']}")
        
        if execution_mode == 'interactive':
            log("\nTop 20 Oldest Stale Assets:")
            display(df_stale.limit(20))
        
        # Store for export
        globals()['df_stale_assets'] = df_stale
    else:
        log(f"\n✓ No stale assets found - all default-named assets modified within {STALE_THRESHOLD_DAYS} days!")
        globals()['df_stale_assets'] = None

In [0]:
if ENABLE_HTML_EXPORT and df_assets is not None:
    cell_start_time = time.time()
    
    log("\nExporting to HTML report...")
    
    # Create timestamp for filename
    eastern = pytz.timezone('America/New_York')
    timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')
    html_path = f"{EXPORT_PATH}/workspace_default_named_assets_{timestamp}.html"
    
    try:
        # Convert to Pandas
        pdf_assets = df_assets.toPandas()
        
        # Build HTML report
        html_content = f"""
<!DOCTYPE html>
<html>
<head>
    <title>Workspace Default Naming Scanner Report</title>
    <style>
        body {{ font-family: Arial, sans-serif; margin: 20px; background-color: #f5f5f5; }}
        h1 {{ color: #FF3621; }}
        h2 {{ color: #333; border-bottom: 2px solid #FF3621; padding-bottom: 5px; }}
        .summary {{ background-color: white; padding: 20px; border-radius: 5px; margin-bottom: 20px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }}
        .metric {{ display: inline-block; margin: 10px 20px; }}
        .metric-label {{ font-weight: bold; color: #666; }}
        .metric-value {{ font-size: 24px; color: #FF3621; }}
        table {{ border-collapse: collapse; width: 100%; background-color: white; margin-bottom: 20px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }}
        th {{ background-color: #FF3621; color: white; padding: 12px; text-align: left; }}
        td {{ padding: 10px; border-bottom: 1px solid #ddd; }}
        tr:hover {{ background-color: #f5f5f5; }}
        .footer {{ text-align: center; color: #666; margin-top: 40px; font-size: 12px; }}
        .config {{ background-color: #fff3cd; padding: 10px; border-radius: 5px; margin-bottom: 20px; }}
    </style>
</head>
<body>
    <h1>🔍 Workspace Default Naming Scanner Report</h1>
    
    <div class="config">
        <strong>Configuration:</strong> Stale Asset Threshold = {STALE_THRESHOLD_DAYS} days | 
        Age Filter = {MIN_AGE_DAYS if MIN_AGE_DAYS else 'None'} | 
        Exclude Repos = {exclude_repos_param}
    </div>
    
    <div class="summary">
        <h2>Summary</h2>
        <div class="metric">
            <div class="metric-label">Total Assets</div>
            <div class="metric-value">{len(all_assets)}</div>
        </div>
        <div class="metric">
            <div class="metric-label">Notebooks</div>
            <div class="metric-value">{len(notebook_data)}</div>
        </div>
        <div class="metric">
            <div class="metric-label">SQL Queries</div>
            <div class="metric-value">{len(query_data)}</div>
        </div>
        <div class="metric">
            <div class="metric-label">Dashboards</div>
            <div class="metric-value">{len(dashboard_data)}</div>
        </div>
        <p><strong>Scan Date:</strong> {datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')}</p>
        <p><strong>Execution Time:</strong> {round((time.time() - execution_stats['start_time']) / 60, 2)} minutes</p>
    </div>
    
    <div class="summary">
        <h2>Top 10 Users with Most Default-Named Assets</h2>
        <table>
            <tr><th>Owner</th><th>Count</th></tr>
"""
        
        # Add top owners
        owner_counts = pdf_assets[pdf_assets['owner'].notna() & (pdf_assets['owner'] != '')] \
            .groupby('owner').size().reset_index(name='count') \
            .sort_values('count', ascending=False).head(10)
        
        for _, row in owner_counts.iterrows():
            html_content += f"            <tr><td>{row['owner']}</td><td>{row['count']}</td></tr>\n"
        
        html_content += """
        </table>
    </div>
    
    <div class="summary">
        <h2>Recent Activity (Last 90 Days)</h2>
        <table>
            <tr><th>Asset Type</th><th>Asset Name</th><th>Owner</th><th>Days Since Modified</th></tr>
"""
        
        # Add recent activity
        recent_assets = pdf_assets[pdf_assets['modified_timestamp'].notna()].copy()
        recent_assets['days_since_modified'] = (pd.Timestamp.now() - pd.to_datetime(recent_assets['modified_timestamp'])).dt.days
        recent_assets = recent_assets[recent_assets['days_since_modified'] <= 90] \
            .sort_values('days_since_modified').head(20)
        
        for _, row in recent_assets.iterrows():
            owner_display = row['owner'] if pd.notna(row['owner']) and row['owner'] != '' else 'Unknown'
            html_content += f"            <tr><td>{row['asset_type']}</td><td>{row['asset_name']}</td><td>{owner_display}</td><td>{row['days_since_modified']}</td></tr>\n"
        
        html_content += """
        </table>
    </div>
"""
        
        # Add stale assets section if available
        if 'df_stale_assets' in globals() and df_stale_assets is not None:
            pdf_stale = df_stale_assets.limit(20).toPandas()
            stale_count = df_stale_assets.count()
            html_content += f"""
    <div class="summary">
        <h2>🗑️ Cleanup Recommendations (Stale Assets)</h2>
        <p>Assets not modified in {STALE_THRESHOLD_DAYS}+ days: <strong>{stale_count}</strong></p>
        <table>
            <tr><th>Asset Type</th><th>Asset Name</th><th>Owner</th><th>Age (Days)</th><th>Path</th></tr>
"""
            for _, row in pdf_stale.iterrows():
                owner_display = row['owner'] if pd.notna(row['owner']) and row['owner'] != '' else 'Unknown'
                html_content += f"            <tr><td>{row['asset_type']}</td><td>{row['asset_name']}</td><td>{owner_display}</td><td>{row['age_days']}</td><td>{row['asset_path']}</td></tr>\n"
            
            html_content += """        </table>
    </div>
"""
        
        html_content += f"""
    <div class="footer">
        <p>Generated by Databricks Workspace Default Naming Scanner</p>
        <p>Report generated at {datetime.now(eastern).strftime('%Y-%m-%d %H:%M:%S %Z')}</p>
    </div>
</body>
</html>
"""
        
        # Write HTML file
        html_file_dbfs = html_path.replace('/dbfs', 'dbfs:')
        dbutils.fs.put(html_file_dbfs, html_content, overwrite=True)
        
        log(f"✓ HTML report export successful: {html_path}")
        log(f"  File size: {len(html_content) / 1024:.2f} KB")
        log(f"  Sections: Summary, Top Owners, Recent Activity, Cleanup Recommendations")
        log_execution_time("HTML report export", cell_start_time)
    except Exception as e:
        log(f"✗ HTML report export failed: {e}")
        if is_job_mode:
            raise
else:
    if not ENABLE_HTML_EXPORT:
        log("\n⊘ HTML export disabled (ENABLE_HTML_EXPORT=False)")
    else:
        log("\n⊘ HTML export skipped (no data)")

In [0]:
if is_job_mode and ENABLE_EMAIL_NOTIFICATIONS and EMAIL_RECIPIENTS and df_assets is not None:
    cell_start_time = time.time()
    
    log("\nSending email notification...")
    
    try:
        # Build email body
        email_subject = f"Workspace Scan Complete - {len(all_assets)} Default-Named Assets Found"
        
        email_body = f"""
        <h2>Workspace Default Naming Scanner - Scan Complete</h2>
        
        <h3>Summary</h3>
        <ul>
            <li><strong>Total Assets Found:</strong> {len(all_assets)}</li>
            <li><strong>Notebooks:</strong> {len(notebook_data)}</li>
            <li><strong>SQL Queries:</strong> {len(query_data)}</li>
            <li><strong>Dashboards:</strong> {len(dashboard_data)}</li>
            <li><strong>Scan Date:</strong> {completion_time}</li>
            <li><strong>Execution Time:</strong> {round(total_execution_time / 60, 2)} minutes</li>
        </ul>
        
        <h3>Export Locations</h3>
        <ul>
            <li><strong>Delta Table:</strong> {DELTA_TABLE_NAME}</li>
            <li><strong>Excel File:</strong> {EXPORT_PATH}/workspace_default_named_assets_*.xlsx</li>
            <li><strong>HTML Report:</strong> {EXPORT_PATH}/workspace_default_named_assets_*.html</li>
        </ul>
        
        <h3>Execution Statistics</h3>
        <ul>
            <li><strong>API Calls:</strong> {execution_stats['api_calls']}</li>
            <li><strong>API Failures:</strong> {execution_stats['api_failures']}</li>
            <li><strong>Success Rate:</strong> {round(((execution_stats['api_calls'] - execution_stats['api_failures']) / execution_stats['api_calls'] * 100), 1) if execution_stats['api_calls'] > 0 else 0}%</li>
        </ul>
        
        <p>View the full report in the Delta table or download the Excel/HTML files from the export path.</p>
        """
        
        # Send email using Databricks notification API
        for recipient in EMAIL_RECIPIENTS:
            try:
                dbutils.notebook.run(
                    "/System/send_email",  # Placeholder - adjust to your email sending mechanism
                    timeout_seconds=60,
                    arguments={
                        "to": recipient,
                        "subject": email_subject,
                        "body": email_body
                    }
                )
                log(f"  ✓ Email sent to {recipient}")
            except Exception as email_error:
                log(f"  ⚠️ Failed to send email to {recipient}: {email_error}")
        
        log_execution_time("Email notification", cell_start_time)
    except Exception as e:
        log(f"✗ Email notification failed: {e}")
else:
    if not is_job_mode:
        log("\n⏭️ Email notifications only available in job mode")
    elif not ENABLE_EMAIL_NOTIFICATIONS:
        log("\n⊘ Email notifications disabled (ENABLE_EMAIL_NOTIFICATIONS=False)")
    elif not EMAIL_RECIPIENTS:
        log("\n⊘ Email notifications skipped (no recipients configured)")

## Download Exported Files

To download the exported files from DBFS to your local machine:

### Option 1: Using Databricks CLI
```bash
# Download Excel report
databricks fs cp dbfs:/tmp/workspace_scan_export/workspace_default_named_assets_*.xlsx ./

# Download HTML report
databricks fs cp dbfs:/tmp/workspace_scan_export/workspace_default_named_assets_*.html ./
```

### Option 2: Using the Workspace UI
1. Navigate to **Data** → **DBFS** in the left sidebar
2. Browse to `/tmp/workspace_scan_export/`
3. Click on the file you want to download (Excel or HTML)
4. Click the **Download** button

### Option 3: Query the Delta Table
If Delta export was successful, you can query the table directly:
```sql
SELECT * FROM main.default.workspace_default_named_assets
ORDER BY modified_timestamp DESC
```

### Option 4: Download All Files
```bash
databricks fs cp -r dbfs:/tmp/workspace_scan_export/ ./workspace_scan_export/
```

### Option 5: View HTML Report in Browser
After downloading the HTML file, simply open it in any web browser to view the formatted report with tables and styling.