# Coding Standards Documentation
## Databricks Workspace Audit Notebooks

This document outlines the coding standards, conventions, and best practices used across Databricks workspace audit notebooks including the Workspace Asset Scanner, Security Review Export, and Workspace Features Audit. These standards ensure consistency, maintainability, and enterprise-grade quality.

---

## Table of Contents

### 📚 Foundational Standards (Start Here)
1. [Code Organization](#code-organization) - Cell structure, ordering, logical grouping
2. [Import Organization](#import-organization) - Import order, aliases, package installation *(See section 20)*
3. [Naming Conventions](#naming-conventions) - Variables, functions, constants, DataFrames
4. [Documentation Standards](#documentation-standards) - Docstrings, comments, markdown *(Includes code comments)*
5. [Configuration Management](#configuration-management) - Structure, widgets, validation

### 🔧 Core Patterns (Essential Techniques)
6. [Error Handling](#error-handling) - Try-except, graceful degradation, validation
7. [Security and Secrets Management](#security-and-secrets-management) - Tokens, secrets, sanitization *(See section 21)*
8. [Logging and Output](#logging-and-output) - Centralized logging, formatting *(Includes string formatting from section 24)*
9. [Data Processing](#data-processing) - Data structures, filtering, timestamps *(Includes Spark DataFrame best practices from section 25)*
10. [Testing and Validation](#testing-and-validation) - DataFrame validation, config checks *(Includes data quality from section 23)*

### 🔌 Integration & APIs
11. [API Integration](#api-integration) - REST API patterns, retry logic, pagination
12. [Databricks SDK Integration](#databricks-sdk-integration) - WorkspaceClient, SDK patterns
13. [Parallel Processing](#parallel-processing-with-threadpoolexecutor) - ThreadPoolExecutor, concurrent operations *(See section 19)*

### ⚡ Advanced Patterns (Environment & Optimization)
14. [Compute Type Detection](#compute-type-detection-and-optimization) - Serverless vs traditional *(See section 12)*
15. [Job Mode and Widget Parameters](#job-mode-and-widget-parameters) - Job detection, overrides *(See section 13)*
16. [Execution Mode Patterns](#execution-mode-patterns) - Quick/Full modes, feature flags *(See section 18)*
17. [Performance Optimization](#performance-optimization) - Execution tracking, memory monitoring *(See section 9)*

### 📊 Analysis & Reporting
18. [Health Scoring and Risk Assessment](#health-scoring-and-risk-assessment) - Scoring systems, risk factors *(See section 14)*
19. [Recommendation Generation](#recommendation-generation) - Structured recommendations, priorities *(See section 15)*
20. [Visualization Standards](#visualization-standards) - Matplotlib, charts, conditional display *(See section 16)*
21. [Export Format Flexibility](#export-format-flexibility) - Excel, CSV, JSON, Delta *(See section 17)*

### ✅ Summary
22. [Summary: Key Best Practices](#summary-key-best-practices) - Checklist, quick reference

---

## 📖 Reading Guide

**For New Developers**: Read sections 1-10 (Foundational + Core Patterns)  
**For API Integration**: Focus on sections 11-13  
**For Performance Tuning**: Read sections 14-17  
**For Reporting Features**: Read sections 18-21  

**Note**: Some sections appear out of numerical order in the notebook but are cross-referenced above for logical reading flow.

---

## Version Control

| Version | Date | Author | Changes |
|---------|------|--------|---------|  
| 1.0.0 | 2026-02-16 | Assistant | Comprehensive coding standards documentation with 22 conceptual sections organized into 5 logical categories: Foundational Standards (code organization, imports, naming, documentation, configuration), Core Patterns (error handling, security, logging, data processing, testing), Integration & APIs (REST API, Databricks SDK, parallel processing), Advanced Patterns (compute detection, job mode, execution modes, performance optimization), and Analysis & Reporting (health scoring, recommendations, visualizations, exports). Includes merged topics for better cohesion and cross-references for navigation. |

---

## Scope

These standards apply to:
* **Workspace Asset Scanner** - Default naming convention detection
* **Security Review Export** - Comprehensive security audit
* **Workspace Features Audit** - Feature inventory and configuration
* **Other workspace audit and governance notebooks**

---

## Key Principles

✓ **Consistency**: Use the same patterns across all notebooks  
✓ **Maintainability**: Write self-documenting code with clear structure  
✓ **Enterprise-Grade**: Include error handling, logging, and validation  
✓ **Performance**: Optimize for both serverless and traditional compute  
✓ **Flexibility**: Support multiple execution modes and export formats  
✓ **Security**: Never hardcode credentials, use secrets management  
✓ **Documentation**: Comprehensive inline and notebook-level documentation  
✓ **Data Quality**: Validate inputs, handle nulls, check for duplicates

## 1. Code Organization

### Cell Structure

**Principle**: One logical unit per cell with clear, descriptive titles

* **Cell Titles**: Use descriptive display names that clearly indicate purpose
  * Good: `"Scan Notebooks"`, `"API Helper Functions"`, `"Export to Delta Table"`
  * Avoid: `"Cell 1"`, `"Code"`, `"Untitled"`

* **Cell Ordering**:
  1. Documentation header (markdown)
  2. Setup and configuration
  3. Helper functions
  4. Data collection/scanning cells
  5. Data processing and consolidation
  6. Export and reporting
  7. Analysis and visualization

### Logical Grouping

* Group related functionality together
* Use markdown cells as section dividers
* Keep cells focused on a single responsibility
* Typical cell size: 50-150 lines (exceptions for complex logic)

### Example Structure

```python
# ============================================================================
# SECTION HEADER: Brief description
# ============================================================================

# Implementation code here

# ============================================================================
```

## 2. Naming Conventions

### Variables

**Constants** (Configuration values):
* `UPPER_SNAKE_CASE` for all constants
* Examples: `MAX_RETRIES`, `ENABLE_DELTA_EXPORT`, `EXPORT_PATH`

**Variables**:
* `snake_case` for all variables
* Examples: `api_url`, `api_token`, `all_notebooks`, `page_count`

**Data Collections**:
* Plural nouns for lists/collections: `notebooks`, `queries`, `dashboards`
* Suffix `_data` for processed results: `notebook_data`, `query_data`
* Prefix `all_` for complete collections: `all_notebooks`, `all_queries`

**DataFrames**:
* Prefix with `df_`: `df_assets`, `df_stale`
* Use descriptive names: `df_stale_assets` not `df1`

### Functions

* `snake_case` for all function names
* Use verb-noun pattern: `get_api_client()`, `validate_config()`, `log_execution_time()`
* Boolean functions start with `is_`, `has_`, `should_`: `is_default_name`

### API-Related

* Consistent naming for API variables:
  * `api_url` - Base API URL
  * `api_token` - Authentication token
  * `headers` - Request headers dictionary
  * `response` - API response object
  * `page_token` - Pagination token

### Examples

```python
# Good
MAX_RETRIES = 3
api_url, api_token = get_api_client()
all_notebooks = []
df_assets = spark.createDataFrame(all_assets, schema)

# Avoid
maxRetries = 3  # Wrong case
url, token = get_client()  # Not descriptive
notebooks_list = []  # Redundant suffix
df = spark.createDataFrame(data)  # Not descriptive
```

## 3. Documentation Standards

### Notebook-Level Documentation

**Required Elements**:
1. Title and overview
2. Feature list with categories
3. Version control table
4. Configuration documentation
5. Usage instructions
6. Asset types or data sources table

### Docstrings

**Functions**: Use triple-quoted docstrings

```python
def get_api_client():
    """Get Databricks API client configuration"""
    # Implementation

def api_call_with_retry(func, *args, **kwargs):
    """Execute API call with retry logic and stats tracking"""
    # Implementation

def validate_dataframe_exists(df_name, df):
    """Validate that a DataFrame exists and has data"""
    # Implementation
```

### Inline Comments

**Section Headers**: Use banner-style comments

```python
# ============================================================================
# PERFORMANCE CONFIGURATION
# ============================================================================
MAX_RETRIES = 3
RETRY_DELAY = 2
# ============================================================================
```

**Code Comments - Comment the WHY, not the WHAT**:
* Explain *why*, not *what*
* Place above the code block
* Use complete sentences

```python
# Good: Explains reasoning
# Convert timestamps from milliseconds to datetime if present
# Databricks APIs return timestamps in epoch milliseconds
if created_at:
    created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)

# Avoid: States the obvious
# Convert timestamp
if created_at:  # if created_at exists
    created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)
```

### Complex Logic Comments

**Explain complex algorithms**:

```python
# Calculate risk score using weighted factors:
# - High-risk items: +20 points each (security vulnerabilities)
# - Medium-risk items: +10 points each (best practice violations)
# - Low-risk items: +5 points each (optimization opportunities)
# Score is capped at 100 to maintain consistent scale
risk_score = 0
for factor in risk_factors:
    if factor['severity'] == 'HIGH':
        risk_score += 20
    elif factor['severity'] == 'MEDIUM':
        risk_score += 10
    else:
        risk_score += 5
risk_score = min(risk_score, 100)
```

### TODO and FIXME Comments

**Use standard markers for future work**:

```python
# TODO: Add support for custom date ranges
# TODO: Implement incremental refresh for large datasets
# TODO(username): Review performance optimization for serverless

# FIXME: Handle edge case where user has no groups
# HACK: Temporary workaround for API pagination bug
```

### Deprecation Warnings

**Document deprecated code**:

```python
def old_function():
    """
    DEPRECATED: Use new_function() instead.
    This function will be removed in version 2.0.
    """
    import warnings
    warnings.warn(
        "old_function is deprecated, use new_function instead",
        DeprecationWarning,
        stacklevel=2
    )
    # ... implementation ...
```

### Markdown Formatting

* Use `**bold**` for emphasis
* Use `*italic*` for feature names
* Use `` `code` `` for inline code/variables
* Use `✓`, `✗`, `⚠️`, `📊`, `🚀` emojis sparingly for visual clarity
* Use tables for structured information
* Use horizontal rules (`---`) to separate major sections

## 4. Configuration Management

### Configuration Structure

**Organize configurations into logical sections**:

```python
# ============================================================================
# PERFORMANCE CONFIGURATION
# ============================================================================
MAX_RETRIES = 3
RETRY_DELAY = 2
MAX_WORKERS = 10
# ============================================================================

# ============================================================================
# EXPORT PATH CONFIGURATION
# ============================================================================
EXPORT_PATH = '/dbfs/tmp/workspace_scan_export'
# ============================================================================

# ============================================================================
# FEATURE FLAGS: Enable/disable features
# ============================================================================
ENABLE_EXCEL_EXPORT = True
ENABLE_HTML_EXPORT = True
ENABLE_DELTA_EXPORT = True
# ============================================================================
```

### Widget Parameters

**Pattern**: Check for job mode, then get widget values with defaults

```python
# Detect if running in job mode or interactive mode
try:
    dbutils.notebook.entry_point.getDbutils().notebook().getContext().currentRunId().isDefined()
    is_job_mode = True
except:
    is_job_mode = False

# Get parameters from widgets
if not is_job_mode:
    execution_mode = dbutils.widgets.get("execution_mode")
    output_catalog = dbutils.widgets.get("output_catalog")
else:
    execution_mode = 'job'
    output_catalog = 'main'
```

### Configuration Dictionaries

**Use dictionaries for related configuration**:

```python
DEFAULT_PATTERNS = {
    'notebooks': ['Untitled Notebook', 'Untitled', 'New Notebook'],
    'queries': ['Untitled Query', 'New Query', 'Untitled'],
    'dashboards': ['New Dashboard', 'Untitled Dashboard', 'Untitled']
}

execution_stats = {
    'start_time': time.time(),
    'api_calls': 0,
    'api_failures': 0,
    'api_retries': 0
}
```

### Validation

**Always validate configuration before execution**:

```python
def validate_config():
    """Validate configuration parameters"""
    errors = []
    
    if not isinstance(MAX_RETRIES, int) or MAX_RETRIES < 1:
        errors.append("MAX_RETRIES must be a positive integer")
    
    if ENABLE_DELTA_EXPORT:
        if not re.match(r'^[a-zA-Z0-9_]+\.[a-zA-Z0-9_]+\.[a-zA-Z0-9_]+$', DELTA_TABLE_NAME):
            errors.append(f"DELTA_TABLE_NAME must be in format 'catalog.schema.table'")
    
    return errors

config_errors = validate_config()
if config_errors:
    error_msg = "Configuration validation failed:\n" + "\n".join(f"  - {e}" for e in config_errors)
    raise ValueError(error_msg)
```

## 5. Error Handling

### Try-Except Patterns

**Graceful Degradation**: Continue execution when possible

```python
# Pattern 1: Return safe defaults on error
api_url, api_token = get_api_client()
if not api_url or not api_token:
    log("  ✗ Failed to get API client")
    notebooks = []
else:
    # Proceed with API calls
    pass

# Pattern 2: Catch and log, continue processing
try:
    # Risky operation
    response = requests.get(url, headers=headers, timeout=30)
except Exception as e:
    log(f"  ✗ Error fetching data: {str(e)}")
    all_data = []
```

**Timestamp Conversion**: Handle multiple formats

```python
# Handle different timestamp formats (milliseconds or ISO string)
if created_at:
    try:
        if isinstance(created_at, (int, float)):
            created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)
        else:
            created_timestamp = datetime.fromisoformat(str(created_at).replace('Z', '+00:00')).astimezone(eastern)
    except:
        pass  # Silently fail, timestamp remains None
```

### Validation Checks

**Check for None/Empty before processing**:

```python
if df_assets is not None:
    # Process DataFrame
    pass
else:
    log("⚠️  No assets found, skipping export")

if len(all_assets) == 0:
    log("\n✓ No assets with default naming found!")
    df_assets = None
else:
    # Create DataFrame
    pass
```

### Error Messages

**Use consistent symbols**:
* `✓` - Success
* `✗` - Failure
* `⚠️` - Warning

```python
log("  ✓ Fetched 150 notebooks")
log("  ✗ Failed to fetch queries: 404")
log("  ⚠️ Warning: Export path may not be writable")
```

## 6. Logging and Output

### Logging Function

**Centralized logging with mode awareness**:

```python
def log(message):
    """Print messages (always in interactive, selectively in job mode)"""
    if not is_job_mode:
        print(message)
    else:
        print(message)  # Also print in job mode for logs
```

### Log Message Patterns

**Section Headers**:
```python
log("\n" + "="*60)
log("CREATING CONSOLIDATED DATAFRAME")
log("="*60)
```

**Progress Updates**:
```python
log("Fetching notebooks...")
log(f"  Fetched {len(all_notebooks)} notebooks so far...")
log(f"  ✓ Fetched {len(all_notebooks)} total notebooks")
```

**Execution Timing**:
```python
def log_execution_time(cell_name, start_time):
    """Log execution time for a cell"""
    elapsed = time.time() - start_time
    log(f"⏱️  {cell_name} completed in {elapsed:.2f} seconds")

# Usage
cell_start_time = time.time()
# ... cell logic ...
log_execution_time("Scan Notebooks", cell_start_time)
```

**Summary Reports**:
```python
log(f"Total assets with default naming: {len(all_assets)}")
log(f"  - Notebooks: {len(notebook_data)}")
log(f"  - SQL Queries: {len(query_data)}")
log(f"  - Dashboards: {len(dashboard_data)}")
```

### Display vs Print

**Use `display()` for DataFrames**:
```python
# Good
display(df_assets.limit(10))

# Avoid
print(df_assets.limit(10))  # Poor formatting
df_assets.limit(10)  # May not render in all contexts
```

### F-Strings

**Always use f-strings for string formatting**:
```python
# Good
log(f"Fetched {count} items in {elapsed:.2f} seconds")
log(f"Export path: {EXPORT_PATH}")

# Avoid
log("Fetched " + str(count) + " items")  # Concatenation
log("Fetched %d items" % count)  # Old-style formatting
```

## 7. API Integration

### API Client Pattern

**Centralized API configuration**:

```python
def get_api_client():
    """Get Databricks API client configuration"""
    try:
        ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
        api_url = ctx.apiUrl().get()
        api_token = ctx.apiToken().get()
        return api_url, api_token
    except Exception as e:
        log(f"Error getting API client: {e}")
        return None, None
```

### Retry Logic

**Implement exponential backoff**:

```python
def api_call_with_retry(func, *args, **kwargs):
    """Execute API call with retry logic and stats tracking"""
    for attempt in range(MAX_RETRIES):
        try:
            execution_stats['api_calls'] += 1
            response = func(*args, **kwargs)
            if response and response.status_code == 200:
                return response
            else:
                execution_stats['api_failures'] += 1
                if attempt < MAX_RETRIES - 1:
                    execution_stats['api_retries'] += 1
                    time.sleep(RETRY_DELAY * (2 ** attempt))  # Exponential backoff
        except Exception as e:
            execution_stats['api_failures'] += 1
            if attempt < MAX_RETRIES - 1:
                execution_stats['api_retries'] += 1
                time.sleep(RETRY_DELAY * (2 ** attempt))
    return None
```

### Pagination Pattern

**Standard pagination loop**:

```python
all_items = []
page_token = None
page_count = 0

try:
    while True:
        params = {"page_size": 100}
        if page_token:
            params["page_token"] = page_token
        
        def fetch_items():
            return requests.get(
                f"{api_url}/api/2.0/endpoint",
                headers=headers,
                params=params,
                timeout=30
            )
        
        response = api_call_with_retry(fetch_items)
        
        if response and response.status_code == 200:
            data = response.json()
            items = data.get('items', [])
            all_items.extend(items)
            page_count += 1
            
            # Progress logging every 5 pages
            if page_count % 5 == 0:
                log(f"  Fetched {len(all_items)} items so far...")
            
            # Check for next page
            page_token = data.get('next_page_token')
            if not page_token:
                break
        else:
            log(f"  ✗ Failed to fetch items: {response.status_code if response else 'No response'}")
            break
            
except Exception as e:
    log(f"  ✗ Error fetching items: {str(e)}")
    all_items = []
```

### Request Headers

**Consistent header structure**:

```python
# Standard headers
headers = {"Authorization": f"Bearer {api_token}"}

# For POST requests with JSON
headers = {
    "Authorization": f"Bearer {api_token}",
    "Content-Type": "application/json"
}
```

## 8. Data Processing

### Data Structure Pattern

**Consistent dictionary structure for all asset types**:

```python
assets.append({
    'asset_type': 'notebook',
    'asset_name': notebook_name,
    'asset_id': str(notebook_id),
    'asset_path': notebook_path,
    'owner': owner,
    'created_timestamp': created_timestamp,
    'modified_timestamp': modified_timestamp
})
```

### DataFrame Creation

**Define schema explicitly**:

```python
schema = StructType([
    StructField("asset_type", StringType(), True),
    StructField("asset_name", StringType(), True),
    StructField("asset_id", StringType(), True),
    StructField("asset_path", StringType(), True),
    StructField("owner", StringType(), True),
    StructField("created_timestamp", TimestampType(), True),
    StructField("modified_timestamp", TimestampType(), True)
])

df_assets = spark.createDataFrame(all_assets, schema)
```

### Filtering Pattern

**Apply filters consistently**:

```python
# Check if name matches default patterns
is_default_name = any(
    pattern.lower() in asset_name.lower() 
    for pattern in DEFAULT_PATTERNS.get('asset_type', [])
)

if is_default_name:
    # Apply age filter if configured
    if MIN_AGE_DAYS and modified_timestamp:
        age_days = (datetime.now(eastern) - modified_timestamp).days
        if age_days < MIN_AGE_DAYS:
            execution_stats['resources_skipped'] += 1
            continue
    
    # Apply incremental scan filter if enabled
    if last_scan_timestamp and modified_timestamp:
        if modified_timestamp <= last_scan_timestamp:
            execution_stats['resources_skipped'] += 1
            continue
    
    # Add to results
    assets.append({...})
```

### Timestamp Handling

**Consistent timezone conversion**:

```python
# Define timezone once
eastern = pytz.timezone('America/New_York')

# Convert from milliseconds
if created_at:
    try:
        created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)
    except:
        pass

# Use created as fallback for modified
if not modified_timestamp and created_timestamp:
    modified_timestamp = created_timestamp
```

### Deduplication

**Remove duplicates based on composite key**:

```python
# Deduplicate based on asset_type + asset_id
initial_count = df_assets.count()
df_assets = df_assets.dropDuplicates(["asset_type", "asset_id"])
final_count = df_assets.count()

if initial_count > final_count:
    log(f"\n⚠️  Removed {initial_count - final_count} duplicate entries")
```

## 9. Performance Optimization

### Execution Tracking

**Track performance metrics**:

```python
execution_stats = {
    'start_time': time.time(),
    'api_calls': 0,
    'api_failures': 0,
    'api_retries': 0,
    'resources_processed': 0,
    'resources_skipped': 0,
    'memory_usage_mb': 0
}

# Update throughout execution
execution_stats['resources_processed'] += len(all_items)
execution_stats['resources_skipped'] += 1
```

**Memory Monitoring**:

```python
def get_memory_usage():
    """Get current memory usage in MB"""
    try:
        import psutil
        process = psutil.Process()
        return process.memory_info().rss / 1024 / 1024
    except:
        return 0

# Track memory deltas
memory_before = get_memory_usage()
# ... operations ...
memory_after = get_memory_usage()
memory_delta = memory_after - memory_before
execution_stats['memory_usage_mb'] = max(execution_stats['memory_usage_mb'], memory_after)
```

### Cell Timing

**Time each major operation**:

```python
cell_start_time = time.time()

# ... cell logic ...

log_execution_time("Cell Name", cell_start_time)
```

### Progress Reporting

**Report progress for long operations**:

```python
if page_count % 5 == 0:
    log(f"  Fetched {len(all_items)} items so far...")
```

### Performance Presets

**Provide configurable performance modes**:

```python
if USE_QUICK_MODE:
    MAX_NOTEBOOKS_FOR_OWNER_LOOKUP = 0
    MAX_WORKSPACE_SCAN_LIMIT = 1000
    ENABLE_SHARED_FOLDER_OWNER_LOOKUP = False
    
elif USE_FULL_MODE:
    MAX_NOTEBOOKS_FOR_OWNER_LOOKUP = 999
    MAX_WORKSPACE_SCAN_LIMIT = 999999
    ENABLE_SHARED_FOLDER_OWNER_LOOKUP = True
```

### Conditional Processing

**Skip expensive operations when not needed**:

```python
if ENABLE_EXCEL_EXPORT and df_assets is not None:
    # Export to Excel
    pass

if execution_mode == 'interactive':
    # Show detailed visualizations
    pass
```

## 10. Testing and Validation

### Data Validation

**Validate DataFrames before use**:

```python
def validate_dataframe_exists(df_name, df):
    """Validate that a DataFrame exists and has data"""
    if df is None:
        log(f"⚠️  Warning: {df_name} is None")
        return False
    try:
        count = df.count()
        if count == 0:
            log(f"⚠️  Warning: {df_name} is empty (0 rows)")
            return False
        return True
    except Exception as e:
        log(f"⚠️  Warning: Error checking {df_name}: {str(e)}")
        return False

# Usage
if not validate_dataframe_exists("df_assets", df_assets):
    log("⚠️  Warning: DataFrame validation failed")
```

### Configuration Validation

**Validate before execution**:

```python
def validate_config():
    """Validate configuration parameters"""
    errors = []
    
    # Type checks
    if not isinstance(MAX_RETRIES, int) or MAX_RETRIES < 1:
        errors.append("MAX_RETRIES must be a positive integer")
    
    # Format checks
    if ENABLE_DELTA_EXPORT:
        if not re.match(r'^[a-zA-Z0-9_]+\.[a-zA-Z0-9_]+\.[a-zA-Z0-9_]+$', DELTA_TABLE_NAME):
            errors.append(f"DELTA_TABLE_NAME must be in format 'catalog.schema.table'")
    
    # Range checks
    if MIN_AGE_DAYS is not None and MIN_AGE_DAYS < 0:
        errors.append("MIN_AGE_DAYS must be a positive integer or None")
    
    return errors

config_errors = validate_config()
if config_errors:
    error_msg = "Configuration validation failed:\n" + "\n".join(f"  - {e}" for e in config_errors)
    raise ValueError(error_msg)
```

### Path Validation

**Test write permissions**:

```python
try:
    test_file = f"{EXPORT_PATH}/.test".replace('/dbfs', 'dbfs:')
    dbutils.fs.put(test_file, 'test', overwrite=True)
    dbutils.fs.rm(test_file)
    log("  ✓ Export path is writable")
except Exception as e:
    log(f"  ⚠️ Warning: Export path may not be writable: {e}")
```

### Execution Summary

**Report comprehensive statistics**:

```python
def print_execution_summary():
    """Print execution statistics summary"""
    elapsed = time.time() - execution_stats['start_time']
    log(f"\n{'='*60}")
    log("EXECUTION SUMMARY")
    log(f"{'='*60}")
    log(f"Total execution time: {elapsed:.2f} seconds ({elapsed/60:.2f} minutes)")
    log(f"API calls made: {execution_stats['api_calls']}")
    log(f"Resources processed: {execution_stats['resources_processed']}")
    log(f"Resources skipped: {execution_stats['resources_skipped']}")
    log(f"API failures: {execution_stats['api_failures']}")
    log(f"API retries: {execution_stats['api_retries']}")
    if execution_stats['api_calls'] > 0:
        success_rate = ((execution_stats['api_calls'] - execution_stats['api_failures']) / execution_stats['api_calls']) * 100
        log(f"Success rate: {success_rate:.1f}%")
    log(f"{'='*60}")
```

## 11. Databricks SDK Integration

### SDK Client Pattern

**Use Databricks SDK for modern API access**:

```python
from databricks.sdk import WorkspaceClient

# Initialize client (automatically uses notebook context)
wc = WorkspaceClient()

# Use SDK methods instead of raw REST API calls
users = list(wc.users.list())
groups = list(wc.groups.list())
jobs = list(wc.jobs.list())
warehouses = list(wc.warehouses.list())
```

### SDK vs REST API

**When to use SDK**:
* Modern, type-safe Python interface
* Automatic authentication from notebook context
* Built-in pagination handling
* Better error messages and type hints
* Recommended for new code

**When to use REST API**:
* SDK doesn't support the endpoint yet
* Need fine-grained control over requests
* Working with legacy code
* Custom retry logic required

### SDK Error Handling

```python
from databricks.sdk.errors import NotFound, PermissionDenied

try:
    job = wc.jobs.get(job_id)
except NotFound:
    log(f"Job {job_id} not found")
except PermissionDenied:
    log(f"No permission to access job {job_id}")
except Exception as e:
    log(f"Error fetching job: {str(e)}")
```

### SDK List Comprehensions

**Convert SDK objects to DataFrames**:

```python
# Pattern: Extract relevant fields from SDK objects
users = list(wc.users.list())
users_df = spark.createDataFrame([
    {
        'user_name': u.user_name,
        'display_name': u.display_name,
        'active': u.active
    }
    for u in users
])

# Pattern: Handle optional fields safely
jobs_data = [
    {
        'job_id': j.job_id,
        'job_name': j.settings.name if j.settings else 'Unknown',
        'creator': j.creator_user_name or 'Unknown'
    }
    for j in jobs
]
```

### SDK Pagination

**SDK handles pagination automatically**:

```python
# Good: SDK handles pagination internally
all_jobs = list(wc.jobs.list())

# Avoid: Manual pagination (SDK does this for you)
# Don't implement manual pagination with SDK
```

## 12. Compute Type Detection and Optimization

### Serverless Detection

**Detect compute type at notebook start**:

```python
# Detect if running on serverless compute (most reliable method: try caching)
try:
    test_df = spark.range(1)
    test_df.cache()
    test_df.count()
    test_df.unpersist()
    is_serverless = False
    log("Running on TRADITIONAL compute")
except Exception as e:
    if 'PERSIST' in str(e).upper() or 'CACHE' in str(e).upper():
        is_serverless = True
        log("Running on SERVERLESS compute")
    else:
        is_serverless = False
        log(f"Compute type detection inconclusive: {e}")
```

### Conditional Caching

**Cache only on traditional compute**:

```python
if is_serverless:
    # Serverless: Just materialize with count (automatic optimization)
    log("  Using automatic optimization (serverless)")
    users_count = users_df.count()
    groups_count = groups_df.count()
    log(f"  Materialized {users_count} users, {groups_count} groups")
else:
    # Traditional: Explicit caching for performance
    log("  Caching DataFrames for reuse (traditional compute)")
    users_df.cache()
    groups_df.cache()
    users_df.count()  # Materialize
    groups_df.count()  # Materialize
    log(f"  Cached {users_df.count()} users, {groups_df.count()} groups")
```

### Serverless-Compatible File Operations

**Use temporary directories for serverless**:

```python
if is_serverless:
    # Serverless: Use temp directory (DBFS not writable)
    import tempfile
    temp_dir = tempfile.mkdtemp()
    export_path = temp_dir
    log(f"Using temporary directory: {export_path}")
else:
    # Traditional: Use DBFS
    export_path = '/dbfs/tmp/exports'
    os.makedirs(export_path, exist_ok=True)
    log(f"Using DBFS directory: {export_path}")
```

### Compute-Aware Optimizations

**Adjust settings based on compute type**:

```python
# Adjust parallelism based on compute type
if is_serverless:
    MAX_WORKERS = 5  # Lower for serverless
    BATCH_SIZE = 100
    log("Optimized for serverless: MAX_WORKERS=5, BATCH_SIZE=100")
else:
    MAX_WORKERS = 20  # Higher for traditional clusters
    BATCH_SIZE = 500
    log("Optimized for traditional: MAX_WORKERS=20, BATCH_SIZE=500")
```

### DBFS Path Handling

**Convert paths for serverless compatibility**:

```python
# Use dbutils.fs for serverless-compatible operations
try:
    dbutils.fs.ls(EXPORT_PATH.replace('/dbfs', 'dbfs:'))
except:
    dbutils.fs.mkdirs(EXPORT_PATH.replace('/dbfs', 'dbfs:'))

# Write files using dbutils.fs
dbutils.fs.put(
    f"{EXPORT_PATH}/file.txt".replace('/dbfs', 'dbfs:'),
    content,
    overwrite=True
)
```

## 13. Job Mode and Widget Parameters

### Job Mode Detection

**Detect at the very start of notebook (MUST BE FIRST)**:

```python
# MUST BE FIRST - before any other code
try:
    dbutils.notebook.entry_point.getDbutils().notebook().getContext().currentRunId().isDefined()
    is_job_mode = True
except:
    is_job_mode = False

log(f"Execution mode: {'JOB' if is_job_mode else 'INTERACTIVE'}")
```

### Widget Parameter Handling

**Optional parameters with try-except**:

```python
# Pattern: Try to get widget parameter, fall back to default
try:
    job_max_resources = dbutils.widgets.get("max_resources_per_type")
    if job_max_resources:
        MAX_RESOURCES_PER_TYPE = int(job_max_resources)
        log(f"Using job parameter: MAX_RESOURCES_PER_TYPE = {MAX_RESOURCES_PER_TYPE}")
except:
    # Widget doesn't exist - use default from configuration
    pass

try:
    export_path_param = dbutils.widgets.get("export_path")
    if export_path_param:
        EXPORT_PATH = export_path_param
        log(f"Using job parameter: EXPORT_PATH = {EXPORT_PATH}")
except:
    pass
```

**Required parameters with validation**:

```python
if not is_job_mode:
    # Interactive: Get from widgets
    try:
        output_catalog = dbutils.widgets.get("output_catalog")
        output_schema = dbutils.widgets.get("output_schema")
    except:
        raise ValueError("Required widgets not found. Run cell 1 to create widgets.")
else:
    # Job: Use defaults or job parameters
    output_catalog = 'main'
    output_schema = 'default'
```

### Job Mode Overrides

**Force comprehensive settings in job mode**:

```python
if is_job_mode:
    log("\n🤖 JOB MODE DETECTED - Forcing Full Mode")
    log("="*60)
    
    # Override all limits for comprehensive audit
    MAX_RESOURCES_PER_TYPE = 999
    MAX_WORKSPACE_OBJECTS = 2000
    
    # Enable all resource types
    ENABLE_JOBS = True
    ENABLE_WAREHOUSES = True
    ENABLE_CLUSTERS = True
    ENABLE_PIPELINES = True
    
    log("  All limits removed for comprehensive audit")
    log("  All resource types enabled")
    log("="*60)
```

### Job Completion Summary

**Return JSON summary for orchestration**:

```python
if is_job_mode:
    # Create summary for job output
    summary = {
        'status': 'success',
        'execution_time_seconds': execution_time,
        'resources_processed': execution_stats['resources_processed'],
        'api_calls': execution_stats['api_calls'],
        'export_path': EXPORT_PATH,
        'timestamp': datetime.now().isoformat()
    }
    
    # Return as JSON for job orchestration
    import json
    dbutils.notebook.exit(json.dumps(summary))
```

## 14. Health Scoring and Risk Assessment

### Health Score Calculation

**Weighted scoring system (0-100)**:

```python
# Define category weights (must sum to 100)
weights = {
    'security': 30,
    'governance': 25,
    'compliance': 25,
    'performance': 20
}

# Calculate scores per category
health_score = 0
max_score = 100

# Security score (0-30 points)
security_score = 0
if security_features_enabled >= 4:
    security_score = 24
elif security_features_enabled >= 3:
    security_score = 18
else:
    security_score = 12

health_score += security_score

# Governance score (0-25 points)
governance_score = 0
if unity_catalog_enabled:
    governance_score += 15
if audit_logs_enabled:
    governance_score += 10

health_score += governance_score

# Display results with emojis
log(f"\n🏥 Overall Health Score: {health_score}/{max_score} ({health_score/max_score*100:.0f}%)")
log(f"\n🔒 Security: {security_score}/{weights['security']} points")
log(f"📋 Governance: {governance_score}/{weights['governance']} points")
log(f"⚖️ Compliance: {compliance_score}/{weights['compliance']} points")
log(f"⚡ Performance: {performance_score}/{weights['performance']} points")
```

### Risk Scoring

**0-100 risk scale (higher = more risk)**:

```python
risk_score = 0

# High-risk configurations (+20 points each)
if tokens_without_expiry > 0:
    risk_score += 20
    
if admin_count > 10:
    risk_score += 20

# Medium-risk issues (+10 points each)
if inactive_users_with_permissions > 0:
    risk_score += 10

if external_users_count > 5:
    risk_score += 10

# Low-risk issues (+5 points each)
if over_privileged_users > 0:
    risk_score += 5

# Cap at 100
risk_score = min(risk_score, 100)

# Categorize risk level with emojis
if risk_score >= 70:
    risk_level = "🔴 HIGH RISK"
elif risk_score >= 40:
    risk_level = "🟡 MEDIUM RISK"
else:
    risk_level = "🟢 LOW RISK"

log(f"\n⚠️ Risk Score: {risk_score}/100 - {risk_level}")
```

### Risk Factors Tracking

**Track individual risk contributors**:

```python
risk_factors = []

if tokens_without_expiry > 0:
    risk_factors.append({
        'category': 'Token Security',
        'severity': 'HIGH',
        'issue': f'{tokens_without_expiry} tokens without expiration',
        'impact': 'Permanent access tokens pose security risk',
        'recommendation': 'Set expiration dates on all tokens',
        'affected_count': tokens_without_expiry
    })

if admin_count > 10:
    risk_factors.append({
        'category': 'Access Control',
        'severity': 'MEDIUM',
        'issue': f'{admin_count} workspace admins',
        'impact': 'Too many admins increases attack surface',
        'recommendation': 'Review and reduce admin count to <10',
        'affected_count': admin_count
    })

# Create DataFrame for reporting
if risk_factors:
    risk_factors_df = spark.createDataFrame(risk_factors)
    display(risk_factors_df)
```

### Severity Levels

**Consistent severity definitions**:

* **CRITICAL**: Immediate action required, active security threat
* **HIGH**: Significant risk, should be addressed soon
* **MEDIUM**: Moderate risk, address in next review cycle
* **LOW**: Minor issue, address when convenient
* **INFO**: Informational, no action required

## 15. Recommendation Generation

### Recommendation Structure

**Consistent recommendation format**:

```python
recommendations = []

# Pattern: Check condition, add recommendation
if condition_detected:
    recommendations.append({
        'priority': 'HIGH',  # CRITICAL, HIGH, MEDIUM, LOW, INFO
        'category': 'Security',
        'title': 'Enable IP Access Lists',
        'description': 'IP access lists are not enabled',
        'impact': 'Workspace accessible from any IP address',
        'recommendation': 'Enable IP access lists to restrict access to trusted networks',
        'documentation': 'https://docs.databricks.com/security/network/ip-access-list.html',
        'affected_count': 1
    })
```

### Priority Levels

**Consistent priority definitions**:

* **CRITICAL**: Immediate action required, active security threat, compliance violation
* **HIGH**: Significant risk, should be addressed within days
* **MEDIUM**: Moderate risk, address in next review cycle (weeks)
* **LOW**: Minor issue, address when convenient (months)
* **INFO**: Informational, no action required

### Recommendation Display

**User-friendly output with priority grouping**:

```python
if recommendations:
    log(f"\n💡 Found {len(recommendations)} recommendations:\n")
    
    # Group by priority
    critical = [r for r in recommendations if r['priority'] == 'CRITICAL']
    high_priority = [r for r in recommendations if r['priority'] == 'HIGH']
    medium_priority = [r for r in recommendations if r['priority'] == 'MEDIUM']
    low_priority = [r for r in recommendations if r['priority'] == 'LOW']
    
    # Display critical first
    if critical:
        log("🔴 CRITICAL:")
        for i, rec in enumerate(critical, 1):
            log(f"  {i}. {rec['title']}")
            log(f"     {rec['description']}")
            log(f"     Impact: {rec['impact']}")
            log(f"     → {rec['recommendation']}\n")
    
    # Display high priority
    if high_priority:
        log("🟠 HIGH PRIORITY:")
        for i, rec in enumerate(high_priority, 1):
            log(f"  {i}. {rec['title']}")
            log(f"     → {rec['recommendation']}\n")
    
    # Display medium priority
    if medium_priority:
        log("🟡 MEDIUM PRIORITY:")
        for i, rec in enumerate(medium_priority, 1):
            log(f"  {i}. {rec['title']}")
            log(f"     → {rec['recommendation']}\n")
else:
    log("\n✅ No recommendations - configuration looks good!")
```

### Actionable Recommendations

**Include specific, measurable actions**:

```python
# Good: Specific and actionable
recommendation = "Set MAX_TOKEN_LIFETIME_DAYS to 90 in workspace settings"
recommendation = "Remove CAN_MANAGE permission from 5 inactive users"
recommendation = "Enable Unity Catalog for data governance"

# Avoid: Vague and generic
recommendation = "Improve token security"  # Too vague
recommendation = "Fix permissions"  # Not specific
recommendation = "Review settings"  # No clear action
```

### Recommendation DataFrame

**Create DataFrame for export**:

```python
if recommendations:
    recommendations_df = spark.createDataFrame(recommendations)
    
    # Sort by priority
    priority_order = {'CRITICAL': 1, 'HIGH': 2, 'MEDIUM': 3, 'LOW': 4, 'INFO': 5}
    recommendations_df = recommendations_df.withColumn(
        'priority_order',
        F.when(F.col('priority') == 'CRITICAL', 1)
         .when(F.col('priority') == 'HIGH', 2)
         .when(F.col('priority') == 'MEDIUM', 3)
         .when(F.col('priority') == 'LOW', 4)
         .otherwise(5)
    ).orderBy('priority_order')
    
    display(recommendations_df.drop('priority_order'))
```

## 16. Visualization Standards

### Matplotlib Configuration

**Standard chart setup**:

```python
import matplotlib.pyplot as plt

# Create figure with subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Chart 1: Bar chart
ax1 = axes[0]
category_counts = df.groupby('category').size().sort_values(ascending=False)
category_counts.plot(kind='bar', ax=ax1, color='steelblue')
ax1.set_title('Distribution by Category', fontsize=12, fontweight='bold')
ax1.set_xlabel('Category')
ax1.set_ylabel('Count')
ax1.grid(axis='y', alpha=0.3)

# Chart 2: Pie chart
ax2 = axes[1]
status_counts = df.groupby('status').size()
ax2.pie(status_counts.values, labels=status_counts.index, autopct='%1.1f%%', startangle=90)
ax2.set_title('Status Distribution', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()
```

### Chart Best Practices

**Consistent styling**:

```python
# Define color palette
COLORS = {
    'primary': 'steelblue',
    'success': 'green',
    'warning': 'orange',
    'danger': 'red',
    'info': 'skyblue'
}

# Add grid for readability
ax.grid(axis='y', alpha=0.3, linestyle='--')

# Rotate x-axis labels if needed
plt.xticks(rotation=45, ha='right')

# Add value labels on bars
for container in ax.containers:
    ax.bar_label(container, fmt='%d')

# Set figure size appropriately
fig, ax = plt.subplots(figsize=(10, 6))  # Single chart
fig, axes = plt.subplots(2, 2, figsize=(14, 10))  # Multiple charts
```

### Conditional Visualization

**Only show charts in interactive mode**:

```python
if not is_job_mode and df is not None:
    log("\n" + "="*60)
    log("VISUALIZATION")
    log("="*60)
    
    # Create charts
    fig, ax = plt.subplots(figsize=(10, 6))
    df.groupby('category').size().plot(kind='bar', ax=ax)
    ax.set_title('Distribution by Category')
    plt.tight_layout()
    plt.show()
else:
    log("ℹ️  Visualization skipped (job mode or no data)")
```

### Display vs Show

**Use appropriate display method**:

```python
# DataFrames: Use display()
display(df.limit(10))

# Matplotlib charts: Use plt.show()
plt.show()

# Pandas DataFrames in interactive mode
if not is_job_mode:
    display(pandas_df.head(10))

# Don't mix them
# display(plt)  # Wrong - doesn't work
# plt.show(df)  # Wrong - not a chart
```

### Chart Types

**Choose appropriate chart type**:

```python
# Bar chart: Comparing categories
df.groupby('category').size().plot(kind='bar')

# Pie chart: Showing proportions
df.groupby('status').size().plot(kind='pie', autopct='%1.1f%%')

# Line chart: Showing trends over time
df.groupby('date').size().plot(kind='line')

# Histogram: Showing distributions
df['age_days'].plot(kind='hist', bins=20)

# Horizontal bar: Long category names
df.groupby('category').size().plot(kind='barh')
```

## 17. Export Format Flexibility

### Multiple Export Formats

**Support Excel, CSV, JSON, Delta with feature flags**:

```python
# Configuration flags
ENABLE_EXCEL_EXPORT = True
ENABLE_CSV_EXPORT = False
ENABLE_JSON_EXPORT = False
ENABLE_DELTA_EXPORT = True

# Export logic
if df is not None:
    if ENABLE_EXCEL_EXPORT:
        log("Exporting to Excel...")
        # Excel export code
    
    if ENABLE_CSV_EXPORT:
        log("Exporting to CSV...")
        # CSV export code
    
    if ENABLE_JSON_EXPORT:
        log("Exporting to JSON...")
        # JSON export code
    
    if ENABLE_DELTA_EXPORT:
        log("Exporting to Delta table...")
        # Delta export code
```

### Excel Export with Multiple Sheets

**Use ExcelWriter for multi-sheet workbooks**:

```python
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment

# Create Excel file with multiple sheets
with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
    # Write data sheets
    df_summary.toPandas().to_excel(writer, sheet_name='Summary', index=False)
    df_details.toPandas().to_excel(writer, sheet_name='Details', index=False)
    df_recommendations.toPandas().to_excel(writer, sheet_name='Recommendations', index=False)
    df_stats.toPandas().to_excel(writer, sheet_name='Statistics', index=False)

log(f"✓ Excel workbook created with {len(writer.sheets)} sheets")
```

### Excel Formatting

**Apply professional styling with openpyxl**:

```python
# Load workbook for formatting
wb = load_workbook(excel_path)

for sheet_name in wb.sheetnames:
    ws = wb[sheet_name]
    
    # Format header row (row 1)
    for cell in ws[1]:
        cell.font = Font(bold=True, color='FFFFFF')
        cell.fill = PatternFill(start_color='366092', end_color='366092', fill_type='solid')
        cell.alignment = Alignment(horizontal='center', vertical='center')
    
    # Auto-adjust column widths
    for column in ws.columns:
        max_length = 0
        column_letter = column[0].column_letter
        for cell in column:
            if cell.value:
                max_length = max(max_length, len(str(cell.value)))
        # Set width with min/max bounds
        adjusted_width = min(max_length + 2, 50)
        ws.column_dimensions[column_letter].width = max(adjusted_width, 10)
    
    # Freeze header row
    ws.freeze_panes = 'A2'

wb.save(excel_path)
log("✓ Excel formatting applied")
```

### CSV Export

**Simple CSV export for large datasets**:

```python
if ENABLE_CSV_EXPORT:
    csv_path = f"{EXPORT_PATH}/data_{timestamp}.csv"
    
    # Convert to Pandas and export
    df.toPandas().to_csv(csv_path, index=False)
    
    log(f"✓ CSV exported: {csv_path}")
    log(f"  Rows: {df.count()}")
```

### JSON Export

**Structured JSON for API integration**:

```python
if ENABLE_JSON_EXPORT:
    json_path = f"{EXPORT_PATH}/data_{timestamp}.json"
    
    # Convert to JSON with metadata
    export_data = {
        'metadata': {
            'export_timestamp': datetime.now().isoformat(),
            'record_count': df.count(),
            'version': '1.0',
            'execution_time_seconds': execution_time
        },
        'data': df.toPandas().to_dict(orient='records')
    }
    
    with open(json_path, 'w') as f:
        json.dump(export_data, f, indent=2, default=str)
    
    log(f"✓ JSON exported: {json_path}")
```

### Delta Table Export

**Historical accumulation with append mode**:

```python
if ENABLE_DELTA_EXPORT:
    # Add audit metadata
    df_export = df.withColumn('audit_timestamp', F.current_timestamp())
    df_export = df_export.withColumn('execution_time_seconds', F.lit(execution_time))
    df_export = df_export.withColumn('execution_mode', F.lit(execution_mode))
    
    # Write to Delta table (append mode for history)
    df_export.write \
        .format('delta') \
        .mode('append') \
        .option('mergeSchema', 'true') \
        .saveAsTable(DELTA_TABLE_NAME)
    
    log(f"✓ Delta table updated: {DELTA_TABLE_NAME}")
    log(f"  Mode: append (historical retention)")
    log(f"  Schema evolution: enabled")
```

### Export Path Handling

**Consistent path construction**:

```python
# Create timestamp for filenames
eastern = pytz.timezone('America/New_York')
timestamp = datetime.now(eastern).strftime('%Y%m%d_%H%M%S')

# Construct export paths
excel_path = f"{EXPORT_PATH}/report_{timestamp}.xlsx"
csv_path = f"{EXPORT_PATH}/data_{timestamp}.csv"
json_path = f"{EXPORT_PATH}/data_{timestamp}.json"

log(f"Export files will be saved to: {EXPORT_PATH}")
```

## 18. Execution Mode Patterns

### Mode-Aware Output

**Conditional display based on execution mode**:

```python
# Interactive mode: Show detailed output and visualizations
if not is_job_mode:
    log("\nDetailed Analysis:")
    display(df.limit(20))
    
    # Show visualizations
    plt.figure(figsize=(10, 6))
    df.groupby('category').size().plot(kind='bar')
    plt.title('Distribution by Category')
    plt.show()
else:
    # Job mode: Minimal output, focus on metrics
    log(f"Processed {df.count()} records")
    log(f"Execution time: {execution_time:.2f} seconds")
```

### Performance Presets

**Quick, Full, and Custom modes**:

```python
# ============================================================================
# PERFORMANCE PRESETS: Choose your execution mode
# ============================================================================
# Uncomment ONE of the following presets

# PRESET 1: QUICK MODE (5-10 minutes) - Fast scanning with limits
# Recommended for: Daily monitoring, quick audits, testing
# USE_QUICK_MODE = True

# PRESET 2: FULL MODE (20-60 minutes) - Complete coverage
# Recommended for: Compliance audits, weekly reviews, comprehensive analysis
# USE_FULL_MODE = True

# PRESET 3: CUSTOM MODE (default)
# Recommended for: Specific use cases, targeted audits
# (Default if no preset is uncommented)

# ============================================================================
# Apply preset configurations
# ============================================================================

if 'USE_QUICK_MODE' in dir() and USE_QUICK_MODE:
    log("\n🚀 QUICK MODE ENABLED")
    log("="*60)
    MAX_RESOURCES = 100
    ENABLE_EXPENSIVE_OPERATIONS = False
    ENABLE_DETAILED_ANALYSIS = False
    log("  Resource limit: 100 per type")
    log("  Expensive operations: DISABLED")
    log("  Estimated time: 5-10 minutes")
    log("="*60)
    
elif 'USE_FULL_MODE' in dir() and USE_FULL_MODE:
    log("\n🔍 FULL MODE ENABLED")
    log("="*60)
    MAX_RESOURCES = 999
    ENABLE_EXPENSIVE_OPERATIONS = True
    ENABLE_DETAILED_ANALYSIS = True
    log("  Resource limit: NONE (complete scan)")
    log("  Expensive operations: ENABLED")
    log("  Estimated time: 20-60 minutes")
    log("="*60)
    
else:
    log("\n⚙️ CUSTOM MODE - Using configuration from Cell 2")
    log("="*60)

# Job mode override: Always use Full Mode
if is_job_mode:
    log("\n🤖 JOB MODE - Forcing Full Mode")
    MAX_RESOURCES = 999
    ENABLE_EXPENSIVE_OPERATIONS = True
```

### Conditional Feature Flags

**Enable/disable features based on configuration**:

```python
# Configuration
ENABLE_JOBS = True
ENABLE_WAREHOUSES = True
ENABLE_CLUSTERS = False
ENABLE_PIPELINES = True

# Execution with skip messages
if ENABLE_JOBS:
    log("Processing jobs...")
    # ... jobs processing code ...
else:
    log("ℹ️  Jobs scanning disabled (ENABLE_JOBS=False)")

if ENABLE_WAREHOUSES:
    log("Processing warehouses...")
    # ... warehouses processing code ...
else:
    log("ℹ️  Warehouses scanning disabled (ENABLE_WAREHOUSES=False)")
```

### Skip Messages

**Informative skip messages with instructions**:

```python
if not ENABLE_FEATURE:
    log("ℹ️  Feature skipped (ENABLE_FEATURE=False)")
    log("   Set ENABLE_FEATURE=True in Cell 2 to enable")
else:
    # Process feature
    pass

# For conditional features
if not condition_met:
    log("ℹ️  Feature skipped (condition not met)")
    log(f"   Reason: {reason}")
```

### Resource Limits

**Apply limits with clear logging**:

```python
if MAX_RESOURCES_PER_TYPE == 999:
    log(f"Fetching all resources (no limit)...")
    resources = list(wc.resource.list())
else:
    log(f"Fetching resources (up to {MAX_RESOURCES_PER_TYPE})...")
    resources = list(wc.resource.list())[:MAX_RESOURCES_PER_TYPE]
    
    if len(resources) == MAX_RESOURCES_PER_TYPE:
        log(f"  ⚠️  Limit reached: {MAX_RESOURCES_PER_TYPE} resources")
        log(f"     Set MAX_RESOURCES_PER_TYPE=999 for complete scan")

log(f"  ✓ Fetched {len(resources)} resources")
```

## Summary: Key Best Practices

### Code Quality Checklist

✓ **Naming**
- [ ] Constants in `UPPER_SNAKE_CASE`
- [ ] Variables and functions in `snake_case`
- [ ] DataFrames prefixed with `df_`
- [ ] Descriptive, meaningful names

✓ **Documentation**
- [ ] Notebook header with overview and version control
- [ ] Docstrings for all functions
- [ ] Section headers with banner comments
- [ ] Inline comments explain *why*, not *what*
- [ ] TODO/FIXME comments for future work

✓ **Error Handling**
- [ ] Graceful degradation on errors
- [ ] Consistent error symbols (✓, ✗, ⚠️)
- [ ] Validation before processing
- [ ] Try-except with specific error handling

✓ **Configuration**
- [ ] Organized into logical sections with banners
- [ ] Validation function implemented
- [ ] Widget parameters with defaults
- [ ] Feature flags for optional functionality

✓ **API Integration**
- [ ] Centralized API client function (REST or SDK)
- [ ] Retry logic with exponential backoff
- [ ] Pagination pattern for large datasets
- [ ] Progress logging for long operations

✓ **Performance**
- [ ] Execution timing for major operations
- [ ] Memory usage monitoring
- [ ] Statistics tracking
- [ ] Conditional processing based on mode
- [ ] Compute-aware optimizations (serverless vs traditional)
- [ ] Parallel processing with ThreadPoolExecutor where appropriate

✓ **Data Processing**
- [ ] Consistent data structure across asset types
- [ ] Explicit schema definition
- [ ] Timezone-aware timestamp handling
- [ ] Deduplication logic
- [ ] Null handling and validation
- [ ] Data quality checks

✓ **Logging**
- [ ] Centralized logging function
- [ ] Consistent message formatting with f-strings
- [ ] Progress updates for long operations
- [ ] Summary reports with statistics
- [ ] Appropriate emoji usage for visual clarity

✓ **Security**
- [ ] Never hardcode credentials or tokens
- [ ] Use dbutils.secrets for sensitive data
- [ ] Mask sensitive information in logs
- [ ] Sanitize user input
- [ ] Set appropriate file permissions

✓ **Advanced Patterns**
- [ ] Databricks SDK integration where appropriate
- [ ] Serverless compute detection and optimization
- [ ] Job mode detection and overrides
- [ ] Health scoring and risk assessment (if applicable)
- [ ] Recommendation generation with priorities
- [ ] Conditional visualizations
- [ ] Multiple export format support
- [ ] Performance presets (Quick/Full/Custom)

✓ **Code Quality**
- [ ] Imports organized (stdlib → third-party → local)
- [ ] Standard import aliases (pd, F, T)
- [ ] Spark DataFrame best practices
- [ ] Proper string formatting
- [ ] Data quality validation

---

## Quick Reference

### Common Patterns

```python
# Job mode detection (MUST BE FIRST)
try:
    dbutils.notebook.entry_point.getDbutils().notebook().getContext().currentRunId().isDefined()
    is_job_mode = True
except:
    is_job_mode = False

# Serverless detection
try:
    test_df = spark.range(1).cache().count()
    is_serverless = False
except:
    is_serverless = True

# API client (REST)
api_url, api_token = get_api_client()

# SDK client
from databricks.sdk import WorkspaceClient
wc = WorkspaceClient()

# Logging with symbols
log(f"✓ Success: {count} items processed")
log(f"✗ Error: {error_message}")
log(f"⚠️ Warning: {warning_message}")
log(f"ℹ️ Info: {info_message}")

# Execution timing
cell_start_time = time.time()
# ... code ...
log_execution_time("Cell Name", cell_start_time)

# Conditional execution
if ENABLE_FEATURE and df is not None:
    # Process feature
    pass
else:
    log("ℹ️  Feature skipped")

# DataFrame validation
if not validate_dataframe_exists("df_name", df):
    log("⚠️  Validation failed")

# Parallel processing
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    futures = [executor.submit(func, item) for item in items]
    for future in as_completed(futures):
        result = future.result()
```

---

## Additional Resources

* **PEP 8**: Python style guide - https://pep8.org/
* **Databricks Best Practices**: Official documentation
* **Spark Programming Guide**: DataFrame optimization
* **Databricks SDK Documentation**: Modern API patterns
* **The Zen of Python**: `import this`

---

## Maintenance

This document should be updated when:
* New patterns are established
* Standards are refined
* New features require new conventions
* Team feedback suggests improvements
* New notebooks are added to the audit suite
* Databricks releases new features or deprecates old ones

## 19. Parallel Processing with ThreadPoolExecutor

### Concurrent API Calls

**Use ThreadPoolExecutor for parallel operations**:

```python
from concurrent.futures import ThreadPoolExecutor, as_completed

# Configuration
MAX_WORKERS = 10  # Adjust based on compute type

def fetch_permissions(resource_id):
    """Fetch permissions for a single resource"""
    try:
        response = api_call_with_retry(lambda: wc.permissions.get(resource_id))
        return {'resource_id': resource_id, 'permissions': response}
    except Exception as e:
        log(f"Error fetching permissions for {resource_id}: {e}")
        return None

# Parallel execution
results = []
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    # Submit all tasks
    future_to_resource = {
        executor.submit(fetch_permissions, resource_id): resource_id 
        for resource_id in resource_ids
    }
    
    # Collect results as they complete
    for future in as_completed(future_to_resource):
        result = future.result()
        if result:
            results.append(result)
```

### Progress Tracking for Parallel Operations

**Track progress with counters**:

```python
completed = 0
total = len(resource_ids)

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    future_to_resource = {executor.submit(fetch_data, rid): rid for rid in resource_ids}
    
    for future in as_completed(future_to_resource):
        completed += 1
        
        # Log progress every 10% or every 20 items
        if completed % max(1, total // 10) == 0 or completed % 20 == 0:
            progress_pct = (completed / total) * 100
            log(f"  Progress: {completed}/{total} ({progress_pct:.1f}%)")
        
        result = future.result()
        if result:
            results.append(result)

log(f"✓ Completed {completed}/{total} operations")
```

### Error Handling in Parallel Operations

**Handle exceptions gracefully**:

```python
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    futures = [executor.submit(process_item, item) for item in items]
    
    for future in as_completed(futures):
        try:
            result = future.result(timeout=30)  # Add timeout
            if result:
                results.append(result)
        except TimeoutError:
            log(f"⚠️  Operation timed out")
            execution_stats['api_failures'] += 1
        except Exception as e:
            log(f"⚠️  Error in parallel operation: {str(e)}")
            execution_stats['api_failures'] += 1
```

### When to Use Parallel Processing

**Use for**:
* Multiple independent API calls
* Fetching permissions for many resources
* Processing independent data chunks
* I/O-bound operations

**Avoid for**:
* CPU-bound operations (use Spark instead)
* Operations with shared state
* Very fast operations (overhead not worth it)
* When order matters

## 20. Import Organization

### Import Order

**Standard library → Third-party → Local**:

```python
# Standard library imports
import os
import re
import time
import json
from datetime import datetime, timedelta
from concurrent.futures import ThreadPoolExecutor, as_completed

# Third-party imports
import pandas as pd
import pytz
import requests
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment

# Databricks SDK
from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import NotFound, PermissionDenied

# PySpark imports
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

# Local/project imports (if any)
# from my_module import my_function
```

### Import Aliases

**Use standard aliases**:

```python
# Good: Standard aliases
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql import types as T

# Avoid: Non-standard aliases
import pandas as p  # Not standard
from pyspark.sql import functions as func  # Too verbose
```

### Conditional Imports

**Import only when needed**:

```python
# Import at top for always-used packages
import pandas as pd

# Import conditionally for optional features
if ENABLE_EXCEL_EXPORT:
    from openpyxl import load_workbook
    from openpyxl.styles import Font, PatternFill

if ENABLE_VISUALIZATION:
    import matplotlib.pyplot as plt
```

### Package Installation

**Use %pip for notebook package installation**:

```python
# Good: Use %pip magic command
%pip install openpyxl --quiet

# Avoid: Using !pip
# !pip install openpyxl  # Less reliable in notebooks
```

### Import Error Handling

**Handle missing optional dependencies**:

```python
try:
    import psutil
    PSUTIL_AVAILABLE = True
except ImportError:
    PSUTIL_AVAILABLE = False
    log("⚠️  psutil not available, memory monitoring disabled")

# Use conditional logic
if PSUTIL_AVAILABLE:
    memory_usage = get_memory_usage()
else:
    memory_usage = 0
```

## 21. Security and Secrets Management

### API Token Handling

**Never hardcode tokens**:

```python
# Good: Get from notebook context
def get_api_client():
    """Get Databricks API client configuration"""
    try:
        ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
        api_url = ctx.apiUrl().get()
        api_token = ctx.apiToken().get()
        return api_url, api_token
    except Exception as e:
        log(f"Error getting API client: {e}")
        return None, None

# Avoid: Hardcoded tokens
# api_token = "dapi123456789"  # NEVER DO THIS
```

### Secrets Management

**Use Databricks Secrets for sensitive data**:

```python
# Good: Use secrets
try:
    api_key = dbutils.secrets.get(scope="my-scope", key="api-key")
    db_password = dbutils.secrets.get(scope="my-scope", key="db-password")
except Exception as e:
    log(f"Error retrieving secrets: {e}")
    raise

# Avoid: Hardcoded credentials
# api_key = "abc123"  # NEVER
# db_password = "password123"  # NEVER
```

### Sensitive Data in Logs

**Mask sensitive information**:

```python
# Good: Mask tokens and passwords
log(f"API token: {api_token[:8]}...{api_token[-4:]}")
log(f"Using user: {username}")

# Avoid: Logging full credentials
# log(f"API token: {api_token}")  # Exposes full token
# log(f"Password: {password}")  # Never log passwords
```

### Secure File Permissions

**Set appropriate permissions on exported files**:

```python
# For sensitive exports
if os.path.exists(export_path):
    os.chmod(export_path, 0o600)  # Owner read/write only
    log(f"✓ Set secure permissions on {export_path}")
```

### Data Sanitization

**Sanitize user input and file paths**:

```python
import re

def sanitize_filename(filename):
    """Remove unsafe characters from filename"""
    # Remove or replace unsafe characters
    safe_name = re.sub(r'[^a-zA-Z0-9_.-]', '_', filename)
    return safe_name

# Usage
user_input = dbutils.widgets.get("filename")
safe_filename = sanitize_filename(user_input)
export_path = f"{EXPORT_PATH}/{safe_filename}.xlsx"
```

## 22. Code Comments and Documentation

### When to Comment

**Comment the WHY, not the WHAT**:

```python
# Good: Explains reasoning
# Convert timestamps from milliseconds to datetime if present
# Databricks APIs return timestamps in epoch milliseconds
if created_at:
    created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)

# Avoid: States the obvious
# Convert created_at to timestamp
if created_at:
    created_timestamp = datetime.fromtimestamp(created_at / 1000, tz=eastern)
```

### Complex Logic Comments

**Explain complex algorithms**:

```python
# Calculate risk score using weighted factors:
# - High-risk items: +20 points each (security vulnerabilities)
# - Medium-risk items: +10 points each (best practice violations)
# - Low-risk items: +5 points each (optimization opportunities)
# Score is capped at 100 to maintain consistent scale
risk_score = 0
for factor in risk_factors:
    if factor['severity'] == 'HIGH':
        risk_score += 20
    elif factor['severity'] == 'MEDIUM':
        risk_score += 10
    else:
        risk_score += 5
risk_score = min(risk_score, 100)
```

### TODO Comments

**Use TODO for future improvements**:

```python
# TODO: Add support for custom date ranges
# TODO: Implement incremental refresh for large datasets
# TODO(username): Review performance optimization for serverless

# FIXME: Handle edge case where user has no groups
# HACK: Temporary workaround for API pagination bug
```

### Deprecation Warnings

**Document deprecated code**:

```python
def old_function():
    """
    DEPRECATED: Use new_function() instead.
    This function will be removed in version 2.0.
    """
    import warnings
    warnings.warn(
        "old_function is deprecated, use new_function instead",
        DeprecationWarning,
        stacklevel=2
    )
    # ... implementation ...
```

### Section Dividers

**Use consistent section dividers**:

```python
# ============================================================================
# MAJOR SECTION: Brief description
# ============================================================================

# --- Subsection ---

# Minor grouping (no divider needed, just comment)
```

## 23. Data Quality and Validation

### Null Handling

**Check for nulls before processing**:

```python
# Check for null values in critical columns
null_counts = df.select([
    F.sum(F.when(F.col(c).isNull(), 1).otherwise(0)).alias(c)
    for c in df.columns
])

if null_counts.first():
    log("⚠️  Null values detected:")
    for col in df.columns:
        null_count = null_counts.first()[col]
        if null_count > 0:
            log(f"  - {col}: {null_count} nulls")
```

### Data Type Validation

**Validate expected data types**:

```python
# Validate schema matches expectations
expected_schema = {
    'asset_id': 'string',
    'asset_name': 'string',
    'created_timestamp': 'timestamp',
    'owner': 'string'
}

for field in df.schema.fields:
    expected_type = expected_schema.get(field.name)
    if expected_type and str(field.dataType).lower() != expected_type:
        log(f"⚠️  Schema mismatch: {field.name} is {field.dataType}, expected {expected_type}")
```

### Empty DataFrame Checks

**Always check before processing**:

```python
if df is None:
    log("⚠️  DataFrame is None, skipping processing")
elif df.count() == 0:
    log("⚠️  DataFrame is empty (0 rows), skipping processing")
else:
    # Process DataFrame
    log(f"✓ Processing {df.count()} rows")
```

### Data Range Validation

**Validate data is within expected ranges**:

```python
# Check for reasonable date ranges
min_date = df.agg(F.min('created_timestamp')).first()[0]
max_date = df.agg(F.max('created_timestamp')).first()[0]

if min_date and max_date:
    date_range_days = (max_date - min_date).days
    if date_range_days > 3650:  # 10 years
        log(f"⚠️  Unusual date range: {date_range_days} days")
    else:
        log(f"✓ Date range: {date_range_days} days")
```

### Duplicate Detection

**Check for and handle duplicates**:

```python
# Check for duplicates
initial_count = df.count()
df_deduped = df.dropDuplicates(['asset_type', 'asset_id'])
final_count = df_deduped.count()

if initial_count > final_count:
    duplicates = initial_count - final_count
    log(f"⚠️  Removed {duplicates} duplicate entries ({duplicates/initial_count*100:.1f}%)")
else:
    log("✓ No duplicates found")

df = df_deduped
```

## 24. String Formatting and Output

### F-String Best Practices

**Use f-strings for all formatting**:

```python
# Good: F-strings (Python 3.6+)
log(f"Processed {count} items in {elapsed:.2f} seconds")
log(f"Success rate: {success_rate:.1f}%")
log(f"Path: {catalog}.{schema}.{table}")

# Avoid: Old-style formatting
log("Processed %d items" % count)  # Old
log("Processed {} items".format(count))  # Verbose
log("Processed " + str(count) + " items")  # Concatenation
```

### Number Formatting

**Consistent number formatting**:

```python
# Integers: No decimal places
log(f"Count: {count:,}")  # 1,234,567

# Floats: 1-2 decimal places
log(f"Percentage: {pct:.1f}%")  # 85.3%
log(f"Time: {elapsed:.2f} seconds")  # 12.45 seconds

# Large numbers: Use K, M, B suffixes
if count >= 1_000_000:
    log(f"Count: {count/1_000_000:.1f}M")
elif count >= 1_000:
    log(f"Count: {count/1_000:.1f}K")
else:
    log(f"Count: {count}")
```

### Multi-line Strings

**Use triple quotes for SQL and long text**:

```python
# Good: Triple quotes for SQL
query = """
    SELECT 
        asset_type,
        COUNT(*) as count
    FROM assets
    WHERE modified_timestamp > current_date() - INTERVAL 30 DAYS
    GROUP BY asset_type
    ORDER BY count DESC
"""

# Good: Triple quotes for long messages
message = """
Workspace scan completed successfully.
Found {count} assets with default naming.
Results exported to {path}.
""".format(count=count, path=path)
```

### Path Formatting

**Consistent path construction**:

```python
# Use f-strings for paths
full_path = f"{catalog}.{schema}.{table}"
file_path = f"{EXPORT_PATH}/{filename}_{timestamp}.xlsx"

# Use os.path.join for file system paths
import os
file_path = os.path.join(EXPORT_PATH, f"{filename}_{timestamp}.xlsx")
```

### Unicode and Emojis

**Use emojis consistently for visual clarity**:

```python
# Status indicators
log("✓ Success")  # Checkmark
log("✗ Failure")  # X mark
log("⚠️  Warning")  # Warning sign
log("ℹ️  Info")  # Information

# Progress indicators
log("⏱️  Timing information")
log("🚀 Quick mode enabled")
log("🔍 Full mode enabled")
log("🤖 Job mode detected")

# Category indicators
log("🔒 Security")
log("📋 Governance")
log("💡 Recommendations")
log("📊 Statistics")
```

## 25. Spark DataFrame Best Practices

### DataFrame Creation

**Always define schema explicitly**:

```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

# Good: Explicit schema
schema = StructType([
    StructField("id", StringType(), False),  # Not nullable
    StructField("name", StringType(), True),  # Nullable
    StructField("count", IntegerType(), True),
    StructField("timestamp", TimestampType(), True)
])

df = spark.createDataFrame(data, schema)

# Avoid: Schema inference (slower, less reliable)
# df = spark.createDataFrame(data)  # Schema inferred
```

### Column Operations

**Use F.col() for column references**:

```python
from pyspark.sql import functions as F

# Good: Use F.col()
df = df.filter(F.col('status') == 'active')
df = df.withColumn('age_days', F.datediff(F.current_date(), F.col('created_date')))

# Acceptable: String column names in simple cases
df = df.select('id', 'name', 'status')
df = df.groupBy('category').count()
```

### Avoid Collect on Large DataFrames

**Use display() or limit() instead**:

```python
# Good: Display limited results
display(df.limit(100))

# Good: Aggregate before collecting
summary = df.groupBy('category').count().collect()

# Avoid: Collecting large DataFrames
# all_data = df.collect()  # Can cause OOM on large data
```

### Column Naming

**Use snake_case for column names**:

```python
# Good: snake_case
df = df.withColumnRenamed('AssetType', 'asset_type')
df = df.withColumnRenamed('CreatedAt', 'created_at')

# Avoid: camelCase or PascalCase in Spark
# df = df.withColumn('assetType', ...)  # Inconsistent
```

### Chaining Operations

**Chain operations for readability**:

```python
# Good: Chained operations
df_result = (df
    .filter(F.col('status') == 'active')
    .withColumn('age_days', F.datediff(F.current_date(), F.col('created_date')))
    .groupBy('category')
    .agg(
        F.count('*').alias('count'),
        F.avg('age_days').alias('avg_age')
    )
    .orderBy(F.desc('count'))
)

# Avoid: Multiple intermediate variables
# df1 = df.filter(F.col('status') == 'active')
# df2 = df1.withColumn('age_days', ...)
# df3 = df2.groupBy('category')
# df_result = df3.agg(...)
```

### Null-Safe Operations

**Handle nulls explicitly**:

```python
# Use coalesce for null defaults
df = df.withColumn('owner', F.coalesce(F.col('owner'), F.lit('Unknown')))

# Use null-safe equality
df = df.filter(F.col('status').eqNullSafe('active'))

# Filter out nulls explicitly
df = df.filter(F.col('required_field').isNotNull())
```