# 10 — Bronze Load Worker

This notebook defines the worker function that loads a **single table** from parquet into a Bronze Delta table.

## Architecture: Bronze History Pattern

Bronze uses **APPEND with run_ts partitioning** to enable full CDC capability:
- **Snapshot/Window tables**: Overwrite entire table
- **Incremental tables**: Append with `_bronze_load_ts` partition
- Silver can reconstruct current state and detect deletes

## Key Features
- Does **not** write to log table (returns metrics dict)
- Reads parquet from auto-detected path (Fabric or Local)
- Adds metadata columns: `_bronze_load_ts`, `_bronze_filename`
- Handles missing files, empty data, corrupt Delta tables
- Returns comprehensive metrics for logging

## Load Modes
- **snapshot**: Complete table refresh (overwrite)
- **incremental**: Delta append with run_ts (CDC history)
- **window**: Overwrite with partitioning support

This notebook is imported via `%run` from the master orchestrator.

In [None]:
# Parameters (set by orchestrator, not by Papermill in this case)
# These are here for documentation and can be overridden when %run is called
RUN_ID = None  # Will be set by orchestrator
#DEBUG = False  # Enable debug output

## [1] Imports and Path Detection

In [None]:
from pyspark.sql import functions as F
from pyspark.sql.functions import (
    lit, input_file_name, col, year, month
)
from datetime import datetime, timezone
from typing import Dict, Any
from uuid import uuid4
import os
from modules.path_utils import build_parquet_dir
from modules.path_utils import get_base_path
from pyspark.sql import functions as F

import logging

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

logger.info("✓ Imports loaded")


In [None]:
# Auto-detect base path (Fabric vs Cluster)
# Base path voor Files/Tables gebruiken we centraal vanuit modules.path_utils

try:
    # Als 02_utils_config al gedraaid heeft, staat BASE_PATH al in de globals
    BASE_PATH  # type: ignore[name-defined]
    logger.info(f"✓ Base path (bronze worker): {BASE_PATH}")
    env_type = (
        "Fabric" if "/lakehouse/default/Files" in BASE_PATH
        else "Custom Cluster" if "/data/lakehouse/" in BASE_PATH
        else "Cluster/Relative"
    )
    logger.info(f"✓ Environment (bronze worker): {env_type}")
except NameError:
    # Fallback zodat je 10_bronze_load ook stand-alone kunt draaien
    BASE_PATH = get_base_path()
    env_type = (
        "Fabric" if "/lakehouse/default/Files" in BASE_PATH
        else "Custom Cluster" if "/data/lakehouse/" in BASE_PATH
        else "Local/Relative"
    )
    logger.info(f"✓ Base path (bronze worker – fallback): {BASE_PATH}")
    logger.info(f"✓ Environment (bronze worker – fallback): {env_type}")


## [2] Helper Functions

In [None]:
# Import Bronze processing function from module
from modules.bronze_processor import process_bronze_table

logger.info("✓ Bronze processor function imported from modules.bronze_processor")
logger.info("✓ Helper functions (error detection, Delta utils) included in module")

## [3] Core Worker Function

This is the main function that processes a single table.

In [None]:
# Bronze processor function is now imported from modules.bronze_processor
# 
# The process_bronze_table() function handles:
# - Reading parquet files
# - Adding metadata columns (_bronze_load_ts, _bronze_filename)
# - Writing to Delta tables (snapshot/incremental/window modes)
# - Error recovery for corrupt Delta tables
# - Partitioning support
#
# Usage in orchestrator:
# result = process_bronze_table(
#     spark=spark,
#     table_def=table,
#     source_name=source,
#     run_id=RUN_ID,
#     run_ts=run_ts,
#     run_date=run_date,
#     base_files=base_files,
#     debug=True
# )

logger.info("✓ Bronze worker function ready (imported from module)")

## [4] Function Ready

The `process_bronze_table()` function is now available for use by the orchestrator.

**Usage pattern:**

```python
# In orchestrator notebook:
%run "10_bronze_load"

# Set RUN_ID
RUN_ID = f"run_{run_ts}"

# Process tables
results = []
for table in tables:
    result = process_bronze_table(
        table_def=table,
        source_name=source,
        run_ts=run_ts,
        base_files=base_files,
        debug=True
    )
    results.append(result)

# Log results in batch
bronze_summary = {...}  # build from results when ready
run_log_id = log_summary(bronze_summary, layer="bronze")
log_batch(results, layer="bronze", run_log_id=run_log_id)
```

In [None]:
logger.info("\n" + "=" * 80)
logger.info("BRONZE WORKER READY")
logger.info("=" * 80)
logger.info(f"Base path: {BASE_PATH}")
logger.info(f"Environment: {'Fabric' if '/lakehouse' in BASE_PATH else 'Local'}")
logger.info("\nFunction available: process_bronze_table(table_def, source_name, run_ts, ...)")
logger.info("\n⚠️  Remember to set RUN_ID before calling process_bronze_table()")
logger.info("✓ Bronze worker notebook loaded successfully")