# 20 — Silver CDC Merge Worker

This notebook performs Change Data Capture (CDC) merge from Bronze to Silver layer.

## CDC Architecture: Bronze History Pattern

### How DELETE Detection Works

Bronze incremental tables use **APPEND with run_ts partitioning**:
```
bronze.Dim_Relatie/
  ├── _bronze_load_ts=20251101T060000/  -- 1000 rows (initial)
  ├── _bronze_load_ts=20251108T060000/  -- 50 rows (delta)
  └── _bronze_load_ts=20251115T060000/  -- 100 rows (delta)
```

Silver CDC merge:
1. **Reconstruct current Bronze state** (latest row per business key)
2. **Calculate row hash** for change detection
3. **Compare with Silver**:
   - Keys in Bronze but not Silver → **INSERT**
   - Keys in both with different hash → **UPDATE**
   - Keys in Silver but not Bronze → **DELETE** (marked as deleted)
   - Keys in both with same hash → **UNCHANGED**

## Key Features

- Full CDC: INSERT, UPDATE, DELETE, UNCHANGED
- Soft deletes (is_deleted flag, not physical deletion)
- Row hash calculation using hash_utils
- Delta MERGE operations (atomic)
- Returns comprehensive metrics for logging

## Load Modes

- **snapshot/window**: Simple overwrite (no CDC needed)
- **incremental**: Full CDC with delete detection

This notebook is imported via `%run` from the master orchestrator.

In [None]:
# Parameters (set by orchestrator)
RUN_ID = None  # Will be set by orchestrator
DEBUG = False  # Enable debug output

## [1] Imports

In [None]:
# Module fabric.bootstrap
# ---------------------
# This cell enables a flexible module loading strategy:
#
# PRODUCTION (default): The `Files/code` directory is empty. This function does nothing,
# and Python imports all modules from the stable, versioned Wheel in the Environment.
#
# DEVELOPMENT / HOTFIX: To bypass the 15-20 minute Fabric publish cycle for urgent fixes,
# upload individual .py files to `Files/code` in the Lakehouse. This function prepends
# that path to sys.path, so Python finds the override files first. All other modules
# continue to load from the Wheel - only the uploaded files are replaced.
#
# Usage: Keep `Files/code` empty for production stability. Use it only for rapid
# iteration during development or emergency hotfixes.

from modules.fabric_bootstrap import ensure_module_path
ensure_module_path()  # Now Python can find the rest

In [None]:
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from pyspark.sql.functions import (
    col, lit, current_timestamp, row_number, coalesce
)
from pyspark.sql.window import Window
from delta.tables import DeltaTable
from datetime import datetime
from typing import Dict, Any, List, Optional
from uuid import uuid4

import logging

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)


# Import hash utilities
import sys
sys.path.append('/home/sparkadmin/source/repos/dwh_spark_processing/modules')  # Adjust if needed

try:
    from hash_utils import add_row_hash, compare_hash_differences
    logger.info("✓ hash_utils imported successfully")
except ImportError as e:
    logger.info(f"⚠️  Warning: Could not import hash_utils: {e}")
    logger.info("   Make sure modules/hash_utils.py is in sys.path")

logger.info("✓ Imports loaded")

## [2] Helper Functions

In [None]:
# Import CDC utility functions from modules
from modules.cdc_utils import (
    reconstruct_bronze_current_state,
    get_business_columns
)

logger.info("✓ CDC helper functions imported from modules.cdc_utils")

## [3] Core Silver CDC Merge Function

In [None]:
# Import Silver CDC processor from module
from modules.silver_processor import process_silver_cdc_merge

# Silver CDC merge function handles:
# - Bronze state reconstruction (for incremental tables)
# - Row hash calculation for change detection
# - MERGE operations (INSERT + UPDATE)
# - DELETE detection (soft deletes with is_deleted flag)
# - Silver metadata management (_silver_inserted_ts, _silver_updated_ts, _silver_deleted_ts)
#
# Usage in orchestrator:
# result = process_silver_cdc_merge(
#     spark=spark,
#     table_def=table,
#     source_name=source,
#     run_id=RUN_ID,
#     run_ts=run_ts,
#     debug=True
# )

logger.info("✓ Silver CDC merge function imported from modules.silver_processor")

## [4] Function Ready

The `process_silver_cdc_merge()` function is now available for use by the orchestrator.

**Usage pattern:**

```python
# In orchestrator notebook:
%run "20_silver_cdc_merge"

# Set RUN_ID
RUN_ID = f"run_{run_ts}"

# Process tables that have been successfully loaded to Bronze
silver_results = []
for table in tables_with_business_keys:
    result = process_silver_cdc_merge(
        table_def=table,
        source_name=source,
        run_ts=run_ts,
        debug=True
    )
    silver_results.append(result)

# Log results in batch
log_batch(silver_results, layer="silver")
```

In [None]:
logger.info("\n" + "=" * 80)
logger.info("SILVER CDC MERGE WORKER READY")
logger.info("=" * 80)
logger.info("\nFunction available: process_silver_cdc_merge(table_def, source_name, run_ts, ...)")
logger.info("\nCDC Capabilities:")
logger.info("  ✓ INSERT detection (new keys)")
logger.info("  ✓ UPDATE detection (changed rows via hash)")
logger.info("  ✓ DELETE detection (missing keys from Bronze)")
logger.info("  ✓ UNCHANGED tracking (same hash)")
logger.info("\n⚠️  Remember to:")
logger.info("  1. Set RUN_ID before calling function")
logger.info("  2. Ensure business_keys defined in table config")
logger.info("  3. Run Bronze load successfully first")
logger.info("\n✓ Silver CDC merge notebook loaded successfully")