# Load bronze: Parquet to Delta Table

**Purpose:** Load parquet files from greenhouse sources into Delta Tables in the bronze layer

**Versie:** 1.0  
**Author:** Data Engineering Team  
**Last Updated:** 2025-10-27

## Notebook Overview

This notebook processes parquet files from the greenhouse source layer and loads them into Delta Tables in the bronze layer.

**Key Features:**
- Parses DAG JSON configuration for table metadata
- Supports three load modes: snapshot, window, and incremental
- Automatic logging to `logs.bronze_processing_log` Delta Table
- Retry capability for failed tables
- Preserves Spark session for downstream processing

**Input Parameters:**
- `source`: Source system name (e.g., "anva_concern")
- `run_ts`: Run timestamp identifier (e.g., "20250923T183119772")
- `dag_path`: Path to DAG JSON configuration file
- `retry_tables`: Optional list of specific tables to retry (default: process all)

**Output:**
- Delta Tables in format: `{source}.{table_name}`
- Processing logs in: `logs.bronze_processing_log`

## Step 1: Setup Logging Infrastructure

Create or verify the existence of the bronze processing log table. This table tracks all processing activities including success/failure status, row counts, and error messages.

**Log Table Schema:**
- Captures run metadata (source, table, run_ts, load_mode)
- Records processing metrics (rows read/written, duration)
- Stores error details for troubleshooting

In [103]:
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, LongType, BooleanType
from pyspark.sql import functions as F
from datetime import datetime

# Define logging table schema
log_schema = StructType([
    StructField("log_id", StringType(), False),
    StructField("run_ts", StringType(), False),
    StructField("source", StringType(), False),
    StructField("table_name", StringType(), False),
    StructField("load_mode", StringType(), True),
    StructField("status", StringType(), False),  # SUCCESS, FAILED, RUNNING
    StructField("rows_read", LongType(), True),
    StructField("rows_written", LongType(), True),
    StructField("start_time", TimestampType(), False),
    StructField("end_time", TimestampType(), True),
    StructField("duration_seconds", LongType(), True),
    StructField("error_message", StringType(), True),
    StructField("parquet_path", StringType(), True),
    StructField("delta_table", StringType(), True)
])

# Ensure logs schema exists
print("Checking logs schema...")
spark.sql("CREATE SCHEMA IF NOT EXISTS logs")
print("Schema 'logs' verified")

# Check if log table exists
log_table_name = "logs.bronze_processing_log"

try:
    existing_log = spark.table(log_table_name)
    print(f"Log table exists: {log_table_name}")
    print(f"Total log records: {existing_log.count():,}")
except:
    print(f"Creating log table: {log_table_name}")
    
    # Create empty DataFrame with schema
    empty_log = spark.createDataFrame([], log_schema) \
        .withColumn("run_date", F.lit(None).cast("date"))

    # Write to Delta
    (empty_log.write
        .format("delta")
        .partitionBy("run_date", "table_name")
        .mode("overwrite")
        .saveAsTable(log_table_name)
    )
    
    print(f"Log table created: {log_table_name}")

print(f"Logging infrastructure ready")

StatementMeta(, 7f1f5177-7640-4a28-88d7-3afc1e7d766e, 105, Finished, Available, Finished)

‚öôÔ∏è  Checking logs schema...
‚úì Schema 'logs' verified
‚úì Log table exists: logs.bronze_processing_log
  Total log records: 1,459

‚úì Logging infrastructure ready


## Step 1b: Configure Spark for Ancient Dates

Set Spark configuration to handle dates before 1582-10-15 correctly. This prevents calendar conversion issues when writing to Delta Tables.

In [104]:
# Configure Spark to handle ancient dates
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "CORRECTED")
spark.conf.set("spark.sql.parquet.int96RebaseModeInWrite", "CORRECTED")
spark.conf.set("spark.sql.caseSensitive", "true")
# Configure aggressive vacuum for bronze (no time travel needed)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

debug = True
print("Spark configuration set")

StatementMeta(, 7f1f5177-7640-4a28-88d7-3afc1e7d766e, 106, Finished, Available, Finished)

‚úì Spark configuration set


## Step 2: Parse Input Parameters and DAG Configuration

Read the DAG JSON file to extract table configurations including load modes, queries, and metadata. The DAG structure contains all necessary information for processing each table.

**Expected DAG Structure:**
- `source`: Source system name
- `run_ts`: Run timestamp
- `run_id`: Run identifier (e.g., "run_20250923T183119772")
- `tables[]`: Array of table configurations with load_mode, name, etc.

In [None]:
if debug:
    # Enable Spark UI metrics
    spark.sparkContext.setLogLevel("WARN")

    # Track start time
    import time
    pipeline_start = time.time()

    print(f"Spark Configuration:")
    print(f"Executors: {spark.sparkContext._conf.get('spark.executor.instances', 'default')}")
    print(f"Executor memory: {spark.sparkContext._conf.get('spark.executor.memory', 'default')}")
    print(f"Driver memory: {spark.sparkContext._conf.get('spark.driver.memory', 'default')}")
    print(f"Cores per executor: {spark.sparkContext._conf.get('spark.executor.cores', 'default')}")

In [105]:
import json
from typing import Dict, List, Optional
from datetime import datetime
import os

# ============================================================================
# INPUT PARAMETERS - Set these when running the notebook
# ============================================================================
# source = "anva_meeus"  # Source system name
# run_ts = "20250923T183119772"  # Run timestamp
# dag_path = "/lakehouse/default/Files/config/dag_anva_meeus_week.json"  # Path to DAG JSON
# retry_tables = None  # Optional: ['Dim_Agent', 'Fact_Polissen'] or None for all tables
# drop_existing_tables = False  # Set to True to drop and recreate all tables (fixes case issues)

# Convert string to boolean (parameters come as strings from pipeline)
drop_existing_tables = drop_existing_tables.lower() == "true" if isinstance(drop_existing_tables, str) else drop_existing_tables

# Convert retry_tables from string to list if needed
if isinstance(retry_tables, str) and retry_tables and retry_tables != "None":
    retry_tables = retry_tables.split(",")
elif retry_tables == "None" or not retry_tables:
    retry_tables = None

print(f"Parsing DAG configuration...")
print(f"Source: {source}")
print(f"Run TS: {run_ts}")
print(f"DAG Path: {dag_path}")
print(f"Drop existing: {drop_existing_tables}")
print(f"Retry tables: {retry_tables}")

# ============================================================================
# Parse DAG JSON
# ============================================================================
# Read DAG JSON file
with open(dag_path, 'r') as f:
    dag_config = json.load(f)

# Verify source matches
if dag_config['source'] != source:
    print(f"Warning: DAG source ({dag_config['source']}) doesn't match input source ({source})")

# Extract table configurations
all_tables = dag_config['tables']
print(f"DAG parsed successfully")
print(f"Total tables in DAG: {len(all_tables)}")

# Filter tables if retry_tables is specified
if retry_tables:
    tables_to_process = [t for t in all_tables if t['name'] in retry_tables]
    print(f"Retry mode: Processing {len(tables_to_process)} specific tables")
else:
    tables_to_process = [t for t in all_tables if t.get('enabled', True)]
    print(f"Processing {len(tables_to_process)} enabled tables")

# Display table summary
print(f"Tables to process:")
load_mode_counts = {}
for table in tables_to_process:
    load_mode = table.get('load_mode', 'unknown')
    load_mode_counts[load_mode] = load_mode_counts.get(load_mode, 0) + 1

for mode, count in load_mode_counts.items():
    print(f"  - {mode}: {count} tables")

# Construct base parquet path
base_files = dag_config.get('base_files', 'greenhouse_sources')
run_id = f"{run_ts}"

# Extract date from run_ts (format: 20250923T183119772)
year = run_ts[:4]
month = run_ts[4:6]
day = run_ts[6:8]

base_parquet_path = f"/lakehouse/default/Files/{base_files}/{source}/{year}/{month}/{day}/{run_id}"
print(f"Base parquet path: {base_parquet_path}")

# Verify parquet path exists - STOP if not found
if not os.path.exists(base_parquet_path):
    error_msg = f"Parquet path does not exist: {base_parquet_path}"
    print(f"{error_msg}")
    raise FileNotFoundError(error_msg)

# Verify table folders exist
table_folders = [d for d in os.listdir(base_parquet_path) if os.path.isdir(os.path.join(base_parquet_path, d))]

if len(table_folders) == 0:
    error_msg = f"No table folders found in: {base_parquet_path}"
    print(f"{error_msg}")
    raise FileNotFoundError(error_msg)

print(f"Found {len(table_folders)} table folders")

StatementMeta(, 7f1f5177-7640-4a28-88d7-3afc1e7d766e, 107, Finished, Available, Finished)

üìñ Parsing DAG configuration...
  Source: anva_meeus
  Run TS: 20250923T183119772
  DAG Path: /lakehouse/default/Files/config/dag_anva_meeus_week.json
  Drop existing: False

‚úì DAG parsed successfully
  Total tables in DAG: 74
  Processing 58 enabled tables

üìã Tables to process:
  - snapshot: 57 tables
  - window: 1 tables

üìÇ Base parquet path:
  /lakehouse/default/Files/greenhouse_sources/anva_meeus/2025/09/23/run_20250923T183119772
  ‚úì Path exists
  ‚úì Found 58 table folders


## Step 3: Define Helper Functions

Create reusable functions for:
- **Logging**: Track processing status (start, success, failure)
- **Load Operations**: Handle different load modes (snapshot, window, incremental)
- **Metrics**: Calculate row counts and processing duration

In [106]:
import uuid
from datetime import datetime
from pyspark.sql.utils import AnalysisException
from delta.tables import DeltaTable
from pyspark.sql import functions as F
import time
from delta.exceptions import ConcurrentAppendException, DeltaConcurrentModificationException

# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

# def log_table_processing(log_id: str, run_ts: str, source: str, table_name: str, 
#                          status: str, load_mode: str = None, rows_read: int = None, 
#                          rows_written: int = None, start_time: datetime = None, 
#                          end_time: datetime = None, error_message: str = None,
#                          parquet_path: str = None, delta_table: str = None):
#     """Write a log entry to the bronze processing log table"""
#     duration_seconds = None
#     if start_time and end_time:
#         duration_seconds = int((end_time - start_time).total_seconds())
    
#     log_data = [(
#         log_id, run_ts, source, table_name, load_mode, status,
#         rows_read, rows_written, start_time, end_time, duration_seconds,
#         error_message, parquet_path, delta_table
#     )]
    
#     log_df = spark.createDataFrame(log_data, schema=log_schema)
#     log_df.write.format("delta").mode("append").saveAsTable(log_table_name)

def log_table_processing(log_id: str, run_ts: str, source: str, table_name: str, 
                         status: str, load_mode: str = None, rows_read: int = None, 
                         rows_written: int = None, start_time: datetime = None, 
                         end_time: datetime = None, error_message: str = None,
                         parquet_path: str = None, delta_table: str = None):

    def merge_with_retry(merge_fn, max_retries=5, base_sleep=0.2):
        for i in range(max_retries):
            try:
                merge_fn()
                return
            except (ConcurrentAppendException, DeltaConcurrentModificationException) as e:
                time.sleep(base_sleep * (2 ** i))
        # laatste poging nog √©√©n keer laten falen
        merge_fn()

    # duur berekenen
    duration_seconds = None
    if start_time and end_time:
        duration_seconds = int((end_time - start_time).total_seconds())

    # enkele rij als DataFrame
    log_df = spark.createDataFrame(
        [(
            log_id, run_ts, source, table_name, load_mode, status,
            rows_read, rows_written, start_time, end_time, duration_seconds,
            error_message, parquet_path, delta_table
        )],
        schema=log_schema
    )

    log_df = log_df.withColumn("run_date", F.to_date(F.col("run_ts").substr(1, 8), "yyyyMMdd"))

    # Delta MERGE: update dezelfde rij op basis van sleutel (hier: log_id)
    dt = DeltaTable.forName(spark, log_table_name)

    # COALESCE op matched update: laat bestaande waarden staan als nieuwe None zijn
    set_expr = {
        "run_ts":          F.col("s.run_ts"),
        "source":          F.coalesce(F.col("s.source"), F.col("t.source")),
        "table_name":      F.coalesce(F.col("s.table_name"), F.col("t.table_name")),
        "load_mode":       F.coalesce(F.col("s.load_mode"), F.col("t.load_mode")),
        "status":          F.col("s.status"),
        "rows_read":       F.coalesce(F.col("s.rows_read"), F.col("t.rows_read")),
        "rows_written":    F.coalesce(F.col("s.rows_written"), F.col("t.rows_written")),
        "start_time":      F.coalesce(F.col("s.start_time"), F.col("t.start_time")),
        "end_time":        F.coalesce(F.col("s.end_time"), F.col("t.end_time")),
        "duration_seconds":F.coalesce(F.col("s.duration_seconds"), F.col("t.duration_seconds")),
        "error_message":   F.coalesce(F.col("s.error_message"), F.col("t.error_message")),
        "parquet_path":    F.coalesce(F.col("s.parquet_path"), F.col("t.parquet_path")),
        "delta_table":     F.coalesce(F.col("s.delta_table"), F.col("t.delta_table")),
        "run_date":        F.col("s.run_date")
    }

    insert_vals = { c: F.col(f"s.{c}") for c in log_df.columns }

    # (dt.alias("t")
    #   .merge(log_df.alias("s"), "t.log_id = s.log_id AND t.table_name = s.table_name AND t.run_date = s.run_date")           # <‚Äî jouw sleutel
    #   .whenMatchedUpdate(set=set_expr)                           # update dezelfde rij
    #   .whenNotMatchedInsert(values=insert_vals)                  # of maak ‚Äòm aan
    #   .execute())
    merge_with_retry(lambda: (
        dt.alias("t")
        .merge(log_df.alias("s"), "t.log_id=s.log_id AND t.table_name=s.table_name AND t.run_date=s.run_date")
        .whenMatchedUpdate(set=set_expr)
        .whenNotMatchedInsert(values=insert_vals)
        .execute()
    ))


def load_parquet_to_delta(table_config: dict, parquet_base_path: str, source: str, run_ts: str, drop_existing: bool = False) -> dict:
    """
    Load parquet to Delta using file:// protocol
    """
    table_name = table_config['name']
    load_mode = table_config.get('load_mode', 'snapshot')
    target_table = table_config.get('delta_table', table_name)
    delta_table_name = f"`{source}`.`{target_table}`"
    
    table_folder = table_name.replace('.', '_')
    parquet_path = f"{parquet_base_path}/{table_folder}"
    
    log_id = str(uuid.uuid4())
    start_time = datetime.now()
    
    result = {
        'status': 'SUCCESS', 'rows_read': 0, 'rows_written': 0,
        'error_message': None, 'log_id': log_id, 'start_time': start_time,
        'parquet_path': parquet_path, 'delta_table': f"{source}.{target_table}"
    }
    
    try:
        log_table_processing(
            log_id=log_id, run_ts=run_ts, source=source, table_name=table_name,
            status='RUNNING', load_mode=load_mode, start_time=start_time,
            parquet_path=parquet_path, delta_table=f"{source}.{target_table}"
        )
        
        spark.sql(f"CREATE SCHEMA IF NOT EXISTS `{source}`")
        
        if drop_existing:
            try:
                spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")
                spark.sql(f"DROP TABLE IF EXISTS `{source}`.`{target_table.lower()}`")
            except:
                pass
        
        # Use file:// protocol - bypasses OneLake connector issues
        file_protocol_path = f"file://{parquet_path}"
        df = spark.read.parquet(file_protocol_path)
        rows_read = df.count()
        result['rows_read'] = rows_read
        
        if rows_read == 0:
            result['error_message'] = "Empty table - schema only"
        
        # Write to Delta
        if load_mode in ['snapshot', 'window']:
            df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(delta_table_name)
        elif load_mode == 'incremental':
            table_exists = False
    
            try:
                # Check both: exists in catalog AND has readable data
                if spark.catalog.tableExists(f"{source}.{target_table}"):
                    # Try to read - if it fails, table is corrupt
                    test_df = spark.table(delta_table_name).limit(1)
                    test_df.count()  # Force evaluation
                    table_exists = True
                    print(f"  ‚ÑπÔ∏è  Incremental append to existing {target_table}")
            except Exception as check_error:
                # Table exists but is corrupt/unreadable
                print(f"Table {target_table} exists but is corrupt - recreating")
                print(f"Error: {str(check_error)[:100]}")
                
                # Drop corrupt table
                try:
                    spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")
                except:
                    pass
                
                table_exists = False

            if not table_exists:
                # First load or recreate: create table with overwrite
                print(f"Creating new incremental table {target_table}")
                df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(delta_table_name)
            else:
                # Subsequent loads: append
                df.write.format("delta").mode("append").saveAsTable(delta_table_name)
        else:
            raise ValueError(f"Unknown load_mode: {load_mode}")
        
        result['rows_written'] = rows_read
        result['status'] = 'SUCCESS'
        
        # Vacuum immediately
        # try:
        #     spark.sql(f"VACUUM {delta_table_name} RETAIN 0 HOURS")
        # except AnalysisException as e:
        #     if "not found" not in str(e).lower():
        #         print(f"Vacuum warning for {target_table}: {str(e)[:100]}")
        
    except Exception as e:
        result['status'] = 'FAILED'
        import traceback
        error_type = type(e).__name__
        error_msg = str(e)
        result['error_message'] = f"[{error_type}] {error_msg[:400]}"
    
    end_time = datetime.now()
    log_table_processing(
        log_id=log_id, run_ts=run_ts, source=source, table_name=table_name,
        status=result['status'], load_mode=load_mode,
        rows_read=result['rows_read'], rows_written=result['rows_written'],
        start_time=start_time, end_time=end_time,
        error_message=result['error_message'],
        parquet_path=parquet_path, delta_table=f"{source}.{target_table}"
    )
    
    return result

    Files/greenhouse_sources/anva_meeus/2025/10/05/20251005T142752505/Fact_PremieFacturen/sql_query_Boek_Datum_11042024120000_12052024000000_01_00000.parquet


print("Helper functions defined.")

StatementMeta(, 7f1f5177-7640-4a28-88d7-3afc1e7d766e, 108, Finished, Available, Finished)

‚úì Helper functions defined (with drop_existing support)


## Step 4: Process Tables - Parquet to Delta

Iterate through all configured tables and load them into Delta Tables based on their load mode:

**Load Modes:**
- **Snapshot**: Full table overwrite with schema evolution
- **Window**: Full table overwrite with time-windowed data
- **Incremental**: Append new records to existing table

Progress is tracked in real-time with success/failure counts.

In [107]:
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading

# ============================================================================
# PARALLEL PROCESSING CONFIGURATION
# ============================================================================

max_parallel_workers = 4
print(f"‚öôÔ∏è  Parallel workers: {max_parallel_workers}")

# Thread-safe counters
lock = threading.Lock()
success_count = 0
failed_count = 0
empty_count = 0
total_rows_processed = 0
processed_count = 0

def update_counters(result):
    """Thread-safe counter updates"""
    global success_count, failed_count, empty_count, total_rows_processed, processed_count
    with lock:
        processed_count += 1
        if result['status'] == 'SUCCESS':
            success_count += 1
            total_rows_processed += result['rows_written']
            if result['rows_written'] == 0:
                empty_count += 1
        else:
            failed_count += 1

# ============================================================================
# MAIN PARALLEL PROCESSING
# ============================================================================

print(f"Starting parallel bronze loading process...")
print(f"Tables to process: {len(tables_to_process)}")
print(f"Parallel workers: {max_parallel_workers}")
print(f"Source: {source}")
print(f"Run TS: {run_ts}")
print(f"=" * 80)

results = []
processing_start = datetime.now()

with ThreadPoolExecutor(max_workers=max_parallel_workers) as executor:
    future_to_table = {
        executor.submit(load_parquet_to_delta, table_config, base_parquet_path, source, run_ts, drop_existing_tables): table_config
        for table_config in tables_to_process
    }
    
    for future in as_completed(future_to_table):
        table_config = future_to_table[future]
        table_name = table_config['name']
        load_mode = table_config.get('load_mode', 'snapshot')
        
        try:
            result = future.result()
            results.append({
                'table_name': table_name,
                'load_mode': load_mode,
                **result
            })
            
            update_counters(result)
            
            with lock:
                current = processed_count
            
            if result['status'] == 'SUCCESS':
                duration = (result.get('end_time', datetime.now()) - result['start_time']).total_seconds()
                if result['rows_written'] == 0:
                    print(f"[{current}/{len(tables_to_process)}] ‚óã {table_name}: Empty table (schema only) in {duration:.1f}s")
                else:
                    print(f"[{current}/{len(tables_to_process)}] ‚úì {table_name}: {result['rows_written']:,} rows in {duration:.1f}s")
            else:
                print(f"[{current}/{len(tables_to_process)}] ‚úó {table_name}: {result['error_message']}")
                
        except Exception as e:
            print(f"[?/{len(tables_to_process)}] ‚úó {table_name}: Unexpected error: {str(e)}")
            results.append({
                'table_name': table_name,
                'load_mode': load_mode,
                'status': 'FAILED',
                'error_message': str(e)
            })
            with lock:
                failed_count += 1
                processed_count += 1

processing_end = datetime.now()
total_duration = (processing_end - processing_start).total_seconds()

# ============================================================================
# SUMMARY REPORT
# ============================================================================

print(f"\n" + "=" * 80)
print(f"üèÅ Processing Complete!")
print(f"=" * 80)
print(f"  Total tables: {len(tables_to_process)}")
print(f"  ‚úì Success: {success_count}")
print(f"    - With data: {success_count - empty_count}")
print(f"    - Empty (schema only): {empty_count}")
print(f"  ‚úó Failed: {failed_count}")
print(f"  Total rows: {total_rows_processed:,}")
print(f"  Duration: {total_duration:.1f} seconds ({total_duration/60:.1f} minutes)")
if total_rows_processed > 0:
    print(f"  Throughput: {total_rows_processed/total_duration:,.0f} rows/second")
print(f"  Average: {total_duration/len(tables_to_process):.1f} seconds/table")

# Show failed tables if any
if failed_count > 0:
    print(f"\n‚ö†Ô∏è  Failed tables ({failed_count}):")
    failed_tables_list = []
    for result in results:
        if result['status'] == 'FAILED':
            print(f"  - {result['table_name']}: {result.get('error_message', 'Unknown error')[:80]}")
            failed_tables_list.append(result['table_name'])
    
    print(f"\nüí° To retry failed tables, run with:")
    print(f"  retry_tables = {failed_tables_list}")

print(f"\n‚úì All logs saved to: {log_table_name}")


StatementMeta(, 7f1f5177-7640-4a28-88d7-3afc1e7d766e, 109, Submitted, Running, Running)

‚öôÔ∏è  Parallel workers: 10

üöÄ Starting parallel bronze loading process...
  Tables to process: 58
  Parallel workers: 10
  Source: anva_meeus
  Run TS: 20250923T183119772
[1/58] ‚úì Dim_Agent: 3,193 rows in 104.2s
[2/58] ‚úì Dim_Branche: 617 rows in 104.6s
[3/58] ‚úì Dim_DekkingCode: 2,575 rows in 107.7s
[4/58] ‚úì Dim_FactuurSoort: 24 rows in 110.2s
[5/58] ‚úì Dim_HoofdBranche: 23 rows in 111.8s
[6/58] ‚úì Dim_Incassowijze: 168 rows in 117.5s
[7/58] ‚úì Dim_Calamiteit: 14 rows in 119.1s
[8/58] ‚úì Dim_DetailMaatschappij: 2,289 rows in 120.8s
[9/58] ‚úì Dim_DekkingVariabel: 93,997 rows in 122.6s
[10/58] ‚úì Dim_Collectiviteit: 10,212 rows in 123.9s
[11/58] ‚úì Dim_Kantoor: 3 rows in 83.9s
[12/58] ‚úì Dim_Medewerker: 9,861 rows in 81.0s
[13/58] ‚úì Dim_Maatschappij: 2,289 rows in 85.1s
[14/58] ‚óã Dim_PolisVariabel_VrijeLabels: Empty table (schema only) in 75.9s
[15/58] ‚úì Dim_MeldingRDW: 7 rows in 107.0s
[16/58] ‚úì Dim_PolisProducent: 1,473 rows in 107.4s
[17/58] ‚úì Dim_Polisvo

In [None]:
# Return structured result for master notebook
result_payload = {
    "success_count": success_count,
    "failed_count": failed_count,
    "total_rows": total_rows_processed,
    "duration_seconds": total_duration,
    "failed_tables": [r['table_name'] for r in results if r['status'] == 'FAILED']
}

print(f"Result payload:")
print(json.dumps(result_payload, indent=2))

# Exit with payload for orchestration
mssparkutils.notebook.exit(json.dumps(result_payload))

In [None]:
# # Check Delta versies van een tabel
# table_name = "anva_meeus.Dim_Agent"

# print(f"üîç Delta Table History for {table_name}")
# print("=" * 80)

# # Optie 1: SQL
# history_df = spark.sql(f"DESCRIBE HISTORY {table_name}")
# history_df.select("version", "timestamp", "operation", "operationMetrics").show(truncate=False)

# print(f"\nüìä Total versions: {history_df.count()}")

# # Optie 2: Python API
# from delta.tables import DeltaTable

# delta_table = DeltaTable.forName(spark, table_name)
# history = delta_table.history()

# print(f"\nüìã Version details:")
# history.select("version", "timestamp", "operation", "operationMetrics.numFiles", "operationMetrics.numOutputRows").show(10, truncate=False)

StatementMeta(, , -1, Waiting, , Waiting)