# Data Processing Notebook Template

Use this scaffold to build repeatable PySpark pipelines with logging, validations, and optional Delta rollbacks. Replace placeholders with your sources, business logic, and checks.


## Notebook guidelines

- Name notebooks in clear snake_case (e.g., `orders_enriched.ipynb` or `domain_orders_enriched.ipynb`); keep one primary table per notebook.
- Make runs idempotent: deterministic transforms, safe overwrites/merges, and repeatable partition logic so reruns do not create duplicates.
- Keep scopes clean: functions in snake_case, classes in PascalCase, constants UPPER_SNAKE, modules/files in snake_case; avoid implicit globals beyond parameters.
- Encapsulate helpers in small functions inside the notebook when reusable; prefer pure functions and pass Spark/DataFrames explicitly.
- Document inputs/outputs near the top and ensure the notebook owns producing one curated table/view, not many.


In [None]:
from app.config.settings import get_settings

import datetime as _dt

from pyspark.sql import functions as F, types as T

from spark_fuse.spark import create_session
from spark_fuse.utils import change_tracking  # noqa: F401 (enables .change_tracking accessors)
from spark_fuse.utils.dataframe import ensure_columns, preview
from spark_fuse.utils.progress import (
    console,
    create_progress_tracker,
    enable_spark_logging,
    log_end as log_end_step,
    log_error as log_error_step,
    log_info as log_info_step,
    log_warn as log_warn_step,
)

progress_tracker = create_progress_tracker(total_steps=10)
log = console()
SHOW_HTML = True  # Render HTML log cards in notebooks by default.

def log_info(label: str, *, advance: int = 1, show_html: bool = SHOW_HTML) -> None:
    log_info_step(progress_tracker, log, label, advance=advance, show_html=show_html)


def log_warn(label: str, *, advance: int = 1, show_html: bool = SHOW_HTML) -> None:
    log_warn_step(progress_tracker, log, label, advance=advance, show_html=show_html)


def log_error(label: str, *, advance: int = 1, show_html: bool = SHOW_HTML) -> None:
    log_error_step(progress_tracker, log, label, advance=advance, show_html=show_html)


def log_end(label: str, *, advance: int = 1, show_html: bool = SHOW_HTML) -> None:
    log_end_step(progress_tracker, log, label, advance=advance, show_html=show_html)

# Set any reusable parameters here
job_ts = _dt.datetime.now().replace(microsecond=0).isoformat()  # UTC timestamp; override if needed


## Load configuration

Settings are layered in priority order (config/base.yaml -> config/{env}.yaml -> .env -> env vars).
Set APP_ENV to local, staging, or prod, and keep secrets in .env or real environment variables.


In [2]:

settings = get_settings()
log_info(f"Loaded settings for env={settings.env}", advance=0)


INFO: Loaded settings for env=local:   0%|          | 0/10 [00:00<?, ?it/s, +0.01s, total 0.01s]

> Why `functions as F` and `types as T`? Aliasing keeps chained expressions concise, matches common Spark style, and avoids polluting the global namespace with hundreds of column functions and type classes.


## Create a session

Adjust `app_name`, `master`, and configs for your environment.


In [3]:
log_info("Starting Spark session...", advance=0)
spark = create_session(
    app_name=f"data-processing-template-{settings.env}",
    master="local[*]",
    extra_configs={"spark.some.credential": "value"},
)
log_info("Spark session ready")
spark


INFO: Starting Spark session...:   0%|          | 0/10 [00:00<?, ?it/s, +0.03s, total 0.03s]    

:: loading settings :: url = jar:file:/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/kevin/.ivy2.5.2/cache
The jars for the packages stored in: /Users/kevin/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2656c19a-5cfc-40c4-83ec-6300ef0baac1;1.0
	confs: [default]
	found io.delta#delta-spark_2.13;4.0.1 in central
	found io.delta#delta-storage;4.0.1 in central
	found org.antlr#antlr4-runtime;4.13.1 in central
:: resolution report :: resolve 110ms :: artifacts dl 2ms
	:: modules in use:
	io.delta#delta-spark_2.13;4.0.1 from central in [default]
	io.delta#delta-storage;4.0.1 from central in [default]
	org.antlr#antlr4-runtime;4.13.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules      

INFO: Spark session ready:  10%|â–ˆ         | 1/10 [00:03<00:33,  3.74s/it, +3.76s, total 3.76s]

## Start logging

Raise Spark log verbosity while you iterate so shuffle and scheduler details show up in the driver logs.


In [4]:
enable_spark_logging(spark, level=settings.logging.level)
log_info(f"Spark logging enabled at {settings.logging.level}.", advance=0)
log_info("Logging configured")


INFO: Spark logging enabled at INFO.:  10%|â–ˆ         | 1/10 [00:03<00:33,  3.74s/it, +0.14s, total 3.90s]

INFO: Logging configured:  20%|â–ˆâ–ˆ        | 2/10 [00:03<00:12,  1.62s/it, +0.14s, total 3.90s]            

## Load relevant data

Declare input locations and load dataframes. Swap formats and options for your sources.


In [5]:
log_info("Loading input data (dummy samples; replace with real sources)...", advance=0)

orders_schema = T.StructType(
    [
        T.StructField("order_id", T.StringType(), False),
        T.StructField("order_ts", T.StringType(), False),
        T.StructField("order_total", T.DoubleType(), False),
        T.StructField("customer_id", T.StringType(), False),
    ]
)
orders_data = [
    ("O-1001", "2024-01-05", 42.50, "C001"),
    ("O-1002", "2024-01-06", 18.00, "C002"),
    ("O-1003", "2024-01-06", 120.75, "C003"),
]
orders_df = spark.createDataFrame(orders_data, schema=orders_schema)

customers_schema = T.StructType(
    [
        T.StructField("customer_id", T.StringType(), False),
        T.StructField("segment", T.StringType(), True),
        T.StructField("country", T.StringType(), True),
    ]
)
customers_data = [
    ("C001", "retail", "US"),
    ("C002", "enterprise", "CA"),
    ("C003", "retail", "UK"),
]
customers_df = spark.createDataFrame(customers_data, schema=customers_schema)

log_info("Input data loaded")
log_info(f"Orders sample:\n{preview(orders_df)}", advance=0)


INFO: Loading input data (dummy samples; replace with real sources)...:  20%|â–ˆâ–ˆ        | 2/10 [00:03<00:12,  1.62s/it, +0.00s, total 3.91s]

INFO: Input data loaded:  30%|â–ˆâ–ˆâ–ˆ       | 3/10 [00:04<00:07,  1.08s/it, +0.44s, total 4.34s]                                               

26/01/21 20:21:12 INFO DAGScheduler: Got job 0 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20) with 1 output partitions
26/01/21 20:21:12 INFO DAGScheduler: Final stage: ResultStage 0 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20)
26/01/21 20:21:12 INFO DAGScheduler: Parents of final stage: List()
26/01/21 20:21:12 INFO DAGScheduler: Missing parents: List()
26/01/21 20:21:12 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[11] at collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20), which has no missing parents
26/01/21 20:21:12 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
26/01/21 20:21:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 13.6 KiB, free 434.4 MiB)
26/01/21 20:21:12 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.1 KiB, free 434.4 MiB)
26/01/21 20:21:12 INFO DAGScheduler: Submitti

INFO: Orders sample:
rows=[{'order_id': 'O-1001', 'order_ts': '2024-01-05', 'order_total': 42.5, 'customer_id': 'C001'}, {'order_id': 'O-1002', 'order_ts': '2024-01-06', 'order_total': 18.0, 'customer_id': 'C002'}, {'order_id': 'O-1003', 'order_ts': '2024-01-06', 'order_total': 120.75, 'customer_id': 'C003'}]
INFO: Orders sample:id:string,order_ts:string,order_total:double,customer_id:string>:  30%|â–ˆâ–ˆâ–ˆ       | 3/10 [00:05<00:07,  1.08s/it, +0.44s, total 4.34s]
rows=[{'order_id': 'O-1001', 'order_ts': '2024-01-05', 'order_total': 42.5, 'customer_id': 'C001'}, {'order_id': 'O-1002', 'order_ts': '2024-01-06', 'order_total': 18.0, 'customer_id': 'C002'}, {'order_id': 'O-1003', 'order_ts': '2024-01-06', 'order_total': 120.75, 'customer_id': 'C003'}]
INFO: Orders sample:id:string,order_ts:string,order_total:double,customer_id:string>:  30%|â–ˆâ–ˆâ–ˆ       | 3/10 [00:05<00:07,  1.08s/it, +1.54s, total 5.88s]
rows=[{'order_id': 'O-1001', 'order_ts': '2024-01-05', 'order_total': 42.5, 'cu

## Process data

Apply your business logic: filtering, casting, enrichment, and derived columns.


In [6]:
log_info("Curating datasets...", advance=0)
curated_orders_df = (
    orders_df
    .withColumn("order_date", F.to_date("order_ts"))
    .withColumn("order_month", F.date_format("order_date", "yyyy-MM"))
    .withColumn("processing_ts", F.lit(job_ts))
)

curated_customers_df = customers_df.select("customer_id", "segment", "country")
log_info("Curated orders and customers dataframes ready.", advance=0)
log_info("Curated datasets ready")


INFO: Curating datasets...:  30%|â–ˆâ–ˆâ–ˆ       | 3/10 [00:05<00:07,  1.08s/it, +1.55s, total 5.88s]                                                                                                                                                                                                                                                                                                                                                                                 

INFO: Curated orders and customers dataframes ready.:  30%|â–ˆâ–ˆâ–ˆ       | 3/10 [00:05<00:07,  1.08s/it, +1.59s, total 5.93s]

INFO: Curated datasets ready:  40%|â–ˆâ–ˆâ–ˆâ–ˆ      | 4/10 [00:05<00:07,  1.28s/it, +1.60s, total 5.94s]                        

## Do joins

Join curated datasets and pick the right join strategy for your domain (inner/left/anti).


In [7]:
log_info("Joining curated datasets...", advance=0)
joined_df = (
    curated_orders_df.alias("o")
    .join(curated_customers_df.alias("c"), on="customer_id", how="left")
)

log_info(f"Join complete. Sample:\n{preview(joined_df)}", advance=0)
log_info("Join complete")


INFO: Joining curated datasets...:  40%|â–ˆâ–ˆâ–ˆâ–ˆ      | 4/10 [00:05<00:07,  1.28s/it, +0.00s, total 5.94s]

26/01/21 20:21:14 INFO DAGScheduler: Registering RDD 13 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20) as input to shuffle 0
26/01/21 20:21:14 INFO DAGScheduler: Got map stage job 3 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20) with 8 output partitions
26/01/21 20:21:14 INFO DAGScheduler: Final stage: ShuffleMapStage 3 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20)
26/01/21 20:21:14 INFO DAGScheduler: Parents of final stage: List()
26/01/21 20:21:14 INFO DAGScheduler: Missing parents: List()
26/01/21 20:21:14 INFO DAGScheduler: Submitting ShuffleMapStage 3 (MapPartitionsRDD[13] at collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20), which has no missing parents
26/01/21 20:21:14 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 16.8 KiB, free 434.4 MiB)
26/01/21 20:21:14 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes

INFO: Join complete. Sample:
rows=[{'customer_id': 'C001', 'order_id': 'O-1001', 'order_ts': '2024-01-05', 'order_total': 42.5, 'order_date': datetime.date(2024, 1, 5), 'order_month': '2024-01', 'processing_ts': '2026-01-21T20:21:07', 'segment': 'retail', 'country': 'US'}, {'customer_id': 'C002', 'order_id': 'O-1002', 'order_ts': '2024-01-06', 'order_total': 18.0, 'order_date': datetime.date(2024, 1, 6), 'order_month': '2024-01', 'processing_ts': '2026-01-21T20:21:07', 'segment': 'enterprise', 'country': 'CA'}, {'customer_id': 'C003', 'order_id': 'O-1003', 'order_ts': '2024-01-06', 'order_total': 120.75, 'order_date': datetime.date(2024, 1, 6), 'order_month': '2024-01', 'processing_ts': '2026-01-21T20:21:07', 'segment': 'retail', 'country': 'UK'}]
INFO: Join complete. Sample:ring,order_id:string,order_ts:string,order_total:double,order_date:date,order_month:string,processing_ts:string,segment:string,country:string>:  40%|â–ˆâ–ˆâ–ˆâ–ˆ      | 4/10 [00:06<00:07,  1.28s/it, +0.00s, total 5

INFO: Join complete:  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 5/10 [00:06<00:05,  1.02s/it, +0.54s, total 6.48s]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

## Do data tests

Add lightweight checks so issues surface early during development.


In [8]:
log_info("Running in-memory data tests...", advance=0)
# Schema/column guardrails
ensure_columns(joined_df, ["order_id", "customer_id", "order_date"])

# Null/uniqueness/data quality checks (expand as needed)
assert joined_df.filter(F.col("order_id").isNull()).count() == 0, "order_id should be populated"
assert joined_df.filter(F.col("customer_id").isNull()).count() == 0, "customer_id should be populated"
assert joined_df.dropDuplicates(["order_id"]).count() == joined_df.count(), "order_id should be unique"

# Domain-specific rule example; swap column names for your metric
invalid_states = joined_df.filter(F.col("order_total") < 0).count()
assert invalid_states == 0, f"Found {invalid_states} negative order totals"
log_info("In-memory data tests passed")


INFO: Running in-memory data tests...:  50%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆ     | 5/10 [00:06<00:05,  1.02s/it, +0.01s, total 6.48s]

26/01/21 20:21:14 INFO DAGScheduler: Registering RDD 23 (run at ThreadPoolExecutor.java:1144) as input to shuffle 2
26/01/21 20:21:14 INFO DAGScheduler: Got job 7 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) with 1 output partitions
26/01/21 20:21:14 INFO DAGScheduler: Final stage: ResultStage 10 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768)
26/01/21 20:21:14 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 9)
26/01/21 20:21:14 INFO DAGScheduler: Missing parents: List()
26/01/21 20:21:14 INFO DAGScheduler: Submitting ResultStage 10 (MapPartitionsRDD[26] at $anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768), which has no missing parents
26/01/21 20:21:14 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 13.6 KiB, free 432.3 MiB)
26/01/21 20:21:14 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 6.5 KiB, free 432.3 MiB)
26/01/21 20:21:14 INFO DAGSch

INFO: In-memory data tests passed:  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 6/10 [00:07<00:04,  1.04s/it, +1.09s, total 7.57s]    

## Write data & post-write tests

Persist curated results, run post-write validations, and attempt rollback when Delta Lake is available.


In [9]:
output_path = f"/tmp/spark_fuse/{settings.env}/orders_enriched_ct" # dummy local path; replace with real target (e.g., s3://bucket/silver/orders)
target_table = "orders_enriched_ct"  # metastore table name if you want one registered
log_info(f"Preparing to write dataset to {output_path}", advance=0)

delta_supported = False
pre_write_version = None
output_format = "delta"
table_exists = False
try:
    from delta.tables import DeltaTable
    delta_supported = True
    try:
        DeltaTable.forPath(spark, output_path)
        table_exists = True
        log_info("Existing Delta table found; skipping DDL.", advance=0)
    except Exception:
        log_info("No existing Delta table found at output path; creating with change-tracking columns.", advance=0)
        (
            DeltaTable.createIfNotExists(spark)
            .tableName(target_table)
            .location(output_path)
            .addColumn("order_id", T.StringType())
            .addColumn("order_ts", T.StringType())
            .addColumn("order_total", T.DoubleType())
            .addColumn("customer_id", T.StringType())
            .addColumn("order_date", T.DateType())
            .addColumn("order_month", T.StringType())
            .addColumn("processing_ts", T.StringType())
            .addColumn("segment", T.StringType())
            .addColumn("country", T.StringType())
            .addColumn("effective_start_ts", T.TimestampType())
            .addColumn("effective_end_ts", T.TimestampType())
            .addColumn("is_current", T.BooleanType())
            .addColumn("version", T.LongType())
            .addColumn("row_hash", T.StringType())
            .addColumn("load_ts", T.TimestampType())
            .execute()
        )
except Exception:
    output_format = "parquet"
    log_warn("Delta Lake not available; falling back to parquet for write and disabling rollback.", advance=0)

log_info(f"Writing data to final path with format={output_format}...", advance=0)
log_info("Write prerequisites complete")

if delta_supported:
    change_tracking_options = {
        "business_keys": ["order_id"],
        "tracked_columns": [
            "order_id",
            "customer_id",
            "order_date",
            "order_month",
            "processing_ts",
            "segment",
            "country",
            "order_total",
        ],
        "load_ts_expr": "current_timestamp()",
        "create_if_not_exists": not table_exists,
        "allow_schema_evolution": True,
    }
    joined_df.write.change_tracking.options(
        change_tracking_mode="track_history",
        change_tracking_options=change_tracking_options,
    ).table(output_path)
else:
    (
        joined_df.write
        .option("mergeSchema", "true")
        .mode("overwrite")
        .format(output_format)
        .partitionBy("order_month")
        .save(output_path)
    )

log_info("Write complete")

log_info("Running post-write validations on persisted data...", advance=0)
persisted_df = spark.read.format(output_format).load(output_path)
current_df = persisted_df.filter(F.col("is_current") == F.lit(True)) if delta_supported else persisted_df

try:
    ensure_columns(current_df, ["order_id", "customer_id", "order_date", "order_month"])
    assert current_df.count() > 0, "Persisted dataset is empty"
    assert current_df.filter(F.col("order_id").isNull()).count() == 0, "order_id should be populated"
    assert current_df.dropDuplicates(["order_id"]).count() == current_df.count(), "order_id should be unique"
    invalid_persisted_states = current_df.filter(F.col("order_total") < 0).count()
    assert invalid_persisted_states == 0, f"Found {invalid_persisted_states} negative order totals after write"
    log_info("Post-write validations passed")
except Exception as exc:
    log_error(f"Post-write validation failed: {exc}", advance=0)
    if delta_supported and pre_write_version is not None:
        try:
            log_info(f"Attempting Delta rollback to version {pre_write_version} ...", advance=0)
            spark.sql(f"RESTORE TABLE delta.`{output_path}` TO VERSION AS OF {pre_write_version}")
            log_info("Rollback succeeded.", advance=0)
        except Exception as rollback_exc:
            log_error(f"Rollback attempt failed: {rollback_exc}", advance=0)
    else:
        log_warn("No rollback available; inspect persisted data manually.", advance=0)
    raise


INFO: Preparing to write dataset to /tmp/spark_fuse/local/orders_enriched_ct:  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 6/10 [00:07<00:04,  1.04s/it, +0.02s, total 7.59s]

INFO: Existing Delta table found; skipping DDL.:  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 6/10 [00:07<00:04,  1.04s/it, +0.25s, total 7.82s]                             

INFO: Writing data to final path with format=delta...:  60%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ    | 6/10 [00:07<00:04,  1.04s/it, +0.25s, total 7.82s]

INFO: Write prerequisites complete:  70%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ   | 7/10 [00:07<00:02,  1.27it/s, +0.25s, total 7.83s]                   

26/01/21 20:21:15 INFO DAGScheduler: Registering RDD 68 (run at ThreadPoolExecutor.java:1144) as input to shuffle 12
26/01/21 20:21:15 INFO DAGScheduler: Got job 21 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) with 1 output partitions
26/01/21 20:21:15 INFO DAGScheduler: Final stage: ResultStage 35 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768)
26/01/21 20:21:15 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 34)
26/01/21 20:21:15 INFO DAGScheduler: Missing parents: List()
26/01/21 20:21:15 INFO DAGScheduler: Submitting ResultStage 35 (MapPartitionsRDD[71] at $anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768), which has no missing parents
26/01/21 20:21:15 INFO MemoryStore: Block broadcast_24 stored as values in memory (estimated size 13.6 KiB, free 430.4 MiB)
26/01/21 20:21:15 INFO MemoryStore: Block broadcast_24_piece0 stored as bytes in memory (estimated size 6.5 KiB, free 430.4 MiB)
26/01/21 20:21:15 INFO D

INFO: Write complete:  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 8/10 [00:14<00:05,  2.52s/it, +6.23s, total 14.05s]             

INFO: Running post-write validations on persisted data...:  80%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ  | 8/10 [00:14<00:05,  2.52s/it, +0.00s, total 14.05s]

26/01/21 20:21:22 INFO DAGScheduler: Got job 55 ($anonfun$recordDeltaOperationInternal$1 at DatabricksLogging.scala:128) with 50 output partitions
26/01/21 20:21:22 INFO DAGScheduler: Final stage: ResultStage 93 ($anonfun$recordDeltaOperationInternal$1 at DatabricksLogging.scala:128)
26/01/21 20:21:22 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 92)
26/01/21 20:21:22 INFO DAGScheduler: Missing parents: List()
26/01/21 20:21:22 INFO DAGScheduler: Submitting ResultStage 93 (MapPartitionsRDD[202] at $anonfun$recordDeltaOperationInternal$1 at DatabricksLogging.scala:128), which has no missing parents
26/01/21 20:21:22 INFO MemoryStore: Block broadcast_74 stored as values in memory (estimated size 678.8 KiB, free 427.0 MiB)
26/01/21 20:21:22 INFO MemoryStore: Block broadcast_74_piece0 stored as bytes in memory (estimated size 154.0 KiB, free 426.8 MiB)
26/01/21 20:21:22 INFO DAGScheduler: Submitting 50 missing tasks from ResultStage 93 (MapPartitionsRDD[202] at $anonfun$r

INFO: Post-write validations passed:  90%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ | 9/10 [00:15<00:02,  2.04s/it, +1.01s, total 15.06s]                      

## Post-write Delta log
Recent Delta history after writing.


In [10]:
from pathlib import Path
from delta.tables import DeltaTable
from pyspark.sql import functions as F
log_path = Path(output_path) / '_delta_log'
if delta_supported and log_path.exists():
    try:
        dt = DeltaTable.forPath(spark, output_path)
        history_df = dt.history(10)
        merge_ops = history_df.filter(F.col('operation') == 'MERGE')
        history_df.select('version','timestamp','operation','operationParameters','operationMetrics').show(truncate=False)
        pivoted = (
            merge_ops
            .select('version', F.explode('operationMetrics').alias('metric','value'))
            .where(F.col('value').isNotNull())
            .groupBy('version')
            .pivot('metric')
            .agg(F.first('value'))
            .orderBy('version')
        )
        pivoted.show(truncate=False)
    except Exception as exc:
        log_warn(f"Delta history not available: {exc}", advance=0)
else:
    log_warn("No Delta log found after write; ensure output_path is correct.", advance=0)


26/01/21 20:21:23 INFO DAGScheduler: Got job 69 (getHistory at DeltaTableOperations.scala:60) with 8 output partitions
26/01/21 20:21:23 INFO DAGScheduler: Final stage: ResultStage 118 (getHistory at DeltaTableOperations.scala:60)
26/01/21 20:21:23 INFO DAGScheduler: Parents of final stage: List()
26/01/21 20:21:23 INFO DAGScheduler: Missing parents: List()
26/01/21 20:21:23 INFO DAGScheduler: Submitting ResultStage 118 (MapPartitionsRDD[255] at getHistory at DeltaTableOperations.scala:60), which has no missing parents
26/01/21 20:21:23 INFO MemoryStore: Block broadcast_93 stored as values in memory (estimated size 188.9 KiB, free 432.6 MiB)
26/01/21 20:21:23 INFO MemoryStore: Block broadcast_93_piece0 stored as bytes in memory (estimated size 60.6 KiB, free 432.5 MiB)
26/01/21 20:21:23 INFO DAGScheduler: Submitting 8 missing tasks from ResultStage 118 (MapPartitionsRDD[255] at getHistory at DeltaTableOperations.scala:60) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7

+-------+-----------------------+------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|version|timestamp              |operation   |operationPa

26/01/21 20:21:23 INFO DAGScheduler: Registering RDD 258 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) as input to shuffle 35
26/01/21 20:21:23 INFO DAGScheduler: Got map stage job 70 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) with 2 output partitions
26/01/21 20:21:23 INFO DAGScheduler: Final stage: ShuffleMapStage 119 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768)
26/01/21 20:21:23 INFO DAGScheduler: Parents of final stage: List()
26/01/21 20:21:23 INFO DAGScheduler: Missing parents: List()
26/01/21 20:21:23 INFO DAGScheduler: Submitting ShuffleMapStage 119 (MapPartitionsRDD[258] at $anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768), which has no missing parents
26/01/21 20:21:23 INFO MemoryStore: Block broadcast_94 stored as values in memory (estimated size 27.1 KiB, free 432.5 MiB)
26/01/21 20:21:23 INFO MemoryStore: Block broadcast_94_piece0 stored as bytes in memory (estimated size 11.8 KiB, free 

+-------+---------------+-----------------------+-------------+-------------+-------------------+---------------------+-------------------------+-----------------------------+-------------------------------+-------------------------------+-------------------+---------------------+-------------------+--------------------+---------------------+---------------------------+---------------------------+--------------------------------------+--------------------------------------+--------------------+-------------+----------+
|version|executionTimeMs|materializeSourceTimeMs|numOutputRows|numSourceRows|numTargetBytesAdded|numTargetBytesRemoved|numTargetChangeFilesAdded|numTargetDeletionVectorsAdded|numTargetDeletionVectorsRemoved|numTargetDeletionVectorsUpdated|numTargetFilesAdded|numTargetFilesRemoved|numTargetRowsCopied|numTargetRowsDeleted|numTargetRowsInserted|numTargetRowsMatchedDeleted|numTargetRowsMatchedUpdated|numTargetRowsNotMatchedBySourceDeleted|numTargetRowsNotMatchedBySourceUpdat

26/01/21 20:21:23 INFO DAGScheduler: Registering RDD 271 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) as input to shuffle 37
26/01/21 20:21:23 INFO DAGScheduler: Got map stage job 74 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) with 2 output partitions
26/01/21 20:21:23 INFO DAGScheduler: Final stage: ShuffleMapStage 127 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768)
26/01/21 20:21:23 INFO DAGScheduler: Parents of final stage: List()
26/01/21 20:21:23 INFO DAGScheduler: Missing parents: List()
26/01/21 20:21:23 INFO DAGScheduler: Submitting ShuffleMapStage 127 (MapPartitionsRDD[271] at $anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768), which has no missing parents
26/01/21 20:21:23 INFO MemoryStore: Block broadcast_98 stored as values in memory (estimated size 36.4 KiB, free 432.3 MiB)
26/01/21 20:21:23 INFO MemoryStore: Block broadcast_98_piece0 stored as bytes in memory (estimated size 13.0 KiB, free 

## Stop session

Shut down the session once the job completes.


In [11]:
log_info("Stopping Spark session.", advance=0)
spark.stop()
log_end("Spark session stopped")


INFO: Stopping Spark session.:  90%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ | 9/10 [00:15<00:02,  2.04s/it, +0.75s, total 15.81s]      

26/01/21 20:21:23 INFO MemoryStore: MemoryStore cleared
26/01/21 20:21:23 INFO BlockManager: BlockManager stopped
26/01/21 20:21:23 INFO BlockManagerMaster: BlockManagerMaster stopped
26/01/21 20:21:23 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!


END: Spark session stopped: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 10/10 [00:16<00:00,  1.66s/it, +1.57s, total 16.63s]  
