# Data Processing Notebook Template

Use this scaffold to build repeatable PySpark pipelines with logging, validations, and optional Delta rollbacks. Replace placeholders with your sources, business logic, and checks.


## Notebook guidelines

- Name notebooks in clear snake_case (e.g., `orders_enriched.ipynb` or `domain_orders_enriched.ipynb`); keep one primary table per notebook.
- Make runs idempotent: deterministic transforms, safe overwrites/merges, and repeatable partition logic so reruns do not create duplicates.
- Keep scopes clean: functions in snake_case, classes in PascalCase, constants UPPER_SNAKE, modules/files in snake_case; avoid implicit globals beyond parameters.
- Encapsulate helpers in small functions inside the notebook when reusable; prefer pure functions and pass Spark/DataFrames explicitly.
- Document inputs/outputs near the top and ensure the notebook owns producing one curated table/view, not many.


In [1]:
from spark_fuse.spark import create_session
from spark_fuse.utils import change_tracking  # noqa: F401 (enables .change_tracking accessors)
from spark_fuse.utils.logging import create_progress_tracker, enable_spark_logging, log_info as log_info_step, log_end as log_end_step, console
from spark_fuse.utils.dataframe import ensure_columns, preview
from pyspark.sql import functions as F, types as T
import datetime as _dt

progress_tracker = create_progress_tracker(total_steps=10)
log = console()

def log_info(label: str) -> None:
    log_info_step(progress_tracker, log, label)


def log_end(label: str) -> None:
    log_end_step(progress_tracker, log, label)

# Set any reusable parameters here
job_ts = _dt.datetime.now().replace(microsecond=0).isoformat()  # UTC timestamp; override if needed


  from .autonotebook import tqdm as notebook_tqdm


> Why `functions as F` and `types as T`? Aliasing keeps chained expressions concise, matches common Spark style, and avoids polluting the global namespace with hundreds of column functions and type classes.


## Create a session

Adjust `app_name`, `master`, and configs for your environment.


In [2]:
log.log("[INFO] Starting Spark session...")
spark = create_session(
    app_name="data-processing-template",
    master="local[*]",
    extra_configs={"spark.some.credential": "value"},
)
log.log("[INFO] Spark session ready.")
log_info("Spark session ready")
spark


:: loading settings :: url = jar:file:/Users/kevin/Github/spark-fuse/.venv/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/kevin/.ivy2.5.2/cache
The jars for the packages stored in: /Users/kevin/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6d8d5dc2-5bae-41d5-8f91-8fa8ffca8a51;1.0
	confs: [default]
	found io.delta#delta-spark_2.13;4.0.1 in central
	found io.delta#delta-storage;4.0.1 in central
	found org.antlr#antlr4-runtime;4.13.1 in central
:: resolution report :: resolve 89ms :: artifacts dl 14ms
	:: modules in use:
	io.delta#delta-spark_2.13;4.0.1 from central in [default]
	io.delta#delta-storage;4.0.1 from central in [default]
	org.antlr#antlr4-runtime;4.13.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   art

Spark session ready:   0%|          | 0/10 [00:00<?, ?it/s, +3.37s, total 3.37s]

## Start logging

Raise Spark log verbosity while you iterate so shuffle and scheduler details show up in the driver logs.


In [3]:
enable_spark_logging(spark, level="WARN")
log.log("[INFO] Spark logging enabled at WARN.")
log_info("Logging configured")


Logging configured:  20%|██        | 2/10 [00:00<00:00,  8.37it/s, +0.27s, total 3.64s]

## Load relevant data

Declare input locations and load dataframes. Swap formats and options for your sources.


In [4]:
log.log("[INFO] Loading input data (dummy samples; replace with real sources)...")

orders_schema = T.StructType(
    [
        T.StructField("order_id", T.StringType(), False),
        T.StructField("order_ts", T.StringType(), False),
        T.StructField("order_total", T.DoubleType(), False),
        T.StructField("customer_id", T.StringType(), False),
    ]
)
orders_data = [
    ("O-1001", "2024-01-05", 42.50, "C001"),
    ("O-1002", "2024-01-06", 18.00, "C002"),
    ("O-1003", "2024-01-06", 120.75, "C003"),
]
orders_df = spark.createDataFrame(orders_data, schema=orders_schema)

customers_schema = T.StructType(
    [
        T.StructField("customer_id", T.StringType(), False),
        T.StructField("segment", T.StringType(), True),
        T.StructField("country", T.StringType(), True),
    ]
)
customers_data = [
    ("C001", "retail", "US"),
    ("C002", "enterprise", "CA"),
    ("C003", "retail", "UK"),
]
customers_df = spark.createDataFrame(customers_data, schema=customers_schema)

log.log("[INFO] Input data loaded.")
log_info("Input data loaded")
log.log(f"[INFO] Orders sample:\n{preview(orders_df)}")


Input data loaded:  30%|███       | 3/10 [00:00<00:02,  2.61it/s, +0.75s, total 4.39s] 

                                                                                

## Process data

Apply your business logic: filtering, casting, enrichment, and derived columns.


In [5]:
log.log("[INFO] Curating datasets...")
curated_orders_df = (
    orders_df
    .withColumn("order_date", F.to_date("order_ts"))
    .withColumn("order_month", F.date_format("order_date", "yyyy-MM"))
    .withColumn("processing_ts", F.lit(job_ts))
)

curated_customers_df = customers_df.select("customer_id", "segment", "country")
log.log("[INFO] Curated orders and customers dataframes ready.")
log_info("Curated datasets ready")


Curated datasets ready:  40%|████      | 4/10 [00:02<00:05,  1.16it/s, +1.67s, total 6.06s]

## Do joins

Join curated datasets and pick the right join strategy for your domain (inner/left/anti).


In [6]:
log.log("[INFO] Joining curated datasets...")
joined_df = (
    curated_orders_df.alias("o")
    .join(curated_customers_df.alias("c"), on="customer_id", how="left")
)

log.log(f"[INFO] Join complete. Sample:\n{preview(joined_df)}")
log_info("Join complete")


Join complete:  50%|█████     | 5/10 [00:03<00:04,  1.18it/s, +0.82s, total 6.88s]         

## Do data tests

Add lightweight checks so issues surface early during development.


In [7]:
log.log("[INFO] Running in-memory data tests...")
# Schema/column guardrails
ensure_columns(joined_df, ["order_id", "customer_id", "order_date"])

# Null/uniqueness/data quality checks (expand as needed)
assert joined_df.filter(F.col("order_id").isNull()).count() == 0, "order_id should be populated"
assert joined_df.filter(F.col("customer_id").isNull()).count() == 0, "customer_id should be populated"
assert joined_df.dropDuplicates(["order_id"]).count() == joined_df.count(), "order_id should be unique"

# Domain-specific rule example; swap column names for your metric
invalid_states = joined_df.filter(F.col("order_total") < 0).count()
assert invalid_states == 0, f"Found {invalid_states} negative order totals"
log.log("[INFO] In-memory data tests passed.")
log_info("In-memory data tests passed")


In-memory data tests passed:  60%|██████    | 6/10 [00:05<00:04,  1.12s/it, +1.66s, total 8.54s]

## Write data & post-write tests

Persist curated results, run post-write validations, and attempt rollback when Delta Lake is available.


In [8]:
output_path = "/tmp/spark_fuse/orders_enriched_ct"  # dummy local path; replace with real target (e.g., s3://bucket/silver/orders)
target_table = "orders_enriched_ct"  # metastore table name if you want one registered
log.log(f"[INFO] Preparing to write dataset to {output_path}")

delta_supported = False
pre_write_version = None
output_format = "delta"
table_exists = False
try:
    from delta.tables import DeltaTable
    delta_supported = True
    try:
        DeltaTable.forPath(spark, output_path)
        table_exists = True
        log.log("[INFO] Existing Delta table found; skipping DDL.")
    except Exception:
        log.log("[INFO] No existing Delta table found at output path; creating with change-tracking columns.")
        (
            DeltaTable.createIfNotExists(spark)
            .tableName(target_table)
            .location(output_path)
            .addColumn("order_id", T.StringType())
            .addColumn("order_ts", T.StringType())
            .addColumn("order_total", T.DoubleType())
            .addColumn("customer_id", T.StringType())
            .addColumn("order_date", T.DateType())
            .addColumn("order_month", T.StringType())
            .addColumn("processing_ts", T.StringType())
            .addColumn("segment", T.StringType())
            .addColumn("country", T.StringType())
            .addColumn("effective_start_ts", T.TimestampType())
            .addColumn("effective_end_ts", T.TimestampType())
            .addColumn("is_current", T.BooleanType())
            .addColumn("version", T.LongType())
            .addColumn("row_hash", T.StringType())
            .addColumn("load_ts", T.TimestampType())
            .execute()
        )
except Exception:
    output_format = "parquet"
    log.log("[INFO] Delta Lake not available; falling back to parquet for write and disabling rollback.")

log.log(f"[INFO] Writing data to final path with format={output_format}...")
log_info("Write prerequisites complete")

if delta_supported:
    change_tracking_options = {
        "business_keys": ["order_id"],
        "tracked_columns": [
            "order_id",
            "customer_id",
            "order_date",
            "order_month",
            "processing_ts",
            "segment",
            "country",
            "order_total",
        ],
        "load_ts_expr": "current_timestamp()",
        "create_if_not_exists": not table_exists,
        "allow_schema_evolution": True,
    }
    joined_df.write.change_tracking.options(
        change_tracking_mode="track_history",
        change_tracking_options=change_tracking_options,
    ).table(output_path)
else:
    (
        joined_df.write
        .option("mergeSchema", "true")
        .mode("overwrite")
        .format(output_format)
        .partitionBy("order_month")
        .save(output_path)
    )

log_info("Write complete")

log.log("[INFO] Running post-write validations on persisted data...")
persisted_df = spark.read.format(output_format).load(output_path)
current_df = persisted_df.filter(F.col("is_current") == F.lit(True)) if delta_supported else persisted_df

try:
    ensure_columns(current_df, ["order_id", "customer_id", "order_date", "order_month"])
    assert current_df.count() > 0, "Persisted dataset is empty"
    assert current_df.filter(F.col("order_id").isNull()).count() == 0, "order_id should be populated"
    assert current_df.dropDuplicates(["order_id"]).count() == current_df.count(), "order_id should be unique"
    invalid_persisted_states = current_df.filter(F.col("order_total") < 0).count()
    assert invalid_persisted_states == 0, f"Found {invalid_persisted_states} negative order totals after write"
    log.log("[INFO] Post-write validations passed.")
    log_info("Post-write validations passed")
except Exception as exc:
    log.log(f"[ERROR]Post-write validation failed: {exc}")
    if delta_supported and pre_write_version is not None:
        try:
            log.log(f"[INFO] Attempting Delta rollback to version {pre_write_version} ...")
            spark.sql(f"RESTORE TABLE delta.`{output_path}` TO VERSION AS OF {pre_write_version}")
            log.log("[INFO] Rollback succeeded.")
        except Exception as rollback_exc:
            log.log(f"[ERROR]Rollback attempt failed: {rollback_exc}")
    else:
        log.log("[INFO] No rollback available; inspect persisted data manually.")
    raise


Write prerequisites complete:  70%|███████   | 7/10 [00:05<00:03,  1.02s/it, +0.81s, total 9.35s]

26/01/20 12:53:31 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
26/01/20 12:53:35 WARN MapPartitionsRDD: RDD 94 was locally checkpointed, its lineage has been truncated and cannot be recomputed after unpersisting


Write complete:  80%|████████  | 8/10 [00:14<00:06,  3.47s/it, +8.80s, total 18.15s]             

Post-write validations passed:  90%|█████████ | 9/10 [00:16<00:02,  2.81s/it, +1.33s, total 19.48s]

## Post-write Delta log
Recent Delta history after writing.


In [9]:
from pathlib import Path
from delta.tables import DeltaTable
from pyspark.sql import functions as F
log_path = Path(output_path) / '_delta_log'
if delta_supported and log_path.exists():
    try:
        dt = DeltaTable.forPath(spark, output_path)
        history_df = dt.history(10)
        merge_ops = history_df.filter(F.col('operation') == 'MERGE')
        history_df.select('version','timestamp','operation','operationParameters','operationMetrics').show(truncate=False)
        pivoted = (
            merge_ops
            .select('version', F.explode('operationMetrics').alias('metric','value'))
            .where(F.col('value').isNotNull())
            .groupBy('version')
            .pivot('metric')
            .agg(F.first('value'))
            .orderBy('version')
        )
        pivoted.show(truncate=False)
    except Exception as exc:
        log.log(f'[INFO] Delta history not available: {exc}')
else:
    log.log('[INFO] No Delta log found after write; ensure output_path is correct.')


+-------+-----------------------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|version|timestamp              |operation   |oper

## Stop session

Shut down the session once the job completes.


In [10]:
log.log("[INFO] Stopping Spark session.")
spark.stop()
log.log("[INFO] Spark session stopped.")
log_end("Spark session stopped")


Spark session stopped: 100%|██████████| 10/10 [00:17<00:00,  1.77s/it, +1.60s, total 21.08s]       
