# Data Processing Notebook Template

Use this scaffold to build repeatable PySpark pipelines with logging, validations, and optional Delta rollbacks. Replace placeholders with your sources, business logic, and checks.


In [1]:
from spark_fuse.spark import create_session
from spark_fuse.utils.logging import enable_spark_logging, console
from spark_fuse.utils.dataframe import ensure_columns, preview
from pyspark.sql import functions as F, types as T

log = console()
# Set any reusable parameters here
job_date = "2024-01-01"


> Why `functions as F` and `types as T`? Aliasing keeps chained expressions concise, matches common Spark style, and avoids polluting the global namespace with hundreds of column functions and type classes.


## Create a session

Adjust `app_name`, `master`, and configs for your environment.


In [2]:
log.log("[INFO] Starting Spark session...")
spark = create_session(
    app_name="data-processing-template",
    master="local[*]",
    # extra_configs={"spark.some.credential": "value"},
)
log.log("[INFO] Spark session ready.")
spark


:: loading settings :: url = jar:file:/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/kevin/.ivy2.5.2/cache
The jars for the packages stored in: /Users/kevin/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-02166c06-b381-49de-833e-d2736ef11256;1.0
	confs: [default]
	found io.delta#delta-spark_2.13;4.0.0 in central
	found io.delta#delta-storage;4.0.0 in central
	found org.antlr#antlr4-runtime;4.13.1 in central
:: resolution report :: resolve 80ms :: artifacts dl 7ms
	:: modules in use:
	io.delta#delta-spark_2.13;4.0.0 from central in [default]
	io.delta#delta-storage;4.0.0 from central in [default]
	org.antlr#antlr4-runtime;4.13.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules       

## Start logging

Raise Spark log verbosity while you iterate so shuffle and scheduler details show up in the driver logs.


In [3]:
enable_spark_logging(spark, level="INFO")
log.log("[INFO] Spark logging enabled at INFO.")


## Load relevant data

Declare input locations and load dataframes. Swap formats and options for your sources.


In [4]:
log.log("[INFO] Loading input data (dummy samples; replace with real sources)...")

orders_schema = T.StructType(
    [
        T.StructField("order_id", T.StringType(), False),
        T.StructField("order_ts", T.StringType(), False),
        T.StructField("order_total", T.DoubleType(), False),
        T.StructField("customer_id", T.StringType(), False),
    ]
)
orders_data = [
    ("O-1001", "2024-01-05", 42.50, "C001"),
    ("O-1002", "2024-01-06", 18.00, "C002"),
    ("O-1003", "2024-01-06", 120.75, "C003"),
]
orders_df = spark.createDataFrame(orders_data, schema=orders_schema)

customers_schema = T.StructType(
    [
        T.StructField("customer_id", T.StringType(), False),
        T.StructField("segment", T.StringType(), True),
        T.StructField("country", T.StringType(), True),
    ]
)
customers_data = [
    ("C001", "retail", "US"),
    ("C002", "enterprise", "CA"),
    ("C003", "retail", "UK"),
]
customers_df = spark.createDataFrame(customers_data, schema=customers_schema)

log.log("[INFO] Input data loaded.")
log.log(f"[INFO] Orders sample:\n{preview(orders_df)}")


25/12/01 12:59:35 INFO DAGScheduler: Got job 0 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20) with 1 output partitions
25/12/01 12:59:35 INFO DAGScheduler: Final stage: ResultStage 0 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20)
25/12/01 12:59:35 INFO DAGScheduler: Parents of final stage: List()
25/12/01 12:59:35 INFO DAGScheduler: Missing parents: List()
25/12/01 12:59:35 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[11] at collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20), which has no missing parents
25/12/01 12:59:35 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
25/12/01 12:59:35 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 13.6 KiB, free 434.4 MiB)
25/12/01 12:59:36 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.1 KiB, free 434.4 MiB)
25/12/01 12:59:36 INFO DAGScheduler: Submitti

## Process data

Apply your business logic: filtering, casting, enrichment, and derived columns.


In [5]:
log.log("[INFO] Curating datasets...")
curated_orders_df = (
    orders_df
    .withColumn("order_date", F.to_date("order_ts"))
    .withColumn("order_month", F.date_format("order_date", "yyyy-MM"))
    .filter(F.col("order_date") >= F.lit(job_date))
)

curated_customers_df = customers_df.select("customer_id", "segment", "country")
log.log("[INFO] Curated orders and customers dataframes ready.")


## Do joins

Join curated datasets and pick the right join strategy for your domain (inner/left/anti).


In [6]:
log.log("[INFO] Joining curated datasets...")
joined_df = (
    curated_orders_df.alias("o")
    .join(curated_customers_df.alias("c"), on="customer_id", how="left")
)

log.log(f"[INFO] Join complete. Sample:\n{preview(joined_df)}")


25/12/01 12:59:37 INFO DAGScheduler: Registering RDD 13 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20) as input to shuffle 0
25/12/01 12:59:37 INFO DAGScheduler: Got map stage job 3 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20) with 8 output partitions
25/12/01 12:59:37 INFO DAGScheduler: Final stage: ShuffleMapStage 3 (collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20)
25/12/01 12:59:37 INFO DAGScheduler: Parents of final stage: List()
25/12/01 12:59:37 INFO DAGScheduler: Missing parents: List()
25/12/01 12:59:37 INFO DAGScheduler: Submitting ShuffleMapStage 3 (MapPartitionsRDD[13] at collect at /Users/kevin/Github/spark-fuse/src/spark_fuse/utils/dataframe.py:20), which has no missing parents
25/12/01 12:59:37 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 16.8 KiB, free 434.4 MiB)
25/12/01 12:59:37 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes

## Do data tests

Add lightweight checks so issues surface early during development.


In [7]:
log.log("[INFO] Running in-memory data tests...")
# Schema/column guardrails
ensure_columns(joined_df, ["order_id", "customer_id", "order_date"])

# Null/uniqueness/data quality checks (expand as needed)
assert joined_df.filter(F.col("order_id").isNull()).count() == 0, "order_id should be populated"
assert joined_df.filter(F.col("customer_id").isNull()).count() == 0, "customer_id should be populated"
assert joined_df.dropDuplicates(["order_id"]).count() == joined_df.count(), "order_id should be unique"

# Domain-specific rule example; swap column names for your metric
invalid_states = joined_df.filter(F.col("order_total") < 0).count()
assert invalid_states == 0, f"Found {invalid_states} negative order totals"
log.log("[INFO] In-memory data tests passed.")


25/12/01 12:59:38 INFO DAGScheduler: Registering RDD 23 (run at ThreadPoolExecutor.java:1144) as input to shuffle 2
25/12/01 12:59:38 INFO DAGScheduler: Got job 7 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) with 1 output partitions
25/12/01 12:59:38 INFO DAGScheduler: Final stage: ResultStage 10 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768)
25/12/01 12:59:38 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 9)
25/12/01 12:59:38 INFO DAGScheduler: Missing parents: List()
25/12/01 12:59:38 INFO DAGScheduler: Submitting ResultStage 10 (MapPartitionsRDD[26] at $anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768), which has no missing parents
25/12/01 12:59:38 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 13.6 KiB, free 432.3 MiB)
25/12/01 12:59:38 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 6.5 KiB, free 432.3 MiB)
25/12/01 12:59:38 INFO DAGSch

## Write data & post-write tests

Persist curated results, run post-write validations, and attempt rollback when Delta Lake is available.


In [8]:
output_path = "/tmp/spark_fuse/orders_enriched_dummy"  # dummy local path; replace with real target (e.g., s3://bucket/silver/orders)
log.log(f"[INFO] Preparing to write dataset to {output_path}")

delta_supported = False
pre_write_version = None
try:
    from delta.tables import DeltaTable
    delta_supported = True
    try:
        existing_table = DeltaTable.forPath(spark, output_path)
        pre_write_version = existing_table.history(1).select("version").head()[0]
        log.log(f"[INFO] Captured pre-write Delta version {pre_write_version} for rollback.")
    except Exception:
        log.log("[INFO] No existing Delta table found at output path; rollback will be best-effort.")
except Exception:
    log.log("[INFO] Delta Lake not available; rollback will be a best-effort delete if needed.")

log.log("[INFO] Writing data to final path...")
(
    joined_df.write
    .mode("overwrite")
    .format("delta")
    .partitionBy("order_month")
    .save(output_path)
)

log.log("[INFO] Running post-write validations on persisted data...")
persisted_df = spark.read.format("delta").load(output_path)

try:
    ensure_columns(persisted_df, ["order_id", "customer_id", "order_date", "order_month"])
    assert persisted_df.count() > 0, "Persisted dataset is empty"
    assert persisted_df.filter(F.col("order_id").isNull()).count() == 0, "order_id should be populated"
    assert persisted_df.dropDuplicates(["order_id"]).count() == persisted_df.count(), "order_id should be unique"
    invalid_persisted_states = persisted_df.filter(F.col("order_total") < 0).count()
    assert invalid_persisted_states == 0, f"Found {invalid_persisted_states} negative order totals after write"
    log.log("[INFO] Post-write validations passed.")
except Exception as exc:
    log.log(f"[ERROR]Post-write validation failed: {exc}")
    if delta_supported and pre_write_version is not None:
        try:
            log.log(f"[INFO] Attempting Delta rollback to version {pre_write_version} ...")
            spark.sql(f"RESTORE TABLE delta.`{output_path}` TO VERSION AS OF {pre_write_version}")
            log.log("[INFO] Rollback succeeded.")
        except Exception as rollback_exc:
            log.log(f"[ERROR]Rollback attempt failed: {rollback_exc}")
    else:
        log.log("[INFO] No rollback available; inspect persisted data manually.")
    raise


25/12/01 12:59:39 INFO DAGScheduler: Got job 21 (getHistory at DeltaTableOperations.scala:60) with 8 output partitions
25/12/01 12:59:39 INFO DAGScheduler: Final stage: ResultStage 34 (getHistory at DeltaTableOperations.scala:60)
25/12/01 12:59:39 INFO DAGScheduler: Parents of final stage: List()
25/12/01 12:59:39 INFO DAGScheduler: Missing parents: List()
25/12/01 12:59:39 INFO DAGScheduler: Submitting ResultStage 34 (MapPartitionsRDD[72] at getHistory at DeltaTableOperations.scala:60), which has no missing parents
25/12/01 12:59:39 INFO MemoryStore: Block broadcast_24 stored as values in memory (estimated size 188.7 KiB, free 430.2 MiB)
25/12/01 12:59:39 INFO MemoryStore: Block broadcast_24_piece0 stored as bytes in memory (estimated size 60.6 KiB, free 430.1 MiB)
25/12/01 12:59:39 INFO DAGScheduler: Submitting 8 missing tasks from ResultStage 34 (MapPartitionsRDD[72] at getHistory at DeltaTableOperations.scala:60) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7))
25

25/12/01 12:59:39 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 217.2 KiB, free 431.9 MiB)
25/12/01 12:59:39 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 39.6 KiB, free 431.9 MiB)
25/12/01 12:59:40 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
25/12/01 12:59:40 INFO MemoryStore: Block broadcast_26 stored as values in memory (estimated size 217.5 KiB, free 431.9 MiB)
25/12/01 12:59:40 INFO MemoryStore: Block broadcast_26_piece0 stored as bytes in memory (estimated size 39.7 KiB, free 431.9 MiB)
25/12/01 12:59:40 INFO DAGScheduler: Registering RDD 76 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) as input to shuffle 12
25/12/01 12:59:40 INFO DAGScheduler: Got map stage job 22 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) with 1 output partitions
25/

25/12/01 12:59:42 INFO DAGScheduler: Registering RDD 120 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) as input to shuffle 17
25/12/01 12:59:42 INFO DAGScheduler: Got map stage job 32 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768) with 50 output partitions
25/12/01 12:59:42 INFO DAGScheduler: Final stage: ShuffleMapStage 53 ($anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768)
25/12/01 12:59:42 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 52)
25/12/01 12:59:42 INFO DAGScheduler: Missing parents: List()
25/12/01 12:59:42 INFO DAGScheduler: Submitting ShuffleMapStage 53 (MapPartitionsRDD[120] at $anonfun$withThreadLocalCaptured$2 at CompletableFuture.java:1768), which has no missing parents
25/12/01 12:59:42 INFO MemoryStore: Block broadcast_40 stored as values in memory (estimated size 644.5 KiB, free 430.0 MiB)
25/12/01 12:59:42 INFO MemoryStore: Block broadcast_40_piece0 stored as bytes in memory (estimated si

## Stop session

Shut down the session once the job completes.


In [9]:
log.log("[INFO] Stopping Spark session.")
spark.stop()
log.log("[INFO] Spark session stopped.")


25/12/01 12:59:44 INFO MemoryStore: MemoryStore cleared
25/12/01 12:59:44 INFO BlockManager: BlockManager stopped
25/12/01 12:59:44 INFO BlockManagerMaster: BlockManagerMaster stopped
25/12/01 12:59:44 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
