**⭐ 1. What This Pattern Solves**

Incremental processing efficiently processes only the new or changed data since the last ETL run, instead of reprocessing the entire dataset.
Use-cases include:

Daily ingestion of transactional logs

Updating aggregates for only new sales records

Slowly changing dimensions with minimal computation

This reduces runtime, I/O, and cloud costs significantly.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
SELECT *
FROM staging_table s
WHERE s.updated_at > (SELECT MAX(updated_at) FROM target_table)

**⭐ 3. Core Idea**

Track a high-watermark (e.g., max timestamp, max ID) from last batch

Filter source data to include only rows greater than this watermark

Process and append to target

Update watermark after successful ETL

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
# Read last watermark from target
last_watermark = spark.read.parquet("/path/to/target").agg({"updated_at": "max"}).collect()[0][0]

# Read incremental data
incremental_df = spark.read.parquet("/path/to/source").filter(col("updated_at") > last_watermark)

# Process & write
incremental_df.write.mode("append").parquet("/path/to/target")


**⭐ 5. Detailed Example**

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, max as Fmax

spark = SparkSession.builder.getOrCreate()

# Target table
target_df = spark.read.parquet("/tmp/target_sales")
last_watermark = target_df.agg(Fmax("updated_at")).collect()[0][0]

# Source data
source_df = spark.read.parquet("/tmp/staging_sales")
incremental_df = source_df.filter(col("updated_at") > last_watermark)

# Process (e.g., aggregation)
agg_df = incremental_df.groupBy("product_id").sum("sales_amount").withColumnRenamed("sum(sales_amount)", "total_sales")

# Append to target
agg_df.write.mode("append").parquet("/tmp/target_sales")

**Step-by-step:**

Get max updated_at from target

Filter source using watermark

Apply transformations on incremental data

Append results

Update watermark

**⭐ 6. Mini Practice Problems**

Process only new customer orders since last run.

Incrementally update product inventory counts.

Compute only the latest user events per day for analytics.

**⭐ 7. Full Data Engineering Problem**

Scenario: A retail platform ingests 100M+ transaction logs daily. Reprocessing all data is expensive.

Solution Approach:

Maintain updated_at or ingest_time watermark

Filter staging/source tables by watermark

Aggregate and join only incremental rows

Append results to Silver/Gold tables

Update watermark in a metadata table for next run

Performance tip: Partition source by date to make watermark filter efficient.

**⭐ 8. Time & Space Complexity**

Time: O(N_inc) where N_inc = number of new/changed rows

Space: O(N_inc) for intermediate transformations

Scales linearly with incremental data, not full dataset

**⭐ 9. Common Pitfalls**

Not updating watermark → reprocesses same data repeatedly

Using incorrect filter column → misses or duplicates data

Large partition scans → can be slow if not partitioned

Watermark based on unreliable timestamps → data gaps