**⭐ 1. What This Pattern Solves**

CDF allows you to capture only the changes (inserts, updates, deletes) between Delta table versions.

Enables incremental ETL pipelines without scanning full tables.

Reduces compute and I/O by processing only new or changed data.

Essential for CDC pipelines, audit, and real-time analytics.

Used for:

Incremental ingestion from Bronze → Silver

Streaming dashboards based on Silver/Gold changes

Auditing updates and deletions

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Enable CDF on table
ALTER TABLE my_table SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

-- Query changes between versions
SELECT * FROM table_changes('my_table', 1, 3);

-- table_changes(table, startVersion, endVersion) returns inserted, updated, deleted rows

**⭐ 3. Core Idea**

Delta Lake tracks change events for each row.

You can read changes between any two versions instead of full table.

CDF = Time Travel + diff extraction

Reusability: Use for incremental ETL or syncing downstream systems efficiently.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
# Enable CDF
spark.sql("ALTER TABLE my_table SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

# Read changes between versions
df_changes = spark.read.format("delta").option("readChangeFeed", "true") \
    .option("startingVersion", 1).option("endingVersion", 3) \
    .load("/delta/my_table")

**⭐ 5. Detailed Example**

In [0]:
# Initial table
data_v1 = [("A", 100), ("B", 50)]
df_v1 = spark.createDataFrame(data_v1, ["id", "amount"])
df_v1.write.format("delta").mode("overwrite").save("/delta/cdf_table")

# Enable CDF
spark.sql("ALTER TABLE delta.`/delta/cdf_table` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

# Update table
data_v2 = [("A", 200), ("C", 300)]
df_v2 = spark.createDataFrame(data_v2, ["id", "amount"])
df_v2.write.format("delta").mode("overwrite").save("/delta/cdf_table")

# Read changes from v0 → v1
df_changes = spark.read.format("delta").option("readChangeFeed", "true") \
    .option("startingVersion", 0).option("endingVersion", 1) \
    .load("/delta/cdf_table")
df_changes.show()

**Step-by-step:**

Enable CDF on the table

Write updates

Query only inserted/updated/deleted rows between versions

**⭐ 6. Mini Practice Problems**

Enable CDF on Bronze table and extract all changes for last 24 hours.

Compute only new customer records in Silver table using CDF.

Apply CDF output to update downstream Gold table incrementally.

**⭐ 7. Full Data Engineering Problem**

Scenario: A retail company updates product prices daily:

Bronze receives raw feeds

Silver maintains curated inventory

Gold tracks daily sales totals

Instead of scanning the full Silver table every day, use CDF to extract only changed rows and update Gold efficiently.

**⭐ 8. Time & Space Complexity**

Time: O(changes) → scales linearly with number of changed rows

Space: Extra storage for change logs (small compared to full table)

Efficient for incremental pipelines vs full table scans

**⭐ 9. Common Pitfalls**

Forgetting to enable CDF → cannot query changes

Using outdated Delta version → CDF not supported in older releases

Querying too far back → older change data may be removed via VACUUM

Misunderstanding that deleted rows appear with _change_type = 'delete'