## Capturing Data Changes

Capturing and replicating changes from source systems is a fundamental component of building reliable ETL pipelines. This ensures that target datasets stay up to date with source modifications. The technique used for this purpose is known as Change Data Capture (CDC).

### What is Change Data Capture (CDC)?
**Change Data Capture (CDC)** refers to the process of identifying and capturing changes—such as inserts, updates, and deletes—in a data source and delivering them to a target system. It enables systems to remain synchronized and accurately reflect source updates.

**CDC typically detects the following row-level changes:**

- **Insertions** - New records added to the source and inserted into the target.
- **Updates** - Modified records in the source that need updating in the target.
- **Deletions** - Records removed from the source and deleted from the target.

### CDC Feed Structure
CDC feeds log changes at the source as events, each including affected record data and metadata such as:

- **Operation Type** - Specifies whether the record was inserted, updated, or deleted.
- **Timestamp or Version Number** - Ensures operations are applied in the correct order.

### CDC Data Sources
CDC data typically originates from two types of sources:

**1. Databases with Built-in CDC Features**

Many modern databases support native CDC by maintaining logs that track every data change. These logs record operation types and affected rows.

  - Example: Microsoft SQL Server offers native CDC, logging change operations on tables.
  - Delta Lake also provides Change Data Feed (CDF) for tracking modifications.

**2. CDC Agents**

These are third-party tools that monitor databases for changes and capture before/after data along with operation types.

   - Example: Debezium, an open-source CDC platform, supports databases like MySQL, PostgreSQL, SQL Server, and MongoDB. It streams change events in real time.

### CDC Feed Delivery Methods
CDC data can be delivered from the source in different formats:

- **Streaming Feeds** - Continuous delivery of change events for near real-time synchronization.
- **JSON Files** - Batch-captured changes written to JSON files, which are later processed to update the target.

Both methods ensure timely and accurate reflection of source changes in the target system.

### CDC with Delta Live Tables (DLT)
DLT provides native support for processing CDC feeds through the `APPLY CHANGES INTO` command. This command simplifies applying changes from a source feed to a target table.

**Example Syntax**

```
APPLY CHANGES INTO LIVE.target_table
FROM STREAM(LIVE.cdc_feed_table)
KEYS (key_field)
APPLY AS DELETE WHEN operation_field = "DELETE"
SEQUENCE BY sequence_field
COLUMNS *
```

**Explanation:**
- **LIVE.target_table** – The DLT table receiving the changes. Must be created beforehand.
- **STREAM(LIVE.cdc_feed_table)** – Specifies the streaming CDC feed source.
- **KEYS** – Defines primary key(s) to detect existing records.
- **APPLY AS DELETE WHEN** – Specifies conditions for deletions.
- **SEQUENCE BY** – Orders events to apply them correctly.
- **COLUMNS** – Applies all columns; optionally, you can choose specific columns.

### Setup Sample CDC Feed Data

In [0]:
%sql
USE CATALOG hive_metastore;
DROP DATABASE IF EXISTS cdc_demo CASCADE;
CREATE DATABASE cdc_demo;
USE cdc_demo;

In [0]:
dbutils.fs.rm("dbfs:/tmp/pipeline/tables", True)

In [0]:
from pyspark.sql.functions import expr
from pyspark.sql.types import *

# Define schema for CDC data
cdc_schema = StructType([
  StructField("customer_id", IntegerType(), True),
  StructField("name", StringType(), True),
  StructField("email", StringType(), True),
  StructField("operation", StringType(), True),  # INSERT, UPDATE, DELETE
  StructField("version", IntegerType(), True)
])

# Sample CDC records
cdc_data = [
  (1, "Alice", "alice@example.com", "INSERT", 1),
  (2, "Bob", "bob@example.com", "INSERT", 1),
  (1, "Alice Smith", "alice.smith@example.com", "UPDATE", 2),
  (2, None, None, "DELETE", 3),
  (3, "Charlie", "charlie@example.com", "INSERT", 4),
]

# Create a static DataFrame
df = spark.createDataFrame(cdc_data, schema=cdc_schema)

# Write to Delta table to simulate a feed source
df.write.format("delta").mode("overwrite").saveAsTable("cdc_demo.cdc_feed_table")

In [0]:
spark.table("cdc_demo.cdc_feed_table").show()

Delta Live Tables with APPLY CHANGES INTO does not hard-delete rows by default. It applies a soft delete by:

- Retaining the row
- Nullifying values
- Marking __DeleteVersion with the version of the delete event

This is done to support SCD (Slowly Changing Dimension) logic and auditability.

In [0]:
spark.table("customer_table").display()

In [0]:
spark.table("customer_table_mv").show(truncate=False)

In [0]:
cdc_data = [
    (3, "Charles", "charles@example.com", "UPDATE", 5),
    (4, "Diana", "diana@example.com", "INSERT", 6),   
    (2, "Bob Reborn", "bob.new@example.com", "INSERT", 8),
    (3, None, None, "DELETE", 9),
    (1, "Alice Smith", "alice.smith@newdomain.com", "UPDATE", 10),
    (4, "Diana", "diana123@example.com", "UPDATE", 7)
]

df = spark.createDataFrame(cdc_data, schema=cdc_schema)

# Write to Delta table to simulate a feed source
df.write.format("delta").mode("append").saveAsTable("cdc_demo.cdc_feed_table")

In [0]:
spark.table("customer_table_mv").show(truncate=False)