---
title: "Change Data Capture (CDC) using Pandas"

description: "This notebook demonstrates how to implement Change Data Capture (CDC) using pandas to capture inserts, updates, and deletions between two datasets."

---

# ## **Step 1: Import Required Libraries**
# Import the pandas library for data manipulation

In [8]:
import pandas as pd

## **Step 2: Create Simulated "Old" and "New" DataFrames**
We will create two DataFrames: one representing the "old" version and another representing the "new" version of the data.

Old version of the dataset

In [13]:
data_old = {
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df_old = pd.DataFrame(data_old)

# Display the old DataFrame
print("Old DataFrame:")
print(df_old)

# New version of the dataset
data_new = {
    'id': [2, 3, 4, 5],
    'name': ['Bob', 'Charlie', 'David', 'Eva'],
    'age': [31, 35, 42, 22],  # Notice Bob's and David's ages have changed
    'city': ['San Francisco', 'Chicago', 'Houston', 'Seattle']  # Notice Bob's city has changed
}
df_new = pd.DataFrame(data_new)

# Display the new DataFrame
print("\n\nNew DataFrame:")
print(df_new)

Old DataFrame:
   id     name  age         city
0   1    Alice   25     New York
1   2      Bob   30  Los Angeles
2   3  Charlie   35      Chicago
3   4    David   40      Houston


New DataFrame:
   id     name  age           city
0   2      Bob   31  San Francisco
1   3  Charlie   35        Chicago
2   4    David   42        Houston
3   5      Eva   22        Seattle


## **Step 3: Identify Inserts, Updates, and Deletions**
### **Step 3a: Identify Insertions**
Rows present in the "new" DataFrame but not in the "old" DataFrame


In [14]:
df_inserts = df_new[~df_new['id'].isin(df_old['id'])]
print("Inserts:\n", df_inserts)

Inserts:
    id name  age     city
3   5  Eva   22  Seattle


### **Step 3b: Identify Deletions**
Rows present in the "old" DataFrame but not in the "new" DataFrame


In [15]:
df_deletes = df_old[~df_old['id'].isin(df_new['id'])]
print("\nDeletions:\n", df_deletes)


Deletions:
    id   name  age      city
0   1  Alice   25  New York


### **Step 3c: Identify Updates**
To capture updates, merge the two DataFrames on the primary key (`id`) and compare columns.

Merging old and new datasets on 'id' to compare changes


In [17]:
df_merged = pd.merge(df_old, df_new, on='id', suffixes=('_old', '_new'))

# Identifying rows where any column values differ between old and new
df_updates = df_merged[(df_merged['name_old'] != df_merged['name_new']) |
                       (df_merged['age_old'] != df_merged['age_new']) |
                       (df_merged['city_old'] != df_merged['city_new'])]

print("\nUpdates:\n", df_updates)


Updates:
    id name_old  age_old     city_old name_new  age_new       city_new
0   2      Bob       30  Los Angeles      Bob       31  San Francisco
2   4    David       40      Houston    David       42        Houston


## **Step 4: Display the Captured Changes**
Summarizing all changes identified


In [18]:
print("\nSummary of Changes:")
print("\nInserts:\n", df_inserts)
print("\nDeletions:\n", df_deletes)
print("\nUpdates:\n", df_updates)


Summary of Changes:

Inserts:
    id name  age     city
3   5  Eva   22  Seattle

Deletions:
    id   name  age      city
0   1  Alice   25  New York

Updates:
    id name_old  age_old     city_old name_new  age_new       city_new
0   2      Bob       30  Los Angeles      Bob       31  San Francisco
2   4    David       40      Houston    David       42        Houston
