# Data Warehousing - Part 11: Slowly Changing Dimensions (SCD) Type 2

## 1. What is SCD Type 2?
**SCD Type 2** is the most common technique for preserving history in a Data Warehouse. Unlike SCD 1 (which overwrites), SCD 2 **adds a new row** for every change.

*   **Behavior:** When a record changes in the source, the old record in the DW is marked as "expired," and a new record is inserted as "active."
*   **Result:** We have a full history of changes over time.
*   **Key Columns Required:**
    1.  **Surrogate Key:** Unique ID for every row (version) in the dimension.
    2.  **Natural Key:** The business ID (e.g., Employee ID) linking all versions.
    3.  **Effective Start Date:** When this version became active.
    4.  **Effective End Date:** When this version expired (Active records usually have a future date like `9999-12-31`).
    5.  **Active Flag:** (Optional but recommended) 'Y' for current, 'N' for history.

---

## 2. Python Simulation: Implementing SCD Type 2
We will continue with the Employee example.

### Initial State (Day 1: Jan 1st)
Employee `E001` (Shubham) lives in `Bengaluru`.

```python
import pandas as pd
from datetime import datetime

# Define high date for active records
HIGH_DATE = datetime(9999, 12, 31)

# --- 1. Current State of Dimension (Before Change) ---
dim_data = {
    'Employee_Key': [1],            # Surrogate Key
    'Employee_ID': ['E001'],        # Natural Key
    'Address': ['Bengaluru'],
    'Start_Date': [datetime(2023, 1, 1)],
    'End_Date': [HIGH_DATE],
    'Is_Active': ['Y']
}

df_dim = pd.DataFrame(dim_data)

print("--- Dimension Table (Day 1) ---")
display(df_dim)
```

### The Change (Day 10: Jan 10th)
On Jan 10th, the source system reports that Shubham has moved to `Kolkata`.
Also, a new employee `E002` (Rohan) joins in `Indore`.

```python
# --- 2. Incoming Source Data (Day 10) ---
source_data = {
    'Employee_ID': ['E001', 'E002'],
    'Address': ['Kolkata', 'Indore'] # E001 Changed, E002 New
}
df_source = pd.DataFrame(source_data)
current_date = datetime(2023, 1, 10) # The date of the load

print("\n--- Incoming Changes ---")
display(df_source)
```

### Implementing SCD 2 Logic
The logic is more complex than SCD 1:
1.  **Identify Changes:** Compare Source vs. Target on Natural Key.
2.  **Expire Old Records:** If a record has changed, find the *active* row in the dimension and update its `End_Date` to (Current Date - 1 second) and `Is_Active` to 'N'.
3.  **Insert New Records:**
    *   For *Changed* records: Insert a new row with the new address, `Start_Date` = Current Date, `End_Date` = 9999-12-31, `Is_Active` = 'Y'.
    *   For *New* records: Insert as usual.

```python
# --- 3. SCD Type 2 Logic Simulation ---

# Merge to compare
merged = pd.merge(
    df_source,
    df_dim[df_dim['Is_Active'] == 'Y'], # Compare only against ACTIVE records
    on='Employee_ID',
    how='left',
    suffixes=('_Source', '_Target')
)

# A. Detect Changes (Old vs New Address)
# Logic: ID exists AND Address is different
changes = merged[
    (merged['Employee_Key'].notna()) & 
    (merged['Address_Source'] != merged['Address_Target'])
].copy()

# B. Detect New Inserts
# Logic: ID does not exist in Dimension
new_inserts = merged[merged['Employee_Key'].isna()].copy()

print(f"Detected {len(changes)} changes and {len(new_inserts)} new records.")

# --- Step 4: Expire Old Records ---
# We update the dataframe to "close" the history for E001
for index, row in changes.iterrows():
    # Find the specific row in the dimension and update it
    mask = (df_dim['Employee_ID'] == row['Employee_ID']) & (df_dim['Is_Active'] == 'Y')
    df_dim.loc[mask, 'End_Date'] = current_date # Set end date to now
    df_dim.loc[mask, 'Is_Active'] = 'N'         # Mark as history

# --- Step 5: Insert New Versions & New Records ---
new_rows_list = []
max_key = df_dim['Employee_Key'].max()

# Prepare Changed Records (New Version)
for index, row in changes.iterrows():
    max_key += 1
    new_row = {
        'Employee_Key': max_key,
        'Employee_ID': row['Employee_ID'],
        'Address': row['Address_Source'], # New Address
        'Start_Date': current_date,
        'End_Date': HIGH_DATE,
        'Is_Active': 'Y'
    }
    new_rows_list.append(new_row)

# Prepare Completely New Records
for index, row in new_inserts.iterrows():
    max_key += 1
    new_row = {
        'Employee_Key': max_key,
        'Employee_ID': row['Employee_ID'],
        'Address': row['Address_Source'],
        'Start_Date': current_date,
        'End_Date': HIGH_DATE,
        'Is_Active': 'Y'
    }
    new_rows_list.append(new_row)

# Append to Dimension
if new_rows_list:
    df_dim = pd.concat([df_dim, pd.DataFrame(new_rows_list)], ignore_index=True)

print("\n--- Final Dimension Table (SCD 2 Applied) ---")
# Notice E001 now has TWO rows. One history (Active=N), one current (Active=Y).
display(df_dim.sort_values('Employee_ID'))
```

---

## 3. Why is this useful?
Imagine a Sales Fact table linked to this dimension.
*   **Sales on Jan 5th:** Will link to `Employee_Key: 1` (Bengaluru).
*   **Sales on Jan 15th:** Will link to `Employee_Key: 2` (Kolkata).

This allows us to accurately report: *"Shubham sold $500 while in Bengaluru and $1000 while in Kolkata."*
If we used SCD 1, all sales would look like they happened in Kolkata.

---

## 4. Summary: SCD Types Comparison

| Type | Name | Logic | History? | Complexity | Use Case |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Type 1** | Overwrite | Update existing row | No | Low | Corrections, Current State |
| **Type 2** | Add Row | Expire old, Insert new | **Full** | Medium | Tracking changes over time (Default for DW) |
| **Type 3** | Add Column | Add `Prev_Column` | Limited | Low | Only need immediate previous value |

---

## 5. Next Steps
We have covered the core SCD types. In the next video, we will briefly touch upon **SCD Type 3** and wrap up the dimension discussion before moving to Fact Tables.