# Data Warehousing - Part 10: Slowly Changing Dimensions (SCD) Type 1

## 1. What is SCD Type 1?
**Slowly Changing Dimension Type 1 (SCD 1)** is a method used to handle changes in dimension tables where **history is NOT preserved**.

*   **Behavior:** If a record changes in the source, we simply **overwrite** the existing record in the Data Warehouse.
*   **Result:** The Data Warehouse always reflects the **current state** of the world. We lose knowledge of what the data looked like in the past.
*   **Use Case:** Correcting data errors (e.g., spelling mistakes in names) or when history is irrelevant (e.g., "Current Marital Status" might not need tracking for some businesses).

---

## 2. Key Concepts: Surrogate Keys vs. Natural Keys
Before implementing SCD, we must distinguish between two types of keys:

1.  **Natural Key (Business Key):** The unique ID from the source system (e.g., `Employee_ID: E001`). This is how the business identifies the entity.
2.  **Surrogate Key:** An artificial, auto-incrementing integer created within the Data Warehouse (e.g., `Employee_Key: 1`).
    *   *Why?* It decouples the DW from the source system, improves join performance (integers are faster than strings), and allows handling history (in SCD Type 2, one Natural Key can have multiple Surrogate Keys).

---

## 3. Python Simulation: Implementing SCD Type 1
Let's simulate the scenario described in the video: an Employee/Customer table where addresses change.

### Initial State
We load the initial data into our Dimension table.

```python
import pandas as pd

# --- 1. Current State of the Dimension Table (Target) ---
# This is what exists in the Data Warehouse
current_dim_data = {
    'Employee_Key': [1, 2],       # Surrogate Key
    'Employee_ID': ['E001', 'E002'], # Natural Key
    'Name': ['Shubham', 'Rohan'],
    'Address': ['Kolkata', 'Bengaluru']
}

df_dim_employee = pd.DataFrame(current_dim_data)

print("--- Existing Dimension Table ---")
display(df_dim_employee)
```

### Incoming Data (The Change)
Now, new data arrives from the Source system.
1.  **E001 (Shubham):** Address changed from 'Kolkata' to 'Bengaluru'.
2.  **E002 (Rohan):** No change.
3.  **E003 (Rakesh):** New employee (Insert).

```python
# --- 2. Incoming Source Data ---
source_data = {
    'Employee_ID': ['E001', 'E002', 'E003'],
    'Name': ['Shubham', 'Rohan', 'Rakesh'],
    'Address': ['Bengaluru', 'Bengaluru', 'Indore'] # E001 Changed, E003 New
}

df_source = pd.DataFrame(source_data)

print("\n--- Incoming Source Data ---")
display(df_source)
```

### Applying Logic: Upsert (Update + Insert)
SCD 1 logic performs an "Upsert":
*   **Match** records based on `Employee_ID` (Natural Key).
*   **Update** attributes if the ID exists.
*   **Insert** new row if the ID does not exist.

```python
# --- 3. Implementing SCD Type 1 Logic ---

# Merge Source with Target on Natural Key to identify changes
merged = pd.merge(
    df_source, 
    df_dim_employee, 
    on='Employee_ID', 
    how='left', 
    suffixes=('_Source', '_Target')
)

# A. Identify New Records (Insert)
# Rows where Employee_Key is NaN (didn't exist in Target)
new_records = merged[merged['Employee_Key'].isna()].copy()
new_records['Name'] = new_records['Name_Source']
new_records['Address'] = new_records['Address_Source']

# Assign new Surrogate Keys (In a real DB, this is a sequence)
max_key = df_dim_employee['Employee_Key'].max()
new_records['Employee_Key'] = range(max_key + 1, max_key + 1 + len(new_records))

# Clean up columns for insertion
cols = ['Employee_Key', 'Employee_ID', 'Name', 'Address']
to_insert = new_records[cols]

print("--- Records to Insert ---")
display(to_insert)

# B. Identify Updates (Overwrite)
# Rows where Key exists, but data is different
# Note: In SCD 1, we overwrite specific columns.
# For this pandas simulation, we reconstruct the dataframe.

# Get existing records that are NOT in the new insert list (E001, E002)
# We take the values from SOURCE to overwrite the TARGET
updated_dim = merged[merged['Employee_Key'].notna()].copy()
updated_dim['Name'] = updated_dim['Name_Source'] # Overwrite Name
updated_dim['Address'] = updated_dim['Address_Source'] # Overwrite Address

to_update = updated_dim[cols]

print("\n--- Records Updated (Overwritten) ---")
display(to_update)

# --- 4. Final State of Dimension ---
# Combine Updated records + New records
df_dim_final = pd.concat([to_update, to_insert], ignore_index=True).sort_values('Employee_Key')

print("\n--- Final Dimension Table (SCD 1 Applied) ---")
# Notice E001's address is now 'Bengaluru'. The history 'Kolkata' is GONE.
display(df_dim_final)
```

---

## 4. Pros and Cons of SCD 1

| Feature | Description | Impact |
| :--- | :--- | :--- |
| **History** | **None.** Old data is lost forever. | You cannot report on "Sales by Region" for last year correctly if the salesperson moved regions today. |
| **Complexity** | Low. Simple update/insert. | Easy to maintain and build. |
| **Storage** | Low. No extra rows added for updates. | Smallest dimension table size. |
| **Use Case** | Corrections, current status tracking. | Fixing typos (`Kolkata` vs `Calcutta`) or tracking current phone numbers. |

---

## 5. Next Steps
SCD 1 is simple but destroys history. What if we need to know *where* Shubham lived when he made a purchase last year? For that, we need **SCD Type 2**, which we will cover in the next session.