# Data Warehousing - Part 16: Handling Early Facts & Late Dimensions

## 1. The Problem Scenario
In a perfect world, Dimension data (like Products) always arrives before Transaction data (like Orders).
*   *Ideal:* Product 'P-100' is created in the system at 9:00 AM. Order for 'P-100' comes at 10:00 AM. The ETL joins them perfectly.

In the real world, systems are asynchronous or buggy.
*   *Reality:* Order for 'P-100' arrives at 10:00 AM. The Product system is slow and sends the 'P-100' details at 11:00 AM.
*   **Result:** When the Order loads, the join to `Dim_Product` fails because 'P-100' doesn't exist yet. This is called an **Early Arriving Fact** or a **Late Arriving Dimension**.

---

## 2. Python Simulation: The Failure
Let's simulate an ETL process failing because of this issue.

```python
import pandas as pd

# --- 1. Current Dimension Table (Target) ---
# Notice: Product 'P-New' is missing
dim_product = pd.DataFrame({
    'Product_Key': [1, 2], # Surrogate Key
    'Product_ID': ['P-001', 'P-002'], # Natural Key
    'Name': ['Old Pen', 'Old Pencil']
})

# --- 2. Incoming Fact Data (Source) ---
# An order arrives for 'P-New'
fact_source = pd.DataFrame({
    'Order_ID': [1001, 1002],
    'Product_ID': ['P-001', 'P-New'], # 'P-New' is the early fact
    'Amount': [10, 50]
})

print("--- Dimension ---")
display(dim_product)
print("\n--- Incoming Fact ---")
display(fact_source)

# --- 3. Naive ETL (The Failure) ---
# We try to lookup the Surrogate Key
fact_loaded = pd.merge(fact_source, dim_product, on='Product_ID', how='left')

print("\n--- Loaded Fact (with Missing Keys) ---")
# The Product_Key for Order 1002 is NaN (Null). This breaks referential integrity.
display(fact_loaded)
```

---

## 3. Solution 1: The "Unknown" Record (Standard Approach)
Instead of leaving the Key as `NULL` (which can break reports), we link it to a special "Unknown" or "Dummy" record in the dimension table (often Key = 0 or -1).

*   **Process:**
    1.  Perform lookup.
    2.  If Key is found, use it.
    3.  If Key is NOT found, use `0` (The "Unknown" Member).

```python
# --- Implementing Solution 1: Default to Unknown ---

# 1. Ensure Dimension has a default row (Key=0)
unknown_row = pd.DataFrame({
    'Product_Key': [0],
    'Product_ID': ['UNKNOWN'],
    'Name': ['Unknown Product']
})

# In reality, this is done once during setup
dim_product_with_default = pd.concat([unknown_row, dim_product], ignore_index=True)

print("--- Dimension with Default ---")
display(dim_product_with_default)

# 2. ETL Logic
# Fill NaN with 0
fact_loaded_fixed = fact_loaded.copy()
fact_loaded_fixed['Product_Key'] = fact_loaded_fixed['Product_Key'].fillna(0).astype(int)

print("\n--- Fact Table with Integrity Maintained ---")
display(fact_loaded_fixed)
```
*   **Pros:** Simple, keeps the pipeline running.
*   **Cons:** Reports show "Unknown Product" until the dimension arrives and we fix the fact table later.

---

## 4. Solution 2: Inferring the Dimension (Advanced Approach)
If we see a Product ID 'P-New' in the Fact that doesn't exist in the Dimension, we **create it on the fly**.

*   **Process:**
    1.  Identify missing keys in the incoming batch (`P-New`).
    2.  **Insert** `P-New` into `Dim_Product` immediately.
        *   Set attributes to "Pending" or "Inferred".
        *   Assign a new Surrogate Key (e.g., 3).
    3.  Proceed with Fact Load using this new key.
    4.  Later, when the real Product data arrives, **Update** the "Pending" record with real details (Name, Category).

```python
# --- Implementing Solution 2: Inferred Dimension ---

# 1. Identify missing IDs
missing_products = fact_source[~fact_source['Product_ID'].isin(dim_product['Product_ID'])]['Product_ID'].unique()
print(f"Missing Products Detected: {missing_products}")

# 2. Create Inferred Rows
new_keys_start = dim_product['Product_Key'].max() + 1
inferred_rows = []

for i, pid in enumerate(missing_products):
    inferred_rows.append({
        'Product_Key': new_keys_start + i,
        'Product_ID': pid,
        'Name': 'Inferred - Waiting for Data' # Placeholder
    })

# 3. Update Dimension Immediately
dim_product_updated = pd.concat([dim_product, pd.DataFrame(inferred_rows)], ignore_index=True)

print("\n--- Dimension Table (After Inference) ---")
display(dim_product_updated)

# 4. Load Fact (Now the join works!)
fact_final = pd.merge(fact_source, dim_product_updated, on='Product_ID', how='left')
print("\n--- Fact Table Loaded Successfully ---")
display(fact_final[['Order_ID', 'Product_Key', 'Amount']])
```

*   **Pros:** Data is immediately available and linked correctly. No need to reload facts later.
*   **Cons:** More complex ETL logic.

---

## 5. Course Final Wrap-Up

This concludes the Data Warehousing series! We have journeyed through:
1.  **Foundations:** OLAP vs OLTP, Schemas.
2.  **Modeling:** Dimensions, Facts, Granularity.
3.  **ETL:** SCD Types, Loading Strategies.
4.  **Advanced:** Handling architectural challenges like Early Facts.

You now have a complete toolkit for designing and understanding Data Warehouse systems.