# Data Warehousing - Part 4: Data Loading Strategies (ETL & ELT)

## 1. The Data Flow: Upstream to Downstream

In the Data Warehousing ecosystem, we constantly move data between two types of systems:

1.  **Upstream (Source):** The operational systems creating data (OLTP, E-commerce sites, CRM).
2.  **Downstream (Target):** The analytical systems receiving data (Data Warehouse, Data Marts).

### The ETL Process
To move data, we use a process called **ETL**:
*   **E**xtract: Read data from the Source (e.g., MySQL).
*   **T**ransform: Clean, format, and map the data.
*   **L**oad: Write data to the Target (e.g., Data Warehouse).

### Scenario: The "Country Code" Problem
As discussed in the lecture, user input in Source systems can be messy. Let's simulate a scenario where users type their country manually.

```python
import pandas as pd
from datetime import datetime

# 1. UPSTREAM (Source System - Raw Data)
# Users have entered 'India' in various formats.
source_data = {
    'UserID': [101, 102, 103, 104],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'User_Input_Country': ['IN', 'IND', 'INDIA', 'AUS'],
    'Signup_Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04']
}

df_source = pd.DataFrame(source_data)

print("--- Upstream (Source) Data ---")
display(df_source)
```

```python
# 2. TRANSFORMATION ( The 'T' in ETL)
# We map these variations to a standard ISO code or Name.

def transform_country(country_raw):
    c = country_raw.upper().strip()
    if c in ['IN', 'IND', 'INDIA']:
        return 'INDIA'
    elif c in ['AUS', 'AUSTRALIA']:
        return 'AUSTRALIA'
    return 'UNKNOWN'

# Apply transformation
df_transformed = df_source.copy()
df_transformed['Standardized_Country'] = df_transformed['User_Input_Country'].apply(transform_country)

# Drop the messy column (optional, depending on requirements)
df_transformed = df_transformed.drop(columns=['User_Input_Country'])

print("--- Transformed Data ---")
display(df_transformed)

# 3. LOAD (Downstream)
# This clean dataframe is what gets written to the Data Warehouse table.
df_warehouse = df_transformed.copy()
```

---

## 2. ETL vs. ELT

While the goal is the same, the **order** of operations changes based on performance requirements.

### A. ETL (Extract -> Transform -> Load)
*   **Process:** Data is extracted, transformed in a separate processing engine (like an ETL tool or Python script), and *then* loaded into the Warehouse.
*   **Use Case:** When the transformation logic is complex or the Data Warehouse compute is expensive.

### B. ELT (Extract -> Load -> Transform)
*   **Process:** Data is extracted and loaded *immediately* into a temporary/staging table in the Warehouse. The transformation happens **inside the database** (using SQL) before moving to the final table.
*   **Advantage:** Leverages the processing power of modern cloud Data Warehouses (Snowflake, Redshift, BigQuery) to do the heavy lifting.

---



## 3. Data Loading Strategies

How do we actually move the data? Do we move everything every day? There are two main strategies.

### Strategy A: Full Load (Truncate Load)
*   **Definition:** Every time the job runs, we delete (**Truncate**) everything in the target table and reload all data from the source.
*   **Use Case:** Small reference tables (e.g., a list of States, Categories) or initial data migration.
*   **Pros:** Simple to implement. No need to track changes.
*   **Cons:** Very slow and expensive for large datasets (e.g., Transaction tables with millions of rows).

```python
# Simulation of Full Load

# Current state of Data Warehouse
dw_table = pd.DataFrame({'ID': [1, 2], 'Data': ['A', 'B']})
print("DW Before Load:", dw_table.values)

# New Data arrives in Source (Everything from scratch)
source_data_new = pd.DataFrame({'ID': [1, 2, 3], 'Data': ['A', 'B', 'C']})

# --- FULL LOAD PROCESS ---
# 1. Truncate (Empty) the target
dw_table = pd.DataFrame() 

# 2. Load everything
dw_table = source_data_new.copy()

print("DW After Full Load:", dw_table.values)
```

### Strategy B: Incremental Load (Delta Load)
*   **Definition:** We only load the data that has **changed** (inserted or updated) since the last load.
*   **Mechanism:** We rely on **Audit Columns** (Watermark columns) like `created_at` or `updated_at`.
*   **Use Case:** Large Fact tables (Sales, Orders, Logs).

#### Simulation: Identifying the Delta
Imagine our Warehouse was last updated on **Jan 10th**. We only want records created *after* Jan 10th.

```python
# 1. State of Data Warehouse (Last updated Jan 10)
warehouse_data = {
    'Order_ID': [1, 2],
    'Amount': [100, 200],
    'Updated_At': [
        datetime(2023, 1, 9), 
        datetime(2023, 1, 10)
    ]
}
df_dw = pd.DataFrame(warehouse_data)

# Identify the "Watermark" (Max Date in DW)
watermark_date = df_dw['Updated_At'].max()
print(f"--- Watermark Date (Last Load): {watermark_date} ---")

# 2. Source System (Contains old AND new data)
source_data = {
    'Order_ID': [1, 2, 3, 4], # 3 and 4 are new
    'Amount': [100, 200, 150, 300],
    'Updated_At': [
        datetime(2023, 1, 9), 
        datetime(2023, 1, 10),
        datetime(2023, 1, 11), # New
        datetime(2023, 1, 12)  # New
    ]
}
df_source_oltp = pd.DataFrame(source_data)

# 3. INCREMENTAL LOGIC
# Filter Source where date > watermark
delta_records = df_source_oltp[df_source_oltp['Updated_At'] > watermark_date]

print("\n--- Delta Records (Incremental Feed) ---")
display(delta_records)

# 4. Append to Warehouse
df_dw_updated = pd.concat([df_dw, delta_records], ignore_index=True)
print("\n--- Final Warehouse State ---")
display(df_dw_updated)
```

---

## 4. Summary Table

| Feature | Full Load | Incremental Load |
| :--- | :--- | :--- |
| **Method** | Erase and Replace | Find changes & Append/Merge |
| **Volume** | Loads 100% of data | Loads ~1-5% of data (Deltas) |
| **Performance** | Slower as data grows | Fast, scalable |
| **Complexity** | Low | High (Need to handle updates/duplicates) |
| **Best For** | Master Data, Lookup Tables | Transactional Data, Big Data |

---

## 5. Next Steps
Now that we know how to move data, we need to understand how to structure it. In the next session, we will discuss the building blocks of Data Modeling: **Measures and Attributes**.