# Data Warehousing - Part 7: Demo - Designing a Star Schema

## 1. The Scenario: The Billing Invoice
In this demo, we will act as Data Architects. Our client has given us a raw **Billing Invoice** (OLTP Data) and asked us to build a Data Warehouse to analyze their sales.

### Step 1: Analyze the Raw Data
Let's look at the data we receive from the source system. It is often denormalized or captured in a transactional format.

```python
import pandas as pd

# Simulating the raw data from a Billing System (OLTP)
# Notice how everything is mixed together: Order info, Customer info, Product info, Store info.
raw_data = {
    'Order_ID': ['ORD-101', 'ORD-101', 'ORD-102'],
    'Order_Date': ['2023-01-01', '2023-01-01', '2023-01-02'],
    'Customer_Name': ['Alice Smith', 'Alice Smith', 'Bob Jones'],
    'Customer_Address': ['123 Maple St', '123 Maple St', '456 Oak Ave'],
    'Store_Name': ['NY-Main', 'NY-Main', 'SF-Bay'],
    'Store_Manager': ['John Doe', 'John Doe', 'Jane Smith'],
    'Product_Name': ['Red Pen', 'Blue Pen', 'Notebook'],
    'Category': ['Stationery', 'Stationery', 'Paper'],
    'Quantity': [10, 5, 2],         # Measure
    'Unit_Price': [1.50, 1.50, 5.00], # Attribute (of product) / Measure context
    'Line_Total': [15.00, 7.50, 10.00], # Measure
    'Tax': [1.50, 0.75, 1.00]       # Measure
}

df_source = pd.DataFrame(raw_data)

print("--- Raw Source Data (The Invoice) ---")
display(df_source)
```

---

## 2. Step 2: Define KPIs and Grain
Before designing tables, we must answer two questions:

1.  **What are the KPIs?**
    *   *Client Request:* "I want to see Total Sales per Store, per Day, and per Product."
2.  **What is the Grain?**
    *   To answer "per Product", we cannot summarize by Order. We need the lowest level of detail.
    *   **Grain:** One row per **Line Item** in an invoice.

---

## 3. Step 3: Identify Measures vs. Attributes
We categorize the columns from our source data:

| Column | Type | Destination |
| :--- | :--- | :--- |
| Quantity | Measure | Fact Table |
| Line Total | Measure | Fact Table |
| Tax | Measure | Fact Table |
| Customer Name | Attribute | Customer Dimension |
| Store Name | Attribute | Store Dimension |
| Product Name | Attribute | Product Dimension |
| Order ID | Attribute | Fact Table (Degenerate Dimension) |

---

## 4. Step 4: Create Dimension Tables
We extract the attributes into separate tables. In a real warehouse, we assign **Surrogate Keys** (Auto-incrementing integers like `Customer_Key`) instead of using strings, to improve join performance.

```python
# --- 1. Create Product Dimension ---
# Extract unique products
df_product_dim = df_source[['Product_Name', 'Category', 'Unit_Price']].drop_duplicates().reset_index(drop=True)
# Assign Surrogate Key
df_product_dim['Product_Key'] = df_product_dim.index + 1

print("--- Product Dimension ---")
display(df_product_dim)

# --- 2. Create Customer Dimension ---
df_customer_dim = df_source[['Customer_Name', 'Customer_Address']].drop_duplicates().reset_index(drop=True)
df_customer_dim['Customer_Key'] = df_customer_dim.index + 1

print("\n--- Customer Dimension ---")
display(df_customer_dim)

# --- 3. Create Store Dimension ---
df_store_dim = df_source[['Store_Name', 'Store_Manager']].drop_duplicates().reset_index(drop=True)
df_store_dim['Store_Key'] = df_store_dim.index + 1

print("\n--- Store Dimension ---")
display(df_store_dim)
```

---

## 5. Step 5: Create the Fact Table
Now we create the central table. This involves:
1.  Taking the source data.
2.  Joining with Dimensions to get the **Keys** (Product_Key, Customer_Key, etc.).
3.  Keeping only the **Keys** and **Measures**.

*Note: `Order_ID` remains in the Fact Table but doesn't link to a separate dimension. This is called a **Degenerate Dimension**.*

```python
# Start with source
df_fact = df_source.copy()

# Join to get Product Key
df_fact = pd.merge(df_fact, df_product_dim, on=['Product_Name', 'Category', 'Unit_Price'])

# Join to get Customer Key
df_fact = pd.merge(df_fact, df_customer_dim, on=['Customer_Name', 'Customer_Address'])

# Join to get Store Key
df_fact = pd.merge(df_fact, df_store_dim, on=['Store_Name', 'Store_Manager'])

# Select only Keys and Measures for the final Fact Table
cols_to_keep = [
    'Order_Date',       # Date Key (usually links to a Date Dim)
    'Order_ID',         # Degenerate Dimension
    'Product_Key',      # Foreign Key
    'Customer_Key',     # Foreign Key
    'Store_Key',        # Foreign Key
    'Quantity',         # Measure
    'Line_Total',       # Measure
    'Tax'               # Measure
]

df_fact_final = df_fact[cols_to_keep]

print("--- Final Sales Fact Table ---")
display(df_fact_final)
```

---

## 6. The Result: Star Schema
We have successfully transformed a flat, messy invoice list into a structured **Star Schema**:

*   **Center:** `Sales_Fact` (Contains numbers and keys).
*   **Points of the Star:** `Product_Dim`, `Customer_Dim`, `Store_Dim`.

### Why did we do this?
Now, if the client asks: *"Change the Store Manager for NY-Main to 'Mike'"*:
*   **Old Way (Flat file):** We have to update 1 million rows of sales history for NY-Main.
*   **New Way (Star Schema):** We update **1 row** in the `Store_Dim`. The Fact table just points to the Store Key, so it automatically reflects the new manager (or history, depending on strategy).

---

## 7. Next Steps
We have built the structure, but the world changes. Prices change, customers move addresses. How do we handle changes in data over time? In the next video, we will cover **Slowly Changing Dimensions (SCD)**.