# Data Warehousing - Part 15: Schema Architecture (Star vs. Snowflake)

## 1. Introduction to Schemas
How do we arrange our Fact and Dimension tables? The arrangement is called a **Schema**.
The two most dominant architectures in Data Warehousing are:
1.  **Star Schema:** The gold standard for performance and simplicity.
2.  **Snowflake Schema:** A variation that saves storage but adds complexity.

---

## 2. The Star Schema
*   **Structure:** A central Fact table connected directly to multiple Dimension tables.
*   **Visual:** Looks like a star.
*   **Characteristics:**
    *   Dimensions are **Denormalized**. (e.g., Product, Category, and Brand are all in one `Dim_Product` table).
    *   Only **one join** away from Fact to any Dimension attribute.
    *   **Pros:** Simplest queries, fastest read performance (fewer joins).
    *   **Cons:** Data redundancy (e.g., "Stationery" category repeated for every pen).

### Python Simulation: Star Schema
In a Star Schema, the Product Dimension is flat.

```python
import pandas as pd

# --- 1. Denormalized Dimension (Star Schema Style) ---
# Notice 'Category' and 'Brand' are repeated. This is redundancy.
dim_product_star = pd.DataFrame({
    'Product_Key': [1, 2, 3],
    'Product_Name': ['Red Pen', 'Blue Pen', 'Notebook'],
    'Category_Name': ['Stationery', 'Stationery', 'Paper'], # Redundant
    'Brand_Name': ['Camlin', 'Camlin', 'Classmate']          # Redundant
})

print("--- Star Schema: Denormalized Dimension ---")
display(dim_product_star)

# --- 2. Fact Table ---
fact_sales = pd.DataFrame({
    'Order_ID': [101, 102],
    'Product_Key': [1, 3],
    'Amount': [10, 20]
})

# --- 3. Querying (Simple Join) ---
# Only ONE join needed to get Category sales
report_star = pd.merge(fact_sales, dim_product_star, on='Product_Key')
print("\n--- Star Schema Query Result ---")
display(report_star[['Order_ID', 'Category_Name', 'Amount']])
```

---

## 3. The Snowflake Schema
*   **Structure:** Dimensions are normalized into multiple related tables.
*   **Visual:** Looks like a snowflake (branching out).
*   **Characteristics:**
    *   Dimensions are **Normalized**. (e.g., `Dim_Product` links to `Dim_Category`, which links to `Dim_Brand`).
    *   **Multiple joins** required to get attributes.
    *   **Pros:** Less storage, easier to maintain consistency (update Category name in one place).
    *   **Cons:** Complex queries, slower performance (many joins).

### Python Simulation: Snowflake Schema
In a Snowflake Schema, we split the Product dimension.

```python
# --- 1. Normalized Dimensions (Snowflake Style) ---

# Table A: Category (Lookup)
dim_category_snow = pd.DataFrame({
    'Category_ID': [10, 20],
    'Category_Name': ['Stationery', 'Paper']
})

# Table B: Product (Links to Category)
dim_product_snow = pd.DataFrame({
    'Product_Key': [1, 2, 3],
    'Product_Name': ['Red Pen', 'Blue Pen', 'Notebook'],
    'Category_ID': [10, 10, 20] # Foreign Key to Category
})

print("--- Snowflake: Product Dimension ---")
display(dim_product_snow)
print("\n--- Snowflake: Category Dimension ---")
display(dim_category_snow)

# --- 2. Querying (Complex Join) ---
# We need TWO joins to get the same result: Fact -> Product -> Category
report_snow = pd.merge(fact_sales, dim_product_snow, on='Product_Key')
report_snow = pd.merge(report_snow, dim_category_snow, on='Category_ID')

print("\n--- Snowflake Schema Query Result ---")
display(report_snow[['Order_ID', 'Category_Name', 'Amount']])
```

---

## 4. Which one to choose?

| Feature | Star Schema | Snowflake Schema |
| :--- | :--- | :--- |
| **Simplicity** | High (Easy for humans and BI tools) | Low (Complex relationships) |
| **Performance** | **Fast** (Fewer joins) | Slower (More joins) |
| **Storage** | Higher (Redundancy) | Lower (Normalized) |
| **ETL Complexity** | Higher (Need to denormalize) | Lower (Load tables as-is) |
| **Modern Usage** | **Preferred** (Storage is cheap, compute is fast) | Less common (used for specific bridges) |

**Industry Trend:**
With modern columnar databases (Snowflake, BigQuery, Redshift), storage is cheap and compression is efficient. Therefore, **Star Schema is highly preferred** because it simplifies the queries for business users and BI tools (Tableau/PowerBI work best with Star).

---

## 5. Course Conclusion
This brings us to the end of the Data Warehousing Fundamentals course.

You have learned the journey from:
1.  **Source Systems (OLTP)**
2.  **Loading Strategies (ETL/ELT)**
3.  **Dimension Modeling (SCD 1/2/3)**
4.  **Fact Modeling (Transactional/Snapshot)**
5.  **Schema Design (Star/Snowflake)**

**Final Challenge:**
Take a dataset (like the Kaggle Retail Dataset), define KPIs, design a Star Schema on paper (Bus Matrix), and implement the ETL using Python.

Thank you for completing the course! Happy Data Engineering.