# Data Warehousing - Part 5: Measures and Attributes

## 1. The Building Blocks of Data
In a Data Warehouse, every piece of data generally falls into one of two categories:
1.  **Measures:** The numbers we want to analyze (Quantitative).
2.  **Attributes (Dimensions):** The context for those numbers (Descriptive).

### What makes a Measure?
A common misconception is that "any column with numbers is a measure." This is **FALSE**.
A measure must have a **logical meaning after aggregation**.

*   **Sales Amount:** If I add Sales from today and yesterday ($100 + $200 = $300), does it make sense? **Yes.** -> **Measure**.
*   **Room Number / User ID:** If I add Room 101 and Room 102 (101 + 102 = 203), does it make sense? **No.** -> **Attribute**.

```python
import pandas as pd

# Data Sample
data = {
    'Transaction_ID': [1001, 1002], # Number, but Attribute
    'Room_Number': [101, 102],      # Number, but Attribute
    'Sales_Amount': [500, 700]      # Number, Measure
}
df = pd.DataFrame(data)

print("--- Data Sample ---")
display(df)

# Logical Test
print(f"\nSum of Sales: {df['Sales_Amount'].sum()} (Logical - We made $1200)")
print(f"Sum of Room Numbers: {df['Room_Number'].sum()} (Illogical - Meaningless number)")
```

---

## 2. Types of Measures
Not all measures behave the same way. We categorize them based on how they can be aggregated (Summed) across different dimensions (like Time, Store, Product).

### A. Additive Measures
These are the easiest to handle. They can be summed across **ALL** dimensions.
*   **Example:** `Total Sales`.
*   You can sum sales across Products, across Stores, and across Dates.

### B. Semi-Additive Measures
These can be summed across **SOME** dimensions, but usually **NOT across Time (Date)**.
*   **Example:** `Inventory Balance` (Stock Quantity).
*   *Scenario:*
    *   Jan 1st Stock: 50 items.
    *   Jan 2nd Stock: 60 items.
*   *Logic:*
    *   Sum across Stores? Yes. (Store A has 10 + Store B has 20 = Total 30 items today).
    *   Sum across Dates? **NO.** (50 items on Jan 1 + 60 items on Jan 2 != 110 items). The stock is a snapshot.

### C. Non-Additive Measures
These cannot be summed across **ANY** dimension.
*   **Example:** `Profit Margin %`, `Temperature`, `Unit Price`.
*   *Logic:*
    *   Product A Margin: 10%
    *   Product B Margin: 20%
    *   Total Margin is **NOT** 30%. You must recalculate the margin based on total Cost and Revenue.

---


## 3. Python Simulation: The Behavior of Measures

Let's create a simulated Fact Table to demonstrate these three behaviors.

```python
import pandas as pd

# Creating a Fact Table for Inventory and Sales
data = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Store': ['Store_A', 'Store_B', 'Store_A', 'Store_B'],
    'Sales_Amount': [100, 200, 150, 250],        # Additive
    'Stock_Level_End_Of_Day': [50, 30, 45, 25],  # Semi-Additive (Snapshot)
    'Margin_Percent': [0.10, 0.20, 0.10, 0.20]   # Non-Additive
}

df_fact = pd.DataFrame(data)

print("--- Fact Table ---")
display(df_fact)
```

### Scenario 1: Additive (Sales)
We want the total sales for the whole dataset.

```python
# Valid Operation: Summing across all dates and stores
total_sales = df_fact['Sales_Amount'].sum()
print(f"Total Sales: ${total_sales} (Correct)")
```

### Scenario 2: Semi-Additive (Stock Level)
If we want to know "How much stock do we have right now (Jan 2nd)?", we can sum across stores.
But if we simply `sum()` the whole column, we get a wrong number because we are adding yesterday's stock to today's stock.

```python
# INCORRECT WAY: Summing Stock across Time
wrong_stock = df_fact['Stock_Level_End_Of_Day'].sum()
print(f"Sum of ALL Stock column: {wrong_stock} (WRONG - Double counting items)")

# CORRECT WAY: Filter for specific time snapshot, then sum across stores
jan_2_stock = df_fact[df_fact['Date'] == '2023-01-02']['Stock_Level_End_Of_Day'].sum()
print(f"Total Stock on Jan 2nd: {jan_2_stock} (Correct)")
```

### Scenario 3: Non-Additive (Margin %)
We want the overall Profit Margin for the company.

```python
# INCORRECT WAY: Summing percentages
sum_margin = df_fact['Margin_Percent'].sum() * 100
print(f"Sum of Margins: {sum_margin}% (WRONG - Mathematically impossible)")

# CORRECT WAY: Recalculation (Weighted Average)
# Usually, DW would store Profit_Amt and Sales_Amt, and BI tools calculate % on the fly.
# Margin % = Sum(Profit) / Sum(Sales)
```

---

## 4. Attributes (Dimensions)
Attributes are the context. Without attributes, measures are just numbers floating in space.

*   **Fact:** "We sold 100 units."
*   **Question:** Who? When? Where? What?
*   **Context (Attributes):** "We sold 100 units **(Measure)** of Red Pens **(Product Attribute)** in New York **(Store Attribute)** on Jan 1st **(Date Attribute)**."

Attributes are used in the `GROUP BY` clause of SQL or Pandas queries.

```python
# Analysis using Attributes (Context)
report = df_fact.groupby(['Store'])['Sales_Amount'].sum()

print("--- Sales by Store (Attribute) ---")
print(report)
```

---

## 5. Summary Table

| Concept | Definition | Example | Aggregation Rule |
| :--- | :--- | :--- | :--- |
| **Attribute** | Context / Descriptive | Date, Product Name, User ID | Group By / Count |
| **Measure (Additive)** | Fully summable | Sales, Quantity Sold, Cost | Sum over Date, Store, Product |
| **Measure (Semi-Additive)** | Summable over some dims | Inventory Balance, Account Balance | Sum over Store (Yes), Time (No) |
| **Measure (Non-Additive)** | Not summable | Unit Price, Margin %, Temperature | Average / Weighted Avg |



---

## 6. Next Steps
Now that we understand the types of data, we need to understand how to organize them into tables. In the next session, we will cover the core of dimensional modeling: **Fact Tables and Dimension Tables**.