# Data Warehousing - Part 3: OLAP & Analytical Reporting

## 1. What is OLAP?
**OLAP (Online Analytical Processing)** is a computing method that enables users to easily and selectively extract and view data from different points of view.

While OLTP (discussed in the previous notebook) is designed for **capturing** data, OLAP is designed for **querying** and **analyzing** data.

### Key Characteristics of OLAP:
1.  **Historical Data:** unlike OLTP which might only keep the last 6-12 months of active data, OLAP systems store data from the "origin" (e.g., from 2001 to present) to enable trend analysis.
2.  **Read-Optimized:** The access pattern is "Write Once, Read Many." Data is loaded (via ETL) typically once a day/hour, but read thousands of times.
3.  **Denormalized:** To speed up reads, we reduce the number of joins by combining tables. This increases data redundancy but drastically improves read performance.
4.  **Columnar Storage:** Modern OLAP databases (like Amazon Redshift, Snowflake, Google BigQuery) often store data by **column** rather than by row, making aggregation queries (SUM, AVG) incredibly fast.

---

## 2. The "Why": Solving the Dirty Data Problem

One of the massive advantages of moving data to an OLAP system is **Data Cleaning**. In an OLTP system (like a website form), users might input data inconsistently. If we report directly on this, our numbers will be wrong.

### Simulation: The "Country" Input Problem
Imagine an e-commerce checkout form where the "Country" field is a text box.

```python
import pandas as pd

# 1. Simulate Raw OLTP Data (User Input)
# Notice the inconsistency in the 'Country' column.
oltp_data = {
    'OrderID': [1, 2, 3, 4, 5, 6],
    'Customer': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Amount': [100, 200, 150, 300, 120, 250],
    'Country': ['IN', 'India', 'ind', 'INDIA', 'US', 'USA']
}

df_oltp = pd.DataFrame(oltp_data)

print("--- Raw OLTP Data ---")
display(df_oltp)

# 2. Trying to generate a report on OLTP
# Result: The data is fragmented. 'India' is split into 4 different categories.
print("\n--- Failed Report on Raw Data ---")
report_fail = df_oltp.groupby('Country')['Amount'].sum()
display(report_fail)
```

### The OLAP Solution (ETL Process)
Before the data enters the Data Warehouse (OLAP), it goes through an **ETL (Extract, Transform, Load)** process where these inconsistencies are mapped to a standard value.

```python
# 3. Simulate ETL Process (Transformation)
def clean_country(country_code):
    code = country_code.upper().strip()
    if code in ['IN', 'IND', 'INDIA', 'HINDUSTAN']:
        return 'India'
    elif code in ['US', 'USA', 'UNITED STATES']:
        return 'USA'
    return 'Unknown'

# Apply transformation
df_olap = df_oltp.copy()
df_olap['Country'] = df_olap['Country'].apply(clean_country)

print("--- Cleaned OLAP Data ---")
display(df_olap)

# 4. Generate Analytical Report
# Result: Accurate aggregation.
print("\n--- Successful Analytical Report ---")
report_success = df_olap.groupby('Country')['Amount'].sum()
display(report_success)
```

---



## 3. The "Why": Solving the Performance Problem

### The Denormalization Advantage
In the previous notebook, we saw that to get a simple invoice in OLTP, we had to join **4 tables**. In OLAP, we denormalize this into "Wide Tables" or "Star Schemas".

Let's simulate a Denormalized table. Instead of 4 tables, we keep everything together.

```python
# Simulating a Denormalized OLAP Table
# Note: Data redundancy (Alice's address is repeated), but Joins are eliminated.

olap_wide_table = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-03'],
    'Product_Category': ['Stationery', 'Stationery', 'Electronics', 'Stationery'],
    'Product_Name': ['Red Pen', 'Blue Pen', 'Mouse', 'Red Pen'],
    'Customer_Region': ['NY', 'NY', 'SF', 'NY'],
    'Sales_Amount': [10, 10, 50, 20]
}

df_wide = pd.read_sql_query # Just using pandas for simulation
df_wide = pd.DataFrame(olap_wide_table)

print("--- Denormalized Wide Table (OLAP) ---")
display(df_wide)

# Querying is now instant (No Joins required)
# Manager asks: "Show me Red Pen sales in NY"
analytics_query = df_wide[
    (df_wide['Product_Name'] == 'Red Pen') & 
    (df_wide['Customer_Region'] == 'NY')
]['Sales_Amount'].sum()

print(f"\nTotal Red Pen Sales in NY: ${analytics_query}")
```

### Row vs. Columnar Storage
*   **Row Store (OLTP like MySQL/Postgres):** Data is stored row-by-row on the disk. Great for fetching *one specific user's* order.
*   **Column Store (OLAP like Redshift/BigQuery):** Data is stored column-by-column.
    *   *Scenario:* If you want the **Average Sales Amount** for 1 billion rows.
    *   *Row Store:* Must read the entire row (Name, Date, Address, **Amount**) for 1 billion records. Heavy I/O.
    *   *Column Store:* Only reads the **Amount** column block. Massive performance gain.

---

## 4. Summary: OLTP vs. OLAP

| Feature | OLTP (Transactional) | OLAP (Analytical) |
| :--- | :--- | :--- |
| **Purpose** | Run the business (Day-to-day) | Analyze the business (Trends) |
| **Data Source** | Live User Inputs | Aggregated & Cleaned from OLTP |
| **Data History** | Recent (e.g., 6 months) | Historic (e.g., 10 years) |
| **Normalization** | Highly Normalized (3NF) | Denormalized (Star Schema) |
| **Operations** | High Read/Write (CRUD) | Mostly Read (Complex Selects) |
| **Users** | Thousands (Customers/Clerks) | Few (Analysts/Managers/CEOs) |
| **Query Speed** | Fast for single records | Fast for aggregations (Sum/Avg) |

### The Verdict
We **cannot** run analytical queries on OLTP systems because:
1.  **Performance:** Reading millions of rows locks the database, preventing customers from placing new orders (Read/Write contention).
2.  **Data Quality:** OLTP data is "dirty" (raw input). OLAP data is "clean" (Source of Truth).
3.  **Complexity:** Business users cannot write queries with 15 Joins. OLAP simplifies the structure.