# Data Warehousing - Part 1: Introduction & High-Level Architecture

## 1. Course Agenda
In this series, we will deep dive into the concepts of Data Warehousing. The roadmap for this course includes:

1.  **Introduction:** Understanding the core concepts.
2.  **OLTP vs. OLAP:** Differentiating between transactional and analytical systems.
3.  **Data Modeling:** Measures, Attributes, Fact Tables, and Dimension Tables.
4.  **Advanced Topics:** A look into complex warehousing scenarios.
5.  **Demo:** A practical walkthrough of designing a warehouse schema.

---

## 2. What is Data Warehousing?

A **Data Warehouse (DW)** is essentially a **central repository**.

In a real-world enterprise, data is often scattered across various systems (sales apps, inventory logs, website tracking). These are known as **heterogeneous systems**. A Data Warehouse brings all this data together into a single location to facilitate analysis.

### The Goal: Informed Decision Making
The primary purpose of a Data Warehouse is not necessarily to run complex AI or Machine Learning models immediately, but to look at historical data trends to make **informed business decisions**.

### Case Study: The Pen Store Logic
Let's illustrate this with the example discussed. Imagine a retail business selling Pens with stores in **New York (NY)** and **San Francisco (SF)**.

**The Observation (Data Analysis):**
*   **New York Store:** In December, Red pens outsell Blue pens significantly.
*   **San Francisco Store:** In December, Red pen sales drop, while other colors perform better.

**The Informed Decision:**
Based on this historical trend, the business decides to modify its inventory distribution for the next December:
*   Move 90% of the **Red Pen** stock to New York.
*   Keep only 10% of the **Red Pen** stock in San Francisco.

### Simulating this in Python
While Data Warehousing is architectural, we can simulate this "Central Repository" concept using Pandas.

```python
import pandas as pd

# 1. Simulate Heterogeneous Sources (Upstream Data)

# Data from New York Point-of-Sale System
ny_sales_data = {
    'Date': ['2023-12-01', '2023-12-05', '2023-12-10'],
    'Store_Location': ['NY', 'NY', 'NY'],
    'Product': ['Red Pen', 'Red Pen', 'Blue Pen'],
    'Units_Sold': [100, 150, 20]
}

# Data from San Francisco Point-of-Sale System
sf_sales_data = {
    'Date': ['2023-12-02', '2023-12-06', '2023-12-12'],
    'Store_Location': ['SF', 'SF', 'SF'],
    'Product': ['Red Pen', 'Blue Pen', 'Red Pen'],
    'Units_Sold': [10, 50, 5]
}

df_ny = pd.DataFrame(ny_sales_data)
df_sf = pd.DataFrame(sf_sales_data)

print("--- Source System: NY ---")
display(df_ny)
print("\n--- Source System: SF ---")
display(df_sf)
```

```python
# 2. The Data Warehouse (Central Repository)
# We merge data from heterogeneous sources into one view

df_warehouse = pd.concat([df_ny, df_sf], ignore_index=True)

print("--- Data Warehouse (Central Repository) ---")
display(df_warehouse)

# 3. Business Intelligence (The Analysis)
# Aggregating data to find trends

report = df_warehouse.groupby(['Store_Location', 'Product'])['Units_Sold'].sum().reset_index()

print("\n--- Analytical Report ---")
display(report)
```

*Note: In the output above, you can clearly see the trend that NY needs more Red Pens than SF, enabling the "Informed Decision."*

---



## 3. Business Intelligence (BI)

**Business Intelligence** refers to the technical infrastructure and processes used to analyze, collect, and store data.

*   **Role:** It acts as the bridge between raw data and business decisions.
*   **Relation to DW:** BI heavily relies on the Data Warehouse as its source of truth to generate reports (like the one generated in the Python code above).

---

## 4. The Architecture: Upstream vs. Downstream

When designing data pipelines, we use specific terminology to describe the flow of data.

### Upstream (The Source)
This is where the data originates. It is the "top" of the river.
*   **Examples:** Point of Sales (POS) systems, CRM software, User application logs.
*   **Characteristics:** These are usually transactional systems (capturing data as it happens).

### Downstream (The Destination)
This is where the data flows to.
*   **Examples:** The Data Warehouse, Analytical Reporting Systems, Dashboards.
*   **Characteristics:** These are analytical systems (reading data to understand history).

### The Connector: ETL (Extract, Transform, Load)
Between Upstream and Downstream lies the **ETL** process.

1.  **Extract:** Pull data from NY and SF sources.
2.  **Transform:** Clean the data (e.g., standardizing date formats, calculating totals).
3.  **Load:** Push the clean data into the Data Warehouse.

*(Note: Sometimes this is done as ELT - Extract, Load, and then Transform, depending on the technology stack).*

### Visualizing the Flow

```mermaid
graph LR
    subgraph Upstream
    A[Source: NY Store]
    B[Source: SF Store]
    end

    subgraph Process
    C((ETL Layer))
    end

    subgraph Downstream
    D[(Data Warehouse)]
    E[BI Reports]
    end

    A --> C
    B --> C
    C --> D
    D --> E
```

---

## 5. Key Interview Question Teaser

A common question often arises:
> *"If we have the data in the Source Systems (Transactional/OLTP), why do we need to copy it to a Data Warehouse? Why can't we just run reports directly on the Source?"*

We will answer this in the next notebook by exploring the differences between **OLTP** (Online Transaction Processing) and **OLAP** (Online Analytical Processing).