# Data Warehousing - Part 2: Architecture & OLTP Systems

## 1. The "Big Three" of Data Storage
Before understanding transactional systems, we must clarify a common interview question: **What is the difference between a Data Lake, Data Warehouse, and Data Mart?**

### A. Data Lake
*   **Origin:** Evolved with the rise of Big Data.
*   **Content:** Stores **all** types of data:
    *   **Structured:** CSV, Relational tables.
    *   **Semi-Structured:** JSON, XML, Logs, HTML.
    *   **Unstructured:** Images (JPEG), PDFs, Audio files.
*   **Purpose:** A vast "ocean" or holding area for raw data. It is often cost-effective storage (e.g., AWS S3, Azure Blob Storage).

### B. Data Warehouse (DW)
*   **Content:** Stores primarily **Structured** data (sometimes Semi-Structured).
*   **Characteristics:**
    *   **Cleaned & Aligned:** Data is processed before entry.
    *   **Optimized:** Designed for fast reading and analysis.
*   **Purpose:** The central "Source of Truth" for the entire organization.

### C. Data Mart
*   **Definition:** A subset of a Data Warehouse.
*   **Characteristics:** **Subject-Oriented**.
*   **Examples:**
    *   **Sales Data Mart:** Contains only sales, products, and customer data.
    *   **HR Data Mart:** Contains employee, payroll, and absence data.
*   **Purpose:** To serve specific business lines without exposing them to the complexity of the entire Warehouse.

---

## 2. OLTP: Online Transaction Processing

**OLTP** systems are the operational systems that run the day-to-day business.

### Key Characteristics:
1.  **Online & Recent:** Handles current data (e.g., last 6-18 months). Old data is often archived to maintain performance.
2.  **Read/Write Intensive:** Supports thousands of concurrent users reading and writing data (e.g., placing orders on Amazon).
3.  **Fast Response:** Response times must be in **milliseconds**.
4.  **Highly Normalized:**
    *   Data is split into many tables to reduce redundancy and ensure consistency (ACID properties).
    *   Follows database normalization rules (1NF, 2NF, 3NF, BCNF).
    *   Heavy use of **Foreign Keys** and relationships.

### The "Amazon Order" Example
When you place a single order on Amazon, the data isn't just dumped into one big row. It is split across multiple tables (Order info, Customer info, Shipping info, Billing info, Order Items).

Let's simulate an OLTP system using Python and SQLite to see how normalized data looks.

### Simulation: E-Commerce OLTP Database

```python
import sqlite3
import pandas as pd

# Create an in-memory SQLite database to simulate an OLTP system
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

# 1. Create Normalized Tables (3NF Schema)
# Notice how data is separated. Customer details are NOT in the Order table.

# Table: Customers
cursor.execute('''
    CREATE TABLE Customers (
        CustomerID INTEGER PRIMARY KEY,
        Name TEXT,
        Email TEXT,
        Address TEXT
    )
''')

# Table: Products
cursor.execute('''
    CREATE TABLE Products (
        ProductID INTEGER PRIMARY KEY,
        ProductName TEXT,
        Price REAL,
        StockQuantity INTEGER
    )
''')

# Table: Orders (The Transaction Header)
cursor.execute('''
    CREATE TABLE Orders (
        OrderID INTEGER PRIMARY KEY,
        CustomerID INTEGER,
        OrderDate TEXT,
        Status TEXT,
        FOREIGN KEY(CustomerID) REFERENCES Customers(CustomerID)
    )
''')

# Table: OrderDetails (The Transaction Items)
cursor.execute('''
    CREATE TABLE OrderDetails (
        DetailID INTEGER PRIMARY KEY,
        OrderID INTEGER,
        ProductID INTEGER,
        Quantity INTEGER,
        FOREIGN KEY(OrderID) REFERENCES Orders(OrderID),
        FOREIGN KEY(ProductID) REFERENCES Products(ProductID)
    )
''')

print("OLTP Database Schema Created.")
```

### Simulating a Transaction (Write Operation)
In an OLTP system, a single user action (buying 2 items) triggers inserts into multiple tables.

```python
# 1. Register a Customer
cursor.execute("INSERT INTO Customers VALUES (1, 'Alice Smith', 'alice@example.com', '123 Maple St, NY')")

# 2. Add Products
cursor.execute("INSERT INTO Products VALUES (101, 'Red Pen', 1.50, 100)")
cursor.execute("INSERT INTO Products VALUES (102, 'Blue Pen', 1.50, 100)")

# 3. Customer Places an Order (Transaction Start)
# Insert into Order Header
cursor.execute("INSERT INTO Orders VALUES (5001, 1, '2023-12-01', 'Shipped')")

# Insert into Order Details (Buying 10 Red Pens and 5 Blue Pens)
cursor.execute("INSERT INTO OrderDetails VALUES (1, 5001, 101, 10)")
cursor.execute("INSERT INTO OrderDetails VALUES (2, 5001, 102, 5)")

conn.commit()
print("Transaction Completed: Order #5001 placed.")
```

### The Challenge of OLTP Reporting
Because the data is **Normalized** (scattered across tables), reading a simple invoice requires complex **Joins**. This is efficient for writing (updates happen in one place) but inefficient for reading large volumes of data.

```python
# To see "Who bought what?", we must join 4 tables!
complex_query = '''
SELECT 
    o.OrderID,
    o.OrderDate,
    c.Name AS CustomerName,
    p.ProductName,
    od.Quantity,
    p.Price,
    (od.Quantity * p.Price) AS TotalLineItemCost
FROM Orders o
JOIN Customers c ON o.CustomerID = c.CustomerID
JOIN OrderDetails od ON o.OrderID = od.OrderID
JOIN Products p ON od.ProductID = p.ProductID
WHERE o.OrderID = 5001
'''

df_invoice = pd.read_sql_query(complex_query, conn)
print("--- Invoice Generated via Complex Joins ---")
display(df_invoice)
```

---

## 3. Why not use OLTP for Analytics?
This brings us to the core problem solved by Data Warehousing.

If we tried to run a report: *"Show me total sales of Red Pens in NY for the last 5 years"* on this OLTP system:
1.  **Performance Hit:** The database has to join millions of rows across 4+ tables. This consumes CPU/Memory, slowing down the website for users trying to place new orders.
2.  **Data Availability:** OLTP systems often archive data older than 1-2 years to keep the tables light. Historical data might not even be there.
3.  **Complexity:** Business analysts would need to write massive SQL queries with 10+ joins to understand the data model.

**Solution:** We move this data to an **OLAP** (Online Analytical Processing) systemâ€”which we will discuss in the next notebook.