# Architecture and Data Warehouse Data Model

## 1. Solution Architecture

Based on the project requirements, we have designed the following high-level architecture:

### **Flow Components:**
1.  **Source Layer:** 
    *   Store CRM systems dump CSV/JSON files.
    *   Data includes: Orders, Customers, Stores, Products.
2.  **Storage Layer (Data Lake):** 
    *   **AWS S3** acts as the centralized storage.
    *   Raw files land in an 'Input' bucket/folder.
3.  **Processing Layer (ETL):** 
    *   **Apache Spark (PySpark)** is the core processing engine.
    *   It reads raw data, applies transformations, and handles Slowly Changing Dimensions (SCD).
4.  **Serving Layer (Data Warehouse):**
    *   Processed data is stored back into **S3** in **Delta Lake** format (or Parquet).
    *   **Symlink Manifest** files are generated to allow external query engines to read the data.
5.  **Reporting Layer:**
    *   **AWS Athena** uses the manifest files to query the data stored in S3 for reporting and generating KPIs.

---

## 2. Data Warehouse Model: Snowflake Schema

We will implement a **Snowflake Schema**. Unlike a Star schema where all dimensions connect directly to the fact table, a Snowflake schema allows dimensions to be normalized (i.e., a dimension can connect to another dimension).

### **Table Definitions & SCD Types:**

| Table Name | Type | SCD Type | Description |
| :--- | :--- | :--- | :--- |
| **FACT_SALES** | Fact | N/A | Transactional sales data linking to all dimensions. |
| **DIM_DATE** | Dimension | Type 1 | Date attributes (Day, Month, Year, etc.). No history tracking needed. |
| **DIM_STORE** | Dimension | Type 1 | Store details. Updates overwrite old values. |
| **DIM_PRODUCT** | Dimension | **Type 2** | Product details. History is preserved (Start Date, End Date, Is Current). |
| **DIM_CUSTOMER** | Dimension | **Type 2** | Customer details. History is preserved. Links to Plan Type. |
| **DIM_PLAN_TYPE** | Dimension | Type 1 | Specific details about subscription plans. Linked from Customer Dim. |

---

## 3. Schema Definition in PySpark

Before we ingest data, we need to define the schema for our target Data Warehouse tables. This ensures data type consistency and helps in validation.

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DateType, BooleanType, TimestampType

# 1. Dimension: DIM_DATE (SCD Type 1)
schema_dim_date = StructType([
    StructField("date_id", TimestampType(), False), # PK
    StructField("full_date", DateType(), True),
    StructField("day_of_week", StringType(), True),
    StructField("day_num_in_month", IntegerType(), True),
    StructField("day_num_overall", IntegerType(), True),
    StructField("month", IntegerType(), True),
    StructField("month_name", StringType(), True),
    StructField("year", IntegerType(), True),
    StructField("quarter", IntegerType(), True),
    StructField("fiscal_quarter", StringType(), True)
])

# 2. Dimension: DIM_STORE (SCD Type 1)
schema_dim_store = StructType([
    StructField("store_key", StringType(), False), # Surrogate Key
    StructField("store_id", StringType(), True),   # Natural Key
    StructField("store_name", StringType(), True),
    StructField("address", StringType(), True),
    StructField("city", StringType(), True),
    StructField("state", StringType(), True),
    StructField("zip_code", StringType(), True),
    StructField("phone_number", StringType(), True),
    StructField("manager_name", StringType(), True),
    StructField("insert_date", TimestampType(), True),
    StructField("update_date", TimestampType(), True)
])

# 3. Dimension: DIM_PRODUCT (SCD Type 2)
# Includes effective dates and current flag for history tracking
schema_dim_product = StructType([
    StructField("product_key", StringType(), False), # Surrogate Key
    StructField("product_id", StringType(), True),   # Natural Key
    StructField("product_name", StringType(), True),
    StructField("brand", StringType(), True),
    StructField("flavor", StringType(), True),
    StructField("size", StringType(), True),
    StructField("price", FloatType(), True),
    StructField("image_url", StringType(), True),
    # SCD Type 2 Columns
    StructField("effective_start_date", TimestampType(), True),
    StructField("effective_end_date", TimestampType(), True),
    StructField("is_current", BooleanType(), True)
])

# 4. Dimension: DIM_PLAN_TYPE (SCD Type 1 - Extension of Customer)
schema_dim_plan = StructType([
    StructField("plan_key", StringType(), False),
    StructField("plan_id", StringType(), True),
    StructField("plan_name", StringType(), True),
    StructField("price", FloatType(), True),
    StructField("features", StringType(), True),
    StructField("insert_date", TimestampType(), True),
    StructField("update_date", TimestampType(), True)
])

# 5. Dimension: DIM_CUSTOMER (SCD Type 2)
# Links to Plan Dimension
schema_dim_customer = StructType([
    StructField("customer_key", StringType(), False), # Surrogate Key
    StructField("customer_id", StringType(), True),   # Natural Key
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("phone", StringType(), True),
    StructField("address", StringType(), True),
    StructField("city", StringType(), True),
    StructField("state", StringType(), True),
    StructField("zip_code", StringType(), True),
    StructField("plan_key", StringType(), True), # FK to Plan Dimension
    # SCD Type 2 Columns
    StructField("effective_start_date", TimestampType(), True),
    StructField("effective_end_date", TimestampType(), True),
    StructField("is_current", BooleanType(), True)
])

# 6. Fact Table: FACT_SALES
schema_fact_sales = StructType([
    StructField("sales_key", StringType(), False), # PK
    StructField("order_id", StringType(), True),
    StructField("date_id", TimestampType(), True), # FK
    StructField("store_key", StringType(), True),  # FK
    StructField("product_key", StringType(), True),# FK
    StructField("customer_key", StringType(), True),# FK
    StructField("quantity", IntegerType(), True),
    StructField("unit_price", FloatType(), True),
    StructField("discount", FloatType(), True),
    StructField("tax", FloatType(), True),
    StructField("total_amount", FloatType(), True),
    StructField("is_returned", BooleanType(), True)
])

print("Schemas defined successfully.")

## 4. Conclusion
We have successfully mapped the visual model from the architecture to code using PySpark StructTypes. 

**Note on Foreign Keys:** 
*   `FACT_SALES` connects to `DIM_CUSTOMER`, `DIM_STORE`, `DIM_PRODUCT`, and `DIM_DATE`.
*   `DIM_CUSTOMER` connects to `DIM_PLAN_TYPE`.

In the next notebook, we will begin the **Environment Setup** to simulate the S3 bucket structure and helper functions.