# Data Warehousing with PySpark: Project Overview

## 1. Introduction
In this project series, we are going to design and implement a Data Warehouse using **PySpark**. The goal is to simulate a real-world scenario where we move from raw data ingestion to analytics reporting.

### Prerequisites
*   Basic understanding of Data Warehousing concepts.
*   Basic knowledge of Python and Apache Spark.

---

## 2. Problem Statement

We are acting as data engineers for a **Retail Pet Food Company**.

*   **Context:** The company has stores at multiple locations in India.
*   **Current State:** The company is growing rapidly in size.
*   **Business Requirement:** They want to determine specific Key Performance Indicators (KPIs) to understand their business performance better.

---

## 3. Business Requirements (KPIs)

To address the business needs, our data warehouse pipeline must be able to answer the following questions:

1.  **Sales Analysis:** Calculate sales per store, per day, and per month.
2.  **Top Performers:** Identify the top-selling products per store in various geographic regions.
3.  **Under-performers:** Identify the least-selling products per store in various geographic regions.

---

## 4. Technical Solution & Architecture

### Data Source (Ingestion)
The individual stores export their CRM data as flat files. These files are dumped into a shared location in a Data Lake.
*   **Storage:** AWS S3 (Simple Storage Service).
*   **Data Types:** 
    *   Orders Data
    *   Customer Data
    *   Store Data
    *   Product Data

### Data Processing (ETL)
We will use **Apache Spark (PySpark)** to read the data from the Data Lake.
*   **Transformation:** We will perform necessary cleaning, joining, and aggregation logic to calculate the KPIs.
*   **Loading:** The processed data will be loaded into the Data Warehouse schema.

### Architecture Flow
1.  **Source:** Stores generate CSV/JSON files.
2.  **Landing Zone:** Files land in AWS S3.
3.  **Processing Layer:** PySpark reads from S3 -> Processes Data.
4.  **Serving Layer:** Data is written to the Data Warehouse for Analytics Reporting.

In [None]:
# Environment Setup 
# Since this is the initialization notebook, let's verify our Spark environment is ready for the upcoming tasks.

import pyspark
from pyspark.sql import SparkSession

def init_spark():
    spark = SparkSession.builder \
        .appName("PetFood_DataWarehouse_Project") \
        .master("local[*]") \
        .getOrCreate()
    return spark

if __name__ == "__main__":
    spark = init_spark()
    print(f"Spark Version: {spark.version}")
    print("Project Environment Initialized successfully.")

## 5. Next Steps

In the upcoming notebooks, we will:
1.  Set up the folder structure (mocking AWS S3 locally or connecting to actual S3).
2.  Ingest the raw CRM data files (Orders, Customers, etc.).
3.  Build the transformation logic to satisfy the KPIs defined above.