# 🧩 Team 2 – APEX  
## Retail Lakehouse – Implementation Plan

---

## 🧭 Overview
This project involves building a **Lakehouse architecture** for e-commerce transactions on AWS using **Delta Lake**.  
The goal is to integrate data lake scalability and warehouse reliability using Spark, AWS Glue, and Athena — enabling transactional consistency, schema evolution, and simplified ingestion pipelines.

---

## ⚙️ Tech Stack Used
- **Apache Spark (PySpark)** – Core data processing engine  
- **AWS Glue (3.0)** – ETL orchestration for Spark jobs and Delta table creation  
- **AWS S3** – Storage for raw, archive, and Lakehouse zones  
- **AWS Athena** – SQL querying on Delta Lake data  
- **AWS Redshift** – Analytical querying and integration  
- **Docker** – Local testing environment for Spark-Delta integration  

---

## 📁 Datasets
- **products.csv** – Product catalog data  
- **orders.csv** – Order-level transaction data  
- **order_items.csv** – Line-item level data per order  

These files are stored in S3’s **raw zone**, organized by date folders (`2024-06-06`, `2024-06-07`).

---

## 🧩 Step-by-Step Plan

### 1️⃣ Lakehouse and Delta Overview
- A **Lakehouse** blends a data lake’s scalability with a data warehouse’s structure.  
- It supports **ACID transactions**, **time travel**, and **schema evolution**.  
- Delta Lake acts as a **reliable storage layer** for Spark, simplifying ingestion and updates.

---

### 2️⃣ Local Spark and Docker Setup
- Build and test the Spark job locally before AWS deployment.  
- Dockerfile uses base image `amazon/aws-glue-libs:glue_libs_3.0`.  
- Install required libraries:
  ```bash
  pip install pyspark delta-spark
  ```
- Mount the `delta-core.jar` file inside the container for Delta support.  
- Commands:
  ```bash
  docker build -t spark-delta-lake-job .
  docker run -v $(pwd):/app spark-delta-lake-job
  ```

---

### 3️⃣ Spark Job Details
- Reads raw CSV data from S3 `raw/` directory.  
- Defines **primary keys**:
  - `orders` → `order_id`
  - `order_items` → `order_id`, `id`
  - `products` → `product_id`
- Converts timestamps to `order_date` (partition key).  
- Implements logic:
  - If Delta Table exists → Perform **upsert (merge)** using Delta APIs.  
  - Else → Write new Delta table to S3 (`df.write.format("delta")`).  
- Moves processed files from `raw/` → `archive/` to prevent reprocessing.

---

### 4️⃣ AWS Glue Job Deployment
- Upload script (`spark_transactional_delta_lake.py`) and JAR to `s3://<bucket>/jars/`.  
- Create Glue job:
  - **Version:** 3.0  
  - **IAM Role:** Custom Glue Role (S3 + CloudWatch access)  
  - **Workers:** 2  
  - **Timeout:** 10 minutes  
  - **Max Concurrency:** 3  
- Add job parameters:
  ```bash
  --table_name orders
  --table_name order_items
  --table_name products
  ```
- Run jobs sequentially and validate outputs in `lakehouse_dw/` folder.

---

### 5️⃣ Output Verification in S3
- Confirm directories created in S3:
  - `lakehouse_dw/orders/` – partitioned by `order_date`  
  - `archive/orders/` – contains processed files  
- Validate using Athena:
  ```sql
  SELECT * FROM orders LIMIT 10;
  ```

---

### 6️⃣ AWS Glue Crawlers and Athena
- Create crawlers for each dataset (orders, order_items, products).  
- Select data source: **Delta Lake**.  
- Target path: `s3://<bucket>/lakehouse_dw/<dataset>/`  
- Database: `supermarket_transactions_db`  
- Run crawler → Confirm tables created in Glue Data Catalog.  
- Validate in Athena with queries:
  ```sql
  SELECT COUNT(*) AS total_orders, SUM(total_amount) AS total_revenue FROM orders;
  ```
- Demonstrate Delta Upsert by uploading `orders_updated.csv` to raw zone and re-running job.

---

### 7️⃣ Redshift Integration (Optional)
- Connect Redshift → Select **Federated User**.  
- Access Glue Data Catalog database `supermarket_transactions_db`.  
- Redshift cannot query Delta tables directly — workaround via Athena:
  ```sql
  CREATE TABLE delta_lake_orders AS SELECT * FROM orders;
  ```
- Schedule this command daily to refresh the view for Redshift queries.

---

## 📆 Sprint Alignment

| **Sprint** | **Duration** | **Focus Area** | **Key Deliverables** |
|-------------|---------------|----------------|----------------------|
| Sprint 1 | 16-Oct-2025 to 18-Oct-2025 (3 Days) | Local Spark Setup and Delta Testing | Docker validation, Delta JAR configuration |
| Sprint 2 | 23-Oct-2025 to 27-Oct-2025 (4 Days) | Glue Job Deployment & S3 Validation | Glue job success, Archive validation |
| Sprint 3 | 28-Oct-2025 to 31-Oct-2025 (4 Days) | Crawlers, Athena & Redshift Integration | Upsert testing, Query validation |

---