# 🧩 Team_2_APEX

## 📘 Project Title
**Retail Lakehouse**

## 🎯 Objective
To design and implement a data lakehouse architecture using AWS Glue (Delta Lake), AWS Glue Crawlers & Data Catalog, Amazon S3, and Amazon Athena that unifies transactional, inventory, and customer data into a single analytical system for business reporting and analysis.

## 🧠 Tech Stack
AWS Glue (Delta Lake) · Glue Crawlers & Data Catalog · Amazon S3 · Amazon Athena

## 👥 Team Members
- Sayyam K Nahar
- Yashaswini D S
- Nikhil Hitnalli
- Priyanka D L
- Tinku Kumar

## 🧩 Project Modules & Backlogs
### Module 1: Data Ingestion
- Load transactional and product datasets into S3 zones (raw, bronze, silver, gold).
- Use Glue Crawlers to infer schema and update the Data Catalog.
- Validate data accessibility and structure.

### Module 2: ETL Development
- Develop Glue jobs using Delta Lake to merge and clean data.
- Implement transformations for sales and customer metrics.
- Handle deduplication and schema evolution.

### Module 3: Table Optimization
- Partition and optimize Delta tables for Athena queries.
- Implement upsert operations using Delta Lake features.
- Configure versioning for data rollback.

### Module 4: Access & Security
- Define IAM policies for Glue, S3, and Athena access.
- Ensure restricted access to production data folders.
- Validate permissions using policy simulator.

### Module 5: Reporting & Analysis
- Create Athena views to join transactional and product data.
- Build queries for daily revenue, product performance, and regional insights.
- Prepare data visualization in QuickSight (optional).

### Module 6: Validation & Testing
- Test Delta Lake transactions and ensure ACID compliance.
- Validate query results with sample calculations.
- Document schema definitions and data lineage.



## 🚀 Sprint 1 – AWS Setup & Initialization
**Dates:** 16–18 Oct 2025 (3 days)

**Goal:** Establish S3 structure, IAM roles, and Glue Data Catalog for the Lakehouse setup.

**Tasks:**
- Create S3 buckets for raw, bronze, silver, and gold data layers.
- Configure IAM roles and permissions for Glue and Athena.
- Run Glue Crawlers to populate the Data Catalog.
- Validate S3 access and schema creation.
- Upload initial CSV data for testing.

**Deliverable:**
Base Lakehouse environment created with working Glue Crawlers and Data Catalog.

## 🚀 Sprint 2 – Pipeline Development & Execution
**Dates:** 23–26 Oct 2025 (4 days)

**Goal:** Develop Glue ETL scripts and integrate Delta Lake functionality.

**Tasks:**
- Create and run Glue jobs using Delta Lake format.
- Implement transformations for sales and inventory data.
- Configure partitioning for efficient queries.
- Run sample Athena queries to validate processed data.
- Store transformed data in the gold layer.

**Deliverable:**
ETL jobs developed and Delta tables successfully created in the gold zone.

## 🚀 Sprint 3 – Integration, Testing & Demo
**Dates:** 27–31 Oct 2025 (4 days)

**Goal:** Test, validate, and prepare Lakehouse project for final demonstration.

**Tasks:**
- Run full ETL workflow from raw to gold layers.
- Execute validation queries in Athena for KPIs.
- Document Delta Lake versioning and schema evolution tests.
- Prepare demo queries and project presentation.
- Review IAM policies and finalize documentation.

**Deliverable:**
Validated and documented Lakehouse pipeline with Athena-based analytical queries.