# 🧩 Team_1_CODE_FLUX

## 📘 Project Title
**Rental Data Pipeline**

## 🎯 Objective
To design and implement an end-to-end data pipeline using Amazon S3, AWS EMR (PySpark), AWS Glue Crawlers, Amazon Athena, and AWS Step Functions that ingests, processes, and analyzes rental vehicle data, enabling data-driven insights for business optimization.

## 🧠 Tech Stack
Amazon S3 · AWS EMR (PySpark) · AWS Glue Crawlers · Amazon Athena · AWS Step Functions

## 👥 Team Members
- Vijay Kumar E
- Smita Sudhakar Hegde
- Sakshath K Shetty
- Mallika Shree K C
- Gokul Raj S

## 🧩 Project Modules & Backlogs
### Module 1: Data Ingestion
- Upload raw CSV files (locations, users, vehicles, transactions) to an S3 bucket.
- Set up folder structure for raw, processed, and analytics data.
- Ensure files are accessible and permissions are correctly applied.

### Module 2: Data Processing
- Create EMR cluster for running PySpark jobs.
- Write and run PySpark scripts for data cleaning and basic aggregations.
- Store processed output back to S3 in structured format (CSV or Parquet).

### Module 3: Workflow Control
- Use AWS Step Functions to manage job sequence for data ingestion and processing.
- Add simple success/failure states and basic logging.
- Test workflow execution with sample data.

### Module 4: Access & Permissions
- Attach correct IAM roles for EMR and Step Functions using provided policy files.
- Verify permissions for reading and writing to S3.
- Ensure minimum privileges are applied.

### Module 5: Query & Reporting
- Set up Athena database and tables for processed data.
- Run sample queries to check record counts and summaries.
- Prepare a simple report or chart showing top-performing vehicles or locations.

### Module 6: Validation & Testing
- Use provided local script for basic testing and validation.
- Check schema consistency and data accuracy.
- Confirm end-to-end data flow works as expected.



## 🚀 Sprint 1 – AWS Setup & Initialization
**Dates:** 16–18 Oct 2025 (3 days)

**Goal:** Get all AWS services and datasets ready for pipeline development.

**Tasks:**
- Create S3 bucket and folder structure for raw and processed data.
- Upload sample datasets (CSV files) into the raw folder.
- Configure IAM roles and attach trust and execution policies.
- Test connectivity between S3, EMR, and Step Functions.
- Validate permissions and access policies.

**Deliverable:**
AWS environment and dataset ready for development.

## 🚀 Sprint 2 – Pipeline Development & Execution
**Dates:** 23–26 Oct 2025 (4 days)

**Goal:** Build and run the PySpark-based data pipeline using EMR.

**Tasks:**
- Write PySpark scripts for data transformation and aggregation.
- Deploy PySpark jobs on EMR and validate job execution.
- Connect EMR output to Athena for query access.
- Test Step Functions flow with EMR job trigger.
- Ensure transformed data is available in S3 for analysis.

**Deliverable:**
Working data pipeline that processes and stores clean, structured data.

## 🚀 Sprint 3 – Integration, Testing & Demo
**Dates:** 27–31 Oct 2025 (4 days)

**Goal:** Combine all modules, test end-to-end flow, and prepare for final presentation.

**Tasks:**
- Run the entire pipeline from ingestion to reporting.
- Execute Athena queries to verify data accuracy and completeness.
- Document steps, workflow diagrams, and data flow.
- Prepare a short demo showing the results and key insights.
- Review IAM roles and optimize permissions where needed.

**Deliverable:**
Tested and documented data pipeline ready for demo and review.