# üß© Team 1 ‚Äì CODE FLUX  
## Rental Data Pipeline - Implementation Plan

---

## üß≠ Overview
This project focuses on building a **big data processing pipeline** for a rental vehicle marketplace using AWS services.  
We will implement an **end-to-end Spark-based workflow** leveraging **Amazon EMR**, **AWS Glue**, **Athena**, and **Step Functions** to process, transform, and analyze vehicle rental data stored in **Amazon S3**.

---

## ‚öôÔ∏è Tech Stack Used
- **Amazon EMR** ‚Äì Managed Hadoop/Spark cluster for distributed data processing  
- **Amazon S3** ‚Äì Storage layer and data lake for raw and transformed datasets  
- **AWS Glue** ‚Äì Schema inference and cataloging using Crawlers and Data Catalog  
- **AWS Athena** ‚Äì SQL-based query engine for S3 data  
- **AWS Step Functions** ‚Äì Workflow orchestration for EMR cluster automation  
- **Docker** ‚Äì Local Spark testing environment before EMR deployment  

---

## üìÅ Datasets
All datasets reside in the **S3 raw zone**:
1. `vehicles.csv` ‚Äì List of all available rental vehicles  
2. `users.csv` ‚Äì Platform user data  
3. `locations.csv` ‚Äì Master location reference data  
4. `rental_transactions.csv` ‚Äì Transactional rental data with start/end dates, pickup/drop locations, vehicle IDs, and total amount  

---

## üß© Step-by-Step Plan

### 1Ô∏è‚É£ EMR and Spark Introduction
- Understand EMR as a managed platform for big data frameworks like Spark and Hadoop.  
- Learn scaling, pricing, and IAM-based access control concepts.  
- Recognize EMR‚Äôs integration with AWS services for secure, cost-effective processing.

---

### 2Ô∏è‚É£ Local Spark Job Development
- Develop two Spark jobs locally using Docker for validation.  
- **Spark Job 1** ‚Äì Uses `rental_transactions` and `users` datasets.  
  - Convert rental start/end time to timestamps.  
  - Derive a `duration` column.  
  - Join users and transactions to calculate:
    - Total transactions & revenue  
    - Avg. transaction value  
    - Max/min rental duration  
    - User-level metrics (spending, revenue, etc.)  
- **Spark Job 2** ‚Äì Uses `rental_transactions`, `locations`, and `vehicles`.  
  - Perform multi-table joins.  
  - Derive:
    - Location-level metrics (revenue, unique vehicles, avg. transaction value)  
    - Vehicle-type-level metrics (revenue, count, duration, etc.)  
- Validate both Spark jobs locally by printing DataFrames (`.show()`) before EMR execution.  

---

### 3Ô∏è‚É£ EMR Cluster Setup and Job Execution
- Create EMR Cluster (version 7.1.0, Spark 3.5.0).  
- Configure IAM roles:
  - `EMR_Service_Role`
  - `EMR_EC2_Instance_Profile_Role`
- Upload Spark scripts to S3 bucket path:  
  `s3://<your-bucket>/spark-scripts/`  
- Modify PySpark scripts:
  - Uncomment S3 write commands  
  - Comment out local `.show()`  
- Add steps for both Spark jobs in EMR console:
  - **Step 1** ‚Äì Execute `spark_job_1.py`  
  - **Step 2** ‚Äì Execute `spark_job_2.py`  
- Verify outputs in S3 folder `output/` containing:
  - `location_performance_metrics`
  - `transaction_metrics`
  - `user_metrics`
  - `vehicle_performance_metrics`

---

### 4Ô∏è‚É£ AWS Glue Crawlers and Athena Querying
- Create four AWS Glue Crawlers (one per output dataset).  
- Assign IAM role with:
  - `AWSGlueServiceRole`  
  - `CloudWatchLogsFullAccess`
- Configure each crawler with source path (S3 output folders).  
- Create a new database: `rental_vehicles_db`.  
- Run crawlers and confirm table creation in the Glue Data Catalog.  
- In Athena:
  - Select `rental_vehicles_db`  
  - Preview tables (`SELECT * FROM table LIMIT 10;`)  
  - Run analytical queries for insights.

---

### 5Ô∏è‚É£ AWS Step Functions Workflow
- Create JSON definition `stepfunctions_emr.json` with 4 states:
  1. **Create EMR Cluster**  
  2. **Execute Spark Job 1**  
  3. **Execute Spark Job 2**  
  4. **Terminate Cluster**  
- Define transitions (`Next`, `Catch`) to handle job success/failure.  
- IAM Role Setup:
  - Create Step Functions Execution Role  
  - Attach inline policy from `execution_policy_stepfunctions.json`  
  - Include `CloudWatchLogsFullAccess` and `S3FullAccess`  
- Deploy and monitor execution flow:  
  - Validate EMR cluster creation ‚Üí Spark jobs ‚Üí termination.  

---

### 6Ô∏è‚É£ EMR Serverless Execution
- Create an **EMR Serverless application** and launch it via EMR Studio.  
- Create runtime IAM Role with trust policy (`emr_serverless_trust_policy.json`).  
- Configure:
  - Runtime Role ‚Üí `EMRServerlessRole`  
  - Script location ‚Üí Spark job in S3  
  - Application type ‚Üí Spark (v7.1.0)  
- Submit job via ‚ÄúSubmit Batch Job Run.‚Äù  
- Validate job success in under 2 minutes.  
- Note: compare cost-effectiveness of provisioned EMR vs EMR Serverless.

---

## üìÜ Sprint Alignment

| **Sprint** | **Duration** | **Focus Area** | **Key Deliverables** |
|-------------|---------------|----------------|----------------------|
| Sprint 1 | 16-Oct-2025 to 18-Oct-2025  (3 Days) | Environment Setup & Local Validation | EMR Roles, Local Spark Validation |
| Sprint 2 | 23-Oct-2025 to 27-Oct-2025  (4 Days) | EMR Execution & Glue Integration | Cluster Jobs, Output Validation, Crawlers Setup |
| Sprint 3 | 28-Oct-2025 to 31-Oct-2025  (4 Days) | Step Functions & Serverless Execution | Workflow Automation, Final Testing |

---