# 🧩 Team 3 – CLOUD CODERS  
## Music Stream – AWS Glue + Airflow Orchestration Plan

---

## 🧭 Overview
This project focuses on using **Airflow for orchestration** and **AWS Glue (Spark + Python Shell)** for data transformation, with **DynamoDB** as the target NoSQL database.  
It demonstrates how Airflow can coordinate data validation, transformation, and ingestion workflows efficiently without handling heavy computation directly.

---

## ⚙️ Tech Stack Used
- **Apache Airflow** – Orchestration and workflow scheduling  
- **AWS Glue (Spark & Python Shell)** – Data transformation and ingestion  
- **AWS S3** – Storage for raw, processed, and archived data  
- **AWS DynamoDB** – Target NoSQL database for metrics storage  
- **Docker** – Local development for Spark testing  
- **IAM & CloudWatch** – Security and logging

---

## 📁 Datasets
- **songs.csv** – Song catalog data  
- **streams.csv** – User stream data (listens, duration, etc.)  
- **users.csv** – User profile and demographic data  

These files are placed under `s3://<bucket>/spotify_data/` with subfolders:  
`songs/`, `user_streams/`, and `users/`.

---

## 🧩 Step-by-Step Plan

### 1️⃣ Airflow for Orchestration
- Airflow DAG (`dag_glue_workflow.py`) runs every **5 minutes**.  
- DAG Structure:
  1. **Check Files Task** – Verifies input files (songs, users, streams) exist in S3.  
  2. **Trigger Glue Spark Job** – If valid, triggers Spark job for metric calculation.  
  3. **Wait for Spark Completion** – Ensures ETL is finished before proceeding.  
  4. **Trigger Glue Python Shell Job** – Loads metrics into DynamoDB.  
  5. **Archive Task** – Moves processed files from `/user_streams/` to `/archive/`.
- If any required files are missing, DAG calls a **Skip Execution** dummy operator.  

---

### 2️⃣ Spark Job – Calculate Metrics ETL
**Job Name:** `calculate_metrics_etl`  
- **Purpose:** Aggregates listening metrics from streams and songs data.  
- Reads data from S3 (`spotify_data/songs/` and `spotify_data/user_streams/`).  
- Performs:  
  - Typecasting and null removal on `track_id`.  
  - Aggregation by `song` and `report_date` to compute:  
    - Total Listens  
    - Unique Users  
    - Total Listening Time  
    - Average Listening Time per User  
- Joins aggregated data with song metadata.  
- Uses **Window Functions** (`rank`, `partitionBy`, `orderBy`) to:  
  - Get **Top 3 Songs per Genre per Day**.  
  - Get **Top 5 Genres** by total listens.  
- Writes final results to `s3://<bucket>/spotify_data/output/song_kpis/`.

---

### 3️⃣ Python Shell Job – Insert Metrics Dynamo
**Job Name:** `insert_metrics_dynamo`  
- Reads Spark output (`song_kpis/`) from S3.  
- Inserts or updates records into **DynamoDB table** `track_level_reports`.  
- Uses **Upsert (UpdateItem)** operation based on composite key:  
  - `track_id` (Partition Key)  
  - `report_date` (Sort Key)  
- Updates existing records with latest metrics if key exists, else inserts a new record.

---

### 4️⃣ Local Testing with Docker
- Run Glue Spark job locally before AWS deployment using Glue Docker image:
  ```bash
  docker run -it -v ~/.aws:/home/glue_user/.aws   -v $(pwd):/home/glue_user/workspace   -e AWS_PROFILE=default   -p 4040:4040 -p 18080:18080   --name glue_spark_submit   amazon/aws-glue-libs:glue_libs_4.0.0   spark-submit glue_pyspark.py
  ```
- Replace `df.show()` with `df.write()` for production runs.

---

### 5️⃣ AWS Glue Job Deployment
- Create **IAM Role** with:  
  - `AmazonS3FullAccess`  
  - `AmazonDynamoDBFullAccess`  
  - `CloudWatchLogsFullAccess`
- Deploy both Glue jobs in console:  
  - **Spark Job:** `calculate_metrics_etl` (2 workers, 15 min timeout)  
  - **Python Job:** `insert_metrics_dynamo` (1/16 DPU, 20 min timeout)
- Attach the IAM role created above to both jobs.

---

### 6️⃣ Airflow Setup and Permissions
- Upload DAG to Airflow environment → `/dags/dag_glue_workflow.py`  
- Update Airflow’s **Execution Role** to include:  
  - `AWSGlueConsoleFullAccess` (required to trigger Glue jobs)  
- DAG checks input files and executes Glue jobs sequentially.

---

### 7️⃣ DynamoDB Table Configuration
- **Table Name:** `track_level_reports`  
- **Partition Key:** `track_id`  
- **Sort Key:** `report_date`  
- Default capacity mode and auto-scaling enabled.  
- Used for storing song-level metrics per day.  

---

### 8️⃣ Validation and Execution
1. Place sample data files in S3:  
   - `/spotify_data/songs/sample_songs.csv`  
   - `/spotify_data/user_streams/sample_streams.csv`  
   - `/spotify_data/users/sample_users.csv`  
2. Trigger DAG manually from Airflow UI.  
3. Monitor Glue job execution logs in CloudWatch.  
4. Verify final results:
   - `s3://<bucket>/spotify_data/output/song_kpis/`  
   - DynamoDB table `track_level_reports` using query:  
     ```sql
     SELECT * FROM track_level_reports;
     ```

---

## 📆 Sprint Alignment

| **Sprint** | **Duration** | **Focus Area** | **Key Deliverables** |
|-------------|---------------|----------------|----------------------|
| Sprint 1 | 16-Oct-2025 to 18-Oct-2025 (3 Days) | Airflow DAG & Local Spark Testing | DAG structure, local PySpark validation |
| Sprint 2 | 23-Oct-2025 to 27-Oct-2025 (4 Days) | Glue Job Deployment & DynamoDB Integration | Spark + Python Glue Jobs working end-to-end |
| Sprint 3 | 28-Oct-2025 to 31-Oct-2025 (4 Days) | Automation, Logging & Validation | Airflow to Glue orchestration validated, DynamoDB populated |

---