# 🧩 Team 4 – DATA FORGE  
## Signal Stream – Real-Time Data Transformation & Streaming Dashboard

---

## 🧭 Overview
This project demonstrates **real-time data transformation** using **Spark Streaming on AWS Glue**, ingesting from **AWS Kinesis Data Streams**, transforming and writing to **S3**, and visualizing insights through a **Streamlit dashboard deployed on ECS**.  
The workflow simulates a telecommunications company analyzing network metrics like signal strength, GPS precision, and network status in near real-time.

---

## ⚙️ Tech Stack Used
- **AWS Kinesis Data Streams** – Real-time ingestion source  
- **AWS Glue (Spark Streaming)** – Real-time data transformation engine  
- **AWS S3** – Data Lake for storing transformed metrics  
- **AWS Glue Crawlers** – Automatic schema detection and catalog creation  
- **Amazon Athena** – SQL-based querying over S3 data  
- **Streamlit** – Real-time visualization layer  
- **Amazon ECS (Fargate)** – Containerized dashboard deployment  
- **Docker** – Local testing and containerization  
- **IAM & CloudWatch** – Security, monitoring, and logging

---

## 🗃 Datasets & Data Flow
- **Source Data:** `mobile_logs.csv` containing signal, network, GPS, operator, and timestamp data.  
- **Streaming Source:** AWS Kinesis Stream → `mobile_coverage_logs`  
- **Intermediate Storage:** Transformed data written to S3 partitions as Snappy Parquet files.  
- **Analytics Layer:** Glue Crawlers create tables in Data Catalog → queried via Athena.  
- **Dashboard:** Streamlit app reads from Athena → hosted on ECS Fargate.

---

## 🧩 Step-by-Step Plan

### 1️⃣ Spark Streaming Job on AWS Glue
- **Script:** Spark Streaming job reads from **Kinesis Data Stream (Mobile Coverage Logs)** using `glueContext.create_data_frame_from_options`.  
- Defines schema using **StructType** and **StructField** for ~15 attributes.  
- Reads JSON payloads → applies schema → converts timestamps → extracts `hour` as partition key.  
- Implements **watermarking (10 mins)** to handle late-arriving data safely.  
- Performs real-time aggregations:
  - Average Signal Strength by Operator  
  - Average GPS Precision by Provider  
  - Count of Network Status by Postal Code  
- Writes partitioned outputs to S3 using `writeStream()` every 20 seconds (`trigger(ProcessingTime="20 seconds")`).  
- Output folders on S3:  
  - `/aggregations/signal_strength_by_operator/`  
  - `/aggregations/gps_precision_by_provider/`  
  - `/aggregations/status_count/`  

#### Deployment on AWS Glue
- Job Name: `etl_streaming_mobile_logs`  
- Engine: Spark Streaming | Version: Glue 4.0 | Workers: 2  
- IAM Role: Custom Glue Role with permissions – `AmazonS3FullAccess`, `AmazonKinesisFullAccess`, `CloudWatchLogsFullAccess`  
- Run continuously (no timeout).

---

### 2️⃣ Glue Crawlers & Athena Configuration
- Create 3 crawlers to catalog transformed data:  
  1. **crawler_gps_precision_by_provider**  
  2. **crawler_signal_strength_by_operator**  
  3. **crawler_status_count**
- Target S3 paths corresponding to Spark output partitions.  
- Database: `mobile_network_aggregations`  
- Schedule: On-Demand  
- Result: 3 tables auto-created in **Glue Data Catalog**.  
- Validate via Athena queries:  
  ```sql
  SELECT * FROM gps_precision_by_provider LIMIT 10;
  SELECT * FROM signal_strength_by_operator LIMIT 10;
  SELECT * FROM status_count LIMIT 10;
  ```

---

### 3️⃣ Streamlit App – Local & Docker Deployment
#### Local App Structure
- Folder: `docker_streamlit/` containing:
  - `app.py` – Streamlit application logic.  
  - `requirements.txt` – Dependencies.  
  - `Dockerfile` – Container image definition.  
- App connects to Athena and queries tables dynamically using Boto3 client.  
- Displays metrics as auto-refreshing tables every 600 seconds (10 minutes).

#### Example Code (app.py)
```python
def query_athena(client, query):
    response = client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={'Database': 'mobile_network_aggregations'},
        ResultConfiguration={'OutputLocation': 's3://<bucket>/athena_output/'}
    )
```
#### Build & Run Locally
```bash
cd docker_streamlit
docker build -t mobile_signal_app .
docker run -p 8501:8501 -v ~/.aws:/root/.aws mobile_signal_app
```
- Access locally via: **http://localhost:8501**

---

### 4️⃣ ECR & ECS Deployment
#### Step 1: Push Image to ECR
1. Create repository: `streamlit_app`
2. Authenticate and push:
   ```bash
   aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account>.dkr.ecr.us-east-1.amazonaws.com
   docker tag mobile_signal_app <account>.dkr.ecr.us-east-1.amazonaws.com/streamlit_app:latest
   docker push <account>.dkr.ecr.us-east-1.amazonaws.com/streamlit_app:latest
   ```

#### Step 2: Create ECS Cluster & Task Definition
- **Cluster Name:** `streamlit_deployments`  
- **Task Definition:** `streamlit_mobile_metrics`  
- Launch Type: Fargate | OS: Linux ARM64  
- Resources: 4 vCPU | 10 GB Memory  
- IAM Role: `custom_streamlit_ecs_role` with:
  - `AmazonAthenaFullAccess`
  - `AmazonS3FullAccess`
  - `CloudWatchLogsFullAccess`
- Container Name: `streamlit_mobile_container`
- Image URI: `<ECR repo>/streamlit_app:latest`
- Port Mapping: 8501

#### Step 3: Run Task
- Deploy via ECS Console → select cluster → **Run Task**
- Once `Running`, open **Logs → External URL** → launch Streamlit dashboard.

---

### 5️⃣ Validation
1. Ingest data via Kinesis producer script (`kinesis_producer.py`) locally:
   ```bash
   python3 kinesis_producer.py
   ```
2. Verify Spark job status (running).  
3. Monitor real-time updates in Streamlit dashboard hosted on ECS.  
4. Data auto-refreshes every 600 seconds with latest Athena query results.

---

## 📆 Sprint Alignment

| **Sprint** | **Duration** | **Focus Area** | **Key Deliverables** |
|-------------|---------------|----------------|----------------------|
| Sprint 1 | 16-Oct-2025 to 18-Oct-2025 (3 Days) | Spark Streaming Setup & Kinesis Integration | Kinesis Stream, Glue Streaming Job, Schema Validation |
| Sprint 2 | 23-Oct-2025 to 27-Oct-2025 (4 Days) | Crawlers, Athena & Streamlit App | Glue Crawlers, Data Catalog, Streamlit Local App |
| Sprint 3 | 28-Oct-2025 to 31-Oct-2025 (4 Days) | ECS Deployment & Dashboard Validation | ECS Deployment, Dashboard Auto-Refresh Integration |

---