# 📰 AI-Powered News Aggregator – Project Breakdown

## 📌 Summary
This project is a self-contained, AI-driven news aggregator focused on Artificial Intelligence news. It scrapes news articles, converts them to semantic embeddings, summarizes them, and generates audio for passive consumption. Designed as a solo developer learning tool, it emphasizes real-world application of data pipelines, NLP, and AI agent orchestration.

---

## 🧱 Implementation Steps

### 🛠️ 1. Project Setup
- [ ] Set up Python environment (venv or Docker).
- [ ] Organize folder structure:
    /ingestion
    /processing
    /storage
    /ai
    /orchestration
    /audio
    /interface


---

### 🌐 2. Data Ingestion: Web Scraping
- [ ] Use **Playwright** to scrape JavaScript-heavy news sites.
- [ ] Use **BeautifulSoup + lxml** for parsing.
- [ ] Extract:
- Title
- Author (optional)
- Publication Date
- Main content
- [ ] Store both `content_raw` and `content_cleaned`.

#### ✅ Ethical Practices
- [ ] Respect `robots.txt`
- [ ] Apply rate limiting
- [ ] Avoid copying verbatim copyrighted material
- [ ] Add attribution when using summaries

---

### 💾 3. Data Storage
- [ ] Use **PostgreSQL + pgvector**
- [ ] Create tables:

#### `articles`
- `article_id`
- `url`
- `title`
- `author`
- `publication_date`
- `content_raw`
- `content_cleaned`
- `embedding_vector`
- `summary_text`
- `audio_path`
- `created_at`, `updated_at`

#### `user_reading_history`
- `interaction_id`
- `user_id`
- `article_id`
- `last_accessed_timestamp`
- `read_progress_seconds`
- `is_read_complete`
- `created_at`, `updated_at`

---

### 🤖 4. AI/ML Services

#### a. Text Embedding
- [ ] Choose model:
- `Gemini Embedding` (API-based, 3K-dim vectors)
- `SentenceTransformers` (e.g., `all-MiniLM-L6-v2`)
- [ ] Generate embeddings from cleaned content

#### b. Summarization
- [ ] Use **Abstractive summarization**
- [ ] Use Hugging Face (`pipeline("summarization")`)
- [ ] Prefer models like `T5` or similar
- [ ] Generate summaries for unread/missed articles

#### c. Text-to-Speech (TTS)
- [ ] Choose a model:
- Lightweight: `Kokoro`, `Dia`, `Chatterbox`
- Realistic: `Orpheus`, `Sesame CSM`
- Simple: `eSpeak` (fallback)
- [ ] Store audio file path in `audio_path`

---

### 🧠 5. AI Agent Orchestration (LangChain)
- [ ] Use **LangChain** for:
- Deciding when to summarize
- Running summary + TTS in sequence
- Integrating with vector DB for search
- [ ] Enable dynamic workflows:
- Check missed days
- Retrieve relevant articles
- Summarize and convert to audio

---

### 🧪 6. Testing
- [ ] Unit test:
- Scraper
- Parser
- Embedding generator
- Summarizer
- TTS generator
- [ ] Integration test end-to-end flow

---

### 🎧 7. Optional User Interface
- [ ] Build CLI or Web UI for:
- Playback
- Reading history
- Missed article notifications

---

## 🧠 Skills & Tools You'll Practice
- Web scraping with Playwright & BeautifulSoup
- PostgreSQL + pgvector
- Embeddings & Semantic Search
- Abstractive Summarization with Hugging Face
- Text-to-Speech synthesis
- AI agent orchestration via LangChain
- Ethical scraping & data handling

---

## ✅ Architecture Recommendation
Use **Kappa Architecture**:
- Unified streaming/batch pipeline
- Replayable logs
- Simpler setup for solo development

---

## 📁 Suggested File Structure
    /ingestion # Playwright scrapers
    /processing # Cleaners & parsers
    /storage # DB models and logic
    /ai/embedding # Embedding generators
    /ai/summarization # Hugging Face models
    /ai/tts # Audio generation
    /orchestration # LangChain logic
    /interface # CLI or Web UI
    /config # API keys, db config

    
---

## 📅 Next Steps
1. [ ] Set up scraper for 1 AI news site
2. [ ] Create database and tables
3. [ ] Generate embeddings and summaries
4. [ ] Add audio conversion
5. [ ] Connect LangChain agent
6. [ ] Optional: build interface for consumption

---

## 📘 References
Check original documentation for citations and detailed explanations.
