<a href="https://colab.research.google.com/github/jhryals/el-roi-intelligence-triage-system/blob/main/data_ingestion/rss_ingestion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [15]:
# ==============================================================
# 📂 Google Drive Integration for EL ROI
# ==============================================================

from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Define base project path inside Google Drive
PROJECT_PATH = "/content/drive/MyDrive/el-roi"
DATA_PATH = os.path.join(PROJECT_PATH, "data")

# Create folders if they don't exist
os.makedirs(DATA_PATH, exist_ok=True)

print(f"✅ Project directory set to: {PROJECT_PATH}")
print(f"✅ Data directory set to: {DATA_PATH}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Project directory set to: /content/drive/MyDrive/el-roi
✅ Data directory set to: /content/drive/MyDrive/el-roi/data


In [16]:
# ==============================================================
# 📦 MODULE SETUP: Install Required Packages
# ==============================================================
# This cell must be run before any other code in the notebook.
# Colab resets its environment on new sessions, so packages
# installed here will only persist for the current runtime.

!pip install feedparser



In [17]:
# STEP 1: Import required libraries
import feedparser  # For parsing RSS
import pandas as pd
from datetime import datetime

# STEP 2: Define Spanish-language RSS feeds
SPANISH_FEEDS = {
    "El País": "https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/section/internacional/portada",
    "BBC Mundo": "https://feeds.bbci.co.uk/mundo/rss.xml",
    "RTVE": "https://www.rtve.es/rss/noticias.xml",
    "El Mundo (Internacional)": "https://e00-elmundo.uecdn.es/elmundo/rss/internacional.xml",
    "Infobae América": "https://www.infobae.com/america/rss/",
    "20 Minutos (España)": "https://www.20minutos.es/rss/internacional/"
}

# STEP 3: RSS parsing function
def parse_rss_feed(feed_url, source_name):
    """
    Parses an RSS feed and returns a list of articles as dictionaries.
    Each article includes: title, link, published date, and summary.
    """
    feed = feedparser.parse(feed_url)
    articles = []

    for entry in feed.entries:
        article = {
            "source": source_name,
            "title": entry.get("title", "").strip(),
            "link": entry.get("link", ""),
            "published": entry.get("published", ""),
            "summary": entry.get("summary", "").strip()
        }
        articles.append(article)

    return articles

# STEP 4: Ingest from all feeds
def collect_all_articles(feed_dict):
    """
    Collects and aggregates articles from all defined RSS feeds.
    Returns a Pandas DataFrame.
    """
    all_articles = []

    for source, url in feed_dict.items():
        #print(f"📡 Fetching from: {source}")
        articles = parse_rss_feed(url, source)
        all_articles.extend(articles)

    df = pd.DataFrame(all_articles)

    # Optional: Convert published column to datetime (if format is valid)
    if 'published' in df.columns:
        df['published'] = pd.to_datetime(df['published'], errors='coerce')

    return df

# STEP 5: Run the ingestion and preview the results
df_articles = collect_all_articles(SPANISH_FEEDS)

print(f"✅ Ingested {len(df_articles)} articles.")
df_articles.head()



✅ Ingested 145 articles.


Unnamed: 0,source,title,link,published,summary
0,El País,Trump apura el plazo que dio a Putin para dete...,https://elpais.com/internacional/2025-08-06/el...,2025-08-06 16:25:49,El enviado de la Casa Blanca se ve con Putin e...
1,El País,Estados Unidos dobla los aranceles a la India ...,https://elpais.com/internacional/2025-08-06/ee...,2025-08-06 14:43:44,Las exportaciones indias estarán sujetas a un ...
2,El País,La detención de Bolsonaro tensa las negociacio...,https://elpais.com/america/2025-08-06/la-deten...,2025-08-06 03:45:01,El Gobierno de Lula se plantea incluir la expo...
3,El País,La sumisión de Bruselas a Trump da alas a la e...,https://elpais.com/internacional/2025-08-03/la...,2025-08-03 03:40:00,Los aliados del estadounidense en Europa logra...
4,El País,Trump ‘hackea’ el sistema económico internacional,https://elpais.com/internacional/2025-08-01/tr...,2025-08-01 16:33:18,La subida de barreras al comercio más intensa ...


In [18]:
import os

# Ensure data directory exists
os.makedirs("data", exist_ok=True)

# Save raw articles to JSONL for downstream modules
raw_articles_path = os.path.join(DATA_PATH, "raw_articles.jsonl")
df_articles.to_json(raw_articles_path, orient="records", lines=True, force_ascii=False)

print(f"✅ Saved raw articles to {raw_articles_path}")

✅ Saved raw articles to /content/drive/MyDrive/el-roi/data/raw_articles.jsonl


## 📌 Phase 2 Expansion Plan – RSS Ingestion Enhancements

The following enhancements are planned for post-MVP scaling of the `rss_ingestion` module:

### 🌍 1. Expand Language Coverage
- Add additional RSS feeds in English, Portuguese, Russian, Chinese, and Arabic.
- Store multilingual feed sources in an external config file (`.json`, `.yaml`, or `.csv`) for easy updates and management.

### 🔁 2. Refresh Scheduling
- Enable time-based feed refresh (e.g., every 3–6 hours) using Colab scheduling or cronjobs (in production).
- Support delta ingestion by tracking already-processed articles (e.g., by GUID or timestamp).

### 💾 3. Persistent Storage
- Save parsed articles to disk in `.jsonl`, `.csv`, or `.parquet` format for archival, reprocessing, or handoff to analysts.

### 🧪 4. Source Health Monitoring
- Log failed or empty feeds.
- Track feed reliability and freshness across multiple runs.

### 🗃️ 5. Feed Metadata Tracking
- Capture metadata such as feed title, last updated timestamp, and entry GUIDs.
- Log and version feed source definitions.

### 📊 6. Add Feed Categories
- Annotate each feed with attributes like:
  - Region (e.g., Latin America, MENA, East Asia)
  - Domain (e.g., political unrest, cyber, disinformation)
  - Risk tier or monitoring priority

### 🧵 7. Future Integrations (Cross-Source Ingestion)
- Introduce additional ingestion modules for:
  - Reddit (via `praw`)
  - Telegram (via `telethon`)
  - X/Twitter (via `snscrape`)
- Unify all source types under a shared `IngestItem` schema for standardized downstream processing.

---

**✅ Current Status:**  
MVP ingestion is complete for Spanish-language RSS feeds.  
Ready to proceed to: **Module 2 – Language Detection + Translation**
