# 📰 RPP News Retrieval System with ChromaDB & LangChain

## Objective
Ingest the latest news from RPP Perú (https://rpp.pe/rss) embed them using SentenceTransformers and build a retrieval system using ChromaDB orchestrated with LangChain.

## Pipeline Overview
0. **Setup & Imports** - Load all required libraries and modules
1. **Load Data** - Extract 50 latest news from RPP RSS feed
2. **Tokenization** - Tokenize with tiktoken and analyze token counts
3. **Embedding** - Generate embeddings with sentence-transformers/all-MiniLM-L6-v2
4. **ChromaDB Storage** - Store documents with metadata in ChromaDB (chroma_db/)
5. **Query Results** - Semantic similarity search with DataFrame output
6. **LangChain Orchestration** - End-to-end modular pipeline

---

In [1]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
from datetime import datetime

# Import custom modules
from rss_loader import load_rss_feed, format_news_for_embedding
from tokenizer import tokenize_text, count_tokens, should_chunk
from embeddings import EmbeddingGenerator
from vector_store import ChromaDBStore
from langchain_pipeline import NewsRetrievalPipeline
from utils import create_results_dataframe, display_results

## 1. Load Data from RPP RSS Feed

Load the latest 50 news items from RPP Perú RSS feed (https://rpp.pe/rss).

**Requirements:**
- Use `feedparser` to extract news items
- Each record includes: `title`, `description`, `link`, `published` (date)

In [19]:
# Load RSS feed
print("📡 Loading RSS feed from RPP Perú...")
news_items = load_rss_feed(url="https://rpp.pe/rss", max_items=50)

print(f"✅ Loaded {len(news_items)} news items")
print("\nFirst 3 news items:")
for i, item in enumerate(news_items[:3], 1):
    print(f"\n{i}. {item['title']}")
    print(f"   Published: {item['published']}")
    print(f"   Link: {item['link']}")
    print(f"   Description: {item['description'][:100]}...")

📡 Loading RSS feed from RPP Perú...
✅ Loaded 50 news items

First 3 news items:

1. Flamengo vs. Racing Club EN VIVO: ¿a qué hora juegan y dónde ver hoy por la semifinal de Copa Libertadores?
   Published: Wed, 22 Oct 2025 17:30:11 -0500
   Link: https://rpp.pe/futbol/copa-libertadores/flamengo-vs-racing-club-en-vivo-ver-espn-transmision-gratis-desde-maracana-ida-semifinal-copa-libertadores-2025-link-stream-partidos-de-hoy-noticia-1660341
   Description: En el Maracaná, Flamengo y Racing chocarán en un partido imperdible por la semifinal ida de la Copa ...

2. JNJ determina que Delia Espinoza no retornará como Fiscal de la Nación
   Published: Wed, 22 Oct 2025 16:58:30 -0500
   Link: https://rpp.pe/politica/judiciales/delia-espinoza-ya-no-sera-fiscal-de-la-nacion-jnj-noticia-1660534
   Description: La JNJ enfatizó se mantiene vigente la medida cautelar de suspensión preventiva en el cargo a Espino...

3. Ala Este de la Casa Blanca será demolida totalmente para construir salón de baile 

In [20]:
# Create DataFrame for visualization
df_news = pd.DataFrame(news_items)
print("\n📊 News DataFrame:")
print(df_news.head())
print(f"\nShape: {df_news.shape}")


📊 News DataFrame:
                                               title  \
0  Flamengo vs. Racing Club EN VIVO: ¿a qué hora ...   
1  JNJ determina que Delia Espinoza no retornará ...   
2  Ala Este de la Casa Blanca será demolida total...   
3  Moquegua: pobladores piden nuevo tamizaje para...   
4  Myriam Hernández en Lima: setlist de canciones...   

                                         description  \
0  En el Maracaná, Flamengo y Racing chocarán en ...   
1  La JNJ enfatizó se mantiene vigente la medida ...   
2  De acuerdo con el diario The New York Times, e...   
3  Rotafono de RPP | En el 2023, la Dirección Reg...   
4  La 'Baladista de América' llega a Lima para do...   

                                                link  \
0  https://rpp.pe/futbol/copa-libertadores/flamen...   
1  https://rpp.pe/politica/judiciales/delia-espin...   
2  https://rpp.pe/mundo/estados-unidos/ala-este-d...   
3  https://rpp.pe/rotafono/servicios-publicos/moq...   
4  https://rpp.pe/musica/co

In [None]:
# Save to CSV file
output_file = "../outputs/rpp_news_50.csv"
df_news.to_csv(output_file, index=False, encoding='utf-8')
print(f"\n✅ All 50 news items saved to: {output_file}")
print(f"📊 Total news downloaded: {len(news_items)}")


📰 Complete List of 50 News Items from RPP:


1. Flamengo vs. Racing Club EN VIVO: ¿a qué hora juegan y dónde ver hoy por la semifinal de Copa Libertadores?
   📅 Wed, 22 Oct 2025 17:30:11 -0500
   📝 En el Maracaná, Flamengo y Racing chocarán en un partido imperdible por la semifinal ida de la Copa Libertadores 2025.
   🔗 https://rpp.pe/futbol/copa-libertadores/flamengo-vs-racing-club-en-vivo-ver-espn-transmision-gratis-desde-maracana-ida-semifinal-copa-libertadores-2025-link-stream-partidos-de-hoy-noticia-1660341
------------------------------------------------------------------------------------------------------------------------

2. JNJ determina que Delia Espinoza no retornará como Fiscal de la Nación
   📅 Wed, 22 Oct 2025 16:58:30 -0500
   📝 La JNJ enfatizó se mantiene vigente la medida cautelar de suspensión preventiva en el cargo a Espinoza Valenzuela por una resolución emitida en setiembre.
   🔗 https://rpp.pe/politica/judiciales/delia-espinoza-ya-no-sera-fiscal-de-la-nacion-jn

## 2. Tokenization with tiktoken

Tokenize articles to understand token counts and determine if chunking is needed.

**Requirements:**
- Use `tiktoken` for tokenization
- Compute `num_tokens` for sample articles
- Decide if chunking is needed based on model context limits (8192 tokens)

In [4]:
# Select a sample article
sample_article = format_news_for_embedding(news_items[0])

print("📝 Sample Article:")
print(sample_article)

📝 Sample Article:
Flamengo vs. Racing Club EN VIVO: ¿a qué hora juegan y dónde ver hoy por la semifinal de Copa Libertadores?. En el Maracaná, Flamengo y Racing chocarán en un partido imperdible por la semifinal ida de la Copa Libertadores 2025.


In [5]:
# Tokenize and count tokens
tokens = tokenize_text(sample_article)
num_tokens = count_tokens(sample_article)

print(f"\n🔢 Token Analysis:")
print(f"   Number of tokens: {num_tokens}")
print(f"   First 10 token IDs: {tokens[:10]}")

# Check if chunking is needed
needs_chunking = should_chunk(sample_article, max_tokens=8192)
print(f"\n   Chunking needed (>8192 tokens): {needs_chunking}")


🔢 Token Analysis:
   Number of tokens: 67
   First 10 token IDs: [3968, 309, 65753, 6296, 13, 33382, 10349, 5301, 650, 68321]

   Chunking needed (>8192 tokens): False


In [6]:
# Analyze token counts for all articles
token_counts = [count_tokens(format_news_for_embedding(item)) for item in news_items]

print("\n📊 Token Statistics Across All Articles:")
print(f"   Average tokens: {np.mean(token_counts):.2f}")
print(f"   Min tokens: {np.min(token_counts)}")
print(f"   Max tokens: {np.max(token_counts)}")
print(f"   Median tokens: {np.median(token_counts):.2f}")


📊 Token Statistics Across All Articles:
   Average tokens: 68.70
   Min tokens: 38
   Max tokens: 97
   Median tokens: 68.00


## 3. Generate Embeddings with SentenceTransformers

Use the `sentence-transformers/all-MiniLM-L6-v2` model to generate embeddings.

**Requirements:**
- Model: `sentence-transformers/all-MiniLM-L6-v2`
- Generate 384-dimensional embeddings
- Store embeddings alongside text and metadata

In [7]:
# Initialize embedding generator
print("🤖 Initializing SentenceTransformer model...")
embedding_generator = EmbeddingGenerator(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("✅ Model loaded!")

🤖 Initializing SentenceTransformer model...




✅ Model loaded!


In [8]:
# Generate embeddings for all news items
print("\n🔄 Generating embeddings for all news items...")
texts = [format_news_for_embedding(item) for item in news_items]
embeddings = embedding_generator.embed_texts(texts)

print(f"✅ Generated {len(embeddings)} embeddings")
print(f"   Embedding dimension: {embeddings[0].shape[0]}")
print(f"   Sample embedding (first 10 values): {embeddings[0][:10]}")


🔄 Generating embeddings for all news items...
✅ Generated 50 embeddings
   Embedding dimension: 384
   Sample embedding (first 10 values): [-0.01191772  0.06891479  0.00700469 -0.04329996 -0.00070988  0.01612812
 -0.01539836 -0.01328336  0.02911671  0.00764781]
✅ Generated 50 embeddings
   Embedding dimension: 384
   Sample embedding (first 10 values): [-0.01191772  0.06891479  0.00700469 -0.04329996 -0.00070988  0.01612812
 -0.01539836 -0.01328336  0.02911671  0.00764781]


## 4. ChromaDB Storage (chroma_db/)

Store documents, metadata, and embeddings in ChromaDB.

**Requirements:**
- Use ChromaDB to store documents, metadata, and embeddings
- Persist to `../chroma_db` directory only
- Implement upsert operation for updates
- Support similarity search by keyword or description

In [9]:
# Initialize ChromaDB store
print("💾 Initializing ChromaDB store...")
chroma_store = ChromaDBStore(
    collection_name="rpp_news",
    persist_directory="../chroma_db"
)
print("✅ ChromaDB store initialized!")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


💾 Initializing ChromaDB store...
✅ ChromaDB store initialized!


In [10]:
# Prepare metadata
metadatas = [
    {
        'title': item['title'],
        'description': item['description'],
        'link': item['link'],
        'published': item['published']
    }
    for item in news_items
]

# Generate unique IDs
ids = [f"news_{i}" for i in range(len(news_items))]

print(f"📝 Prepared {len(metadatas)} metadata entries")

📝 Prepared 50 metadata entries


In [11]:
# Upsert documents to ChromaDB
print("\n⬆️  Upserting documents to ChromaDB...")
chroma_store.upsert_documents(
    documents=texts,
    metadatas=metadatas,
    embeddings=embeddings.tolist(),
    ids=ids
)

collection_count = chroma_store.get_collection_count()
print(f"✅ Collection now contains {collection_count} documents")


⬆️  Upserting documents to ChromaDB...
✅ Collection now contains 50 documents
✅ Collection now contains 50 documents


## 5. Query and Retrieve Results

Query the system with various topics and display results in a DataFrame.

**Requirements:**
- Query with prompts like "Últimas noticias de economía"
- Display results in pandas DataFrame
- Columns: `title | description | link | date_published`
- Show top 10 most relevant results

In [12]:
# Query the collection
query_text = "Últimas noticias de economía"
print(f"🔍 Querying: '{query_text}'")

results = chroma_store.query(
    query_texts=[query_text],
    n_results=10
)

print(f"✅ Found {len(results['metadatas'][0])} results")

2025-10-22 17:18:45.027 python[25621:1482628] 2025-10-22 17:18:45.027101 [W:onnxruntime:, coreml_execution_provider.cc:113 GetCapability] CoreMLExecutionProvider::GetCapability, number of partitions supported by CoreML: 49 number of nodes in the graph: 323 number of nodes supported by CoreML: 232


🔍 Querying: 'Últimas noticias de economía'


Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


✅ Found 10 results


In [13]:
# Create and display results DataFrame
df_results = create_results_dataframe(results)

print("\n📊 Query Results:")
print(f"   Found {len(df_results)} relevant articles")
display(df_results)

# Save to CSV
output_path = "../outputs/query_results_economia.csv"
df_results.to_csv(output_path, index=False)
print(f"\n💾 Results saved to: {output_path}")


📊 Query Results:
   Found 10 relevant articles


Unnamed: 0,title,description,link,date_published
0,Consejo Fiscal pide que el TC revise las más d...,"Alonso Segura alerta sobre la ""avalancha enorm...",https://rpp.pe/economia/economia/consejo-fisca...,"Wed, 22 Oct 2025 10:15:26 -0500"
1,"Estados Unidos anunciará un ""aumento sustancia...","""Vamos a anunciar después del cierre (de los m...",https://rpp.pe/mundo/estados-unidos/estados-un...,"Wed, 22 Oct 2025 16:33:24 -0500"
2,JNJ determina que Delia Espinoza no retornará ...,La JNJ enfatizó se mantiene vigente la medida ...,https://rpp.pe/politica/judiciales/delia-espin...,"Wed, 22 Oct 2025 16:58:30 -0500"
3,Policía en presunto estado de ebriedad es dete...,El agente es de la Unidad de Servicios Especia...,https://rpp.pe/peru/junin/huancayo-detienen-a-...,"Wed, 22 Oct 2025 16:05:27 -0500"
4,Agua Marina se retira temporalmente los escena...,El cantante Lucho Granda señaló que la decisió...,https://rpp.pe/musica/nacional/agua-marina-se-...,"Wed, 22 Oct 2025 11:59:24 -0500"
5,Isabel Preysler publica íntimas cartas de amor...,Isabel Preysler publicó su libro 'Mi verdadera...,https://rpp.pe/famosos/celebridades/isabel-pre...,"Wed, 22 Oct 2025 08:47:59 -0500"
6,¿Será necesario prorrogar el estado de emergen...,"En Ampliación de Noticias, Rubén Cano consider...",https://rpp.pe/lima/seguridad/estado-de-emerge...,"Wed, 22 Oct 2025 13:34:50 -0500"
7,Ala Este de la Casa Blanca será demolida total...,"De acuerdo con el diario The New York Times, e...",https://rpp.pe/mundo/estados-unidos/ala-este-d...,"Wed, 22 Oct 2025 17:15:10 -0500"
8,Soda Stereo en Lima: precios y cómo comprar en...,"El trío argentino regresa en 2026, seis años d...",https://rpp.pe/musica/conciertos/soda-stereo-e...,"Wed, 22 Oct 2025 17:00:37 -0500"
9,Cronograma del octavo retiro de AFP 2025 HOY: ...,El desembolso se efectuará hasta en cuatro arm...,https://rpp.pe/economia/economia/octavo-retiro...,"Wed, 22 Oct 2025 16:30:03 -0500"



💾 Results saved to: ../outputs/query_results_economia.csv


## 6. LangChain Orchestration Pipeline

Implement the complete end-to-end pipeline using LangChain.

**Requirements:**
- End-to-end pipeline: Load RSS → Tokenize → Embed → Store → Retrieve
- Each step should be modular (functions or LangChain chains)
- Use same `chroma_db/` directory for consistency
- Demonstrate complete pipeline execution with query results

In [14]:
# Initialize LangChain pipeline
print("🔗 Initializing LangChain Pipeline...")
langchain_pipeline = NewsRetrievalPipeline(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    persist_directory="../chroma_db"
)
print("✅ LangChain pipeline initialized!")

🔗 Initializing LangChain Pipeline...




✅ LangChain pipeline initialized!


In [15]:
# Load fresh RSS feed for LangChain demo
print("\n📡 Loading fresh RSS feed...")
fresh_news = load_rss_feed(url="https://rpp.pe/rss", max_items=50)
print(f"✅ Loaded {len(fresh_news)} fresh news items")


📡 Loading fresh RSS feed...
✅ Loaded 50 fresh news items
✅ Loaded 50 fresh news items


In [16]:
# Step 1: Load and Process with LangChain
print("\n🔄 Step 1: Loading and processing documents...")
documents = langchain_pipeline.load_and_process(fresh_news)
print(f"✅ Created {len(documents)} LangChain documents")
print(f"   Sample document content: {documents[0].page_content[:100]}...")
print(f"   Sample metadata: {documents[0].metadata}")


🔄 Step 1: Loading and processing documents...
✅ Created 50 LangChain documents
   Sample document content: Flamengo vs. Racing Club EN VIVO: ¿a qué hora juegan y dónde ver hoy por la semifinal de Copa Libert...
   Sample metadata: {'title': 'Flamengo vs. Racing Club EN VIVO: ¿a qué hora juegan y dónde ver hoy por la semifinal de Copa Libertadores?', 'link': 'https://rpp.pe/futbol/copa-libertadores/flamengo-vs-racing-club-en-vivo-ver-espn-transmision-gratis-desde-maracana-ida-semifinal-copa-libertadores-2025-link-stream-partidos-de-hoy-noticia-1660341', 'published': 'Wed, 22 Oct 2025 17:30:11 -0500', 'description': 'En el Maracaná, Flamengo y Racing chocarán en un partido imperdible por la semifinal ida de la Copa Libertadores 2025.'}


In [17]:
# Step 2: Create Vector Store
print("\n💾 Creating vector store in chroma_db/...")
langchain_pipeline.create_vectorstore(documents)
print("✅ Vector store created and persisted to ../chroma_db/")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



💾 Creating vector store in chroma_db/...
✅ Vector store created and persisted to ../chroma_db/
✅ Vector store created and persisted to ../chroma_db/


In [18]:
# Step 3: Query with LangChain
print("\n🔄 Step 3: Querying vector store...")
query = "Últimas noticias de economía"
df_langchain_results = langchain_pipeline.query(query, k=10)

print(f"\n📊 LangChain Query Results for: '{query}'")
display(df_langchain_results)

# Save results
output_path_lc = "../outputs/langchain_query_results.csv"
df_langchain_results.to_csv(output_path_lc, index=False)
print(f"\n💾 Results saved to: {output_path_lc}")


🔄 Step 3: Querying vector store...


Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given



📊 LangChain Query Results for: 'Últimas noticias de economía'


Unnamed: 0,title,description,link,date_published
0,Consejo Fiscal pide que el TC revise las más d...,"Alonso Segura alerta sobre la ""avalancha enorm...",https://rpp.pe/economia/economia/consejo-fisca...,"Wed, 22 Oct 2025 10:15:26 -0500"
1,"Estados Unidos anunciará un ""aumento sustancia...","""Vamos a anunciar después del cierre (de los m...",https://rpp.pe/mundo/estados-unidos/estados-un...,"Wed, 22 Oct 2025 16:33:24 -0500"
2,JNJ determina que Delia Espinoza no retornará ...,La JNJ enfatizó se mantiene vigente la medida ...,https://rpp.pe/politica/judiciales/delia-espin...,"Wed, 22 Oct 2025 16:58:30 -0500"
3,Policía en presunto estado de ebriedad es dete...,El agente es de la Unidad de Servicios Especia...,https://rpp.pe/peru/junin/huancayo-detienen-a-...,"Wed, 22 Oct 2025 16:05:27 -0500"
4,Agua Marina se retira temporalmente los escena...,El cantante Lucho Granda señaló que la decisió...,https://rpp.pe/musica/nacional/agua-marina-se-...,"Wed, 22 Oct 2025 11:59:24 -0500"
5,Isabel Preysler publica íntimas cartas de amor...,Isabel Preysler publicó su libro 'Mi verdadera...,https://rpp.pe/famosos/celebridades/isabel-pre...,"Wed, 22 Oct 2025 08:47:59 -0500"
6,¿Será necesario prorrogar el estado de emergen...,"En Ampliación de Noticias, Rubén Cano consider...",https://rpp.pe/lima/seguridad/estado-de-emerge...,"Wed, 22 Oct 2025 13:34:50 -0500"
7,Ala Este de la Casa Blanca será demolida total...,"De acuerdo con el diario The New York Times, e...",https://rpp.pe/mundo/estados-unidos/ala-este-d...,"Wed, 22 Oct 2025 17:15:10 -0500"
8,Soda Stereo en Lima: precios y cómo comprar en...,"El trío argentino regresa en 2026, seis años d...",https://rpp.pe/musica/conciertos/soda-stereo-e...,"Wed, 22 Oct 2025 17:00:37 -0500"
9,Cronograma del octavo retiro de AFP 2025 HOY: ...,El desembolso se efectuará hasta en cuatro arm...,https://rpp.pe/economia/economia/octavo-retiro...,"Wed, 22 Oct 2025 16:30:03 -0500"



💾 Results saved to: ../outputs/langchain_query_results.csv
