# üì∞ RPP News Retrieval System with ChromaDB & LangChain

## Objective
Ingest the latest news from RPP Per√∫ (https://rpp.pe/rss) embed them using SentenceTransformers and build a retrieval system using ChromaDB orchestrated with LangChain.

## Pipeline Overview
0. **Setup & Imports** - Load all required libraries and modules
1. **Load Data** - Extract 50 latest news from RPP RSS feed
2. **Tokenization** - Tokenize with tiktoken and analyze token counts
3. **Embedding** - Generate embeddings with sentence-transformers/all-MiniLM-L6-v2
4. **ChromaDB Storage** - Store documents with metadata in ChromaDB (chroma_db/)
5. **Query Results** - Semantic similarity search with DataFrame output
6. **LangChain Orchestration** - End-to-end modular pipeline

---

In [116]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
from datetime import datetime

# Import custom modules
from rss_loader import load_rss_feed, format_news_for_embedding
from tokenizer import tokenize_text, count_tokens, should_chunk
from embeddings import EmbeddingGenerator
from vector_store import ChromaDBStore
from langchain_pipeline import NewsRetrievalPipeline
from utils import create_results_dataframe, display_results

## 1. Load Data from RPP RSS Feed

Load the latest 50 news items from RPP Per√∫ RSS feed (https://rpp.pe/rss).

**Requirements:**
- Use `feedparser` to extract news items
- Each record includes: `title`, `description`, `link`, `published` (date)

In [117]:
# Load RSS feed
print("üì° Loading RSS feed from RPP Per√∫...")
news_items = load_rss_feed(url="https://rpp.pe/rss", max_items=50)

print(f"‚úÖ Loaded {len(news_items)} news items")
print("\nFirst 3 news items:")
for i, item in enumerate(news_items[:3], 1):
    print(f"\n{i}. {item['title']}")
    print(f"   Published: {item['published']}")
    print(f"   Link: {item['link']}")
    print(f"   Description: {item['description'][:100]}...")

üì° Loading RSS feed from RPP Per√∫...
‚úÖ Loaded 50 news items

First 3 news items:

1. JNJ determina que Delia Espinoza no retornar√° como Fiscal de la Naci√≥n
   Published: Wed, 22 Oct 2025 16:58:30 -0500
   Link: https://rpp.pe/politica/judiciales/delia-espinoza-ya-no-sera-fiscal-de-la-nacion-jnj-noticia-1660534
   Description: La JNJ enfatiz√≥ se mantiene vigente la medida cautelar de suspensi√≥n preventiva en el cargo a Espino...

2. Flamengo vs. Racing Club EN VIVO: ¬øa qu√© hora juegan y d√≥nde ver hoy por la semifinal de Copa Libertadores?
   Published: Wed, 22 Oct 2025 17:00:11 -0500
   Link: https://rpp.pe/futbol/copa-libertadores/flamengo-vs-racing-club-en-vivo-ver-espn-transmision-gratis-desde-maracana-ida-semifinal-copa-libertadores-2025-link-stream-partidos-de-hoy-noticia-1660341
   Description: En el Maracan√°, Flamengo y Racing chocar√°n en un partido imperdible por la semifinal ida de la Copa ...

3. Soda Stereo en Lima: precios y c√≥mo comprar entradas para su conci

In [118]:
# Create DataFrame for visualization
df_news = pd.DataFrame(news_items)
print("\nüìä News DataFrame:")
print(df_news.head())
print(f"\nShape: {df_news.shape}")


üìä News DataFrame:
                                               title  \
0  JNJ determina que Delia Espinoza no retornar√° ...   
1  Flamengo vs. Racing Club EN VIVO: ¬øa qu√© hora ...   
2  Soda Stereo en Lima: precios y c√≥mo comprar en...   
3  Argentina: d√≥lar blue hoy a cu√°nto cotiza este...   
4  M√©xico confirma arresto en Cuba de Zhi Dong Zh...   

                                         description  \
0  La JNJ enfatiz√≥ se mantiene vigente la medida ...   
1  En el Maracan√°, Flamengo y Racing chocar√°n en ...   
2  El tr√≠o argentino regresa en 2026, seis a√±os d...   
3  La cotizaci√≥n del d√≥lar blue, hoy mi√©rcoles 22...   
4  Zhi Dong Zhang, alias 'Brother Wang', hab√≠a es...   

                                                link  \
0  https://rpp.pe/politica/judiciales/delia-espin...   
1  https://rpp.pe/futbol/copa-libertadores/flamen...   
2  https://rpp.pe/musica/conciertos/soda-stereo-e...   
3  https://rpp.pe/mundo/argentina/argentina-dolar...   
4  https

## 2. Tokenization with tiktoken

Tokenize articles to understand token counts and determine if chunking is needed.

**Requirements:**
- Use `tiktoken` for tokenization
- Compute `num_tokens` for sample articles
- Decide if chunking is needed based on model context limits (8192 tokens)

In [119]:
# Select a sample article
sample_article = format_news_for_embedding(news_items[0])

print("üìù Sample Article:")
print(sample_article)

üìù Sample Article:
JNJ determina que Delia Espinoza no retornar√° como Fiscal de la Naci√≥n. La JNJ enfatiz√≥ se mantiene vigente la medida cautelar de suspensi√≥n preventiva en el cargo a Espinoza Valenzuela por una resoluci√≥n emitida en setiembre.


In [120]:
# Tokenize and count tokens
tokens = tokenize_text(sample_article)
num_tokens = count_tokens(sample_article)

print(f"\nüî¢ Token Analysis:")
print(f"   Number of tokens: {num_tokens}")
print(f"   First 10 token IDs: {tokens[:10]}")

# Check if chunking is needed
needs_chunking = should_chunk(sample_article, max_tokens=8192)
print(f"\n   Chunking needed (>8192 tokens): {needs_chunking}")


üî¢ Token Analysis:
   Number of tokens: 64
   First 10 token IDs: [41, 88086, 6449, 64, 1744, 7462, 689, 27612, 3394, 4458]

   Chunking needed (>8192 tokens): False


In [121]:
# Analyze token counts for all articles
token_counts = [count_tokens(format_news_for_embedding(item)) for item in news_items]

print("\nüìä Token Statistics Across All Articles:")
print(f"   Average tokens: {np.mean(token_counts):.2f}")
print(f"   Min tokens: {np.min(token_counts)}")
print(f"   Max tokens: {np.max(token_counts)}")
print(f"   Median tokens: {np.median(token_counts):.2f}")


üìä Token Statistics Across All Articles:
   Average tokens: 69.56
   Min tokens: 38
   Max tokens: 136
   Median tokens: 68.00


## 3. Generate Embeddings with SentenceTransformers

Use the `sentence-transformers/all-MiniLM-L6-v2` model to generate embeddings.

**Requirements:**
- Model: `sentence-transformers/all-MiniLM-L6-v2`
- Generate 384-dimensional embeddings
- Store embeddings alongside text and metadata

In [122]:
# Initialize embedding generator
print("ü§ñ Initializing SentenceTransformer model...")
embedding_generator = EmbeddingGenerator(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("‚úÖ Model loaded!")

ü§ñ Initializing SentenceTransformer model...




‚úÖ Model loaded!


In [123]:
# Generate embeddings for all news items
print("\nüîÑ Generating embeddings for all news items...")
texts = [format_news_for_embedding(item) for item in news_items]
embeddings = embedding_generator.embed_texts(texts)

print(f"‚úÖ Generated {len(embeddings)} embeddings")
print(f"   Embedding dimension: {embeddings[0].shape[0]}")
print(f"   Sample embedding (first 10 values): {embeddings[0][:10]}")


üîÑ Generating embeddings for all news items...
‚úÖ Generated 50 embeddings
   Embedding dimension: 384
   Sample embedding (first 10 values): [-0.03629785 -0.00173203 -0.02642466 -0.04860017 -0.02594501  0.06199325
  0.11210272  0.02435867 -0.03372595  0.05717984]


## 4. ChromaDB Storage (chroma_db/)

Store documents, metadata, and embeddings in ChromaDB.

**Requirements:**
- Use ChromaDB to store documents, metadata, and embeddings
- Persist to `../chroma_db` directory only
- Implement upsert operation for updates
- Support similarity search by keyword or description

In [124]:
# Initialize ChromaDB store
print("üíæ Initializing ChromaDB store...")
chroma_store = ChromaDBStore(
    collection_name="rpp_news",
    persist_directory="../chroma_db"
)
print("‚úÖ ChromaDB store initialized!")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


üíæ Initializing ChromaDB store...
‚úÖ ChromaDB store initialized!


In [125]:
# Prepare metadata
metadatas = [
    {
        'title': item['title'],
        'description': item['description'],
        'link': item['link'],
        'published': item['published']
    }
    for item in news_items
]

# Generate unique IDs
ids = [f"news_{i}" for i in range(len(news_items))]

print(f"üìù Prepared {len(metadatas)} metadata entries")

üìù Prepared 50 metadata entries


In [126]:
# Upsert documents to ChromaDB
print("\n‚¨ÜÔ∏è  Upserting documents to ChromaDB...")
chroma_store.upsert_documents(
    documents=texts,
    metadatas=metadatas,
    embeddings=embeddings.tolist(),
    ids=ids
)

collection_count = chroma_store.get_collection_count()
print(f"‚úÖ Collection now contains {collection_count} documents")


‚¨ÜÔ∏è  Upserting documents to ChromaDB...
‚úÖ Collection now contains 50 documents
‚úÖ Collection now contains 50 documents


## 5. Query and Retrieve Results

Query the system with various topics and display results in a DataFrame.

**Requirements:**
- Query with prompts like "√öltimas noticias de econom√≠a"
- Display results in pandas DataFrame
- Columns: `title | description | link | date_published`
- Show top 10 most relevant results

In [127]:
# Query the collection
query_text = "√öltimas noticias de econom√≠a"
print(f"üîç Querying: '{query_text}'")

results = chroma_store.query(
    query_texts=[query_text],
    n_results=10
)

print(f"‚úÖ Found {len(results['metadatas'][0])} results")

üîç Querying: '√öltimas noticias de econom√≠a'
‚úÖ Found 10 results


In [128]:
# Create and display results DataFrame
df_results = create_results_dataframe(results)

print("\nüìä Query Results:")
print(f"   Found {len(df_results)} relevant articles")
display(df_results)

# Save to CSV
output_path = "../outputs/query_results_economia.csv"
df_results.to_csv(output_path, index=False)
print(f"\nüíæ Results saved to: {output_path}")


üìä Query Results:
   Found 10 relevant articles


Unnamed: 0,title,description,link,date_published
0,Consejo Fiscal pide que el TC revise las m√°s d...,"Alonso Segura alerta sobre la ""avalancha enorm...",https://rpp.pe/economia/economia/consejo-fisca...,"Wed, 22 Oct 2025 10:15:26 -0500"
1,"Estados Unidos anunciar√° un ""aumento sustancia...","""Vamos a anunciar despu√©s del cierre (de los m...",https://rpp.pe/mundo/estados-unidos/estados-un...,"Wed, 22 Oct 2025 16:33:24 -0500"
2,JNJ determina que Delia Espinoza no retornar√° ...,La JNJ enfatiz√≥ se mantiene vigente la medida ...,https://rpp.pe/politica/judiciales/delia-espin...,"Wed, 22 Oct 2025 16:58:30 -0500"
3,Polic√≠a en presunto estado de ebriedad es dete...,El agente es de la Unidad de Servicios Especia...,https://rpp.pe/peru/junin/huancayo-detienen-a-...,"Wed, 22 Oct 2025 16:05:27 -0500"
4,Agua Marina se retira temporalmente los escena...,El cantante Lucho Granda se√±al√≥ que la decisi√≥...,https://rpp.pe/musica/nacional/agua-marina-se-...,"Wed, 22 Oct 2025 11:59:24 -0500"
5,√Ålvarez dice que posible ampliaci√≥n del Reinfo...,El jefe del Gabinete Ministerial indic√≥ que el...,https://rpp.pe/politica/gobierno/alvarez-sobre...,"Wed, 22 Oct 2025 14:08:40 -0500"
6,¬øSer√° necesario prorrogar el estado de emergen...,"En Ampliaci√≥n de Noticias, Rub√©n Cano consider...",https://rpp.pe/lima/seguridad/estado-de-emerge...,"Wed, 22 Oct 2025 13:34:50 -0500"
7,Soda Stereo en Lima: precios y c√≥mo comprar en...,"El tr√≠o argentino regresa en 2026, seis a√±os d...",https://rpp.pe/musica/conciertos/soda-stereo-e...,"Wed, 22 Oct 2025 17:00:37 -0500"
8,Cronograma del octavo retiro de AFP 2025 HOY: ...,El desembolso se efectuar√° hasta en cuatro arm...,https://rpp.pe/economia/economia/octavo-retiro...,"Wed, 22 Oct 2025 16:30:03 -0500"
9,Argentina: d√≥lar blue hoy a cu√°nto cotiza este...,"La cotizaci√≥n del d√≥lar blue, hoy mi√©rcoles 22...",https://rpp.pe/mundo/argentina/argentina-dolar...,"Wed, 22 Oct 2025 05:59:39 -0500"



üíæ Results saved to: ../outputs/query_results_economia.csv


## 6. LangChain Orchestration Pipeline

Implement the complete end-to-end pipeline using LangChain.

**Requirements:**
- End-to-end pipeline: Load RSS ‚Üí Tokenize ‚Üí Embed ‚Üí Store ‚Üí Retrieve
- Each step should be modular (functions or LangChain chains)
- Use same `chroma_db/` directory for consistency
- Demonstrate complete pipeline execution with query results

In [129]:
# Initialize LangChain pipeline
print("üîó Initializing LangChain Pipeline...")
langchain_pipeline = NewsRetrievalPipeline(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    persist_directory="../chroma_db"
)
print("‚úÖ LangChain pipeline initialized!")

üîó Initializing LangChain Pipeline...




‚úÖ LangChain pipeline initialized!


In [130]:
# Load fresh RSS feed for LangChain demo
print("\nüì° Loading fresh RSS feed...")
fresh_news = load_rss_feed(url="https://rpp.pe/rss", max_items=50)
print(f"‚úÖ Loaded {len(fresh_news)} fresh news items")


üì° Loading fresh RSS feed...
‚úÖ Loaded 50 fresh news items


In [131]:
# Step 1: Load and Process with LangChain
print("\nüîÑ Step 1: Loading and processing documents...")
documents = langchain_pipeline.load_and_process(fresh_news)
print(f"‚úÖ Created {len(documents)} LangChain documents")
print(f"   Sample document content: {documents[0].page_content[:100]}...")
print(f"   Sample metadata: {documents[0].metadata}")


üîÑ Step 1: Loading and processing documents...
‚úÖ Created 50 LangChain documents
   Sample document content: JNJ determina que Delia Espinoza no retornar√° como Fiscal de la Naci√≥n. La JNJ enfatiz√≥ se mantiene ...
   Sample metadata: {'title': 'JNJ determina que Delia Espinoza no retornar√° como Fiscal de la Naci√≥n', 'link': 'https://rpp.pe/politica/judiciales/delia-espinoza-ya-no-sera-fiscal-de-la-nacion-jnj-noticia-1660534', 'published': 'Wed, 22 Oct 2025 16:58:30 -0500', 'description': 'La JNJ enfatiz√≥ se mantiene vigente la medida cautelar de suspensi√≥n preventiva en el cargo a Espinoza Valenzuela por una resoluci√≥n emitida en setiembre.'}


In [132]:
# Step 2: Create Vector Store
print("\nüíæ Creating vector store in chroma_db/...")
langchain_pipeline.create_vectorstore(documents)
print("‚úÖ Vector store created and persisted to ../chroma_db/")

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given



üíæ Creating vector store in chroma_db/...
‚úÖ Vector store created and persisted to ../chroma_db/
‚úÖ Vector store created and persisted to ../chroma_db/


In [133]:
# Step 3: Query with LangChain
print("\nüîÑ Step 3: Querying vector store...")
query = "√öltimas noticias de econom√≠a"
df_langchain_results = langchain_pipeline.query(query, k=10)

print(f"\nüìä LangChain Query Results for: '{query}'")
display(df_langchain_results)

# Save results
output_path_lc = "../outputs/langchain_query_results.csv"
df_langchain_results.to_csv(output_path_lc, index=False)
print(f"\nüíæ Results saved to: {output_path_lc}")


üîÑ Step 3: Querying vector store...

üìä LangChain Query Results for: '√öltimas noticias de econom√≠a'


Unnamed: 0,title,description,link,date_published
0,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"
1,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"
2,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"
3,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"
4,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"
5,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"
6,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"
7,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"
8,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"
9,Bre√±a: familia busca a adulto mayor con Alzhei...,Rotafono de RPP | La √∫ltima vez que lo vieron ...,https://rpp.pe/rotafono/servicios-a-la-comunid...,"Wed, 22 Oct 2025 14:06:06 -0500"



üíæ Results saved to: ../outputs/langchain_query_results.csv
