<a href="https://colab.research.google.com/github/nalpata/proyecto_aplicado_preservantes/blob/main/notebooks/Proyecto_1_Hito_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Baseline - Proyecto Aplicado: Preservantes

En este notebook construimos el baseline de un sistema RAG usando un conjunto de PDFs
sobre preservantes. Incluye:

1. Carga e ingesta de PDFs
2. Preprocesamiento básico y chunking
3. Generación de embeddings
4. Creación de un vector store
5. Retriever (similarity search)
6. Benchmark (Precision@k sobre un set de preguntas)


In [1]:
## Instalación de librerías (celda de código)
!pip install -q langchain langchain-community langchain-text-splitters \
               chromadb sentence-transformers pypdf


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.7/21.7 MB[0m [31m96.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m328.2/328.2 kB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m102.0 MB/s[0m eta [36m0:00:

In [2]:
##Importaciones y configuración básica
import os
from pathlib import Path

# LangChain imports
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

# Para evaluación básica
from typing import List, Dict
import numpy as np

# Para ver resultados
from pprint import pprint


In [3]:
# RESETEAR TODO PARA PARTIR LIMPIO EN COLAB

import os, shutil

# 1) Ir a /content
%cd /content

# 2) Borrar cualquier clone previo duplicado
if os.path.exists("proyecto_aplicado_preservantes"):
    shutil.rmtree("proyecto_aplicado_preservantes")
    print("🗑️ Carpeta borrada: proyecto_aplicado_preservantes")

# 3) Clonar de nuevo desde tu GitHub
!git clone https://github.com/nalpata/proyecto_aplicado_preservantes.git

# 4) Entrar a la carpeta correcta
%cd proyecto_aplicado_preservantes

print("\n🎉 Listo. Ahora estamos en el repo correcto sin duplicados.")
!ls


/content
Cloning into 'proyecto_aplicado_preservantes'...
remote: Enumerating objects: 160, done.[K
remote: Counting objects: 100% (160/160), done.[K
remote: Compressing objects: 100% (149/149), done.[K
remote: Total 160 (delta 61), reused 77 (delta 10), pack-reused 0 (from 0)[K
Receiving objects: 100% (160/160), 19.38 MiB | 43.15 MiB/s, done.
Resolving deltas: 100% (61/61), done.
/content/proyecto_aplicado_preservantes

🎉 Listo. Ahora estamos en el repo correcto sin duplicados.
 02_chunking.ipynb	    INSTRUCCIONES_PRUEBA.md   RESUMEN_HITO_1.md
 app.py			    MEJORAS_CHUNKING.md       RESUMEN_PROBLEMAS.md
 CHECKLIST_INSTALACION.md   notebooks		      run_pipeline.py
 data			   'Pauta proyecto.pdf'       SOLUCION_INSTALACION.md
 DIAGRAMAS_HITO_1.md	    PROBLEMAS_COMUNES.md      src
 examples.py		    QUICK_START.md	      streamlit_app.py
 FIX_CHROMADB.md	    README.md		      test_improvements.py
 install_dependencies.sh    requirements.txt	      TESTING_LOCAL.md
 install_fix.sh		    RE

In [4]:
# Ruta base del proyecto en Colab
BASE_PATH = Path("/content/proyecto_aplicado_preservantes")

DATA_PDF_DIR = BASE_PATH / "data" / "pdfs"          # aquí PDFs de preservantes
CHROMA_DIR   = BASE_PATH / "chroma_preservantes"   # carpeta donde se guardará el vector store

BASE_PATH.mkdir(parents=True, exist_ok=True)
CHROMA_DIR.mkdir(parents=True, exist_ok=True)

print("Base path:", BASE_PATH)
print("PDF dir:", DATA_PDF_DIR)
print("Chroma dir:", CHROMA_DIR)


Base path: /content/proyecto_aplicado_preservantes
PDF dir: /content/proyecto_aplicado_preservantes/data/pdfs
Chroma dir: /content/proyecto_aplicado_preservantes/chroma_preservantes


In [5]:
##Carga de documentos (ingesta de PDFs)
def load_pdfs(pdf_dir: Path):
    """
    Carga todos los PDFs de una carpeta usando LangChain.
    Devuelve una lista de Documents.
    """
    loader = DirectoryLoader(
        str(pdf_dir),
        glob="*.pdf",
        loader_cls=PyPDFLoader,
        show_progress=True
    )
    docs = loader.load()
    return docs

raw_docs = load_pdfs(DATA_PDF_DIR)
len(raw_docs), raw_docs[0]


100%|██████████| 19/19 [00:22<00:00,  1.19s/it]


(475,
 Document(metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'Elsevier', 'creationdate': '2025-11-03T03:51:39+00:00', 'crossmarkdomains[1]': 'elsevier.com', 'creationdate--text': '3rd November 2025', 'robots': 'noindex', 'elsevierwebpdfspecifications': '7.0.1', 'moddate': '2025-11-03T04:07:06+00:00', 'doi': '10.1016/j.fbio.2025.107864', 'title': 'Mechanisms, applications and challenges of natural antimicrobials in food system', 'keywords': 'Natural antimicrobials,Food preservation,Bioactive compounds,Food safety,Clean label', 'subject': 'Food Bioscience, 74 (2025) 107864. doi:10.1016/j.fbio.2025.107864', 'crossmarkdomains[2]': 'sciencedirect.com', 'author': 'Anand Kumar', 'source': '/content/proyecto_aplicado_preservantes/data/pdfs/1-s2.0-S2212429225020413-main.pdf', 'total_pages': 20, 'page': 0, 'page_label': '1'}, page_content='Mechanisms, applications and challenges of natural antimicrobials in \nfood system\nAnand Kumar\na , 1\n, Suprativ Das\nb , 1\n, Sada

In [6]:
##Preprocesamiento
def clean_metadata(docs):
    """
    Normaliza  los metadatos: agrega un campo 'source'
    y mantiene solo lo relevante.
    """
    cleaned = []
    for d in docs:
        meta = d.metadata or {}
        source = meta.get("source", "")
        # Nos quedamos con un metadata simple
        new_meta = {
            "source": source,
            "page": meta.get("page", None)
        }
        d.metadata = new_meta
        cleaned.append(d)
    return cleaned

docs = clean_metadata(raw_docs)
len(docs), docs[0].metadata


(475,
 {'source': '/content/proyecto_aplicado_preservantes/data/pdfs/1-s2.0-S2212429225020413-main.pdf',
  'page': 0})

**Chunking baseline**

Usamos RecursiveCharacterTextSplitter con:
- chunk_size = 800
- chunk_overlap = 200



In [7]:
##Chunking
##Usamos RecursiveCharacterTextSplitter como baseline.

CHUNK_SIZE = 800
CHUNK_OVERLAP = 200

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
)

chunks = text_splitter.split_documents(docs)
len(chunks), chunks[0]


(2790,
 Document(metadata={'source': '/content/proyecto_aplicado_preservantes/data/pdfs/1-s2.0-S2212429225020413-main.pdf', 'page': 0}, page_content='Mechanisms, applications and challenges of natural antimicrobials in \nfood system\nAnand Kumar\na , 1\n, Suprativ Das\nb , 1\n, Sadaqat Ali\na\n, Swapnil Ganesh Jaiswal\nc\n,  \nAhmad Rabbani\nd\n, Syed Mohammad Ehsanur Rahman\ne , f\n, Ramachandran Chelliah\nf\n,  \nDeog-Hwan Oh\nf\n, Shucheng Liu\na , g\n, Shuai Wei\na , g , *\na\nCollege of Food Science and Technology, Guangdong Ocean University, Guangdong Provincial Key Laboratory of Aquatic Products Processing and Safety, Guangdong \nProvince Engineering Laboratory for Marine Biological Products, Guangdong Provincial Engineering Technology Research Center of Seafood, Key Laboratory of Advanced \nProcessing of Aquatic Product of Guangdong Higher Education Institution, Zhanjiang, 524088, China\nb'))

In [8]:
##cuántos PDFs y de qué archivo vienen los chunks
from collections import Counter

print("N° de documentos originales:", len(raw_docs))
print("Fuentes (PDFs) originales:")
for src in sorted({d.metadata.get("source") for d in raw_docs}):
    print(" -", src)

print("\nN° de chunks:", len(chunks))
print("N° de chunks por PDF:")
conteo = Counter(d.metadata.get("source") for d in chunks)
for src, c in conteo.items():
    print(f"{src}: {c}")


N° de documentos originales: 475
Fuentes (PDFs) originales:
 - /content/proyecto_aplicado_preservantes/data/pdfs/1-s2.0-S2212429225020413-main.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/1-s2.0-S2405844023042287-main.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/AntimicrobialActivityofSpiceextracts.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/Effects of Acidification and Preservatives on Microbial Growth Puree.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/FTB-61-212.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/Food aditives1.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/IJFS2018-8410747.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/Molecules to food, preservatives.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/Preservantes.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/Prop Ca y s. potasio.pdf
 - /content/proyecto_aplicado_preservantes/data/pdfs/The Scientific World Journal - 2022 - Tes

Usamos chunking gerarquico porque los chunks no estan balanceados y se genera un pdf dominante

In [9]:
## Uso chunking gerarquico
from langchain_text_splitters import RecursiveCharacterTextSplitter
from uuid import uuid4

# Splitter de nivel alto (bloques grandes)
high_level_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    length_function=len,
)

# Splitter de nivel bajo (para el vector store)
low_level_splitter = RecursiveCharacterTextSplitter(
    chunk_size=700,
    chunk_overlap=150,
    length_function=len,
)

def hierarchical_chunk(docs):
    """
    1. Divide en bloques grandes (nivel 1)
    2. Cada bloque grande se subdivide en chunks pequeños (nivel 2)
    3. Añade metadatos de jerarquía (parent_id, level1_index)
    """
    level1_docs = high_level_splitter.split_documents(docs)

    final_chunks = []
    for idx, d in enumerate(level1_docs):
        parent_id = str(uuid4())  # id único del bloque grande

        # subdividir este bloque
        sub_docs = low_level_splitter.split_documents([d])

        for s in sub_docs:
            meta = dict(s.metadata)
            meta["parent_id"] = parent_id
            meta["level1_index"] = idx
            s.metadata = meta
            final_chunks.append(s)

    return final_chunks

hier_chunks = hierarchical_chunk(docs)
len(hier_chunks), hier_chunks[0].metadata


(3519,
 {'source': '/content/proyecto_aplicado_preservantes/data/pdfs/1-s2.0-S2212429225020413-main.pdf',
  'page': 0,
  'parent_id': 'e068cb77-f992-4c99-bc5a-d90ced0f0177',
  'level1_index': 0})

In [11]:
from pathlib import Path
from langchain_community.vectorstores import Chroma

BASE_PATH = Path("/content/proyecto_aplicado_preservantes")
CHROMA_HIER_DIR = BASE_PATH / "chroma_preservantes_hier"

import shutil
shutil.rmtree(CHROMA_HIER_DIR, ignore_errors=True)
CHROMA_HIER_DIR.mkdir(parents=True, exist_ok=True)

vector_store_hier = Chroma.from_documents(
    documents=hier_chunks,
    embedding=embeddings,                  # el mismo modelo multilingüe
    persist_directory=str(CHROMA_HIER_DIR)
)

vector_store_hier.persist()
print("Vector store HIER creado con", vector_store_hier._collection.count(), "documentos")


Vector store HIER creado con 3519 documentos


  vector_store_hier.persist()


In [12]:
retriever_hier = vector_store_hier.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20}
)


In [13]:
def inspeccionar_query_con(retriever, query: str, k: int = 5):
    docs = retriever.invoke(query)[:k]
    print("Query:", query, "\n")
    for i, d in enumerate(docs, 1):
        print(f"--- Documento {i} ---")
        print("Source:", d.metadata.get("source"), "| Page:", d.metadata.get("page"),
              "| level1_index:", d.metadata.get("level1_index"))
        print(d.page_content[:500], "...\n")

inspeccionar_query_con(retriever_hier, "¿What are the antimicrobial effects of sodium benzoate, sodium nitrite, and potassium sorbate?")


Query: ¿What are the antimicrobial effects of sodium benzoate, sodium nitrite, and potassium sorbate? 

--- Documento 1 ---
Source: /content/proyecto_aplicado_preservantes/data/pdfs/Molecules to food, preservatives.pdf | Page: 3 | level1_index: 481
teria, and fungi. It acts through membrane disruption and inhi-
bition of metabolic reactions, stress, and accumulation of toxic
anions inside the microbial cell (Brul and Coote1999). It may
be coupled to calcium, potassium, or sodium for different an-
timicrobial targets and effects. The main applications of sodium ...

--- Documento 2 ---
Source: /content/proyecto_aplicado_preservantes/data/pdfs/IJFS2018-8410747.pdf | Page: 7 | level1_index: 154
usedalone.Forinstance,antimicrobialactivityagainst E.coli
w a se nh a n c e dth r o u ghth ec o m b i n e du seo f0 . 1%po ta s s i u m
sorbateand0.1%sodiumbenzoateat8
∘Cwi thsurvi valtime
being reduced by 50 % compared with the one with 0.1 %
sodiumbenzoatealone[29].
T e m p e r a t u r ei sa l s 

**Modelo de embeddings**

In [10]:
EMBEDDING_MODEL_NAME = "sentence-transformers/distiluse-base-multilingual-cased-v2"

embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME
)

print("Modelo cargado:", EMBEDDING_MODEL_NAME)


  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/539M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/531 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

Modelo cargado: sentence-transformers/distiluse-base-multilingual-cased-v2


**Vector Store**

In [14]:
CHROMA_DIR.mkdir(parents=True, exist_ok=True)
#Cuando uso el jerárquico
CHROMA_HIER_DIR.mkdir(parents=True, exist_ok=True)


In [22]:
vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=str(CHROMA_DIR)
)

vector_store.persist()
print("Vector store creado con", vector_store._collection.count(), "documentos")


Vector store creado con 2790 documentos


In [23]:
from langchain_community.vectorstores import Chroma

vector_store_hier = Chroma.from_documents(
    documents=hier_chunks,    # chunks jerárquicos
    embedding=embeddings      # mismo modelo multilingüe
)

print("Vector store jerárquico creado en memoria con:",
      vector_store_hier._collection.count(), "chunks")


Vector store jerárquico creado en memoria con: 7038 chunks


**Retriever jerarquico**

In [24]:
# Retriever jerárquico
retriever_hier = vector_store_hier.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20}
)

query_ejemplo = "¿Qué es un preservante y qué función cumple en alimentos?"
resultados_hier = retriever_hier.invoke(query_ejemplo)

len(resultados_hier), resultados_hier[0]


(5,
 Document(metadata={'parent_id': '29801b3b-1bde-4187-ba7d-8c12d3502613', 'page': 60, 'source': '/content/proyecto_aplicado_preservantes/data/pdfs/Food aditives1.pdf', 'level1_index': 1002}, page_content='food additives.\n14. What is the importance of gels in processed foods? Give two examples\nof substances used to form gels in food systems.\n15. What is rum caviar and how is it made?\n16. Give two examples of emulsiﬁers and describe their use in food systems.\n17. What is a fat replacer? Give an example.\n18. Give an example of enzymes used as a food additives.\n19. Deﬁne the terms toxin and toxicant. Give an example of each.\n20. Give an example of a toxicant that results from the Maillard reaction.\n21. What is trypsin inhibitor and where is it found?\n22. Name 2 sources of trypsin inhibitor.\n23. What is the difference between direct and indirect food additives?'))

In [25]:
for i, d in enumerate(resultados_hier, 1):
    print(f"\n### Documento {i} ###")
    print("Source:", d.metadata.get("source"), "| Page:", d.metadata.get("page"))
    print(d.page_content[:300], "...\n")



### Documento 1 ###
Source: /content/proyecto_aplicado_preservantes/data/pdfs/Food aditives1.pdf | Page: 60
food additives.
14. What is the importance of gels in processed foods? Give two examples
of substances used to form gels in food systems.
15. What is rum caviar and how is it made?
16. Give two examples of emulsiﬁers and describe their use in food systems.
17. What is a fat replacer? Give an example ...


### Documento 2 ###
Source: /content/proyecto_aplicado_preservantes/data/pdfs/Food aditives1.pdf | Page: 16
and reduction of nutritional value. Additives included in processed foods
perform their antioxidant function either as free radical scavengers or che-
lators of pro-oxidant metal ions.
What are antioxidants? How do they work?
Antioxidants are compounds that inhibit or terminate free radical reaction ...


### Documento 3 ###
Source: /content/proyecto_aplicado_preservantes/data/pdfs/Prop Ca y s. potasio.pdf | Page: 16
vida útil del pan precocido u horneado almacenado en re

In [26]:
def evaluate_retriever(retriever, eval_queries, k=5, nombre="Evaluación"):
    print(f"\n=== Evaluando retriever: {nombre} ===\n")

    scores = []

    for item in eval_queries:
        query = item["query"]
        keywords = item["relevant_keywords"]

        # Recuperar documentos
        docs = retriever.invoke(query)[:k]

        # Precision@k manual
        hits = 0
        for doc in docs:
            text = doc.page_content.lower()
            if any(kw.lower() in text for kw in keywords):
                hits += 1

        precision = hits / k
        scores.append(precision)

        print(f"Query: {query}")
        print(f"Precision@{k}: {precision:.2f}\n")

    print(f"Precision@{k} promedio: {sum(scores)/len(scores):.2f}")
    return scores


**Naive RAG**

In [27]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)


In [28]:
##Baseline: solo mostrar textos recuperados
def show_retrieval(query: str, k: int = 5, retriever=retriever):
    # Con LangChain nuevo el retriever se invoca así:
    docs = retriever.invoke(query)
    docs = docs[:k]

    print(f"Query: {query}\n")
    for i, d in enumerate(docs, start=1):
        print(f"--- Documento {i} ---")
        print("Source:", d.metadata.get("source"), "Page:", d.metadata.get("page"))
        print(d.page_content[:500], "...")
        print()

# Prueba
show_retrieval("Tipos de preservantes utilizados en bebidas", retriever=retriever_hier)


Query: Tipos de preservantes utilizados en bebidas

--- Documento 1 ---
Source: /content/proyecto_aplicado_preservantes/data/pdfs/j.ijfoodmicro.2013.06.025.pdf Page: 1
USA
6.55 8 533
6 Z. bailii Spoilage, bottled ice tea
USA
7.46 9.12 545
7 Z. bailii Spoilage, preserved
fruit punch USA
6.67 8.13 475
8 Z. bailii Spoilage, soft drink USA 6.68 8.5 467
9 Z. bailii Spoilage, carbonated
orange drink USA
8.04 8.13 468
10 Z. bailii Spoilage, soft drink USA 6.35 8.33 483
11 Z. bailii Spoilage, soft drink USA 7 9.13 466
12 Z. bailii Spoilage, carbonated
orange drink USA
8.09 9.75 468
13 Z. bailii Spoilage, soft drink USA 7.06 10.12 467
15 Z. bailii Spoilage, salad dress ...

--- Documento 2 ---
Source: /content/proyecto_aplicado_preservantes/data/pdfs/Prop Ca y s. potasio.pdf Page: 13
permitiendo la formación de la miga suave, pero no  de la corteza crujiente. El  
producto obtenido es almacenado para su distribución en supermercados  en ...

--- Documento 3 ---
Source: /content/proyecto_aplicad

**Integramos un LLM para responder**

In [29]:
!pip install -q langchain-openai langchain-community openai tiktoken


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.6/84.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [30]:
!pip install -q langchain langchain-openai langchain-community langchain-text-splitters
!pip install -q langchain-core
!pip install -q langchain-experimental
!pip install -q langchainhub
!pip install -q lc-retrieval


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m210.1/210.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-adk 1.20.0 requires opentelemetry-api<=1.37.0,>=1.37.0, but you have opentelemetry-api 1.39.1 which is incompatible.
google-adk 1.20.0 requires opentelemetry-sdk<=1.37.0,>=1.37.0, but you have opentelemetry-sdk 1.39.1 which is incompatible.[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement lc-retrieval (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for lc-retrieval[0m[31m
[0m

In [31]:
import os
import getpass

from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough


In [34]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Ingresa tu OPENAI_API_KEY: ")


Ingresa tu OPENAI_API_KEY: ··········


In [35]:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)


In [36]:
def ask_rag(query: str, k: int = 5, retriever=retriever_hier):
    # 1. Recuperar documentos relevantes
    docs = retriever.invoke(query)
    docs = docs[:k]

    # 2. Construir el contexto a partir de los chunks
    context = "\n\n---\n\n".join(d.page_content for d in docs)

    # 3. Armar el prompt para el LLM
    prompt = f"""
Eres un asistente experto en preservantes de alimentos.
Responde usando EXCLUSIVAMENTE la información del contexto.

Contexto:
{context}

Pregunta: {query}

Respuesta en español, clara y concisa:
"""

    # 4. Llamar al modelo
    response = llm.invoke(prompt)

    # 5. Mostrar resultado y fuentes
    print("Pregunta:", query)
    print("\n Respuesta:\n")
    print(response.content)

    print("\n Fuentes:")
    for d in docs:
        print("-", d.metadata.get("source"), "| page", d.metadata.get("page"))


In [37]:
ask_rag("Tipos de preservantes utilizados en bebidas", retriever=retriever_hier)


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

**Benchmark  (Precision@k)**

In [38]:
retriever_hier = vector_store_hier.as_retriever(
    search_type="mmr",          # búsqueda diversificada
    search_kwargs={
        "k": 5,                 # número final de documentos que regresará
        "fetch_k": 20           # número de documentos que explora primero
    }
)


In [39]:
eval_queries = [
{
  "query": "¿Qué es un preservante antimicrobiano?",
  "relevant_keywords": [
    "preservante antimicrobiano",
    "conservante antimicrobiano",
    "inhibición microbiana",
    "inhibe el crecimiento microbiano",
    "sustancia antimicrobiana",
    "agente antimicrobiano",
    "inhibición de microorganismos",

    "antimicrobial preservative",
    "antimicrobial agent",
    "microbial growth inhibition",
    "inhibits microbial growth"
  ]
},
{
  "query": "¿Cuáles son los factores que afectan la efectividad de los preservantes?",
  "relevant_keywords": [
    "efectividad de los preservantes",
    "factores que afectan la efectividad",
    "actividad de agua",
    "aw",
    "concentración del conservante",
    "concentración inhibitoria",
    "pKa del conservante",
    "interacción con composición del alimento",

    "preservative effectiveness",
    "factors influencing preservative efficacy",
    "water activity",
    "aw value",
    "preservative concentration",
    "food composition interaction",
    "minimum inhibitory concentration"
  ]
},
{
  "query": "¿Qué se entiende por vida útil de un alimento?",
  "relevant_keywords": [
    "vida útil del alimento",
    "vida útil",
    "deterioro microbiano",
    "estabilidad del alimento",
    "seguridad alimentaria",
    "calidad durante el almacenamiento",

    "shelf life",
    "food shelf life",
    "food spoilage",
    "microbial spoilage",
    "quality stability",
    "storage stability"
  ]
}
]


In [40]:
scores_hier = evaluate_retriever(
    retriever_hier,
    eval_queries,
    k=5,
    nombre="Jerárquico (MMR + chunking estructural)"
)



=== Evaluando retriever: Jerárquico (MMR + chunking estructural) ===

Query: ¿Qué es un preservante antimicrobiano?
Precision@5: 0.20

Query: ¿Cuáles son los factores que afectan la efectividad de los preservantes?
Precision@5: 0.40

Query: ¿Qué se entiende por vida útil de un alimento?
Precision@5: 0.40

Precision@5 promedio: 0.33


In [41]:
from typing import List
import numpy as np

def precision_at_k(query: str, retrieved_docs: List, keywords: List[str], k: int = 5):
    """
    Calcula Precision@k verificando si los documentos recuperados contienen keywords relevantes.
    """
    hits = 0
    for doc in retrieved_docs[:k]:
        text = doc.page_content.lower()
        # Si alguna keyword aparece en el texto => HIT
        if any(keyword.lower() in text for keyword in keywords):
            hits += 1

    return hits / k  # Precision@k


def evaluate_retriever_precision(retriever, eval_queries, k: int = 5, nombre: str = "Modelo"):
    """
    Aplica Precision@k a un conjunto de queries y muestra resultados.
    """
    print(f"\n=== Evaluando retriever: {nombre} ===\n")

    scores = []
    for item in eval_queries:
        query = item["query"]
        keywords = item["relevant_keywords"]

        # Recuperar documentos
        retrieved = retriever.invoke(query)

        # Calcular Prec@k
        score = precision_at_k(query, retrieved, keywords, k)
        scores.append(score)

        print(f"Query: {query}")
        print(f"Precision@{k}: {score:.2f}\n")

    print(f"Precision@{k} promedio: {np.mean(scores):.2f}")
    return scores


In [42]:
scores_hier = evaluate_retriever_precision(
    retriever_hier,
    eval_queries,
    k=5,
    nombre="Jerárquico (MMR)"
)



=== Evaluando retriever: Jerárquico (MMR) ===

Query: ¿Qué es un preservante antimicrobiano?
Precision@5: 0.20

Query: ¿Cuáles son los factores que afectan la efectividad de los preservantes?
Precision@5: 0.40

Query: ¿Qué se entiende por vida útil de un alimento?
Precision@5: 0.40

Precision@5 promedio: 0.33


**CONCLUSIONES**

Corpus heterogéneo (inglés/español): Los documentos contienen conceptos relevantes en distintos idiomas, lo que afecta la recuperación cuando la evaluación depende de keywords únicamente en español o traducciones exactas.

Evaluación basada en coincidencia de palabras clave: Precision@k penaliza documentos que son relevantes conceptualmente, pero no contienen literalmente las palabras clave definidas.

Preguntas conceptuales difíciles: Consultas de tipo “¿Qué es…?” requieren definiciones explícitas que pueden no aparecer como tal en el corpus o estar formuladas con vocabulario técnico, reduciendo la recuperación efectiva.

Tamaño y calidad del corpus: Aunque el corpus es valioso, varias fuentes no están estructuradas pedagógicamente y contienen tablas, fórmulas o párrafos extensos, lo que dificulta la segmentación óptima.

**POSIBLES MEJORAS PARA SIGUENTE HITO**

Mejorar los embeddings. Adoptar un modelo más robusto y científico multilingüe

Optimizar el proceso de chunking: Usar chunking híbrido (estructura + semántica + tamaño).

Incluir metadatos explícitos (subtítulos, figuras, secciones) para mejorar contexto jerárquico.

Mejorar la evaluación: Expandir keywords con sinónimos y variaciones técnicas.



In [49]:
# HITO 1 — resultados congelados


BASE_K = 5

baseline_scores = evaluate_retriever_precision(
    retriever,
    eval_queries,
    k=BASE_K,
    nombre="Hito 1 - Baseline Naive"
)

baseline_precision = float(np.mean(baseline_scores))

print("\n BASELINE CONGELADO")
print(f"Precision@{BASE_K} = {baseline_precision:.4f}")



=== Evaluando retriever: Hito 1 - Baseline Naive ===

Query: ¿Qué es un preservante antimicrobiano?
Precision@5: 0.20

Query: ¿Cuáles son los factores que afectan la efectividad de los preservantes?
Precision@5: 0.20

Query: ¿Qué se entiende por vida útil de un alimento?
Precision@5: 0.40

Precision@5 promedio: 0.27

 BASELINE CONGELADO
Precision@5 = 0.2667


In [50]:
import pandas as pd
import os

os.makedirs("results", exist_ok=True)

pd.DataFrame([{
    "modelo": "Hito 1 - Baseline Naive",
    "Precision@5": baseline_precision
}]).to_csv("results/hito1_baseline.csv", index=False)

print(" Baseline guardado en results/hito1_baseline.csv")


 Baseline guardado en results/hito1_baseline.csv
