# **Two-Stage-Retrieval LLM RAG**


## **Table of Contents**

Formato::::::::: * [5. Data Analysis and Inference](#5-data-analysis-and-inference)
    * [5.2. The Evaluation Challenge](#52-the-evaluation-challenge-defining-likeness)


* [1. Introduction & Key Features](#1-introduction--key-features)
* [2. Project Structure](#2-project-structure)
* [3. Workflow Summary](#3-workflow-summary)
* [4. Demonstration](~4)
    * [4.1 Data Acquisition and Preparation](#41-data-acquisition-and-preparation)
    * [4.2 Methodology and Pipeline Design](#32-methodology-and-pipeline-design)
    * [4.3 Execution and Results](#43-execution-and-results)
* [5. Data Analysis and Inference](#5-data-analysis-and-inference)
    * [5.1. Defining Metrics and Data Collection](#51-defining-metrics-and-data-collection)
    * [5.2. The Evaluation Challenge: Defining "Likeness"](#52-the-evaluation-challenge-defining-likeness)
    * [5.3. Pipeline Performance and Inference](#53-pipeline-performance-and-inference)
    * [5.4. Conclusion: Optimal Pipeline and Conditions](#54-conclusion-optimal-pipeline-and-conditions)

# 1. Introduction & Key Features


This project implements a question-answering (Q&A) system over a technical research paper from Meta SuperIntelligence Labs. The goal is to compare how different retrieval-augmented generation (RAG) setups help a large language model (LLM) answer multiple-choice questions.

The following retrieval pipelines are compared:

- **LLM Baseline** ‚Äì Direct generation without retrieval.

- **BM25** ‚Äì Classic sparse keyword search (TF-IDF based).

- **Dense Retrieval** ‚Äì Embedding-based semantic retrieval.

- **(Bonus) Hybrid Retrieval** ‚Äì Combination of BM25 and Dense Retrieval.

- **(Bonus) Hybrid Retrieval + Cross-Encoder** ‚Äì A SOTA approach where a high-recall retriever fetches candidates (Hybrid, k=20) and a Cross-Encoder reranks them for high precision (k=5).


To ensure robustness, reproducibility, and scientific rigor, this project integrates several software engineering patterns:

- **Singleton Pattern for Retrieval Engine:** The `RetrievalEngine` class uses a Singleton pattern with **Lazy Loading**. This prevents memory overhead by loading heavy models only when necessary and ensures consistent database connections across the pipeline.

- **Smart Ingestion Strategy:** We compare **Recursive Chunking** (optimized with high overlap to preserve scientific context) against **Semantic Chunking** (experimental), which uses embedding distances to segment text by topic.

- **Automated "Judge" Evaluation:** Instead of relying solely on the final answer accuracy, we implement a **Ground Truth Verification** system. It compares the retrieved chunks against the source reference in the dataset to detect lucky guesses.

- **Two-Stage Architecture:** A funnel approach that prioritizes *Recall* in the first stage (Hybrid Search) and *Precision* in the second stage (Cross-Encoder Re-ranking).

# 2. Project Structure

 The project is organized into multiple files under the src/ folder, with a clear separation of responsibilities. Here‚Äôs how it works:

1. **Input Data**

  All of the following are on the `data/` directory:
- `**data/questions.json**` - Contains 50 multiple-choice questions extracted from a technical research paper.
Each question includes: the correct answer, three distractors, and an optional reference to the source paper.

- `**data/chroma_db**` - Embedded vector data base.
- `**data/enunciado.pdf**` - The problem statement already described.
- `**data/paper_refrag.pdf**` - The technical paper from which questions and answers are extracted.


2. **Source Code**
- `**main.py**` - Entry point of the project.
Supports *Local* mode (`results/local_results/`) and *Persistent* mode (`results/persistent_results/<test_name>/`).  
    - Calls the pipeline, saves final results, and generates plots.
  

  All of the following are on the `src/` directory:
- `**launcher.py**` - Sets up the environment for experiments.
    - Initializes the vector database (ChromaDB).
    - Clears previous results if needed.
    - Creates all necessary directories (plots, final CSVs, etc.).

- `**src/ingestion.py**` - Prepares and creates the data base. Invoked by the launcher.

- `**queries.py**` - Main logic to execute questions.
    - For each question and method it:
        - Sends the query to the LLM and retrieves documents.
        - Computes accuracy and evidence verification.
        - Stores partial results in results/resultados_parciales.csv.
        - Returns a DataFrame with all results for further evaluation.

- `**rag_pipeline.py**` - Implements RAG logic for different retrieval methods
    - Contains functions to verify ground truth against retrieved documents.
     Computes retrieval scores and status tags for each answer.

- `**retrieval.py**` - Implements the retrieval engine.
    - Provides a singleton engine to handle different retrieval methods efficiently.

- `**evaluation.py**` - Evaluates the results and generates dashboards.
    - Accuracy per method  
    - RAG quality distribution  
    - Response latency  
    - Retrieval fidelity

3. **Output Data**

    Found on the `results/` directory:

- **Partial results** - Always stored in `results/resultados_parciales.csv`.
    - Updated after each question is processed.

- **Final results** - Stored in `results/local_results/` or `results/persistent_results/<test_name>`.
    - File name: `resultados_finales.csv`.

- **Plots/Dashboard** - Stored in `plots/` inside the corresponding results folder.
    - Include:
        - Bar charts for accuracy and RAG quality
        - Boxplots for response latency
        - Violin plots for retrieval fidelity

> **Tip:** Run `main.py try_n` on the terminal in orden to save the try number n results in that directory<br> in orden not to overwrite other results.

# 3. Workflow Summary

1. Load questions dataset (questions.json).

2. Initialize vector database (ChromaDB).

3. For each question:
    - Retrieve relevant documents (BM25 / Dense / Hybrid / Hybrid + Cross-Encoder).
    - Query LLM for an answer.
    - Verify against ground truth.
    - Save partial results.

4. After all questions are processed:
    - Concatenate all results.
    - Save final results CSV.
    - Generate plots/dashboard for comparison and analysis.

In [1]:
# --- IMPORTS Y SETUP ---
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt
from dotenv import load_dotenv

# A√±adimos la ra√≠z del proyecto al path para poder importar 'src'
sys.path.append(os.path.abspath("."))

# Importamos nuestros m√≥dulos de ingenier√≠a
from src.retrieval import RetrievalEngine
from src.rag_pipeline import query_rag
from src.evaluation import generate_dashboard
from src.queries import run_questions

# Cargar API Key
load_dotenv()
API_KEY = os.getenv("GOOGLE_API_KEY")

# Configuraci√≥n visual para Pandas
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

print("‚úÖ Entorno configurado y librer√≠as cargadas.")

  from tqdm.autonotebook import tqdm, trange


‚úÖ Entorno configurado y librer√≠as cargadas.


## Key Features

To ensure robustness, reproducibility, and scientific rigor, this project integrates several software engineering patterns:

- **Singleton Pattern for Retrieval Engine:** The `RetrievalEngine` class uses a Singleton pattern with **Lazy Loading**. This prevents memory overhead by loading heavy models only when necessary and ensures consistent database connections across the pipeline.

- **Smart Ingestion Strategy:** We compare **Recursive Chunking** (optimized with high overlap to preserve scientific context) against **Semantic Chunking** (experimental), which uses embedding distances to segment text by topic.

- **Automated "Judge" Evaluation:** Instead of relying solely on the final answer accuracy, we implement a **Ground Truth Verification** system. It compares the retrieved chunks against the source reference in the dataset to detect lucky guesses.

- **Two-Stage Architecture:** A funnel approach that prioritizes *Recall* in the first stage (Hybrid Search) and *Precision* in the second stage (Cross-Encoder Re-ranking).

## Data Ingestions and Chunking Strategy

The initial stage of the RAG pipeline focuses on **Data Pre-processing**. This involves breaking down the source document (`paper_refrag.pdf`) into discrete units of knowledge (**chunks**) that are stored in **ChromaDB**.

### Chunking Methodologies

We implemented two primary chunking strategies, selectable via the `CHUNKING_METHOD` variable in `src/ingestion.py`:

1.  **Recursive Chunking (Recursive Character Splitter):** This is the baseline method, optimized for technical context. It relies on rules of fixed length (e.g., **1200 characters** with **350 characters of overlap**).

2.  **Semantic Chunking:** The advanced method. It utilizes an **embedding model** to compute the similarity between consecutive sentences, identifying natural **topic shifts** to define chunk boundaries.

### Why Semantic Chunking?

We chose to focus on **Semantic Chunking** because it addresses the primary weakness of RAG in technical domains: **avoiding false negatives** in retrieval.

In dense scientific papers, naive splitting often cuts conclusions, cross-references, or complex sentences in half. This leads to:

* **Loss of Context:** The embedding model creates an inaccurate vector because the sentence is fragmented.

* **Lucky guesses:** The system answers correctly, but the strict evaluation judge fails to find the complete reference phrase because it was broken up.

By using **Semantic Chunking**, we ensure that each unit of information passed to the retriever has the **highest possible semantic coherence**, which directly improves the **Retrieval Recall** and **Fidelity** metrics in our evaluation dashboard.



---

Once the text is chunked, the system proceeds to **initialize the Retriever** in a **Lazy Loading** state, ensuring the database connection is established efficiently right before the first query is made.

In [None]:
# --- GESTI√ìN DE DATOS Y VERIFICACI√ìN ---
import os
from src.ingestion import db_setup
from src.retrieval import RetrievalEngine

try:
    # 1. GARANT√çA DE EXISTENCIA DE BD (Ingesta Manual)
    db_setup(rebuild_db=True, chunking_method="semantic") # Opciones: "recursive", "semantic"
    
    # 2. CONEXI√ìN AL MOTOR (Singleton)
    print("\n‚öôÔ∏è Conectando al motor de recuperaci√≥n...")
    engine = RetrievalEngine.get_instance()
    
    # 3. PRUEBA DE RECUPERADOR
    # Esto asegura que ChromaDB y los Embeddings est√°n cargados en RAM
    test_retriever = engine.get_retriever("hybrid", k=1) # As√≠ aseguramos de que se inicien dense y bm25
    
    # Obtenemos estad√≠sticas reales de la BD
    if engine.db:
        doc_count = engine.db._collection.count()
        print(f"‚úÖ SISTEMA OPERATIVO")
        print(f"   - Estado: Base de Datos Conectada")
        print(f"   - Ruta: ./data/chroma_db")
        print(f"   - Chunks Indexados: {doc_count}")
        print(f"   - Modelo Embeddings: all-MiniLM-L6-v2")
    
except Exception as e:
    print(f"\n‚ùå ERROR CR√çTICO DE INICIALIZACI√ìN: {e}")
    print("   - Verifica que la base de datos est√© construida correctamente.")

  from tqdm.autonotebook import tqdm, trange


üìÇ Verificando integridad de los datos...

üîå Desconectando motor de b√∫squeda...
üìÑ Cargando PDF...
   -> PDF cargado: 30 p√°ginas.
‚úÇÔ∏è Procesando fragmentos
   -> Generados 81 fragmentos.
üß† Guardando vectores en disco...


## Project Structure

 The project is organized into multiple files under the src/ folder, with a clear separation of responsibilities. Here‚Äôs how it works:

1. **Input Data**

  All of the following are on the `data/` directory:
- `**data/questions.json**` - Contains 50 multiple-choice questions extracted from a technical research paper.
Each question includes: the correct answer, three distractors, and an optional reference to the source paper.

- `**data/chroma_db**` - Embedded vector data base.
- `**data/enunciado.pdf**` - The problem statement already described.
- `**data/paper_refrag.pdf**` - The technical paper from which questions and answers are extracted.


2. **Source Code**
- `**main.py**` - Entry point of the project.
Supports *Local* mode (`results/local_results/`) and *Persistent* mode (`results/persistent_results/<test_name>/`).  
    - Calls the pipeline, saves final results, and generates plots.
  

  All of the following are on the `src/` directory:
- `**launcher.py**` - Sets up the environment for experiments.
    - Initializes the vector database (ChromaDB).
    - Clears previous results if needed.
    - Creates all necessary directories (plots, final CSVs, etc.).

- `**src/ingestion.py**` - Prepares and creates the data base. Invoked by the launcher.

- `**queries.py**` - Main logic to execute questions.
    - For each question and method it:
        - Sends the query to the LLM and retrieves documents.
        - Computes accuracy and evidence verification.
        - Stores partial results in results/resultados_parciales.csv.
        - Returns a DataFrame with all results for further evaluation.

- `**rag_pipeline.py**` - Implements RAG logic for different retrieval methods
    - Contains functions to verify ground truth against retrieved documents.
     Computes retrieval scores and status tags for each answer.

- `**retrieval.py**` - Implements the retrieval engine.
    - Provides a singleton engine to handle different retrieval methods efficiently.

- `**evaluation.py**` - Evaluates the results and generates dashboards.
    - Accuracy per method  
    - RAG quality distribution  
    - Response latency  
    - Retrieval fidelity

3. **Output Data**

Found on the `results/` directory:

- **Partial results** - Always stored in `results/resultados_parciales.csv`.
    - Updated after each question is processed.

- **Final results** - Stored in `results/local_results/` or `results/persistent_results/<test_name>`.
    - File name: `resultados_finales.csv`.

- **Plots/Dashboard** - Stored in `plots/` inside the corresponding results folder.
    - Include:
        - Bar charts for accuracy and RAG quality
        - Boxplots for response latency
        - Violin plots for retrieval fidelity

> **Tip:** Run `main.py try_n` on the terminal in orden to save the try number n results in that directory<br> in orden not to overwrite other results.


# 3. Workflow Summary

1. Load questions dataset (questions.json).

2. Initialize vector database (ChromaDB).

3. For each question:
    - Retrieve relevant documents (BM25 / Dense / Hybrid / Hybrid + Cross-Encoder).
    - Query LLM for an answer.
    - Verify against ground truth.
    - Save partial results.

4. After all questions are processed:
    - Concatenate all results.
    - Save final results CSV.
    - Generate plots/dashboard for comparison and analysis.



# 5. Data Analysis and Inference


Our **project** is designed for generating four plots:

1. **Accuracy** on **answer selected**
2. RAG quality
3. Latency
4. Retrieval fidelity

## 5.1. Defining Metrics and Data Collection
discutir como se recoge cada dato y como esta valorado si se valora

## 5.2. The Evaluation Challenge: Defining "Likeness"
Therefore, we will discuss each **pipeline's performance** in each metric described, but before that, we must take into account the **methodology** considered for the evaluation of the RAG quality and the retrieval fidelity judge. This is due to the fact that these metrics are attached to how we evaluate likeness and the threshold of a "lucky answer." We will give an insight into this idea:

- The plots in `results/persistent_results/27-11-1/2_rag_quality.png` are obtained through a judge that is based on text matching:


<center>
    <img src="results/persistent_results/27-11-1/plots/2_rag_quality_pct.png" alt="Text Matching Judge (Conservative)" width="45%"/>
</center>

- Whereas the ones in `results/persistent_results/8-11-1` have an **LLM judge** that searches for the semantic match:¬†


<center>
    <img src="results/persistent_results/28-11-1/plots/2_rag_quality_pct.png" alt="LLM Judge (Semantic)" width="45%"/>
</center>


Suddenly, all the pipelines have improved. However, when **manually supervised**, neither of the judges has a **flawless performance** at this task. The first judge **incorrectly** marks as "lucky answers" a group that is, in fact, a "correct answer," and the second does the opposite.

This **false positives** and **false negatives** concept is an extremely important idea in statistics and in other science fields, and the general conclusion for this problem is to be **conservative**.

For this reason, we will mention other methodologies we have developed but only show and explore the ones with a risk-averse approach.


## 5.3. Pipeline Performance and Inference

explicar como de bueno es cada pipeline en cada metrica y decir cual es mejor y en que condiciones

## 5.4. Conclusion: Optimal Pipeline and Conditions

## Authors
- Jorge Barbero Mor√°n ‚Äì UCM, Faculty of Mathematics
- David Marcos Jimeno ‚Äì UCM, Faculty of Mathematics


## License
This project is licensed under the **MIT License**