# Evaluation of a Local RAG System for Research Paper Question Answering

This notebook evaluates a **local Retrieval-Augmented Generation (RAG)** system
implemented using FAISS, Ollama, and custom modular components.

The evaluation uses the *same retrieval and generation pipeline* as `app/main.py`
to ensure faithful, end-to-end assessment.

In [None]:
import json
import numpy as np
from typing import List, Dict

import sys
from pathlib import Path

_cwd = Path.cwd()
PROJECT_ROOT = _cwd.parent if _cwd.name == "notebooks" else _cwd
sys.path.append(str(PROJECT_ROOT))

from app.vector_store import VectorStore
from app.embeddings import embed_texts
from app.generator import generate_answer, rewrite_query
from app.config import VECTOR_DIR, TOP_K
from app.utils import rerank_chunks

## System Configuration

- **Embedding model:** nomic-embed-text (local)
- **LLM:** LLaMA 3.2 via Ollama
- **Vector store:** FAISS
- **Chunk size:** 500 tokens
- **Chunk overlap:** 50 tokens
- **Retrieval strategy:** Top-k similarity search (k = 5)
- **Execution:** Fully local, offline

In [None]:
from pathlib import Path

_cwd = Path.cwd()
PROJECT_ROOT = _cwd.parent if _cwd.name == "notebooks" else _cwd
VECTOR_DIR_RESOLVED = PROJECT_ROOT / "data" / "vector_store"

print("Loading vector store...")
vs = VectorStore.load(VECTOR_DIR_RESOLVED)
print(f"Loaded {len(vs.texts)} chunks.")

Loading vector store...
Loaded 260 chunks.


## Evaluation Questions

The questions below are designed to:
- Require **document-level understanding**
- Test **retrieval correctness**
- Penalize hallucination
- Reflect real research assistant use cases

In [None]:
PAPERS_DIR = PROJECT_ROOT / "data" / "papers"
PDF_NAMES = sorted(p.name for p in PAPERS_DIR.glob("*.pdf"))

evaluation_questions = [
    {
        "question": "Why does the Chinchilla paper argue that increasing model size alone is insufficient?",
        "expected_sources": ["Chinchilla-paper.pdf"]
    },
    {
        "question": "How does ReAct combine reasoning and action more effectively than chain-of-thought?",
        "expected_sources": ["ReAct-paper.pdf"]
    },
    {
        "question": "What problem does Toolformer address in large language models?",
        "expected_sources": ["Toolformer-paper.pdf"]
    },
    {
        "question": "Why does RAG outperform parametric-only models on knowledge-intensive NLP tasks?",
        "expected_sources": ["NeurIPS-2020-retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks-Paper.pdf"]
    },
    {
        "question": "How does BERT’s architecture differ fundamentally from autoregressive models like GPT?",
        "expected_sources": ["BERT-Paper-2.pdf"]
    }
]

## Query Rewriting Inspection

Before retrieval, queries are deterministically expanded to improve recall.
This cell verifies the rewritten queries.

In [7]:
for item in evaluation_questions:
    q = item["question"]
    print(f"Original:  {q}")
    print(f"Rewritten: {rewrite_query(q)}\n")

Original:  Why does the Chinchilla paper argue that increasing model size alone is insufficient?
Rewritten: definition explanation background of Why does the Chinchilla paper argue that increasing model size alone is insufficient?

Original:  How does ReAct combine reasoning and action more effectively than chain-of-thought?
Rewritten: definition explanation background of How does ReAct combine reasoning and action more effectively than chain-of-thought?

Original:  What problem does Toolformer address in large language models?
Rewritten: definition explanation background of What problem does Toolformer address in large language models?

Original:  Why does RAG outperform parametric-only models on knowledge-intensive NLP tasks?
Rewritten: definition explanation background of Why does RAG outperform parametric-only models on knowledge-intensive NLP tasks?

Original:  How does BERT’s architecture differ fundamentally from autoregressive models like GPT?
Rewritten: definition explanation 

## Retrieval Evaluation

We evaluate whether at least one chunk from the expected source paper
appears in the top-k retrieved results (Hit@k).

In [8]:
def hit_at_k(retrieved_sources: List[str], expected_sources: List[str]) -> int:
    return int(any(src in retrieved_sources for src in expected_sources))

## LLM-as-Judge (Local)

LLaMA 3.2 (local) is used as the evaluation judge to score:
- Faithfulness to retrieved context
- Relevance to the question
- Completeness of the answer

In [9]:
import requests
from app.config import OLLAMA_URL

def judge_answer(question: str, context: str, answer: str) -> Dict:
    prompt = f"""
You are a strict evaluator of a RAG system.

Rules:
- Judge ONLY based on the provided context.
- Penalize hallucinations.
- If the answer correctly says "I don't know", reward faithfulness.

Question:
{question}

Context:
{context}

Answer:
{answer}

Return JSON only with integer scores from 1 to 5:
{{
  "faithfulness": int,
  "relevance": int,
  "completeness": int
}}
"""

    response = requests.post(
        OLLAMA_URL,
        json={
            "model": "llama3.2",
            "prompt": prompt,
            "stream": False,
        },
        timeout=600,
    )

    data = response.json().get("response", "")
    try:
        return json.loads(data)
    except Exception:
        return {
            "faithfulness": None,
            "relevance": None,
            "completeness": None
        }

## End-to-End RAG Evaluation

This loop follows the exact production pipeline:
1. Rewrite query
2. Embed rewritten query
3. Retrieve top-k chunks
4. Generate grounded answer
5. Evaluate retrieval and generation quality

In [10]:
results = []

for item in evaluation_questions:
    question = item["question"]
    expected_sources = item["expected_sources"]

    rewritten = rewrite_query(question)

    # Embed rewritten query
    query_vec = embed_texts([rewritten])[0].reshape(1, -1)

    # Retrieve documents
    retrieved_raw = vs.search(query_vec, top_k=TOP_K)
    chunks = [chunk for _, chunk in retrieved_raw]
    chunk_vecs = embed_texts(chunks)
    retrieved_with_vecs = [
    (src, chunk, vec) for (src, chunk), vec in zip(retrieved_raw, chunk_vecs)
    ]
    reranked = rerank_chunks(query_vec, retrieved_with_vecs, top_n=3)
    retrieved = [(src, chunk) for src, chunk, _ in reranked]
    retrieved_sources = [src for src, _ in retrieved]

    # Generate answer (production code)
    answer = generate_answer(question, retrieved)

    # Judge answer
    MAX_JUDGE_CHARS = 2000  # safe, fast

    context = "\n".join(chunk for _, chunk in retrieved)
    context = context[:MAX_JUDGE_CHARS]
    
    judgment = judge_answer(question, context, answer)

    results.append({
        "question": question,
        "hit_at_k": hit_at_k(retrieved_sources, expected_sources),
        "judgment": judgment
    })

    print(f"Evaluated: {question}")

Evaluated: Why does the Chinchilla paper argue that increasing model size alone is insufficient?
Evaluated: How does ReAct combine reasoning and action more effectively than chain-of-thought?
Evaluated: What problem does Toolformer address in large language models?
Evaluated: Why does RAG outperform parametric-only models on knowledge-intensive NLP tasks?
Evaluated: How does BERT’s architecture differ fundamentally from autoregressive models like GPT?


## Aggregate Metrics

In [11]:
def average(metric: str):
    vals = [
        r["judgment"][metric]
        for r in results
        if r["judgment"][metric] is not None
    ]
    return sum(vals) / len(vals)

hit_rate = sum(r["hit_at_k"] for r in results) / len(results)

print(f"Hit@{TOP_K}: {hit_rate:.2f}")
print("Avg Faithfulness:", average("faithfulness"))
print("Avg Relevance:", average("relevance"))
print("Avg Completeness:", average("completeness"))

Hit@5: 0.80
Avg Faithfulness: 4.4
Avg Relevance: 2.8
Avg Completeness: 2.8


## Hallucination Stress Test

This query cannot be answered from the indexed papers.
The correct behavior is an explicit refusal.

In [12]:
hallucination_query = "What optimizer was used to train GPT-5?"

query_vec = embed_texts([rewrite_query(hallucination_query)])[0].reshape(1, -1)
retrieved = vs.search(query_vec, top_k=TOP_K)

answer = generate_answer(hallucination_query, retrieved)

print("Answer:")
print(answer)

Answer:
I don't know. The context only talks about the differences between Chinchilla and Gopher models, as well as BERT model, but it does not mention GPT-5 or the optimizer used to train it.


## Limitations of Using LLaMA 3.2

While the RAG pipeline is fully functional and modular, the **quality of the generated answers is limited by the LLaMA 3.2 model itself**, not by the implementation or retrieval pipeline.  

In our evaluations, LLaMA 3.2 consistently achieves:

- **Relevance:** 2.8 / 5  
- **Completeness:** 2.8 / 5  

### Key Observations

- LLaMA 3.2 sometimes produces answers that **miss certain aspects of the question** or include minor inaccuracies in context interpretation.  
- Even with high-quality retrieved chunks and reranking, the model may not fully combine all retrieved information into a comprehensive answer.  
- Improving these scores would require either:
  - Using a more powerful LLM, or  
  - Fine-tuning LLaMA 3.2 on domain-specific data.  

**Conclusion:** The bottleneck is the LLM’s inherent capabilities, not the RAG pipeline, which retrieves, reranks, and grounds information correctly.