#### Problem: arXiv RAG Research Assistant

Goal:
Build a research assistant that answers questions using real arXiv papers, with:
- metadata-aware retrieval
- citations
- LCEL pipeline
- zero hallucination

#### Tasks

1) Load the PDFs with metadata
- paper_title
- year
- arxiv_category
- source

2) Chunking
- chunk_size = 1000
- chunk_overlap = 200

3) Embedding + Vector DB
- Use FAISS or ChromaDB
- Store embeddings with metadata

4) Metadata-filtered retrieval
- Example query like: 'What improvements were proposed after 2021 for RAG?', where filter would be year>=2022

5) LCEL RAG Pipeline
Pipelne must:
Question:

  -  -> Retriever 
  -  -> Context Formatter ([Paper: Self-RAG | Year: 2023 | Page: 4] \n <chunk_text>)
  -  -> Prompt
  -  -> LLM
  -  -> JSON Output

6) Output
{
  "answer": "...",
  "sources": [
    {
      "paper": "Self-RAG",
      "year": 2023,
      "page": 4
    }
  ]
}


#### Task 1) Load the PDFs

In [1]:
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Map Filename
meta_map = {
    'rag_2020.pdf': {'paper_title': 'RAG', 'year': 2020, 'arxiv_category': 'cs.CL'},
    'dpr_2020.pdf': {'paper_title': 'DPR', 'year': 2020, 'arxiv_category': 'cs.CL'},
    'cot_2022.pdf': {'paper_title': 'CoT', 'year': 2022, 'arxiv_category': 'cs.CL'},
    'hyde_2022.pdf': {'paper_title': 'HyDE', 'year': 2022, 'arxiv_category': 'cs.IR'},
    'self_rag_2023.pdf': {'paper_title': 'Self-RAG', 'year': 2023, 'arxiv_category': 'cs.CL'},
}

In [3]:
loader = DirectoryLoader(
    'data/arxiv_papers', 
    '**/*.pdf',
    loader_cls= PyPDFLoader
)

In [4]:
import os

In [5]:
docs= loader.load()

for d in docs:
    fname = os.path.basename(d.metadata.get('source', ''))
    d.metadata['sourcefile'] = fname

    meta = meta_map.get(fname, {})
    d.metadata['paper_title'] = meta.get('paper_tile', fname.replace('.pdf', ''))
    d.metadata["year"] = meta.get("year", None)
    d.metadata["arxiv_category"] = meta.get("arxiv_category", None)


In [6]:
print('Loaded docs:', len(docs))
print('\n','Sample metadata:', docs[0].metadata)

Loaded docs: 116

 Sample metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-10-19T00:28:18+00:00', 'author': '', 'keywords': '', 'moddate': '2023-10-19T00:28:18+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'data/arxiv_papers/self_rag_2023.pdf', 'total_pages': 30, 'page': 0, 'page_label': '1', 'sourcefile': 'self_rag_2023.pdf', 'paper_title': 'self_rag_2023', 'year': 2023, 'arxiv_category': 'cs.CL'}


#### Task 2) Creating Chunks

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [8]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)

splits = splitter.split_documents(docs)

In [9]:
print('Chunks:', len(splits))
print('\n', 'Chunk sample metadata:', splits[0].metadata)
print('\n', 'Chunk Preview:', splits[0].page_content[:500])

Chunks: 535

 Chunk sample metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-10-19T00:28:18+00:00', 'author': '', 'keywords': '', 'moddate': '2023-10-19T00:28:18+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'data/arxiv_papers/self_rag_2023.pdf', 'total_pages': 30, 'page': 0, 'page_label': '1', 'sourcefile': 'self_rag_2023.pdf', 'paper_title': 'self_rag_2023', 'year': 2023, 'arxiv_category': 'cs.CL'}

 Chunk Preview: Preprint.
SELF -RAG: L EARNING TO RETRIEVE , G ENERATE , AND
CRITIQUE THROUGH SELF -R EFLECTION
Akari Asai†, Zeqiu Wu†, Yizhong Wang†§, Avirup Sil‡, Hannaneh Hajishirzi†§
†University of Washington §Allen Institute for AI ‡IBM Research AI
{akari,zeqiuwu,yizhongw,hannaneh}@cs.washington.edu, avi@us.ibm.com
ABSTRACT
Despite their remarkable capabilities, large language models (LLMs) often produce
respo

#### Task 3) Creating embeddings + Store the embeddings in Chroma DB

In [10]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

In [11]:
embeddings = HuggingFaceEmbeddings(
    model_name = 'sentence-transformers/all-MiniLM-L6-v2'
)

persist_dir = 'chroma_arxiv_db'

vector_db = Chroma.from_documents(
    documents= splits,
    embedding= embeddings,
    persist_directory= persist_dir,
    collection_name= "arxiv_rag_practice"
)

print('\n','Chroma Vector DB ready.')

  embeddings = HuggingFaceEmbeddings(
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 2981.38it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.



 Chroma Vector DB ready.


#### Task 4) Metadata-filtered retriever (year>=2022)

In [12]:
retriever = vector_db.as_retriever(
    search_kwargs = {
        'k': 4,
        'filter': {'year': {'$gte': 2022}} # Chroma-style filter
    }
)

In [13]:
query = 'What improvements were proposed after 2021 for RAG?'

top_docs = retriever.invoke(query)

for i, d in enumerate(top_docs, 1):
    print('='*80)
    print('Rank:', i)
    print('Paper:', d.metadata.get('paper_title'), 'Year:', d.metadata.get('year'),
          'Page:', d.metadata.get('page'))
    print(d.page_content[:300])

Rank: 1
Paper: self_rag_2023 Year: 2023 Page: 0
indicate the need for retrieval and its generation quality respectively (Figure 1 right). In particular,
given an input prompt and preceding generations, SELF -RAG first determines if augmenting the
continued generation with retrieved passages would be helpful. If so, it outputs a retrieval token th
Rank: 2
Paper: self_rag_2023 Year: 2023 Page: 0
indicate the need for retrieval and its generation quality respectively (Figure 1 right). In particular,
given an input prompt and preceding generations, SELF -RAG first determines if augmenting the
continued generation with retrieved passages would be helpful. If so, it outputs a retrieval token th
Rank: 3
Paper: hyde_2022 Year: 2022 Page: 6
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek
Rao, Parker Barnes, Yi Tay,

In [14]:
query = 'What improved retrieval-augmented generation after 2021?'

top_docs = retriever.invoke(query)

print('\n', query)
for i, d in enumerate(top_docs, 1):
    print('='*80)
    print('Rank:', i)
    print('Paper:', d.metadata.get('paper_title'), 'Year:', d.metadata.get('year'),
          'Page:', d.metadata.get('page'))
    print(d.page_content[:300])


 What improved retrieval-augmented generation after 2021?
Rank: 1
Paper: self_rag_2023 Year: 2023 Page: 11
Transactions on Machine Learning Research , 2022a. URL https://openreview.net/
forum?id=jKN1pXi7b0.
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane
Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with
retrieval augmente
Rank: 2
Paper: self_rag_2023 Year: 2023 Page: 11
Transactions on Machine Learning Research , 2022a. URL https://openreview.net/
forum?id=jKN1pXi7b0.
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane
Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with
retrieval augmente
Rank: 3
Paper: self_rag_2023 Year: 2023 Page: 1
retrieval-augmented ChatGPT on four tasks, Llama2-chat (Touvron et al., 2023) and Alpaca (Dubois
et al., 2023) on all tasks. Our analysis demonstrates the effectiveness of training and inferenc

#### Task 5) Building LCEL RAG Pipeline

A) Choosing an LLM 

In [15]:
from dotenv import load_dotenv

In [16]:
os.environ['LANGCHAIN_TRACKING_V2'] = 'true'
os.environ['LANGCHAIN_PROJECT'] = os.getenv('LANGCHAIN_PROJECT')
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')


load_dotenv()

True

In [17]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model='gpt-4o-mini', temperature=0)


B) Building LCEL Pipeline

In [18]:
import json
from typing import List, Dict, Any
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough, RunnableParallel

def format_context(docs):
    # Context formatting with citations
    lines = []
    for d in docs:
        paper = d.metadata.get('paper_title', 'Unknown')
        year = d.metadata.get('year', 'Unknown')
        page = d.metadata.get('page', 'Unknown')
        lines.append(f"[Paper : {paper} | Year: {year} | Page: {page} \n{d.page_content}")

    return '\n\n'.join(lines)

def build_sources(docs) -> List[Dict[str, Any]]:
    # Deduplication citations
    seen = set()
    sources = []
    for d in docs:
        paper = d.metadata.get('paper_title', 'Unknown')
        year = d.metadata.get('year', None)
        page = d.metadata.get('page', None)
        key = (paper, page)
        if key in seen:
            continue
        seen.add(key)
        sources.append({'paper':paper, 'year':year, 'page':page})
    return sources

prompt = ChatPromptTemplate([
    ('system', 
     'You are a research assistant. Use ONLY provided contextto answer.\n'
     'If the answer is not in the context, say: \Insufficientcontext\.\n'
     'Return only valid JSON with keys: answer, sources.\n'
     'No extra text.'),
     ('human', 
      'Question:\n{question}\n\n'
      'Context:\n{context}\n\n'
      'Return JSON only.')

])

  'If the answer is not in the context, say: \Insufficientcontext\.\n'


In [19]:
## Step 1) fetch relevant docs + keep answers

fetch_docs = RunnableParallel(
    question = RunnablePassthrough(), 
    docs = retriever
)

## Step 2) format context
add_context = RunnableLambda(lambda x: {
    'question': x['question'],
    'docs': x['docs'],
    'context': format_context(x['docs'])
})

## Step 3) Call LLM
call_llm = RunnableLambda(lambda x: {
    "raw_llm": (prompt | llm).invoke({"question": x["question"], "context": x["context"]}),
    "docs": x["docs"]
})

# Step 4: enforce strict JSON + attach sources from docs (grounded)
def finalize(x):
    # LLM output (string-like)
    content = x["raw_llm"].content if hasattr(x["raw_llm"], "content") else str(x["raw_llm"])
    # Parse JSON safely
    try:
        parsed = json.loads(content)
    except Exception:
        # Hard fallback: still return strict structure
        parsed = {"answer": "Insufficient context", "sources": []}

    # If model answered insufficient context, keep sources empty
    answer = parsed.get("answer", "")
    if isinstance(answer, str) and answer.strip().lower() == "insufficient context":
        return {"answer": "Insufficient context", "sources": []}

    # Always ground sources from retrieved docs (not hallucinated)
    return {
        "answer": parsed.get("answer", ""),
        "sources": build_sources(x["docs"])
    }

In [20]:
rag_chain = fetch_docs | add_context | call_llm | RunnableLambda(finalize)

In [25]:
## Testing 1

result1 = rag_chain.invoke("What is HyDE and why does it help retrieval?")
print(json.dumps(result1, indent=2))

{
  "answer": "HyDE is a retrieval model that shows better performance than other models like ANCE and DPR, even though those models are fine-tuned on specific datasets. It provides sizable improvements in retrieval tasks, particularly in low-resource scenarios, and demonstrates strong performance compared to fine-tuned models. HyDE's architecture allows it to operate without the need for relevance labels, making it practical for early stages of search systems. As the search log grows, it can be gradually replaced by a supervised dense retriever.",
  "sources": [
    {
      "paper": "hyde_2022",
      "year": 2022,
      "page": 4
    },
    {
      "paper": "hyde_2022",
      "year": 2022,
      "page": 5
    },
    {
      "paper": "hyde_2022",
      "year": 2022,
      "page": 6
    }
  ]
}


In [29]:
result1['answer']

"HyDE is a retrieval model that shows better performance than other models like ANCE and DPR, even though those models are fine-tuned on specific datasets. It provides sizable improvements in retrieval tasks, particularly in low-resource scenarios, and demonstrates strong performance compared to fine-tuned models. HyDE's architecture allows it to operate without the need for relevance labels, making it practical for early stages of search systems. As the search log grows, it can be gradually replaced by a supervised dense retriever."

In [22]:
result = rag_chain.invoke("What improvements were proposed after 2021 for retrieval-augmented generation?")
print(json.dumps(result, indent=2))

{
  "answer": "Improvements proposed after 2021 for retrieval-augmented generation include the effectiveness of training and inference with reflection tokens for overall performance improvements and test-time model customizations, such as balancing the trade-off between citation precision and completeness.",
  "sources": [
    {
      "paper": "self_rag_2023",
      "year": 2023,
      "page": 11
    },
    {
      "paper": "self_rag_2023",
      "year": 2023,
      "page": 1
    }
  ]
}


In [23]:
result = rag_chain.invoke("What is LangChain's revenue in 2025?")
print(json.dumps(result, indent=2))


{
  "answer": "Insufficient context",
  "sources": []
}
