# Week 4: Retrieval-Augmented Generation (RAG) with arXiv Papers
This week marks a major shift in your AI agent's capabilities: you’ll build the foundation for a Retrieval-Augmented Generation (RAG) system tailored to scientific research. Rather than relying on an LLM’s memory alone, RAG architectures allow your agent to search a structured knowledge base and generate grounded, document-aware answers.

Your task is to create a RAG pipeline using recent arXiv cs.CL papers, converting them into searchable chunks, embedding them, and indexing them with FAISS. You’ll then implement a simple query interface that takes a user question, retrieves the top relevant chunks, and displays them for further processing.

This week marks the beginning of building your agent’s private research knowledge base—a semantic index that you’ll evolve into a full-featured hybrid database in Week 5.


## 📚 Learning Objectives

* Understand the components of a Retriever-Reader QA pipeline.
* Explore document chunking strategies (e.g., sections vs. sliding windows) and their impact on retrieval performance.
* Index scientific text using vector embeddings and FAISS.
* Build and query a semantic index via a FastAPI endpoint that returns relevant passages.


## Project Design

The project will guide you through building a RAG pipeline on arXiv cs.CL papers:

1. **Data Collection:** Obtain 50 arXiv cs.CL PDFs (you can scrape via the arXiv API or use a provided sample set).
2. **Text Extraction:** Extract raw text from each PDF (for example, using PyMuPDF's `get_text()` on each page). Clean and concatenate the page text into full-document strings.
3. **Text Chunking:** Split each paper into chunks (≤ 512 tokens each). You might split at section boundaries or use a sliding-window approach (e.g., 500-token windows with overlap). Chunking into smaller, meaningful segments (around 250–512 tokens) often yields better retrieval precision.
4. **Embedding Generation:** Compute dense vector embeddings for each chunk. For instance, using the `sentence-transformers` library:

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(list_of_chunks)  # embeds each text chunk into a 384-d vecto

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


NameError: name 'list_of_chunks' is not defined

   (Alternatively, you can use a Hugging Face Transformer model and apply pooling manually to get chunk embeddings.)
5. **Indexing with FAISS:** Build a FAISS index of the chunk embeddings. For example, use a simple index like `IndexFlatL2` with the same dimensionality as your embeddings. Add all chunk vectors to the index (e.g., `index.add(np.array(embeddings))`).
6. **Notebook Demo:** Create a notebook where a user query is embedded and passed to the index (`index.search(query_embedding, k)`) to retrieve the top-3 matching chunks. Display the original chunk text for these results.
7. **FastAPI Service:** Build a simple FastAPI app. Define an endpoint (e.g. `@app.get("/search")`) that accepts a query parameter `q`. In the handler, embed `q`, perform the FAISS search, and return the top passages as JSON. (For example, a FastAPI endpoint can accept a question and return relevant documents.)



## Starter Code Snippets

Below are skeleton code templates. Fill in the details (indicated by comments or ellipses).

**Data Extraction (PDF → Text):**

In [None]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Open a PDF and extract all text as a single string.
    """
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        page_text = page.get_text()  # get raw text from page
        # (Optional) clean page_text here (remove headers/footers)
        pages.append(page_text)
    full_text = "\n".join(pages)
    return full_text


PyMuPDF 1.26.3: Python bindings for the MuPDF 1.26.3 library (rebased implementation).
Python 3.9 running on win32 (64-bit).

c:\Users\zhang\OneDrive\Documents\Programming\Inference AI Course\.venv\lib\site-packages\fitz\__init__.py


In [3]:
pdf_paths = ["data/2508.15711v1.pdf", "data/2508.15721v1.pdf", "data/2508.15746v1.pdf", "data/2508.15754v1.pdf", "data/2508.15760v1.pdf"]
corpus_texts = [extract_text_from_pdf(p) for p in pdf_paths]
full_corpus = "\n\n".join(corpus_texts)

**Chunking Logic (Sliding Window):**

In [5]:
from typing import List

def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> List[str]:
    tokens = text.split()
    chunks = []
    step = max_tokens - overlap
    for i in range(0, len(tokens), step):
        chunk = tokens[i:i + max_tokens]
        chunks.append(" ".join(chunk))
    return chunks


In [6]:
list_of_chunks = []
for doc in corpus_texts:
    list_of_chunks.extend(chunk_text(doc))

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(list_of_chunks)  # embeds each text chunk into a 384-d vecto

  from .autonotebook import tqdm as notebook_tqdm



**Embedding Generation (Sentence-Transformers):**

In [None]:

# def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> List[str]:
#     tokens = text.split()
#     chunks = []
#     step = max_tokens - overlap
#     for i in range(0, len(tokens), step):
#         chunk = tokens[i:i + max_tokens]
#         chunks.append(" ".join(chunk))
#     return chunks



**FAISS Indexing and Search:**


In [12]:

import faiss
import numpy as np

# Assume embeddings is a 2D numpy array of shape (num_chunks, dim)
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)  # using a simple L2 index
index.add(np.array(embeddings))  # add all chunk vectors

# Example: search for a query embedding
query = "What is retrieval augmented generation?"
query_embedding = model.encode([query]) # get embedding for the query (shape: [1, dim])
k = 3
distances, indices = index.search(query_embedding, k)
# indices[0] holds the top-k chunk indices

for i in indices[0]:
    print(list_of_chunks[i])

Heng Sia, Chai Rick Soh, Joshua Yi Min Tung, Jasmine Chiat Ling Ong, Chang Fu Kuo, Shao-Chun Wu, Vesela P. Kovacheva, and Daniel Shu Wei Ting. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. NPJ Digital Medicine, 8, 2025. 1 [7] Lameck Mbangula Amugongo, Pietro Mascheroni, Steven Brooks, Stefan Doering, and Jan Seidel. Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health, 4(6):e0000877, 2025. 1 [8] Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, and Yixin Wang. MMedAgent: Learning to use medical tools with multi-modal agent. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, Miami, Florida, USA, November 2024. Association for Computational Linguistics. 1 [9] Shanghua Gao, Richard Zhu, Zhenglun Kon



**FastAPI Route Skeleton:**


In [None]:

from fastapi import FastAPI
import numpy as np

app = FastAPI()

@app.get("/search")
async def search(q: str):
    """
    Receive a query 'q', embed it, retrieve top-3 passages, and return them.
    """
    # TODO: Embed the query 'q' using your embedding model
    query_vector = ...  # e.g., model.encode([q])[0]
    # Perform FAISS search
    k = 3
    distances, indices = faiss_index.search(np.array([query_vector]), k)
    # Retrieve the corresponding chunks (assuming 'chunks' list and 'indices' shape [1, k])
    results = []
    for idx in indices[0]:
        results.append(chunks[idx])
    return {"query": q, "results": results}


In [2]:
def search(query: str, top_k: int = 3):
    q_emb = model.encode([query])
    distances, indices = index.search(np.array(q_emb), top_k)
    results = [list_of_chunks[i] for i in indices[0]]
    return results

print(search("What is retrieval-augmented generation?"))

NameError: name 'model' is not defined


## Deliverables

* **Code Notebook / Script:** Complete code for the RAG pipeline (PDF extraction, chunking, embedding, indexing, retrieval).
* **Data & Index:** The FAISS index file and the set of 50 processed paper chunks (e.g., as JSON or pickled objects).
* **Retrieval Report:** A brief report showing at least 5 example queries and the top-3 retrieved passages for each, to demonstrate system performance.
* **FastAPI Service:** The FastAPI app code (e.g. `main.py`) and instructions on how to run it. The `/search` endpoint should be demonstrable (e.g. returning top-3 passages in JSON for sample queries).

## Student Exploration Tips

* Experiment with different chunk sizes and overlaps. Smaller chunks (∼250 tokens) often give more precise retrieval, while larger chunks include more context.
* Try different embedding models (e.g. using `'all-mpnet-base-v2'` or `'paraphrase-MiniLM-L6-v2'`) to see how retrieval results change.
* Implement a simple reranking step: for example, after retrieving candidates with FAISS, re-score them with a cross-encoder model for finer ranking.
* Use metadata: consider filtering or weighting chunks by paper metadata (e.g. year, authors, keywords) to improve relevance if needed.
