# RAG System for Research Paper Q&A

## Problem Statement

**Objective**: Build a Retrieval-Augmented Generation (RAG) system that can answer questions about research papers by combining semantic search with Large Language Model (LLM) generation.

**Challenge**: Research papers contain dense technical information. Traditional keyword search fails to capture semantic meaning, and LLMs have limited context windows. This RAG system addresses both challenges by:
- Efficiently retrieving relevant document chunks using semantic embeddings
- Augmenting LLM prompts with retrieved context for accurate answers
- Providing source citations for transparency

**Use Cases**:
- Academic research and literature review
- Technical document Q&A
- Knowledge base querying

## Dataset / Knowledge Source

**Type of Data**: PDF research papers

**Data Source**: Public research papers from arXiv:
1. **1706.03762v7.pdf** - "Attention Is All You Need" (Transformer architecture)
2. **1810.04805v2.pdf** - "BERT: Pre-training of Deep Bidirectional Transformers"
3. **1908.10084v1.pdf** - Research paper on NLP/ML
4. **2005.11401v4.pdf** - Research paper on AI/ML
5. **2401.08281v4.pdf** - Recent AI research paper

**Total Papers**: 5 research papers covering foundational and recent AI/NLP research

## RAG Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                     RAG PIPELINE ARCHITECTURE                    │
└─────────────────────────────────────────────────────────────────┘

1. DATA INGESTION
   ┌──────────┐
   │ PDF Files│
   └────┬─────┘
        │
        ▼
   ┌─────────────┐
   │ PyPDF Loader│  ← Load and extract text from PDFs
   └──────┬──────┘
          │
          ▼
2. TEXT PROCESSING
   ┌──────────────────┐
   │ Text Chunking    │  ← RecursiveCharacterTextSplitter
   │ Size: 1000 chars │  ← Chunk size = 1000
   │ Overlap: 200     │  ← Overlap = 200
   └────────┬─────────┘
            │
            ▼
3. EMBEDDING GENERATION
   ┌────────────────────┐
   │ Sentence Transform │  ← all-MiniLM-L6-v2 model
   │ 384-dim embeddings │  ← Generate vector representations
   └─────────┬──────────┘
             │
             ▼
4. VECTOR STORAGE
   ┌──────────────────┐
   │ FAISS Vector DB  │  ← Fast similarity search
   │ Index & Store    │  ← Persistent storage
   └─────────┬────────┘
             │
             ▼
5. QUERY PROCESSING
   ┌──────────────┐
   │ User Query   │
   └──────┬───────┘
          │
          ▼
   ┌──────────────────┐
   │ Embed Query      │  ← Convert query to embedding
   └────────┬─────────┘
            │
            ▼
6. RETRIEVAL
   ┌───────────────────┐
   │ Similarity Search │  ← FAISS retrieves top-k chunks
   │ Top 4 chunks      │  ← k=4 most relevant documents
   └─────────┬─────────┘
             │
             ▼
7. GENERATION
   ┌────────────────────┐
   │ Context + Query    │
   └──────┬─────────────┘
          │
          ▼
   ┌────────────────────┐
   │ Gemini LLM         │  ← Google Gemini Pro
   │ Generate Answer    │  ← Context-aware response
   └──────┬─────────────┘
          │
          ▼
   ┌────────────────────┐
   │ Final Answer       │
   │ + Source Citations │
   └────────────────────┘
```

## Text Chunking Strategy

**Chunk Size**: 1000 characters

**Chunk Overlap**: 200 characters

**Reason for Chosen Strategy**:
1. **Optimal Context Size**: 1000 characters provides enough context (typically 2-3 paragraphs) to maintain semantic coherence without exceeding embedding model limits
2. **Overlap Prevents Information Loss**: 200-character overlap ensures important information spanning chunk boundaries is not lost
3. **Balance Performance & Accuracy**: Smaller chunks = more precise retrieval but may lose context; larger chunks = more context but less precise. 1000 chars is the sweet spot
4. **Embedding Model Compatibility**: The chosen size works well with the sentence-transformers model (512 token limit)
5. **Recursive Splitting**: Uses RecursiveCharacterTextSplitter which respects natural text boundaries (paragraphs, sentences) rather than arbitrary cuts

## Embedding Details

**Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2`

**Model Specifications**:
- Dimensions: 384
- Max Sequence Length: 256 tokens
- Model Size: ~80MB
- License: Apache 2.0

**Reason for Selecting the Model**:
1. **Efficiency**: Small, fast model that runs well on CPU without requiring GPU
2. **Performance**: Achieves excellent semantic similarity scores on benchmark datasets
3. **Pre-trained**: Trained on 1B+ sentence pairs, no fine-tuning needed
4. **Open Source**: Free to use with no API costs
5. **Community Support**: Widely used in production RAG systems with extensive documentation
6. **Balanced Trade-off**: Best balance between speed, accuracy, and resource requirements for this use case

## Vector Database

**Vector Store Used**: FAISS (Facebook AI Similarity Search)

**Key Features**:
- Fast approximate nearest neighbor search
- Supports billion-scale vector databases
- Multiple index types (Flat, IVF, HNSW)
- Can be saved/loaded from disk
- No external server required (embedded)

**Advantages**:
1. **Speed**: Optimized for fast similarity search
2. **Scalability**: Handles large document collections efficiently
3. **Persistence**: Can save/load index from disk
4. **No Setup**: No database server setup required
5. **Production-Ready**: Battle-tested by Meta in production systems

---
# Implementation: Step-by-Step RAG Pipeline
---

### Step 1: Install Required Libraries

In [None]:
# Install required packages
!pip install langchain langchain-community langchain-google-genai
!pip install pypdf sentence-transformers faiss-cpu
!pip install google-generativeai

### Step 2: Import Libraries

In [None]:
# Import necessary libraries
import os
import glob
from typing import List

# LangChain components
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Google Generative AI
import google.generativeai as genai

print("✓ All libraries imported successfully!")

### Step 3: Configure API Keys

In [None]:
# Set up Gemini API key
GOOGLE_API_KEY = "AIzaSyDSS-MuRRsNrbPnQoVFm8W3t9FHN7UvnDI"
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

# Configure Gemini
genai.configure(api_key=GOOGLE_API_KEY)

print("✓ API key configured successfully!")

### Step 4: Load PDF Documents

In [None]:
# Define the path to PDF folder
pdf_folder = "research_papers"

# Get all PDF files
pdf_files = glob.glob(os.path.join(pdf_folder, "*.pdf"))

print(f"Found {len(pdf_files)} PDF files:")
for pdf in pdf_files:
    print(f"  - {os.path.basename(pdf)}")

# Load all PDFs
documents = []
for pdf_path in pdf_files:
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()
    documents.extend(docs)
    print(f"✓ Loaded {len(docs)} pages from {os.path.basename(pdf_path)}")

print(f"\n✓ Total documents loaded: {len(documents)} pages")

### Step 5: Split Documents into Chunks

In [None]:
# Initialize text splitter with our chosen strategy
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # 1000 characters per chunk
    chunk_overlap=200,      # 200 character overlap
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Respect natural text boundaries
)

# Split documents into chunks
chunks = text_splitter.split_documents(documents)

print(f"✓ Split {len(documents)} documents into {len(chunks)} chunks")
print(f"\nSample chunk (first 200 chars):")
print(chunks[0].page_content[:200] + "...")

### Step 6: Generate Embeddings

In [None]:
# Initialize embedding model
print("Loading embedding model... (this may take a minute)")
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

print("✓ Embedding model loaded successfully!")
print(f"  Model: sentence-transformers/all-MiniLM-L6-v2")
print(f"  Embedding dimensions: 384")

### Step 7: Create FAISS Vector Store

In [None]:
# Create FAISS vector store from documents
print("Creating FAISS vector store... (this may take a few minutes)")
vectorstore = FAISS.from_documents(chunks, embeddings)

print(f"✓ Vector store created with {len(chunks)} document chunks")

# Save the vector store for future use
vectorstore.save_local("faiss_index")
print("✓ Vector store saved to 'faiss_index' folder")

### Step 8: Initialize Gemini LLM

In [None]:
# Initialize Gemini model
llm = ChatGoogleGenerativeAI(
    model="gemini-pro",
    temperature=0.3,  # Lower temperature for more factual answers
    google_api_key=GOOGLE_API_KEY
)

print("✓ Gemini LLM initialized successfully!")
print("  Model: gemini-pro")
print("  Temperature: 0.3 (more factual)")

### Step 9: Create Custom Prompt Template

In [None]:
# Define custom prompt template for better answers
template = """You are an AI assistant helping with questions about research papers.
Use the following context to answer the question. If you cannot find the answer in the context, 
say "I cannot find this information in the provided documents."

Context:
{context}

Question: {question}

Provide a detailed and accurate answer based on the context above:
"""

PROMPT = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

print("✓ Custom prompt template created")

### Step 10: Create RAG Retrieval Chain

In [None]:
# Create retriever from vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Retrieve top 4 most relevant chunks
)

# Create RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Stuff all retrieved docs into prompt
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

print("✓ RAG chain created successfully!")
print("  Retrieval: Top 4 similar chunks")
print("  Chain type: Stuff (all context in prompt)")

### Step 11: Helper Function for Queries

In [None]:
def ask_question(question: str):
    """
    Ask a question to the RAG system and display the answer with sources.
    
    Args:
        question: The question to ask
    """
    print("="*80)
    print(f"QUESTION: {question}")
    print("="*80)
    
    # Get answer from RAG chain
    result = qa_chain({"query": question})
    
    # Display answer
    print("\nANSWER:")
    print("-" * 80)
    print(result['result'])
    
    # Display source documents
    print("\n" + "="*80)
    print("SOURCE DOCUMENTS:")
    print("="*80)
    for i, doc in enumerate(result['source_documents'], 1):
        print(f"\nSource {i}:")
        print(f"File: {doc.metadata.get('source', 'Unknown')}")
        print(f"Page: {doc.metadata.get('page', 'Unknown')}")
        print(f"Content preview: {doc.page_content[:200]}...")
        print("-" * 80)

print("✓ Helper function created")

---
# Test Queries
---

Let's test the RAG system with 3 different queries to demonstrate its capabilities.

### Test Query 1: Understanding Transformer Architecture

In [None]:
ask_question("What is the Transformer architecture and what are its key components?")

### Test Query 2: BERT Model Details

In [None]:
ask_question("What is BERT and how does it differ from previous language models?")

### Test Query 3: Attention Mechanism

In [None]:
ask_question("Explain the attention mechanism in neural networks. Why is it important?")

---
# Future Improvements
---

## 1. Better Chunking Strategies

**Current Limitation**: Fixed-size chunking may split sentences or concepts awkwardly.

**Improvements**:
- **Semantic Chunking**: Use NLP to identify topic boundaries and chunk by semantic units
- **Hierarchical Chunking**: Create multi-level chunks (sections → paragraphs → sentences)
- **Metadata-Aware Chunking**: Preserve document structure (headings, sections, figures)
- **Adaptive Chunk Sizing**: Vary chunk size based on content density and complexity

**Expected Impact**: 15-20% improvement in retrieval accuracy

## 2. Reranking / Hybrid Search

**Current Limitation**: Pure vector search may miss exact keyword matches.

**Improvements**:
- **Hybrid Search**: Combine dense (vector) and sparse (BM25) retrieval
- **Cross-Encoder Reranking**: Use models like `cross-encoder/ms-marco-MiniLM-L-6-v2` to rerank top-k results
- **Reciprocal Rank Fusion**: Merge rankings from multiple retrievers
- **Query Expansion**: Expand user query with synonyms and related terms

**Implementation**:
```python
from langchain.retrievers import BM25Retriever, EnsembleRetriever

bm25_retriever = BM25Retriever.from_documents(chunks)
ensemble_retriever = EnsembleRetriever(
    retrievers=[vectorstore.as_retriever(), bm25_retriever],
    weights=[0.5, 0.5]
)
```

**Expected Impact**: 25-30% improvement in retrieval precision

## 3. Metadata Filtering

**Current Limitation**: Cannot filter by paper, author, date, or section.

**Improvements**:
- **Rich Metadata Extraction**: Extract paper title, authors, publication date, section headers
- **Filtered Retrieval**: Allow users to filter by metadata before vector search
- **Faceted Search**: Enable multi-dimensional filtering (e.g., "papers from 2023 about transformers")
- **Citation Tracking**: Track and display which specific paper each answer comes from

**Implementation**:
```python
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"source": "1706.03762v7.pdf"}  # Filter by specific paper
    }
)
```

**Expected Impact**: Better user control and more targeted results

## 4. UI Integration (Already Implemented!)

**Current State**: Notebook-based interaction

**Implemented Improvements**:
- ✓ **Streamlit Web Interface**: User-friendly web UI (see `app.py`)
- ✓ **Chat History**: Maintains conversation context
- ✓ **Source Display**: Shows document sources for each answer
- ✓ **Easy Deployment**: Can be hosted on Streamlit Cloud, Hugging Face Spaces

**Future UI Enhancements**:
- Multi-turn conversations with memory
- Document upload interface for new PDFs
- Visualization of retrieved chunks and similarity scores
- Export conversation to PDF/Markdown
- Mobile-responsive design

## 5. Additional Improvements

### Performance Optimization
- **Quantized Embeddings**: Use int8 quantization to reduce memory (50% smaller)
- **Batch Processing**: Process multiple queries in parallel
- **Caching**: Cache frequently asked questions and their answers
- **GPU Acceleration**: Use CUDA for faster embedding generation

### Accuracy Enhancements
- **Few-Shot Examples**: Include example Q&A pairs in prompt
- **Confidence Scoring**: Return confidence scores for answers
- **Hallucination Detection**: Detect when LLM generates info not in context
- **Multi-Query Retrieval**: Generate multiple query variations for better retrieval

### Evaluation & Monitoring
- **Ground Truth Dataset**: Create test questions with known answers
- **Retrieval Metrics**: Track precision@k, recall@k, MRR
- **Answer Quality**: Use LLM-as-judge to evaluate answer quality
- **User Feedback**: Collect thumbs up/down on answers

---
# Conclusion

This RAG system successfully demonstrates:
- ✓ Loading and processing multiple research papers
- ✓ Intelligent text chunking with overlap
- ✓ Semantic embedding generation
- ✓ Fast vector similarity search with FAISS
- ✓ Context-aware answer generation with Gemini
- ✓ Source citation and transparency

The system can be extended with the improvements listed above to create a production-ready research assistant.
---