# RAG System for Research Paper Q&A
**Harshit Arora & Aditi Jha**

This notebook implements a Retrieval-Augmented Generation (RAG) pipeline that answers questions about research papers by combining semantic search with LLM generation.

## Pipeline Overview
1. Load PDF documents
2. Split text into chunks
3. Generate embeddings (MiniLM)
4. Store in FAISS vector database
5. Query with similarity search
6. Generate answers with Google Gemini

## Step 1: Install Dependencies

In [None]:
!pip install -q langchain langchain-community langchain-google-genai faiss-cpu sentence-transformers pypdf google-generativeai

## Step 2: Import Libraries

In [None]:
import os
import time
import warnings
warnings.filterwarnings('ignore')

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_groq import ChatGroq
from langchain_classic.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate

print("All libraries imported successfully!")

## Step 3: Configure Gemini API Key
Set your Google Gemini API key below. You can get one for free at [Google AI Studio](https://aistudio.google.com/app/apikey).

In [None]:
# Set your Groq API key here (use environment variable or replace with your key)
GROQ_API_KEY = "your-groq-api-key-here"
os.environ["GROQ_API_KEY"] = GROQ_API_KEY

print("API key configured.")

## Step 4: Load PDF Documents
Load all 5 research papers from the `research_papers/` directory using PyPDF.

In [None]:
PDF_FOLDER = "research_papers"

all_documents = []
pdf_files = sorted([f for f in os.listdir(PDF_FOLDER) if f.endswith(".pdf")])

print(f"Found {len(pdf_files)} PDF files:")
for pdf_file in pdf_files:
    pdf_path = os.path.join(PDF_FOLDER, pdf_file)
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    all_documents.extend(documents)
    print(f"  - {pdf_file}: {len(documents)} pages")

print(f"\nTotal pages loaded: {len(all_documents)}")

## Step 5: Split Text into Chunks
Use `RecursiveCharacterTextSplitter` with:
- **Chunk size**: 500 characters (~1 paragraph for higher precision)
- **Chunk overlap**: 100 characters (prevents information loss at boundaries)
- **Separators**: paragraph → newline → space → character (respects natural text boundaries)

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_documents(all_documents)

print(f"Total chunks created: {len(chunks)}")
print(f"Average chunk length: {sum(len(c.page_content) for c in chunks) // len(chunks)} characters")
print(f"\nSample chunk (first 300 chars):")
print(f"---")
print(chunks[0].page_content[:300])
print(f"---")
print(f"Metadata: {chunks[0].metadata}")

## Step 6: Generate Embeddings with MiniLM
Use `sentence-transformers/all-MiniLM-L6-v2` to convert each chunk into a 384-dimensional vector.

| Parameter | Value |
|-----------|-------|
| Model | all-MiniLM-L6-v2 |
| Dimensions | 384 |
| Max Sequence Length | 256 tokens |
| Model Size | ~80 MB |

In [None]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

# Test embedding
test_embedding = embeddings.embed_query("test sentence")
print(f"Embedding model loaded successfully!")
print(f"Embedding dimensions: {len(test_embedding)}")

## Step 7: Create FAISS Vector Store
Store all chunk embeddings in a FAISS index for efficient similarity search.

| Parameter | Value |
|-----------|-------|
| Vector Store | FAISS |
| Index Type | Flat L2 (exact search) |
| Persistence | Saved to disk (`faiss_index/`) |
| Search Type | Similarity search (top-k) |
| k Value | 6 |

In [None]:
FAISS_INDEX_PATH = "faiss_index"

print("Creating FAISS vector store...")
start_time = time.time()

vectorstore = FAISS.from_documents(chunks, embeddings)

elapsed = time.time() - start_time
print(f"Vector store created in {elapsed:.2f}s")
print(f"Total vectors stored: {vectorstore.index.ntotal}")

# Save to disk for reuse
vectorstore.save_local(FAISS_INDEX_PATH)
print(f"Index saved to '{FAISS_INDEX_PATH}/'")

## Step 8: Initialize Groq LLM
Set up the Groq LLM (Llama 3.3 70B) for answer generation. Groq provides free API access with generous rate limits.

In [None]:
model_name = "llama-3.3-70b-versatile"

llm = ChatGroq(
    model_name=model_name,
    temperature=0.3,
    groq_api_key=os.environ["GROQ_API_KEY"]
)

print(f"Groq LLM initialized: {model_name}")

## Step 9: Create Prompt Template
Design a prompt that instructs Gemini to answer questions based only on the retrieved context, with source citations.

In [None]:
prompt_template = """You are a helpful research assistant. Use the following pieces of context from research papers to answer the question. 
If you don't know the answer based on the context, say "I don't have enough information in the provided papers to answer this question."

Always cite which paper(s) you're referencing in your answer.

Context:
{context}

Question: {question}

Answer (with citations):"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

print("Prompt template created.")

## Step 10: Build RetrievalQA Chain
Combine the retriever (FAISS with top-k=10) and the LLM (Groq) into a LangChain RetrievalQA chain.

In [None]:
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 10}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

print("RetrievalQA chain built successfully!")
print(f"  - Retriever: FAISS (top-k=10)")
print(f"  - LLM: {model_name}")
print(f"  - Chain type: stuff")

## Step 11: Run Test Queries
Test the RAG pipeline with sample questions about the research papers.

In [None]:
def ask_question(query):
    """Ask a question and display the answer with source citations."""
    print(f"\n{'='*80}")
    print(f"Q: {query}")
    print(f"{'='*80}")
    
    start_time = time.time()
    result = qa_chain.invoke({"query": query})
    elapsed = time.time() - start_time
    
    print(f"\nA: {result['result']}")
    print(f"\n[Response time: {elapsed:.2f}s]")
    
    print(f"\n--- Source Documents ---")
    for i, doc in enumerate(result['source_documents'], 1):
        source = doc.metadata.get('source', 'Unknown')
        page = doc.metadata.get('page', 'N/A')
        print(f"  [{i}] {os.path.basename(source)} (Page {page})")
        print(f"      {doc.page_content[:150]}...")
    
    return result

In [None]:
# Query 1: Transformer Architecture
result1 = ask_question("What is the Transformer architecture and its key components?")

In [None]:
# Query 2: BERT
result2 = ask_question("What is BERT and how does it differ from previous models?")

In [None]:
# Query 3: Attention Mechanism
result3 = ask_question("Explain the attention mechanism in neural networks.")

In [None]:
# Additional test queries
result4 = ask_question("What is Retrieval-Augmented Generation (RAG) and how does it work?")

In [None]:
result5 = ask_question("How does Sentence-BERT generate sentence embeddings?")

## Summary

This RAG pipeline successfully:
1. **Ingested** 5 research papers from PDF format
2. **Chunked** text into overlapping segments for optimal retrieval
3. **Embedded** chunks using MiniLM (384-dim vectors)
4. **Stored** embeddings in FAISS for fast similarity search
5. **Retrieved** top-6 relevant chunks per query
6. **Generated** accurate, grounded answers using Gemini with source citations

For an interactive web interface, run the Streamlit app:
```bash
streamlit run app.py
```