# Semantic Q&A System on Research Documents

## Introduction:
This project implements a Retrieval-Augmented Generation (RAG) system that combines semantic search with generative capabilities to answer queries based on a research paper. By leveraging FAISS for dense vector retrieval and TinyLLaMA for lightweight question generation, the system offers efficient and context-aware responses to user queries. The document used is a scientific research paper (RAGPAPER.pdf), and the system supports conversational memory, allowing follow-up questions to be contextualized appropriately.
AIM To build a lightweight and efficient Retrieval-Augmented Generation (RAG) pipeline that utilizes document embeddings and a small LLM to semantically understand, retrieve, and respond to questions about a given PDF document.

## Objectives:
- Load and parse research PDFs using LangChain document loaders.
- Split documents into manageable chunks for semantic search.
- Generate embeddings using HuggingFace Embedding models.
- Store and retrieve document vectors using FAISS vector store.
- Integrate Ollama to run the TinyLLaMA language model locally.
- Build a ConversationalRetrievalChain for multi-turn Q&A.
- Rephrase follow-up questions into standalone prompts using prompt engineering.
- Provide accurate, contextually grounded answers to user queries.

## Model & Configurations:
- LLM: TinyLLaMA – a compact large language model suitable for low-resource inference.
- Embedding Model: HuggingFaceEmbeddings (Default is sentence-transformers/all-MiniLM-L6-v2)
- Vector Store: FAISS – Facebook AI Similarity Search, used for fast retrieval of top relevant chunks.
- Prompt Engineering: Custom prompt using PromptTemplate to rephrase follow-up questions to standalone ones.

## TinyLLaMA Parameters:
- Model Name: TinyLLaMA-1.1B
- Number of Parameters: 1.1 Billion
- Architecture: Decoder-only Transformer (similar to LLaMA)
- Trained on: The RedPajama dataset and additional curated datasets
- Tokenizer: LLaMA-compatible tokenizer with 32k vocab size  

In [84]:
from langchain_community.llms import Ollama 
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter 
from langchain.chains import ConversationalRetrievalChain
from langchain.prompts import PromptTemplate

In [85]:
# Load and chunk PDF
pdf_loader = PyPDFLoader('RAGPAPER.pdf')
documents = pdf_loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

In [86]:
# Embedding and Vector Store
embedding_model = HuggingFaceEmbeddings()
vector_store = FAISS.from_documents(documents=chunks, embedding=embedding_model)

  embedding_model = HuggingFaceEmbeddings()


In [87]:
# LLM Model
llm_model = Ollama(model='tinyllama')

In [88]:
# Prompt Template for standalone question
question_prompt = PromptTemplate.from_template("""
Given the following conversation and follow-up question, rephrase the follow-up question to a standalone question.
Chat History: {chat_history}
Follow-up Input: {question}
Standalone Question:""")

In [89]:
# Conversational Retrieval Chain
qa_chain = ConversationalRetrievalChain.from_llm(
    llm=llm_model,
    retriever=vector_store.as_retriever(),
    condense_question_prompt=question_prompt,
    return_source_documents=True,
    verbose=False
)

In [90]:
# Chat history and sample query
chat_history = []
query = "What are different Indexing Optimization methods used in this paper?"
results = qa_chain({'question': query, 'chat_history': chat_history})

In [91]:
# Output answer
print("Answer:", results['answer'])

Answer: Different Indexing Optimization methods used in this paper are Chunking Strategy, Enhancinig Data Granularity, Adding Metadata, Alignmnet Optimization, and Post-Retrieval Process. These methods aim to enhance the quality of the content being indexed by strategically optimizing index structures, optimizing query structures, aligning queries with retrieved information, re-ranking retrieved information to relate most relevant documents to edges of prompts, feeding metadata from original documents directly into LLMs for direct relevance retrieval, and establishing hierarchical structures for documents. These methods aid in the swift traversal of data and assist RAG systems in determining which documents are pertinent to a user's original question.


### Limitations 
- Model Capacity: TinyLLaMA (1.1B params) may struggle with complex or highly technical queries compared to larger LLMs.
- Context Window: Limited token size can cause incomplete understanding of lengthy documents.
- System Dependency: Performance depends on local CPU/GPU; low-end systems may face slowdowns.
- General-Purpose Model: Not fine-tuned for academic/research texts, affecting precision.
- Static Data Scope: Answers are limited to the uploaded PDF content only.

### Future Work
- Model Upgrade: Integrate larger or fine-tuned models (e.g., LLaMA-2, GPT-3.5) for better understanding and accuracy.
- Domain Adaptation: Fine-tune the model on academic corpora to improve performance on technical queries
- Multi-Document Support: Expand system to handle and cross-reference multiple research papers.
- Web-Based UI: Develop a user-friendly web interface using Streamlit or Gradio for broader usability.
- Hybrid Retrieval: Combine vector search with keyword-based methods for more robust question answering.



### Conclusion
This project demonstrates the power of combining retrieval-based search with generative models, even with lightweight LLMs like TinyLLaMA. It bridges the gap between static document search and intelligent conversational interfaces. The use of FAISS ensures fast, scalable vector search, while LangChain’s tools enable modular and extendable pipeline development. This system can be adapted for document Q&A, legal or research assistants, and educational tools.