<a href="https://colab.research.google.com/github/inafees14/domain_specific_qna/blob/main/Domain_Specific_Q%26A_System_using_Generative_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Domain-Specific Question Answering System using Retrieval-Augmented Generation (RAG)**

Large Language Models (LLMs) often generate fluent but factually incorrect answers (hallucinations), especially when queried about domain-specific or private documents such as resumes, textbooks, or manuals.

This project addresses that limitation by building a Retrieval-Augmented Generation (RAG) system that:

- Grounds answers in retrieved documents
- Prioritizes faithfulness over creativity
- Runs efficiently on CPU-only environments


## **Objective**
To design and implement a document-grounded question-answering system that:
- Retrieves relevant context using semantic search
- Generates answers strictly based on retrieved evidence
- Avoids hallucination via architectural and decoding constraint

### RAG Architecture

The fo6llowing figure illustrates the architecture of the proposed
domain-specific question answering system.

rag_architecture.gv.svg

In [None]:
!pip install -U \
  transformers \
  sentence-transformers \
  langchain \
  langchain-community \
  langchain-text-splitters \
  faiss-cpu \
  pypdf \
  accelerate

Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-text-splitters
  Downloading langchain_text_splitters-1.1.0-py3-none-any.whl.metadata (2.7 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Collecting pypdf
  Downloading pypdf-6.5.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain-community)
  Downloading langchain_classic-1.0.1-py3-none-any.whl.metadata (4.2 kB)
Collecting requests (from transformers)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading marshmallow-3.26.2-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
import os

BASE_DIR = "/content/drive/MyDrive/rag_project"
MODEL_DIR = f"{BASE_DIR}/models"
DATA_DIR = f"{BASE_DIR}/data"

os.makedirs(MODEL_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)

print("Folders ready")

Folders ready


## **Core Components**

### **Document Loader**

**Input:** PDF documents (resume, books, notes)

**Tool Used:**
- `PyPDFLoader` (LangChain Community)


**Mathematical View:** Let document $D$ consists of pages:
$$D = \{p1, p2, p3,...,p_n\}$$

Each page is treated as raw unstructured text.

### Text Chunking Strategy

**Why chunking matters:**

- Prevents loss of semantic locality
- Enables efficient retrieval
- Reduces hallucination

- **Method:** Recursive Character Splitting
$$D \rightarrow \{c_1, c_2, ..., c_m\} $$

Where:
- $|c_i| ≤ 800$ characters
- Overlap ensures continuity

**Design Choice:**

- Smaller chunks → better precision
- Larger chunks → better context
- Final choice balances both

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

MODEL_NAME = "google/flan-t5-base"
NEW_MODEL_DIR = f"{BASE_DIR}/models/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

tokenizer.save_pretrained(NEW_MODEL_DIR)
model.save_pretrained(NEW_MODEL_DIR)

print("FLAN-T5-BASE saved successfully")

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

FLAN-T5-BASE saved successfully


In [None]:
from langchain_community.document_loaders import PyPDFLoader

DATA_DIR = f"{BASE_DIR}/data"
PDF_PATH = f"{DATA_DIR}/nafees_resume_ai.pdf"  # change this

loader = PyPDFLoader(PDF_PATH)
documents = loader.load()

print(f"Loaded {len(documents)} pages")

Loaded 1 pages


### **Embedding Model (Semantic Representation)**

- Model : `sentence-transformers/all-MiniLM-L6-v2`

Each chunk $c_i$ is mapped to a dense vector:

$$\phi(c_i) \in \mathbb{R}^{384}$$

Semantic similarity is computed using cosine similarity:

$$\text{sim}(q, c_i) = \frac{\phi(q) \cdot \phi(c_i)}{\|\phi(q)\| \|\phi(c_i)\|}$$


### **Vector Database (FAISS)**

- Purpose: Efficient Approximate Nearest Neighbor (ANN) search.

- Operation:
$\text{Retrieve } \arg\max_{c_i} \text{sim}(q, c_i)$

Top- chunks are selected:

$$C_q = \{c_{i_1}, c_{i_2}, ..., c_{i_k}\}$$

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = FAISS.from_documents(chunks, embedding_model)

vectorstore.save_local(f"{BASE_DIR}/faiss_index")
print("New FAISS index created")

New FAISS index created


In [None]:
from transformers import pipeline

llm = pipeline(
    "text2text-generation",
    model=NEW_MODEL_DIR,
    tokenizer=NEW_MODEL_DIR,
    max_new_tokens=256,
    temperature=0.0,
    do_sample=False
)

print("FLAN-T5-BASE loaded (deterministic)")

Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


FLAN-T5-BASE loaded (deterministic)


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(NEW_MODEL_DIR)

MAX_INPUT_TOKENS = 480  # safe margin

In [None]:
def answer_question(query, k=4):
    docs = vectorstore.similarity_search(query, k=k)

    if not docs:
        return "The information is not available in the provided document."

    context = ""
    total_tokens = 0

    for doc in docs:
        doc_tokens = len(tokenizer.encode(doc.page_content, add_special_tokens=False))
        if total_tokens + doc_tokens > MAX_INPUT_TOKENS:
            break
        context += doc.page_content + "\n\n"
        total_tokens += doc_tokens

    if total_tokens < 50:
        return "The information is not available in the provided document."

    prompt = f"""
You are a factual question-answering system.

RULES:
- Answer ONLY using the context.
- If the answer is not explicitly stated, say:
  "The information is not available in the provided document."
- Do NOT guess or infer.

Context:
{context}

Question:
{query}

Answer:
"""

    return llm(prompt)[0]["generated_text"].strip()

In [None]:
def answer_question(query, k=4):
    docs = vectorstore.similarity_search(query, k=k)

    if not docs:
        return "The information is not available in the provided document."

    context = "\n\n".join([doc.page_content for doc in docs])

    if len(context.strip()) < 200:
        return "The information is not available in the provided document."

    prompt = f"""
You are a factual question-answering system.

RULES:
- Answer ONLY using the context.
- If the answer is not explicitly stated, say:
  "The information is not available in the provided document."
- Do NOT guess or infer.

Context:
{context}

Question:
{query}

Answer:
"""

    return llm(prompt)[0]["generated_text"].strip()

In [None]:
print(answer_question("What is his name?"))
print(answer_question("What are his skills?"))
print(answer_question("How many stories are there?"))

Mohammad Nafees Iqbal
AI & MACHINE LEARNING •Machine Learning- Regression (Linear, Logistic), Classification (SVM, Naive Bayes), Decision Trees, Ensemble Methods (Random Forests, Gradient Boosting) (Random Forests, Gradient Boosting) •Deep Learning- Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Transfer Learning, Hyper- parameter Tuning •Generative AI- Familiarity with Transformer Models, Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Prompt Engineering techniques. •Mathematical Foundations- Statistical Modeling, Linear Algebra, Hypothesis Testing, Optimization TOOLS & LIBRARIES •Programming & Databases- Python, R, SQL, MySQL •ML/Data Science Libraries- TensorFlow, Keras, Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn •Cloud & DevOps- AWS, GCP (F
The information is not available in the provided document.
