# TASK 4 — Context-Aware RAG Chatbot

**1. Problem Statement & Objective**

**Problem Statement:**

Standard chatbots hallucinate and cannot answer from private documents.

**Objective:**

Build a Retrieval-Augmented Generation (RAG) chatbot grounded in custom PDFs.

**2. Dataset Loading & Preprocessing**

**Steps:**

Loaded PDF documents

Split text into chunks

Generated embeddings using SentenceTransformers

Stored vectors in FAISS


**3. Model Development & Training**

**Pipeline:**

Convert documents to vectors

Store in vector database

Retrieve top-k relevant chunks

Generate answer using LLM

This ensures answers are document-grounded.

**4. Evaluation**

**Evaluated using:**

Answer relevance

Context awareness

Multi-turn consistency

**5. Visualizations**

Run on Streamlit.

**6. Final Insights**

RAG improves chatbot reliability by combining information retrieval with generative AI.

In [None]:
# Install required models and libraries

!pip install -U \
transformers \
sentence-transformers \
faiss-cpu \
pypdf


Collecting transformers
  Downloading transformers-4.57.6-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Collecting pypdf
  Downloading pypdf-6.6.0-py3-none-any.whl.metadata (7.1 kB)
Downloading transformers-4.57.6-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf-6.6.0-py3-none-any.whl (328 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.0/329.0 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?

In [None]:
# Task#4 RAG

#Import required Libraries

import os
import faiss
import numpy as np

from pypdf import PdfReader
from sentence_transformers import SentenceTransformer
from transformers import pipeline

#Load Dataset PDFs

def load_pdfs(folder_path):
    texts = []

    for file in os.listdir(folder_path):
        if file.endswith(".pdf"):
            reader = PdfReader(os.path.join(folder_path, file))
            for page in reader.pages:
                texts.append(page.extract_text())

    return texts


documents = load_pdfs("/content/pdfs")
print(f"Loaded {len(documents)} pages")

# Chunking the pdfs

def chunk_text(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap

    return chunks

 # Embedding and store in FAISS

all_chunks = []
for doc in documents:
    all_chunks.extend(chunk_text(doc))

print(f"Total chunks: {len(all_chunks)}")


embedder = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = embedder.encode(
    all_chunks,
    show_progress_bar=True
)

embeddings = np.array(embeddings).astype("float32")


dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

index.add(embeddings)

print("FAISS index ready")


generator = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    max_new_tokens=256
)

# Apply RAG

chat_history = []

def ask(question, k=3):
    global chat_history

    # Embed question
    q_embedding = embedder.encode([question]).astype("float32")

    # Vector search
    distances, indices = index.search(q_embedding, k)

    # Retrieve context
    context = "\n\n".join([all_chunks[i] for i in indices[0]])

    # Format chat history
    history_text = ""
    for q, a in chat_history[-3:]:  # last 3 turns
        history_text += f"User: {q}\nBot: {a}\n"

    # Prompt
    prompt = f"""
You are a helpful assistant. Answer using the context and conversation history.

Conversation History:
{history_text}

Context:
{context}

User Question:
{question}

Answer clearly:
"""

    answer = generator(prompt)[0]["generated_text"]

    # Save memory
    chat_history.append((question, answer))

    return answer

#verify results

print(ask("What is this document about?"))
print(ask("Explain it in simple words"))




Loaded 53 pages
Total chunks: 198


Batches:   0%|          | 0/7 [00:00<?, ?it/s]

FAISS index ready


Device set to use cpu


PDC Project Documentation 16 Architecture Diagram: PDC Report Writing 30 Architecture Diagram: PDC Report Writing 6 Given the clear objectives and limited scope, the project can be completed within one academic semester. The use of established frameworks reduces development time, allowing sufficient time for performance evaluation and report writing. Therefore, the project is feasible within the given schedule constraints. 1.1.5 Specification Feasibility Specification feasibility examines whether the project requirements are clearly defined and achievable.
The project is economically f easible as it does not involve significant financial costs. All tools and technologies used in the project, such as Apache Spark, Python, and SQL, are open-source and freely available. The experiments can be conducted on existing laboratory systems or cloud-based platforms with minimal or no cost. Since there is no requirement for licensed software or proprietary hardware, the overall f User


In [None]:
#Install Streamlit

!pip install streamlit


Collecting streamlit
  Downloading streamlit-1.53.0-py3-none-any.whl.metadata (10 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.53.0-py3-none-any.whl (9.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m57.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydeck, streamlit
Successfully installed pydeck-0.9.1 streamlit-1.53.0
