# CTSE Assignment 02

## General Introduction

---

**👤 IT Number:** IT21304088
**📛 Name:** SHABEER M.S.M

---

### 📘 Project Overview: CTSE Chatbot 🤖

This project is a smart **AI-powered chatbot** designed to assist with answering questions based on **CTSE (Current Trends In Software Engineering)** lecture notes. It is implemented in a **Google Colab (Jupyter Notebook)** environment and follows a **Retrieval-Augmented Generation (RAG)** architecture.

### 🔧 Key Features & Technologies:

* 📄 **PDF-Based QA**: Ingests lecture notes in PDF format for context-aware question answering.
* 🧠 **RAG Pipeline**: Combines semantic search with generative AI to produce accurate, relevant responses.
* 📦 **Chroma Vector Store**: Stores semantic vectors of lecture note chunks for fast retrieval.
* 🌐 **Google Gemini 2.0 Flash**: Leverages a cutting-edge generative model for response generation.
* ⚡ **Caching**: Speeds up repeat queries and reduces API calls.
* 🐞 **Verbose Debugging Mode**: Enables detailed inspection with `--verbose`.
* 📝 **Markdown-Inspired Output**: Clean and readable formatting for better user experience.

## PHASE 1: Dependency Installation

In [1]:
!pip install langchain langchain-community chromadb sentence-transformers pypdf langchain-google-genai gdown



## PHASE 2: Importing Libraries

In [2]:
"""
📦 Loads all essential libraries for:

📁 File handling & Google Drive integration
📄 PDF text extraction & document processing
🧠 Embeddings & vector database (Chroma) setup
🤖 LLM configuration (Google Gemini 2.0 Flash)
🔍 Building a Retrieval-Augmented Generation (RAG) pipeline

Everything you need to power the CTSE Chatbot! 🚀
"""

import os
import gdown
from google.colab import drive
from getpass import getpass
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_google_genai import ChatGoogleGenerativeAI
import google.generativeai as genai

## PHASE 3: Environment Setup

In [3]:
"""
🚀 Environment Setup & Initialization

This block performs the following tasks:
📂 1. Mounts Google Drive to enable persistent file storage across sessions.
📁 2. Defines important file paths for accessing and saving project data.
📥 3. Downloads the CTSE lecture notes PDF for processing.
🧠 4. Initializes an in-memory cache to store responses for repeated queries, improving performance.
"""

'\n🚀 Environment Setup & Initialization\n\nThis block performs the following tasks:\n📂 1. Mounts Google Drive to enable persistent file storage across sessions.\n📁 2. Defines important file paths for accessing and saving project data.\n📥 3. Downloads the CTSE lecture notes PDF for processing.\n🧠 4. Initializes an in-memory cache to store responses for repeated queries, improving performance.\n'

In [4]:
# Mount Google Drive for persistent Chroma database storage
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Define file paths and cache

# Path for downloaded PDF
DATA_PATH = "/content/CTSE_Lecture_Notes.pdf"
# Path for Chroma vector database
CHROMA_PATH = "/content/drive/MyDrive/chroma_db_ctse_gemini"
# Dictionary to store cached query responses
CACHE = {}

In [6]:
# Download lecture notes PDF from Google Drive
file_id = "1fy4CWBFzfS1cxMxRlPiRIVdS0e4zihzQ"
gdown.download(f"https://drive.google.com/uc?id={file_id}", DATA_PATH, quiet=True)

'/content/CTSE_Lecture_Notes.pdf'

In [7]:
# Verify PDF download
if not os.path.exists(DATA_PATH):
    print(f"Error: Failed to download lecture notes to {DATA_PATH}")
else:
    print(f"Lecture notes downloaded to {DATA_PATH}")

Lecture notes downloaded to /content/CTSE_Lecture_Notes.pdf


In [8]:
# Set up Google API key for Gemini
print("Enter your Google API key for Gemini:")
api_key = getpass("API Key: ")
os.environ["GOOGLE_API_KEY"] = api_key
genai.configure(api_key=api_key)

Enter your Google API key for Gemini:
API Key: ··········


## PHASE 4: Load and Split Document

In [9]:
"""
📘 PDF Loading & Chunking

This section handles the following:
📄 1. Loads the CTSE lecture notes PDF into memory.
✂️ 2. Splits the document into manageable text chunks for processing.
📏 3. Uses a larger chunk size with overlap to retain contextual continuity.
🎯 4. Enhances the accuracy of information retrieval during query handling.
"""

'\n📘 PDF Loading & Chunking\n\nThis section handles the following:\n📄 1. Loads the CTSE lecture notes PDF into memory.\n✂️ 2. Splits the document into manageable text chunks for processing.\n📏 3. Uses a larger chunk size with overlap to retain contextual continuity.\n🎯 4. Enhances the accuracy of information retrieval during query handling.\n'

In [10]:
# Load PDF using PyPDFLoader
print("Loading lecture notes...")
loader = PyPDFLoader(DATA_PATH)
documents = loader.load()

Loading lecture notes...


In [11]:
# Verify document loading
if not documents:
    print("Error: No documents loaded from the PDF.")
else:
    print(f"Loaded {len(documents)} document pages.")

Loaded 408 document pages.


In [12]:
# Split documents into chunks for embedding
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Larger chunk size to retain context
    chunk_overlap=200  # Overlap to ensure continuity between chunks
)
docs = splitter.split_documents(documents)
print(f"Split into {len(docs)} chunks.")

Split into 384 chunks.


## PHASE 5: Embeddings and Vector Store Creation

In [13]:
"""
🧠 Embedding Generation & Vector Storage

This block performs the following tasks:
🔍 1. Generates text embeddings using Google's `embedding-001` model.
📦 2. Stores the embeddings in a Chroma vector database for efficient retrieval.
💾 3. Persists the vector database to Google Drive to ensure data is retained across sessions.
🧲 4. Configures a retriever to fetch relevant document chunks based on similarity search.
"""

"\n🧠 Embedding Generation & Vector Storage\n\nThis block performs the following tasks:\n🔍 1. Generates text embeddings using Google's `embedding-001` model.\n📦 2. Stores the embeddings in a Chroma vector database for efficient retrieval.\n💾 3. Persists the vector database to Google Drive to ensure data is retained across sessions.\n🧲 4. Configures a retriever to fetch relevant document chunks based on similarity search.\n"

In [14]:
# Initialize embedding model
embedding_model_name = "models/embedding-001"
embeddings = GoogleGenerativeAIEmbeddings(model=embedding_model_name)
print(f"Initialized embeddings: {embedding_model_name}")

Initialized embeddings: models/embedding-001


In [15]:
# Load or create Chroma vector store
if os.path.exists(CHROMA_PATH):
    print(f"Loading vector store from {CHROMA_PATH}")
    vectorstore = Chroma(
        persist_directory=CHROMA_PATH,
        embedding_function=embeddings
    )
else:
    print(f"Creating vector store in {CHROMA_PATH}")
    vectorstore = Chroma.from_documents(
        documents=docs,
        embedding=embeddings,
        persist_directory=CHROMA_PATH
    )
print("Vector store created.")

Loading vector store from /content/drive/MyDrive/chroma_db_ctse_gemini


  vectorstore = Chroma(


Vector store created.


In [16]:
# Configure retriever to fetch top 5 relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
print("Retriever configured.")

Retriever configured.


## PHASE 6: Language Model Initialization

In [17]:
"""
✨ Gemini Model Setup

This step includes:
🤖 1. Initializes the `google/gemini-2.0-flash` model for fast and efficient response generation.
⚙️ 2. Prepares the model for integration with the retrieval pipeline.
🚀 3. Enables high-performance question answering using retrieved context.
"""

'\n✨ Gemini Model Setup\n\nThis step includes:\n🤖 1. Initializes the `google/gemini-2.0-flash` model for fast and efficient response generation.\n⚙️ 2. Prepares the model for integration with the retrieval pipeline.\n🚀 3. Enables high-performance question answering using retrieved context.\n'

In [18]:
# Initialize Gemini model
model_id = "gemini-2.0-flash"
print(f"Initializing model: {model_id}")
llm = ChatGoogleGenerativeAI(
    model=model_id,
    temperature=0.7,
    max_output_tokens=512,
    top_p=0.95,
)
print("LLM initialized.")

Initializing model: gemini-2.0-flash
LLM initialized.


## PHASE 7: RAG Chain Configurations

In [19]:
"""
🛠️ Sets up the Retrieval-Augmented Generation (RAG) pipeline:

🧾 Uses a custom prompt to ensure answers stay grounded in the provided context
🔍 Connects the retriever (for relevant chunks) with the LLM (for smart responses)
🤖 Enables context-aware, accurate answers based on lecture note content

Making your chatbot both intelligent and trustworthy! ✅
"""

'\n🛠️ Sets up the Retrieval-Augmented Generation (RAG) pipeline:\n\n🧾 Uses a custom prompt to ensure answers stay grounded in the provided context  \n🔍 Connects the retriever (for relevant chunks) with the LLM (for smart responses)  \n🤖 Enables context-aware, accurate answers based on lecture note content\n\nMaking your chatbot both intelligent and trustworthy! ✅\n'

In [20]:
# Define custom prompt for RAG
prompt = ChatPromptTemplate.from_template(
    "Context from Lecture Notes:\n"
    "{context}\n\n"
    "Based on the above context, provide a concise answer (up to 3 sentences) to the following question.\n"
    "Summarize relevant information (e.g., bullet points, definitions) and answer ONLY the question asked.\n"
    "If the context does not contain the answer, respond with: "
    "\"No answer For question based on the provided notes.\"\n\n"
    "Question: {input}"
)

In [21]:
# Create document chain to process retrieved documents
document_chain = create_stuff_documents_chain(llm, prompt)

In [22]:
# Create RAG chain combining retriever and document chain
rag_chain = create_retrieval_chain(retriever, document_chain)
print("RAG chain created. Ready to answer questions.")

RAG chain created. Ready to answer questions.


## PHASE 8: Realtime Chat Loop

In [23]:
"""
💬 Implements an interactive chatbot loop:

⚡ Supports caching to speed up repeated queries
📝 Enables verbose mode to show source documents for transparency
🖋️ Uses Markdown-style formatting for clean, readable responses

Ask questions and get smart, context-aware answers instantly! 🚀
"""

'\n💬 Implements an interactive chatbot loop:\n\n⚡ Supports caching to speed up repeated queries  \n📝 Enables verbose mode to show source documents for transparency  \n🖋️ Uses Markdown-style formatting for clean, readable responses  \n\nAsk questions and get smart, context-aware answers instantly! 🚀\n'

In [24]:
# Print chatbot introduction
print("\n🚀========================================🚀")
print("         📚 Welcome to CTSE Bot 🤖         ")
print("===========================================")
print("💬 Ask me anything about your lecture notes!")
print("🔍 Add '--verbose' to view source details.")
print("❌ Type 'exit' anytime to leave the chat.\n")

# Main loop for user interaction
while True:
    query = input("❓ Question: ")

    # Handle exit command
    if query.strip().lower() == 'exit':
        print("\n🤖 Session ended. See you next time!\n")
        break

    # Handle verbose mode and normalize query
    verbose = False
    if '--verbose' in query:
        verbose = True
        query = query.replace('--verbose', '').strip()

    # Validate and normalize input
    query = ' '.join(query.split())  # Remove extra spaces
    if not query:
        print("\n⚠️ Error: Please enter a valid question.")
        continue

    # Log processing
    print("\n⏳ Processing...")

    try:
        # Check cache and validate response
        if query in CACHE:
            cached_answer = CACHE[query]['answer']
            # Re-invoke if cached answer is the fallback response
            if cached_answer == "I cannot answer this question based on the provided notes.":
                result = rag_chain.invoke({"input": query})
                CACHE[query] = {'answer': result['answer'], 'context': result['context']}
            else:
                print(f"\nQuestion: {query}\n")
                print("Answer (Cached):")
                print(f"{cached_answer}\n")
                if verbose:
                    print("📚 Source Documents (Cached):")
                    for i, doc in enumerate(CACHE[query]['context'], 1):
                        print(f"- Source {i} (Page: {doc.metadata.get('page', 'N/A')}):")
                        print(f"{doc.page_content[:300]}{'...' if len(doc.page_content) > 300 else ''}")
                        print("" + "-" * 100)
                print("\n🤖 Need more help? Enter your next question or type 'exit' to sign off.\n")
                continue

        # Invoke RAG chain
        result = rag_chain.invoke({"input": query})

        # Cache response
        CACHE[query] = {
            'answer': result['answer'],
            'context': result['context']
        }

        # Print refined terminal-friendly output
        print(f"\nQuestion: {query}\n")
        print("Answer:")
        print(f"{result['answer']}\n")
        if verbose:
            print("📚 Source Documents:\n")
            for i, doc in enumerate(result['context'], 1):
                print(f"- Source {i} (Page: {doc.metadata.get('page', 'N/A')}):")
                print(f"{doc.page_content[:300]}{'...' if len(doc.page_content) > 300 else ''}")
                print("" + "-" * 100)
        print("\n🤖 I'm ready! Ask your next question or type 'exit' to sign off.\n")

    except Exception as e:
        print(f"\n Error: {e}")
        print("Please check your input, ensure the PDF is accessible")


         📚 Welcome to CTSE Bot 🤖         
💬 Ask me anything about your lecture notes!
🔍 Add '--verbose' to view source details.
❌ Type 'exit' anytime to leave the chat.

❓ Question: what is docker?

⏳ Processing...

Question: what is docker?

Answer:
Docker is a container engine that packages and runs applications in loosely isolated environments. It provides tooling and a platform to manage the lifecycle of containers, allowing developers to develop, distribute, test, and deploy applications as containers. Docker as a concept can refer to a company, product, platform, CLI tool, or computer program.


🤖 I'm ready! Ask your next question or type 'exit' to sign off.

❓ Question: What is kubernetes?

⏳ Processing...

Question: What is kubernetes?

Answer:
Kubernetes (k8s) is an open-source platform for automating deployment, scaling, and management of containers at scale. It's a container orchestration platform originally created by Google and now managed by the CNCF. The current stable 