# Retrieval-Augmented Generation (RAG) Chatbot over Course Materials

## Project Overview
This project implements a **Retrieval-Augmented Generation (RAG)** application that allows users to ask questions about university course materials (.pdf)

The system retrieves relevant document chunks using semantic search and generates answers using an open-source Hugging Face language model.
If the answer is not present in the documents, the system responds **"I don't know"** to prevent hallucinations.


##  System Architecture

The RAG system follows a standard pipeline:

1. PDF and TXT Documents loading
2. Text chunking
3. Embedding with Sentence-Transformers
4. Vector storage using Chroma
5. Retrieval of relevant chunks
6. Answer generation using a Hugging Face LLM

This architecture separates **retrieval** from **generation**, improving factual accuracy.


In [None]:
!pip install -q \
  langchain==0.1.16 \
  langchain-community==0.0.36 \
  langchain-core==0.1.48 \
  langchain-text-splitters==0.0.1 \
  chromadb sentence-transformers transformers pypdf accelerate gradio


In [None]:
from pathlib import Path
from google.colab import drive

from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_community.llms import HuggingFacePipeline

from langchain.chains import RetrievalQA

print("RetrievalQA imported successfully âœ…")


##  Document Collection

Course materials are stored in Google Drive and loaded automatically.


In [None]:
drive.mount("/content/drive")

DATA_DIR = "/content/drive/MyDrive/Course_Materials_RAG"
CHROMA_DIR = "/content/chroma_db"

Path(DATA_DIR).mkdir(parents=True, exist_ok=True)
Path(CHROMA_DIR).mkdir(parents=True, exist_ok=True)

In [None]:
# Documents Loading
def load_documents(folder):
    docs = []
    for file in Path(folder).glob("**/*"):
        if file.suffix.lower() == ".pdf":
            docs.extend(PyPDFLoader(str(file)).load())
        elif file.suffix.lower() in [".txt", ".md"]:
            docs.extend(TextLoader(str(file), encoding="utf-8").load())
    return docs

documents = load_documents(DATA_DIR)
print(f"Loaded {len(documents)} documents")


##  Text Chunking

Documents are split into overlapping chunks to preserve semantic continuity.


In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150
)

splits = text_splitter.split_documents(documents)
print(f"Created {len(splits)} chunks")

##  Embeddings and Vector Database

Chunks are embedded using a Sentence-Transformer model and stored in Chroma.

In [None]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

In [None]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory=CHROMA_DIR
)

retriever = vectordb.as_retriever(search_kwargs={"k": 3})

##  Language Model

An open-source instruction-tuned Hugging Face model is used for generation.


In [None]:
LLM_NAME = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(LLM_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(LLM_NAME)

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    do_sample=False
)

llm = HuggingFacePipeline(pipeline=pipe)

##  Hallucination Control !

The model is forced to answer **only from retrieved context**.


In [None]:
from langchain.prompts import PromptTemplate

RAG_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template="""
You are an academic assistant.
Answer the QUESTION using ONLY the CONTEXT below.

Rules:
-Answer in the same language as the context.
- Do NOT use external knowledge.
- Do NOT invent information.
- If the answer is NOT explicitly contained in the CONTEXT, reply exactly:
  "I don't know"

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""
)


In [None]:
# RAG Chain (CORE)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    chain_type_kwargs={"prompt": RAG_PROMPT},
    return_source_documents=False
)

##  System Evaluation


In [None]:
def ask_rag(question):
    return qa_chain.run(question)

In [None]:
print(ask_rag("C'est quoi un agent intelligent ?"))

In [None]:
import gradio as gr

def chat_rag(user_message, history):
    """
    Simple RAG chat:
    - Question
    - Retrieval
    - Strict answer from documents
    """
    try:
        answer = qa_chain.run(user_message)
    except Exception as e:
        answer = f"Error: {str(e)}"

    history = history + [(user_message, answer)]
    return history, history, ""

with gr.Blocks(title="ðŸ“˜ Course RAG Assistant") as demo:

    gr.Markdown("""
    # ðŸ“˜ Course RAG Assistant
    """)

    chatbot = gr.Chatbot(height=400)

    with gr.Row():
        msg = gr.Textbox(
            placeholder="Ask a question about the course...",
            show_label=False
        )

    with gr.Row():
        send = gr.Button("Send")
        clear = gr.Button("Clear")

    send.click(
        chat_rag,
        inputs=[msg, chatbot],
        outputs=[chatbot, chatbot, msg]
    )

    msg.submit(
        chat_rag,
        inputs=[msg, chatbot],
        outputs=[chatbot, chatbot, msg]
    )

    clear.click(lambda: ([], ""), outputs=[chatbot, msg])

demo.launch()

##  Conclusion

This project demonstrates a complete vanilla RAG pipeline using open-source tools.
By combining semantic retrieval with controlled generation, the system provides accurate and reliable answers grounded in course materials.

**Limitations:**  
The system depends on the quality and coverage of the provided documents; questions outside this scope are intentionally rejected.

