# Multilingual Retrieval-Augmented Generation (RAG) System

Welcome! This notebook implements a simple RAG pipeline capable of answering queries in both Bangla and English by retrieving relevant information from a PDF corpus and generating grounded answers.

---

### How to Use This Notebook

1. **Run all cells sequentially from top to bottom** — this will install dependencies, load and process the PDF, build the knowledge base, start the API server, and set up the chat client.

2. After running all setup cells, **use the last cell to interact with the chat client.**  
   You can ask questions in Bangla or English and receive answers grounded in the document.

---

> Make sure to follow the instructions in each cell carefully. If you upload a new PDF, rerun the knowledge base build cell.

>And if just willing to evaluate, just watch the output of cells.

Enjoy exploring the RAG system!


**Required packages to install**

In [1]:
!pip install pdfplumber sentence-transformers chromadb transformers accelerate groq
!pip install fastapi nest-asyncio pyngrok uvicorn python-multipart
!pip install pytesseract pdf2image



In [2]:
!apt-get install tesseract-ocr -y
!apt-get install tesseract-ocr-ben -y
!apt-get install poppler-utils -y


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr-ben is already the newest version (1:4.00~git30-7274cfa-1.1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [3]:
!ngrok authtoken 30P4RXpkpS0fudAyUOlySNPpqDg_FRTvxRNDye7FsfUeACmP

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


**Necessary imports**

In [4]:
import pdfplumber
import re
from sentence_transformers import SentenceTransformer
import chromadb
from transformers import pipeline
import torch
import shutil
import os
import pytesseract
from pdf2image import convert_from_path
from groq import Groq
from typing import List
from google.colab import files
from fastapi import FastAPI, Request, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from collections import defaultdict, deque
import uvicorn
from pyngrok import ngrok
import nest_asyncio
import threading

### Text Extraction and Cleaning

This cell defines two important functions for processing the PDF documents:

- **`extract_text_from_pdf(pdf_path)`**:  
  Converts each page of the PDF into an image and then uses Tesseract OCR to extract Bengali text from those images. This is useful especially for scanned PDFs or PDFs without selectable text.

- **`clean_text(text)`**:  
  Cleans the raw extracted text by removing unwanted patterns such as page headers, question numbering, multiple-choice options, answer hints, and non-Bangla characters. This cleaning helps improve the quality of the data before chunking and embedding for retrieval.


In [5]:
def extract_text_from_pdf(pdf_path):
    images = convert_from_path(pdf_path)

    texts = []
    for i, img in enumerate(images):
        text = pytesseract.image_to_string(img, lang='ben')
        texts.append(text)


    return texts



def clean_text(text):

    text = re.sub(r'=+ Page \d+ =+', '', text)
    text = re.sub(r'\b[০-৯]{1,3}।', '', text)
    text = re.sub(r'[\(\[]?[কখগঘ][)\].]', '', text)
    text = re.sub(r'[কখগঘ][)\.:\s]', '', text)
    text = re.sub(r'উ[ত্তরঃ:\s]*[কখগঘ]', '', text)
    text = re.sub(r'^\s*[কখগঘ][)\.]', '', text, flags=re.MULTILINE)
    text = re.sub(r'[^\u0980-\u09FF০-৯\s.,!?;:\-]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()

    return text


### Paragraph-Based Chunking

Splits text into chunks of up to 200 words by combining paragraphs.  
Useful for preparing text for embedding and retrieval.


In [6]:
def chunk_by_paragraph(text, max_words=200):
    paragraphs = re.split(r'\n\s*\n', text)

    chunks = []
    current_chunk = ""

    for para in paragraphs:
        para = para.strip()
        if not para:
            continue

        if len(current_chunk.split()) + len(para.split()) <= max_words:
            current_chunk += " " + para
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

### Initialize Embedding Model and Vector Store

- Load the multilingual SentenceTransformer model (`intfloat/multilingual-e5-base`) for text embeddings.
- Set up a Chroma vector database client.
- Delete any existing collection named `pdf_collection`.
- Create a new collection to store document embeddings.


In [7]:
embed_model_name = "intfloat/multilingual-e5-base"
embed_model = SentenceTransformer(embed_model_name)

client = chromadb.Client()
collection_name = "pdf_collection"

if collection_name in [col.name for col in client.list_collections()]:
    client.delete_collection(collection_name)

collection = client.create_collection(name=collection_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Upload PDF and Build Knowledge Base

- Upload your PDF file by running the next code cell.  
- A file picker will appear; select the PDF document to upload (e.g., `HSC26-Bangla1st-Paper.pdf`).  
- The uploaded file is automatically saved in the `/content` directory in Colab.  

This code will:  
- Extract text from the uploaded PDF using OCR,  
- Clean and chunk the extracted text into manageable paragraphs,  
- Generate embeddings for each chunk using the SentenceTransformer model,  
- Store the embeddings and chunks in the Chroma vector database to create the knowledge base.  




In [8]:
uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]
pdf_path = os.path.join("/content", pdf_path)

def build_knowledge_base(pdf_path):
    pages = extract_text_from_pdf(pdf_path)
    print(f"Extracted {len(pages)} pages")
    all_chunks = []
    for page in pages:
        cleaned = clean_text(page)
        chunks = chunk_by_paragraph(cleaned)
        all_chunks.extend(chunks)

    print(f"Created {len(all_chunks)} chunks")

    embeddings = embed_model.encode(all_chunks).tolist()

    collection.add(
        documents=all_chunks,
        embeddings=embeddings,
        ids=[f"chunk_{i}" for i in range(len(all_chunks))]
    )
    print("Knowledge base built and stored in Chroma.")

build_knowledge_base(pdf_path)


Saving HSC26-Bangla1st-Paper.pdf to HSC26-Bangla1st-Paper.pdf
Extracted 49 pages
Created 49 chunks
Knowledge base built and stored in Chroma.


### Querying the Knowledge Base

This function takes a user query, converts it to an embedding, and searches the vector database (`Chroma`) for the most relevant document chunks.

- `top_k`: Number of top results to return.
- `min_score`: Minimum similarity score threshold to filter out low-relevance results.

It returns the filtered document chunks that have a similarity score above the threshold, ensuring relevant context is retrieved for answering the query.


In [9]:
def query_knowledge_base(query, top_k=5, min_score=0.3):
    query_emb = embed_model.encode([query]).tolist()

    results = collection.query(
        query_embeddings=query_emb,
        n_results=top_k,
        include=["documents", "distances"]
    )

    documents = results.get("documents", [[]])[0]
    scores = results.get("distances", [[]])[0]

    filtered = [
        doc for doc, score in zip(documents, scores)
        if score >= min_score
    ]

    return filtered


### Setting Up the LLM Client

- Initialize the Groq client using your API key.
- Specify the language model to be used for generating answers.

This setup connects your code to the Groq API and selects the `llama-4-scout-17b-16e-instruct` model for multilingual question answering.


In [None]:
GROQ_API_KEY = "<paste your api key here>"

client = Groq(api_key=GROQ_API_KEY)

model_name = "meta-llama/llama-4-scout-17b-16e-instruct"


### Generating Answers with Short-Term Memory

- Maintains a short-term conversation history of the last 3 exchanges using a `deque`.
- Constructs a prompt combining retrieved context and recent conversation history.
- Sends the prompt to the Groq LLM to generate an answer **in the same language as the query**.
- If no answer is found in the context, the model replies with "I don't know" in the query's language.
- Updates the short-term memory with the current query and response to maintain context across turns.


In [12]:
from collections import deque

SHORT_TERM_MEMORY = deque(maxlen=3)

def generate_answer_groq(query, retrieved_chunks):
    context = "\n".join(retrieved_chunks)
    history_prompt = ""
    if SHORT_TERM_MEMORY:
        history_prompt = "\n\nRecent Conversation History:\n" + "\n".join(
            [f"User: {q}\nAssistant: {a}" for q, a in SHORT_TERM_MEMORY]
        )
    prompt = f"""Read the context below and recent conversation history to answer the question **in the same language as the question**.
If the answer is not present, say "I don't know" in the query's language.

Context:
{context}
{history_prompt}

Question: {query}
Answer:"""
    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": "You are a helpful assistant that replies in the same language as the question."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.4,
        max_tokens=1024,
    )
    answer = response.choices[0].message.content.strip()
    SHORT_TERM_MEMORY.append((query, answer))
    return answer


### Sample Queries and Answers

This cell runs a set of example queries in both Bangla and English against the RAG system.  
For each query, it retrieves relevant document chunks and generates an answer using the Groq model.  
If no relevant context is found, it informs that no relevant information is available.


In [19]:
queries = [
   "অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?",
    "কাকে অনুপমের ভাগ্য দেবতা বলে উল্লেখ করা হয়েছে?",
    "বিয়ের সময় কল্যাণীর প্রকৃত বয়স কত ছিল?",
    "who is the writer of the story?",
    "when was the writer was born?",
    "father name of kollani?"
]

for query in queries:
    retrieved_chunks = query_knowledge_base(query)
    if not retrieved_chunks:
        print(f" Q: {query}")
        print("প্রাসঙ্গিক তথ্য পাওয়া যায়নি।\n")
        continue

    answer = generate_answer_groq(query, retrieved_chunks)
    print(f" প্রশ্ন: {query}")
    print(f"উত্তর: {answer}\n")


 প্রশ্ন: অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?
উত্তর: গজাননের কার্তিকেয়কে

 প্রশ্ন: কাকে অনুপমের ভাগ্য দেবতা বলে উল্লেখ করা হয়েছে?
উত্তর: মামাকে

 প্রশ্ন: বিয়ের সময় কল্যাণীর প্রকৃত বয়স কত ছিল?
উত্তর: ১৬ বছর

 প্রশ্ন: who is the writer of the story?
উত্তর: রবীন্দ্রনাথ ঠাকুর

 প্রশ্ন: when was the writer was born?
উত্তর: মে ৭, ১৮৬১

 প্রশ্ন: father name of kollani?
উত্তর: শস্তুনাথ সেন



### REST API Server for Multilingual RAG

This cell sets up a FastAPI application exposing a `/ask` endpoint.  
- Accepts POST requests with a JSON payload containing `query` (user question) and optional `session_id` (for session-based memory).  
- Maintains short-term conversational memory (last 3 exchanges) per session.  
- Retrieves relevant chunks from the knowledge base and generates an answer grounded on context and conversation history.  
- Uses the Groq model for multilingual answer generation.  
- Returns the answer along with the session history as JSON.  
- CORS enabled to allow requests from any origin.

This API enables integration with frontend clients or other applications for real-time Q&A interaction.


In [14]:
app = FastAPI(title="Multilingual RAG API")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

session_histories = defaultdict(lambda: deque(maxlen=3))

class QueryRequest(BaseModel):
    query: str
    session_id: str = "default"

class QueryResponse(BaseModel):
    answer: str
    session_id: str
    history: list

@app.post("/ask", response_model=QueryResponse)
async def ask_question(request: Request, query_req: QueryRequest):
    try:
        retrieved_chunks = query_knowledge_base(query_req.query)

        memory = session_histories[query_req.session_id]

        answer = generate_answer_with_memory(
            query_req.query,
            retrieved_chunks,
            memory
        )

        memory.append((query_req.query, answer))

        history = [{"query": q, "answer": a} for q, a in memory]

        return QueryResponse(
            answer=answer,
            session_id=query_req.session_id,
            history=history
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

def generate_answer_with_memory(query, retrieved_chunks, memory):
    context = "\n".join(retrieved_chunks) if retrieved_chunks else "No relevant context found."

    history_context = ""
    if memory:
        history_context = "\n\nConversation History:\n" + "\n".join(
            f"User: {q}\nAssistant: {a}" for q, a in memory
        )

    prompt = f"""Answer the question based on the context and conversation history.
Respond in the SAME LANGUAGE as the question. If the answer isn't in the context,
say "I don't know" in the question's language.

Context:
{context}
{history_context}

Question: {query}
Answer:"""

    response = client.chat.completions.create(
        model=model_name,
        messages=[
            {"role": "system", "content": "You are a helpful multilingual assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.4,
        max_tokens=1024,
    )

    return response.choices[0].message.content.strip()

print("FastAPI application created with /ask endpoint")

FastAPI application created with /ask endpoint


### Exposing the FastAPI Server Publicly with Ngrok

This cell starts the FastAPI server on port 8000 inside a background thread and exposes it publicly using Ngrok.  
- `ngrok.connect(8000)` creates a public URL tunnel to your local server.  
- `uvicorn.run` launches the FastAPI app.  
- Running the server in a separate thread allows the notebook to remain interactive.  
- The public URL is printed so you can access the API endpoint `/ask` from anywhere.

Use this URL to send POST requests or interact with the RAG system via the REST API.


In [15]:
public_url = ngrok.connect(8000).public_url
print(f"Public URL: {public_url}")


def run_server():
    uvicorn.run(app, host="0.0.0.0", port=8000)

thread = threading.Thread(target=run_server, daemon=True)
thread.start()

print("Server is running! Use the public URL to access the API")
print(f"API endpoint: {public_url}/ask")

Public URL: https://ea1fff4faa21.ngrok-free.app
Server is running! Use the public URL to access the API
API endpoint: https://ea1fff4faa21.ngrok-free.app/ask


### Interactive Chat Client for the RAG API

This cell provides a simple command-line chat interface that connects to the deployed FastAPI RAG system.

- You enter a session ID (or press Enter for the default session).
- The client sends your queries to the `/ask` endpoint using HTTP POST requests.
- The assistant’s answer is printed below each query.
- The recent conversation history (last 3 exchanges) is displayed for context.
- Type `exit` or `quit` to end the chat session.

This allows you to easily test and interact with the RAG model in a conversational manner directly from the notebook.


In [16]:
from IPython.display import clear_output
import requests

def chat():
    session_id = input("Enter session ID (press Enter for default): ") or "default"
    url = f"{public_url}/ask"
    clear_output()

    print(f"Chat session started: {session_id}")
    print("Type 'exit' to quit\n")

    while True:
        query = input("You: ")
        if query.lower() in ['exit', 'quit']:
            print("Session ended.")
            break

        response = requests.post(
            url,
            json={"query": query, "session_id": session_id}
        ).json()

        print(f"\nAssistant:\n{response['answer']}\n")

        history = response.get("history", [])
        if history:
            print("Recent History:")
            for i, exchange in enumerate(history[-3:], 1):
                print(f"\n  {i}) Q: {exchange['query']}")
                print(f"     A: {exchange['answer']}")

        print("\n" + "=" * 60 + "\n")

chat()


Chat session started: default
Type 'exit' to quit

You: অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?",
INFO:     34.91.73.162:0 - "POST /ask HTTP/1.1" 200 OK

Assistant:
শস্তুনাথকে

Recent History:

  1) Q: অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?",
     A: শস্তুনাথকে


You: কাকে অনুপমের ভাগ্য দেবতা বলে উল্লেখ করা হয়েছে?
INFO:     34.91.73.162:0 - "POST /ask HTTP/1.1" 200 OK

Assistant:
মামাকে

Recent History:

  1) Q: অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?",
     A: শস্তুনাথকে

  2) Q: কাকে অনুপমের ভাগ্য দেবতা বলে উল্লেখ করা হয়েছে?
     A: মামাকে


You: বিয়ের সময় কল্যাণীর প্রকৃত বয়স কত ছিল?
INFO:     34.91.73.162:0 - "POST /ask HTTP/1.1" 200 OK

Assistant:
১৬ বছর

Recent History:

  1) Q: অনুপমের ভাষায় সুপুরুষ কাকে বলা হয়েছে?",
     A: শস্তুনাথকে

  2) Q: কাকে অনুপমের ভাগ্য দেবতা বলে উল্লেখ করা হয়েছে?
     A: মামাকে

  3) Q: বিয়ের সময় কল্যাণীর প্রকৃত বয়স কত ছিল?
     A: ১৬ বছর


You:  who is the writer of the story?
INFO:     34.91.73.162:0 - "POST /ask HTTP/1.1" 200 OK

Assista