<a href="https://colab.research.google.com/github/kumarsirish/rag-workshop/blob/main/scholarships-rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Implementation Chatbot to get information about the scholarships provided by Govt. of India
## https://scholarships.gov.in/


In this notebook we will build a simple RAG application based on a scholraship dataset from Govt. of India. It has following sections
* Load the dataset
* Chunking - Splitting of the Data
* Vector Database
* Generating Embedding
* Encoding user query, Creating the prompt and generating the similarity score
* Generate the output and channel it through the LLM for the proper response


### Key Components and Workflow:

1.  **Dataset Loading**: The project utilizes a dataset of Indian government scholarships, sourced from scholarships.gov.in and made available on Hugging Face (`NetraVerse/indian-govt-scholarships`).

2.  **Data Chunking**: The raw text data from the scholarships is split into smaller, manageable chunks to improve the relevance and efficiency of information retrieval.

3.  **Vector Database**: [Qdrant](https://qdrant.tech/) is employed as the vector database to store and manage the embeddings of these document chunks.

4.  **Embedding Generation**: [SentenceTransformer](https://www.sbert.net/) (`all-MiniLM-L6-v2`) is used to convert both the scholarship document chunks and user queries into dense vector representations (embeddings). This model maps sentences and paragraphs into a 384-dimensional vector space, enabling semantic similarity search.

5.  **Retrieval**: When a user poses a query, its embedding is generated and used to search the Qdrant vector database. The system retrieves the most semantically similar document chunks to the query.

6.  **Language Model (LLM)**: [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) is the chosen Large Language Model. It's a smaller, efficient model capable of generating coherent responses.

7.  **Response Generation (RAG)**: The retrieved document chunks are provided as context to TinyLlama. The LLM then generates a factual and relevant answer to the user's query, grounded in the provided scholarship information.

8.  **Interactive Interface**: The entire RAG pipeline is encapsulated within a [Gradio](https://www.gradio.app/) interface, allowing users to interact with the chatbot in real-time.

### Load the Dataset.
Govt. of India data is available at https://scholarships.gov.in/ which was uploaded to the Hugging Face. Uploading to the Hugging face is already done.
In this cell we would download the dataset from HF.

In [None]:
! pip install pandas
import pandas as pd
from pprint import pprint

# Read scholarship data from parquet file
df = pd.read_parquet("hf://datasets/NetraVerse/indian-govt-scholarships/data/train-00000-of-00001.parquet")
df = df[['label', 'text']]

# Convert to dict format
data = df.to_dict('records')
print(f"Loaded {len(data)} scholarship documents")
pprint(data[:1])

### Chunk the Data - Splitting into smaller pieces
* We will split the data into smaller chunks to make it easier to process and retrieve relevant information.
* Interactive chunking experience is available at https://chunkviz.up.railway.app/

In [None]:
# Set this to True to enable chunking, False to disable
ENABLE_CHUNKING = True

def chunk_text(text, chunk_size, overlap):
    '''Split text into overlapping chunks'''
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

if ENABLE_CHUNKING:
    # Create chunked version of data
    chunked_data = []
    for doc in data:
        text = doc['text']  #data under 'text' key
        chunks = chunk_text(text, chunk_size=500, overlap=100)  #500 characters with 100 characters overlap

        for i, chunk in enumerate(chunks):
            chunked_data.append({
                'label': doc['label'],
                'text': chunk,
                'chunk_id': i,
                'total_chunks': len(chunks)
            })

    # Reassign data with chunked data
    data = chunked_data

    print(f"CHUNKING ENABLED")
    print(f"Chunked into {len(data)} pieces")
    # Display first chunk example - FULL TEXT
    print("FIRST CHUNK EXAMPLE:")
    print(f"Chunk ID: {data[0]['chunk_id']} of {data[0]['total_chunks']}")
else:
    print(f"CHUNKING DISABLED - Using full documents")
    print(f"Total documents: {len(data)}")
    print("FIRST DOCUMENT EXAMPLE:")

print(f"Label: {data[0]['label']}")
print(f"Text Length: {len(data[0]['text'])} characters")
print(f"FULL TEXT:\n{data[0]['text']}")


### üì¶ Install required dependencies for vector database, embeddings, and deep learning
* Vector database used is qdrant
* Embeddings model is from sentence transformers. This maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
* Deep learning model is from Hugging Face.



In [None]:
! pip install qdrant-client
! pip install sentence-transformers
! pip install torch

### üì¶ Initialize Qdrant vector database client and SentenceTransformer embedding encoder
* Vector database is used to store and retrieve document chunks based on their semantic similarity to the query.
* SentenceTransformer is used to convert text into dense vector representations (embeddings).

In [None]:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

qdrant = QdrantClient(":memory:") # Create in-memory Qdrant instance

encoder_model = SentenceTransformer('all-MiniLM-L6-v2') # Model to create embeddings

###  üì¶ Initialize vector database for storing scholarship embeddings with cosine similarity. Other similarity functions are
* DOT Product (models.Distance.DOT)
* Euclidean (models.Distance.EUCLIDIAN)
* Manhattan (models.Distance.MANHATTAN)
* ...etc.

In [None]:
# Create collection to store the scholarship data
collection_name="scholarships"

qdrant.recreate_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=encoder_model.get_sentence_embedding_dimension(), # Vector size is as defined in the used model
        distance=models.Distance.COSINE
    )
)

### üì¶ Generate embeddings for each document and upload to vector database
* In vector database, each data point is represented as a vector in a high-dimensional space.
* For all-MiniLM-L6-v2 model, each vector has 384 dimensions.

In [None]:
points_to_upload = []
for idx, doc in enumerate(data):
    points_to_upload.append(
        models.PointStruct(
            id=idx,
            vector=encoder_model.encode(doc["text"]).tolist(),  # Use 'text' field for scholarship data
            payload=doc
        )
    )

# vectorize and upload points to Qdrant
qdrant.upload_points(
    collection_name=collection_name,
    points=points_to_upload
)

### üì¶ Check the embeddings.

In [None]:
# Display first document's text and embedding
first_doc = data[0]
first_text = first_doc['text']
first_vector = encoder_model.encode(first_text).tolist()

print("DOCUMENT TEXT:")
print(f"Text (first 100 chars): {first_text[:100]}...")
print("EMBEDDING VECTOR:")
print(f"Vector dimension: {len(first_vector)}")
print(f"First 20 values: {first_vector[:20]}")


### Check the number of points (embeddings) in the collection in qdrant

In [None]:
count_result = qdrant.count(collection_name=collection_name, exact=True)
print(f"Number of points in collection '{collection_name}': {count_result.count}")

### Retrieve and display a specific point (embedding and its payload)

You can retrieve a point by its `id`. For example, let's look at the point with `id=0` (which corresponds to the first chunk of data).

In [None]:
point_id = 0
retrieved_point = qdrant.retrieve(collection_name=collection_name, ids=[point_id], with_vectors=True, with_payload=True)

if retrieved_point:
    print(f"--- Retrieved Point ID: {retrieved_point[0].id} ---")
    print(f"Payload: ")
    pprint(retrieved_point[0].payload)
    print(f"Vector (first 2 values): {retrieved_point[0].vector[:2]}...")
    print(f"Vector dimension: {len(retrieved_point[0].vector)}")
else:
    print(f"Point with ID {point_id} not found.")

### üì¶ User query and searching the database
 * Define user query
 * Convert user query to embedding using the same SentenceTransformer model.



In [None]:
user_prompt = "what is the percetnage reservations for women in NSPG Scheme"
#SentenceTransoformer model returns a NumPy array or PyTorch Tensor but qdrant
# expects in the list format.
query_vector = encoder_model.encode(user_prompt).tolist()
print(f"query_vector: {query_vector}")
print(f"Query Vector Dimension: {len(query_vector)}")

### üéØ Search vector database
* Search the vector database for the top 3 (top k) most similar document chunks based on cosine similarity.
* Display the retrieved document chunks with metadata and similarity scores.

In [None]:
# Search time for awesome wines!
from qdrant_client import QdrantClient
from qdrant_client.models import SearchParams, ScoredPoint

hits = qdrant.query_points(
    collection_name=collection_name,
    query=query_vector,
    limit=3
)

#  save the search results
search_results = []
for hit in hits.points:
    search_results.append(hit.payload)
    pprint(hit)


### ü§ñ Load TinyLlama model
* TinyLlama is a smaller version of the LLaMA model, designed to be more efficient while still providing good performance for various NLP tasks.
* We will use TinyLlama to generate responses based on the retrieved document chunks.
* https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0



In [None]:
# For Hugging Face models
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Set up device (GPU if available, else CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load TinyLlama model and tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

### ü§ñ Generate response using TinyLlama without search results
* Generate a response to the user query using TinyLlama without incorporating any retrieved document chunks.
* This serves as a baseline to compare against the RAG approach.
* Return Tensor output is of PyTorch type.
* max_new_tokens: The maximum number of new tokens to generate in the response.

In [None]:
prompt = [
    {"role": "system", "content": "You are a helpful chatbot. Your top priority is to help users and guide them with their queries. "},
    {"role": "user","content": user_prompt},
]

print(prompt)
inputs = tokenizer.apply_chat_template(
	prompt,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt", #pt, np, tf
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1024)
pprint("Response without RAG and with TinyLlama:")
pprint(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

### ‚ú® Generate response using TinyLlama WITH search results (RAG-ENHANCED)
* Generate a response to the user query using TinyLlama  incorporating retrieved document chunks.
* max_new_tokens: The maximum number of new tokens to generate in the response.

In [None]:
# No need to reload the model - just create a new prompt with RAG context

prompt = [
    {"role": "system", "content": f"You are a helpful chatbot specializing in Indian government scholarships. Use the following retrieved documents to answer the user's question accurately.ONLY use information from the retrieved documents.\n\nRetrieved Documents:\n{str(search_results)}"},
    {"role": "user", "content": user_prompt},
]

print(prompt)
inputs = tokenizer.apply_chat_template(
	prompt,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)
pprint("Response with  RAG and with TinyLlama:")

outputs = model.generate(**inputs, max_new_tokens=500)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])
pprint(response)

# Print source documents in one line
sources = " | ".join([f"{doc['label']}" for doc in search_results])
print(f"\nüìö Sources: {sources}")

<h4>üåê Launch interactive Gradio chatbot interface with full RAG pipeline</h4>

In [None]:
import gradio as gr

def scholarship_chatbot(message, history):
    # Encode user query
    query_vector = encoder_model.encode(message).tolist()

    # Search for relevant scholarships
    hits = qdrant.query_points(
        collection_name=collection_name,
        query=query_vector,
        limit=3
    )

    search_results = []
    for hit in hits.points:
        search_results.append(hit.payload)

    # Generate response with LLM
    prompt = [
        {"role": "system", "content": f"You are a helpful chatbot specializing in Indian government scholarships. Use the following retrieved documents to answer accurately:\n\nRetrieved Documents:\n{str(search_results)}"},
        {"role": "user", "content": message}
    ]

    inputs = tokenizer.apply_chat_template(
        prompt,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    outputs = model.generate(**inputs, max_new_tokens=1024)
    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])

    # Add source documents in one line
    sources_list = []
    for doc in search_results:
        sources_list.append(f"{doc['label']}")
    sources = " | ".join(sources_list)
    response_with_sources = f"{response}\n\n Sources: {sources}"

    return response_with_sources

# Launch Gradio interface
demo = gr.ChatInterface(
    scholarship_chatbot,
    title="üéì Indian Government Scholarship Chatbot",
    description="Ask me about Indian government scholarships!",
    examples=[
        "What scholarships are available for engineering students?",
        "Tell me about AICTE scholarships",
        "Are there scholarships for women in STEM? Summarize the answer"
    ]
)

demo.launch()