<h4>üì• Load scholarship data from Hugging Face dataset (parquet format)</h4>

In [1]:
! pip install pandas



In [2]:
import pandas as pd
from pprint import pprint

# Read scholarship data from parquet file
df = pd.read_parquet("hf://datasets/NetraVerse/indian-govt-scholarships/data/train-00000-of-00001.parquet")

# Select only text and label columns 
df = df[['label', 'text']]

# Convert to records format
data = df.to_dict('records')

print(f"Loaded {len(data)} scholarship documents")
pprint(data[:2])

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


Loaded 10 scholarship documents
[{'label': 'AICTE_2010_F',
  'text': 'Page 1 of 8 \n'
          '  \n'
          ' \n'
          'PRAGATI SCHOLARSHIP SCHEME  \n'
          'Frequently Asked Questions (FAQs)  \n'
          'Q.1 Who is eligible for PRAGATI Scholarship?  \n'
          'Ans: Eligibility criteria underPRAGATI Scholarship scheme:  \n'
          'EligibilityforPragati-DegreeLevel '
          'EligibilityforPragati‚ÄìDiplomaLevel \n'
          '1. Upto TwoGirls per family. 1. Upto TwoGirls perfamily. \n'
          '2. \n'
          ' Familyincomeshouldbelessthan Rs.8Lakh perannu\n'
          'm. 2. Familyincomeshouldbelessthan Rs.8Lakh perannum. \n'
          '3.  Studentsadmittedin UGDegreeLevel \n'
          'Programme/CourseinAICTEApprovedInstitutions. 3.Studentsadmittedin '
          'DiplomaLevel Programme/CourseinAIC\n'
          'TE ApprovedInstitutions. \n'
          '4.Thestudentsadmitted in  first year of their Degree \n'
          'Course OR Second year oftheir \n'


<h4>‚úÖ Validate Dataset Quality and Structure</h4>

In [3]:
# ============================================
# CHUNKING (Enable/Disable by setting flag)
# ============================================
# Set this to True to enable chunking, False to disable
ENABLE_CHUNKING = True  # Change to False to disable chunking

def chunk_text(text, chunk_size=500, overlap=100):
    '''Split text into overlapping chunks'''
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

if ENABLE_CHUNKING:
    # Create chunked version of data
    chunked_data = []
    for doc in data:
        text = doc['text']
        chunks = chunk_text(text, chunk_size=500, overlap=100)
        
        for i, chunk in enumerate(chunks):
            chunked_data.append({
                'label': doc['label'],
                'text': chunk,
                'chunk_id': i,
                'total_chunks': len(chunks)
            })
    
    # Replace original data with chunked data
    data = chunked_data
    print(f"\n{'='*50}")
    print(f"‚úÖ CHUNKING ENABLED")
    print(f"Chunked into {len(data)} pieces")
    print(f"{'='*50}\n")
    
    # Display first chunk example - FULL TEXT
    print("FIRST CHUNK EXAMPLE:")
    print(f"Label: {data[0]['label']}")
    print(f"Chunk ID: {data[0]['chunk_id']} of {data[0]['total_chunks']}")
    print(f"Text Length: {len(data[0]['text'])} characters")
    print(f"FULL TEXT:\n{data[0]['text']}")
    print(f"\n{'='*50}\n")
else:
    print(f"\n{'='*50}")
    print(f"‚ùå CHUNKING DISABLED - Using full documents")
    print(f"Total documents: {len(data)}")
    print(f"{'='*50}\n")
    print("FIRST DOCUMENT EXAMPLE:")
    print(f"Label: {data[0]['label']}")
    print(f"Text Length: {len(data[0]['text'])} characters")
    print(f"FULL TEXT:\n{data[0]['text']}")
    print(f"\n{'='*50}\n")


‚úÖ CHUNKING ENABLED
Chunked into 447 pieces

FIRST CHUNK EXAMPLE:
Label: AICTE_2010_F
Chunk ID: 0 of 33
Text Length: 500 characters
FULL TEXT:
Page 1 of 8 
  
 
PRAGATI SCHOLARSHIP SCHEME  
Frequently Asked Questions (FAQs)  
Q.1 Who is eligible for PRAGATI Scholarship?  
Ans: Eligibility criteria underPRAGATI Scholarship scheme:  
EligibilityforPragati-DegreeLevel EligibilityforPragati‚ÄìDiplomaLevel 
1. Upto TwoGirls per family. 1. Upto TwoGirls perfamily. 
2. 
 Familyincomeshouldbelessthan Rs.8Lakh perannu
m. 2. Familyincomeshouldbelessthan Rs.8Lakh perannum. 
3.  Studentsadmittedin UGDegreeLevel 
Programme/CourseinAICTEApprovedInstit




<h4>üì¶ Install required dependencies for vector database, embeddings, and deep learning</h4>

In [4]:
! pip install qdrant-client
! pip install sentence-transformers
! pip install torch



<h4>üîß Initialize Qdrant vector database client and SentenceTransformer embedding encoder</h4>

In [5]:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

# create the vector database client
qdrant = QdrantClient(":memory:") # Create in-memory Qdrant instance


# Create the embedding encoder
encoder = SentenceTransformer('all-MiniLM-L6-v2') # Model to create embeddings

<h4>üóÑÔ∏è Create vector collection for storing scholarship embeddings with cosine similarity</h4>

In [6]:
# Create collection to store the scholarship data
collection_name="scholarships"

qdrant.recreate_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)

  qdrant.recreate_collection(


True

<h4>‚¨ÜÔ∏è Generate embeddings for each document and upload to vector database</h4>

In [7]:
points_to_upload = []
for idx, doc in enumerate(data):
    points_to_upload.append(
        models.PointStruct(
            id=idx,
            vector=encoder.encode(doc["text"]).tolist(),  # Use 'text' field for scholarship data
            payload=doc
        )
    )

# vectorize!
qdrant.upload_points(
    collection_name=collection_name,
    points=points_to_upload
)

<h4>‚¨ÜÔ∏è Check the embeddings</h4>

In [8]:
# Display first document's text and embedding
first_doc = data[0]
first_text = first_doc['text']
first_vector = encoder.encode(first_text).tolist()

print("DOCUMENT TEXT:")
print(f"Text (first 500 chars): {first_text[:500]}...")
print("EMBEDDING VECTOR:")
print(f"Vector dimension: {len(first_vector)}")
print(f"First 20 values: {first_vector[:20]}")


DOCUMENT TEXT:
Text (first 500 chars): Page 1 of 8 
  
 
PRAGATI SCHOLARSHIP SCHEME  
Frequently Asked Questions (FAQs)  
Q.1 Who is eligible for PRAGATI Scholarship?  
Ans: Eligibility criteria underPRAGATI Scholarship scheme:  
EligibilityforPragati-DegreeLevel EligibilityforPragati‚ÄìDiplomaLevel 
1. Upto TwoGirls per family. 1. Upto TwoGirls perfamily. 
2. 
 Familyincomeshouldbelessthan Rs.8Lakh perannu
m. 2. Familyincomeshouldbelessthan Rs.8Lakh perannum. 
3.  Studentsadmittedin UGDegreeLevel 
Programme/CourseinAICTEApprovedInstit...
EMBEDDING VECTOR:
Vector dimension: 384
First 20 values: [-0.0012908128555864096, 0.012660644017159939, 0.002040782244876027, -0.03014940395951271, 0.0076779816299676895, 0.030943244695663452, -0.0029246823396533728, 0.025981172919273376, -0.04518456012010574, 0.0067618186585605145, 0.05067143961787224, -0.07189879566431046, -0.01719922386109829, -0.021779561415314674, -0.0004332665994297713, -0.051407407969236374, -0.06701146811246872, -0.05509680509

<h4>üí¨ Define user query for testing the RAG system</h4>

In [9]:
user_prompt = "what is the percetnage reservations for women in NSPG Scheme"

<h4>üîç Convert user query into embedding vector for semantic search</h4>

In [10]:
query_vector = encoder.encode(user_prompt).tolist()

<h4>üéØ Search vector database for top 3 most relevant scholarship documents</h4>

In [11]:
# Search time for awesome wines!
from qdrant_client import QdrantClient
from qdrant_client.models import SearchParams, ScoredPoint

hits = qdrant.query_points(
    collection_name=collection_name,
    query=query_vector,
    limit=5
)

<h4>üìÑ Display retrieved search results with metadata and similarity scores</h4>

In [12]:
for hit in hits.points: # Corrected: iterate over hits.points to get the ScoredPoint objects
  pprint(hit)

ScoredPoint(id=133, version=0, score=0.5962271669195092, payload={'label': 'FAQ_NSPG', 'text': 'FAQs for NATIONAL SCHOLARSHIP FOR POSTGRADUATE STUDIES\n1\n. \nWhat is NSPG Scheme?\nReply: \nNational Scholarship for Post Graduate Studies (NSPGS) is an umbrella\nscheme by merging four diÔ¨Äerent scholarship schemes for postgraduate studies\ni.e. (i) PG Indira Gandhi single girl child scholarship, (ii) PG scholarship for\nuniversity rank holders, (iii) PG scholarship for SC/ST students pursuing\nprofessional courses and (iv) Post Graduate Scholarship for GATE/GPAT.\n \n2\n. \nHow can I apply online for scho', 'chunk_id': 0, 'total_chunks': 14}, vector=None, shard_key=None, order_value=None)
ScoredPoint(id=140, version=0, score=0.59211491907457, payload={'label': 'FAQ_NSPG', 'text': 'marked for Science, Engineering &\nTechnology, medical, technical, agriculture, forestry programmes. The selection is\ndone purely on merit basis. The slots are allocated as per Govt. of India\nreservation pol

<h4>ü§ñ Load TinyLlama model and generate response WITHOUT retrieval context (baseline)</h4>

In [13]:
# For Hugging Face models
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from huggingface_hub import login


# Log in to Hugging Face Hub (requires a token set in Colab secrets as 'HF_TOKEN')
# You can get a token from https://huggingface.co/settings/tokens and add it to Colab secrets.
try:
    hf_token =""
    if hf_token:
        login(token=hf_token)
        print("[green]Successfully logged into Hugging Face Hub.")
    else:
        print("Warning: Hugging Face token not found in Colab secrets. Some models might require authentication")
except Exception as e:
    print(f"Error during Hugging Face login: {e}. Some models might not load.")


# Set up device (GPU if available, else CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load TinyLlama model and tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

prompt = [
    {"role": "system", "content": "You are a helpful chatbot specializing in Indian government scholarships. Your top priority is to help users find relevant scholarship information and guide them with their queries. ONLY use information from the retrieved documents"},
    {"role": "user","content": user_prompt},
]
inputs = tokenizer.apply_chat_template(
	prompt,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=2048)
pprint("Response without RAG and with TinyLlama:")
pprint(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Using device: cuda
'Response without RAG and with TinyLlama:'
('According to the given material, the percentage of reservations for women in '
 'NSPG Scheme is not mentioned. However, the text mentions that the scheme '
 'aims to provide financial assistance to women students for pursuing higher '
 'education. Therefore, it is safe to assume that the scheme provides '
 'financial assistance to women students as well.</s>')


<h4>üìã Extract payload data from search results for RAG augmentation</h4>

In [14]:
# define a variable to hold the search results
search_results = [hit.payload for hit in hits.points]

<h4>‚ú® Generate response WITH retrieval context (RAG-enhanced)</h4>

In [15]:
# Use the already-loaded model and tokenizer from the previous cell
# No need to reload the model - just create a new prompt with RAG context

prompt = [
    {"role": "system", "content": f"You are a helpful chatbot specializing in Indian government scholarships. Use the following retrieved documents to answer the user's question accurately.ONLY use information from the retrieved documents.\n\nRetrieved Documents:\n{str(search_results)}"},
    {"role": "user", "content": user_prompt},
]
inputs = tokenizer.apply_chat_template(
	prompt,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)
pprint("Response with  RAG and with TinyLlama:")

outputs = model.generate(**inputs, max_new_tokens=500)
pprint(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

'Response with  RAG and with TinyLlama:'
('According to the retrieved documents, the NSPG Scheme reserves 30% slots for '
 'women candidates. This means that women candidates are given priority over '
 'men candidates in the selection process.</s>')


In [16]:
# ============================================
# DEEP DIVE: Analyze Retrieved Chunks
# ============================================

print("üîç RETRIEVED CHUNKS ANALYSIS:")
print("=" * 80)

for i, result in enumerate(search_results, 1):
    print(f"\nüìÑ CHUNK {i}:")
    print(f"Label: {result['label']}")
    print(f"Chunk ID: {result.get('chunk_id', 'N/A')} of {result.get('total_chunks', 'N/A')}")
    print(f"Text length: {len(result['text'])} characters")
    print(f"\nüìù FIRST 500 CHARACTERS OF TEXT:")
    print(result['text'][:500])
    print(f"\nüìù LAST 200 CHARACTERS OF TEXT:")
    print(result['text'][-200:])
    print("-" * 80)

üîç RETRIEVED CHUNKS ANALYSIS:

üìÑ CHUNK 1:
Label: FAQ_NSPG
Chunk ID: 0 of 14
Text length: 500 characters

üìù FIRST 500 CHARACTERS OF TEXT:
FAQs for NATIONAL SCHOLARSHIP FOR POSTGRADUATE STUDIES
1
. 
What is NSPG Scheme?
Reply: 
National Scholarship for Post Graduate Studies (NSPGS) is an umbrella
scheme by merging four diÔ¨Äerent scholarship schemes for postgraduate studies
i.e. (i) PG Indira Gandhi single girl child scholarship, (ii) PG scholarship for
university rank holders, (iii) PG scholarship for SC/ST students pursuing
professional courses and (iv) Post Graduate Scholarship for GATE/GPAT.
 
2
. 
How can I apply online for scho

üìù LAST 200 CHARACTERS OF TEXT:
 scholarship for
university rank holders, (iii) PG scholarship for SC/ST students pursuing
professional courses and (iv) Post Graduate Scholarship for GATE/GPAT.
 
2
. 
How can I apply online for scho
--------------------------------------------------------------------------------

üìÑ CHUNK 2:
Label: FAQ_NSPG
Chun

In [17]:
! pip install datasets



In [18]:
from datasets import load_dataset

# Load the dataset using Hugging Face datasets library
print("Loading dataset from Hugging Face...")
dataset = load_dataset("NetraVerse/indian-govt-scholarships", split="train")

print("\nüìä DATASET INFORMATION:")
print("=" * 80)
print(f"Number of rows: {len(dataset)}")
print(f"Features/Columns: {dataset.features}")
print(f"\nüîç First record:")
print("-" * 80)
pprint(dataset[0])
print("\n" + "=" * 80)

Loading dataset from Hugging Face...


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/96.6k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10 [00:00<?, ? examples/s]


üìä DATASET INFORMATION:
Number of rows: 10
Features/Columns: {'file_name': Value('string'), 'label': Value('string'), 'text': Value('string')}

üîç First record:
--------------------------------------------------------------------------------
{'file_name': 'AICTE_2010_F.pdf',
 'label': 'AICTE_2010_F',
 'text': 'Page 1 of 8 \n'
         '  \n'
         ' \n'
         'PRAGATI SCHOLARSHIP SCHEME  \n'
         'Frequently Asked Questions (FAQs)  \n'
         'Q.1 Who is eligible for PRAGATI Scholarship?  \n'
         'Ans: Eligibility criteria underPRAGATI Scholarship scheme:  \n'
         'EligibilityforPragati-DegreeLevel '
         'EligibilityforPragati‚ÄìDiplomaLevel \n'
         '1. Upto TwoGirls per family. 1. Upto TwoGirls perfamily. \n'
         '2. \n'
         ' Familyincomeshouldbelessthan Rs.8Lakh perannu\n'
         'm. 2. Familyincomeshouldbelessthan Rs.8Lakh perannum. \n'
         '3.  Studentsadmittedin UGDegreeLevel \n'
         'Programme/CourseinAICTEApprovedInstit

<h4>üîç Verify RAG Output - Check for Hallucinations</h4>
<p>Compare the retrieved documents with the model's response to ensure accuracy</p>

<h4>üåê Launch interactive Gradio chatbot interface with full RAG pipeline</h4>

In [19]:
import gradio as gr

def scholarship_chatbot(message, history):
    # Encode user query
    query_vector = encoder.encode(message).tolist()
    
    # Search for relevant scholarships
    hits = qdrant.query_points(
        collection_name=collection_name,
        query=query_vector,
        limit=3
    )
    
    search_results = [hit.payload for hit in hits.points]
    
    # Generate response with LLM
    prompt = [
        {"role": "system", "content": f"You are a helpful chatbot specializing in Indian government scholarships. Use the following retrieved documents to answer accurately:\n\nRetrieved Documents:\n{str(search_results)}"},
        {"role": "user", "content": message}
    ]
    
    inputs = tokenizer.apply_chat_template(
        prompt,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)
    
    outputs = model.generate(**inputs, max_new_tokens=250)
    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])
    
    return response

# Launch Gradio interface
demo = gr.ChatInterface(
    scholarship_chatbot,
    title="üéì Indian Government Scholarship Chatbot",
    description="Ask me about Indian government scholarships!",
    examples=[
        "What scholarships are available for engineering students?",
        "Tell me about AICTE scholarships",
        "Are there scholarships for women in STEM?"
    ]
)

demo.launch()

  self.chatbot = Chatbot(


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://4b519101e36a4eb5fa.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


