<a href="https://colab.research.google.com/github/lovnishverma/Python-Getting-Started/blob/main/NIELIT_Student_Helpdesk_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NIELIT Student Helpdesk Chatbot (Fine-Tuning + Testing)

# Step 1 ▶ Install dependencies

In [1]:
!pip install transformers datasets sentencepiece accelerate bitsandbytes -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.4/59.4 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h

# Step 2 : Upload the Dataset (CSV)

➡ Upload the file : [nielit_helpdesk_dataset.csv](https://drive.google.com/file/d/1UpEDu_fiDhcUPZwY6VPKbTeAaX1fEkwg/view?usp=sharing)

In [2]:
from google.colab import files
uploaded = files.upload()

import pandas as pd
csv_file = list(uploaded.keys())[0]
df = pd.read_csv(csv_file)
df.head()

Saving nielit_helpdesk_dataset.csv to nielit_helpdesk_dataset.csv


Unnamed: 0,question,answer
0,What is the eligibility for the Cyber Security...,The eligibility criteria for the Cyber Securit...
1,What is the fee for the Artificial Intelligenc...,The fee for the Artificial Intelligence course...
2,My name is incorrect on the certificate. What ...,"If your name is incorrect on the certificate, ..."
3,How do I download my course certificate?,You can download your course certificate from ...
4,When will certificates be issued after course ...,Certificates are usually issued within 2–4 wee...


# Step 3 : Prepare Dataset for Training

In [3]:
from datasets import Dataset

dataset = Dataset.from_pandas(df[["question","answer"]])

# Convert Q/A into single training text format
def format_row(row):
    return f"User: {row['question']}\nBot: {row['answer']}"

dataset = dataset.map(lambda x: {"text": format_row(x)})
dataset = dataset.remove_columns(["question","answer"])
dataset

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 5000
})

# Step 4 : Load Tokenizer + Base Model (GPT-2)

In [4]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_name = "gpt2"  # You may try: "distilgpt2" (faster) or "gpt2-medium" (better)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# Step 5 : Tokenize the Dataset

In [5]:
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)

tokenized_ds = dataset.map(tokenize, batched=True, remove_columns=["text"])

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

# Step 6 : Training Setup

In [9]:
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Enables all experiment tracking integrations
# training_args = TrainingArguments(
#     output_dir="nielit-helpdesk-model",
#     per_device_train_batch_size=2,
#     gradient_accumulation_steps=2,
#     num_train_epochs=3,
#     logging_steps=50,
#     save_steps=200,
#     fp16=True,
# )

# Training without any external logging (W&B, TensorBoard, MLflow, etc.)
training_args = TrainingArguments(
    output_dir="nielit-helpdesk-model",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    logging_steps=50,
    save_steps=200,
    fp16=True,
    report_to="none"   # Disable W&B, TensorBoard, MLFlow, etc. #`https://wandb.ai/authorize` is for **Weights & Biases (W&B)** — a tool used to track machine learning experiments (loss curves, metrics, GPU usage, etc.).
)

# Step 7 : Train the Model

In [10]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds,
    data_collator=data_collator,
)

trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,1.5515
100,0.2152
150,0.1463
200,0.1285
250,0.1212
300,0.1135
350,0.1044
400,0.1025
450,0.1006
500,0.0984


TrainOutput(global_step=3750, training_loss=0.11551635335286459, metrics={'train_runtime': 1439.7116, 'train_samples_per_second': 10.419, 'train_steps_per_second': 2.605, 'total_flos': 1959690240000000.0, 'train_loss': 0.11551635335286459, 'epoch': 3.0})

### You only need W&B if you want features like:

✔ Training graphs
✔ Model comparison over time
✔ Experiment logging
✔ Team collaboration features

### For your case (student helpdesk chatbot fine-tuning):

* You can **completely ignore it**
* The notebook will run fine without logging in
* It will not stop training

---

### Recommendation for your project:

➡ For students and beginners: **skip it**
➡ Keep training simple and focused
➡ Enable W&B only if teaching experiment monitoring

---


# Step 8 : Save Model + Tokenizer

In [11]:
trainer.save_model("nielit-helpdesk-chatbot")
tokenizer.save_pretrained("nielit-helpdesk-chatbot")

print("Model Saved Successfully!")

Model Saved Successfully!


# Part 2: Load Model & Test Chatbot

You can restart the runtime and jump directly to this section anytime.

# Step 1 : Load Trained Model

In [12]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

model_path = "/content/nielit-helpdesk-chatbot"

tokenizer = GPT2Tokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained(model_path)

# Step 2 : Chat Function

In [13]:
import torch

def chat(prompt):
    input_text = f"User: {prompt}\nBot:"
    inputs = tokenizer(input_text, return_tensors="pt")

    output = model.generate(
        **inputs,
        max_length=200,
        do_sample=True,
        top_p=0.92,
        temperature=0.7
    )

    reply = tokenizer.decode(output[0], skip_special_tokens=True)
    return reply.split("Bot:")[-1].strip()

# Step 3 : Test the Chatbot

In [14]:
print(chat("How can I apply for admission?"))
print(chat("What is the course fee for Data Science?"))
print(chat("Is hostel available for students?"))
print(chat("Where is the NIELIT Chandigarh campus located?"))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The eligibility criteria for the Full Stack Development course at NIELIT Chandigarh can be found on the official admission portal. Typically, it requires completion of Diploma or equivalent qualification. Typically, it requires completion of a recognized Bachelor's degree. Typically, it requires completion of a recognized Bachelor's degree. Typically, it requires completion of a recognized Bachelor's degree. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The fee for the Data Science course at NIELIT Chandigarh is available on the fee structure page of the institute website. Typically, it requires completion of a recognized Bachelor's degree. Typically, it requires completion of a recognized Bachelor's degree. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion of 10+2 with Mathematics. Typically, it requires completion: 10+2 with Mathematics. Typically, it requires completion: 10+2 with Mathematics. Typically, it requires completion: 10+2 with Mathematics. Typically: 10


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Yes
NIELIT Chandigarh campus is located at Plot No. 1, 2nd Floor, Sampark Building, Sector 42B, Chandigarh. NIELIT Chandigarh campus is located at Plot No. 1, 2nd Floor, Sampark Building, Sector 42B, Chandigarh. NIELIT Chandigarh campus is located at Plot No. 1, 2nd Floor, Sampark Building, Sector 42B, Chandigarh. NIELIT Chandigarh campus is located at Plot No: 1, 2nd Floor, Sampark Building, Sector 42B, Chandigarh. NIELIT Chandigarh campus is located at Plot: 1, Sampark Building, Sector 42B, Chandigarh. NIELIT Chandigarh campus is located at Plot No: 1, Sampark Building, Sector 42


# Done!

You now have a fine-tuned NIELIT Student Helpdesk Chatbot.


**What's Next?**

* Web Chat UI with Gradio                              
* WhatsApp / Telegram bot version                     
* Student Lab Manual (PDF)                             
* Classroom PPT (30–40 slides)                         
* FastAPI / Flask backend API                          
* Model export to GGUF (for LM Studio / CPU inference)
