# Kaggle Execution Checklist Notebook

**Project:** Fine-Tuning LLaMA 3.1-8B-Instruct on Bengali Empathetic Conversations
**Environment:** Kaggle Free GPU (T4 / P100)

---

## 0. Notebook Metadata (Fill Before Running)

* **Author:** Md Islam
* **Date:**
* **GPU Type:**
* **Strategy Used:** ‚òê LoRA ‚òê Unsloth
* **Model:** LLaMA 3.1-8B-Instruct

---

## 1. Environment Setup

### 1.1 Enable GPU

* [ ] Kaggle Notebook ‚Üí Settings ‚Üí Accelerator = GPU
* [ ] Confirm CUDA availability

### 1.2 Install Dependencies

* [ ] transformers
* [ ] datasets
* [ ] torch
* [ ] peft
* [ ] accelerate
* [ ] bitsandbytes
* [ ] evaluate
* [ ] nltk
* [ ] rouge-score

### 1.3 Hugging Face Authentication

* [ ] `huggingface-cli login`
* [ ] Verify access to LLaMA 3.1-8B-Instruct

---

## 2. Dataset Preparation

### 2.1 Load Dataset

* [ ] Load Bengali Empathetic Conversations dataset
* [ ] Inspect columns and sample records

### 2.2 Data Cleaning

* [ ] Remove null / empty rows
* [ ] Normalize Bengali Unicode text
* [ ] Remove duplicates

### 2.3 Instruction Formatting

* [ ] Convert to instruction‚Äìresponse format
* [ ] Validate format consistency

### 2.4 Tokenization

* [ ] Load LLaMA tokenizer
* [ ] Full-sequence tokenization (NO truncation)
* [ ] Verify maximum token length

### 2.5 Dataset Split

* [ ] Train set
* [ ] Validation set
* [ ] Test prompts set

---

## 3. Code Architecture (OOP)

### 3.1 DatasetProcessor

* [ ] Load dataset
* [ ] Clean data
* [ ] Instruction formatting
* [ ] Tokenization

### 3.2 Fine-Tuning Strategy (Strategy Pattern)

* [ ] Define FineTuningStrategy interface
* [ ] Implement LoRAStrategy
* [ ] (Optional) Implement UnslothStrategy

---

## 4. Model Training

### 4.1 Model Setup

* [ ] Load LLaMA 3.1-8B-Instruct
* [ ] Enable gradient checkpointing
* [ ] Enable mixed precision (fp16 / bf16)

### 4.2 Apply Fine-Tuning Strategy

* [ ] Apply LoRA / Unsloth
* [ ] Confirm trainable parameters

### 4.3 Training Configuration

* [ ] Batch size
* [ ] Gradient accumulation
* [ ] Learning rate
* [ ] Optimizer (AdamW)
* [ ] Scheduler
* [ ] Epoch count

### 4.4 Training Execution

* [ ] Start training
* [ ] Monitor training loss
* [ ] Monitor validation loss
* [ ] Save checkpoints

---

## 5. Experiment Logging

### 5.1 LLAMAExperiments

* [ ] Experiment ID
* [ ] Model name
* [ ] Strategy configuration
* [ ] Train loss
* [ ] Validation loss
* [ ] Metrics
* [ ] Timestamp

### 5.2 GeneratedResponses

* [ ] Experiment ID reference
* [ ] Input text
* [ ] Generated response
* [ ] Timestamp

---

## 6. Evaluation

### 6.1 Automatic Metrics

* [ ] Perplexity
* [ ] BLEU
* [ ] ROUGE-1
* [ ] ROUGE-2
* [ ] ROUGE-L

### 6.2 Human Evaluation

* [ ] Select 20‚Äì50 test prompts
* [ ] Generate responses
* [ ] Rate empathy quality
* [ ] Rate naturalness
* [ ] Rate relevance

---

## 7. Analysis

* [ ] Compare baseline vs fine-tuned outputs
* [ ] Identify empathy improvements
* [ ] Identify failure cases
* [ ] (Optional) Compare LoRA vs Unsloth

---

## 8. Deliverables

* [ ] Preprocessing notebook/script
* [ ] Fine-tuning notebook/script
* [ ] Evaluation notebook/script
* [ ] Sample Bengali responses
* [ ] Metrics table
* [ ] Analysis summary
* [ ] Documentation (strategy, challenges, solutions)

---

## 9. Final Verification

* [ ] All requirements satisfied
* [ ] No sequence length reduction
* [ ] Kaggle GPU limits respected
* [ ] Code is modular and clean
* [ ] Submission ready

---

**Status:** ‚òê In Progress ‚òê Completed


In [1]:
import torch

print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0))
print("Torch version:", torch.__version__)


CUDA available: True
GPU: Tesla T4
Torch version: 2.8.0+cu126


In [2]:
# now i am going to login to hugging face
# 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
from huggingface_hub import login

login(token="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")


In [3]:
from datasets import load_dataset

In [4]:
import pandas as pd

# Replace filename once you see it in Step 2.2
df = pd.read_csv(
    "/kaggle/input/bengali-empathetic-conversations-corpus/bengali-empathetic-conversations-corpus.csv"
)

print("Dataset shape:", df.shape)
df.head()


Dataset shape: (38233, 4)


Unnamed: 0,Topics,Question-Title,Questions,Answers
0,‡¶™‡¶æ‡¶∞‡¶ø‡¶¨‡¶æ‡¶∞‡¶ø‡¶ï ‡¶¶‡ßç‡¶¨‡¶®‡ßç‡¶¶‡ßç‡¶¨,‡¶Æ‡¶æ ‡¶ì ‡¶∏‡ßç‡¶§‡ßç‡¶∞‡ßÄ‡¶∞ ‡¶Æ‡¶ß‡ßç‡¶Ø‡ßá ‡¶Æ‡¶§‡¶æ‡¶®‡ßà‡¶ï‡ßç‡¶Ø ‡¶¨‡ßÉ‡¶¶‡ßç‡¶ß‡¶ø,‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶∏‡ßç‡¶§‡ßç‡¶∞‡ßÄ ‡¶è‡¶¨‡¶Ç ‡¶Æ‡¶æ‡¶Ø‡¶º‡ßá‡¶∞ ‡¶Æ‡¶ß‡ßç‡¶Ø‡ßá ‡¶ü‡¶æ‡¶®‡¶ü‡¶æ‡¶® ‡¶Æ‡¶§‡¶¨‡¶ø‡¶∞‡ßã‡¶ß ‡¶ö...,"‡¶Ü‡¶™‡¶®‡¶ø ‡¶Ø‡¶æ ‡¶¨‡¶∞‡ßç‡¶£‡¶®‡¶æ ‡¶ï‡¶∞‡¶õ‡ßá‡¶® ‡¶§‡¶æ‡¶ï‡ßá ‡¶Æ‡¶®‡ßã‡¶¨‡¶ø‡¶ú‡ßç‡¶û‡¶æ‡¶®‡ßÄ‡¶∞‡¶æ ""‡¶§‡ßç‡¶∞‡¶ø..."
1,"‡¶™‡¶¶‡¶æ‡¶∞‡ßç‡¶•‡ßá‡¶∞ ‡¶Ö‡¶™‡¶¨‡ßç‡¶Ø‡¶¨‡¶π‡¶æ‡¶∞, ‡¶Ü‡¶∏‡¶ï‡ßç‡¶§‡¶ø",‡¶Ü‡¶Æ‡¶ø ‡¶ß‡ßÇ‡¶Æ‡¶™‡¶æ‡¶®‡ßá ‡¶Ü‡¶∏‡¶ï‡ßç‡¶§‡•§ ‡¶Ü‡¶Æ‡¶ø ‡¶ï‡¶ø‡¶≠‡¶æ‡¶¨‡ßá ‡¶•‡¶æ‡¶Æ‡¶æ‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡¶ø?,"‡¶Ü‡¶Æ‡¶ø ‡¶¨‡¶æ‡¶ö‡ßç‡¶ö‡¶æ ‡¶®‡ßá‡¶ì‡¶Ø‡¶º‡¶æ‡¶∞ ‡¶™‡¶∞‡¶ø‡¶ï‡¶≤‡ßç‡¶™‡¶®‡¶æ ‡¶ï‡¶∞‡¶õ‡¶ø, ‡¶§‡¶æ‡¶á ‡¶Ü‡¶Æ‡¶æ‡¶ï‡ßá ...",‡¶π‡¶æ‡¶á‡•§ ‡¶Ü‡¶™‡¶®‡¶æ‡¶∞ ‡¶∂‡¶ø‡¶∂‡ßÅ‡¶∞ (‡¶è‡¶¨‡¶Ç ‡¶®‡¶ø‡¶ú‡ßá‡¶∞) ‡¶ú‡¶®‡ßç‡¶Ø ‡¶Ø‡¶æ ‡¶∏‡ßç‡¶¨‡¶æ‡¶∏‡ßç‡¶•‡ßç...
2,‡¶™‡¶æ‡¶∞‡¶ø‡¶¨‡¶æ‡¶∞‡¶ø‡¶ï ‡¶¶‡ßç‡¶¨‡¶®‡ßç‡¶¶‡ßç‡¶¨,‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶™‡¶∞‡¶ø‡¶¨‡¶æ‡¶∞‡ßá‡¶∞ ‡¶ï‡¶æ‡¶õ ‡¶•‡ßá‡¶ï‡ßá ‡¶ó‡ßã‡¶™‡¶® ‡¶∞‡¶æ‡¶ñ‡¶æ,"‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶Æ‡¶®‡ßá‡¶∞ ‡¶Æ‡¶ß‡ßç‡¶Ø‡ßá ‡¶ó‡ßã‡¶™‡¶® ‡¶Ü‡¶õ‡ßá, ‡¶è‡¶¨‡¶Ç ‡¶Ü‡¶Æ‡¶ø ‡¶ú‡¶æ‡¶®‡¶ø ‡¶®‡¶æ ‡¶§‡¶æ‡¶¶...",‡¶Æ‡¶®‡ßá ‡¶π‡¶ö‡ßç‡¶õ‡ßá ‡¶ó‡ßã‡¶™‡¶® ‡¶∞‡¶æ‡¶ñ‡¶æ ‡¶è‡¶ñ‡¶® ‡¶Ü‡¶™‡¶®‡¶æ‡¶∞ ‡¶ú‡¶®‡ßç‡¶Ø ‡¶è‡¶ï‡¶ü‡¶ø ‡¶∏‡¶Æ‡¶∏‡ßç‡¶Ø...
3,"‡¶Ü‡¶ö‡¶∞‡¶£‡¶ó‡¶§ ‡¶™‡¶∞‡¶ø‡¶¨‡¶∞‡ßç‡¶§‡¶®, ‡¶∏‡¶æ‡¶Æ‡¶æ‡¶ú‡¶ø‡¶ï ‡¶∏‡¶Æ‡ßç‡¶™‡¶∞‡ßç‡¶ï",‡¶Ö‡¶ß‡¶ø‡¶ï‡¶æ‡¶∞‡ßÄ ‡¶π‡¶ì‡¶Ø‡¶º‡¶æ‡¶∞ ‡¶Ö‡¶®‡ßç‡¶§‡¶∞‡ßç‡¶®‡¶ø‡¶π‡¶ø‡¶§ ‡¶ï‡¶æ‡¶∞‡¶£,‡¶Ü‡¶Æ‡¶ø ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶∏‡¶Æ‡ßç‡¶™‡¶∞‡ßç‡¶ï‡ßá‡¶∞ ‡¶ï‡ßç‡¶∑‡ßá‡¶§‡ßç‡¶∞‡ßá ‡¶Ö‡¶§‡ßç‡¶Ø‡¶®‡ßç‡¶§ ‡¶Ö‡¶ß‡¶ø‡¶ï‡¶æ‡¶∞‡¶∏‡ßÇ‡¶ö‡¶ï...,‡¶π‡ßç‡¶Ø‡¶æ‡¶≤‡ßã‡•§ ‡¶è‡¶ü‡¶æ ‡¶¶‡ßÅ‡¶∞‡ßç‡¶¶‡¶æ‡¶®‡ßç‡¶§ ‡¶Ø‡ßá ‡¶Ü‡¶™‡¶®‡¶ø ‡¶â‡¶™‡¶≤‡¶¨‡ßç‡¶ß‡¶ø ‡¶ï‡¶∞‡¶§‡ßá ‡¶∏‡¶ï...
4,‡¶¶‡ßÅ‡¶∂‡ßç‡¶ö‡¶ø‡¶®‡ßç‡¶§‡¶æ,‡¶Ü‡¶Æ‡¶ø ‡¶ï‡¶ø ‡¶ì‡¶∑‡ßÅ‡¶ß ‡¶õ‡¶æ‡¶°‡¶º‡¶æ ‡¶â‡¶¶‡ßç‡¶¨‡ßá‡¶ó ‡¶®‡¶ø‡¶Ø‡¶º‡¶®‡ßç‡¶§‡ßç‡¶∞‡¶£ ‡¶ï‡¶∞‡¶§‡ßá ‡¶™‡¶æ‡¶∞‡¶ø?,‡¶ï‡¶Ø‡¶º‡ßá‡¶ï ‡¶¨‡¶õ‡¶∞ ‡¶Ü‡¶ó‡ßá ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶Æ‡¶æ‡¶•‡¶æ‡¶Ø‡¶º ‡¶Ü‡¶ò‡¶æ‡¶§ ‡¶≤‡ßá‡¶ó‡ßá‡¶õ‡¶ø‡¶≤ ‡¶è‡¶¨‡¶Ç ‡¶Ü‡¶Æ‡¶æ...,‡¶Ü‡¶™‡¶®‡¶ø ‡¶¨‡¶≤‡ßá‡¶®‡¶®‡¶ø ‡¶ï‡¶ø ‡¶¨‡¶æ ‡¶ï‡¶§ ‡¶ì‡¶∑‡ßÅ‡¶ß ‡¶Ü‡¶™‡¶®‡¶ø ‡¶ö‡ßá‡¶∑‡ßç‡¶ü‡¶æ ‡¶ï‡¶∞‡ßá‡¶õ‡ßá‡¶®‡•§...


In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38233 entries, 0 to 38232
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Topics          38222 non-null  object
 1   Question-Title  37639 non-null  object
 2   Questions       38215 non-null  object
 3   Answers         38228 non-null  object
dtypes: object(4)
memory usage: 1.2+ MB


In [6]:
df.sample(5)

Unnamed: 0,Topics,Question-Title,Questions,Answers
26511,‡¶ï‡ßç‡¶∑‡¶ø‡¶™‡ßç‡¶§,‡¶Æ‡¶æ‡¶®‡ßÅ‡¶∑ ‡¶ï‡ßç‡¶≤‡¶æ‡¶®‡ßç‡¶§,‡¶™‡¶æ‡¶∂‡ßá‡¶∞ ‡¶∞‡¶æ‡¶∏‡ßç‡¶§‡¶æ‡¶Ø‡¶º ‡ß´‡ß¶-‡ß¨‡ß¶ ‡¶ú‡¶® ‡¶≤‡ßã‡¶ï ‡¶Ø‡ßá‡¶§‡ßá ‡¶ï‡ßç‡¶≤‡¶æ‡¶®‡ßç‡¶§‡•§,‡¶ï‡¶§ ‡¶¶‡ßç‡¶∞‡ßÅ‡¶§ ‡¶§‡¶æ‡¶∞‡¶æ ‡¶Ø‡ßá‡¶§‡ßá ‡¶Ö‡¶®‡ßÅ‡¶Æ‡¶ø‡¶§ ‡¶π‡¶Ø‡¶º?
8468,‡¶Ü‡¶®‡¶®‡ßç‡¶¶‡¶¶‡¶æ‡¶Ø‡¶º‡¶ï,‡¶ï‡¶æ‡¶ú ‡¶ï‡¶∞‡¶§‡ßá ‡¶¶‡¶æ‡¶∞‡ßÅ‡¶£,‡¶Ü‡¶Æ‡¶ø‡¶ì ‡¶§‡¶æ‡¶á ‡¶ï‡¶∞‡ßá‡¶õ‡¶ø‡•§ ‡¶¨‡¶æ‡¶ö‡ßç‡¶ö‡¶æ‡¶¶‡ßá‡¶∞ ‡¶∏‡¶æ‡¶•‡ßá‡¶ì ‡¶ï‡¶æ‡¶ú ‡¶ï‡¶∞‡¶æ ‡¶ñ‡ßÅ‡¶¨ ‡¶≠‡¶æ‡¶≤‡ßã,"‡¶π‡ßç‡¶Ø‡¶æ‡¶Å, ‡¶¨‡¶æ‡¶ö‡ßç‡¶ö‡¶æ‡¶¶‡ßá‡¶∞ ‡¶∂‡ßá‡¶ñ‡¶æ‡¶®‡ßã ‡¶è‡¶¨‡¶Ç ‡¶ï‡ßÄ‡¶≠‡¶æ‡¶¨‡ßá ‡¶ï‡¶æ‡¶∞‡ßç‡¶Ø‡¶ï‡¶∞ ‡¶Æ‡¶æ‡¶®..."
1577,PTSD,‡¶™‡¶æ‡¶∞‡ßç‡¶∂‡ßç‡¶¨ ‡¶™‡ßç‡¶∞‡¶§‡¶ø‡¶ï‡ßç‡¶∞‡¶ø‡¶Ø‡¶º‡¶æ ‡¶∏‡¶§‡ßç‡¶Ø‡¶ø‡¶á ‡¶ñ‡¶æ‡¶∞‡¶æ‡¶™ ‡¶è‡¶¨‡¶Ç ‡¶Ø‡ßå‡¶® ‡¶™‡ßç‡¶∞‡¶≠...,‡¶Ü‡¶Æ‡¶æ‡¶∞ PTSD ‡¶Ü‡¶õ‡ßá‡•§ ‡¶™‡¶æ‡¶∞‡ßç‡¶∂‡ßç‡¶¨ ‡¶™‡ßç‡¶∞‡¶§‡¶ø‡¶ï‡ßç‡¶∞‡¶ø‡¶Ø‡¶º‡¶æ ‡¶∏‡¶§‡ßç‡¶Ø‡¶ø‡¶á ‡¶ñ‡¶æ...,‡¶ß‡ßÄ‡¶∞‡ßá ‡¶ß‡ßÄ‡¶∞‡ßá ‡¶Ø‡ßá ‡¶π‡¶æ‡¶∞‡ßá ‡¶Ü‡¶™‡¶®‡¶ø ‡¶Ü‡¶™‡¶®‡¶æ‡¶∞ ‡¶ú‡ßÄ‡¶¨‡¶® ‡¶´‡¶ø‡¶∞‡ßá ‡¶™‡¶æ‡¶¨‡ßá‡¶®‡•§...
10989,‡¶¶‡ßÅ‡¶É‡¶ñ‡¶ú‡¶®‡¶ï,‡¶§‡¶æ‡¶¶‡ßá‡¶∞ ‡¶ú‡ßÄ‡¶¨‡¶® ‡¶®‡¶∑‡ßç‡¶ü ‡¶ï‡¶∞‡ßá,‡¶π‡ßç‡¶Ø‡¶æ‡¶Å...‡¶è‡¶ü‡¶æ ‡¶ñ‡ßÅ‡¶¨‡¶á ‡¶¶‡ßÅ‡¶É‡¶ñ‡¶ú‡¶®‡¶ï ‡¶¨‡¶ø‡¶∂‡ßá‡¶∑ ‡¶ï‡¶∞‡ßá ‡¶Ø‡¶ñ‡¶® ‡¶Ü‡¶™‡¶®‡¶ø ‡¶ú‡¶æ...,‡¶Ü‡¶™‡¶®‡¶æ‡¶ï‡ßá ‡¶è‡¶á ‡¶≤‡ßã‡¶ï‡¶¶‡ßá‡¶∞ ‡¶Ø‡ßá‡¶§‡ßá ‡¶¶‡¶ø‡¶§‡ßá ‡¶π‡¶¨‡ßá
19471,‡¶â‡¶§‡ßç‡¶§‡ßá‡¶ú‡¶ø‡¶§,‡¶ó‡¶ø‡¶≤‡ßç‡¶° ‡¶™‡¶Ø‡¶º‡ßá‡¶®‡ßç‡¶ü,‡¶Ü‡¶Æ‡¶ø ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶ó‡¶ø‡¶≤‡ßç‡¶° ‡¶™‡¶Ø‡¶º‡ßá‡¶®‡ßç‡¶ü ‡¶Æ‡ßá‡¶á‡¶≤‡ßá ‡¶Ü‡¶∏‡¶æ‡¶∞ ‡¶ú‡¶®‡ßç‡¶Ø ‡¶Ö‡¶™‡ßá‡¶ï‡ßç‡¶∑‡¶æ...,‡¶ó‡¶ø‡¶≤‡ßç‡¶° ‡¶∏‡¶¶‡¶∏‡ßç‡¶Ø‡¶¶‡ßá‡¶∞ ‡¶ú‡¶®‡ßç‡¶Ø ‡¶™‡¶Ø‡¶º‡ßá‡¶®‡ßç‡¶ü ‡¶ï‡¶ø?


In [7]:
print(df.iloc[0])

Topics                                           ‡¶™‡¶æ‡¶∞‡¶ø‡¶¨‡¶æ‡¶∞‡¶ø‡¶ï ‡¶¶‡ßç‡¶¨‡¶®‡ßç‡¶¶‡ßç‡¶¨
Question-Title                   ‡¶Æ‡¶æ ‡¶ì ‡¶∏‡ßç‡¶§‡ßç‡¶∞‡ßÄ‡¶∞ ‡¶Æ‡¶ß‡ßç‡¶Ø‡ßá ‡¶Æ‡¶§‡¶æ‡¶®‡ßà‡¶ï‡ßç‡¶Ø ‡¶¨‡ßÉ‡¶¶‡ßç‡¶ß‡¶ø
Questions          ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶∏‡ßç‡¶§‡ßç‡¶∞‡ßÄ ‡¶è‡¶¨‡¶Ç ‡¶Æ‡¶æ‡¶Ø‡¶º‡ßá‡¶∞ ‡¶Æ‡¶ß‡ßç‡¶Ø‡ßá ‡¶ü‡¶æ‡¶®‡¶ü‡¶æ‡¶® ‡¶Æ‡¶§‡¶¨‡¶ø‡¶∞‡ßã‡¶ß ‡¶ö...
Answers            ‡¶Ü‡¶™‡¶®‡¶ø ‡¶Ø‡¶æ ‡¶¨‡¶∞‡ßç‡¶£‡¶®‡¶æ ‡¶ï‡¶∞‡¶õ‡ßá‡¶® ‡¶§‡¶æ‡¶ï‡ßá ‡¶Æ‡¶®‡ßã‡¶¨‡¶ø‡¶ú‡ßç‡¶û‡¶æ‡¶®‡ßÄ‡¶∞‡¶æ "‡¶§‡ßç‡¶∞‡¶ø...
Name: 0, dtype: object


In [8]:
# now I am going to check the quality of the dataset
# Check missing values
print(df.isnull().sum())

# Check duplicates
print("Duplicate rows:", df.duplicated().sum())


Topics             11
Question-Title    594
Questions          18
Answers             5
dtype: int64
Duplicate rows: 25


In [9]:
df.columns

Index(['Topics', 'Question-Title', 'Questions', 'Answers'], dtype='object')

In [10]:
df = df[["Topics", "Questions", "Answers"]]
print(df.shape)


(38233, 3)


In [11]:
df = df.dropna(subset=["Questions", "Answers"])

df = df[
    (df["Questions"].str.strip() != "") &
    (df["Answers"].str.strip() != "")
]

df = df.reset_index(drop=True)

print("Cleaned dataset shape:", df.shape)


Cleaned dataset shape: (38210, 3)


In [12]:
def normalize_text(text: str) -> str:
    text = text.strip()
    text = " ".join(text.split())  # normalize whitespace
    return text

df["Questions"] = df["Questions"].apply(normalize_text)
df["Answers"] = df["Answers"].apply(normalize_text)
df["Topics"] = df["Topics"].fillna("").apply(normalize_text)


### Instruction:
Respond empathetically in Bengali to the following message.

### Context:
Topic: {topic}

### User:
{question}

### Assistant:
{answer}

In [13]:
def format_instruction(row):
    return f"""### Instruction:
Respond empathetically in Bengali to the following message.

### Context:
Topic: {row['Topics']}

### User:
{row['Questions']}

### Assistant:
{row['Answers']}"""

df["text"] = df.apply(format_instruction, axis=1)


In [14]:
print(df["text"].iloc[0])


### Instruction:
Respond empathetically in Bengali to the following message.

### Context:
Topic: ‡¶™‡¶æ‡¶∞‡¶ø‡¶¨‡¶æ‡¶∞‡¶ø‡¶ï ‡¶¶‡ßç‡¶¨‡¶®‡ßç‡¶¶‡ßç‡¶¨

### User:
‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶∏‡ßç‡¶§‡ßç‡¶∞‡ßÄ ‡¶è‡¶¨‡¶Ç ‡¶Æ‡¶æ‡¶Ø‡¶º‡ßá‡¶∞ ‡¶Æ‡¶ß‡ßç‡¶Ø‡ßá ‡¶ü‡¶æ‡¶®‡¶ü‡¶æ‡¶® ‡¶Æ‡¶§‡¶¨‡¶ø‡¶∞‡ßã‡¶ß ‡¶ö‡¶≤‡¶õ‡ßá‡•§ ‡¶Ö‡¶§‡ßÄ‡¶§‡ßá, ‡¶§‡¶æ‡¶¶‡ßá‡¶∞ ‡¶Æ‡¶ß‡ßç‡¶Ø‡ßá ‡¶õ‡ßã‡¶ü‡¶ñ‡¶æ‡¶ü‡ßã ‡¶™‡¶æ‡¶∞‡ßç‡¶•‡¶ï‡ßç‡¶Ø ‡¶õ‡¶ø‡¶≤‡•§ ‡¶â‡¶¶‡¶æ‡¶π‡¶∞‡¶£‡¶∏‡ßç‡¶¨‡¶∞‡ßÇ‡¶™, ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶∏‡ßç‡¶§‡ßç‡¶∞‡ßÄ ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶ï‡¶æ‡¶õ‡ßá ‡¶Ö‡¶≠‡¶ø‡¶Ø‡ßã‡¶ó ‡¶ï‡¶∞‡¶¨‡ßá ‡¶Ø‡ßá ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶Æ‡¶æ ‡¶ñ‡ßÅ‡¶¨ ‡¶ï‡¶∞‡ßç‡¶§‡ßÉ‡¶§‡ßç‡¶¨‡¶™‡ßç‡¶∞‡¶Ø‡¶º‡¶æ‡¶∏‡ßÄ; ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶Æ‡¶æ ‡¶Ö‡¶≠‡¶ø‡¶Ø‡ßã‡¶ó ‡¶ï‡¶∞‡¶¨‡ßá‡¶® ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶∏‡ßç‡¶§‡ßç‡¶∞‡ßÄ ‡¶Ö‡¶≤‡¶∏‡•§ ‡¶§‡¶¨‡ßá ‡¶á‡¶¶‡¶æ‡¶®‡ßÄ‡¶Ç ‡¶§‡¶æ ‡¶§‡ßÄ‡¶¨‡ßç‡¶∞‡¶§‡¶∞ ‡¶π‡¶Ø‡¶º‡ßá‡¶õ‡ßá ‡•§ ‡¶Ü‡¶Æ‡¶ø ‡¶Æ‡¶®‡ßá ‡¶ï‡¶∞‡¶ø, ‡¶è‡¶∞ ‡¶ï‡¶æ‡¶∞‡¶£ ‡¶π‡¶ö‡ßç‡¶õ‡ßá ‡¶Ü‡¶Æ‡¶æ‡¶∞ ‡¶∏‡ßç‡¶§‡ßç‡¶∞‡ßÄ ‡¶§‡¶æ‡¶∞ ‡¶∏‡¶æ‡¶•‡ßá ‡¶è‡¶ï‡¶¨‡¶æ‡¶∞ ‡¶ï‡¶•‡¶æ‡¶∞ ‡¶™‡ßç‡¶∞‡¶§‡¶ø‡¶§‡ßç‡¶§‡¶∞ ‡¶ï‡¶∞‡ßá‡¶õ‡¶ø‡¶≤‡•

In [15]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(
    df,
    test_size=0.1,
    random_state=42
)

print("Train size:", len(train_df))
print("Validation size:", len(val_df))


Train size: 34389
Validation size: 3821


In [16]:
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    use_fast=True
)

# LLaMA requires explicit padding token
tokenizer.pad_token = tokenizer.eos_token


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [17]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_df[["text"]])
val_dataset   = Dataset.from_pandas(val_df[["text"]])


In [18]:
MAX_LEN = 512   # üîë critical

def tokenize_function(examples):
    outputs = tokenizer(
        examples["text"],
        truncation=True,        # ‚úÖ now allowed
        max_length=MAX_LEN,
        padding=False,
        return_attention_mask=True,
    )
    outputs["labels"] = outputs["input_ids"].copy()
    return outputs



In [19]:
tokenized_train = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

tokenized_val = val_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)


Map:   0%|          | 0/34389 [00:00<?, ? examples/s]

Map:   0%|          | 0/3821 [00:00<?, ? examples/s]

In [20]:
lengths = [len(x) for x in tokenized_train["input_ids"]]

print("Max tokens:", max(lengths))
print("Average tokens:", sum(lengths) // len(lengths))


Max tokens: 512
Average tokens: 240


In [21]:
sample = tokenized_train[0]
print("Input IDs length:", len(sample["input_ids"]))


Input IDs length: 167


**Step 5**

In [22]:
from abc import ABC, abstractmethod

class FineTuningStrategy(ABC):

    @abstractmethod
    def apply(self, model):
        pass

    @abstractmethod
    def get_config(self):
        pass


In [23]:
from peft import LoraConfig, get_peft_model

class LoRAStrategy(FineTuningStrategy):

    def __init__(
        self,
        r=16,
        lora_alpha=32,
        lora_dropout=0.05
    ):
        self.config =  LoraConfig(
    r=8,                      # ‚Üì from 16
    lora_alpha=16,            # ‚Üì from 32
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

    def apply(self, model):
        return get_peft_model(model, self.config)

    def get_config(self):
        return self.config


2026-01-08 03:06:22.925922: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767841583.109617      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767841583.168536      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1767841583.668600      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767841583.668697      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767841583.668701      55 computation_placer.cc:177] computation placer alr

In [24]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model



In [25]:
# bnb_config = BitsAndBytesConfig(
#     load_in_8bit=True,
#     llm_int8_threshold=6.0,
#     llm_int8_has_fp16_weight=False
# )
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False
)

**Step 6**

In [26]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)


In [27]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama-bengali-empathetic",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,   # üîë
    learning_rate=2e-4,
    max_steps=2500,                   # üîë
    fp16=True,                       # üîë
    logging_steps=25,
    eval_strategy="no",
    save_strategy="no",
    report_to="none",
)



In [28]:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    # quantization_config=bnb_config,
    torch_dtype=torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True
)

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [29]:
model.gradient_checkpointing_enable()
model.config.use_cache = False

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,                      # ‚Üì from 16
    lora_alpha=16,            # ‚Üì from 32
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622


In [30]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator
)


  trainer = Trainer(


In [31]:
from peft import prepare_model_for_kbit_training


In [32]:
# 1. Prepare model for 8-bit training
model = prepare_model_for_kbit_training(model)

# 2. Enable gradient checkpointing input grads
model.enable_input_require_grads()

# 3. Disable cache (already done, but keep it here)
model.config.use_cache = False

# 4. Apply LoRA
lora_strategy = LoRAStrategy()
model = lora_strategy.apply(model)




In [33]:
model.print_trainable_parameters()


trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622


In [34]:
trainer.train()


The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Step,Training Loss
25,1.4172
50,0.9274
75,0.9436
100,0.8637
125,0.8095
150,0.7839
175,0.8168
200,0.8265
225,0.7645
250,0.8005


TrainOutput(global_step=2500, training_loss=0.7110343650817871, metrics={'train_runtime': 2669.7065, 'train_samples_per_second': 0.936, 'train_steps_per_second': 0.936, 'total_flos': 2.40770392971264e+16, 'train_loss': 0.7110343650817871, 'epoch': 0.07269766495100177})

**Deliverables**
 1. Preprocessing notebook/script.
 2. Fine-tuning notebook/script
 3. Evaluation notebook/script
 4. Sample Bengali responses
 5. Metrics table
 6. Analysis summary
 7. Documentation (strategy, challenges, solutions)

**Generate Bengali Empathetic Responses**

In [35]:
model.eval()


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PeftModelForCausalLM(
      (base_model): LoraModel(
        (model): LlamaForCausalLM(
          (model): LlamaModel(
            (embed_tokens): Embedding(32000, 4096)
            (layers): ModuleList(
              (0-31): 32 x LlamaDecoderLayer(
                (self_attn): LlamaAttention(
                  (q_proj): lora.Linear(
                    (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=4096, out_features=8, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=8, out_features=4096, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (l

In [41]:
prompt = """### Instruction:
Solve the following calculation and explain briefly.

### User:
125 √ó 24 + 18 ‡¶ï‡¶§?

### Assistant:
"""


inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))


### Instruction:
Solve the following calculation and explain briefly.

### User:
125 √ó 24 + 18 ‡¶ï‡¶§?

### Assistant:
‡¶Ü‡¶™‡¶®‡¶ø ‡¶è‡¶ñ‡¶® ‡¶è‡¶ü‡¶æ ‡¶ï‡¶ø ‡¶ö‡¶æ‡¶ï‡¶∞‡¶ø ‡¶Ü‡¶õ‡ßá? ‡¶Ü‡¶™‡¶®‡¶ø ‡¶∏‡¶¨ ‡¶ß‡¶∞‡¶®‡ßá‡¶∞ ‡¶ï‡¶æ‡¶ú ‡¶ï‡¶ø ‡¶Ü‡¶õ‡ßá? ‡¶è‡¶ü‡¶æ ‡¶Ü‡¶™‡¶®‡¶æ‡¶∞ ‡¶ú‡¶®‡ßç‡¶Ø ‡¶è‡¶ï‡¶ü‡¶ø ‡¶ö‡¶æ‡¶ï‡¶∞‡¶ø ‡¶®‡¶æ ‡¶ï‡¶ø? ‡¶Ü‡¶™‡¶®‡¶ø ‡¶ï‡¶ø ‡¶è‡¶ü‡¶æ ‡¶Ü‡¶™‡¶®‡¶æ‡¶∞ ‡¶∏‡¶æ


In [37]:
model.save_pretrained("./lora-bengali-empathetic")
tokenizer.save_pretrained("./lora-bengali-empathetic")


('./lora-bengali-empathetic/tokenizer_config.json',
 './lora-bengali-empathetic/special_tokens_map.json',
 './lora-bengali-empathetic/chat_template.jinja',
 './lora-bengali-empathetic/tokenizer.model',
 './lora-bengali-empathetic/added_tokens.json',
 './lora-bengali-empathetic/tokenizer.json')

**Metrics Table**

In [42]:
# !pip install evaluate nltk rouge-score


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=c819849073f430fbea9e8258947093bdfb4dceb1c8812a4b3a3c49a7c40669e2
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score, evaluate
Successfully installed evaluate-0.4.6 rouge-score-0.1.2


In [43]:
def extract_prompt_and_answer(text):
    prompt, answer = text.split("### Assistant:")
    return prompt + "### Assistant:", answer.strip()

val_prompts = []
val_references = []

for t in val_df["text"].tolist():
    p, a = extract_prompt_and_answer(t)
    val_prompts.append(p)
    val_references.append(a)


In [44]:
import torch

def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=200,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            eos_token_id=tokenizer.eos_token_id
        )

    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    return decoded.split("### Assistant:")[-1].strip()


In [45]:
N = 50  # recommended for Kaggle
predictions = []

for i in range(N):
    predictions.append(generate_response(val_prompts[i]))


In [46]:
import evaluate
import nltk
nltk.download("punkt")

bleu = evaluate.load("bleu")

bleu_score = bleu.compute(
    predictions=predictions,
    references=[[r] for r in val_references[:N]]
)

print("BLEU:", bleu_score["bleu"])


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

BLEU: 0.0


In [47]:
rouge = evaluate.load("rouge")

rouge_scores = rouge.compute(
    predictions=predictions,
    references=val_references[:N]
)

print(rouge_scores)


Downloading builder script: 0.00B [00:00, ?B/s]

{'rouge1': np.float64(0.0), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.0), 'rougeLsum': np.float64(0.0)}


In [48]:
import math
from torch.nn import CrossEntropyLoss

losses = []

model.eval()

for batch in tokenized_val.select(range(100)):
    input_ids = torch.tensor(batch["input_ids"]).unsqueeze(0).to(model.device)

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss

    losses.append(loss.item())

perplexity = math.exp(sum(losses) / len(losses))
print("Perplexity:", perplexity)


Perplexity: 1.9473086080606958


In [49]:
import pandas as pd

metrics_table = pd.DataFrame({
    "Metric": ["Perplexity", "BLEU", "ROUGE-1", "ROUGE-2", "ROUGE-L"],
    "Score": [
        round(perplexity, 2),
        round(bleu_score["bleu"], 3),
        round(rouge_scores["rouge1"], 3),
        round(rouge_scores["rouge2"], 3),
        round(rouge_scores["rougeL"], 3)
    ]
})

metrics_table


Unnamed: 0,Metric,Score
0,Perplexity,1.95
1,BLEU,0.0
2,ROUGE-1,0.0
3,ROUGE-2,0.0
4,ROUGE-L,0.0


In [51]:
metrics_table.to_csv("metrics_table.csv", index=False)
