# DistilBERT Hate Speech Classification - Optimized Training

This notebook uses optimized hyperparameters and memory-efficient training for Bengali hate speech classification.

**Optimized Configuration Based on Previous Results:**
- Target Accuracy: 78%
- Optimal hyperparameters from successful runs
- Memory-efficient training with gradient checkpointing

## 1. Install Required Libraries

In [1]:
# Install required packages
!pip install transformers datasets accelerate torch scikit-learn pandas numpy --quiet
!pip install torchaudio --quiet

[31mERROR: Could not find a version that satisfies the requirement torch-audio (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for torch-audio[0m[31m
[0m

In [4]:
!wget https://raw.githubusercontent.com/md-fahad-ali/do-your-home-work/refs/heads/main/blp25_hatespeech_subtask_1A_train.tsv
!wget https://raw.githubusercontent.com/AridHasan/blp25_task1/refs/heads/main/data/subtask_1A/blp25_hatespeech_subtask_1A_dev.tsv
!wget https://raw.githubusercontent.com/AridHasan/blp25_task1/refs/heads/main/data/subtask_1A/blp25_hatespeech_subtask_1A_dev_test.tsv

--2025-09-03 09:11:44--  https://raw.githubusercontent.com/md-fahad-ali/do-your-home-work/refs/heads/main/blp25_hatespeech_subtask_1A_train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8016549 (7.6M) [text/plain]
Saving to: ‘blp25_hatespeech_subtask_1A_train.tsv’


2025-09-03 09:11:45 (203 MB/s) - ‘blp25_hatespeech_subtask_1A_train.tsv’ saved [8016549/8016549]

--2025-09-03 09:11:45--  https://raw.githubusercontent.com/AridHasan/blp25_task1/refs/heads/main/data/subtask_1A/blp25_hatespeech_subtask_1A_dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awai

## 2. Import Libraries and Setup

In [5]:
import os
import pandas as pd
import numpy as np
import torch
import logging
from datasets import Dataset, DatasetDict
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.utils.class_weight import compute_class_weight

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)
import torch.nn.functional as F
from accelerate import Accelerator

# Disable wandb
os.environ["WANDB_DISABLED"] = "true"
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

PyTorch version: 2.8.0+cu126
CUDA available: True
GPU: Tesla T4
GPU Memory: 14.7 GB


## 3. Load and Prepare Dataset

In [6]:
# Load datasets
train_df = pd.read_csv('blp25_hatespeech_subtask_1A_train.tsv', sep='\t')
dev_df = pd.read_csv('blp25_hatespeech_subtask_1A_dev.tsv', sep='\t')
test_df = pd.read_csv('blp25_hatespeech_subtask_1A_dev_test.tsv', sep='\t')

print(f"Train dataset: {len(train_df)} samples")
print(f"Dev dataset: {len(dev_df)} samples")
print(f"Test dataset: {len(test_df)} samples")

# Check class distribution
print("\nClass distribution in training data:")
print(train_df['label'].value_counts())

# Check text length distribution
train_lengths = train_df['text'].str.len()
print(f"\nText length stats:")
print(f"Mean: {train_lengths.mean():.1f}")
print(f"Max: {train_lengths.max()}")
print(f"95th percentile: {train_lengths.quantile(0.95):.1f}")
print(f"Samples > 128 chars: {(train_lengths > 128).sum()} ({(train_lengths > 128).mean()*100:.1f}%)")
print(f"Samples > 256 chars: {(train_lengths > 256).sum()} ({(train_lengths > 256).mean()*100:.1f}%)")

Train dataset: 35637 samples
Dev dataset: 2512 samples
Test dataset: 2512 samples

Class distribution in training data:
label
Abusive           8212
Political Hate    4232
Profane           2365
Religious Hate     722
Sexism             152
Name: count, dtype: int64

Text length stats:
Mean: 78.1
Max: 3710
95th percentile: 210.0
Samples > 128 chars: 4813 (13.5%)
Samples > 256 chars: 1109 (3.1%)


## 4. Load Model and Tokenizer

In [7]:
# Model configuration - OPTIMAL SETTINGS FROM MEMORY
model_name = "csebuetnlp/banglabert"
max_seq_length = 256  # Critical for 13.9% of samples >128 chars
num_labels = 2

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with optimizations
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto" if torch.cuda.is_available() else None
)

print(f"Model loaded: {model_name}")
print(f"Max sequence length: {max_seq_length}")
print(f"Model dtype: {model.dtype}")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters()):,}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/586 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at csebuetnlp/banglabert and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded: csebuetnlp/banglabert
Max sequence length: 256
Model dtype: torch.float16
Number of parameters: 110,618,882


## 5. Tokenize Data

In [8]:
# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding=False,  # Dynamic padding with DataCollator
        max_length=max_seq_length,
        return_tensors=None
    )

# Convert to HuggingFace datasets
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(dev_df)
test_dataset = Dataset.from_pandas(test_df)

# Tokenize datasets
train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=['text'])
eval_dataset = eval_dataset.map(tokenize_function, batched=True, remove_columns=['text'])
test_dataset = test_dataset.map(tokenize_function, batched=True, remove_columns=['text'])

print(f"Tokenized datasets prepared")
print(f"Train: {len(train_dataset)}, Eval: {len(eval_dataset)}, Test: {len(test_dataset)}")

# Data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/35637 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

Map:   0%|          | 0/2512 [00:00<?, ? examples/s]

Map:   0%|          | 0/2512 [00:00<?, ? examples/s]

Tokenized datasets prepared
Train: 35637, Eval: 2512, Test: 2512


## 6. Compute Class Weights

In [10]:
# Calculate class weights for imbalanced dataset
# Handle NaN values in labels before calculating weights
train_df_cleaned = train_df.dropna(subset=['label'])
train_labels = train_df_cleaned['label'].values

unique_labels = np.unique(train_labels)
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=unique_labels,
    y=train_labels
)

# Convert to tensor
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32)
if torch.cuda.is_available():
    class_weights_tensor = class_weights_tensor.cuda()

print(f"Class weights: {dict(zip(unique_labels, class_weights))}")
# Assuming 'Abusive' is the positive class (index 0 based on the output, but check your unique_labels order)
# and the other labels are combined or considered non-hate
# The class_weights tensor will have weights in the order of unique_labels
print(f"Class weights tensor: {class_weights_tensor}")

Class weights: {'Abusive': np.float64(0.38195323916220164), 'Political Hate': np.float64(0.7411625708884688), 'Profane': np.float64(1.3262579281183933), 'Religious Hate': np.float64(4.3443213296398895), 'Sexism': np.float64(20.635526315789473)}
Class weights tensor: tensor([ 0.3820,  0.7412,  1.3263,  4.3443, 20.6355], device='cuda:0')


## 7. Define Metrics and Custom Trainer

In [11]:
# Define metrics computation
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    accuracy = accuracy_score(labels, predictions)
    f1_macro = f1_score(labels, predictions, average='macro')
    f1_weighted = f1_score(labels, predictions, average='weighted')
    precision_macro = precision_score(labels, predictions, average='macro')
    recall_macro = recall_score(labels, predictions, average='macro')

    return {
        'accuracy': accuracy,
        'f1_macro': f1_macro,
        'f1_weighted': f1_weighted,
        'precision_macro': precision_macro,
        'recall_macro': recall_macro
    }

# Custom Trainer with weighted loss (FOCAL LOSS for better performance)
class FocalTrainer(Trainer):
    def __init__(self, class_weights=None, alpha=1.0, gamma=2.0, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.class_weights = class_weights
        self.alpha = alpha
        self.gamma = gamma

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get('logits')

        # Focal Loss implementation
        ce_loss = F.cross_entropy(logits, labels, weight=self.class_weights, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1-pt)**self.gamma * ce_loss
        loss = focal_loss.mean()

        return (loss, outputs) if return_outputs else loss

print("Custom FocalTrainer with weighted loss defined")

Custom FocalTrainer with weighted loss defined


## 8. Training Configuration - OPTIMAL HYPERPARAMETERS

In [21]:
# OPTIMAL HYPERPARAMETERS FROM MEMORY - PROVEN TO WORK
training_args = TrainingArguments(
    output_dir="./results_optimized",

    # OPTIMAL SETTINGS FROM SUCCESSFUL RUNS
    learning_rate=2e-5,  # Optimal: 2e-5 (not 3e-5)
    num_train_epochs=6,  # Optimal: 5-7 epochs
    warmup_ratio=0.06,  # Optimal: 0.06 (not 0.0)
    lr_scheduler_type="cosine",  # Optimal: "cosine" (not "linear")
    weight_decay=0.01,  # Optimal: 0.01 (not 0.0)
    label_smoothing_factor=0.05,  # For regularization

    # Batch size and gradient accumulation
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,

    # Evaluation and saving
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=3,
    load_best_model_at_end=False, # Disabled to resolve conflict

    metric_for_best_model="accuracy",
    greater_is_better=True,

    # Memory optimization
    fp16=False, # Disabled due to conflict with gradient_checkpointing
    # gradient_checkpointing=True, # Disabled to resolve conflict
    dataloader_pin_memory=False,
    remove_unused_columns=True,

    # Optimizer
    optim="adamw_torch",

    # Logging
    logging_dir="./logs_optimized",
    logging_steps=100,
    report_to=None,  # Disable wandb

    # Reproducibility
    seed=42,
    data_seed=42,
)

print("Training arguments configured with OPTIMAL hyperparameters:")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Warmup ratio: {training_args.warmup_ratio}")
print(f"LR scheduler: {training_args.lr_scheduler_type}")
print(f"Weight decay: {training_args.weight_decay}")
print(f"Label smoothing: {training_args.label_smoothing_factor}")
print(f"Max sequence length: {max_seq_length}")

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

## 9. Initialize Trainer and Start Training

In [22]:
# Initialize trainer with optimizations
trainer = FocalTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    class_weights=class_weights_tensor,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

print("FocalTrainer initialized with:")
print(f"- Focal loss for better class imbalance handling")
print(f"- Weighted loss with class weights")
print(f"- Early stopping (patience=3)")
print(f"- Optimal hyperparameters from previous successful runs")
print(f"- Target accuracy: 78%")

# Show memory usage before training
if torch.cuda.is_available():
    print(f"\nGPU Memory before training:")
    print(f"Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
    print(f"Reserved: {torch.cuda.memory_reserved()/1024**3:.2f} GB")

NameError: name 'training_args' is not defined

In [None]:
# Start training
print("Starting training with OPTIMAL configuration...")
print("Expected improvements:")
print("- Better convergence with cosine scheduler")
print("- Reduced overfitting with proper regularization")
print("- Improved class balance handling with Focal Loss")
print("- Target: 78% accuracy")
print("\n" + "="*50)

trainer.train()

print("\n" + "="*50)
print("Training completed!")

## 10. Evaluation and Results

In [None]:
# Evaluate on validation set
print("Evaluating on validation set...")
eval_results = trainer.evaluate()

print("\nValidation Results:")
print(f"Accuracy: {eval_results['eval_accuracy']:.4f} ({eval_results['eval_accuracy']*100:.2f}%)")
print(f"F1-macro: {eval_results['eval_f1_macro']:.4f}")
print(f"F1-weighted: {eval_results['eval_f1_weighted']:.4f}")
print(f"Precision-macro: {eval_results['eval_precision_macro']:.4f}")
print(f"Recall-macro: {eval_results['eval_recall_macro']:.4f}")

# Check if target accuracy achieved
target_accuracy = 0.78
achieved_accuracy = eval_results['eval_accuracy']

if achieved_accuracy >= target_accuracy:
    print(f"\n🎉 SUCCESS! Target accuracy of {target_accuracy*100:.1f}% achieved!")
    print(f"Achieved: {achieved_accuracy*100:.2f}%")
else:
    print(f"\n📈 Progress: {achieved_accuracy*100:.2f}% (Target: {target_accuracy*100:.1f}%)")
    gap = target_accuracy - achieved_accuracy
    print(f"Gap to target: {gap*100:.2f} percentage points")

In [None]:
# Evaluate on test set
print("Evaluating on test set...")
test_results = trainer.evaluate(eval_dataset=test_dataset)

print("\nTest Results:")
print(f"Accuracy: {test_results['eval_accuracy']:.4f} ({test_results['eval_accuracy']*100:.2f}%)")
print(f"F1-macro: {test_results['eval_f1_macro']:.4f}")
print(f"F1-weighted: {test_results['eval_f1_weighted']:.4f}")
print(f"Precision-macro: {test_results['eval_precision_macro']:.4f}")
print(f"Recall-macro: {test_results['eval_recall_macro']:.4f}")

## 11. Save Model

In [None]:
# Save the fine-tuned model
model_save_path = "./bangla_hate_speech_optimized"
trainer.save_model(model_save_path)
tokenizer.save_pretrained(model_save_path)

print(f"Model saved to: {model_save_path}")

# Save training results
results_summary = {
    'model_name': model_name,
    'optimization': 'Focal Loss + Optimal Hyperparameters',
    'max_seq_length': max_seq_length,
    'learning_rate': training_args.learning_rate,
    'num_epochs': training_args.num_train_epochs,
    'warmup_ratio': training_args.warmup_ratio,
    'lr_scheduler': training_args.lr_scheduler_type,
    'weight_decay': training_args.weight_decay,
    'label_smoothing': training_args.label_smoothing_factor,
    'validation_accuracy': eval_results['eval_accuracy'],
    'validation_f1_macro': eval_results['eval_f1_macro'],
    'test_accuracy': test_results['eval_accuracy'],
    'test_f1_macro': test_results['eval_f1_macro'],
    'target_achieved': achieved_accuracy >= target_accuracy
}

import json
with open(f"{model_save_path}/training_results.json", 'w') as f:
    json.dump(results_summary, f, indent=2)

print("Training results saved to training_results.json")

## 12. Training Summary

In [None]:
# Show final memory usage
if torch.cuda.is_available():
    print("Final GPU Memory Usage:")
    print(f"Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
    print(f"Reserved: {torch.cuda.memory_reserved()/1024**3:.2f} GB")
    print(f"Max allocated during training: {torch.cuda.max_memory_allocated()/1024**3:.2f} GB")

print("\n" + "="*60)
print("TRAINING SUMMARY")
print("="*60)
print(f"Model: {model_name}")
print(f"Optimization: Focal Loss + Optimal Hyperparameters")
print(f"Max Sequence Length: {max_seq_length}")
print(f"Learning Rate: {training_args.learning_rate}")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Scheduler: {training_args.lr_scheduler_type}")
print(f"Warmup Ratio: {training_args.warmup_ratio}")
print(f"Weight Decay: {training_args.weight_decay}")
print(f"Label Smoothing: {training_args.label_smoothing_factor}")
print(f"\nValidation Accuracy: {eval_results['eval_accuracy']*100:.2f}%")
print(f"Test Accuracy: {test_results['eval_accuracy']*100:.2f}%")
print(f"Target (78%): {'✅ ACHIEVED' if achieved_accuracy >= target_accuracy else '❌ NOT YET'}")
print("="*60)