# WebSafety Custom Model Training on Kaggle

**Novel Multi-Indic Language Web Safety Dataset with Hinglish & Tenglish**

This notebook trains a custom model on the WebSafety dataset with:
- 7 primary categories (Safe, Phishing, Malware, Hate Speech, Cyberbullying, Sexual Content, Violence)
- Multi-lingual support (English + Hinglish + Tenglish)
- Rich contextual metadata

## Setup Instructions:

1. **Upload your dataset files** to Kaggle:
   - Upload `train.jsonl`, `validation.jsonl`, `test.jsonl` as a Kaggle Dataset
   - Or add them as notebook input

2. **Enable GPU**:
   - Go to Settings â†’ Accelerator â†’ GPU T4 x2 (FREE!)

3. **Run all cells**

In [None]:
# Install required packages
!pip install transformers datasets torch scikit-learn -q

In [None]:
import json
import torch
from torch.utils.data import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import numpy as np

print("âœ“ Imports successful")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Load Dataset

Update the paths below to match where you uploaded your files

In [None]:
# UPDATE THESE PATHS!
TRAIN_FILE = "/kaggle/input/websafety-dataset/train.jsonl"
VAL_FILE = "/kaggle/input/websafety-dataset/validation.jsonl"
TEST_FILE = "/kaggle/input/websafety-dataset/test.jsonl"

# Or if files are in working directory:
# TRAIN_FILE = "train.jsonl"
# VAL_FILE = "validation.jsonl"
# TEST_FILE = "test.jsonl"

In [None]:
# Label mapping
LABEL2ID = {
    'safe': 0,
    'phishing': 1,
    'malware': 2,
    'hate_speech': 3,
    'cyberbullying': 4,
    'sexual_content': 5,
    'violence': 6
}
ID2LABEL = {v: k for k, v in LABEL2ID.items()}

print("Label mapping:")
for label, idx in LABEL2ID.items():
    print(f"  {idx}: {label}")

In [None]:
class WebSafetyDataset(Dataset):
    """Custom Dataset for WebSafety"""
    
    def __init__(self, filepath, tokenizer, max_length=512):
        self.samples = []
        with open(filepath, 'r', encoding='utf-8') as f:
            for line in f:
                self.samples.append(json.loads(line))
        
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        sample = self.samples[idx]
        text = sample['text']
        label = LABEL2ID[sample['primary_label']]
        
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

In [None]:
# Load tokenizer and model
MODEL_NAME = "distilbert-base-uncased"  # Fast and good for starter
# For multilingual support, use: "bert-base-multilingual-cased"

print(f"Loading {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=7,
    id2label=ID2LABEL,
    label2id=LABEL2ID
)
print("âœ“ Model loaded")

In [None]:
# Create datasets
print("Loading datasets...")
train_dataset = WebSafetyDataset(TRAIN_FILE, tokenizer)
val_dataset = WebSafetyDataset(VAL_FILE, tokenizer)
test_dataset = WebSafetyDataset(TEST_FILE, tokenizer)

print(f"âœ“ Train: {len(train_dataset)} samples")
print(f"âœ“ Validation: {len(val_dataset)} samples")
print(f"âœ“ Test: {len(test_dataset)} samples")

In [None]:
def compute_metrics(pred):
    """Compute evaluation metrics"""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='weighted', zero_division=0
    )
    acc = accuracy_score(labels, preds)
    
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## Training Configuration

Adjust these parameters based on your needs:
- `num_train_epochs`: More epochs = better performance (but slower)
- `per_device_train_batch_size`: Larger = faster (but needs more GPU memory)
- `learning_rate`: Lower = more stable, higher = faster convergence

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./websafety-model",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    save_total_limit=2,
    report_to="none",  # Disable wandb
    fp16=True,  # Use mixed precision for speed
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

print("âœ“ Trainer configured")

## Start Training!

This will take 5-15 minutes depending on GPU

In [None]:
print("=" * 60)
print("STARTING TRAINING")
print("=" * 60)

trainer.train()

print("\n" + "="*60)
print("TRAINING COMPLETE!")
print("="*60)

## Evaluation

In [None]:
# Evaluate on test set
print("Evaluating on test set...")
results = trainer.evaluate(test_dataset)

print("\n" + "="*60)
print("TEST SET RESULTS")
print("="*60)
for key, value in results.items():
    print(f"{key}: {value:.4f}")
print("="*60)

In [None]:
# Generate predictions for detailed analysis
predictions = trainer.predict(test_dataset)
pred_labels = predictions.predictions.argmax(-1)
true_labels = predictions.label_ids

# Classification report
print("\nDetailed Classification Report:")
print("="*60)
print(classification_report(
    true_labels, 
    pred_labels, 
    target_names=list(LABEL2ID.keys())
))
print("="*60)

## Save Model

Save your trained model for later use

In [None]:
# Save model and tokenizer
output_dir = "./websafety-final-model"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

# Save label mapping
with open(f"{output_dir}/label_mapping.json", 'w') as f:
    json.dump({'label2id': LABEL2ID, 'id2label': ID2LABEL}, f, indent=2)

print(f"âœ“ Model saved to {output_dir}")
print("âœ“ Download the folder to use in your application!")

## Test with Custom Examples

In [None]:
def predict_text(text):
    """Predict label for custom text"""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    probs = torch.softmax(outputs.logits, dim=-1)
    pred_label = probs.argmax(-1).item()
    confidence = probs.max().item()
    
    return ID2LABEL[pred_label], confidence

# Test examples
test_examples = [
    "This is a great movie, everyone should watch it!",
    "Click here to claim your prize: http://fake-site.tk",
    "You're such a loser, nobody likes you",
    "Yaar, ye website safe hai kya?",  # Hinglish
    "Abbai, ee link click cheyakandi",  # Tenglish
]

print("\nTesting with custom examples:")
print("="*60)
for text in test_examples:
    label, conf = predict_text(text)
    print(f"Text: {text[:50]}...")
    print(f"Prediction: {label} (confidence: {conf:.3f})")
    print("-"*60)

## ðŸŽ‰ SUCCESS!

Your model is now trained! 

**Next steps:**
1. Download the `websafety-final-model` folder
2. Use it in your WebSafety application
3. Document the results for your research paper

**For your paper, report:**
- Test accuracy, F1, precision, recall
- Per-class performance
- Training time and resources
- Model architecture (DistilBERT fine-tuned)
- Dataset size and distribution