<a href="https://colab.research.google.com/github/kdadobe/toxicity_identification_exercise/blob/main/Jigsaw_Toxic_Identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Installing Dependencies



In [None]:
!pip install transformers datasets evaluate --quiet
!pip install -q kaggle

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h

# Imports

In [None]:
import os
import zipfile
import torch
import numpy as np
from google.colab import files
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from transformers import DistilBertForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer


# Uploading kaggle.json for access token set up

In [None]:
files.upload()  # Upload the kaggle.json file here

# Step 2: Move it to the correct location
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


# 1. Dataset Selection

## Public Dataset
### Downloading dataset from keggle

In [None]:
!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge

Downloading jigsaw-toxic-comment-classification-challenge.zip to /content
  0% 0.00/52.6M [00:00<?, ?B/s]
100% 52.6M/52.6M [00:00<00:00, 887MB/s]


### Unzip the dataset file

In [None]:
!unzip jigsaw-toxic-comment-classification-challenge.zip -d jigsaw_data

Archive:  jigsaw-toxic-comment-classification-challenge.zip
  inflating: jigsaw_data/sample_submission.csv.zip  
  inflating: jigsaw_data/test.csv.zip  
  inflating: jigsaw_data/test_labels.csv.zip  
  inflating: jigsaw_data/train.csv.zip  


### Unzipping the csv.zip files for the dataset

In [None]:


# Unzip all CSV files
with zipfile.ZipFile("/content/jigsaw_data/train.csv.zip", 'r') as zip_ref:
    zip_ref.extractall()  # This will extract train.csv

with zipfile.ZipFile("/content/jigsaw_data/test.csv.zip", 'r') as zip_ref:
    zip_ref.extractall()

with zipfile.ZipFile("/content/jigsaw_data/test_labels.csv.zip", 'r') as zip_ref:
    zip_ref.extractall()


### Loading train.csv

Note : train.csv already has binary labels with 0 and 1.

In [None]:

df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


## Task Framing
### Data clean up

If any of the columns have values other than 0 or 1, we will discard the row and merge all columns in on "toxic" with max of column values. We consider if any of the column has value 1, the comment text is toxic.

In [None]:
label_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

def clean_and_merge_labels(df, label_columns, text_col="comment_text",):
  for col in label_columns:
    df = df[df[col].isin([0, 1])]
  df['label'] = df[label_columns].max(axis=1)
  df = df.rename(columns={text_col: 'text'})
  return df[["text", "label"]].reset_index(drop=True)



In [None]:

df = clean_and_merge_labels(df, label_columns, text_col="comment_text")


In [None]:
df.head()

Unnamed: 0,text,label
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


### Train test division 80-20%

In [None]:


train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['text'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)


### Tokenizing Text Data for Transformer Models

In [None]:


tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

train_encodings = tokenizer(
    list(train_texts), truncation=True, padding=True, max_length=256
)
val_encodings = tokenizer(
    list(val_texts), truncation=True, padding=True, max_length=256
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Creating a Custom PyTorch Dataset for Toxic Comment Classification

In [None]:

class ToxicCommentsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

#The following code defines a custom Dataset class that wraps the tokenized input and labels
# so they can be fed into a transformer model like DistilBERT.
train_dataset = ToxicCommentsDataset(train_encodings, list(train_labels))
val_dataset = ToxicCommentsDataset(val_encodings, list(val_labels))


# 2. Model Selection

In [None]:
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Configuring Training Argument


In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    num_train_epochs=2, # no of epoch for training
    weight_decay=0.01,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=1,
    report_to=[], # to disable WANDB
)

### Compute Metrics

In [None]:

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }


### Defining trainer using training args, train dataset, test dataset and compute metrics

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)


## Training The model

In [None]:
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1457,0.163964,0.963622,0.862057,0.764561,0.810387
2,0.2132,0.125177,0.966536,0.864169,0.795994,0.828681


TrainOutput(global_step=31914, training_loss=0.11983835054703322, metrics={'train_runtime': 6239.086, 'train_samples_per_second': 40.921, 'train_steps_per_second': 5.115, 'total_flos': 1.6910258242830336e+16, 'train_loss': 0.11983835054703322, 'epoch': 2.0})

## Evaluating the model on test data

In [None]:
trainer.evaluate()


{'eval_loss': 0.12517701089382172,
 'eval_accuracy': 0.9665361115462948,
 'eval_precision': 0.8641686182669789,
 'eval_recall': 0.7959938366718028,
 'eval_f1': 0.8286814244465832,
 'eval_runtime': 202.2611,
 'eval_samples_per_second': 157.791,
 'eval_steps_per_second': 9.863,
 'epoch': 2.0}

## Evaluating the model on Test dataset provided with Jigsaw toxic content dataset

Loading test data, merging the labels, removing rows with column values not in 0 or 1.

In [None]:
# Load test data
test_df = pd.read_csv("test.csv")
test_labels_df = pd.read_csv("test_labels.csv")

# Merge test text with labels
merged_test = test_df.merge(test_labels_df, on="id")

merged_test = clean_and_merge_labels(merged_test, label_columns, text_col="comment_text")

In [None]:
test_encodings = tokenizer(
    list(merged_test['text']), truncation=True, padding=True, max_length=128
)

test_dataset = ToxicCommentsDataset(test_encodings, list(merged_test['label']))
trainer.evaluate(eval_dataset=test_dataset)

{'eval_loss': 0.3312896490097046,
 'eval_accuracy': 0.9181281065366219,
 'eval_precision': 0.5499850790808714,
 'eval_recall': 0.8856319077366651,
 'eval_f1': 0.6785714285714286,
 'eval_runtime': 213.3948,
 'eval_samples_per_second': 299.81,
 'eval_steps_per_second': 18.74,
 'epoch': 2.0}

### Testing with real time sentences

In [None]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def classify_prompt(text):
    # Tokenize and move to the same device as model
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)

    # Inference
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.nn.functional.softmax(outputs.logits, dim=1)
        pred = torch.argmax(probs).item()

    return "unsafe" if pred == 1 else "safe"


**Inference Example:** Demonstrate the model’s performance with a few example prompts, showing its classification as “safe” or “unsafe.”

In [None]:
print(classify_prompt("You're an idiot."))
print(classify_prompt("How do I reset my password?"))
print(classify_prompt("This code is not throwing error"))
print(classify_prompt("Why there is no email sent yet"))

unsafe
safe
safe
safe


Notes :

**Summary** :

Train Results -

| Metric              | Epoch 1  | Epoch 2  | 📈 Trend               |
| ------------------- | -------- | -------- | ---------------------- |
| **Training Loss**   | 0.1457   | 0.2132   | 🔺 Increased           |
| **Validation Loss** | 0.163964 | 0.125177 | 🔻 Decreased           |
| **Accuracy**        | 96.36%   | 96.65%   | 🔺 Slight improvement  |
| **Precision**       | 0.8621   | 0.8642   | ⬆️ Slight improvement  |
| **Recall**          | 0.7646   | 0.7960   | ⬆️ Notable improvement |
| **F1 Score**        | 0.8104   | 0.8287   | ⬆️ Improvement         |


Observations -
1. Validation loss decreased, indicating improved generalization.
2. Accuracy and F1 score both improved in the second epoch.
3. Recall improved the most — means model is catching more true positives in Epoch 2.
3. Precision stayed stable, which means the quality of predictions is maintained even as recall improves.
4. Training loss increased in epoch 2, which could suggest slight overfitting or variance—worth monitoring in further training.

The model is showing strong performance in classifying safe vs. unsafe prompts, with an F1 score of 83% and accuracy of 96.65% by the second epoch—suitable for integration into a prompt guardrail system.

Test Results -

{'eval_loss': 0.12517701089382172,
 'eval_accuracy': 0.9665361115462948,
 'eval_precision': 0.8641686182669789,
 'eval_recall': 0.7959938366718028,
 'eval_f1': 0.8286814244465832,
 'eval_runtime': 202.2611,
 'eval_samples_per_second': 157.791,
 'eval_steps_per_second': 9.863,
 'epoch': 2.0}

Observations-
1. The classifier is performing strongly, with balanced precision and recall, and excellent accuracy.
2. It is well-suited for detecting unsafe prompts with a good trade-off between false positives and false negatives.
3. Evaluation runtime seems fine, but for scalable application, batch size and GPU Optimization will be required.

**Improvements :**

1. Can be trained on more epochs but due to time constraints, only 2 were done
2. Hyperparameter tuning like learning rate, weight decay, batch size etc
3. Multi label classification can also be done
4. Can be made multilingual classifier model consider in Adobe different regions use different languages.


**Potential Extensions**:

We can wrap DistilBERT model in a lightweight API (e.g., using Flask, FastAPI, or Django Rest Framework) and deploy it on HuggingFace or similar cloud solution. This API will expose an endpoint (e.g., /identifysafe) that accepts a user prompt as input.
The API endpoint will handle:
1. Receiving the prompt.
2. Tokenizing the prompt using the same DistilBERT tokenizer used during training.
3. Passing the tokenized input to loaded fine-tuned DistilBERT model.
4. Receiving the model's prediction (safe/unsafe).
Returning the prediction in a structured format (e.g., JSON).

For ease of use, the finetuned model can also be uploaded at the HuggingFace from where it can be loaded whenever required.

**Challenges Faced** :
1. Time required to train and fine tune the model was high.
2. Identifying the correct dataset and analysing the same for preporcessing. (Consdiering the dataset has 1.5 lac rows)
3. Idenfiying training arguments to run the training



