# Hey! That Wasn't Nice!

by Janet Ahn and Jooyoung (Julia) Lee

## Motivation

Our project focuses on improving interpretability and fairness in toxic speech classification. Existing hate speech and toxicity classifiers, including those in the Jigsaw Toxic Comment Classification Challenge and HateXplain datasets, often produce accurate but opaque predictions. As a result, it becomes difficult to verify why a model labels a comment as toxic and whether those decisions rely on inappropriate cues such as identity terms (e.g., “gay”, “Muslim”, “Black”).

Recent work highlights the importance of explainability in this domain. In particular, the HateXplain dataset demonstrates that even high-performing hate speech models can fail to produce meaningful explanations and may exhibit unintended bias toward target communities [7]. HateXplain provides token-level human rationales alongside traditional classification labels, offering a benchmark for evaluating interpretability and bias in automated systems. This motivates our focus on evaluating and improving the transparency of our model using post-hoc interpretability methods such as LIME and SHAP.

We chose to improve the explainability and bias detection aspects of the task for three main reasons:

1. **Safety concerns**: As social media is widely used, it is common knowledge that toxicity classifiers are increasingly being used as online moderation systems, and misclassification has real-world consequences. A lack of transparency can amplify harm toward marginalized communities.

2. **Model reliability**: Deep learning models, especially BERT-based architectures, may overfit by misleading patterns. Exposing rationales helps diagnose when models rely on identity terms rather than genuine harmful content.

3. **Dataset suitability**: HateXplain includes human-annotated rationales, making it a natural benchmark for measuring interpretability. Jigsaw provides large-scale, diverse training data that boosts classifier performance.

By combining the breadth of Jigsaw with the transparency features of HateXplain, our project aims to produce a toxic speech classifier that is not only accurate but also interpretable and less biased.


## Approach

Our approach integrates three key components: fine-tuned BERT classification, zero-shot LLM comparison, and post-hoc interpretability with bias/robustness evaluation.

We fine-tuned a pre-trained BERT model (bert-base-uncased) [1] on the Jigsaw Toxic Comment Classification dataset [5] to predict six toxicity categories: toxic, severe toxic, obscene, threat, insult, and identity hate. The model was trained for multi-label classification using binary cross-entropy loss [2], allowing it to simultaneously assign probabilities to each label.

To compare task-specific fine-tuning with a general-purpose large language model, we evaluated a zero-shot Qwen2.5-7B-Instruct model. The LLM was prompted with textual instructions to classify the same comments as BERT, and its textual responses were parsed into multi-label predictions using a simple rule-based method. This setup allowed us to measure how well a general-purpose LLM performs on multi-label toxicity classification without fine-tuning, in direct comparison to a task-specific BERT model.

To understand model reasoning and potential biases, we applied two post-hoc interpretability methods to the BERT predictions: LIME (Locally Interpretable Model-Agnostic Explanations) [3,8] and SHAP (SHapley Additive exPlanations) [4,9]. LIME generates human-readable highlights by perturbing input text and fitting a local interpretable surrogate, while SHAP provides quantitative token-level contributions based on cooperative game theory. These methods enable us to identify which words influence toxicity predictions and whether the model over-relies on identity-related terms.

Finally, we conducted a bias and robustness evaluation. Model-generated rationales were compared against human-annotated rationales from HateXplain [6,7], and performance was examined on ambiguous or borderline-toxic examples. This combination of supervised modeling, zero-shot comparison, interpretability, and fairness assessment provides a comprehensive evaluation of predictive performance and transparency.


## Data

Our project used two primary datasets. The first is the Jigsaw Toxic Comment Classification Challenge dataset (Hugging Face version) [5], which contains public comments labeled for six categories of toxicity: toxic, severe toxic, obscene, threat, insult, and identity hate. We used this dataset as the primary training dataset for our BERT classifier. To reduce computational time, we created the option to use 10–30% of the full dataset. The dataset was loaded directly from HuggingFace Datasets. The original training split was manually divided into 80% train, 10% validation, and 10% test, while the original test split was not used.

The second dataset is HateXplain [6], which consists of posts annotated by human annotators for labels (normal, offensive, hatespeech), rationales (token indices highlighted as contributing to the label), and target communities. HateXplain was used exclusively for interpretability evaluation, enabling us to compare model-generated explanations to human rationales.

**Note that HateXplain contains offensive and hateful content, as it is intended for toxicity research.**

No additional external datasets were used in this project.


## Code

All core scripts in our project were written by our group, including dataset preprocessing (preprocessing/jigsaw.py and preprocessing/hatexplain.py), model training (models/train_bert.py), and evaluation (models/evaluate_bert.py). While no homework code was directly reused, we referenced prior homework implementations for guidance on model structure, tokenizer handling, and training loops. For example, our approach to initializing a Transformer-based model with a classification head and managing training batches was inspired by the homework tagger code, which utilized AutoModel and AutoTokenizer from HuggingFace [1], PyTorch optimization routines [2], and subword token alignment strategies. These references informed our implementation of BERT fine-tuning for multi-label classification, but all code in this project was independently written and adapted to our datasets and multi-label task.

The BERT model was loaded and fine-tuned using the HuggingFace Transformers library [1], specifically the AutoModelForSequenceClassification and AutoTokenizer classes. Dataset handling and preprocessing were performed with HuggingFace Datasets [5,6], which allowed efficient loading, splitting, and PyTorch formatting of both the Jigsaw and HateXplain datasets. Model training and optimization were conducted using PyTorch [2], utilizing the DataLoader for batching, AdamW for gradient updates, and standard backpropagation. The tqdm library provided progress bars during training. For interpretability, we implemented LIME [3,8] and SHAP [4,9] with understanding from research papers and documentations to generate token-level explanations for BERT predictions. These libraries allowed us to provide both human-readable highlights (LIME) and quantitative contribution scores (SHAP) in our analysis.


## Experimental Setup

Our experimental setup is designed to evaluate both the predictive performance and interpretability of toxicity classifiers, while directly comparing a fine-tuned BERT model with a zero-shot large language model (LLM). The goal is to assess not only how well each model performs the multi-label classification task, but also how transparent, reliable, and human-aligned their explanations are.

The BERT model was fine-tuned on the Jigsaw dataset using PyTorch [2] and the HuggingFace Transformers library [1]. Input comments were tokenized with AutoTokenizer, and classification was performed using AutoModelForSequenceClassification. Multi-label predictions were generated via a sigmoid-activated linear head, and training employed BCEWithLogitsLoss [2]. To reduce computation time, we trained BERT on the full training set but evaluated on a 200-sample subset of the held-out test set for direct comparison with the LLM.

For zero-shot evaluation, we used Qwen2.5-7B-Instruct. Each test comment was prompted with a simple instruction asking the model to list all applicable toxicity labels. Responses were parsed with a rule-based approach to produce multi-label predictions aligned with the six Jigsaw categories. This setup ensures a fair, controlled comparison between task-specific fine-tuning and zero-shot general-purpose LLM classification.

Because the LLM produces free-form text rather than token-level probabilities, interpretability evaluation was conducted exclusively on BERT. LIME [3] and SHAP [4] were applied to individual predictions to highlight influential tokens, which were then compared against human-annotated rationales from HateXplain [6,7]. Metrics included token-level Intersection-over-Union (IoU) and precision/recall for rationale alignment. We also qualitatively inspected high-toxicity, negative-but-non-toxic, and polite comments to assess how well explanations capture meaningful features and avoid over-reliance on identity terms.

This experimental setup allows direct comparison of fine-tuned BERT and zero-shot LLM performance on classification accuracy, as well as detailed evaluation of interpretability and fairness for the BERT model.



## Results

### Baseline Performance

We fine-tuned bert-base-uncased on the Jigsaw Toxic Comment Classification dataset using our own 80/10/10 train/validation/test split. The model was trained for 3 epochs with a batch size of 16 and a learning rate of 2e-5. Training proceeded smoothly, with loss decreasing consistently across epochs (0.049 → 0.034 → 0.027), indicating stable convergence without signs of overfitting.

On the held-out test set (15,958 comments), the fine-tuned BERT model achieved a macro-averaged F1 score of 0.65 across the six toxicity labels.

### Model Performance: Fine-Tuned BERT vs Zero-Shot LLM

To compare task-specific fine-tuning with zero-shot prompting, we evaluated both our fine-tuned BERT classifier and a zero-shot Qwen2.5-7B-Instruct model on the same 200-sample subset of the Jigsaw test set.

Fine-tuned BERT Macro F1: 0.435

Zero-shot Qwen2.5-7B-Instruct Macro F1: 0.180

The LLM received only an instruction prompt and produced free-form text that we converted to multi-label predictions using a substring-based parser. Even under this setup, the LLM frequently defaulted to predicting “none.” The fine-tuned BERT model outperformed the zero-shot LLM on all labels.

### Interpretability: LIME and SHAP

To evaluate how the model makes decisions, we applied two complementary interpretability methods:

LIME (Local Interpretable Model-Agnostic Explanations), which highlights token importance by approximating local decision boundaries.

SHAP (Shapley Additive Explanations), which provides quantitative attribution scores based on cooperative game theory. 

Across a diverse set of example sentences (including highly toxic, negative-but-non-toxic, and positive comments) both methods showed strong agreement. In high-toxicity samples, LIME and SHAP consistently highlighted abusive or identity-related tokens (e.g., hate, racist, garbage). For non-toxic or polite sentences, both methods assigned negative or near-zero contributions to all tokens, demonstrating that the model does not rely on spurious cues.

### Interpretability Results (LIME + SHAP)

To better understand how our fine-tuned BERT model makes toxicity predictions, we applied two complementary interpretability methods: LIME (Local Interpretable Model-Agnostic Explanations) and SHAP (Shapley Additive Explanations). These methods reveal which tokens contribute most strongly to the model’s decisions and whether those contributing tokens align with intuitive linguistic cues.

SHAP values indicate how much each token pushes the model’s toxicity prediction up or down relative to a neutral baseline, with positive values signaling harmful influence and negative values signaling mitigating or non-toxic influence. Meanwhile, LIME highlights the words that most strongly influence the model’s prediction for a specific example, showing which tokens the classifier relies on most when deciding whether a comment is toxic or non-toxic.

We evaluated five representative sentences that span clearly toxic, negative-but-non-toxic, and clearly positive language. For each example, we provide the LIME visualization, the SHAP bar plot, and a concise interpretation of what the model appears to be focusing on.

1. High-Toxicity Examples

    **“I hate you so much.”**

    <img src="example_plots/ihateyousomuch.png" width="500" height="450" />

    Interpretation:
    Both LIME and SHAP agree that hate and you are the primary drivers of the toxic classification in this sentence. SHAP shows a large positive contribution from hate, and LIME visually highlights the same tokens in the text. This indicates that the model identifies direct hostility and personal targeting as key signals of toxic language, which is consistent with human intuition.

    **“You racist piece of garbage.”**

    <img src="example_plots/youracistpieceofgarbage.png" width="500" height="450" />

    Interpretation:
    In this example, the model picks up on severe toxic expressions. Both LIME and SHAP highlight racist and garbage as the strongest contributors to the prediction. SHAP further shows that You also increases the toxic score, suggesting the model has learned that insults directed at a person tend to be especially harmful. The clear alignment across both methods suggests that the model relies on linguistically meaningful tokens rather than spurious patterns.

2. Negative but Non-Toxic Examples

    These sentences contain negative sentiment, but they are not abusive. They test whether the model mistakenly interprets criticism as harassment.

    **“This is absolutely terrible work.”**

    <img src="example_plots/thisisabsolutelyterriblework.png" width="500" height="450" />

    Interpretation:
    Even though the sentence contains negative sentiment, both LIME and SHAP show that the model does not treat it as toxic. Words like terrible are highlighted, but their contributions remain minimal or negative. SHAP confirms that all tokens either reduce or barely influence the toxic probability. This suggests that the model is capable of distinguishing criticism from harassment, a key requirement for fairness.

3. Positive / Polite Examples

    These examples test whether the model correctly recognizes supportive, grateful, or polite language.

    **“Thank you for your help.”**

    <img src="example_plots/thankyouforyourhelp.png" width="500" height="450" />

    Interpretation:
    LIME highlights positive expressions like Thank and help with negative contributions, and SHAP shows clear negative SHAP values for the same words. This means that the model interprets polite language as strongly non-toxic. The model’s ability to detect this distinction is important for avoiding false positives.

    **“I really appreciate your kindness.”**

    <img src="example_plots/ireallyappreciateyourkindness.png" width="500" height="450" />

    Interpretation:
    Both interpretability methods highlight positive expressions such as appreciate and kindness, with these tokens reducing the predicted toxicity. The consistency between LIME and SHAP shows that the model handles positive sentiment reliably and does not mistakenly attribute toxicity to polite or uplifting language.

**Alignment with Human Rationales (HateXplain)**

To further evaluate interpretability, we compared LIME explanations with human-annotated rationales from the HateXplain dataset. Using token-level Intersection-over-Union (IoU), we computed how closely LIME’s highlighted tokens matched human-marked spans.

* Overall Mean IoU: 0.17

* Hate Speech Class: 0.1716 (70 samples)

* Offensive Class: 0.1588 (30 samples)

The moderate IoU scores reflect partial overlap between LIME explanations and human rationales.


## Analysis of Results

Our fine-tuned BERT classifier achieved a macro F1 score of about 0.65 on the full Jigsaw test set, providing the strongest baseline in this study. Since our goal is interpretability rather than maximizing accuracy, the analysis focuses on what the explanations reveal about model behavior.

A major finding is the substantial performance gap between supervised BERT and the zero-shot LLM. Although large language models are strong general reasoners, they lack the task-specific supervision required to distinguish overlapping toxicity categories such as toxic, obscene, insult, and identity hate. This limitation is most visible in cases with implicit or indirect toxicity, where the LLM frequently defaulted to predicting “none.” Without fine-tuning, it struggled to map subtle linguistic cues to discrete toxicity labels.

BERT, in contrast, benefits from thousands of annotated examples. Fine-tuning provides clearer decision boundaries and more precise distinctions between toxicity categories, which results in stronger performance on both explicit and subtle toxic language.

Interpretability methods further clarify how BERT makes these decisions. LIME produces localized, human-readable token importance scores, while SHAP assigns contribution values. Both consistently highlight meaningful tokens such as insults, identity terms, and direct expressions of hostility. They also help identify potential weaknesses, including occasional over-reliance on identity terms in borderline cases.

However, alignment between model explanations and human rationales is relatively low, with average intersection-over-union around 0.17. Low alignment should not be interpreted as evidence of poor explanation quality. Human rationales in HateXplain often consist of only a minimal span, sometimes only one token. LIME always returns a fixed number (we set it as 10) of influential tokens, which increases the denominator of the IoU calculation. WordPiece tokenization further fragments many expressions into subword units, making exact token-level matching difficult. Human rationale annotation also varies, and annotators often highlight only a single pivotal word rather than a larger set of influential tokens.

Importantly, these same limitations are well documented in prior work. In the TU Wien seminar “Measuring explainability in hate speech detection using the HateXplain dataset” by Markus Reichel (TUW NLP Seminar, Advisors: Gábor Recski and Allan Hanbury), plausibility evaluations show that even models trained directly on HateXplain rationales achieve IOU F1 scores in the range of 0.11 to 0.22 [13]. BERT models with LIME explanations achieve IOU values around 0.118, and even the best-performing attention-based architectures reach only about 0.22. These findings indicate that low rationale alignment is a known and expected outcome when using token-level plausibility metrics on HateXplain. Our results are therefore consistent with established benchmarks, and they reflect the structural challenges of comparing post-hoc explanations with minimal human spans.

Taken together, we made two conclusions.
1. General-purpose large language models cannot replace supervised fine-tuning for fine-grained toxicity classification, particularly when multiple labels must be predicted simultaneously.
2. Post-hoc interpretability remains a useful but imperfect tool for analyzing model behavior, and improvements in explanation quality will depend on better alignment methods, more consistent annotation schemes, and interpretability techniques that handle subword tokenization more effectively.

## Future Work

Given the relatively low alignment scores observed between model-generated explanations and human rationales in this study, future work should focus on strategies to improve interpretability metrics and the quality of token-level explanations. LIME and SHAP could be more effectively integrated with careful tuning of parameters, sampling strategies, and feature selection to produce explanations that better capture the truly influential words driving toxicity predictions. Additional preprocessing, such as merging subword tokens from BERT’s WordPiece tokenizer, could also help improve token-level alignment with human-annotated rationales.

Complementary interpretability methods, such as Integrated Gradients, could be explored to provide more faithful token attributions, potentially yielding higher Intersection-over-Union (IoU) and precision/recall scores against HateXplain rationales. Expanding evaluation to include more complex or ambiguous examples would help identify cases where current methods underperform and allow targeted improvements in explanation quality.

On the modeling side, stronger architectures like RoBERTa-base [10] or DeBERTa-v3 [11], or joint training on both Jigsaw and HateXplain, may strengthen the link between classification predictions and explanatory signals. Fairness-oriented techniques such as counterfactual data augmentation, adversarial training, or identity-term masking could reduce spurious associations in token importance.

Going further, adversarial robustness remains an important consideration. Techniques such as automatic decipherment, as proposed by Wu et al. (2018) [12], could improve the model’s ability to detect intentionally disguised toxic content, ensuring that interpretability methods reflect robust and reliable reasoning even under adversarial input. Overall, these strategies aim to raise explanation quality and alignment scores while maintaining strong predictive performance.

## References

[1] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805)

[2] PyTorch Documentation: torch.nn.BCEWithLogitsLoss. [https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)

[3] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?” Explaining the Predictions of Any Classifier. KDD. [https://arxiv.org/abs/1602.04938](https://arxiv.org/abs/1602.04938)

[4] Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS. [https://arxiv.org/abs/1705.07874](https://arxiv.org/abs/1705.07874)

[5] Jigsaw Toxic Comment Classification Challenge, HuggingFace Datasets. [https://huggingface.co/datasets/thesofakillers/jigsaw-toxic-comment-classification-challenge](https://huggingface.co/datasets/thesofakillers/jigsaw-toxic-comment-classification-challenge)

[6] HateXplain Dataset, HuggingFace Datasets. [https://huggingface.co/datasets/Hate-speech-CNERG/hatexplain](https://huggingface.co/datasets/Hate-speech-CNERG/hatexplain)

[7] Mathew, B., Saha, P., Yimam, S. M., Biemann, C., Goyal, P., & Mukherjee, A. (2021). HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection. [https://arxiv.org/abs/2109.10163](https://arxiv.org/abs/2109.10163)

[8] LIME Documentation. [https://lime-ml.readthedocs.io/en/latest/lime.html](https://lime-ml.readthedocs.io/en/latest/lime.html)

[9] SHAP Documentation. [https://shap.readthedocs.io/en/latest/generated/shap.Explainer.html](https://shap.readthedocs.io/en/latest/generated/shap.Explainer.html)

[10] Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. [https://arxiv.org/abs/1907.11692](https://arxiv.org/abs/1907.11692)

[11] He, P., et al. (2021). DeBERTaV3: Improving DeBERTa using ELECTRA-style Pretraining with Gradient-Disentangled Embedding Sharing. [https://arxiv.org/abs/2106.04560](https://arxiv.org/abs/2106.04560)

[12] Wu, Z., Kambhatla, N., & Sarkar, A. (2018). Decipherment for Adversarial Offensive Language Detection. ACL Workshop W18-5119. [https://aclanthology.org/W18-5119/](https://aclanthology.org/W18-5119/)

[13] Reichel, M. (2023). Measuring Explainability in Hate Speech Detection Using the HateXplain Dataset. TU Wien NLP Seminar, Advisors: Gábor Recski and Allan Hanbury. Seminar Talk