## **Introduction**

This notebook demonstrates how to use BERT for toxic content detection and leverage LIME (Local Interpretable Model-Agnostic Explanations) to explain the predictions made by the BERT model. We will walk through the process of classifying toxic content in text using a pre-trained BERT model and then visualize which words contributed the most to the model's decision using LIME.


Author: Lennox Anderson

Date Modified: September 29th, 2024.

---

## **Dependencies**

In [1]:
!pip install shap
!pip install transformers
!pip install lime



## **Section 1: Toxic Content Detection using BERT**

In this section, we load a pre-trained BERT model that is fine-tuned for toxic content detection. The BERT model will predict categories such as toxic, severe toxic, obscene, threat, insult, and identity hate based on the input text.

In [2]:
text = "You are a dumb asshole"

In [3]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
import warnings

# handle warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# initialize BERT model toxcity detection and tokenizing
model_name = "unitary/toxic-bert"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)
model.eval()

# tokenize and convert to tensors
# we convert the tokenizer output to tensors because tensors are the format BERT requires for processing
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)

# speed up the process since gradients are not needed for making predictions
# pass the tokenized tensors to BERT to generate predictions
# Extracts the raw, unnormalized output scores (logits) from the model.
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# apply softmax
# softmax: a mathematical function that converts raw scores (logits) into probabilities, ensuring that the sum of the probabilities is 1
probabilities = torch.softmax(logits, dim=-1).numpy()

# labels from BERT
class_names = ['toxic', 'severe toxic', 'obscene', 'threat', 'insult', 'identity hate']

predicted_class = class_names[probabilities.argmax()]
print(f"Predicted class: {predicted_class}")
print(f"Class probabilities: {probabilities}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Predicted class: toxic
Class probabilities: [[7.3768127e-01 8.5911941e-04 1.0318284e-01 7.2450780e-06 1.5816484e-01
  1.0469716e-04]]


## **Section 2: Explaining Model Predictions with LIME**

LIME helps make model predictions interpretable by showing which parts of the text (words or phrases) contributed most to the final classification. In this section, we use LIME to generate an explanation for the BERT model's toxic content prediction.

In [4]:
text = "you are a great big beautiful bitch"

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
import lime
from lime.lime_text import LimeTextExplainer
import numpy as np

# initialize BERT model for toxicity detection and tokenizing
model_name = "unitary/toxic-bert"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)
model.eval()

# BERTs labels
class_names = ['toxic', 'severe toxic', 'obscene', 'threat', 'insult', 'identity hate']

inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.softmax(logits, dim=-1).numpy()

# initialize LIME text explainer
explainer = LimeTextExplainer(class_names=class_names)

# perturbs the text for LIME analysis.
# gets model probabilities using softmax.
# highlights important words affecting predictions.
explanation = explainer.explain_instance(text, lambda x: torch.softmax(model(**tokenizer(x, return_tensors='pt', padding=True, truncation=True, max_length=512)).logits, dim=-1).detach().numpy(), num_features=10)

# show bar chart explanation
explanation.show_in_notebook(text=True)


In [None]:
# raw explaination
explanation_list = explanation.as_list()
for feature, weight in explanation_list:
    print(f"Feature: {feature}, Weight: {weight}")