In this notebook, we use the toxicity classifier trained by Skolkovo Intitute ([link](https://huggingface.co/s-nlp/roberta_toxicity_classifier)) to retrieve both toxicity label (True/False) and toxicity metric (float number) for a given text sample. The model itself is a RoBERTa fine-tuned on toxicity classification task.

- The toxicity metric is retrieved as the model output layer logits (the smaller the value, the more toxic the text).

- The toxicity label is assigned as it is done inside the model, i.e. according to the logits value (positive => non-toxic, negative => toxic).

In [4]:
import numpy as np

from tqdm import tqdm
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from typing import List

In [5]:
def measure_toxicity(texts: List[str], batch_size: int = 32):
    res_labels = []
    res_scores = []

    tokenizer = RobertaTokenizer.from_pretrained('SkolkovoInstitute/roberta_toxicity_classifier')
    model = RobertaForSequenceClassification.from_pretrained('SkolkovoInstitute/roberta_toxicity_classifier')

    for i in tqdm(range(0, len(texts), batch_size)):
        batch = tokenizer(texts[i:i + batch_size], return_tensors='pt', padding=True)

        labels = model(**batch)['logits'].argmax(1).float().data.tolist()
        res_labels.extend(labels)

        logits_tensors = model(**batch)['logits'].float().data
        logits_tensors = logits_tensors[:, 0] - logits_tensors[:, 1]
        logits_list = logits_tensors.view(-1, 1).tolist()
        res_scores.extend(logits_list)

    return res_labels, res_scores

In [6]:
samples = [
    "What a great sunny day, I miss it.",
    "hello there! i'm a piece of a non-toxic text",
    "I didn't screw him",
    "I didn't fuck him",
    "I'm going to hit you in all directions, civil and criminal, on all counts.",
    "What a fucked rainy day, goddamnit.",
    "hello there! i'm a piece of shit :)",
]

In [9]:
toxicity_labels, toxicity_scores = measure_toxicity(samples)

print()
for sample, toxicity_label, toxicity_score in zip(samples, toxicity_labels, toxicity_scores):
    toxicity_label = "toxic" if toxicity_label else "non toxic"

    print(f"'{sample}'")
    print(f"toxicity score: {toxicity_score[0]:.2f}\t({toxicity_label})" + "\n")

Some weights of the model checkpoint at SkolkovoInstitute/roberta_toxicity_classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 1/1 [00:00<00:00,  3.12it/s]


'What a great sunny day, I miss it.'
toxicity score: 10.05	(non toxic)

'hello there! i'm a piece of a non-toxic text'
toxicity score: 5.66	(non toxic)

'I didn't screw him'
toxicity score: -0.94	(toxic)

'I didn't fuck him'
toxicity score: -5.59	(toxic)

'I'm going to hit you in all directions, civil and criminal, on all counts.'
toxicity score: -4.71	(toxic)

'What a fucked rainy day, goddamnit.'
toxicity score: -6.52	(toxic)

'hello there! i'm a piece of shit :)'
toxicity score: -6.13	(toxic)






We see the model measures toxicity quite accurately for the data considered.