# Exercise 8: Hate Speech Detection with BERT

In this exercise, you will finetune a BERT model to do hate speech detection on tweets. You will also modify the training dataset to make training more efficient.

You should complete the parts of the exercise that are marked as **TODO**.
A correctly completed **TODO** gives 2 bonus points. Partially correct answers give 1 bonus point.
Some **TODO**s are inside a comment in a code block: Here, you should complete the line of code.
Other **TODO**s are inside a text block: Here, you should write a few sentences to answer the question.

**Important:** Some students were under the impression that you have to complete a TODO in a _single_ line of code. That is not the case, you can use as many lines as you need.

**Submission deadline:** 11.01.2022

**Instructions for submission:** After completing the exercise, save a copy of the notebook as exercise8_twitterhate_MATRIKELNUMMER.ipynb, where MATRIKELNUMMER is your student ID number. Then upload the notebook to moodle (submission exercise sheet 8).

In order to understand the code, it can be helpful to experiment a bit during development, e.g., to print tensors or their shapes. But please remove these changes before submitting the notebook. If we cannot run your notebook, or if a print statement is congesting stdout too much, then we cannot grade it. 

To make the most of this exercise, you should try to read and understand the entire code, not just the parts that contain a **TODO**. If you have questions, write them down for the exercise, which will happen in the week after the submission deadline.

**CUDA:** You can use a GPU for this exercise (on colab: Runtime -> Change Runtime Type -> GPU). This is not mandatory, but it will speed up training epochs, thereby allowing you to test more hyperparameters.

# Required libraries
When working with 🤗 transformers, or any fast-changing software library, you should be extra careful to fix the library versions when you begin your project, and not change versions while you're developing.

In [1]:
!pip install transformers==4.15.0
!pip install datasets==1.16.0
!pip install tensorflow

Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 2.6 MB/s eta 0:00:01
Installing collected packages: tokenizers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.11.4
    Uninstalling tokenizers-0.11.4:
      Successfully uninstalled tokenizers-0.11.4
Successfully installed tokenizers-0.10.3
Collecting datasets==1.16.0
  Using cached datasets-1.16.0-py3-none-any.whl (298 kB)
Collecting tqdm>=4.62.1
  Using cached tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
Installing collected packages: tqdm, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.49.0
    Uninstalling tqdm-4.49.0:
      Successfully uninstalled tqdm-4.49.0
  Attempting uninstall: datasets
    Found existing installation: datasets 1.2.0
    Uninstalling datasets-1.2.0:
      Successfully uninstalled



In [2]:
from transformers import BertForSequenceClassification, DistilBertTokenizerFast, Trainer, TrainingArguments, EarlyStoppingCallback, BertTokenizerFast
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import torch

2022-01-20 18:37:04.732925: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-01-20 18:37:04.732952: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [3]:
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

In [4]:
NUM_EPOCHS = 2
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 64
WARMUP_STEPS = 50
WEIGHT_DECAY = 0.01
LOGGING_STEPS = 50
LEARNING_RATE = 5e-05

# Data
In this exercise, we will finetune a BERT model to perform hate speech detection on data from twitter. Hate speech detection is the task of classifying sentences, or in this case, tweets, as hate speech or not hate speech, for example so that we can automatically report it or filter it out. The dataset we're using is from the 🤗/datasets library, so we can load it very easily: https://huggingface.co/datasets/tweets_hate_speech_detection 
As the dataset currently only contains a training portion, we are going to use the Slicing API (https://huggingface.co/docs/datasets/splits.html) to divide it into a training and a development set.


In [5]:
train_dataset = load_dataset("tweets_hate_speech_detection", split='train[:80%]') #TODO: create the train dataset, using the first 80% of the dataset
dev_dataset = load_dataset("tweets_hate_speech_detection", split='train[:-20%]')#TODO: create the dev dataset, using the last 20% of the dataset

Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (/home/min20120907/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/3e953745870454cf8ff15cc48097dbb5ff459596e0a089867c2a29cee63984ec)
Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (/home/min20120907/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/3e953745870454cf8ff15cc48097dbb5ff459596e0a089867c2a29cee63984ec)


Now let's look at some examples of tweets containing hate speech.

In [6]:
hate_speech_examples = [ a['tweet'] for a in train_dataset if a['label']==1] #TODO: create a list with all tweets that are marked as hate speech
print('\n'.join(hate_speech_examples[15:20]))
# print(hate_speech_examples)

you might be a libtard if... #libtard  #sjw #liberal #politics 
@user take out the #trash america...  - i voted against #hate - i voted against  - i voted against  - i votâ¦ 
if you hold open a door for a woman because she's a woman and not because it's a nice thing to do, that's . don't even try to deny it
@user this man ran for governor of ny, the state with the biggest african-american population    #â¦
#stereotyping #prejudice  offer no #hope or solutions but create the same old repetitive #hate #conflictâ¦ 


# Tokenization
Now that we have the datasets loaded, we need to tokenize them. This is very easy with 🤗 transformers, but to make our model faster we are first going to find out the smallest sequence length that we can comfortably work with. Tweets are very short, so we should be able to choose a sequence length that is a lot shorter than the standard 512 that most BERT models run with. We are going to tokenize the whole dataset with a very generous sequence length, choose our new sequence length so that at least 95% of all tweets are within this length, and then tokenize again while truncating those that are longer.

In [7]:
def run_tokenizer(train_dataset, dev_dataset):
    tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') #actually the same as BertTokenizerFast

    def get_sequence_len(tokenizer, train_dataset, dev_dataset):

        def tokenize_for_lengths(batch):
            return tokenizer(batch['tweet'], padding=False, truncation=True, max_length=128, return_length = True)

        train_dataset_for_lengths = train_dataset.map(tokenize_for_lengths, batched=True, batch_size=len(train_dataset))

        tweet_lengths = np.array(train_dataset_for_lengths[:]['length'])
        tweet_lengths.sort()
        chosen_sequence_len = tweet_lengths[95]
        #TODO: find out what sequence length is at the 95th percentile of the tweet_lengths list and add 1 so we're on the safe side
        
        return chosen_sequence_len

    chosen_sequence_len = get_sequence_len(tokenizer, train_dataset, dev_dataset)

    def tokenize(batch, sequence_len):
        return tokenizer(batch['tweet'], padding="max_length", truncation=True, max_length=sequence_len)

    train_dataset = train_dataset.map(tokenize, fn_kwargs={'sequence_len': chosen_sequence_len}, batched=True, batch_size=len(train_dataset))
    dev_dataset = dev_dataset.map(tokenize, fn_kwargs={'sequence_len': chosen_sequence_len}, batched=True, batch_size=len(dev_dataset))

    train_dataset = train_dataset.remove_columns('tweet')
    dev_dataset = dev_dataset.remove_columns('tweet')
    
    return (train_dataset, dev_dataset)

train_dataset, dev_dataset = run_tokenizer(train_dataset, dev_dataset)

Loading cached processed dataset at /home/min20120907/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/3e953745870454cf8ff15cc48097dbb5ff459596e0a089867c2a29cee63984ec/cache-4aca46e39acebe7a.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [8]:
def set_format(train_dataset, dev_dataset):
    train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    dev_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

    return (train_dataset, dev_dataset)

train_dataset, dev_dataset = set_format(train_dataset, dev_dataset)

# Model Definition
To make training as fast as possible, we are going to load the Distilbert model. To do finetuning, we are going to load our BERT model, add classification heads on top of it and then train with our dataset and specific task. In this case, we're doing binary sequence classification: we're classifying sequences, tweets, as either hate speech (label 1) or not hate speech (label 0). Luckily for us, in 🤗 transformers, we only need to instantiate a BertForSequenceClassification model from a pretrained generic BERT model and specify how many labels we want for the classification. We will get a warning that some of the weights (those of the classification heads) have not been trained yet, but that's fine.

In [9]:
def define_model():
    model = BertForSequenceClassification.from_pretrained('distilbert-base-uncased') #TODO: instantiate a BertForSequenceClassification model from distilbert-base-uncased for classification with two possible labels

    return model

model = define_model()

You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertForSequenceClassification: ['distilbert.transformer.layer.0.output_layer_norm.weight', 'distilbert.transformer.layer.1.sa_layer_norm.weight', 'distilbert.transformer.layer.0.output_layer_norm.bias', 'distilbert.transformer.layer.1.attention.k_lin.weight', 'distilbert.transformer.layer.2.ffn.lin2.bias', 'distilbert.transformer.layer.4.sa_layer_norm.bias', 'distilbert.transformer.layer.4.ffn.lin2.weight', 'distilbert.transformer.layer.5.ffn.lin1.bias', 'distilbert.transformer.layer.1.output_layer_norm.weight', 'distilbert.transformer.layer.2.ffn.lin1.bias', 'distilbert.transformer.layer.5.ffn.lin2.bias', 'distilbert.transformer.layer.2.ffn.lin1.weight', 'distilbert.transformer.layer.3.sa_layer_norm.weight', 'distilbert.transformer.layer.

Here, we are going to set up two things. The first is a function that we can then pass to the Trainer class to tell it what kinds of metrics we want to compute on our development set, and the other is a Early Stopping Callback so that just like in the last exercise sheet, we can stop training if the development performance isn't increasing.

In [10]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

early_stopping_callback = EarlyStoppingCallback(early_stopping_patience= 2, early_stopping_threshold = 0.0)

Now, we are going to define some training arguments that we are going to pass to the Trainer class which will handle the training for us. Very important here is that we have set the metric for best model to F1-measure and load_best_model_at_end to True, so that F1-measure is used for early stopping and we load the best model at the end, not the one for which F1-measure has already decreased.

In [11]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    warmup_steps=WARMUP_STEPS,
    weight_decay=WEIGHT_DECAY,
    logging_dir='./logs/',
    evaluation_strategy="steps",
    logging_steps=LOGGING_STEPS,
    learning_rate=LEARNING_RATE,
    metric_for_best_model="f1",
    load_best_model_at_end=True
)

Here we instantiate the Trainer class using our model, the training args, and so on

In [12]:
def define_trainer(model, training_args, train_dataset, dev_dataset, compute_metrics, early_stopping_callback):
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        compute_metrics=compute_metrics,
        callbacks = [early_stopping_callback]
    )
    
    return trainer

trainer = define_trainer(model, training_args, train_dataset, dev_dataset, compute_metrics, early_stopping_callback)

# Training

In [13]:
def train(trainer):
    #TODO: tell the Trainer object to train 
    return trainer

trainer = train(trainer)

In [14]:
def evaluate(trainer):
    return trainer.evaluate()

evaluate(trainer)

***** Running Evaluation *****
  Num examples = 25570
  Batch size = 64


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.5520108342170715,
 'eval_accuracy': 0.9296441141963239,
 'eval_f1': 0.0,
 'eval_precision': 0.0,
 'eval_recall': 0.0,
 'eval_runtime': 76.189,
 'eval_samples_per_second': 335.613,
 'eval_steps_per_second': 5.25}

# What happened?

It looks like our accuracy is really good, but F1 is not where we want it to be. How could we improve that?

Let's take a look at our dataset again. How are the classes distributed?


In [15]:
hate_speech_dataset = load_dataset('tweets_hate_speech_detection', split='train')
print(np.mean(np.array(hate_speech_dataset['label'])))

Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (/home/min20120907/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/3e953745870454cf8ff15cc48097dbb5ff459596e0a089867c2a29cee63984ec)


0.07014579813528565


It looks like only about 7% of the training dataset is actually hate speech. This extreme imbalance means that the model takes the path of least resistance for the loss, which is to predict "not hate speech" much more than it should.

Let's do something about that!

# Rebalancing the dataset

To help with this, we are simply going to rebalance the dataset so that it contains all the examples for hate speech, but only as many negative examples, so that the dataset is balanced 50-50.

In [16]:
def rebalance_dataset(dataset):
    num_label_1 = len([a for a in hate_speech_dataset if a['label']==1])#TODO: find out how often the label 1 appears in hate_speech_dataset's label column
    num_label_0 = len([a for a in hate_speech_dataset if a['label']==0]) #TODO: find out how often the label 0 appears in hate_speech_dataset's label column
    sorted_dataset = dataset.sort('label')
    balanced_dataset = sorted_dataset.select(list(range(0,num_label_1)) + list(range(num_label_0, num_label_0 + num_label_1)))
    return balanced_dataset.shuffle(seed=42)

balanced_dataset = rebalance_dataset(hate_speech_dataset)
balanced_train_dataset = balanced_dataset[:int(len(balanced_dataset)*0.8)]#using the dataset.select method, select the first 80% of the balanced dataset
balanced_dev_dataset = balanced_dataset[:-int(len(balanced_dataset)*0.2)]#using the dataset.select method, select the last 20% of the balanced dataset
print(balanced_train_dataset)
print(balanced_dev_dataset)

Loading cached sorted indices for dataset at /home/min20120907/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/3e953745870454cf8ff15cc48097dbb5ff459596e0a089867c2a29cee63984ec/cache-a9bafc97a07fd8cc.arrow
Loading cached shuffled indices for dataset at /home/min20120907/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/3e953745870454cf8ff15cc48097dbb5ff459596e0a089867c2a29cee63984ec/cache-219df93c8ac8c00c.arrow




Now that we have balanced the dataset, let's run everything again. 

In [None]:
balanced_train_dataset, balanced_dev_dataset = run_tokenizer(balanced_train_dataset, balanced_dev_dataset)
balanced_train_dataset, balanced_dev_dataset = set_format(balanced_train_dataset, balanced_dev_dataset)
balanced_model = define_model()
balanced_trainer = define_trainer(balanced_model, training_args, balanced_train_dataset, balanced_dev_dataset, compute_metrics, early_stopping_callback)
balanced_trainer = train(balanced_trainer)
evaluate(balanced_trainer)

That's better! **TODO**: Write a few sentences about how much F1-measure has improved, and why.