<a href="https://colab.research.google.com/github/oliverguhr/htw-nlp-lecture/blob/master/assignments/transformer/nlp_2_transformer_offensive_language_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Offensive Language Classification


In [1]:
!pip install datasets transformers accelerate sentencepiece

Defaulting to user installation because normal site-packages is not writeable


In [2]:
!mkdir data
!wget -c https://raw.githubusercontent.com/oliverguhr/htw-nlp-lecture/master/assignments/germeval2018.training.txt -O data/germeval2018.train.txt
!wget -c https://raw.githubusercontent.com/oliverguhr/htw-nlp-lecture/master/assignments/germeval2018.test.txt -O data/germeval2018.test.txt

mkdir: cannot create directory ‘data’: File exists
--2024-06-04 20:46:26--  https://raw.githubusercontent.com/oliverguhr/htw-nlp-lecture/master/assignments/germeval2018.training.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

--2024-06-04 20:46:26--  https://raw.githubusercontent.com/oliverguhr/htw-nlp-lecture/master/assignments/germeval2018.test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing t

In [3]:
import pandas as pd
import numpy as np

In [4]:
# check if we have a GPU
!nvidia-smi

Tue Jun  4 20:46:27 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-PCIE-16GB           On  | 00000001:00:00.0 Off |                    0 |
| N/A   25C    P0              24W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Prepairing the data

In the next step we have to load the data and adjust it a bit. The data is available in tab delimited csv. Pandas is a good choice for simple processing, but it could also be done with Python board tools.

In [5]:
test_df = pd.read_csv("./data/germeval2018.test.txt", sep='\t', header=0,encoding="utf-8")
train_df = pd.read_csv("./data/germeval2018.train.txt", sep='\t', header=0,encoding="utf-8")

In [6]:
train_df.head()

Unnamed: 0,text,label,label2
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER,OTHER
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER,OTHER
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER,OTHER
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER,OTHER
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE,INSULT


In [7]:
# Since we do not need the label 2 columns, we can delete them.
test_df.drop(columns=['label2'], inplace=True)
train_df.drop(columns=['label2'], inplace=True)

In [8]:
def clean_text (text):
    #text = text.str.lower() # lowercase
    #text = text.str.replace(r"\#","") # replaces hashtags
    #text = text.str.replace(r"http\S+","URL")  # remove URL addresses
    #text = text.str.replace(r"@","")
    #text = text.str.replace(r"[^A-Za-z0-9öäüÖÄÜß()!?]", " ")
    #text = text.str.replace("\s{2,}", " ")
    return text

def convert_label(label):
    return 1 if label == "OFFENSE" else 0

In [9]:
train_df["text"]=clean_text(train_df["text"])
test_df["text"]=clean_text(test_df["text"])
train_df["label"]=train_df["label"].map(convert_label)
test_df["label"]=test_df["label"].map(convert_label)

In [10]:
# this is  how our data set looks now
train_df.head() 

Unnamed: 0,text,label
0,"@corinnamilborn Liebe Corinna, wir würden dich...",0
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,0
2,@ahrens_theo fröhlicher gruß aus der schönsten...,0
3,@dushanwegner Amis hätten alles und jeden gewä...,0
4,@spdde kein verläßlicher Verhandlungspartner. ...,1


In [11]:
len(train_df.loc[train_df["label"]==1])

1688

In [12]:
from sklearn.utils import shuffle
train_df = shuffle(train_df)

How many datasets do we have in our Train/Valid/Test sets?

In [13]:
print(f"Test exampels \t {len(test_df) }")
print(f"Train exampels \t {len(train_df[500:])}")
print(f"Valid exampels \t {len(train_df[:500])}")

Test exampels 	 3398
Train exampels 	 4509
Valid exampels 	 500


In the next step we convert the data in a format that our ml lib can use.

In [97]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [98]:
# What is the shape of our dataset?
train_dataset

Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 5009
})

## Encoding of the data 

We convert our texts into token that our model can process.

In [75]:
from transformers import AutoTokenizer
from datasets import load_dataset, load_metric, list_metrics


# try out different models :) 

model_checkpoint ="distilbert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [76]:
!rm -rf ./test-offsive-language/checkpoint*

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [77]:
demo_tokens = tokenizer(["Mehr Daten führen oftmals zu besseren Ergebnissen.", "And this is a second sentence"],add_special_tokens=True, truncation=True)
demo_tokens

{'input_ids': [[101, 74658, 42635, 32092, 18496, 42488, 10304, 53978, 10136, 47419, 10115, 119, 102], [101, 12689, 10531, 10124, 169, 11132, 49219, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

In [78]:
tokenizer.convert_ids_to_tokens(demo_tokens['input_ids'][0])

['[CLS]',
 'Mehr',
 'Daten',
 'führen',
 'oft',
 '##mals',
 'zu',
 'besser',
 '##en',
 'Ergebnisse',
 '##n',
 '.',
 '[SEP]']

In [99]:
def example_tokenizer(examples):
    return tokenizer(examples["text"], truncation=True,padding=True)

In [100]:
encoded_train_dataset = train_dataset.map(example_tokenizer, batched=True)
encoded_test_dataset = test_dataset.map(example_tokenizer, batched=True)

Map:   0%|          | 0/5009 [00:00<?, ? examples/s]

Map: 100%|██████████| 5009/5009 [00:00<00:00, 10008.60 examples/s]
Map: 100%|██████████| 3398/3398 [00:00<00:00, 5326.26 examples/s]


## The training \o/

Now we can train our model. To do this, we need to define a number of settings (hyperparameters):

In [106]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

batch_size = 16

args = TrainingArguments(
    "test-offsive-language",
    evaluation_strategy = "steps",
    save_strategy= "steps",
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size*4,
    gradient_accumulation_steps=4,
    num_train_epochs=2,
    eval_steps=0.2,
    save_steps=0.2,
    warmup_steps=50,
    logging_steps=10,
    load_best_model_at_end=True,
    overwrite_output_dir=True,
    metric_for_best_model="f1",
    save_total_limit=2,    
    bf16=True    
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [93]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average="macro")
  acc = accuracy_score(labels, preds)
  return {"accuracy": acc, "f1": f1}

In [107]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_test_dataset,        
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [108]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,F1
32,0.6371,0.635779,0.662154,0.53982
64,0.5713,0.552119,0.722484,0.647282
96,0.458,0.621595,0.729253,0.629988
128,0.474,0.552303,0.745438,0.679744


TrainOutput(global_step=156, training_loss=0.5110889337001703, metrics={'train_runtime': 188.6745, 'train_samples_per_second': 53.097, 'train_steps_per_second': 0.827, 'total_flos': 587903510230116.0, 'train_loss': 0.5110889337001703, 'epoch': 1.9872611464968153})

In [None]:
#tensorboard --logdir runs
%load_ext tensorboard
#%reload_ext tensorboard
%tensorboard --logdir /content/test-offsive-language/runs

## Testing the model

The next step is to test the model with the provided test data.

In [96]:
result = trainer.predict(encoded_test_dataset)
result.metrics["test_f1"]

0.6942384090366289

In [87]:
import torch

#trainer.prediction_step(trainer.model,tokenizer("das ist ein test"),False)
trainer.model.cpu()
#trainer.model.num_parameters()
encoded_texts = tokenizer(["du bist so dumm", "du bist toll"],padding=True, return_tensors="pt")
print(encoded_texts)
logits = trainer.model(**encoded_texts)
probabilities = torch.softmax(logits[0],dim=1)
print(probabilities)
class_label = torch.argmax(probabilities,dim=1)
print(class_label)

{'input_ids': tensor([[  101, 10168, 10467, 10123, 10380, 54892, 10147,   102],
        [  101, 10168, 10467, 10123, 81754,   102,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0]])}
tensor([[0.5488, 0.4512],
        [0.8414, 0.1586]], grad_fn=<SoftmaxBackward0>)
tensor([0, 0])


How can we predict a sigle test example and how long does it take on a cpu?

In [88]:
def predict(text):
    trainer.model.cpu()
    #trainer.model.num_parameters()
    encoded_texts = tokenizer(text, return_tensors="pt")
    #print(encoded_texts)
    logits = trainer.model(**encoded_texts)
    probabilities = torch.softmax(logits[0],dim=1)
    #print(probabilities)
    class_label = torch.argmax(probabilities)
    return class_label
    #print(class_label)

%timeit predict("du bist so toll")



18.5 ms ± 906 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


# Tutorial:

Our results are already quite good - but we can still improve the results.  First get familiar with the notebook - change a few parameters like learning rate and number of epochs and see how they change the results. 

**Your task is to improve the classification score.**

Here are some ideas how you can improve the score.

* Test different models. The [Model Hub](https://huggingface.co/models) lists a number of German models with which you can improve the results. 

* About 5000 sampels in the data set are comparatively few for this problem. You may find more data sets that you can add to the current training data set.

* A number of multilingual models are available in the [Model Hub](https://huggingface.co/models). These models have been trained with different languages. You could also try adding English to the German dataset to train a multilingual model. This may also be better on the German data. 

Data augmentation is a procedure to create new data sets by modifying existing data sets. It is important that the statement does not change (the class remains the same).

* You can replace synonyms words and thus generate new data sets. An example:

> "Can you still believe all this crap?" -> "Can you still believe all this crap?"

* Everything is allowed here. Try translating texts from German to English and back to German. If the meaning is preserved, the result can also be used for training. A small example with Google Translate:

> Deutsch: "Kann man diesen ganzen Scheiß noch glauben?" 

> Englisch: "Can you still believe all this shit?"

> Deutsch: "Kannst du all diese Scheiße noch glauben?"


