<a href="https://colab.research.google.com/github/oliverguhr/htw-nlp-lecture/blob/master/assignments/transformer/nlp_2_transformer_offensive_language_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Offensive Language Classification


In [None]:
!pip install datasets transformers accelerate sentencepiece

In [None]:
!mkdir data
!wget -c https://raw.githubusercontent.com/oliverguhr/htw-nlp-lecture/master/assignments/germeval2018.training.txt -O data/germeval2018.train.txt
!wget -c https://raw.githubusercontent.com/oliverguhr/htw-nlp-lecture/master/assignments/germeval2018.test.txt -O data/germeval2018.test.txt

In [None]:
import pandas as pd
import numpy as np

In [None]:
# check if we have a GPU
!nvidia-smi

## Prepairing the data

In the next step we have to load the data and adjust it a bit. The data is available in tab delimited csv. Pandas is a good choice for simple processing, but it could also be done with Python board tools.

In [None]:
test_df = pd.read_csv("./data/germeval2018.test.txt", sep='\t', header=0,encoding="utf-8")
train_df = pd.read_csv("./data/germeval2018.train.txt", sep='\t', header=0,encoding="utf-8")

In [None]:
train_df.head()

In [None]:
# Since we do not need the label 2 columns, we can delete them.
test_df.drop(columns=['label2'], inplace=True)
train_df.drop(columns=['label2'], inplace=True)

In [None]:
def clean_text (text):
    #text = text.str.lower() # lowercase
    #text = text.str.replace(r"\#","") # replaces hashtags
    #text = text.str.replace(r"http\S+","URL")  # remove URL addresses
    #text = text.str.replace(r"@","")
    #text = text.str.replace(r"[^A-Za-z0-9öäüÖÄÜß()!?]", " ")
    #text = text.str.replace("\s{2,}", " ")
    return text

def convert_label(label):
    return 1 if label == "OFFENSE" else 0

In [None]:
train_df["text"]=clean_text(train_df["text"])
test_df["text"]=clean_text(test_df["text"])
train_df["label"]=train_df["label"].map(convert_label)
test_df["label"]=test_df["label"].map(convert_label)

In [None]:
# this is  how our data set looks now
train_df.head() 

In [None]:
len(train_df.loc[train_df["label"]==1])

In [None]:
from sklearn.utils import shuffle
train_df = shuffle(train_df)

How many datasets do we have in our Train/Valid/Test sets?

In [None]:
print(f"Test exampels \t {len(test_df) }")
print(f"Train exampels \t {len(train_df[500:])}")
print(f"Valid exampels \t {len(train_df[:500])}")

In the next step we convert the data in a format that our ml lib can use.

In [None]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [None]:
# What is the shape of our dataset?
train_dataset

## Encoding of the data 

We convert our texts into token that our model can process.

In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset, load_metric, list_metrics


# try out different models :) 

model_checkpoint ="distilbert-base-multilingual-cased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
!rm -rf ./test-offsive-language/checkpoint*

In [None]:
demo_tokens = tokenizer(["Mehr Daten führen oftmals zu besseren Ergebnissen.", "And this is a second sentence"],add_special_tokens=True, truncation=True)
demo_tokens

In [None]:
tokenizer.convert_ids_to_tokens(demo_tokens['input_ids'][0])

In [None]:
def example_tokenizer(examples):
    return tokenizer(examples["text"], truncation=True,padding=True)

In [None]:
encoded_train_dataset = train_dataset.map(example_tokenizer, batched=True)
encoded_test_dataset = test_dataset.map(example_tokenizer, batched=True)

## The training \o/

Now we can train our model. To do this, we need to define a number of settings (hyperparameters):

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

batch_size = 16

args = TrainingArguments(
    "test-offsive-language",
    evaluation_strategy = "steps",
    save_strategy= "steps",
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size*4,
    gradient_accumulation_steps=4,
    num_train_epochs=2,
    eval_steps=0.2,
    save_steps=0.2,
    warmup_steps=50,
    logging_steps=10,
    load_best_model_at_end=True,
    overwrite_output_dir=True,
    metric_for_best_model="f1",
    save_total_limit=2,    
    bf16=True    
)

In [None]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  f1 = f1_score(labels, preds, average="macro")
  acc = accuracy_score(labels, preds)
  return {"accuracy": acc, "f1": f1}

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_test_dataset,        
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:
#tensorboard --logdir runs
%load_ext tensorboard
#%reload_ext tensorboard
%tensorboard --logdir /content/test-offsive-language/runs

## Testing the model

The next step is to test the model with the provided test data.

In [None]:
result = trainer.predict(encoded_test_dataset)
result.metrics["test_f1"]

In [None]:
import torch

#trainer.prediction_step(trainer.model,tokenizer("das ist ein test"),False)
trainer.model.cpu()
#trainer.model.num_parameters()
encoded_texts = tokenizer(["du bist so dumm", "du bist toll"],padding=True, return_tensors="pt")
print(encoded_texts)
logits = trainer.model(**encoded_texts)
probabilities = torch.softmax(logits[0],dim=1)
print(probabilities)
class_label = torch.argmax(probabilities,dim=1)
print(class_label)

How can we predict a sigle test example and how long does it take on a cpu?

In [None]:
def predict(text):
    trainer.model.cpu()
    #trainer.model.num_parameters()
    encoded_texts = tokenizer(text, return_tensors="pt")
    #print(encoded_texts)
    logits = trainer.model(**encoded_texts)
    probabilities = torch.softmax(logits[0],dim=1)
    #print(probabilities)
    class_label = torch.argmax(probabilities)
    return class_label
    #print(class_label)

%timeit predict("du bist so toll")



# Tutorial:

Our results are already quite good - but we can still improve the results.  First get familiar with the notebook - change a few parameters like learning rate and number of epochs and see how they change the results. 

**Your task is to improve the classification score.**

Here are some ideas how you can improve the score.

* Test different models. The [Model Hub](https://huggingface.co/models) lists a number of German models with which you can improve the results. 

* About 5000 sampels in the data set are comparatively few for this problem. You may find more data sets that you can add to the current training data set.

* A number of multilingual models are available in the [Model Hub](https://huggingface.co/models). These models have been trained with different languages. You could also try adding English to the German dataset to train a multilingual model. This may also be better on the German data. 

Data augmentation is a procedure to create new data sets by modifying existing data sets. It is important that the statement does not change (the class remains the same).

* You can replace synonyms words and thus generate new data sets. An example:

> "Can you still believe all this crap?" -> "Can you still believe all this crap?"

* Everything is allowed here. Try translating texts from German to English and back to German. If the meaning is preserved, the result can also be used for training. A small example with Google Translate:

> Deutsch: "Kann man diesen ganzen Scheiß noch glauben?" 

> Englisch: "Can you still believe all this shit?"

> Deutsch: "Kannst du all diese Scheiße noch glauben?"


