# Assignment 7
## GermEval 2018 shared task: identification of offensive language

GermEval task: https://projects.fzai.h-da.de/iggsa/germeval-2018/

Assignment: https://snlp2018.github.io/assignments.html

### Task 1
The goal of Task 1 is to detect offensive language in social media posts in German: tweets have to be classified as either `OFFENSE` or `OTHER`.

Data: https://github.com/uds-lsv/GermEval-2018-Data

The file `germeval2018.training.txt` is a tab-separated list of labelled tweets; this is the development set to be used in training and tuning of model(s). The file `germeval2018.test.txt` contains the "gold-standard" data against which the final model should be (and has been, in the actual shared task) evaluated.

#### The data
First of all, let's read and take a quick look at the data:

In [1]:
import pandas as pd

In [51]:
df = pd.read_csv("data/germeval2018.training.txt", sep = "\t", encoding = "utf-8", header = None)

In [52]:
df.head()

Unnamed: 0,0,1,2
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER,OTHER
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER,OTHER
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER,OTHER
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER,OTHER
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE,INSULT


We can observe: first column contains tweet and second column the labels `OFFENSE` or `OTHER`; third column contains finer-grained labeling which we are not interesed in at the moment. Let's drop the latter and give the other two columns user-friendly names:

In [53]:
df.drop(df.columns[2], axis = 1, inplace = True)
df.columns = ["tweet", "category"]

In [54]:
df.head()

Unnamed: 0,tweet,category
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE


Is the data balanced? Let's find out:

In [55]:
df.groupby("category").count()

Unnamed: 0_level_0,tweet
category,Unnamed: 1_level_1
OFFENSE,1688
OTHER,3321


Not quite, there are twice as many tweets labelled as `OTHER` than tweets labelled as `OFFENSE`. Is this a problem? It can be; so: 1) we won't use accuracy but precision and recall metrics and 2) we might need to weight the loss function(s) of our model(s). We'll come back to this later.

#### Preprocessing

What do the tweets look like?

In [56]:
import random
print(random.sample(df["tweet"].to_list(), 10))

['#SPD für Harz4 |LBR| #AfD für Mittelschicht |LBR| #cdu für selbsternannte Eliten ARD/ZDF |LBR| #rest für die Tonne', 'Merkel redet grad mit Kennerblick von |LBR| KÜNSTLICHER INTELLIGENZ |LBR| Werte Bunteskanzlerin, |LBR| wie wäre es denn zur Abwechslung mal mit |LBR| MENSCHLICHER INTELLIGENZ |LBR| ??', 'Harpyie Murksel drückt "Verbundenheit" für Manchester aus &gt; Wie falsch muss man eigentlich sein, wenn man d. Terror Tür u. Tor öffnete 🤢🤢🤢', '@NoBiTwt Der Abschaum der Menschheit bekommt deutsche Pässe! Diese schuldigen Behörden würden in einem Rechtsstaat verklagt!', '@jot_el Nein, das bestimmen die Idioten in den Konzernen! Aldi macht z.B. nicht mit!', '@StephanJBauer @soskinderdorf Auch in Deutschland hungern Kinder.', '@HolgerEwald1 @Opposition24de Ab sofort sollten Hammer verboten werden! 😜', '@focusonline Die FDP hat doch Eier in der Hose.', '@albastone72 @Ralf_Stegner Stimmt...das solche erbärmliche Kommentare überhaupt notwendig sind verdanken wir ja Stegner und Schulz...gl

As we'd expect, there are Twitter handles to users (`@...`), hashtags (`#...`) and emojis, beside more or less usual punctuation. For a starter, we can perhaps remove the handles (Twitter's usernames are typically not German words) but keep hashtags and emojis.

In [57]:
import re

For example:

In [60]:
string = "@handle1 @handle2 some other words or #hashtags @handle3 and finally the end @handle4."

In [61]:
tmp = re.sub(r"\@\w+","", string)
print(tmp)

some other words or #hashtags  and finally the end .


Then we can remove leftover multiple blank spaces:

In [62]:
tmp1 = re.sub(r"\s+", " ", tmp)
print(tmp1)

some other words or #hashtags and finally the end .


Let's a write a simple function to do this cleaning and apply it to the `tweet` column:

In [63]:
def tweet_cleaner(string):
    clean_string = re.sub(r"\s+", " ", re.sub(r"\@\w+","", string))
    return(clean_string)

In [64]:
print("Before: " + string)
print("After: " + tweet_cleaner(string))

Before: @handle1 @handle2 some other words or #hashtags @handle3 and finally the end @handle4.
After:  some other words or #hashtags and finally the end .


In [65]:
df.loc[:, "clean_tweet"] = df.apply(lambda row : tweet_cleaner(row.tweet), axis = 1)

In [67]:
df.head()

Unnamed: 0,tweet,category,clean_tweet
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER,"Liebe Corinna, wir würden dich gerne als Mode..."
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER,Sie haben ja auch Recht. Unser Tweet war etwa...
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER,fröhlicher gruß aus der schönsten stadt der w...
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER,Amis hätten alles und jeden gewählt...nur Hil...
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE,kein verläßlicher Verhandlungspartner. Nachka...


#### Baseline models

Let's start quick and easy: let's see how vanilla Logistic Regression, Naive Bayes, Support Vector Machines behave with `tf-idf` features, using `sklearn`'s pipelines.

First, we split the `df` into training and testing data:

In [73]:
from sklearn.model_selection import train_test_split

In [75]:
X = df.clean_tweet # features are the cleaned tweets
y = df.category # labels are the categories
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42) # split

Next, define a pipeline with vectorizer, tf-idf encoder and logistic regression model:

In [76]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression

In [77]:
logreg = Pipeline([('vect', CountVectorizer()), # transform text into matrix of token counts
                   ('tfidf', TfidfTransformer()), # from token counts to normalized tf-idf
                   ('clf', LogisticRegression()), # logistic regression classifier
                  ])

In [78]:
logreg.fit(X_train, y_train) # training
y_pred = logreg.predict(X_test) # predict labels of test data

We can quickly explore model's performance with the help of `sklearn`'s built-in metrics:

In [80]:
from sklearn.metrics import classification_report, accuracy_score

In [83]:
# print report
print(classification_report(y_test, y_pred, target_names = ["OFFENSE", "OTHER"]))

precision    recall  f1-score   support

     OFFENSE       0.76      0.27      0.40       497
       OTHER       0.73      0.96      0.83      1006

    accuracy                           0.73      1503
   macro avg       0.74      0.61      0.61      1503
weighted avg       0.74      0.73      0.69      1503

