# Assignment 7
## GermEval 2018 shared task: identification of offensive language

GermEval task: https://projects.fzai.h-da.de/iggsa/germeval-2018/

Assignment: https://snlp2018.github.io/assignments.html

### Task 1
The goal of Task 1 is to detect offensive language in social media posts in German: tweets have to be classified as either `OFFENSE` or `OTHER`.

Data: https://github.com/uds-lsv/GermEval-2018-Data

The file `germeval2018.training.txt` is a tab-separated list of labelled tweets; this is the development set to be used in training and tuning of model(s). The file `germeval2018.test.txt` contains the "gold-standard" data against which the final model should be (and has been, in the actual shared task) evaluated.

#### The data
First of all, let's read and take a quick look at the data:

In [30]:
import pandas as pd

In [31]:
df = pd.read_csv("data/germeval2018.training.txt", sep = "\t", encoding = "utf-8", header = None)

In [32]:
df.head()

Unnamed: 0,0,1,2
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER,OTHER
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER,OTHER
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER,OTHER
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER,OTHER
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE,INSULT


We can observe: first column contains tweet and second column the labels `OFFENSE` or `OTHER`; third column contains finer-grained labeling which we are not interesed in at the moment. Let's drop the latter and give the other two columns user-friendly names:

In [33]:
df.drop(df.columns[2], axis = 1, inplace = True)
df.columns = ["tweet", "category"]

In [34]:
df.head()

Unnamed: 0,tweet,category
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE


Is the data balanced? Let's find out:

In [35]:
df.groupby("category").count()

Unnamed: 0_level_0,tweet
category,Unnamed: 1_level_1
OFFENSE,1688
OTHER,3321


Not quite, there are twice as many tweets labelled as `OTHER` than tweets labelled as `OFFENSE`. Is this a problem? It can be; so: 1) we won't use accuracy but precision and recall metrics and 2) we might need to weight the loss function(s) of our model(s). We'll come back to this later.

#### Preprocessing

What do the tweets look like?

In [36]:
import random
print(random.sample(df["tweet"].to_list(), 10))

['@NeudF Hast Du nichts anderes drauf?', 'User die melden sind einfach Schmutz', '@IQ_Stimulator @Boelscheline Andere mögen es ignorieren.....wir nicht ..!!', "Gleich geht's los! Wir drücken die Daumen für @imriziv, der in Kürze als 18. für #Israel an den Start geht. #Eurovision #ESC2017 🇮🇱🇮🇱🇮🇱", '@ShakRiet @correctiv_org Das wurde in der DDR und BRD getan &gt; BIS HEUTE!', '@Christo75554365 Doch das hat mit dem religiösen Weltbild zu tun, nachdem der Körper einer Frau per se etwas Sündiges ist. Aber für jeden vernünftig denkenden Zeitgenossen ist klar: der Körper ist neutral. Nur Gedanken können schmutzig sein.', '#Mallorca, einheimische Arbeitslose bekommen nichts. Wirtschaftsmigranten bekommen Kindergeld und Sozialhilfe.', '@EstoyLimpia @KumaAndrea @Moni4950 @loriotfehlt Ja. Aber eine Islamisierung findet nicht statt. Wenn wir alle brav Blockflöte spielen 😘', 'WIE WÄRE ES MAL MIT GENERALVERDACHT GEGENÜBER MUSLIMISCHISLAMISTISCHER EROBERER ??? Jeden Tag mehrere Kapitalverbrechen von 

As we'd expect, there are Twitter handles to users (`@...`), hashtags (`#...`) and emojis, beside more or less usual punctuation. For a starter, we can perhaps remove the handles (Twitter's usernames are typically not German words) but keep hashtags and emojis.

In [37]:
import re

For example:

In [38]:
string = "@handle1 @handle2 some other words or #hashtags @handle3 and finally the end @handle4."

In [39]:
tmp = re.sub(r"\@\w+","", string)
print(tmp)

some other words or #hashtags  and finally the end .


Then we can remove leftover multiple blank spaces:

In [40]:
tmp1 = re.sub(r"\s+", " ", tmp)
print(tmp1)

some other words or #hashtags and finally the end .


Let's a write a simple function to do this cleaning and apply it to the `tweet` column:

In [41]:
def tweet_cleaner(string):
    clean_string = re.sub(r"\s+", " ", re.sub(r"\@\w+","", string))
    return(clean_string)

In [42]:
print("Before: " + string)
print("After: " + tweet_cleaner(string))

Before: @handle1 @handle2 some other words or #hashtags @handle3 and finally the end @handle4.
After:  some other words or #hashtags and finally the end .


In [43]:
df.loc[:, "clean_tweet"] = df.apply(lambda row : tweet_cleaner(row.tweet), axis = 1)

In [44]:
df.head()

Unnamed: 0,tweet,category,clean_tweet
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER,"Liebe Corinna, wir würden dich gerne als Mode..."
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER,Sie haben ja auch Recht. Unser Tweet war etwa...
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER,fröhlicher gruß aus der schönsten stadt der w...
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER,Amis hätten alles und jeden gewählt...nur Hil...
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE,kein verläßlicher Verhandlungspartner. Nachka...


#### Baseline models

Let's start quick and easy: let's see how vanilla Logistic Regression, Naive Bayes, Support Vector Machines behave with `tf-idf` features, using `sklearn`'s pipelines.

First, we split the `df` into training and testing data:

In [45]:
from sklearn.model_selection import train_test_split

In [253]:
X = df.clean_tweet # features are the cleaned tweets
y = df.category # labels are the categories
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42) # split

As mentioned above, let's also compute sample weights in order to "cure" unbalanced target labels:

In [254]:
from sklearn.utils import class_weight

In [255]:
weights = class_weight.compute_sample_weight(class_weight = "balanced", y = df.category) # weights for the full dataset

Next, define a pipeline with vectorizer, tf-idf encoder and logistic regression model:

In [250]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression

In [311]:
logreg = Pipeline([('vect', CountVectorizer()), # transform text into matrix of token counts
                   ('tfidf', TfidfTransformer()), # from token counts to normalized tf-idf
                   ('clf', LogisticRegression()), # logistic regression classifier
                  ])

In [312]:
logreg.fit(X_train, y_train, clf__sample_weight = weights[0 : len(y_train)]) # training, with weighted samples
y_pred = logreg.predict(X_test) # predict labels of test data

We can quickly explore model's performance with the help of `sklearn`'s built-in metrics:

In [313]:
from sklearn.metrics import classification_report, accuracy_score

In [314]:
# print report
print(classification_report(y_test, y_pred))

precision    recall  f1-score   support

     OFFENSE       0.55      0.65      0.60       497
       OTHER       0.81      0.74      0.77      1006

    accuracy                           0.71      1503
   macro avg       0.68      0.70      0.69      1503
weighted avg       0.73      0.71      0.71      1503



Let's try with word-level n-gram `tf-idf`:

In [356]:
logreg = Pipeline([('vect', CountVectorizer(analyzer = "word", token_pattern = r"\w{1,}", ngram_range = (2,3))), # ngrams counts
                   ('tfidf', TfidfTransformer()), # normalized tf-idf
                   ('clf', LogisticRegression()), # logistic regression
                  ])

In [357]:
logreg.fit(X_train, y_train, clf__sample_weight = weights[0 : len(y_train)]) # training, with sample weights
y_pred = logreg.predict(X_test) # predict labels of test data

In [358]:
print(classification_report(y_test, y_pred))

precision    recall  f1-score   support

     OFFENSE       0.54      0.37      0.44       497
       OTHER       0.73      0.85      0.78      1006

    accuracy                           0.69      1503
   macro avg       0.64      0.61      0.61      1503
weighted avg       0.67      0.69      0.67      1503



Let's try with n-gram character-level `tf-idf`:

In [385]:
logreg = Pipeline([('vect', CountVectorizer(analyzer = "char", token_pattern = r"\w{1,}", ngram_range = (2,3))), # char-level ngrams counts
                   ('tfidf', TfidfTransformer()), # normalized tf-idf
                   ('clf', LogisticRegression()), # logistic regression
                  ])

In [386]:
logreg.fit(X_train, y_train, clf__sample_weight = weights[0 : len(y_train)]) # training, with sample weights
y_pred = logreg.predict(X_test) # predict labels of test data

In [387]:
print(classification_report(y_test, y_pred))

precision    recall  f1-score   support

     OFFENSE       0.60      0.71      0.65       497
       OTHER       0.84      0.77      0.80      1006

    accuracy                           0.75      1503
   macro avg       0.72      0.74      0.73      1503
weighted avg       0.76      0.75      0.75      1503



Slightly better!

How about Naive Bayes classifier?

In [298]:
from sklearn.naive_bayes import MultinomialNB

In [341]:
nb = Pipeline([('vect', CountVectorizer()), # token counts
               ('tfidf', TfidfTransformer()), # normalized tf-idf
               ('clf', MultinomialNB()), # naive bayes classifier
               ])

In [342]:
nb.fit(X_train, y_train, clf__sample_weight = weights[0 : len(y_train)]) # training
y_pred = nb.predict(X_test) # predict labels of test data

In [343]:
print(classification_report(y_test, y_pred))

precision    recall  f1-score   support

     OFFENSE       0.55      0.68      0.61       497
       OTHER       0.82      0.72      0.77      1006

    accuracy                           0.71      1503
   macro avg       0.69      0.70      0.69      1503
weighted avg       0.73      0.71      0.72      1503



Finally, let's try Naive Bayes with different features:

In [392]:
nb = Pipeline([('vect', CountVectorizer(analyzer = "char", token_pattern = r"\w{1,}", ngram_range = (2,3))), # char-level ngrams counts
               ('tfidf', TfidfTransformer()), # normalized tf-idf
               ('clf', MultinomialNB()), # naive bayes classifier
               ])

In [393]:
nb.fit(X_train, y_train, clf__sample_weight = weights[0 : len(y_train)]) # training
y_pred = nb.predict(X_test) # predict labels of test data

In [394]:
print(classification_report(y_test, y_pred))

precision    recall  f1-score   support

     OFFENSE       0.51      0.87      0.64       497
       OTHER       0.90      0.59      0.71      1006

    accuracy                           0.68      1503
   macro avg       0.71      0.73      0.68      1503
weighted avg       0.77      0.68      0.69      1503



Conclusion: the best result was obtained with character-level n-grams and logistic regression. Can we improve on that?