# Assignment 7
## GermEval 2018 shared task: identification of offensive language

GermEval task: https://projects.fzai.h-da.de/iggsa/germeval-2018/

Assignment: https://snlp2018.github.io/assignments.html

### Task 1
The goal of Task 1 is to detect offensive language in social media posts in German: tweets have to be classified as either `OFFENSE` or `OTHER`.

Data: https://github.com/uds-lsv/GermEval-2018-Data

The file `germeval2018.training.txt` is a tab-separated list of labelled tweets; this is the development set to be used in training and tuning of model(s). The file `germeval2018.test.txt` contains the "gold-standard" data against which the final model should be (and has been, in the actual shared task) evaluated.

#### The data
First of all, let's read and take a quick look at the data:

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data/germeval2018.training.txt", sep = "\t", encoding = "utf-8", header = None)

In [3]:
df.head()

Unnamed: 0,0,1,2
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER,OTHER
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER,OTHER
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER,OTHER
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER,OTHER
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE,INSULT


We can observe: first column contains tweet and second column the labels `OFFENSE` or `OTHER`; third column contains finer-grained labeling which we are not interesed in at the moment. Let's drop the latter and give the other two columns user-friendly names:

In [4]:
df.drop(df.columns[2], axis = 1, inplace = True)
df.columns = ["tweet", "category"]

In [5]:
df.head()

Unnamed: 0,tweet,category
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE


Is the data balanced? Let's find out:

In [6]:
df.groupby("category").count()

Unnamed: 0_level_0,tweet
category,Unnamed: 1_level_1
OFFENSE,1688
OTHER,3321


Not quite, there are twice as many tweets labelled as `OTHER` than tweets labelled as `OFFENSE`. Is this a problem? It can be; so: 1) we won't use accuracy but precision and recall metrics and 2) we might need to weight the loss function(s) of our model(s). We'll come back to this later.

#### Preprocessing

What do the tweets look like?

In [7]:
import random
print(random.sample(df["tweet"].to_list(), 10))

['Steuermehreinnahmen. Wofür? Blöde Frage. Es gibt so viele Löcher, |LBR| die gestopft werden müssen. Hört auf mit Steuergeschenken.', '@bettisweb @dannytastisch Hi, Marktplatz 40213 Düsseldorf um 14 Uhr.', '@X9donernesto @petpanther0 @machtjanix23 @NoHerrman @teite99 @info2099 @lifetrend @ThomasGBauer @SchmiddieMaik @AthinaMala @charlie_silve @willjrosenblatt @feldenfrizz @nasanasal @_macmike @ellibisathide @MD_Franz Kann ich Dir nachher schicken, habe zu Hause hebräische Tastatur.', '@Beatrix_vStorch @krippmarie CDU soll mit Merkel und SPD weiterhin mit Schulz...!! |LBR| Nur so kann das endgültig beendet werden.....😃😃😃😃', '@SJW_MadBunny @machtjanix23 @Steffmann45 @troll_putin @petpanther0 @mastermikeg @Norbinator2403 @ennof_ @AthinaMala @NancyPeggyMandy @info2099 @lifetrend @ThomasGBauer @SchmiddieMaik @charlie_silve @NoHerrman @willjrosenblatt @feldenfrizz @nasanasal @_macmike @ellibisathide @MD_Franz Das ist ihre Spezialität. Mehr kommt dann aber auch nicht', 'Das Wahlsystem in die

As we'd expect, there are Twitter handles to users (`@...`), hashtags (`#...`) and emojis, beside more or less usual punctuation. For a starter, we can perhaps remove the handles (Twitter's usernames are typically not German words) but keep hashtags and emojis.

In [8]:
import re

For example:

In [9]:
string = "@handle1 @handle2 some other words or #hashtags @handle3 and finally the end @handle4."

In [10]:
tmp = re.sub(r"\@\w+","", string)
print(tmp)

  some other words or #hashtags  and finally the end .


Then we can remove leftover multiple blank spaces:

In [11]:
tmp1 = re.sub(r"\s+", " ", tmp)
print(tmp1)

 some other words or #hashtags and finally the end .


Let's a write a simple function to do this cleaning and apply it to the `tweet` column:

In [12]:
def tweet_cleaner(string):
    clean_string = re.sub(r"\s+", " ", re.sub(r"\@\w+","", string))
    return(clean_string)

In [13]:
print("Before: " + string)
print("After: " + tweet_cleaner(string))

Before: @handle1 @handle2 some other words or #hashtags @handle3 and finally the end @handle4.
After:  some other words or #hashtags and finally the end .


In [14]:
df.loc[:, "clean_tweet"] = df.apply(lambda row : tweet_cleaner(row.tweet), axis = 1)

In [15]:
df.head()

Unnamed: 0,tweet,category,clean_tweet
0,"@corinnamilborn Liebe Corinna, wir würden dich...",OTHER,"Liebe Corinna, wir würden dich gerne als Mode..."
1,@Martin28a Sie haben ja auch Recht. Unser Twee...,OTHER,Sie haben ja auch Recht. Unser Tweet war etwa...
2,@ahrens_theo fröhlicher gruß aus der schönsten...,OTHER,fröhlicher gruß aus der schönsten stadt der w...
3,@dushanwegner Amis hätten alles und jeden gewä...,OTHER,Amis hätten alles und jeden gewählt...nur Hil...
4,@spdde kein verläßlicher Verhandlungspartner. ...,OFFENSE,kein verläßlicher Verhandlungspartner. Nachka...


#### Baseline models

Let's start quick and easy: let's see how Logistic Regression behave with `tf-idf` features, using `sklearn`'s pipelines.

First, we split the `df` into training and testing data:

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
X = df.clean_tweet # features are the cleaned tweets
y = df.category # labels are the categories
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42) # split

Next, define a pipeline with vectorizer, tf-idf encoder and logistic regression model:

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression

In [19]:
logreg = Pipeline([('vect', CountVectorizer()), # transform text into matrix of token counts
                   ('tfidf', TfidfTransformer()), # from token counts to normalized tf-idf
                   ('clf', LogisticRegression(class_weight = "balanced")), # logistic regression classifier, class_weight param set to "cure" unbalanced labels
                  ])

In [20]:
logreg.fit(X_train, y_train) # training
y_pred = logreg.predict(X_test) # predict labels of test data

We can quickly explore model's performance with the help of `sklearn`'s built-in metrics:

In [21]:
from sklearn.metrics import classification_report, accuracy_score

In [22]:
# print report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     OFFENSE       0.56      0.64      0.59       497
       OTHER       0.81      0.75      0.78      1006

    accuracy                           0.71      1503
   macro avg       0.68      0.69      0.69      1503
weighted avg       0.72      0.71      0.72      1503



Let's try with word-level n-gram `tf-idf`:

In [23]:
logreg = Pipeline([('vect', CountVectorizer(analyzer = "word", token_pattern = r"\w{1,}", ngram_range = (2,3))), # ngrams counts
                   ('tfidf', TfidfTransformer()), # normalized tf-idf
                   ('clf', LogisticRegression(class_weight = "balanced")), # logistic regression
                  ])

In [24]:
logreg.fit(X_train, y_train) # training
y_pred = logreg.predict(X_test) # predict labels of test data

In [25]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     OFFENSE       0.56      0.36      0.44       497
       OTHER       0.73      0.86      0.79      1006

    accuracy                           0.69      1503
   macro avg       0.64      0.61      0.61      1503
weighted avg       0.67      0.69      0.67      1503



Let's try with n-gram character-level `tf-idf`:

In [26]:
logreg = Pipeline([('vect', CountVectorizer(analyzer = "char", token_pattern = r"\w{1,}", ngram_range = (2,4))), # char-level ngrams counts
                   ('tfidf', TfidfTransformer()), # normalized tf-idf
                   ('clf', LogisticRegression(class_weight = "balanced")), # logistic regression
                  ])

In [27]:
logreg.fit(X_train, y_train) # training
y_pred = logreg.predict(X_test) # predict labels of test data

In [28]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     OFFENSE       0.62      0.70      0.66       497
       OTHER       0.84      0.78      0.81      1006

    accuracy                           0.76      1503
   macro avg       0.73      0.74      0.73      1503
weighted avg       0.77      0.76      0.76      1503



Slightly better!

Conclusion: the best result was obtained with character-level n-grams and logistic regression. Can we improve on that?

#### Support Vector Machines

In [29]:
from sklearn.svm import SVC

In [30]:
svcl = Pipeline([('vect', CountVectorizer()), # transform text into matrix of token counts
                   ('tfidf', TfidfTransformer()), # from token counts to normalized tf-idf
                   ('clf', SVC(class_weight = "balanced")), # support vector classifier, default params
                  ])

In [31]:
svcl.fit(X_train, y_train) # training
y_pred = svcl.predict(X_test) # predict labels of test data

In [32]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     OFFENSE       0.64      0.53      0.58       497
       OTHER       0.78      0.85      0.82      1006

    accuracy                           0.75      1503
   macro avg       0.71      0.69      0.70      1503
weighted avg       0.74      0.75      0.74      1503



Already not too far from best LogReg above. How about n-gram features?

In [33]:
svcl = Pipeline([('vect', CountVectorizer(analyzer = "word", token_pattern = r"\w{1,}", ngram_range = (2,3))), # word-level n-grams
                   ('tfidf', TfidfTransformer()), # from n-gram counts to normalized tf-idf
                   ('clf', SVC(class_weight = "balanced")), # support vector classifier
                  ])

In [34]:
svcl.fit(X_train, y_train) # training
y_pred = svcl.predict(X_test) # predict labels of test data

In [35]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     OFFENSE       0.73      0.09      0.15       497
       OTHER       0.69      0.98      0.81      1006

    accuracy                           0.69      1503
   macro avg       0.71      0.54      0.48      1503
weighted avg       0.70      0.69      0.59      1503



Very poor recall of `OFFENSE`: puzzling. How about character-level n-grams?

In [36]:
svcl = Pipeline([('vect', CountVectorizer(analyzer = "char", token_pattern = r"\w{1,}", ngram_range = (2,4))), # word-level n-grams
                   ('tfidf', TfidfTransformer()), # from n-gram counts to normalized tf-idf
                   ('clf', SVC(class_weight = "balanced")), # support vector classifier
                  ])

In [37]:
svcl.fit(X_train, y_train) # training
y_pred = svcl.predict(X_test) # predict labels of test data

In [38]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     OFFENSE       0.71      0.62      0.66       497
       OTHER       0.82      0.87      0.85      1006

    accuracy                           0.79      1503
   macro avg       0.76      0.75      0.75      1503
weighted avg       0.78      0.79      0.79      1503



This is the best so far. We can take it as our new baseline, and try to improve from here, e.g. tuning the parameters of `SVC`.

To do so, we try with `sklearn`'s `GridSearchCV`:

In [39]:
from sklearn.model_selection import GridSearchCV

In [40]:
# same pipeline as before
svcl = Pipeline([('vect', CountVectorizer(analyzer = "char", token_pattern = r"\w{1,}", ngram_range = (2,4))), # word-level n-grams
                   ('tfidf', TfidfTransformer()), # from n-gram counts to normalized tf-idf
                   ('clf', SVC(class_weight = "balanced")), # support vector classifier
                  ])

In [48]:
import numpy as np # to define intervals of values

In [50]:
# grid of parameter values
param_grid = {"clf__C" : np.linspace(0.1, 1, num = 10), # regularization
              "clf__gamma" : np.linspace(0.1, 1, num = 10) # kernel coefficient
             }

In [51]:
search = GridSearchCV(svcl, param_grid, verbose = 10, n_jobs=-1) # n_job=-1 in order to use all available processors

In [52]:
search.fit(X_train, y_train) # this will take a while

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed: 12.4min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed: 14.3min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed: 16.7min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed: 18.6min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed: 21.6min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 24.2min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed: 27

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='char',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(2, 4),
                                                        p

In [53]:
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Best parameter (CV score=0.758):
{'clf__C': 1.0, 'clf__gamma': 0.7000000000000001}


`C=1` is the default value; not sure about `gamma`.

In [54]:
svcl = Pipeline([('vect', CountVectorizer(analyzer = "char", token_pattern = r"\w{1,}", ngram_range = (2,4))), # word-level n-grams
                   ('tfidf', TfidfTransformer()), # from n-gram counts to normalized tf-idf
                   ('clf', SVC(C = 1, gamma = 0.7, class_weight = "balanced")), # support vector classifier
                  ])

In [55]:
svcl.fit(X_train, y_train) # training
y_pred = svcl.predict(X_test) # predict labels of test data

In [56]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     OFFENSE       0.65      0.67      0.66       497
       OTHER       0.84      0.82      0.83      1006

    accuracy                           0.77      1503
   macro avg       0.74      0.75      0.75      1503
weighted avg       0.78      0.77      0.77      1503



Worse results compared to model with default parameters...