## Tuning features towards simple language
In this notebook we show an effort to modify a common way of extracting features in an attempt to improve
the performance when classifying simple language.
We first set up the feature extraction, and then run two experiments.
Our feature is a modified version of tf-idf, where before applying the vectorizer,
we drop all tokens that arent nouns, verbs or adjectives. We also include the lemmas
instead of the tokens themselves.

We then train and test a classifier using tf-idf vectors with the out of the box vectorizer,
using simple language once, and conventional language in a separate trial.
We then train and test for both cases again, using our modified features.

If our approach is successful, the modified feature won't improve the classification of the
conventional language by a lot, if at all, but will improve the classification of simple language.

In [12]:
import pandas as pd
import spacy
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

We are using the TfIdf vectorizer from scikit learn, both for the control and internally in our modified feature.
An attempt had been made to implement the tf-idf computation manually, but we ran into the issue of the resulting
vectors dimensionality, therefore we resorted to the premade system.
We also include spacy and its pretrained language model for lemmatization and POS tagging.

In [13]:
# we are using spacys pretrained model to obtain pos tags for our data, and to lemmatize it
nlp = spacy.load("de_core_news_md")

This notebook requires the cleaned data from the cleaning and metadata notebook written by me to be in the same directory, so they can be read here.

In [14]:
df_simple = pd.read_csv("clean_leichte_sprache.csv")
df_poli = pd.read_csv("clean_politik_normal.csv")
df_cult = pd.read_csv("clean_kultur_normal.csv")
df_sport = pd.read_csv("clean_sport_normal.csv")
df_normal = pd.concat([df_poli, df_cult, df_sport])

We define our modified feature, with one fit_transform which calls the fit_transform of the Tf-Idf vectorizer,
and one standard transform.

In [22]:
def custom_featurizer(series_in) :
    # our very first step is to throw out all words that dont have one of the following
    # pos tags
    wanted_tags = ["ADJ", "ADV", "NOUN", "PROPN", "VERB"]
    
    #for saving the augmented documents
    stripped_documents = []
    
    for document in series_in :
        stripped_document = ""
        
        doc = nlp(document)
        for token in doc :
            if token.pos_ in wanted_tags :
                stripped_document +=token.lemma_ + " " # we lemmatize the words for less variety, speeding up the computation
        stripped_documents.append(stripped_document)
        # now we only have the lemmas of the desired pos tags, and have already greatly reduced our amount of words.     
    
    return stripped_documents

def fit_transform(series_in, vectorizer) :
    stripped_documents = custom_featurizer(series_in)
    tf_idf_vectors = vectorizer.fit_transform(stripped_documents)
    
    return tf_idf_vectors.toarray()

def transform(series_in, vectorizer) :
    stripped_documents = custom_featurizer(series_in)
    tf_idf_vectors = vectorizer.transform(stripped_documents)
    
    return tf_idf_vectors.toarray()
    
        

We are using the standard train test splitting, and for the classifier we are using a multilayer perceptron.

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
import numpy as np

First our control. We split for train and test, using a 70-30 split. We then vectorize using the standard tf-idf
vectorizer.

In [None]:
def vectorize(train_data, test_data):
    vectorizer = TfidfVectorizer()
    # fit on train data + transform
    train_tfidf = vectorizer.fit_transform(train_data).toarray()
    # transform test data
    test_tfidf = vectorizer.transform(test_data).toarray() 
    return train_tfidf, test_tfidf

# train test split and vectorize simple texts
train_text_simple, test_text_simple, train_label_simple, test_label_simple = train_test_split(df_simple["text"], df_simple["label"], train_size=0.7, random_state=0)
train_vec_simple, test_vec_simple = vectorize(train_text_simple, test_text_simple)

Now we can train and test our classifier, and show our evaluation.

In [None]:
# train MLP classifier with simple
clf_simple = MLPClassifier()
clf_simple.fit(train_vec_simple, train_label_simple)

In [None]:
# predict the labels and evaluate the results for simple language
predictions_simple = clf_simple.predict(test_vec_simple)
print(classification_report(test_label_simple, predictions_simple))

In [8]:
def vectorize(train_data, test_data):
    vectorizer = TfidfVectorizer()
    # fit on train data + transform
    train_tfidf = vectorizer.fit_transform(train_data).toarray()
    # transform test data
    test_tfidf = vectorizer.transform(test_data).toarray() 
    return train_tfidf, test_tfidf

# train test split and vectorize normal texts
train_text_normal, test_text_normal, train_label_normal, test_label_normal = train_test_split(df_normal["text"], df_normal["label"], train_size=0.7, random_state=0)
train_vec_normal, test_vec_normal = vectorize(train_text_normal, test_text_normal)

In [9]:
clf_normal = MLPClassifier()
clf_normal.fit(train_vec_normal, train_label_normal)

MLPClassifier()

In [10]:
# predict the labels and evaluate the results for normal language
predictions_normal = clf_normal.predict(test_vec_normal)
print(classification_report(test_label_normal, predictions_normal))

              precision    recall  f1-score   support

      Kultur       0.96      0.96      0.96       704
 Nachrichten       0.95      0.96      0.96       673
       Sport       0.99      0.99      0.99       798

    accuracy                           0.97      2175
   macro avg       0.97      0.97      0.97      2175
weighted avg       0.97      0.97      0.97      2175



Now we can do the same setup, but with our modified features.

In [23]:
def vectorize(train_data, test_data):
    vectorizer = TfidfVectorizer()
    # fit on train data + transform
    train_modified = fit_transform(train_data, vectorizer)
    # transform test data
    test_modified = transform(test_data, vectorizer) 
    return train_modified, test_modified

# train test split and vectorize simple texts
train_text_simple, test_text_simple, train_label_simple, test_label_simple = train_test_split(df_simple["text"], df_simple["label"], train_size=0.7, random_state=0)
train_vec_simple, test_vec_simple = vectorize(train_text_simple, test_text_simple)

4032    Wichtiger Jahres-Tag Deutschland und das Land ...
2976    Neue Regierung in Italien Italien hat eine neu...
461     Thomas-Mann-Haus  Thomas Mann war ein berühmte...
3565    Naher Osten Der Präsident von dem Land USA hei...
1767    Chris Froome gewinnt Der Rad-Fahrer Christophe...
                              ...                        
1033    Ausstellung über Deutschland In einem Museum i...
3264    Urteil erwartet Ein Gericht in München will sc...
1653    Champions League: Achtel-Finale In der Fußball...
2607    Neuer US-Präsident im Amt Der Demokrat Joe Bid...
2732    USA ziehen mehr Soldaten ab Das Land USA will ...
Name: text, Length: 3187, dtype: object

In [24]:
# train MLP classifier with simple
clf_simple = MLPClassifier()
clf_simple.fit(train_vec_simple, train_label_simple)

MLPClassifier()

In [25]:
# predict the labels and evaluate the results for simple language
predictions_simple = clf_simple.predict(test_vec_simple)
print(classification_report(test_label_simple, predictions_simple))

              precision    recall  f1-score   support

      Kultur       0.93      0.91      0.92       397
 Nachrichten       0.93      0.95      0.94       595
       Sport       0.99      0.98      0.99       375

    accuracy                           0.95      1367
   macro avg       0.95      0.95      0.95      1367
weighted avg       0.95      0.95      0.95      1367



In [27]:
def vectorize(train_data, test_data):
    vectorizer = TfidfVectorizer()
    # fit on train data + transform
    train_modified = fit_transform(train_data, vectorizer)
    # transform test data
    test_modified = transform(test_data, vectorizer)
    return train_modified, test_modified

# train test split and vectorize normal texts
train_text_normal, test_text_normal, train_label_normal, test_label_normal = train_test_split(df_normal["text"], df_normal["label"], train_size=0.7, random_state=0)
train_vec_normal, test_vec_normal = vectorize(train_text_normal, test_text_normal)

In [28]:
clf_normal = MLPClassifier()
clf_normal.fit(train_vec_normal, train_label_normal)

MLPClassifier()

In [29]:
# predict the labels and evaluate the results for normal language
predictions_normal = clf_normal.predict(test_vec_normal)
print(classification_report(test_label_normal, predictions_normal))

              precision    recall  f1-score   support

      Kultur       0.95      0.96      0.96       704
 Nachrichten       0.96      0.96      0.96       673
       Sport       0.99      0.99      0.99       798

    accuracy                           0.97      2175
   macro avg       0.97      0.97      0.97      2175
weighted avg       0.97      0.97      0.97      2175



Author: Henri Thölke