## Tuning features towards simple language
In this notebook we show an effort to modify a common way of extracting features in an attempt to improve
the performance when classifying simple language.
We first set up the feature extraction, and then run two experiments.
Our feature is a modified version of tf-idf, where before applying the vectorizer,
we drop all tokens that arent nouns, verbs or adjectives. We also include the lemmas
instead of the tokens themselves.

We then train and test a classifier using tf-idf vectors with the out of the box vectorizer,
using simple language once, and conventional language in a separate trial.
We then train and test for both cases again, using our modified features.

If our approach is successful, the modified feature won't improve the classification of the
conventional language by a lot, if at all, but will improve the classification of simple language.

In [1]:
import pandas as pd
import spacy
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

We are using the TfIdf vectorizer from scikit learn, both for the control and internally in our modified feature.
An attempt had been made to implement the tf-idf computation manually, but we ran into the issue of the resulting
vectors dimensionality, therefore we resorted to the premade system.
We also include spacy and its pretrained language model for lemmatization and POS tagging.

In [2]:
# we are using spacys pretrained model to obtain pos tags for our data, and to lemmatize it
nlp = spacy.load("de_core_news_md")

In [32]:
df_simple = pd.read_csv("clean_leichte_sprache.csv")
df_poli = pd.read_csv("clean_politik_normal.csv")
df_cult = pd.read_csv("clean_kultur_normal.csv")
df_sport = pd.read_csv("clean_sport_normal.csv")

We define our modified feature, with one fit_transform which calls the fit_transform of the Tf-Idf vectorizer,
and one standard transform.

In [37]:
def custom_featurizer(list_in) :
    # our very first step is to throw out all words that dont have one of the following
    # pos tags
    wanted_tags = ["ADJ", "ADV", "NOUN", "PROPN", "VERB"]
    
    #for saving the augmented documents
    stripped_documents = []
    
    for document in list_in :
        stripped_document = []
        
        doc = nlp(document)
        for token in doc :
            if token.pos_ in wanted_tags :
                stripped_document.append(token.lemma_) # we lemmatize the words for less variety, speeding up the computation
        stripped_documents.append(stripped_document)
        # now we only have the lemmas of the desired pos tags, and have already greatly reduced our amount of words.     
    
    return stripped_documents

def fit_transform(list_in, vectorizer) :
    stripped_documents = custom_featurizer(list_in)
    tf_idf_vectors = vectorizer.fit_transform(text_train)
    
    return tf_idf_vectors.toarray()

def transform(list_in, vectorizer) :
    stripped_documents = custom_featurizer(list_in)
    tf_idf_vectors = vectorizer.transform(text_train)
    
    return tf_idf_vectors.toarray()
    
        

In [38]:
vectors_test = custom_featurizer(df_simple)

In [39]:
df_storage = pd.DataFrame(vectors_test)

In [40]:
df_storage = df_storage.fillna(0)

In [41]:
display(df_storage)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22898,22899,22900,22901,22902,22903,22904,22905,22906,22907
0,910.8,650.571429,4.4,2.844472,2277.0,1.968872,3.247059,146.903226,130.114286,650.571429,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.0,0.000000,0.0,7.875486,1.623529,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,2.844472,0.0,0.000000,6.494118,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,2.844472,0.0,0.000000,3.247059,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.0,5.688944,0.0,3.937743,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4549,0.0,0.000000,4.4,8.533417,0.0,1.968872,1.623529,0.000000,0.000000,0.000000,...,4554.0,9108.0,4554.0,4554.0,0.0,0.0,0.0,0.0,0.0,0.0
4550,0.0,0.000000,0.0,2.844472,0.0,1.968872,8.117647,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,9108.0,4554.0,0.0,0.0,0.0,0.0
4551,0.0,0.000000,0.0,0.000000,0.0,0.000000,1.623529,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,9108.0,0.0,0.0,0.0
4552,0.0,0.000000,13.2,0.000000,0.0,3.937743,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4554.0,4554.0,0.0


In [42]:
df_storage.to_csv("tf_idf_vectors_henri_simple.csv", index=False)

In [43]:
df_testread = pd.read_csv("tf_idf_vectors_henri_simple.csv", dtype="float")
display(df_testread)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22898,22899,22900,22901,22902,22903,22904,22905,22906,22907
0,910.8,650.571429,4.4,2.844472,2277.0,1.968872,3.247059,146.903226,130.114286,650.571429,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.0,0.000000,0.0,7.875486,1.623529,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.0,2.844472,0.0,0.000000,6.494118,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.0,2.844472,0.0,0.000000,3.247059,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.0,5.688944,0.0,3.937743,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4549,0.0,0.000000,4.4,8.533417,0.0,1.968872,1.623529,0.000000,0.000000,0.000000,...,4554.0,9108.0,4554.0,4554.0,0.0,0.0,0.0,0.0,0.0,0.0
4550,0.0,0.000000,0.0,2.844472,0.0,1.968872,8.117647,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,9108.0,4554.0,0.0,0.0,0.0,0.0
4551,0.0,0.000000,0.0,0.000000,0.0,0.000000,1.623529,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,9108.0,0.0,0.0,0.0
4552,0.0,0.000000,13.2,0.000000,0.0,3.937743,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4554.0,4554.0,0.0
