# NLP Project - Leichte Sprache

Quick Guide to the notebook:  
1. create './data' folder and place following .csv files inside  
  1. Kultur_normal.csv  
  2. leicht_nachricht.csv  
  3. Politik_normal.csv
  4. Sport_normal.csv
2. install spacy transformer (see below) and restart your python runtime
3. run the cells for global variables and data loading+cleansing
4. the code is split up based on which data is used and which data is tested for.  
 Every section is divided into three parts:
  1. `vectorization`: vectorizes the loaded data and prepares the data for the `classifiers`-cells of the same section. Only run of these cells, because they all set the same variables. The cell run last is the vectorization method used when running a classifier.
  2. `classifiers`: uses the previously set vectors to train a classifier. You should only run of the cells in this section. Across sections these cells are identical, which means it is easy to use any other classifier for the task, even if there is no corresponding cell in the current section. The classifier run last is the one used for evaluation
  3. `evaluation`: cells in this subsection use the trained classifier and test its performance on the test-set or on other data

## Pickle Save/Load

In [11]:
# import picke file
import pickle

def save_object(object_to_save, filename='pickled_data.pkl'):
  with open(filename, 'wb') as file:
     pickle.dump(object_to_save, file)

def load_object(file_name_to_load):
  with open(file_name_to_load, 'rb') as file:
     obj = pickle.load(file)
  return obj

#save_object(df)
#df = load_object('pickled_data.pkl')

## Install spacy dependencies

In [None]:
# run cell and restart runtime before continuing

!pip install --upgrade spacy
!pip install spacy-transformers
!python -m spacy download de_dep_news_trf
!pip install textstat

Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/1b/d8/0361bbaf7a1ff56b44dca04dace54c82d63dad7475b7d25ea1baefafafb2/spacy-3.0.6-cp37-cp37m-manylinux2014_x86_64.whl (12.8MB)
[K     |████████████████████████████████| 12.8MB 229kB/s 
Collecting pathy>=0.3.5
[?25l  Downloading https://files.pythonhosted.org/packages/13/87/5991d87be8ed60beb172b4062dbafef18b32fa559635a8e2b633c2974f85/pathy-0.5.2-py3-none-any.whl (42kB)
[K     |████████████████████████████████| 51kB 6.2MB/s 
Collecting thinc<8.1.0,>=8.0.3
[?25l  Downloading https://files.pythonhosted.org/packages/1c/83/1f567d77173dcdf8e57fccd2a9e086d7702f4b42299070506f72d7353d3a/thinc-8.0.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (618kB)
[K     |████████████████████████████████| 624kB 29.8MB/s 
Collecting catalogue<2.1.0,>=2.0.3
  Downloading https://files.pythonhosted.org/packages/9c/10/dbc1203a4b1367c7b02fddf08cb2981d9aa3e688d398f587cea0ab9e3bec/catalogue-2.0.4-py3-none-any.whl
Collecting s

Collecting spacy-transformers
  Downloading https://files.pythonhosted.org/packages/f3/58/e470e8217c1c93db41c50ef210e02f7302fbf252a56b66708f8ecb579aa3/spacy_transformers-1.0.3-py2.py3-none-any.whl
Collecting transformers<4.7.0,>=3.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 8.4MB/s 
Collecting spacy-alignments<1.0.0,>=0.7.2
[?25l  Downloading https://files.pythonhosted.org/packages/ea/86/a6786d24d1d8f3a6cff2c60b55a7e845725a94919cd94d270ea49d82e59b/spacy_alignments-0.8.3-cp37-cp37m-manylinux2014_x86_64.whl (998kB)
[K     |████████████████████████████████| 1.0MB 39.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_

## Main Code

### Global variables

In [None]:
# set which column to analyse on
# choose one of: 
# - haupt_text 
# - kurz_text 
# - article 
# - all_text
data_column = 'haupt_text'

### Data Loading and Cleansing

In [None]:
import spacy
import pandas as pd

In [None]:
import io

# load Kultur data non-Leichte Sprache
df_culture = pd.read_csv('./data/Kultur_normal.csv').drop(['Line_ID', 'year', 'month', 'day'], axis=1)
df_culture['category'] = 'Kultur'

# load Sport data non-Leichte Sprache
df_sport = pd.read_csv('./data/Sport_normal.csv').drop(['Line_ID', 'year', 'month', 'day'], axis=1)
df_sport['category'] = 'Sport'

# load Politik data non-Leichte Sprache
df_politic = pd.read_csv('./data/Politik_normal.csv').drop(['Line_ID', 'year', 'month', 'day'], axis=1)
df_politic['category'] = 'Nachrichten'

# combine non-Leichte Sprache data
df_not_leichte_sprache = pd.concat([df_culture, df_sport, df_politic])
# set Leichte Sprache identifier to 0 -> non-Leichte Sprache
df_not_leichte_sprache['is_leichte_sprache'] = 0

# load Leichte Sprache data
df_leichte_sprache = pd.read_csv('./data/leicht_nachricht.csv').drop(['audio_link', 'Line_ID', 'year', 'month', 'day'], axis=1)
df_leichte_sprache = df_leichte_sprache[df_leichte_sprache['category'] != 'Vermischtes']
# set Leichte Sprache identifier to 1 -> Leichte Sprache
df_leichte_sprache['is_leichte_sprache'] = 1

# concat dataframes
df = pd.concat([df_not_leichte_sprache, df_leichte_sprache])
df = df.reset_index(drop=True)

In [None]:
import re
import string

# remove formatting characters with spaces and then replace multiple spaces with single spaces
# finally .strip() the text, to remove leading and trailing blanks
def data_cleansing(text):
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'\r', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text

# cleanse data
df['article'] = df['article'].apply(lambda x: data_cleansing(str(x)))
df['kurz_text'] = df['kurz_text'].apply(lambda x: data_cleansing(str(x)))
df['haupt_text'] = df['haupt_text'].apply(lambda x: data_cleansing(str(x)))

# concat article title, short text and main text into one field
df['all_text'] = df.apply(lambda row: '. '.join([row['article'], row['kurz_text'], row['haupt_text']]), axis=1)

df.head()

Unnamed: 0,category,article,kurz_text,haupt_text,is_leichte_sprache,all_text
0,Kultur,Gerhard Richter erklärt Kirchenfenster zu sein...,Im Kloster Tholey werden diese Woche neue Fens...,Seine abstrakten Bilder werden in den wichtigs...,0,Gerhard Richter erklärt Kirchenfenster zu sein...
1,Kultur,Das sind unsere Buchempfehlungen für die Ferien,"Alberne Eltern, fliegende Brötchen, Trolle und...","Egal wie heiß es ist, auf Bücher ist Verlass. ...",0,Das sind unsere Buchempfehlungen für die Ferie...
2,Kultur,Grauenhafte Leerstelle,"War der Autor des Welterfolgs ""Alice im Wunder...","Unstrittig ist, dass ""Alice im Wunderland"" nic...",0,Grauenhafte Leerstelle. War der Autor des Welt...
3,Kultur,HBO Max erteilt Deutschland für 2021 eine Absage,WarnerMedia kommt mit seiner Streamingplattfor...,Deutsche Film- und Serienfans werden auch künf...,0,HBO Max erteilt Deutschland für 2021 eine Absa...
4,Kultur,Javicia Leslie ist die neue Batwoman,Vor zwei Monaten stieg Ruby Rose überraschend ...,Sie wurde unter anderem als erste homosexuelle...,0,Javicia Leslie ist die neue Batwoman. Vor zwei...


### Feature Extraction

In [None]:
# attempt to use word embeddings
import spacy

# problem with batch size of doc vectors -> sometimes 1x768, sometimes 2x768 (depending on number of batches)
# (without preprocessing)
def feature_extraction(row, nlp):
    feature = nlp(row[data_column])._.trf_data.tensors[-1]
    return feature

# load transformer
nlp = spacy.load('de_dep_news_trf')

# focus on important data, leave out rest
df_embedding = df[[data_column, 'category']]

# find max text length
max_string_length = df_embedding[data_column].map(len).max()
# pad every text to match the max text length
df_embedding[data_column] = df_embedding[data_column].apply(lambda x: x.ljust(max_string_length))

# get embedding vector for every entry in dataframe
df_embedding['feature'] = df_embedding.apply(lambda row: feature_extraction(row, nlp), axis=1)

save_object(df_embedding, 'word_embedding_df.pkl')

df_embedding.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [15]:
df_embedding = load_object('word_embedding_df.pkl')
# -> vectors not sized correctly
df_embedding.apply(lambda row: print(row["feature"].shape), axis=1)

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(2, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(2, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(2, 768)
(1, 768)
(2, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(2, 768)
(2, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(1, 768)
(2, 768)
(1, 768)
(1, 768)
(

0        None
1        None
2        None
3        None
4        None
         ... 
11798    None
11799    None
11800    None
11801    None
11802    None
Length: 11803, dtype: object

In [None]:
# Feature extraction testing ground

#from sklearn.model_selection import train_test_split
#from sklearn.feature_extraction.text import CountVectorizer
#import numpy as np
#import textstat

#textstat.set_lang("de")

#def vectorize(train_data, test_data):
#    def feature_extraction(text):
#        ### reading ease
#        reading_ease = textstat.flesch_reading_ease(text)
#        ### comma count
#        #comma_count = text.count(',')
#        ### dot count
#        #dot_count = text.count(',')
#        ### max word length
#        #words = re.split('\s|,', text)
#        #word_lengths = [len(a) for a in words]
#        #max_word_length = np.max(word_lengths)

#        ### todo: anzahl verben am satzende?
#        return [reading_ease]

#    train_tfidf = [feature_extraction(a) for a in train_data]
#    test_tfidf = [feature_extraction(a) for a in test_data]
#    return train_tfidf, test_tfidf

#train_x, test_x, train_y, test_y = train_test_split(df['haupt_text'], df['is_leichte_sprache'], train_size=0.7, random_state=0)
#train_vec, test_vec = vectorize(train_x, test_x)

In [None]:
# testing ground for transformer

#nlp = spacy.load('de_dep_news_trf')
#text = df[data_column][0]

#print(text)
#for sent in doc.sents:
#    token = sent[-2]
#    print(token.pos_, token.tag_)
#    print(sent.start_char,sent.end_char) 


#def count_end_of_sentence_pos(text):
#    vec = [0,0]
#    doc = nlp(text)

#    for sent in doc.sents:
#      token = sent[-2]
#      if token.tag_ == 'NN':
#          vec[0] = vec[0] + 1
#      elif token.tag_ == 'VB':
#          vec[1] = vec[1] + 1
    
#    return vec 


    

Seine abstrakten Bilder werden in den wichtigsten Museen weltweit ausgestellt und erzielen auf internationalen Kunstauktionen Rekordpreise. Jetzt verabschiedet sich Gerhard Richter mit drei Fenstern in einer Abtei von der hohen Welt der Gegenwartskunst.
NOUN NN
0 139
NOUN NN
140 253


### Mixed data classification

#### Vectorization

Cells in this segment create feature vectors.  
Only run one of these cells, corresponding to the vectorization you want to use

In [None]:
# Count vectorizer - no seperation of Leichte Sprache and non-Leichte Sprache

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

def vectorize(train_data, test_data):
    vectorizer = CountVectorizer()
    # fit on train data + transform
    train_tfidf = vectorizer.fit_transform(train_data).toarray()
    # transform test data
    test_tfidf = vectorizer.transform(test_data).toarray() 
    return train_tfidf, test_tfidf

# train test split and vectorize texts
train_x, test_x, train_y, test_y = train_test_split(df[data_column], df['is_leichte_sprache'], train_size=0.7, random_state=0)
train_vec, test_vec = vectorize(train_x, test_x)

In [None]:
# tfidf vectorizer - no seperation of Leichte Sprache and non-Leichte Sprache

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorize train_data and test_data
def vectorize(train_data, test_data):
    vectorizer = TfidfVectorizer()
    # fit on train data + transform
    train_tfidf = vectorizer.fit_transform(train_data).toarray()
    # transform test data
    test_tfidf = vectorizer.transform(test_data).toarray() 
    return train_tfidf, test_tfidf

# train test split and vectorize texts
train_x, test_x, train_y, test_y = train_test_split(df[data_column], df['is_leichte_sprache'], train_size=0.7, random_state=0)
train_vec, test_vec = vectorize(train_x, test_x)

#### Classifiers

Cells in this segment train a classifier based on the previously created train and test vectors.  
Only run one of these cells, corresponding to the classifier you want to use

In [None]:
# Multi-Layer Perceptron

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

# train MLP classifier
clf = MLPClassifier()
clf.fit(train_vec, train_y)

              precision    recall  f1-score   support

           0       0.88      0.88      0.88      2160
           1       0.82      0.81      0.81      1381

    accuracy                           0.85      3541
   macro avg       0.85      0.85      0.85      3541
weighted avg       0.85      0.85      0.85      3541



In [None]:
# DecisionTree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# train DecisionTree classifier
clf = DecisionTreeClassifier()
clf.fit(train_vec, train_y)

              precision    recall  f1-score   support

           0       0.94      0.95      0.95      2160
           1       0.92      0.91      0.91      1381

    accuracy                           0.93      3541
   macro avg       0.93      0.93      0.93      3541
weighted avg       0.93      0.93      0.93      3541



In [None]:
# Naive Bayes

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# train Naive Bayes classifier
clf = GaussianNB()
clf.fit(train_vec, train_y)

              precision    recall  f1-score   support

           0       0.92      0.80      0.85      2160
           1       0.74      0.89      0.81      1381

    accuracy                           0.83      3541
   macro avg       0.83      0.84      0.83      3541
weighted avg       0.85      0.83      0.84      3541



In [None]:
# SVM

from sklearn.svm import SVC
from sklearn.metrics import classification_report

# train SVM classifier
clf = SVC()
clf.fit(train_vec, train_y)

              precision    recall  f1-score   support

           0       0.89      0.83      0.86      2160
           1       0.76      0.84      0.80      1381

    accuracy                           0.83      3541
   macro avg       0.82      0.83      0.83      3541
weighted avg       0.84      0.83      0.83      3541



#### Evaluation

Evaluate the previously run vectorization-classifier pair

In [None]:
# predict on test set
predictions = clf.predict(test_vec)

# print a classification report
print(classification_report(test_y, predictions))

### Classification of Leichte Sprache without training on Leichte Sprache

#### Vectorization

Cells in this segment create feature vectors.  
Only run one of these cells, corresponding to the vectorization you want to use

In [None]:
# attempt to vectorize data

#from sklearn.model_selection import train_test_split
#from sklearn.feature_extraction.text import CountVectorizer
#import numpy as np
#import textstat

#textstat.set_lang("de")

#def vectorize(train_data, test_data=[]):
#    def feature_extraction(text):
#        ### reading ease
#        reading_ease = textstat.flesch_reading_ease(text)
#        ### comma count
#        comma_count = text.count(',')
#        ### dot count
#        dot_count = text.count(',')
#        ### max word length
#        words = re.split('\s|,', text)
#        word_lengths = [len(a) for a in words]
#        max_word_length = np.max(word_lengths)
#
#        ### word count
#        words = text.split()
#        word_count = len(words)
#
#        ### todo: anzahl verben am satzende?
#        return [reading_ease, comma_count, dot_count, max_word_length, word_count]

#    train_tfidf = [feature_extraction(a) for a in train_data]
#    test_tfidf = [feature_extraction(a) for a in test_data]
#    return train_tfidf, test_tfidf


#df_non_leichte_sprache = df[df['is_leichte_sprache'] == 0]
#df_leichte_sprache = df[df['is_leichte_sprache'] == 1]

#train_x, test_x, train_y, test_y = train_test_split(df_non_leichte_sprache[data_column], df_non_leichte_sprache['category'], train_size=0.7, random_state=0)
#train_vec, test_vec = vectorize(train_x, test_x)

In [None]:
# Vectorizing data
# count vectorizer
# train/test vectors on non-Leichte Sprache
# further vectorization of Leichte Sprache data

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# count vectorizer method for train and test data
def count_vectorize(train_data, test_data):
    vectorizer = CountVectorizer()
    # fit on train data and transform
    train_tfidf = vectorizer.fit_transform(train_data).toarray()
    # transform test data
    test_tfidf = vectorizer.transform(test_data).toarray() 
    # return vectorized data and vocabulary
    return train_tfidf, test_tfidf, vectorizer.vocabulary_

# count vectorization with given vocabulary
def count_vectorize_with_vocab(data, vocab):
    vectorizer = CountVectorizer(vocabulary=vocab)
    # transform data by given vocabulary
    data_vec = vectorizer.transform(data).toarray()
    return data_vec

# seperate Leichte Sprache and non-Leichte Sprache
df_non_leichte_sprache = df[df['is_leichte_sprache'] == 0]
df_leichte_sprache = df[df['is_leichte_sprache'] == 1]

# train test split on non-Leichte Sprache and vectorize texts
train_x, test_x, train_y, test_y = train_test_split(df_non_leichte_sprache[data_column], df_non_leichte_sprache['category'], train_size=0.7, random_state=0)
train_vec, test_vec, vocabulary = count_vectorize(train_x, test_x)

# vectorize Leichte Sprache seperately
vectorized_leichte_sprache = count_vectorize_with_vocab(df_leichte_sprache[data_column], vocabulary)

In [None]:
# Vectorizing data
# tf-idf vectorizer
# train/test vectors on non-Leichte Sprache
# further vectorization of Leichte Sprache data

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf vectorization method for train and test data
def tfidf_vectorize(train_data, test_data):
    vectorizer = TfidfVectorizer()
    # fit on train data and transform
    train_tfidf = vectorizer.fit_transform(train_data).toarray()
    # transform test data
    test_tfidf = vectorizer.transform(test_data).toarray() 
    # return vectors and fitted vectorizer
    return train_tfidf, test_tfidf, vectorizer

# tfidf vectoriation method with given vectorizer
def tfidf_vectorize_with_vectorizer(data, vectorizer):
    data_vec = vectorizer.transform(data).toarray()
    return data_vec

# seperate Leichte Sprache and non-Leichte Sprache
df_non_leichte_sprache = df[df['is_leichte_sprache'] == 0]
df_leichte_sprache = df[df['is_leichte_sprache'] == 1]

# train test split on non-Leichte Sprache and vectorize texts
train_x, test_x, train_y, test_y = train_test_split(df_non_leichte_sprache[data_column], df_non_leichte_sprache['category'], train_size=0.7, random_state=0)
train_vec, test_vec, vectorizer = tfidf_vectorize(train_x, test_x)

# vectorize Leichte Sprache with vectorizer fitted on training data
vectorized_leichte_sprache = tfidf_vectorize_with_vectorizer(df_leichte_sprache[data_column], vectorizer)

In [None]:
# Experimental vectorizer

#from sklearn.model_selection import train_test_split
#import spacy

#nlp = spacy.load('de_dep_news_trf')

#def vectorize(text):
#    data_vec = []
    
#    doc = nlp(text)

#    num_of_sents = len(list(doc.sents))+1
#    ### comma count
#    comma_count = text.count(',') + 1
#    comma_count_normalized = comma_count / num_of_sents
#    data_vec += [comma_count_normalized]
#    ### dot count
#    dot_count = text.count(',') + 1
#    dot_count_normalized = dot_count / num_of_sents
#    data_vec += [dot_count_normalized]
#    ### max word length
#    #words = re.split('\s|,', text)
#    #word_lengths = [len(a) for a in words]
#    #max_word_length = np.max(word_lengths)

#    return data_vec

#df_non_leichte_sprache = df[df['is_leichte_sprache'] == 0]
#df_leichte_sprache = df[df['is_leichte_sprache'] == 1]

#train_x, test_x, train_y, test_y = train_test_split(df_non_leichte_sprache[data_column], df_non_leichte_sprache['category'], train_size=0.7, random_state=0)
#train_vec = [vectorize(a) for a in train_x]
#test_vec = [vectorize(a) for a in test_x]

#vectorized_leichte_sprache = [vectorize(a) for a in df_leichte_sprache[data_column]]

In [None]:
# not used / bad/incomplete code

#from sklearn.model_selection import train_test_split
#from sklearn.feature_extraction.text import CountVectorizer
#import spacy
#import numpy as np
#import textstat

#textstat.set_lang("de")

#nlp = spacy.load('de_dep_news_trf')

#def vectorize(train_data, test_data=[]):
#    vocab = {}

#    def preprocess(text):
#        ## text to lower
#        text = text.lower()
#        ## remove numbers
#        text = re.sub(r'\d+', '', text)
#        ## split compound words by dashes
#        #text = re.sub('-', ' ', text)
#        #text = re.sub(r'\s+', ' ', text)
#        ## remove punctuation
#        text = text.translate(str.maketrans('', '', string.punctuation))
#        return text


#    def feature_extraction(text):
#        text = preprocess(text)
#        doc = nlp(text)
#        lemmas = [a.lemma_ for a in doc]
#        
#        return []

#    train_tfidf = [feature_extraction(a) for a in train_data]
#    test_tfidf = [feature_extraction(a) for a in test_data]
#    return train_tfidf, test_tfidf


#df_non_leichte_sprache = df[df['is_leichte_sprache'] == 0]
#df_leichte_sprache = df[df['is_leichte_sprache'] == 1]

#train_x, test_x, train_y, test_y = train_test_split(df_non_leichte_sprache[data_column], df_non_leichte_sprache['category'], train_size=0.7, random_state=0)
#train_vec, test_vec = vectorize(train_x, test_x)

#### Classifiers

Cells in this segment train a classifier based on the previously created train and test vectors.  
Only run one of these cells, corresponding to the classifier you want to use

In [None]:
# Classifier fitting
# Multi-Layer Perceptron

from sklearn.neural_network import MLPClassifier

classifier_name = 'Multi-Layer Perceptron'

# train MLP classifier
clf = MLPClassifier()
clf.fit(train_vec, train_y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [None]:
# Classifier fitting
# Naive Bayes

from sklearn.naive_bayes import GaussianNB

classifier_name = 'Naive Bayes'

# train Naive Bayes classifier
clf = GaussianNB()
clf.fit(train_vec, train_y)

              precision    recall  f1-score   support

      Kultur       0.84      0.85      0.85       708
 Nachrichten       0.87      0.86      0.87       655
       Sport       0.95      0.94      0.94       812

    accuracy                           0.89      2175
   macro avg       0.89      0.89      0.89      2175
weighted avg       0.89      0.89      0.89      2175



In [None]:
# Classifier fitting
# Random Forest

from sklearn.ensemble import RandomForestClassifier

classifier_name = 'Random Forest'

# train RandomForest classifier
clf = RandomForestClassifier()
clf.fit(train_vec, train_y)

              precision    recall  f1-score   support

      Kultur       0.76      0.91      0.83       708
 Nachrichten       0.91      0.81      0.86       655
       Sport       0.96      0.87      0.91       812

    accuracy                           0.87      2175
   macro avg       0.87      0.86      0.86      2175
weighted avg       0.88      0.87      0.87      2175



#### Evalation

Evaluate the previously run vectorization-classifier pair

In [None]:
# Prediction and Evaluation
# trained on non-Leichte Sprache + test set non-Leichte Sprache 

from sklearn.metrics import classification_report

training_data_name = "non-Leichte Sprache"
evaluation_data_name = "non-Leichte Sprache"

print("Trained on", training_data_name)
print("Evaluated on", evaluation_data_name)

print("classifier:", classifier_name)

# predict on test set
predictions = clf.predict(test_vec)

print('Prediction on test set')
# show classification report
print(classification_report(test_y, predictions))

classifier: Multi-Layer Perceptron
Prediction on test set
              precision    recall  f1-score   support

      Kultur       0.89      0.93      0.91       708
 Nachrichten       0.94      0.91      0.92       655
       Sport       0.98      0.96      0.97       812

    accuracy                           0.94      2175
   macro avg       0.94      0.93      0.93      2175
weighted avg       0.94      0.94      0.94      2175

Prediction for Leichte Sprache
              precision    recall  f1-score   support

      Kultur       0.73      0.86      0.79      1304
 Nachrichten       0.89      0.81      0.85      2020
       Sport       0.98      0.94      0.96      1230

    accuracy                           0.86      4554
   macro avg       0.87      0.87      0.87      4554
weighted avg       0.87      0.86      0.86      4554



In [None]:
# Prediction and Evaluation
# trained on non-Leichte Sprache + test on Leichte Sprache

from sklearn.metrics import classification_report

training_data_name = "non-Leichte Sprache"
evaluation_data_name = "Leichte Sprache"

print("Trained on", training_data_name)
print("Evaluated on", evaluation_data_name)

print("classifier:", classifier_name)

# predict on Leichte Sprache
predictions = clf.predict(vectorized_leichte_sprache)

print('Prediction on test set')
# show classification report
print(classification_report(df_leichte_sprache['category'], predictions))

### Classification of Leichte Sprache with training on Leichte Sprache

#### Vectorization

Cells in this segment create feature vectors.  
Only run one of these cells, corresponding to the vectorization you want to use

In [None]:
# Vectorizing data
# count vectorizer
# train/test vectors on Leichte Sprache
# further vectorization of non-Leichte Sprache data

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# count vectorization method for train and test data
def count_vectorize(train_data, test_data):
    vectorizer = CountVectorizer()
    # fit on train data and transform
    train_tfidf = vectorizer.fit_transform(train_data).toarray()
    # transform test data
    test_tfidf = vectorizer.transform(test_data).toarray() 
    # return vectorized data and vocabulary
    return train_tfidf, test_tfidf, vectorizer.vocabulary_

# count vectorization method with given vocabulary
def count_vectorize_with_vocab(data, vocab):
    vectorizer = CountVectorizer(vocabulary=vocab)
    # transform data with given vocabulary
    data_vec = vectorizer.transform(data).toarray()
    return data_vec

# seperate Leichte Sprache and non-Leichte Sprache
df_non_leichte_sprache = df[df['is_leichte_sprache'] == 0]
df_leichte_sprache = df[df['is_leichte_sprache'] == 1]

# train test split on Leichte Sprache and vectorize Texts
train_x, test_x, train_y, test_y = train_test_split(df_leichte_sprache[data_column], df_leichte_sprache['category'], train_size=0.7, random_state=0)
train_vec, test_vec, vocabulary = count_vectorize(train_x, test_x)

# vectorize non-Leichte Sprache with given vocabulary from fitting train_data
vectorized_non_leichte_sprache = count_vectorize_with_vocab(df_non_leichte_sprache[data_column], vocabulary)

In [None]:
# Vectorizing data
# tf-idf vectorizer
# train/test vectors on Leichte Sprache
# further vectorization of non-Leichte Sprache data

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf vectorization method for train and test data
def tfidf_vectorize(train_data, test_data):
    vectorizer = TfidfVectorizer()
    # fit on train data and transform
    train_tfidf = vectorizer.fit_transform(train_data).toarray()
    # transform test data
    test_tfidf = vectorizer.transform(test_data).toarray() 
    # return vectors and vectorizer itself
    return train_tfidf, test_tfidf, vectorizer

# tfidf vectorization method with given vectorizer
def tfidf_vectorize_with_vectorizer(data, vectorizer):
    # transform data using given vectorizer
    data_vec = vectorizer.transform(data).toarray()
    return data_vec

# train test split on Leichte Sprache and vectorize texts
train_x, test_x, train_y, test_y = train_test_split(df_leichte_sprache[data_column], df_leichte_sprache['category'], train_size=0.7, random_state=0)
train_vec, test_vec, vectorizer = tfidf_vectorize(train_x, test_x)

# vectorize non-Leichte Sprache by using the previously created vectorizer
vectorized_non_leichte_sprache = tfidf_vectorize_with_vocab(df_non_leichte_sprache[data_column], vectorizer)

#### Classifiers

Cells in this segment train a classifier based on the previously created train and test vectors.  
Only run one of these cells, corresponding to the classifier you want to use

In [None]:
# Classifier fitting
# Multi-Layer Perceptron

from sklearn.neural_network import MLPClassifier

classifier_name = 'Multi-Layer Perceptron'

# train MLP classifier
clf = MLPClassifier()
clf.fit(train_vec, train_y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

#### Evaluation

Evaluate the previously run vectorization-classifier pair

In [None]:
# Prediction and Evaluation
# only on Leichte Sprache

from sklearn.metrics import classification_report

training_data_name = "Leichte Sprache"
evaluation_data_name = "Leichte Sprache"

print("Trained on", training_data_name)
print("Evaluated on", evaluation_data_name)

print("Classifier:", classifier_name)

# predict on test set
predictions = clf.predict(test_vec)

print('Prediction on test set')
# show classification report
print(classification_report(test_y, predictions))

Trained on Leichte Sprache
Evaluated on Leichte Sprache
Classifier: Multi-Layer Perceptron
Prediction on test set
              precision    recall  f1-score   support

      Kultur       0.91      0.90      0.90       397
 Nachrichten       0.92      0.95      0.93       595
       Sport       0.99      0.96      0.98       375

    accuracy                           0.94      1367
   macro avg       0.94      0.94      0.94      1367
weighted avg       0.94      0.94      0.94      1367



In [None]:
# Prediction and Evaluation
# train on Leichte Sprache, test on non-Leichte Sprache

from sklearn.metrics import classification_report

training_data_name = "Leichte Sprache"
evaluation_data_name = "non-Leichte Sprache"

print("Trained on", training_data_name)
print("Evaluated on", evaluation_data_name)

print("Classifier:", classifier_name)

# predict on non-Leichte Sprache
predictions = clf.predict(vectorized_non_leichte_sprache)

print('Prediction on test set')
# show classification report
print(classification_report(df_non_leichte_sprache['category'], predictions))

Trained on Leichte Sprache
Evaluated on non-Leichte Sprache
Classifier: Multi-Layer Perceptron
Prediction on test set
              precision    recall  f1-score   support

      Kultur       0.73      0.80      0.76      2357
 Nachrichten       0.80      0.75      0.78      2233
       Sport       0.93      0.89      0.91      2659

    accuracy                           0.82      7249
   macro avg       0.82      0.82      0.82      7249
weighted avg       0.82      0.82      0.82      7249



### Classification of Leichte Sprache with training on mixed data

#### Vectorization

Cells in this segment create feature vectors.  
Only run one of these cells, corresponding to the vectorization you want to use

In [None]:
# Vectorizing data
# count vectorizer
# train/test vectors on mixed language
# remove non-Leichte Sprache from test set

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# count vectorization method
def count_vectorize(train_data, test_data):
    vectorizer = CountVectorizer()
    # fit on train data and transform
    train_tfidf = vectorizer.fit_transform(train_data).toarray()
    # transform test data
    test_tfidf = vectorizer.transform(test_data).toarray() 
    return train_tfidf, test_tfidf, vectorizer.vocabulary_

# lowered train_size, because size of test_set will be reduced later
train_x, test_x, train_y, test_y = train_test_split(df[data_column], df[['category', 'is_leichte_sprache']], train_size=0.6, random_state=0)
train_y = train_y.drop('is_leichte_sprache', axis=1)

# remove entries from test set that are not Leichte Sprache
temp_df = test_y
temp_df[data_column] = test_x
test_x = temp_df[temp_df['is_leichte_sprache'] == 1][data_column]
test_y = temp_df[temp_df['is_leichte_sprache'] == 1]['category']

# vectorize data
train_vec, test_vec, vocabulary = count_vectorize(train_x, test_x)


#### Classifiers

Cells in this segment train a classifier based on the previously created train and test vectors.  
Only run one of these cells, corresponding to the classifier you want to use

In [None]:
# Classifier fitting
# Multi-Layer Perceptron

from sklearn.neural_network import MLPClassifier

classifier_name = 'Multi-Layer Perceptron'

# train MLP classifier
clf = MLPClassifier()
clf.fit(train_vec, train_y)

  y = column_or_1d(y, warn=True)


MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

#### Evaluation

Evaluate the previously run vectorization-classifier pair

In [None]:
# Prediction and Evaluation
# train on Leichte Sprache, test on non-Leichte Sprache

from sklearn.metrics import classification_report

training_data_name = "Mixed data"
evaluation_data_name = "Leichte Sprache"

print("Trained on", training_data_name)
print("Evaluated on", evaluation_data_name)

print("Classifier:", classifier_name)

# predict on test set
predictions = clf.predict(test_vec)

print('Prediction on test set')
# show classification report
print(classification_report(test_y, predictions))

Trained on Mixed data
Evaluated on Leichte Sprache
Classifier: Multi-Layer Perceptron
Prediction on test set
              precision    recall  f1-score   support

      Kultur       0.92      0.85      0.89       536
 Nachrichten       0.90      0.96      0.93       790
       Sport       0.98      0.96      0.97       497

    accuracy                           0.93      1823
   macro avg       0.94      0.92      0.93      1823
weighted avg       0.93      0.93      0.93      1823

