# Datacleaning and metadata generation

This notebook contains a script to clean and process our simple language dataset, in order to
do some minor analysis and gather some metadata.

In [30]:
import pandas as pd
import spacy

## Cleaning
Our csv files contains many categories we will not use, and splits the article into headline, short-text and body.
We will create a file that outputs the labeled data with the entire article in one piece.

We set the file we want to work with here, along with a couple of output parameters.

In [64]:
csv_ext = ".csv"
txt_ext = ".txt"
simple_in = "leicht_nachricht"
simple_out = "leichte_sprache"
norm_cult_in = "Kultur_normal"
norm_cult_out = "kultur_normal"
norm_sport_in = "Sport_normal"
norm_sport_out = "sport_normal"
norm_poli_in = "Politik_normal"
norm_poli_out = "politik_normal"

input_filename = norm_poli_in + csv_ext
output_filename_csv = "clean_" + norm_poli_out + csv_ext
metadata_output_filename = norm_poli_out + "_meta" + txt_ext
metadata_string = ""

df = pd.read_csv(input_filename)
display(df)

Unnamed: 0,Line_ID,day,month,year,category,article,kurz_text,haupt_text
0,0,5,5,2021,Politik,Spahn kündigt digitalen Impfpass zum Sommer an,\r\nDie EU arbeitet an einem digitalen Impfzer...,\r\nNachdem die Bundesregierung Lockerungen fü...
1,1,25,1,2021,Politik,»Durchhalteparolen helfen Familien nicht«,"\r\nSchulen dicht, Kitas im Notbetrieb – Famil...","\r\nSPIEGEL: Frau Midyatli, Sie haben zwei Sch..."
2,2,10,2,2021,Politik,Wen Sie in Rheinland-Pfalz wählen wollen,\r\nWer regiert künftig in Mainz? In Rheinland...,"\r\nIn Rheinland-Pfalz gibt es sie noch, die s..."
3,3,3,4,2021,Politik,Dieser »Witz« ist Rassismus aus Bequemlichkeit,\r\nDer Satiriker Helmut Schleich malt sein Ge...,\r\nSatire darf ziemlich viel. Vor allem darf ...
4,4,4,6,2021,Politik,Union und SPD lehnen Hilfe für Italien bei Boo...,\r\nZuletzt erreichten wieder deutlich mehr Bo...,\r\nPolitiker der Union und SPD haben einem Au...
...,...,...,...,...,...,...,...,...
2228,2228,20,4,2021,Politik,\r\n\r\n\r\n,\r\nArmin Laschet wird die Union in den Bundes...,\r\nArmin Laschet ist Kanzlerkandidat der Unio...
2229,2229,31,5,2021,Politik,Politiker empören sich nach Wahl von Max Otte ...,\r\nNach der Wahl Max Ottes zum neuen Vorsitze...,\r\nDer rechtsgerichtete Ökonom Max Otte ist n...
2230,2230,7,5,2021,Politik,\r\n\r\n\r\n,\r\nDie Konferenz zur Zukunft Europas ist gere...,\r\nDas monatelange Gezerre um die Ausgestaltu...
2231,2231,10,6,2021,Politik,Was vom G7-Treffen zu erwarten ist,"\r\nLiebe Leserin, lieber Leser, guten Morgen,...",\r\nBoris Johnson darf wieder spielen – den Ga...


In [65]:
cleanArray = []
for index in df.loc[:,"Line_ID"]:
    if index in [1285, 3781, 5648, 5663]:
        #there are four articles without a body. We simply exclude those from our data for simplicity
        pass
    else:
        #first we extract the article
        article = df.loc[index,"article"]
    
        #next up the short text
        short_text = df.loc[index,"kurz_text"]
    
        #and finally the main body of the article
    
        body = df.loc[index,"haupt_text"]
        if type(body) is float:
            print(index)
            break
    
        #our resulting text contains some unicode symbols, namely \xa0, that we dont want, so we replace it with
        #a space. We also remove line breaks
        
        text = article + " " + short_text + body
        text = text.replace(u'\xa0', u' ')
        text = text.replace("\n", "")
        text = text.replace("\r", "")
    
    
        newEntry = [text, df.loc[index, "category"]]
    
    
    cleanArray.append(newEntry)
    
    
cleanDf_with_misc = pd.DataFrame(cleanArray, columns=["text", "label"])
#Our data contained articles with the category "miscellaneous", which we dont intend to use
#We therefore exclude it from this file
is_not_misc = cleanDf_with_misc["label"]!="Vermischtes"
print(is_not_misc)
cleanDf = cleanDf_with_misc[is_not_misc]

0       True
1       True
2       True
3       True
4       True
        ... 
2228    True
2229    True
2230    True
2231    True
2232    True
Name: label, Length: 2233, dtype: bool


In [66]:
display(cleanDf)

Unnamed: 0,text,label
0,Spahn kündigt digitalen Impfpass zum Sommer an...,Politik
1,»Durchhalteparolen helfen Familien nicht« Schu...,Politik
2,Wen Sie in Rheinland-Pfalz wählen wollen Wer r...,Politik
3,Dieser »Witz« ist Rassismus aus Bequemlichkeit...,Politik
4,Union und SPD lehnen Hilfe für Italien bei Boo...,Politik
...,...,...
2228,Armin Laschet wird die Union in den Bundestag...,Politik
2229,Politiker empören sich nach Wahl von Max Otte ...,Politik
2230,Die Konferenz zur Zukunft Europas ist gerette...,Politik
2231,Was vom G7-Treffen zu erwarten ist Liebe Leser...,Politik


In [67]:
cleanDf.to_csv(output_filename_csv)

### Now we have cleaned data, and we can do a little analysis for gathering metadata.
We begin by counting how many articles we have total, and split by label

In [68]:
is_kultur = cleanDf["label"] == "Kultur"
kultur = cleanDf[is_kultur]
is_nachrichten = cleanDf["label"] == "Nachrichten"
nachrichten = cleanDf[is_nachrichten]
is_sport = cleanDf["label"] == "Sport"
sport = cleanDf[is_sport]

In [69]:
print("Number of articles labelled \"culture\": " + str(len(kultur)))
print("Number of articles labelled \"news\": " + str(len(nachrichten)))
print("Number of articles labelled \"sport\": " + str(len(sport)))
print("Total number of articles: " + str(len(kultur) + len(nachrichten) + len(sport)))
metadata_string += "Number of articles labelled \"culture\": " + str(len(kultur)) + "\n"
metadata_string += "Number of articles labelled \"news\": " + str(len(nachrichten)) + "\n"
metadata_string += "Number of articles labelled \"sport\": " + str(len(sport)) + "\n"
metadata_string += "Total number of articles: " + str(len(kultur) + len(nachrichten) + len(sport)) + "\n"

Number of articles labelled "culture": 0
Number of articles labelled "news": 0
Number of articles labelled "sport": 0
Total number of articles: 0


Now we get a little more detailed. We determine the number of words and types in our entire collection, the average length of an article, and we compute the average flesch reading ease of our texts, using the textstat package

In [70]:
import textstat

In [71]:
textstat.set_lang("de") #we're dealing with german text


input_string = ""
aggregate_reading_ease = 0
aggregate_article_length = 0
for text in cleanDf["text"] :
    input_string += text + " "
    aggregate_reading_ease += textstat.flesch_reading_ease(text)
    aggregate_article_length += textstat.lexicon_count(text)
    
#for some things, we want all the articles together, for some global metrics

lexicon = textstat.lexicon_count(input_string, removepunct=True)
print("Number of unique words:", lexicon)
metadata_string += "Number of unique words:" + str(lexicon) + "\n"
    
average_reading_ease = aggregate_reading_ease / len(cleanDf["text"])
print("average flesch reading ease of our articles:", average_reading_ease)
metadata_string += "average flesch reading ease of our articles:" + str(average_reading_ease) + "\n"

average_article_length = aggregate_article_length / len(cleanDf["text"])
print("average length of an article (words):", average_article_length)
metadata_string += "average length of an article (words):" + str(average_article_length) + "\n"

Number of unique words: 200674
average flesch reading ease of our articles: 33.060528437080166
average length of an article (words): 89.86744290192566


In [72]:
input_string = input_string.lower()

#rewrite input to all lower case, to avoid counting the same type twice due to capitalization
input_words = input_string.split(" ")

In [73]:
types = 0
alreadySeen = []
#initialize counters for types and tokens
#initialize a list to keep track of the types already counted
for word in input_words:
    if word[:-1] == ".":
        word = word[:-1]
    if word in alreadySeen:
        pass
    else:
        alreadySeen.append(word)
        types = types + 1
        
#go through each word, counting the token up each time, and counting the types each time a new one occurs


In [74]:
print("Number of types in the collection:", types)
metadata_string += "Number of types in the collection:" + str(types) + "\n"

Number of types in the collection: 33140


In [75]:
aggregate_article_length = 0
for text in kultur["text"] :
    aggregate_article_length += textstat.lexicon_count(text)
    
if len(kultur["text"]) > 0:
    average_article_length = aggregate_article_length / len(kultur["text"])
    print("culture: average length of an article (words):", average_article_length)
    metadata_string += "culture: average length of an article (words):" + str(average_article_length) + "\n"

In [76]:
aggregate_article_length = 0
for text in sport["text"] :
    aggregate_article_length += textstat.lexicon_count(text)
    
if len(sport["text"]) > 0 :
    average_article_length = aggregate_article_length / len(sport["text"])
    print("sport: average length of an article (words):", average_article_length)
    metadata_string += "sport: average length of an article (words):" + str(average_article_length) + "\n"

In [77]:
aggregate_article_length = 0
for text in nachrichten["text"] :
    aggregate_article_length += textstat.lexicon_count(text)
    
if len(nachrichten["text"]) > 0:   
    average_article_length = aggregate_article_length / len(nachrichten["text"])
    print("politics: average length of an article (words):", average_article_length)
    metadata_string += "politics: average length of an article (words):" + str(average_article_length) + "\n"

In [78]:
with open(metadata_output_filename, "a") as meta_file:
    meta_file.write(metadata_string)
    print("wrote metadata into a file")

wrote metadata into a file
