# Datacleaning and metadata generation

This notebook contains a script to clean and process our simple language dataset, in order to
do some minor analysis and gather some metadata.

In [1]:
import pandas as pd
import spacy
from collections import Counter

## Cleaning
Our csv files contains many categories we will not use, and splits the article into headline, short-text and body.
We will create a file that outputs the labeled data with the entire article in one piece.

We set the file we want to work with here, along with a couple of output parameters.

In [2]:
csv_ext = ".csv"
txt_ext = ".txt"
in_directory = "raw_data/"
out_directory = "clean_data/"
meta_directory = "metadata/"

simple_in = "leicht_nachricht"
simple_out = "leichte_sprache"
norm_cult_in = "Kultur_normal"
norm_cult_out = "kultur_normal"
norm_sport_in = "Sport_normal"
norm_sport_out = "sport_normal"
norm_poli_in = "Politik_normal"
norm_poli_out = "politik_normal"

input_filename = in_directory + simple_in + csv_ext
output_filename_csv = out_directory + "clean_" + norm_poli_out + csv_ext
metadata_output_filename = meta_directory + norm_poli_out + "_meta" + txt_ext
metadata_string = ""

df = pd.read_csv(input_filename)
display(df)

Unnamed: 0,Line_ID,day,month,year,category,article,kurz_text,haupt_text,audio_link
0,0,14,Mai,2021,Kultur,\nFernseh-Preise \n,Die Grimme-Preise sind die wichtigsten Fernseh...,\nDie Moderatorin Caren Miosga wird besonders ...,['https://ondemand-mp3.dradio.de/file/dradio/2...
1,1,14,Mai,2021,Kultur,\nFilm-Festival in Berlin\n,Jedes Jahr gibt es in Berlin ein großes Festiv...,\nDie Chefin von der Berlinale heißt Mariette ...,['https://ondemand-mp3.dradio.de/file/dradio/2...
2,2,14,Mai,2021,Kultur,\nPreise für Schulen\n,7 Schulen in Deutschland haben einen Preis bek...,\nDer Bundes-Präsident heißt Frank-Walter Stei...,['https://ondemand-mp3.dradio.de/file/dradio/2...
3,3,7,Mai,2021,Kultur,\nParty in Liverpool\n,In der Stadt Liverpool in dem Land England hat...,\nDie Menschen durften zusammen ohne Maske und...,['https://ondemand-mp3.dradio.de/file/dradio/2...
4,4,7,Mai,2021,Kultur,\nDeutschland gibt Kunst zurück\n,Das afrikanische Land Nigeria fordert seit lan...,\nBei den Kunst-Werken handelt es sich um Meta...,['https://ondemand-mp3.dradio.de/file/dradio/2...
...,...,...,...,...,...,...,...,...,...
5819,5819,16,Februar,2013,Nachrichten,\nNeue Bildungs-Ministerin\n,Johanna Wanka ist neue Ministerin für Bildung ...,\nDie CDU-Politikerin Johanna Wanka ist die ne...,
5820,5820,16,Februar,2013,Nachrichten,\nHomo-Ehe in Frankreich\n,In Frankreich können schwule und lesbische Paa...,\n\n\nIn Frankreich sollen gleich-geschlechtli...,
5821,5821,16,Februar,2013,Nachrichten,\nNord-Korea testet Atom-Bombe \n,Nord-Korea hat eine Atom-Bombe getestet. Fast ...,\nDie Regierung von Nord-Korea hat zum dritten...,
5822,5822,16,Februar,2013,Nachrichten,\nObama hält wichtige Rede\n,"Der Präsident der USA, Barack Obama, hat eine ...","\nIn seiner Rede sagte Obama, dass in den USA ...",


In [3]:
cleanArray = []
for index in df.loc[:,"Line_ID"]:
    if index in [1285, 3781, 5648, 5663]:
        #there are four articles without a body. We simply exclude those from our data for simplicity
        pass
    else:
        #first we extract the article
        article = df.loc[index,"article"]
    
        #next up the short text
        short_text = df.loc[index,"kurz_text"]
    
        #and finally the main body of the article
    
        body = df.loc[index,"haupt_text"]
        if type(body) is float:
            print(index)
            break
    
        #our resulting text contains some unicode symbols, namely \xa0, that we dont want, so we replace it with
        #a space. We also remove line breaks
        
        text = article + " " + short_text + body
        text = text.replace(u'\xa0', u' ')
        text = text.replace("\n", "")
        text = text.replace("\r", "")
    
    
        newEntry = [text, df.loc[index, "category"]]
    
    
    cleanArray.append(newEntry)
    
    
cleanDf_with_misc = pd.DataFrame(cleanArray, columns=["text", "label"])
#Our data contained articles with the category "miscellaneous", which we dont intend to use
#We therefore exclude it from this file
is_not_misc = cleanDf_with_misc["label"]!="Vermischtes"
print(is_not_misc)
cleanDf = cleanDf_with_misc[is_not_misc]

0       True
1       True
2       True
3       True
4       True
        ... 
5819    True
5820    True
5821    True
5822    True
5823    True
Name: label, Length: 5824, dtype: bool


In [4]:
display(cleanDf)

Unnamed: 0,text,label
0,Fernseh-Preise Die Grimme-Preise sind die wic...,Kultur
1,Film-Festival in Berlin Jedes Jahr gibt es in ...,Kultur
2,Preise für Schulen 7 Schulen in Deutschland ha...,Kultur
3,Party in Liverpool In der Stadt Liverpool in d...,Kultur
4,Deutschland gibt Kunst zurück Das afrikanische...,Kultur
...,...,...
5819,Neue Bildungs-Ministerin Johanna Wanka ist neu...,Nachrichten
5820,Homo-Ehe in Frankreich In Frankreich können sc...,Nachrichten
5821,Nord-Korea testet Atom-Bombe Nord-Korea hat e...,Nachrichten
5822,Obama hält wichtige Rede Der Präsident der USA...,Nachrichten


In [5]:
cleanDf.to_csv(output_filename_csv)

### Now we have cleaned data, and we can do a little analysis for gathering metadata.
We begin by counting how many articles we have total, and split by label

In [6]:
is_kultur = cleanDf["label"] == "Kultur"
kultur = cleanDf[is_kultur]
is_nachrichten = cleanDf["label"] == "Nachrichten"
nachrichten = cleanDf[is_nachrichten]
is_sport = cleanDf["label"] == "Sport"
sport = cleanDf[is_sport]

In [7]:
print("Number of articles labelled \"culture\": " + str(len(kultur)))
print("Number of articles labelled \"news\": " + str(len(nachrichten)))
print("Number of articles labelled \"sport\": " + str(len(sport)))
print("Total number of articles: " + str(len(kultur) + len(nachrichten) + len(sport)))
metadata_string += "Number of articles labelled \"culture\": " + str(len(kultur)) + "\n"
metadata_string += "Number of articles labelled \"news\": " + str(len(nachrichten)) + "\n"
metadata_string += "Number of articles labelled \"sport\": " + str(len(sport)) + "\n"
metadata_string += "Total number of articles: " + str(len(kultur) + len(nachrichten) + len(sport)) + "\n"

Number of articles labelled "culture": 1304
Number of articles labelled "news": 2020
Number of articles labelled "sport": 1230
Total number of articles: 4554


Now we get a little more detailed. We determine the number of words and types in our entire collection, the average length of an article, and we compute the average flesch reading ease of our texts, using the textstat package

In [8]:
import textstat

In [9]:
textstat.set_lang("de") #we're dealing with german text


input_string = ""
aggregate_reading_ease = 0
aggregate_article_length = 0
for text in cleanDf["text"] :
    input_string += text + " "
    aggregate_reading_ease += textstat.flesch_reading_ease(text)
    aggregate_article_length += textstat.lexicon_count(text)
    
#for some things, we want all the articles together, for some global metrics

lexicon = textstat.lexicon_count(input_string, removepunct=True)
print("Number of unique words:", lexicon)
metadata_string += "Number of unique words:" + str(lexicon) + "\n"
    
average_reading_ease = aggregate_reading_ease / len(cleanDf["text"])
print("average flesch reading ease of our articles:", average_reading_ease)
metadata_string += "average flesch reading ease of our articles:" + str(average_reading_ease) + "\n"

average_article_length = aggregate_article_length / len(cleanDf["text"])
print("average length of an article (words):", average_article_length)
metadata_string += "average length of an article (words):" + str(average_article_length) + "\n"

Number of unique words: 716642
average flesch reading ease of our articles: 62.76925779534482
average length of an article (words): 157.36539306104524


In [10]:
input_string = input_string.lower()

#rewrite input to all lower case, to avoid counting the same type twice due to capitalization
input_words = input_string.split(" ")

In [11]:
testcounter = Counter(input_words)

types = len(testcounter.keys())
# we use the counter class to obtain unique words

In [12]:
print("Number of types in the collection:", types)
metadata_string += "Number of types in the collection:" + str(types) + "\n"

Number of types in the collection: 47602


In [13]:
aggregate_article_length = 0
for text in kultur["text"] :
    aggregate_article_length += textstat.lexicon_count(text)
    
if len(kultur["text"]) > 0:
    average_article_length = aggregate_article_length / len(kultur["text"])
    print("culture: average length of an article (words):", average_article_length)
    metadata_string += "culture: average length of an article (words):" + str(average_article_length) + "\n"

culture: average length of an article (words): 150.82898773006136


In [14]:
aggregate_article_length = 0
for text in sport["text"] :
    aggregate_article_length += textstat.lexicon_count(text)
    
if len(sport["text"]) > 0 :
    average_article_length = aggregate_article_length / len(sport["text"])
    print("sport: average length of an article (words):", average_article_length)
    metadata_string += "sport: average length of an article (words):" + str(average_article_length) + "\n"

sport: average length of an article (words): 137.24471544715448


In [15]:
aggregate_article_length = 0
for text in nachrichten["text"] :
    aggregate_article_length += textstat.lexicon_count(text)
    
if len(nachrichten["text"]) > 0:   
    average_article_length = aggregate_article_length / len(nachrichten["text"])
    print("politics: average length of an article (words):", average_article_length)
    metadata_string += "politics: average length of an article (words):" + str(average_article_length) + "\n"

politics: average length of an article (words): 173.83663366336634


We gathered metadata about our sets in the metadata_string, which we now want to write to a file for easy access.

In [16]:
with open(metadata_output_filename, "a") as meta_file:
    meta_file.write(metadata_string)
    print("wrote metadata into a file")

wrote metadata into a file


Author: Henri Thölke