# Data Preprocessing

The best results in machine translation are still achieved by a good and big dataset with aligned sentences in source and target language.

From [Tatoabe](https://tatoeba.org/) you can download already aligned sentences in German and Low German which were created by the community. As the dataset contains a relatively small amount of ~17.000 sentences and moreover with different spelling and grammar, it is just a first step for building a databasis.

Moreover there is [Low German Wikipedia](https://nds.wikipedia.org/wiki/Wikipedia:H%C3%B6%C3%B6ftsiet) with over 60.000 articles (April 2020). An idea would be to align suitable sentences with the German wikipedia. Luckily Facebook's research team did this already for [all Wikipedial languages](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix) including as a side product Low German-German. A first look into that datasets shows that there are many mismatched sentences. Moreover there is still the same problem of different spelling and grammar. Still this could be a good approach to extend the databasis.


The goal of this notebook is to get clean datasets for training the different translation models. To achieve that we have to correct as much as possible the spelling and have to delete mismatched sentences.

In [224]:
import pandas as pd
import numpy as np

## Tatoabe dataset

As a first step we load in the dataset from Tatoabe. You can download the tsv files from the website. Moreover you will need the "links.csv" from the website to match the sentences from German and Low German by their sentence id.

In [225]:
# creating column names for the tables
column_names_platt = ["id", "language", "nds"]
column_names_deu = ["id", "language", "deu"]

nds_sentences = pd.read_csv("nds_sentences.tsv", sep= "\t", header = None, names=column_names_platt)
deu_sentences = pd.read_csv("deu_sentences.tsv", sep= "\t", header = None, names=column_names_deu)
link_sentences = pd.read_csv("links.csv", sep= "\t", header = None, names=["origin","translation"])

In [226]:
nds_sentences.sample(5)

Unnamed: 0,id,language,nds
12580,1022693,nds,Hest du en Fraag?
8679,926514,nds,Männigmaal is dat Verbetern sworer as dat Schr...
7573,899881,nds,Nadem hebbt wi em nich mehr sehn.
2891,835145,nds,Ik drink nich so veel Beer.
143,791267,nds,He is mien Fründ.


In [227]:
deu_sentences.head(2)

Unnamed: 0,id,language,deu
0,77,deu,Lass uns etwas versuchen!
1,78,deu,Ich muss schlafen gehen.


In [228]:
link_sentences.head(2)

Unnamed: 0,origin,translation
0,1,1276
1,1,2481


In [229]:
df = link_sentences.merge(deu_sentences
                     , left_on = "origin"
                     , right_on = "id").merge(nds_sentences
                                              , left_on="translation", right_on="id")

In [230]:
df = df[["deu", "nds"]]

In [231]:
df[df["nds"].str.contains('\(')]

Unnamed: 0,deu,nds
16769,"Es ist schon fast sieben, wir müssen zur Schul...",Dat is dicht an söven. Wi mööt na School (gahn).
16770,"Es ist schon fast sieben Uhr, wir müssen zur S...",Dat is dicht an söven. Wi mööt na School (gahn).
16771,"Es ist schon fast sieben, wir müssen zur Schule.",Dat is dicht an söven. Wi mööt na School (gahn).
16894,"Ich mag wxWidgets, weil es im Gegensatz zu and...",Ik mag wxWidgets vonwegen dat dat anners as an...
17483,Wir rufen euch an.,Wü röp juu ön (frr)
17484,Spielst du Baseball?,Spölest Baseball? (frr)
17485,Sie wird es dir sagen.,Jü wel et di sii. (frr)
17486,Tom brachte mir eine Tasse Tee.,Tom braacht mi en Kop Tee. (frr)
17487,Wir hatten ein gutes Gespräch.,Wü her en gur Snak. (frr)
17488,Wir wollten mit ihr sprechen.,Wü wil me höör snaki (frr)


In [232]:
df = df.drop(df[df["nds"].str.contains('\(frr')].index)

In [233]:
df.sample(5)

Unnamed: 0,deu,nds
17419,Was soll ich tun? Ich habe zu nichts Lust!,Wat schall ik doon? Ik heff to nix Lust!
3345,Ich habe drei Hunde.,Ik heff dree Hunnen.
10619,Ich habe hier gewohnt.,Ik heff hier wahnt.
8892,"Ist es nicht so, dass wir alle nackt in unsere...","Is dat nich so, dat wi all naakt in ins Kledaa..."
1876,"Die Leute waren es früher gewohnt, zu Fuß zu r...","De Lüüd weren dat fröher so wennt, to Foot to ..."


In [234]:
df[df["nds"].str.contains('(?i) ick')]

Unnamed: 0,deu,nds
1094,Jetzt habe ich Zeit.,Nu häb ick Tied.
4949,"Rede du nur, ich tue, was ich will.","Küer man to, ick do wat ick wil."
15997,"Gut, dann bedanke ich mich auch.","Na, denn bedank ick mi ok."
16138,In diesem Stadium der Weltgeschichte gibt es s...,Op dat hierste Stadium vun de Werltgeschicht g...
16532,Wann soll ich kommen?,Üm wölke Klock mutt ick kamen?
16746,Hier kann ich dir nicht zustimmen.,Dorup kann ick nich mit di tostimmen.
16802,Wann soll ich einchecken?,Wo lååt mutt ick inchecken?


In [235]:
print(len(df))
df = df[(df.deu.str.len() < 70) & (df.nds.str.len() < 70)]
print(len(df))

17483
16895


## Wikipedia parallel sentences

Facebook aligned through all languages of Wikipedia suitable sentences and published it on: https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix

Luckily there is a lower german version of Wikipedia with over 60.000 articles. Of course many of them are specific for lower german topics and therefore there is no german version of the article. Still Facebook managed it to align 70.000 sentences.
We will use this dataset to have a bigger databasis

In [17]:

wiki_deu = pd.read_csv("data/fb-wiki/WikiMatrix.de-nds-de.txt",
                       sep="\n+",engine='python', encoding="utf-8",
                       header=None, names=["deu"])



In [18]:
wiki_nds = pd.read_csv("data/fb-wiki/WikiMatrix.de-nds-nds.txt"
                       , sep="\n+",engine='python'
                       , encoding="utf-8", names=["nds"])

In [19]:
wiki_df = wiki_deu.join(wiki_nds)
print(len(wiki_df))
# exclude long sentences for faster training
wiki_df = wiki_df[(wiki_df.deu.str.len() < 70) & (wiki_df.nds.str.len() < 70)]
print(len(wiki_df))

df = df.append(wiki_df, ignore_index=True)

75590
19083


In [20]:
len(df)

35978

# Generalizing the low german words

Here we try to correct the sentences with the "official" spelling according to [Sass wordbook](http://sass-platt.de/)



In [1]:
import docx
import re

The available dictionary is not ordered. But the German words are written in bold. Further German subwords are written in italic. The translated words are following the bold or the italic words. First we will try to get the index range of each translation.
For that we need the paragraph in which the word is and the range inside the paragraph.


In [39]:
# function to find only the first translation
# easier to understand in combination when used inside the loop
def find_first_translation(translation_id, runs):
        translation = ''
        first_row_of_ones = False
        for idx,i in enumerate(translation_id):
            if i == 1:
                translation += runs[idx].text
                first_row_of_ones = True
            elif i == 0 and first_row_of_ones:
                break
        return translation

In [46]:
def get_word_list(document, word_list, translation_font = "Rockwell"):
    idx_col = 0
    idx_row = 0
    string = ""
    for paragraph in document.paragraphs:
        # grab only the lines which start with a word / other lines are additional content or page numbers
        new_line = re.match(r'\A[A-Za-zÄäÜüÖö]', paragraph.text)
        if new_line:
            # write the word into the column 0 with the format bold and not italic which stands for the basis word
            word_list.iloc[idx_row,0] = string.join([x.text for x in paragraph.runs 
                                                               if x.bold and not x.italic])
            # now we want to grab only the translation without catching additional content with the
            # same font. When the Font changes, it means, that the translation is finished.
            font_list = [1 if x.font.name == translation_font else 0 for x in paragraph.runs]
            word_list.iloc[idx_row,1] = find_first_translation(font_list, paragraph.runs)
            # jump to next row for next entry
            idx_row += 1


        
        

In [57]:
# first we will load the deu-nds word_list

deu_nds_raw = docx.Document("gathering_data/sass/hd-nd-wortliste_absatz_nach_wort.docx")
# prepare a word list only containing one translation
deu_nds_word_list = pd.DataFrame(index=np.arange(20000), columns= ["deu","nds"])

# pass into function
get_word_list(deu_nds_raw, deu_nds_word_list)



In [58]:
# now the nds-deu word_list

nds_deu_raw = docx.Document("gathering_data/sass/nd-hd-wortliste_absatz_nach_wort.docx")
# prepare a word list only containing one translation
nds_deu_word_list = pd.DataFrame(index=np.arange(20000), columns= ["nds","deu"])

# pass into function
get_word_list(nds_deu_raw, nds_deu_word_list)


In [59]:
deu_nds_word_list.head(5)

Unnamed: 0,deu,nds
0,A,
1,Aal ;,Aal
2,Aas,Aas
3,"ab ;; ;;,",af
4,abändern,afännern


In [60]:
nds_deu_word_list.head(5)

Unnamed: 0,nds,deu
0,A,
1,Aadboor,Storch
2,Aadler,Adler
3,aadlig,
4,Aaft,Obst


Having the data now in our pandas dataframe, we can do some regex and clean the words.

In [84]:
nds_deu_word_list[(nds_deu_word_list.deu == "") | (nds_deu_word_list.nds == "")].index

Int64Index([    0,     3,    36,    61,   398,   399,   570,  1588,  2129,
             2169,
            ...
            11293, 11301, 11477, 11568, 11784, 11790, 11793, 11850, 11873,
            11876],
           dtype='int64', length=119)

In [89]:
#drop nan and uncomplete rows first

def drop_uncomplete(df):
    df.dropna(inplace=True)
    drop_index = df[(df.deu == "") | (df.nds == "")].index
    df.drop(index=drop_index, inplace=True)

drop_uncomplete(deu_nds_word_list)
print("Entries DEU-NDS version:",len(deu_nds_word_list))
drop_uncomplete(nds_deu_word_list)
print("Entries NDS-DEU version:",len(nds_deu_word_list))

Entries DEU-NDS version: 6383
Entries NDS-DEU version: 11774


We have now two tables with the reverse translation. We can see already that they have different length. Maybe this is because of the original data or because a the reading of one document was better.
Nevertheless as we have a computer and don't have to turn pages, we need only one table.

After doing some data cleaning we will drop duplicate entries

In [436]:
# lets make one table out of it

nds_deu = nds_deu.append(deu_nds, sort=False, ignore_index = True)


In [122]:
# now we know how big our new vocabulary could be.
# still there are many wrong signs included as we can see only in the first 10 entries

string = "Test ~Test WetterW~asser  Abgabe4 1. 2. ,,; ~weg A~weg end/ jetzt!"

def clear_string(string):
    # delete everything after these characters
    string = re.split(r'[\d.,;!?/]',string)[0]
    # delete alternative spelling as there is no easy automatic solution for taking them into the list
    string = re.split(r'(([A-ZÖÜÄa-zäöü]|\s)~)',string)[0]
    return string

clear_string(string)

# apply the cleaning function
nds_deu = nds_deu_word_list.applymap(lambda x: clear_string(str(x)))


'Test'

In [429]:

# drop paranthesis including the content as they are as well an additional explanation we don't want to have
def drop_parenthesis_dash(string):
    string = re.sub(r'\([a-zöäü.~]*\)?',"",string)
    string = re.sub(r'\|',"",string)
    #empty spaces at the beginning and end
    string = re.sub("\s+$","",string)
    string = re.sub("^\s+","",string)
    return string

nds_deu = nds_deu.applymap(drop_parenthesis_dash)

In [430]:
# especially in the Low German column we have a lot of -. These are mostly format problems.
# still some belong there. We assume that the minus is right, if it occurs in both columns

def sub_minus(string):
    return re.sub(r"-","",string)


nds_deu.loc[~(nds_deu.nds.str.contains("-{1}") & nds_deu.deu.str.contains("-{1}")),:] = nds_deu.loc[~(nds_deu.nds.str.contains("-{1}") & nds_deu.deu.str.contains("-{1}")),:].applymap(sub_minus)





In [431]:
# delete the words "sik", "wat" in Low German as this information is about the grammar
# same in German with "sich", 

In [432]:
def replace_pre_words(df):
    id_sich = df[df.deu == "sich"].index
    id_was = df[df.deu == "was"].index
    df.nds = df.nds.str.replace("(wat|sik)","")
    df.deu = df.deu.str.replace("(was|sich)","")
    df.nds[id_sich] = "sik"
    df.deu[id_sich] = "sich"
    df.nds[id_was] = "wat"
    df.deu[id_was] = "was"
    
replace_pre_words(nds_deu)

In [433]:
# In some rows there is only one letter or nothing, so a further read in error.
# We will drop these lines
def drop_read_in_errors(df):
    drop_index = df[(df.deu.str.len() <= 1 )|(df.nds.str.len() <= 1 )].index
    df.drop(index=drop_index,inplace=True)
    
    
drop_read_in_errors(deu_nds)
drop_read_in_errors(nds_deu)


In [434]:
len(deu_nds),len(nds_deu)

(6215, 10406)

In [435]:
deu_nds.to_csv("data/deu_nds_dictionary.csv")
nds_deu.to_csv("data/nds_deu_dictionairy.csv")

## Using the dictionairy

Finally we have the dictionairy and can try to correct the sentences.



In [437]:
# replace "ik" with "ick"


df.nds = df.nds.str.replace("Ick", "Ik")
df.nds = df.nds.str.replace("ick", "ik")


In [438]:
nds_deu.reset_index(drop=True, inplace=True)

In [440]:
sum(nds_deu.nds.str.endswith(" ")), sum(nds_deu.deu.str.startswith(" "))


(6, 255)

In [444]:
if len(count) == 0:
    count = []
    for text in nds_deu.nds:
        count.append(sum(df.nds.str.contains(re.escape(text))))
    

In [457]:
sorted(count,reverse=True)
nds_deu["corpus_frq"] = count

nds_deu.sort_values("corpus_frq",ascending=False)
print("Words not used: ", sum(nds_deu["corpus_frq"] == 0))
print("Words used: ", len(nds_deu) - sum(nds_deu["corpus_frq"] == 0))

Words not used:  10692
Words used:  5929


As we don't have a frequency vector for Low German, but it is very closely related to German.
Therefore we can use the German word_frequency for Low German as well.


In [445]:
from wordfreq import word_frequency
word_frq = np.empty(len(nds_deu))
for text in nds_deu.deu:
    np.append(word_frq, word_frequency(text, 'de'))

In [446]:
word_frq = pd.Series(word_frq, name="frq", index=np.arange(1,len(nds_deu)+1) )
nds_dic = nds_deu[["nds"]].join(word_frq)

nds_dic.dropna( inplace=True)

In [447]:
nds_dic.sort_values("frq", ascending = False)

Unnamed: 0,nds,frq
1,Aadler,2.168225e-314
11102,Naricht,2.168225e-314
11072,bienanner,2.168225e-314
11073,Bifohrer,2.168225e-314
11074,Bifall,2.168225e-314
11075,Biel,2.168225e-314
11076,bilöpig,2.168225e-314
11077,bi’n,2.168225e-314
11078,Been,2.168225e-314
11079,scheve Been,2.168225e-314


In [346]:
with open('data/nds_dic.dic', "w+") as f:
    f.write("nds_dic.to_dict")

{'nds': {1: 'Aadler ',
  2: 'Aaft ',
  3: 'Aal  ',
  4: 'Aal',
  5: 'Aant ',
  6: 'Aantenflott ',
  7: 'Aantenbraden',
  8: 'Aap ',
  9: 'Aapner ',
  10: 'aapsch ',
  11: 'Aas  ',
  12: 'Aaskraam  ',
  13: 'Aastüüch',
  14: 'abasig ',
  15: 'ach',
  16: 'acheln',
  17: 'acht',
  18: 'acht',
  19: 'acht… ',
  20: 'acht',
  21: 'acht',
  22: 'Acht ',
  23: 'Acht',
  24: 'achtbor',
  25: 'Achtel',
  26: 'Achten',
  27: 'achter ',
  28: 'achter',
  29: 'achtern ',
  30: 'achteran ',
  31: 'achterher',
  32: 'achterankamen',
  33: 'achterdaal  ',
  34: 'Achterdöör',
  35: 'Achtern ',
  36: 'achtern',
  37: 'achtern',
  38: 'achterna',
  39: 'achterop',
  40: 'achtero',
  41: 'achtersinnig ',
  42: 'Achtersitt ',
  43: 'Achterste',
  44: 'Achtersteven',
  45: 'achter’t ',
  46: 'achterto ',
  47: 'achterüm ',
  48: 'achterut ',
  49: 'Achterutspegel  ',
  50: 'achtteihn',
  51: 'achttig ',
  52: 'achtunachttig',
  53: 'Acker',
  54: 'Ackermann ',
  55: 'Ackersnacker',
  56: 'Ackewars ',
  57

In [347]:
from spellchecker import SpellChecker

spell = SpellChecker(local_dictionary="data/nds_dic.dic")  # loads nds word frequency list

word = "Kiken"
spell.candidates(word)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [22]:
df[df.nds.str.contains("ebe")]

Unnamed: 0,deu,nds
3214,Ist der Fisch noch lebendig?,Is de Fisch noch lebennig?
4323,Sie müssen auf die Bibel schwören.,Se mööt op de Biebel sweren.
5813,"Dieses Buch ist für mich das, was für dich die...","Dit Book is för mi dat, wat för di de Biebel is."
6350,Im Februar schlachten wir das fette Schwein.,In ’n Feberwoor slacht wi dat fette Swien.
6674,Er wurde dort von den Rebellen attackiert.,He is dor von de Rebellers angrepen worrn.
9215,Ich war im Februar in Sao Paulo.,Ick weer in ’n Feberwoor in São Paulo.
9216,Ich war im Februar in São Paulo.,Ick weer in ’n Feberwoor in São Paulo.
10835,"Ich weiß nicht, ob er tot oder lebendig ist.","Ick weet nich, wat he dood oder lebennig is."
11442,Justin Bieber ist berühmter als Katie Holmes.,Justin Bieber is beröhmter as Katie Holmes.
11786,Ich habe noch nie einen lebendigen Tintenfisch...,Ick heff noch nien en lebennigen Blackfisch eten.


In [23]:
test_str = "De Stevel is Biebel"

In [24]:
import re

# According to Arbatzat "Basiswortschatz Plattdeutsch" if it is possible, b should be replaced with v
#assumption from my side: if there is a vowel in front and after

df.nds = df.nds.str.replace(r"([AEOUÄÜÖaeouäöü])b([AEUOaeouäöü])", r"\1v\2")

# by chance I saw that Bibel is spelled wrong a lot of times

df.nds = df.nds.str.replace(r"Bi(e|)bel", "Bivel")

# replacement of "sch" to "s"

df.nds = df.nds.str.replace(r"((sch)([lmn]))", r"s\2")
df.nds = df.nds.str.replace(r"((Sch)([lmn]))", r"S\2")


In [25]:
df.to_csv(path_or_buf="data/data.csv", index=False)

In [26]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.1, random_state=42)
train_data, valid_data = train_test_split(train_data, test_size=0.1, random_state=42)

train_data.to_csv(path_or_buf="data/train_data.csv", index=False)
valid_data.to_csv(path_or_buf="data/valid_data.csv", index=False)
test_data.to_csv(path_or_buf="data/test_data.csv", index=False)

print("Numbers of training samples: " , len(train_data))
print("Number of validation samples: ",len(valid_data))
print("Number of test samples: ",len(test_data))

Numbers of training samples:  29142
Number of validation samples:  3238
Number of test samples:  3598
