# Data Preprocessing

The best results in machine translation are still achieved by a good and big dataset with aligned sentences in source and target language.

From [Tatoabe](https://tatoeba.org/) you can download already aligned sentences in German and Low German which were created by the community. As the dataset contains a relatively small amount of ~17.000 sentences and moreover with different spelling and grammar, it is just a first step for building a databasis.

Moreover there is [Low German Wikipedia](https://nds.wikipedia.org/wiki/Wikipedia:H%C3%B6%C3%B6ftsiet) with over 60.000 articles (April 2020). An idea would be to align suitable sentences with the German wikipedia. Luckily Facebook's research team did this already for [all Wikipedial languages](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix) including as a side product Low German-German. A first look into that datasets shows that there are many mismatched sentences. Moreover there is still the same problem of different spelling and grammar. Still this could be a good approach to extend the databasis.


The goal of this notebook is to get clean datasets for training the different translation models. To achieve that we have to correct as much as possible the spelling and have to delete mismatched sentences.

In [126]:
import pandas as pd
import numpy as np

## Tatoabe dataset

As a first step we load in the dataset from Tatoabe. You can download the tsv files from the website. Moreover you will need the "links.csv" from the website to match the sentences from German and Low German by their sentence id.

In [127]:
# creating column names for the tables
column_names_platt = ["id", "language", "nds"]
column_names_deu = ["id", "language", "deu"]

nds_sentences = pd.read_csv("nds_sentences.tsv", sep= "\t", header = None, names=column_names_platt)
deu_sentences = pd.read_csv("deu_sentences.tsv", sep= "\t", header = None, names=column_names_deu)
link_sentences = pd.read_csv("links.csv", sep= "\t", header = None, names=["origin","translation"])

In [128]:
nds_sentences.sample(5)

Unnamed: 0,id,language,nds
8193,906561,nds,Uns Vörfohren kemen vör 150 Johr in dit Land.
1387,810834,nds,"Se verlöövt uns nich, in ’e Disko to gahn."
13769,1083539,nds,Ik harr dat Geld nehmen schullt.
1516,811082,nds,"As he de Naricht hören deed, weer he verbaast."
3472,839586,nds,He schreet veel.


In [129]:
deu_sentences.head(2)

Unnamed: 0,id,language,deu
0,77,deu,Lass uns etwas versuchen!
1,78,deu,Ich muss schlafen gehen.


In [130]:
link_sentences.head(2)

Unnamed: 0,origin,translation
0,1,1276
1,1,2481


In [131]:
df = link_sentences.merge(deu_sentences
                     , left_on = "origin"
                     , right_on = "id").merge(nds_sentences
                                              , left_on="translation", right_on="id")

In [132]:
df = df[["deu", "nds"]]

In [133]:
df[df["nds"].str.contains('\(')]

Unnamed: 0,deu,nds
16769,"Es ist schon fast sieben, wir müssen zur Schul...",Dat is dicht an söven. Wi mööt na School (gahn).
16770,"Es ist schon fast sieben Uhr, wir müssen zur S...",Dat is dicht an söven. Wi mööt na School (gahn).
16771,"Es ist schon fast sieben, wir müssen zur Schule.",Dat is dicht an söven. Wi mööt na School (gahn).
16894,"Ich mag wxWidgets, weil es im Gegensatz zu and...",Ik mag wxWidgets vonwegen dat dat anners as an...
17483,Wir rufen euch an.,Wü röp juu ön (frr)
17484,Spielst du Baseball?,Spölest Baseball? (frr)
17485,Sie wird es dir sagen.,Jü wel et di sii. (frr)
17486,Tom brachte mir eine Tasse Tee.,Tom braacht mi en Kop Tee. (frr)
17487,Wir hatten ein gutes Gespräch.,Wü her en gur Snak. (frr)
17488,Wir wollten mit ihr sprechen.,Wü wil me höör snaki (frr)


In [134]:
df = df.drop(df[df["nds"].str.contains('\(frr')].index)

In [135]:
df.sample(5)

Unnamed: 0,deu,nds
2132,Ich habe dir dein Abendessen im Ofen gelassen.,Ik heff di dien Avendeten in ’n Aven laten.
16796,"Wir beide wissen, dass ihr beiden lügt.",Wi beide weet dat ji beide leegt.
10115,Das war sein eigener Fehler.,Dat weer sien egen Schuld.
3531,Warum nimmst du dir den Tag nicht frei?,Worüm nimmst du di den Dag nich free?
7807,"Wir haben eine Tochter, die mit einem Franzose...","Wi hebbt en Dochter, de mit en Franzoos verhei..."


In [136]:
df[df["nds"].str.contains('(?i) ick')]

Unnamed: 0,deu,nds
1094,Jetzt habe ich Zeit.,Nu häb ick Tied.
4949,"Rede du nur, ich tue, was ich will.","Küer man to, ick do wat ick wil."
15997,"Gut, dann bedanke ich mich auch.","Na, denn bedank ick mi ok."
16138,In diesem Stadium der Weltgeschichte gibt es s...,Op dat hierste Stadium vun de Werltgeschicht g...
16532,Wann soll ich kommen?,Üm wölke Klock mutt ick kamen?
16746,Hier kann ich dir nicht zustimmen.,Dorup kann ick nich mit di tostimmen.
16802,Wann soll ich einchecken?,Wo lååt mutt ick inchecken?


In [137]:
print(len(df))
df = df[(df.deu.str.len() < 70) & (df.nds.str.len() < 70)]
print(len(df))

17483
16895


## Wikipedia parallel sentences

Facebook aligned through all languages of Wikipedia suitable sentences and published it on: https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix

Luckily there is a lower german version of Wikipedia with over 60.000 articles. Of course many of them are specific for lower german topics and therefore there is no german version of the article. Still Facebook managed it to align 70.000 sentences.
We will use this dataset to have a bigger databasis

In [138]:

wiki_deu = pd.read_csv("data/fb-wiki/WikiMatrix.de-nds-de.txt",
                       sep="\n+",engine='python', encoding="utf-8",
                       header=None, names=["deu"])



In [139]:
wiki_nds = pd.read_csv("data/fb-wiki/WikiMatrix.de-nds-nds.txt"
                       , sep="\n+",engine='python'
                       , encoding="utf-8", names=["nds"])

In [140]:
wiki_df = wiki_deu.join(wiki_nds)
print(len(wiki_df))
# exclude long sentences for faster training
wiki_df = wiki_df[(wiki_df.deu.str.len() < 70) & (wiki_df.nds.str.len() < 70)]
print(len(wiki_df))

df = df.append(wiki_df, ignore_index=True)

75590
19083


In [141]:
len(df)

35978

# Generalizing the low german words

Here we try to correct the sentences with the "official" spelling according to [Sass wordbook](http://sass-platt.de/)



In [142]:
import docx
import re

The available dictionary is not ordered. But the German words are written in bold. Further German subwords are written in italic. The translated words are following the bold or the italic words. First we will try to get the index range of each translation.
For that we need the paragraph in which the word is and the range inside the paragraph.


In [143]:
# function to find only the first translation
# easier to understand in combination when used inside the loop
def find_first_translation(translation_id, runs):
        translation = ''
        first_row_of_ones = False
        for idx,i in enumerate(translation_id):
            if i == 1:
                translation += runs[idx].text
                first_row_of_ones = True
            elif i == 0 and first_row_of_ones:
                break
        return translation

In [144]:
def get_word_list(document, word_list, translation_font = "Rockwell"):
    idx_col = 0
    idx_row = 0
    string = ""
    for paragraph in document.paragraphs:
        # grab only the lines which start with a word / other lines are additional content or page numbers
        new_line = re.match(r'\A[A-Za-zÄäÜüÖö]', paragraph.text)
        if new_line:
            # write the word into the column 0 with the format bold and not italic which stands for the basis word
            word_list.iloc[idx_row,0] = string.join([x.text for x in paragraph.runs 
                                                               if x.bold and not x.italic])
            # now we want to grab only the translation without catching additional content with the
            # same font. When the Font changes, it means, that the translation is finished.
            font_list = [1 if x.font.name == translation_font else 0 for x in paragraph.runs]
            word_list.iloc[idx_row,1] = find_first_translation(font_list, paragraph.runs)
            # jump to next row for next entry
            idx_row += 1


        
        

In [None]:
# first we will load the deu-nds word_list

deu_nds_raw = docx.Document("gathering_data/sass/hd-nd-wortliste_absatz_nach_wort.docx")
# prepare a word list only containing one translation
deu_nds_word_list = pd.DataFrame(index=np.arange(20000), columns= ["deu","nds"])

# pass into function
get_word_list(deu_nds_raw, deu_nds_word_list)



In [None]:
# now the nds-deu word_list

nds_deu_raw = docx.Document("gathering_data/sass/nd-hd-wortliste_absatz_nach_wort.docx")
# prepare a word list only containing one translation
nds_deu_word_list = pd.DataFrame(index=np.arange(20000), columns= ["nds","deu"])

# pass into function
get_word_list(nds_deu_raw, nds_deu_word_list)


In [None]:
deu_nds_word_list.head(5)

In [None]:
nds_deu_word_list.head(5)

Having the data now in our pandas dataframe, we can do some regex and clean the words.

In [None]:
nds_deu_word_list[(nds_deu_word_list.deu == "") | (nds_deu_word_list.nds == "")].index

In [None]:
#drop nan and uncomplete rows first

def drop_uncomplete(df):
    df.dropna(inplace=True)
    drop_index = df[(df.deu == "") | (df.nds == "")].index
    df.drop(index=drop_index, inplace=True)

drop_uncomplete(deu_nds_word_list)
print("Entries DEU-NDS version:",len(deu_nds_word_list))
drop_uncomplete(nds_deu_word_list)
print("Entries NDS-DEU version:",len(nds_deu_word_list))

## Making one wordbook

We have now two tables with the reverse translation. We can see already that they have different length. Maybe this is because of the original data or because a the reading of one document was better.
Nevertheless as we have a computer and don't have to turn pages, we need only one table.

After doing some data cleaning we will drop duplicate entries

In [None]:
# lets make one table out of it

nds_deu = nds_deu_word_list.append(deu_nds_word_list, sort=False, ignore_index = True)


In [None]:
# now we know how big our new vocabulary could be.
# still there are many wrong signs included as we can see only in the first 10 entries

string = "Test ~Test WetterW~asser  Abgabe4 1. 2. ,,; ~weg A~weg end/ jetzt!"

def clear_string(string):
    # delete everything after these characters
    string = re.split(r'[\d.,;!?/]',string)[0]
    # delete alternative spelling as there is no easy automatic solution for taking them into the list
    string = re.split(r'(([A-ZÖÜÄa-zäöü]|\s)~)',string)[0]
    return string

clear_string(string)

# apply the cleaning function
nds_deu = nds_deu_word_list.applymap(lambda x: clear_string(str(x)))


In [None]:
# especially in the Low German column we have a lot of -. These are mostly format problems.
# still some belong there. We assume that the minus is right, if it occurs in both columns

def sub_minus(string):
    return re.sub(r"-","",string)


nds_deu.loc[~(nds_deu.nds.str.contains("-{1}") & nds_deu.deu.str.contains("-{1}")),:] = nds_deu.loc[~(nds_deu.nds.str.contains("-{1}") & nds_deu.deu.str.contains("-{1}")),:].applymap(sub_minus)





In [None]:
# delete the words "sik", "wat" in Low German as this information is about the grammar
# same in German with "sich", 

def replace_pre_words(df):
    df.nds = df.nds.str.replace("(wat\s|sik\s)","")
    df.deu = df.deu.str.replace("(was\s|sich\s)","")

    
replace_pre_words(nds_deu)

In [None]:

# drop paranthesis including the content as they are as well an additional explanation we don't want to have
def drop_parenthesis_dash(string):
    string = re.sub(r'\([a-zöäü.~]*\)?',"",string)
    string = re.sub(r'\|',"",string)
    #empty spaces at the beginning or end
    string = re.sub(r'^\s|\s$',"",string)
    return string

nds_deu = nds_deu.applymap(drop_parenthesis_dash)

In [None]:
# In some rows there is only one letter or nothing, so a further read in error.
# We will drop these lines
# moreover in Low German there are some words with two characters but as well many read in errors
def drop_read_in_errors(df):
    drop_index = df[(df.deu.str.len() <= 1 )|(df.nds.str.len() <= 1 )].index
    df.drop(index=drop_index,inplace=True)
    drop_different_lengths_one_index = df[(df.nds.str.len() <= 2) & (df.deu.str.len() > 4)].index
    df.drop(index=drop_different_lengths_one_index,inplace=True)
    drop_different_lengths_two_index = df[(df.nds.str.len() <= 3) & (df.deu.str.len() > 7)].index
    df.drop(index=drop_different_lengths_two_index,inplace=True)

    drop_different_lengths_reverse_index = df[(df.deu.str.len() == 2) & (df.nds.str.len() > 4)].index
    df.drop(index=drop_different_lengths_reverse_index,inplace=True)

    # there are as well some read in problems with german article "der"
    # we will insert it later in the additional text
    drop_der = df[df.deu == "der"].index
    df.drop(index=drop_der,inplace=True)
    
    
drop_read_in_errors(nds_deu)


In [None]:
# load in some additonal hand picked cleaned data and append

hand_picked_words = pd.read_csv("gathering_data/sass/additional_words.txt", sep="\t", header=None, names = ["nds","deu"] )
nds_deu = nds_deu.append(hand_picked_words, sort = False)

In [None]:
# and drop only exact duplicates as a Low German word could have several meanings in German and vice versa
print("Entries before dropping duplicates: ", len(nds_deu))
nds_deu.drop_duplicates(inplace=True)
print("Entries after dropping duplicates: ", len(nds_deu))

# as we did some cleaning, maybe we have created some addtional empty elements
# we are using our drop_uncomplete function again

drop_uncomplete(nds_deu)

print("Entries after dropping empty entries: ", len(nds_deu))



In [None]:
# resetting index
nds_deu.reset_index(drop=True, inplace=True)

In [None]:
nds_deu.to_csv("data/nds_deu_dictionary.csv")

# Using the dictionairy

Finally we have the dictionairy and can try to correct the sentences.



#### Now some Markdown cells which should be code, but take too long

###### count how often words from our dictionary occur in the text already
if 'count' not in globals():

    count = [df.nds.str.count(re.escape(text)).sum() for text in nds_deu.nds]



#[nds_deu.nds.str.count(re.escape(sentence)).sum() for sentence in df.nds]

nds_deu["corpus_count"] = count


print("Words not used: ", sum(nds_deu["corpus_count"] == 0))
print("Words used: ", len(nds_deu) - sum(nds_deu["corpus_count"] == 0))

nds_deu.sort_values("corpus_count",ascending=False).head(10)

In [None]:
nds_deu.nds.str.len()[8441]

nds_deu[nds_deu.nds =="ik"]

We don't have a frequency vector for Low German, but it is very closely related to German.
Therefore we can use the German word_frequency for Low German as well.


In [None]:
from wordfreq import word_frequency



In [None]:

def get_frq(text):
    return word_frequency(text, "de")
word_frq = nds_deu.deu.apply(get_frq).rename("frq")


In [None]:
nds_dic = nds_deu.join(word_frq)

nds_dic.dropna( inplace=True)
nds_dic.shape

In [None]:
nds_dic.sort_values("frq", ascending = False).head(5)

nds_dic.isna().sum()
nds_dic[nds_dic.nds == "ik"]
nds_dic[nds_dic.duplicated("nds", keep=False)].head(2)



In [None]:
nds_dic = nds_dic[["nds","deu","frq"]]
nds_dic.reset_index(drop=True, inplace=True)
print("Before removing duplicates: ",len(nds_dic))

nds_dic = nds_dic.sort_values("frq", ascending = True)
nds_dic = nds_dic.drop_duplicates(subset='nds', keep="last")
print("After removing duplicates: ",len(nds_dic))
nds_dic.to_csv("data/nds_deu_count.csv", index=False)
nds_dic.isna().sum()

In [None]:
# to get full integers
nds_dic.frq = nds_dic.frq * 100000000
nds_dic.frq = nds_dic.frq.astype("int")
nds_tojson = nds_dic.set_index("nds")
nds_tojson = nds_tojson.to_dict(orient="dict")["frq"]

In [None]:
import json

with open('data/nds_dic.txt', 'w') as fp:
    json.dump(nds_tojson, fp, ensure_ascii=False, )

Now we are loading the module Spellchecker which can correct words based on our created dictionary.
After that we are going to see how many words from our sentences are in the dictionairy and try to automize the correction.

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker(local_dictionary="data/nds_dic.txt")  # loads nds word frequency list

spell.export("data/example_dict", gzipped=False)

In [None]:
word = "maol"
spell.correction(word)

In [None]:
spell.correction("ick")

In [None]:
spell.known(["ik", "gehe"])

First we will tokenize the words, so we can work with single words.
Probably a simple tokenizer which splits by whitespace and punctuation would be enough, but we will use Spacy, so we will have the chance to get better tokens.

In [None]:

import spacy

spacy_de = spacy.load('de')

In [None]:
spell.known(["ik", "verdenen"])

In [None]:
def get_spell(series, dic):
    series = series.str.replace(r"[\d?!.\",]*","")
    #tokenize
    sent_tok = series.apply(spacy_de.tokenizer)
    # get as list
    complete = sent_tok.apply(lambda x: [a.text for a in x])
    # get unknown and known words
    unknown_words = complete.apply(dic.unknown)
    known_words = complete.apply(dic.known)
    return unknown_words, known_words, complete
def hit_quote(series, dic):
    unknown_words,known_words,complete = get_spell(series,dic)
    total_words = complete.apply(len).sum()
    # each row is saved as spell_object, therefore we have to do the following workaround
    known_count = known_words.apply(lambda x: [1 for i in x]).apply(len).sum()
    unknown_count = unknown_words.apply(lambda x: [1 for i in x]).apply(len).sum()
    print("Total words: ", total_words)
    print("Known words: ", known_count)
    print("Unknown words: ", unknown_count)
    print("Hit quote: ", round(100*known_count/total_words))

hit_quote(df.nds, spell)

# Regex correction

We have many different spellings, but for some common words or patterns, we will define the "right" spelling.

In [None]:
# replace "ik" with "ick"
def replace_ik(df):    
    print(df.nds.str.count(r"(I|i)ck").sum())
    df.nds = df.nds.str.replace("(I|i)ck", "\1k")

# replace us with uns
def replace_uns(df):
    df.nds = df.nds.str.replace("\s(U|u)s\s", "\1ns")


# "sch" before a consonant will be replaced with s 
def replace_s(df):
    print(df.nds.str.count(r"\s(S|s)ch[lmknbwv][a-zäöü]*").sum())
    index = df[df.nds.str.contains(r"(\s((S|s)ch)([lmknbwvptb]))")].index
    df.nds.str.replace(r"((S|s)ch)([lmknbwvptb])", r"\2\3")[index]
    
def regex_all(df):
    replace_ik(df)
    replace_uns(df)
    replace_s(df)

regex_all(df)


# Prepairing sentences

In this section we want to select and preprocess the sentences for our model.

As explained the sentences are not perfectly aligned. To find only high-quality sentences we use our wordbook and see in which we find translations with our wordbook. Actually, this is not the perfect way as we might exclude translations, which can't be translated word by word.
We will evaluate later if we have a gain.

In [None]:

unknown_nds , known_nds , _ = get_spell(df.nds,spell)



In [None]:
for word in known_nds[28]:
    print(word)

In [None]:
dic = pd.read_csv("data/nds_deu_count.csv")
print(len(dic))

dic = dic.drop_duplicates(subset='nds', keep="last")
dic[dic.nds == "ik "]

In [None]:
def german_word(dic_list):
    strings = [dic[a]]

In [None]:
known_nds.apply(lambda x: dic[a] for a in x)

In [None]:
df[df.nds.str.contains(r"[lpgtrda]+aa[tzrsdfgbnm]*")]


In [None]:
hit_quote(df.nds)

In [None]:
test_str = "De Stevel is Biebel"

In [None]:
import re

# According to Arbatzat "Basiswortschatz Plattdeutsch" if it is possible, b should be replaced with v
#assumption from my side: if there is a vowel in front and after

df.nds = df.nds.str.replace(r"([AEOUÄÜÖaeouäöü])b([AEUOaeouäöü])", r"\1v\2")

# by chance I saw that Bibel is spelled wrong a lot of times

df.nds = df.nds.str.replace(r"Bi(e|)bel", "Bivel")

# replacement of "sch" to "s"

df.nds = df.nds.str.replace(r"((sch)([lmn]))", r"s\2")
df.nds = df.nds.str.replace(r"((Sch)([lmn]))", r"S\2")


In [None]:
df.to_csv(path_or_buf="data/data.csv", index=False)

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.1, random_state=42)
train_data, valid_data = train_test_split(train_data, test_size=0.1, random_state=42)

train_data.to_csv(path_or_buf="data/train_data.csv", index=False)
valid_data.to_csv(path_or_buf="data/valid_data.csv", index=False)
test_data.to_csv(path_or_buf="data/test_data.csv", index=False)

print("Numbers of training samples: " , len(train_data))
print("Number of validation samples: ",len(valid_data))
print("Number of test samples: ",len(test_data))