# Data Preprocessing

The best results in machine translation are still achieved by a good and big dataset with aligned sentences in source and target language.

From [Tatoeba](https://tatoeba.org/) you can download already aligned sentences in German and Low German which were created by the community. As the dataset contains a relatively small amount of ~17.000 sentences and moreover with different spelling and grammar, it is just a first step for building a databasis.

Moreover there is [Low German Wikipedia](https://nds.wikipedia.org/wiki/Wikipedia:H%C3%B6%C3%B6ftsiet) with over 60.000 articles (April 2020). An idea would be to align suitable sentences with the German wikipedia. Luckily Facebook's research team did this already for [all Wikipedial languages](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix) including as a side product Low German-German. A first look into that datasets shows that there are many mismatched sentences. Moreover there is still the same problem of different spelling and grammar. Still this could be a good approach to extend the databasis.


The goal of this notebook is to get clean datasets for training the different translation models. To achieve that we have to correct as much as possible the spelling and have to delete mismatched sentences.

In [58]:
import pandas as pd
import numpy as np
import re

import plotly.graph_objects as go
import matplotlib.pyplot as plt

## Tatoeba dataset

As a first step we load in the dataset from tatoeba. You can download the tsv files from the [website](https://tatoeba.org/eng/downloads). The data is provided under the [CC BY 2.0 FR license](https://creativecommons.org/licenses/by/2.0/de/#). Moreover you will need the "links.csv" from the website to match the sentences from German and Low German by their sentence id.

In [59]:
# creating column names for the tables
column_names_platt = ["id", "language", "nds"]
column_names_deu = ["id", "language", "deu"]

nds_sentences = pd.read_csv("data/tatoeba/nds_sentences.tsv", sep= "\t", header = None, names=column_names_platt)
deu_sentences = pd.read_csv("data/tatoeba/deu_sentences.tsv", sep= "\t", header = None, names=column_names_deu)
link_sentences = pd.read_csv("data/tatoeba/links.csv", sep= "\t", header = None, names=["origin","translation"])

Not surprisingly we have way more german sentences in the dataset than in low german.

In [60]:
nds_sentences.shape, deu_sentences.shape

((17597, 3), (499720, 3))

The datasets include an unique id for each sentence. The with the information from the links.csv we can match for each sentence the corresponding translation. The links.csv contains the translations from one sentence to every other available language. Therefore we have to see how many direct translations from german to low german we will find.

In [61]:
nds_sentences.head(2)

Unnamed: 0,id,language,nds
0,639306,nds,Viël to vötälen giëwt viël totolustern.
1,639321,nds,"Wenn de Rüe griäst, giëwt Riängen."


In [62]:
link_sentences.head(4)

Unnamed: 0,origin,translation
0,1,1276
1,1,2481
2,1,5350
3,1,5972


In [63]:
tatoeba_raw = link_sentences.merge(deu_sentences
                     , left_on = "origin"
                     , right_on = "id").merge(nds_sentences
                                              , left_on="translation", right_on="id")

Nearly every low german sentence is also available in german. Perfect!

In [64]:
tatoeba_raw = tatoeba_raw[["deu", "nds"]]

print("Percentage of direct translations from German to Low German: ",round(100* tatoeba_raw.shape[0] / nds_sentences.shape[0],2))

Percentage of direct translations from German to Low German:  99.43


There are some other dialects of Low German included and marked by parenthesis

In [65]:
tatoeba_raw[tatoeba_raw["nds"].str.contains('\(')].sample(3)

Unnamed: 0,deu,nds
17483,Wir rufen euch an.,Wü röp juu ön (frr)
17493,Mir ist Wasser ins Ohr gekommen.,Weeter es mi iin ön Uar kemen (frr)
17494,"Weißt du, wo meine Schlüssel sind?","Weest, hur min Kaien sen? (frr)"


We will drop them in order to reduce diversity in spelling

In [66]:
 tatoeba_raw = tatoeba_raw.drop(tatoeba_raw[tatoeba_raw["nds"].str.contains('\(frr')].index)

Have a look at a sample of the translations -  if you understand german and/or low german

In [67]:
tatoeba_raw.sample(5)

Unnamed: 0,deu,nds
7285,Ich habe sie vor zwei Jahren irgendwo gesehen.,Ik heff ehr vör twee Johr jichtens sehn.
11869,Die Kuhhirten trieben die Herde zusammen.,De Köhers bröchen de Kudd bieen.
17022,Dieser Junge redet wie ein Erwachsener.,De Jung snackt as en grootwussen Minsch.
5924,Der Minirock ist wieder in Mode.,De Minirock is wedder Mood.
2905,"Ich habe gehört, dass du ein guter Tennisspiel...","Ik heff höört, dat du en goden Tennisspeler büst."


## Wikipedia parallel sentences

Facebook aligned through all languages of Wikipedia suitable sentences and published it on: https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix
The corresponding paper was published from: Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, [WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv](https://arxiv.org/abs/1907.05791), July 11 2019. The data is provided under the Creative Commons Attribution-ShareAlike license.

Luckily there is a lower german version of Wikipedia with over 60.000 articles. Of course many of them are specific for lower german topics and therefore there is no german version of the article. Still Facebook managed it to align 70.000 sentences.
We will use this dataset to have a bigger databasis.

Facebook did already a ranking how convinced they are about the quality of the translation. As we will see later, this ranking is just an indication for this dataset as there are many wrong translations even among the sentences with a high ranking.

In [70]:
wiki_complete = pd.read_csv("data/fb-wiki/WikiMatrix.de-nds.tsv.gz",sep="\t+"
                            , engine="python", header= None
                            ,encoding="utf-8", compression="gzip",
                           names = ["threshold","deu","nds"])

# function to select a subset of the complete dataset
def wiki_selection(df, boundary):
    '''returns a copy of wikipedia dataframe only containing values above boundary'''
    df = df.copy()
    # a manual view of the dataset shows that the sentences above a threshold of 1.2 are nonsense
    df = df[(df.threshold < 1.2) & (df.threshold > boundary)]
    return df[["deu","nds"]]


# with an under boundary we can control how big and faulty our dataset should be
# we use as boundary 1.04 to find a trade of between a big dataset and the chance to find good sentences
wiki_raw = wiki_selection(wiki_complete, 1.04)
wiki_raw.shape

(75582, 2)

All translations above 1.2 are wrong or doesn't fit very well. Maybe there is a small bug in the algorithm. Below 1.2 the quality is pretty good but decrease as we go further down.

In [71]:
wiki_complete.head(20)

Unnamed: 0,threshold,deu,nds
0,8.0,Variety (11. März 2009).,"Gautier betekent Gautier, Oort in Mississippi,..."
1,1.250414,Ich werde meinen Herrn für dich um Vergebung b...,"19,19), üm em wat rünner to maken (Joh."
2,1.236675,Lobet den Herrn; denn lobsingen ist gut).,Siel (de oder dat) is ene lüttje Diekslüüs.
3,1.225012,"Kalle jedoch bezichtigt ihn der Lüge, da der S...",Vun dat Amt is he denn aber wegen eenig Affäre...
4,1.213181,Noach (männlich) Noe (männlich) Noé (männlich)...,"So geev dat een Bund mit Adam, een annern Bund..."
5,1.213065,Nun also fürchtet euch nicht!,So bruukt se denn nich to friegen.
6,1.210924,Daraufhin habe er gerufen: “This is the shit!,Dat bedutt „na dat (rabbinsche) jöödsche Wett“.
7,1.201537,Tulifant tritt ihnen in Gestalt eines furchter...,Se weert in FEU (Forty-foot Equivalent Unit) m...
8,1.195482,Kann ich dir helfen?,Woans kann ik Se helpen?
9,1.192203,"Und wir wissen (wohl) darüber Bescheid, wie es...","Wi schöllt Beides, wat wi schöllt un wat wi ni..."


At the tail of the complete dataset we see that almost no sentence fit anymore. With the threshold of 1.04 we have more right translations. Still they are not quite convincing.

In [72]:
wiki_complete.tail(5)

Unnamed: 0,threshold,deu,nds
188406,1.02,"Im Osten des Dorfes liegt ein tiefes Tobel, du...",In’n Oosten flütt de Ried dör dat Rebeed von W...
188407,1.02,Es wurde 1889 erneuert.,Dat weer he bet 1889.
188408,1.02,Das Prokopfeinkommen betrug 17.077 USD.,Dat Inkamen je Kopp vun de hele Inwahnertall w...
188409,1.02,"Der OT Rockenberg zählt 2.297 Einwohner, der O...","Sylt hett 21.190 Inwahner, dorvan leevt 9.072 ..."
188410,1.02,Der Gruß geht auf die Wandertätigkeit der zünf...,"De Beteken Böhnhaas kümmt van de Warksteed,«» ..."


In [73]:
wiki_raw.tail(5)

Unnamed: 0,deu,nds
75585,Michael Starkl ist leidenschaftlicher Tänzer u...,Michael McCulley is verheiradt un hett söss Ki...
75586,Sie denkt laut und trägt das Herz auf der Zunge.,Se snacken aber ok en Dialekt vun de Kalispel.
75587,"Beide besuchen eine Ausstellung, auf der sich ...",Tosamen besöökt de beiden Fronslüüd en Utstell...
75588,Aadahl ist der Familienname folgender Personen...,Aadahl is de Familiennaam von Erik Aadahl (* 1...
75589,"Die Wagen bestehen aus sechs Sitzen, jeweils z...","De Texten kriegt elk enen Kood, de ut bet to d..."


Tn english the enumeration is for example 2nd, 5th, 7th, ...
In german it is common to set a point after the number: 2., 5., 7.,
The wikipedia dataset uses mainly points to identify sentences, but shortens therefore when enumeration is used.
These half sentences mostly doesn't make sense, so they will be dropped.

In [74]:
def delete_wrong_enumeration(df):
    
    len_before = len(df)
    drop_index = df[df.nds.str.contains("\d\.") | df.deu.str.contains("\d\.")].index
    df.drop(index=drop_index, inplace = True)
    print("Deleted wrong enumerations: ", len_before - len(df))
    
delete_wrong_enumeration(wiki_raw)

Deleted wrong enumerations:  7313


## Subset and joining datasets

First let's have a very general overview about both datasets.

We see that the Wikipedia Dataset is much larger than the other dataset. But we already know that the content of the Tatoeba dataset is correct as we aligened by id (and a quick look into the data proved it). For the Wikipedia dataset we can't be sure which sentence is correctly translated and which isn't.

In [78]:
fig = go.Figure([go.Bar(x=["Tatoeba","Wikipedia"], y = [len(tatoeba_raw),len(wiki_raw)])])
fig.update_layout(
    title_text='Total sentences in the datasets', # title of plot
    yaxis_title_text='Number of Sentences', # xaxis label
    xaxis_title_text='Dataset', # yaxis label
)
fig.show()

A little deeper look at the length of each sentence will help to understand further differences in the datasets.

For example the number of words (or more precisely tokens) in each sentence.
For a good performance in the neural net we will need to limit the sentence length.

In [79]:
# functions to calculate the lenght of sentences and to choose a range which should be included in the dataset
def get_length(df):
    df_output = df.copy()
    # split by a simple tokenizer which uses regex and return the length of tokens
    df_output.nds = df_output.nds.str.split(r"[\s.,;:?!-\"\']+")
    df_output.deu = df_output.deu.str.split(r"[\s.,;:?!-\"\']+")
    return df_output.applymap(len)

def get_range(df, start, end):
    df_length = get_length(df)
    df_length = df_length[df_length.nds.ge(start) & df_length.nds.le(end)]
    df_length = df_length[df_length.deu.ge(start) & df_length.deu.le(end)]

    return df.loc[df_length.index,:]



But how long is a good lenght without loosing to many examples?

In [80]:
def plot_length_dis(df):
    df_plot = get_length(df)
    fig = go.Figure()
    fig.add_trace(go.Box(y=df_plot.deu, name = "Hochdeutsch"))
    fig.add_trace(go.Box(y=df_plot.nds, name="Niederdeutsch"))

    fig.show()

print("Wikipedia-data:",get_length(wiki_raw).describe())
print("Tatoeba-data:",get_length(tatoeba_raw).describe())
plot_length_dis(wiki_raw)

Wikipedia-data:                 deu           nds
count  68269.000000  68269.000000
mean      17.950387     18.673117
std       12.854764     14.423439
min        1.000000      1.000000
25%       11.000000     11.000000
50%       16.000000     17.000000
75%       23.000000     23.000000
max     1464.000000   1908.000000
Tatoeba-data:                 deu           nds
count  17483.000000  17483.000000
mean       7.246983      7.425785
std        3.300230      3.478409
min        2.000000      2.000000
25%        5.000000      6.000000
50%        7.000000      7.000000
75%        8.000000      9.000000
max      221.000000    226.000000


As we see above we have some extreme outliers in the wikipedia dataset. Moreover we see already that there seems to be a big difference in the length of the datasets. We will first limit the total length of the datasets to get rid of the outliers.

As the median and quantiles differ so heavily between the datasets, lets calculate how many sentences we can keep with a given max. length. You could play a bit with the numbers, but we will go with 25 as it seems a fair compromise

In [103]:
max_length = 25
round(100 * sum(get_length(wiki_raw).deu <= max_length) / len(wiki_raw),1)

82.3

Let's filter our dataset and see how much we loose. For our purpuses it is good.

In [104]:
# get the length 
wiki_df = get_range(wiki_raw, 1, 25)
tatoeba_df = get_range(tatoeba_raw, 1, 25)
print("Wikipedia-Sentences above range: ", len(wiki_raw) - len(wiki_df))
print("tatoeba-Sentences above range: ", len(tatoeba_raw) - len(tatoeba_df))



Wikipedia-Sentences above range:  15029
tatoeba-Sentences above range:  47


We already saw that we have differences in length in the dataset. The following stats show it.

In [107]:
# count words and calculate the differences between german and low german

wiki_words_per_sent = get_length(wiki_df)
tatobae_words_per_sent = get_length(tatoeba_df)
total_words_wiki = wiki_words_per_sent.sum()
total_words_tatoeba = tatobae_words_per_sent.sum()
print("Low German Tokens in Wikipedia Dataset: ", wiki_words_per_sent.nds.sum())
print("German Tokens in Wikipedia Dataset: ", wiki_words_per_sent.deu.sum())
print("Token Ratio Low-German/German in Wikipedia Dataset: ",wiki_words_per_sent.nds.sum() / wiki_words_per_sent.deu.sum())
print("Low German Tokens in tatoeba Dataset: ", tatobae_words_per_sent.nds.sum())
print("German Tokens in tatoeba Dataset: ", tatobae_words_per_sent.deu.sum())
print("Token Ratio Low-German/German in tatoeba Dataset: ",tatobae_words_per_sent.nds.sum() / tatobae_words_per_sent.deu.sum())




Low German Tokens in Wikipedia Dataset:  776801
German Tokens in Wikipedia Dataset:  744952
Token Ratio Low-German/German in Wikipedia Dataset:  1.0427530901319817
Low German Tokens in tatoeba Dataset:  127933
German Tokens in tatoeba Dataset:  124951
Token Ratio Low-German/German in tatoeba Dataset:  1.023865355219246


In [108]:
fig = go.Figure()

fig.add_trace(go.Histogram(x=tatobae_words_per_sent.nds, name = "Tatoeba Dataset"))
fig.add_trace(go.Histogram(x=wiki_words_per_sent.nds, name="Wikipedia Dataset"))

# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.update_layout(
    title_text='Histogram of Tokens per Sentence in Low German', # title of plot
    xaxis_title_text='Number of Tokens / Sentence', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.02, # gap between bars of adjacent location coordinates
    bargroupgap=0.01 # gap between bars of the same location coordinates
)
fig.show()

We see that there are big differences in the datasets. Also if you look into the datasets, you will see that the tatoeba data contains more "simple" and short sentences as they describe sentences in daily use. The wikipedia sentences describe more complex facts and are therefore longer.

Between each language the difference is not very significant. Sentences in low german and german have fairly a similar length.

In [19]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=wiki_words_per_sent.deu, name = "German"))

fig.add_trace(go.Histogram(x=wiki_words_per_sent.nds, name="Low German"))

# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.update_layout(
    title_text='Wikipedia Dataset Histogram of Tokens per Sentence in Low German and German', # title of plot
    xaxis_title_text='Tokens', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.02, # gap between bars of adjacent location coordinates
    bargroupgap=0.01 # gap between bars of the same location coordinates
)
fig.show()

## Most common words in datasets

To understand both languages better, we will see how the words are distributed in each language.


In [109]:
import spacy

# take for german the spacy tokenizer.
spacy_de = spacy.load('de')

def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

# Low German sometimes has apostrophs ' in between words but they are abbreviations and count as one
# threrfore we will take a simple custom tokenizer but with which we cover these specialities.
def tokenize_nds(text):
    """
    Tokenizes Low German text from a string into a list of strings (tokens)
    """
    text = re.sub(r"([.,\"\-;*:%?!&#])", r" \1", text)
    text = re.split(r"[\s]", text)
    text = [a for a in text if len(a)>0]
    return text

In [21]:
def get_word_count(df):
    # first we will get the tokens
    words = df.copy()
    words.deu = words.deu.apply(tokenize_de)
    words.nds = words.nds.apply(tokenize_nds)
    # we will put all in one big list
    words_deu = []
    for ls in words.deu:
        words_deu += ls
    words_nds = []
    for ls in words.nds:
        words_nds += ls
    
    words_count_deu = pd.Series(words_deu, name="deu")
    words_count_nds = pd.Series(words_nds, name="nds")

    words_count_deu = words_count_deu.str.replace(r"\W|\d","")
    words_count_deu = words_count_deu.value_counts()

    words_count_nds = words_count_nds.str.replace(r"\W|\d","")
    words_count_nds = words_count_nds.value_counts()
    words_count_nds = pd.DataFrame({"nds": words_count_nds.index, "nds_count": words_count_nds.values.tolist()})
    words_count_deu = pd.DataFrame({"deu": words_count_deu.index, "deu_count": words_count_deu.values.tolist()})
    words_count_deu.drop([0], inplace=True)
    words_count_nds.drop([0], inplace=True)

    return words_count_deu, words_count_nds

In [22]:
wiki_deu, wiki_nds = get_word_count(wiki_df)
tatoeba_deu, tatoeba_nds = get_word_count(tatoeba_df)

In [23]:
print("Unique words in German: ",len(wiki_deu))
print("Unique words in Low German: ",len(wiki_nds))

Unique words in German:  96419
Unique words in Low German:  86369


In [24]:
single_words_wiki_before = sum(wiki_nds.nds_count == 1)
single_words_tatoeba_before = sum(tatoeba_nds.nds_count == 1)


print("Number of Low German Words which appears only once in Wikipedia: ",single_words_wiki_before)
print("Percentage of Low German Words which appears only once in Wikipedia: ",100*single_words_wiki_before / len(wiki_nds))
print("Number of Low German Words which appears only once in Tatoeba: ",single_words_tatoeba_before)
print("Percentage of Low German Words which appears only once: ",100* single_words_tatoeba_before / len(tatoeba_nds))

Number of Low German Words which appears only once in Wikipedia:  53910
Percentage of Low German Words which appears only once in Wikipedia:  62.41822876263474
Number of Low German Words which appears only once in Tatoeba:  4712
Percentage of Low German Words which appears only once:  50.96809085992428


In [25]:
wiki_nds.groupby("nds_count").count()

Unnamed: 0_level_0,nds
nds_count,Unnamed: 1_level_1
1,53910
2,11621
3,5756
4,3116
5,1972
6,1421
7,1000
8,756
9,653
10,541


In [26]:
fig = go.Figure()
#fig.add_trace(go.Histogram(x=wiki_deu.deu_count.values, name = "German"))

fig.add_trace(go.Histogram(x=wiki_nds.groupby("nds_count").count().index, name="Low German"))

# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.update_layout(
    title_text='Wikipedia Dataset Histogram of Tokens per Sentence in Low German and German', # title of plot
    xaxis_title_text='Tokens', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.02, # gap between bars of adjacent location coordinates
    bargroupgap=0.01 # gap between bars of the same location coordinates
)
fig.show()

# Regex correction

We have many different spellings, but for some common words or patterns, we will define the "right" spelling according to the word-book of [Peter Hansen](https://www.niederdeutsche-literatur.de/dwn/index.php).

In the notebook "word_replacement" you can see how the list was created. It covers the most frequent alternative spellings in the bigger Wikipedia dataset.

In [27]:
dict_hansen = pd.read_csv("preprocessed_data/hansen/replacement_list_viewed.csv", index_col = 0, sep=";")


dict_hansen.dropna(inplace=True)

dict_hansen = dict_hansen[dict_hansen["count"] > 0][["word","replaced_by"]]

dict_hansen.reset_index(drop=True,inplace=True)
dict_hansen.head(4)


Unnamed: 0,word,replaced_by
0,von,vun
1,up,op
2,bit,bet
3,Johr,Jahr


For the replacement we will tokenize the words, compare each token if it is in the alternatives. It could happen that we have overlap with all the different spellings for several words. To be sure that we only replace words with the same meaning, we will only replace those where we have one clear match.

In [28]:
ignore_str = "weer ween weern weerst ween  weer ween weren ween hebben harr hatt harrst hatt harr hatt harrn hatt warrn wöör worrn wöörst worrn wöör worrn wörrn worrn doon harr doon harrst doon harr doon harrn doon"

ignore_str = ignore_str + "bün heff du büst hest  hett sünd hebbt warrn warr warrst warrt warrt doon do deist deit doot"

In [32]:
def apply_replace(df, row):
    word = row[0]
    correction = row[1]
    count = df.nds.str.count(rf"\b{word}\b").sum()
    print("Replacements of " , word , "to ", correction)
    print("Number of replacements: ", count)

    df.nds = df.nds.str.replace(rf"\b{word}\b", correction, case = True)
    
    
def replace_schrieven(df):
    print("Replacements of 'schreven' to 'schrieven': ",df.nds.str.count(r"\bschreven\b").sum())
    df.nds = df.nds.str.replace(r"\bschreven\b", "schrieven")


In [33]:
replace_schrieven(wiki_df)
replace_schrieven(tatoeba_df)



dict_hansen.apply(lambda x: apply_replace(wiki_df,x), axis=1)
dict_hansen.apply(lambda x: apply_replace(tatoeba_df,x), axis=1)


Replacements of 'schreven' to 'schrieven':  0
Replacements of 'schreven' to 'schrieven':  81
Replacements of  von to  vun
Number of replacements:  0
Replacements of  up to  op
Number of replacements:  0
Replacements of  bit to  bet
Number of replacements:  0
Replacements of  Johr to  Jahr
Number of replacements:  0
Replacements of  van to  vun
Number of replacements:  0
Replacements of  aber to  aver
Number of replacements:  0
Replacements of  dor to  dar
Number of replacements:  0
Replacements of  disse to  düsse
Number of replacements:  0
Replacements of  blots to  bloot
Number of replacements:  0
Replacements of  hüm to  em
Number of replacements:  0
Replacements of  annern to  ännern
Number of replacements:  0
Replacements of  um to  üm
Number of replacements:  0
Replacements of  annere to  anner
Number of replacements:  0
Replacements of  kregen to  kriegen
Number of replacements:  0
Replacements of  dee to  de
Number of replacements:  0
Replacements of  wunn to  wunnern
Number of

Replacements of  Leer to  Ledder
Number of replacements:  0
Replacements of  jedes to  jeder
Number of replacements:  0
Replacements of  Norden to  Noord
Number of replacements:  0
Replacements of  was to  wat
Number of replacements:  0
Replacements of  Gemeend to  Gemeen
Number of replacements:  0
Replacements of  Kort to  Kaart
Number of replacements:  0
Replacements of  Örgel to  Orgel
Number of replacements:  0
Replacements of  Red to  Reed
Number of replacements:  0
Replacements of  dragen to  dregen
Number of replacements:  0
Replacements of  düt to  dit
Number of replacements:  0
Replacements of  maaken to  maken
Number of replacements:  0
Replacements of  Kindheit to  Kindheet
Number of replacements:  0
Replacements of  vor to  vör
Number of replacements:  0
Replacements of  kloor to  klaar
Number of replacements:  0
Replacements of  Pastoor to  Paster
Number of replacements:  0
Replacements of  Osten to  Oost
Number of replacements:  0
Replacements of  just to  jüst
Number of 

Replacements of  Nakamen to  Nakaam
Number of replacements:  0
Replacements of  Poor to  Paar
Number of replacements:  0
Replacements of  maal to  mal
Number of replacements:  0
Replacements of  roden to  raden
Number of replacements:  0
Replacements of  twalf to  twölf
Number of replacements:  0
Replacements of  möglich to  mööglich
Number of replacements:  0
Replacements of  lieken to  glieken
Number of replacements:  0
Replacements of  Miss to  Mist
Number of replacements:  0
Replacements of  El to  Ell
Number of replacements:  0
Replacements of  Flach to  Flaag
Number of replacements:  0
Replacements of  düster to  düüster
Number of replacements:  0
Replacements of  Kate to  Kaat
Number of replacements:  0
Replacements of  seven to  söven
Number of replacements:  0
Replacements of  ohn to  ahn
Number of replacements:  0
Replacements of  Aante to  Aant
Number of replacements:  0
Replacements of  Woord to  Woort
Number of replacements:  0
Replacements of  sonst to  faten
Number of re

Replacements of  geboren to  baren
Number of replacements:  0
Replacements of  gullen to  gollen
Number of replacements:  0
Replacements of  Roscher to  Rüüsch
Number of replacements:  0
Replacements of  leden to  leiden
Number of replacements:  0
Replacements of  mennig to  männich
Number of replacements:  0
Replacements of  Liev to  Lief
Number of replacements:  0
Replacements of  bei to  beid
Number of replacements:  0
Replacements of  Stern to  Steern
Number of replacements:  0
Replacements of  Lööw to  Lööv
Number of replacements:  0
Replacements of  dröge to  dröög
Number of replacements:  0
Replacements of  Garden to  Gaarn
Number of replacements:  0
Replacements of  Dame to  Daam
Number of replacements:  0
Replacements of  Inn to  Enn
Number of replacements:  0
Replacements of  Deele to  Deel
Number of replacements:  0
Replacements of  Di to  Pie
Number of replacements:  0
Replacements of  Ühr to  Öhr
Number of replacements:  0
Replacements of  Leefste to  Leevster
Number of re

Replacements of  kunnig to  künnig
Number of replacements:  0
Replacements of  Schinkel to  Schenkel
Number of replacements:  0
Replacements of  koold to  koolt
Number of replacements:  0
Replacements of  Ehpoor to  Ehpaar
Number of replacements:  0
Replacements of  Garde to  Gaarn
Number of replacements:  0
Replacements of  Teen to  Tehn
Number of replacements:  0
Replacements of  Wunder to  Wunner
Number of replacements:  0
Replacements of  dulle to  dull
Number of replacements:  0
Replacements of  enkeld to  enkelt
Number of replacements:  0
Replacements of  bilden to  billen
Number of replacements:  0
Replacements of  lüttjet to  lütt
Number of replacements:  0
Replacements of  Var to  Vadder
Number of replacements:  0
Replacements of  Goorn to  Gaarn
Number of replacements:  0
Replacements of  Bispill to  Bispeel
Number of replacements:  0
Replacements of  domals to  domaals
Number of replacements:  0
Replacements of  uprichten to  oprichten
Number of replacements:  0
Replacements

Replacements of  Mag to  Maag
Number of replacements:  0
Replacements of  flaak to  flach
Number of replacements:  0
Replacements of  kenn to  kennen
Number of replacements:  0
Replacements of  Lech to  Lechel
Number of replacements:  0
Replacements of  Huusfru to  Huusfro
Number of replacements:  0
Replacements of  Kabel to  Kavel
Number of replacements:  0
Replacements of  unnerscheedlich to  ünnerscheedlich
Number of replacements:  0
Replacements of  well to  wedder
Number of replacements:  0
Replacements of  Bain to  Been
Number of replacements:  0
Replacements of  Be to  Been
Number of replacements:  0
Replacements of  nahe to  na
Number of replacements:  0
Replacements of  schlicht to  slicht
Number of replacements:  0
Replacements of  Re to  Reeg
Number of replacements:  0
Replacements of  Flanke to  Flank
Number of replacements:  0
Replacements of  Alt to  Boord
Number of replacements:  0
Replacements of  Flagge to  Flagg
Number of replacements:  0
Replacements of  Muster to  M

Replacements of  Düstern to  Düüster
Number of replacements:  0
Replacements of  Ort to  Oort
Number of replacements:  0
Replacements of  Knie to  Knee
Number of replacements:  0
Replacements of  Full to  Foolt
Number of replacements:  0
Replacements of  wohin to  wohen
Number of replacements:  0
Replacements of  Deinst to  Deenst
Number of replacements:  0
Replacements of  Wurt to  Wuurt
Number of replacements:  0
Replacements of  döör to  dör
Number of replacements:  0
Replacements of  gen to  mögen
Number of replacements:  0
Replacements of  sweer to  swaar
Number of replacements:  0
Replacements of  Knütten to  Knütt
Number of replacements:  0
Replacements of  wert to  weert
Number of replacements:  0
Replacements of  Tull to  Toll
Number of replacements:  0
Replacements of  uplösen to  oplösen
Number of replacements:  0
Replacements of  dansen to  danzen
Number of replacements:  0
Replacements of  Doot to  Dood
Number of replacements:  0
Replacements of  Nahber to  Naver
Number of

Replacements of  Ruggen to  Rüch
Number of replacements:  0
Replacements of  Berne to  Beer
Number of replacements:  0
Replacements of  ass to  as
Number of replacements:  0
Replacements of  Mus to  Muus
Number of replacements:  0
Replacements of  Veh to  Veeh
Number of replacements:  0
Replacements of  Lüü to  Lüüd
Number of replacements:  0
Replacements of  Wäsch to  Wasch
Number of replacements:  0
Replacements of  Wacht to  Wach
Number of replacements:  0
Replacements of  oll to  oolt
Number of replacements:  0
Replacements of  Gerät to  Gereed
Number of replacements:  0
Replacements of  Broen to  Braden
Number of replacements:  0
Replacements of  Grootmudder to  Grootmoder
Number of replacements:  0
Replacements of  Loov to  Loof
Number of replacements:  0
Replacements of  Fruunslüüd to  Froonslüüd
Number of replacements:  0
Replacements of  Wilt to  Welt
Number of replacements:  0
Replacements of  mein to  menen
Number of replacements:  0
Replacements of  lieks to  liek
Number of

Replacements of  mörgens to  morgens
Number of replacements:  0
Replacements of  nerrn to  nedden
Number of replacements:  0
Replacements of  Wurm to  Worm
Number of replacements:  0
Replacements of  Sang to  Gesang
Number of replacements:  0
Replacements of  Deev to  Deef
Number of replacements:  0
Replacements of  Schut to  Schutt
Number of replacements:  0
Replacements of  ümmers to  ümmer
Number of replacements:  0
Replacements of  Flick to  Flicken
Number of replacements:  0
Replacements of  Hannschen to  Handsch
Number of replacements:  0
Replacements of  Bruder to  Broder
Number of replacements:  0
Replacements of  Leber to  Lever
Number of replacements:  0
Replacements of  Handschen to  Handsch
Number of replacements:  0
Replacements of  Jümfer to  Jumfer
Number of replacements:  0
Replacements of  vuneenanner to  vuneen
Number of replacements:  0
Replacements of  dritte to  drütt
Number of replacements:  0
Replacements of  Plantage to  Plantaasch
Number of replacements:  0
Rep

Replacements of  Töhn to  Tehn
Number of replacements:  0
Replacements of  Öber to  Över
Number of replacements:  0
Replacements of  frie to  free
Number of replacements:  0
Replacements of  Dänzer to  Danzer
Number of replacements:  0
Replacements of  dür to  dör
Number of replacements:  0
Replacements of  Falt to  Foolt
Number of replacements:  0
Replacements of  Frage to  Fraag
Number of replacements:  0
Replacements of  dütt to  dit
Number of replacements:  0
Replacements of  herum to  rüm
Number of replacements:  0
Replacements of  Döntje to  Dööntje
Number of replacements:  0
Replacements of  Köhm to  Kööm
Number of replacements:  0
Replacements of  li to  lehren
Number of replacements:  0
Replacements of  Kök to  Köök
Number of replacements:  0
Replacements of  Köken to  Köök
Number of replacements:  0
Replacements of  Büxen to  Büx
Number of replacements:  0
Replacements of  Plünn to  Plünnen
Number of replacements:  0
Replacements of  Dress to  Dresp
Number of replacements:  0

Replacements of  daun to  doon
Number of replacements:  0
Replacements of  verstohn to  verstahn
Number of replacements:  0
Replacements of  Bug to  Boog
Number of replacements:  0
Replacements of  Ganter to  Ganner
Number of replacements:  0
Replacements of  kN to  spreken
Number of replacements:  0
Replacements of  Eckel to  Ecker
Number of replacements:  0
Replacements of  iut to  ut
Number of replacements:  0
Replacements of  Verlöff to  Verlööf
Number of replacements:  0
Replacements of  Ruhe to  Roh
Number of replacements:  0
Replacements of  Rage to  Roh
Number of replacements:  0
Replacements of  Gassen to  Garsten
Number of replacements:  0
Replacements of  Baut to  Boot
Number of replacements:  0
Replacements of  Borch to  Borg
Number of replacements:  0
Replacements of  fromm to  fraam
Number of replacements:  0
Replacements of  Börst to  Böst
Number of replacements:  0
Replacements of  weihn to  weihen
Number of replacements:  0
Replacements of  loosgahn to  losgahn
Number 

Replacements of  düchtich to  düchtig
Number of replacements:  0
Replacements of  Piepe to  Piep
Number of replacements:  0
Replacements of  Pipe to  Piep
Number of replacements:  0
Replacements of  wann to  wenn
Number of replacements:  0
Replacements of  Roor to  Roder
Number of replacements:  0
Replacements of  Noom to  Naam
Number of replacements:  0
Replacements of  Dum to  Dumen
Number of replacements:  0
Replacements of  niks to  nix
Number of replacements:  0
Replacements of  allein to  alleen
Number of replacements:  0
Replacements of  Hem to  Hemd
Number of replacements:  0
Replacements of  Nahr to  Narr
Number of replacements:  0
Replacements of  dalrieten to  daalrieten
Number of replacements:  0
Replacements of  fievteihn to  föffteihn
Number of replacements:  0
Replacements of  String to  Strang
Number of replacements:  0
Replacements of  utrauhn to  utrohen
Number of replacements:  0
Replacements of  stur to  stuur
Number of replacements:  0
Replacements of  Dachs to  Da

Replacements of  bloß to  bloot
Number of replacements:  0
Replacements of  künn to  könen
Number of replacements:  0
Replacements of  ännert to  ännern
Number of replacements:  15
Replacements of  Eer to  Eerd
Number of replacements:  41
Replacements of  fiev to  fief
Number of replacements:  55
Replacements of  Perfesser to  Professer
Number of replacements:  7
Replacements of  dorvun to  darvun
Number of replacements:  0
Replacements of  unner to  ünner
Number of replacements:  0
Replacements of  swoor to  swaar
Number of replacements:  45
Replacements of  neben to  neven
Number of replacements:  0
Replacements of  sogor to  sogaar
Number of replacements:  3
Replacements of  Westen to  West
Number of replacements:  4
Replacements of  poor to  paar
Number of replacements:  73
Replacements of  uptreden to  optreden
Number of replacements:  0
Replacements of  Ernst to  Eernst
Number of replacements:  0
Replacements of  beide to  beid
Number of replacements:  7
Replacements of  reck to 

Replacements of  können to  könen
Number of replacements:  0
Replacements of  düern to  duern
Number of replacements:  0
Replacements of  fre to  free
Number of replacements:  3
Replacements of  holen to  halen
Number of replacements:  1
Replacements of  Sellschopp to  Sellschop
Number of replacements:  0
Replacements of  im to  em
Number of replacements:  0
Replacements of  lieke to  liek
Number of replacements:  2
Replacements of  grünn to  grünnen
Number of replacements:  0
Replacements of  Siegen to  Sied
Number of replacements:  0
Replacements of  Leben to  Leven
Number of replacements:  0
Replacements of  olle to  all
Number of replacements:  0
Replacements of  eenfack to  eenfach
Number of replacements:  0
Replacements of  Arbeid to  Arbeit
Number of replacements:  69
Replacements of  Fohrt to  Fahrt
Number of replacements:  5
Replacements of  Halle to  Hall
Number of replacements:  0
Replacements of  Gewicht to  Wicht
Number of replacements:  0
Replacements of  höörn to  hören


Replacements of  hoog to  hooch
Number of replacements:  36
Replacements of  Schill to  Schell
Number of replacements:  0
Replacements of  sichtbor to  sichtbar
Number of replacements:  3
Replacements of  Doktor to  Dokter
Number of replacements:  0
Replacements of  Kinnergoorn to  Kinnergaarn
Number of replacements:  2
Replacements of  Brügge to  Brüch
Number of replacements:  0
Replacements of  Uhl to  Uul
Number of replacements:  0
Replacements of  duchtig to  düchtig
Number of replacements:  0
Replacements of  Henkel to  Henk
Number of replacements:  0
Replacements of  jedet to  jeder
Number of replacements:  1
Replacements of  wied to  wiet
Number of replacements:  41
Replacements of  vgl to  düselig
Number of replacements:  0
Replacements of  süden to  süüdlich
Number of replacements:  2
Replacements of  vullständig to  vullstännig
Number of replacements:  0
Replacements of  südlich to  süüdlich
Number of replacements:  0
Replacements of  mehrstendeels to  meistendeels
Number of 

Replacements of  Suldat to  Soldaat
Number of replacements:  0
Replacements of  Meister to  Meester
Number of replacements:  0
Replacements of  gresig to  gräsig
Number of replacements:  16
Replacements of  akkerat to  akkeraat
Number of replacements:  0
Replacements of  hüde to  hüüt
Number of replacements:  0
Replacements of  eben to  even
Number of replacements:  0
Replacements of  Bohr to  Bahr
Number of replacements:  0
Replacements of  sen to  sien
Number of replacements:  0
Replacements of  Paus to  Poos
Number of replacements:  3
Replacements of  domols to  domaals
Number of replacements:  0
Replacements of  Vaders to  Vadder
Number of replacements:  0
Replacements of  Först to  Fast
Number of replacements:  0
Replacements of  Gasten to  Garsten
Number of replacements:  0
Replacements of  Flöte to  Stroom
Number of replacements:  0
Replacements of  polietsch to  plietsch
Number of replacements:  0
Replacements of  Blöte to  Blööt
Number of replacements:  0
Replacements of  wur 

Replacements of  Haut to  Hoot
Number of replacements:  0
Replacements of  Pupp to  Popp
Number of replacements:  0
Replacements of  ruut to  rut
Number of replacements:  1
Replacements of  Link to  Lenk
Number of replacements:  0
Replacements of  Maas to  Moos
Number of replacements:  0
Replacements of  Warktüüch to  Warktüüg
Number of replacements:  0
Replacements of  ernst to  eernst
Number of replacements:  0
Replacements of  ehe to  ehr
Number of replacements:  0
Replacements of  egal to  egaal
Number of replacements:  0
Replacements of  swach to  swack
Number of replacements:  0
Replacements of  Kerke to  Kark
Number of replacements:  0
Replacements of  Kerk to  Kark
Number of replacements:  0
Replacements of  bewohren to  wahren
Number of replacements:  0
Replacements of  Mode to  Mood
Number of replacements:  0
Replacements of  Esk to  Esch
Number of replacements:  0
Replacements of  geheim to  geheem
Number of replacements:  0
Replacements of  Dans to  Danz
Number of replaceme

Replacements of  dreih to  dreihen
Number of replacements:  0
Replacements of  Holz to  Holt
Number of replacements:  0
Replacements of  Etage to  Etaasch
Number of replacements:  0
Replacements of  Schnee to  Snee
Number of replacements:  0
Replacements of  Flege to  Fleeg
Number of replacements:  0
Replacements of  dartig to  dörtig
Number of replacements:  0
Replacements of  Eel to  Ell
Number of replacements:  1
Replacements of  How to  Hoff
Number of replacements:  0
Replacements of  murken to  marken
Number of replacements:  0
Replacements of  Flood to  Floot
Number of replacements:  2
Replacements of  Schillen to  Schilling
Number of replacements:  0
Replacements of  runne to  rund
Number of replacements:  1
Replacements of  apenbor to  apenbar
Number of replacements:  0
Replacements of  enge to  eng
Number of replacements:  1
Replacements of  Dune to  Duun
Number of replacements:  0
Replacements of  Uproop to  Oproop
Number of replacements:  0
Replacements of  sogar to  sogaar


Replacements of  Bu to  Bo
Number of replacements:  0
Replacements of  Kerken to  Kark
Number of replacements:  0
Replacements of  Kiärke to  Kark
Number of replacements:  0
Replacements of  Sturm to  Storm
Number of replacements:  0
Replacements of  Schoolmester to  Schoolmeester
Number of replacements:  0
Replacements of  Fohrtüüg to  Fahrtüüg
Number of replacements:  1
Replacements of  Utdrück to  Utdruck
Number of replacements:  0
Replacements of  Emmer to  Ammer
Number of replacements:  1
Replacements of  Deer to  Deert
Number of replacements:  0
Replacements of  Huud to  Huut
Number of replacements:  7
Replacements of  uppassen to  oppassen
Number of replacements:  0
Replacements of  Eest to  Nest
Number of replacements:  0
Replacements of  Störm to  Storm
Number of replacements:  0
Replacements of  proten to  praten
Number of replacements:  0
Replacements of  fleeten to  fleten
Number of replacements:  0
Replacements of  vorbi to  vörbi
Number of replacements:  0
Replacements of

Replacements of  din to  dien
Number of replacements:  1
Replacements of  höören to  hören
Number of replacements:  0
Replacements of  Laage to  Laag
Number of replacements:  0
Replacements of  väl to  veel
Number of replacements:  0
Replacements of  Heimot to  Heimat
Number of replacements:  0
Replacements of  App to  Appel
Number of replacements:  0
Replacements of  Wertschaft to  Weertschop
Number of replacements:  0
Replacements of  kapabel to  kumpabel
Number of replacements:  0
Replacements of  Mallöör to  Malöör
Number of replacements:  0
Replacements of  Lind to  Linn
Number of replacements:  0
Replacements of  Stünden to  Stünn
Number of replacements:  0
Replacements of  slaapen to  slapen
Number of replacements:  0
Replacements of  teeknen to  teken
Number of replacements:  0
Replacements of  Stunn to  Stünn
Number of replacements:  0
Replacements of  Blom to  Bloom
Number of replacements:  0
Replacements of  vörholen to  vörhollen
Number of replacements:  1
Replacements of  

Replacements of  Kerkhoff to  Karkhoff
Number of replacements:  0
Replacements of  Schiete to  Schiet
Number of replacements:  0
Replacements of  Morrn to  Morgen
Number of replacements:  0
Replacements of  Fähre to  Fähr
Number of replacements:  0
Replacements of  Kasper to  Karkspeel
Number of replacements:  0
Replacements of  heran to  ran
Number of replacements:  0
Replacements of  Falle to  Fall
Number of replacements:  0
Replacements of  Kaspeel to  Karkspeel
Number of replacements:  0
Replacements of  Sark to  Sarg
Number of replacements:  1
Replacements of  Gemüt to  Gemööt
Number of replacements:  0
Replacements of  waaken to  waken
Number of replacements:  0
Replacements of  Keunig to  König
Number of replacements:  0
Replacements of  Kook to  Koken
Number of replacements:  0
Replacements of  Huurn to  Hoorn
Number of replacements:  0
Replacements of  eerstmaal to  eerstmal
Number of replacements:  0
Replacements of  verkopen to  verköpen
Number of replacements:  0
Replacemen

Replacements of  Härte to  Hart
Number of replacements:  0
Replacements of  Fat to  Fatt
Number of replacements:  0
Replacements of  Saffran to  Safran
Number of replacements:  0
Replacements of  Lill to  Lilg
Number of replacements:  0
Replacements of  Feber to  Fever
Number of replacements:  0
Replacements of  Oller to  Öller
Number of replacements:  0
Replacements of  Olldag to  Alldag
Number of replacements:  0
Replacements of  langsaam to  langsam
Number of replacements:  1
Replacements of  langsame to  langsam
Number of replacements:  0
Replacements of  Stolz to  Stolt
Number of replacements:  0
Replacements of  harut to  rut
Number of replacements:  0
Replacements of  Middeweken to  Middeweek
Number of replacements:  0
Replacements of  Schanne to  Schann
Number of replacements:  0
Replacements of  Beil to  Biel
Number of replacements:  0
Replacements of  truerig to  trurig
Number of replacements:  0
Replacements of  Dogge to  Dogg
Number of replacements:  0
Replacements of  Warm

Replacements of  Fak to  Fack
Number of replacements:  0
Replacements of  afhaun to  afhauen
Number of replacements:  0
Replacements of  daarvan to  darvun
Number of replacements:  0
Replacements of  eun to  een
Number of replacements:  0
Replacements of  Dannebrog to  Dannbrog
Number of replacements:  0
Replacements of  mitunner to  mitünner
Number of replacements:  0
Replacements of  Furcht to  Forcht
Number of replacements:  0
Replacements of  Reu to  Rü
Number of replacements:  0
Replacements of  waarn to  warden
Number of replacements:  0
Replacements of  wern to  warden
Number of replacements:  0
Replacements of  mui to  mi
Number of replacements:  0
Replacements of  Eike to  Eek
Number of replacements:  0
Replacements of  Grenze to  Grenz
Number of replacements:  0
Replacements of  Slecht to  Geslecht
Number of replacements:  0
Replacements of  handeln to  hanneln
Number of replacements:  0
Replacements of  nargens to  narms
Number of replacements:  0
Replacements of  Besittersc

0       None
1       None
2       None
3       None
4       None
5       None
6       None
7       None
8       None
9       None
10      None
11      None
12      None
13      None
14      None
15      None
16      None
17      None
18      None
19      None
20      None
21      None
22      None
23      None
24      None
25      None
26      None
27      None
28      None
29      None
        ... 
1545    None
1546    None
1547    None
1548    None
1549    None
1550    None
1551    None
1552    None
1553    None
1554    None
1555    None
1556    None
1557    None
1558    None
1559    None
1560    None
1561    None
1562    None
1563    None
1564    None
1565    None
1566    None
1567    None
1568    None
1569    None
1570    None
1571    None
1572    None
1573    None
1574    None
Length: 1575, dtype: object

def replace_schrieven(df):
    print("Replacements of 'schreven' to 'schrieven': ",df.nds.str.count(r"\bschreven\b").sum())
    df.nds = df.nds.str.replace(r"\bschreven\b", "schrieven")
replace_schrieven(wiki_df)


replacements = pd.DataFrame(np.nan, index=np.arange(0,10000), columns=['word', 'replaced_by', 'count'])

def replace(df, word, correction):
    count = df.nds.str.count(rf"\b{word}\b").sum()
    print("Replacements of " , word , "to ", correction)
    print("Number of replacements: ", count)
    df.nds = df.nds.str.replace(rf"\b{word}\b", correction, case = True)
    return count
total_count = 0
idx_df = 0
for i in dict_hansen.index:
    correction = dict_hansen.loc[i,"replaced_by"]
    word = dict_hansen.loc[i,"word"]
    if word not in ignore_str:
        count = replace(wiki_df, word, correction)
        total_count += count
        print("Total replacements: ",total_count)
        print("--------------------------------")
        replacements.loc[idx_df,:] = [word, correction,count]
        idx_df += 1

In [34]:
print("Total number of replacements: ", replacements["count"].sum())
print("Percentage Replacement: ",round(100 * replacements["count"].sum() / total_words_wiki.nds), "%")

Total number of replacements:  0.0
Percentage Replacement:  0.0 %


replacements_tatoeba = pd.DataFrame(np.nan, index=np.arange(0,10000), columns=['word', 'replaced_by', 'count'])
total_count_t = 0
idx_df = 0
for i in dict_hansen.index:
    correction = dict_hansen.loc[i,"replaced_by"]
    word = dict_hansen.loc[i,"word"]
    if word not in ignore_str:
        count = replace(tatoeba_df, word, correction)
        total_count_t += count
        print("Total replacements: ",total_count_t)
        print("--------------------------------")
        replacements_tatoeba.loc[idx_df,:] = [word, correction,count]
        idx_df += 1

In [35]:
print("Total number of replacements: ", replacements_tatoeba["count"].sum())
print("Percentage Replacement: ",round(100 * replacements_tatoeba["count"].sum() / total_words_tatoeba.nds), "%")

NameError: name 'replacements_tatoabe' is not defined

In [36]:
wiki_deu_after, wiki_nds_after = get_word_count(wiki_df)
tatoeba_deu_after, tatoeba_nds_after = get_word_count(tatoeba_df)

single_words_tatoeba_after = sum(tatoeba_nds_after.nds_count == 1)
single_words_wiki_after = sum(wiki_nds_after.nds_count == 1)


print("Number of Low German Words which appears only once in Wikipedia: ",single_words_wiki_before)
print("Percentage of Low German Words which appears only once in Wikipedia: ",100*single_words_wiki_before / len(wiki_nds))
print("Number of Low German Words which appears only once in Wikipedia after correction: ",single_words_wiki_after)
print("Percentage of Low German Words which appears only once in Wikipedia after correction: ",100*single_words_wiki_after / len(wiki_nds))

print("Number of Low German Words which appears only once in Tatoeba: ",single_words_tatoeba_before)
print("Percentage of Low German Words which appears only once: ",100* single_words_tatoeba_before / len(tatoeba_nds))
print("Number of Low German Words which appears only once in Tatoeba after correction: ",single_words_tatoeba_after)
print("Percentage of Low German Words which appears only once after correction: ",100* single_words_tatoeba_after / len(tatoeba_nds_after))

Number of Low German Words which appears only once in Wikipedia:  53910
Percentage of Low German Words which appears only once in Wikipedia:  62.41822876263474
Number of Low German Words which appears only once in Wikipedia after correction:  53353
Percentage of Low German Words which appears only once in Wikipedia after correction:  61.77332144635228
Number of Low German Words which appears only once in Tatoeba:  4712
Percentage of Low German Words which appears only once:  50.96809085992428
Number of Low German Words which appears only once in Tatoeba after correction:  4601
Percentage of Low German Words which appears only once after correction:  51.15632643984879


In [None]:
# replace "ik" with "ick"
def replace_ik(df):    
    print("Replacements of 'ick' to 'ik': ",df.nds.str.count(r"(I|i)ck").sum())
    df.nds = df.nds.str.replace("(I|i)ck", "\1k")

# replace us with uns
def replace_uns(df):
    print("Replacements of 'us' to 'uns': ", df.nds.str.count("\s(U|u)s\s").sum())
    df.nds = df.nds.str.replace("\s(U|u)s\s", "\1ns")
    
# replace up with op
def replace_op_cap(df):
    print("Replacements of 'Up' to 'Op'", df.nds.str.count(r"\bUp([sdklptrmnbv]?[a-zäöü]*)\b").sum())
    df.nds = df.nds.str.replace(rf"\bUp([sdklptrmnbv]?[a-zäöü]*)\b", "Op\1", case = True)
def replace_op(df):
    print("Replacements of 'up' to 'op'", df.nds.str.count(rf"\bup([sdklptrmnbv]?[a-zäöü]*)\b").sum())
    df.nds = df.nds.str.replace(r"\bup([sdklptrmnbv]?[a-zäöü]*)\b", "op\1", case = True)
    
# replace von,van with vun
def replace_vun(df):
    print("Replacements of 'van''von' to 'vun'", df.nds.str.count(r"\bvan|von\b").sum())
    df.nds = df.nds.str.replace(r"\bvan\b|\bvon\b", "vun")

# "sch" before a consonant will be replaced with s 
def replace_s(df):
    print("Replacements of 'sch' to 's'",df.nds.str.count(r"\s(S|s)ch[lmknbwv][a-zäöü]*").sum())
    df.nds = df.nds.str.replace(r"((S|s)ch)([lmknbwvptb])", r"\2\3")

def regex_all(df):
    replace_ik(df)
    replace_uns(df)
    replace_s(df)
    replace_op(df)
    replace_op_cap(df)
    replace_vun(df)   

In [124]:
regex_all(wiki_df)
regex_all(tatoeba_df)

Replacements of 'ick' to 'ik':  0
Replacements of 'us' to 'uns':  0
Replacements of 'sch' to 's' 0
Replacements of 'up' to 'op' 0
Replacements of 'Up' to 'Op' 0
Replacements of 'van''von' to 'vun' 30
Replacements of 'ick' to 'ik':  182
Replacements of 'us' to 'uns':  0
Replacements of 'sch' to 's' 7
Replacements of 'up' to 'op' 3
Replacements of 'Up' to 'Op' 2
Replacements of 'van''von' to 'vun' 7


In [126]:
tatoeba_df.to_csv("preprocessed_data/tatoeba/tatoeba_sentences.csv", index = False)

In [127]:
wiki_df.to_csv("preprocessed_data/fb-wiki/wiki_sentences.csv", index = False)

In [128]:
for i in range(8,9):
    print(i)

8


In [None]:
# general tokenization
def tokenize_nds(text):
    """
    Tokenizes Low German text from a string into a list of strings (tokens)
    """
    text = re.sub(r"([.,\"\-;*:%?!&#])", r" \1", text)
    text = re.split(r"[\s]", text)
    text = [a for a in text if len(a)>0]
    return text
# function which checks according to dictionary and returns the replaced sentences as well as the number of replacements

correction_collection = pd.DataFrame(np.zeros([10000,2]),columns = ["original","correction"])
corr_idx = 0
def correction(text):
    text = tokenize_nds(text)
    global correction_collection
    global corr_idx
    for word in text:

        match = dict_hansen.nds[dict_hansen.alternatives.str.contains(rf"\b{re.escape(word)}\b")].tolist()
        if len(match) == 1 and str(word) not in ignore_str:    
            correction_collection.loc[corr_idx,:] = [word,match[0]]
            corr_idx += 1           
        else:
            pass





In [None]:
wiki_df.nds.sample(5000, random_state=42).apply(correction)

In [None]:
correction_collection.drop(correction_collection[correction_collection.original == 0].index, inplace = True)

In [None]:
correction_collection.sort_values("correction").groupby(["correction","original"]).count().to_csv("dictionairy/possible_corrections.csv")

In [None]:
corr_diff = correction_collection[correction_collection.correction != correction_collection.original]

corr_diff = corr_diff.drop_duplicates()

In [None]:
len(corr_diff)

In [None]:
corr_diff.to_csv("dictionairy/short_replacement.csv", index = False)

In [None]:
word_test = "hör"
corr_test = "qzatsch"
wiki_df.nds.str.count(rf"\b{word_test}\b").sum()

In [None]:
def replace_schrieven(df):
    print("Replacements of 'schreven' to 'schrieven': ",df.nds.str.count(r"\bschreven\b").sum())
    df.nds = df.nds.str.replace(r"\bschreven\b", "schrieven")
replace_schrieven(wiki_df)





def replace(df, word, correction):
    count = df.nds.str.count(rf"\b{word}\b").sum()
    print("Replacements of: " , word , ": " , count)
    df.nds = df.nds.str.replace(rf"\b{word}\b", correction)
    return count
total_count = 0
for i in corr_diff.index:
    word = corr_diff.loc[i,"original"]
    correction = corr_diff.loc[i,"correction"]
    total_count += replace(wiki_df, word, correction)
    print("Total replacements: ",total_count)
    print("--------------------------------")


up mit Konsonant in op mit konsonant
von, van, in vun


In [116]:
# replace "ik" with "ick"
def replace_ik(df):    
    print("Replacements of 'ick' to 'ik': ",df.nds.str.count(r"(I|i)ck").sum())
    df.nds = df.nds.str.replace("(I|i)ck", "\1k")

# replace us with uns
def replace_uns(df):
    print("Replacements of 'us' to 'uns': ", df.nds.str.count("\s(U|u)s\s").sum())
    df.nds = df.nds.str.replace("\s(U|u)s\s", "\1ns")
    
# replace up with op
def replace_op_cap(df):
    print("Replacements of 'Up' to 'Op'", df.nds.str.count(r"\bUp([sdklptrmnbv]?[a-zäöü]*)\b").sum())
    df.nds = df.nds.str.replace(rf"\bUp([sdklptrmnbv]?[a-zäöü]*)\b", "Op\1", case = True)
def replace_op(df):
    print("Replacements of 'up' to 'op'", df.nds.str.count(rf"\bup([sdklptrmnbv]?[a-zäöü]*)\b").sum())
    df.nds = df.nds.str.replace(r"\bup([sdklptrmnbv]?[a-zäöü]*)\b", "op\1", case = True)
    
# replace von,van with vun
def replace_vun(df):
    print("Replacements of 'van''von' to 'vun'", df.nds.str.count(r"\bvan|von\b").sum())
    df.nds = df.nds.str.replace(r"\bvan\b|\bvon\b", "vun")

# "sch" before a consonant will be replaced with s 
def replace_s(df):
    print("Replacements of 'sch' to 's'",df.nds.str.count(r"\s(S|s)ch[lmknbwv][a-zäöü]*").sum())
    df.nds = df.nds.str.replace(r"((S|s)ch)([lmknbwvptb])", r"\2\3")

def regex_all(df):
    replace_ik(df)
    replace_uns(df)
    replace_s(df)
    replace_op(df)
    replace_op_cap(df)
    replace_vun(df)    
    


In [118]:
def regex_all(df):
    replace_ik(df)
    replace_uns(df)
    replace_s(df)
    replace_op(df)
    replace_op_cap(df)
    replace_vun(df)
regex_all(wiki_df)
replace_schrieven(wiki_df)

Replacements of 'ick' to 'ik':  1092
Replacements of 'us' to 'uns':  0
Replacements of 'sch' to 's' 330
Replacements of 'up' to 'op' 632
Replacements of 'Up' to 'Op' 625
Replacements of 'van''von' to 'vun' 30
Replacements of 'schreven' to 'schrieven':  0


In [None]:
df_test = df_all
delete_wrong_enumeration(df_test)




# Generalizing the low german words

Here we try to correct the sentences with the "official" spelling according to [Sass wordbook](http://sass-platt.de/)



In [None]:
import docx
import re

The available dictionary is not ordered. But the German words are written in bold. Further German subwords are written in italic. The translated words are following the bold or the italic words. First we will try to get the index range of each translation.
For that we need the paragraph in which the word is and the range inside the paragraph.


In [None]:
# function to find only the first translation
# easier to understand in combination when used inside the loop
def find_first_translation(translation_id, runs):
        translation = ''
        first_row_of_ones = False
        for idx,i in enumerate(translation_id):
            if i == 1:
                translation += runs[idx].text
                first_row_of_ones = True
            elif i == 0 and first_row_of_ones:
                break
        return translation

In [None]:
def get_word_list(document, word_list, translation_font = "Rockwell"):
    idx_col = 0
    idx_row = 0
    string = ""
    for paragraph in document.paragraphs:
        # grab only the lines which start with a word / other lines are additional content or page numbers
        new_line = re.match(r'\A[A-Za-zÄäÜüÖö]', paragraph.text)
        if new_line:
            # write the word into the column 0 with the format bold and not italic which stands for the basis word
            word_list.iloc[idx_row,0] = string.join([x.text for x in paragraph.runs 
                                                               if x.bold and not x.italic])
            # now we want to grab only the translation without catching additional content with the
            # same font. When the Font changes, it means, that the translation is finished.
            font_list = [1 if x.font.name == translation_font else 0 for x in paragraph.runs]
            word_list.iloc[idx_row,1] = find_first_translation(font_list, paragraph.runs)
            # jump to next row for next entry
            idx_row += 1


        
        

In [None]:
# first we will load the deu-nds word_list

deu_nds_raw = docx.Document("gathering_data/sass/hd-nd-wortliste_absatz_nach_wort.docx")
# prepare a word list only containing one translation
deu_nds_word_list = pd.DataFrame(index=np.arange(20000), columns= ["deu","nds"])

# pass into function
get_word_list(deu_nds_raw, deu_nds_word_list)



In [None]:
# now the nds-deu word_list

nds_deu_raw = docx.Document("gathering_data/sass/nd-hd-wortliste_absatz_nach_wort.docx")
# prepare a word list only containing one translation
nds_deu_word_list = pd.DataFrame(index=np.arange(20000), columns= ["nds","deu"])

# pass into function
get_word_list(nds_deu_raw, nds_deu_word_list)


In [None]:
deu_nds_word_list.head(5)

In [None]:
nds_deu_word_list.head(5)

Having the data now in our pandas dataframe, we can do some regex and clean the words.

In [None]:
nds_deu_word_list[(nds_deu_word_list.deu == "") | (nds_deu_word_list.nds == "")].index

In [None]:
#drop nan and uncomplete rows first

def drop_uncomplete(df):
    df.dropna(inplace=True)
    drop_index = df[(df.deu == "") | (df.nds == "")].index
    df.drop(index=drop_index, inplace=True)

drop_uncomplete(deu_nds_word_list)
print("Entries DEU-NDS version:",len(deu_nds_word_list))
drop_uncomplete(nds_deu_word_list)
print("Entries NDS-DEU version:",len(nds_deu_word_list))

## Making one wordbook

We have now two tables with the reverse translation. We can see already that they have different length. Maybe this is because of the original data or because a the reading of one document was better.
Nevertheless as we have a computer and don't have to turn pages, we need only one table.

After doing some data cleaning we will drop duplicate entries

In [None]:
# lets make one table out of it

nds_deu = nds_deu_word_list.append(deu_nds_word_list, sort=False, ignore_index = True)


In [None]:
# now we know how big our new vocabulary could be.
# still there are many wrong signs included as we can see only in the first 10 entries

string = "Test ~Test WetterW~asser  Abgabe4 1. 2. ,,; ~weg A~weg end/ jetzt!"

def clear_string(string):
    # delete everything after these characters
    string = re.split(r'[\d.,;!?/]',string)[0]
    # delete alternative spelling as there is no easy automatic solution for taking them into the list
    string = re.split(r'(([A-ZÖÜÄa-zäöü]|\s)~)',string)[0]
    return string

clear_string(string)

# apply the cleaning function
nds_deu = nds_deu_word_list.applymap(lambda x: clear_string(str(x)))


In [None]:
# especially in the Low German column we have a lot of -. These are mostly format problems.
# still some belong there. We assume that the minus is right, if it occurs in both columns

def sub_minus(string):
    return re.sub(r"-","",string)


nds_deu.loc[~(nds_deu.nds.str.contains("-{1}") & nds_deu.deu.str.contains("-{1}")),:] = nds_deu.loc[~(nds_deu.nds.str.contains("-{1}") & nds_deu.deu.str.contains("-{1}")),:].applymap(sub_minus)





In [None]:
# delete the words "sik", "wat" in Low German as this information is about the grammar
# same in German with "sich", 

def replace_pre_words(df):
    df.nds = df.nds.str.replace("(wat\s|sik\s)","")
    df.deu = df.deu.str.replace("(was\s|sich\s)","")

    
replace_pre_words(nds_deu)

In [None]:

# drop paranthesis including the content as they are as well an additional explanation we don't want to have
def drop_parenthesis_dash(string):
    string = re.sub(r'\([a-zöäü.~]*\)?',"",string)
    string = re.sub(r'\|',"",string)
    #empty spaces at the beginning or end
    string = re.sub(r'^\s|\s$',"",string)
    string = re.sub(r'^\s|\s$',"",string)
    return string

nds_deu = nds_deu.applymap(drop_parenthesis_dash)

In [None]:
# In some rows there is only one letter or nothing, so a further read in error.
# We will drop these lines
# moreover in Low German there are some words with two characters but as well many read in errors
def drop_read_in_errors(df):
    drop_index = df[(df.deu.str.len() <= 1 )|(df.nds.str.len() <= 1 )].index
    df.drop(index=drop_index,inplace=True)
    drop_different_lengths_one_index = df[(df.nds.str.len() <= 2) & (df.deu.str.len() > 4)].index
    df.drop(index=drop_different_lengths_one_index,inplace=True)
    drop_different_lengths_two_index = df[(df.nds.str.len() <= 3) & (df.deu.str.len() > 7)].index
    df.drop(index=drop_different_lengths_two_index,inplace=True)

    drop_different_lengths_reverse_index = df[(df.deu.str.len() == 2) & (df.nds.str.len() > 4)].index
    df.drop(index=drop_different_lengths_reverse_index,inplace=True)

    # there are as well some read in problems with german article "der"
    # we will insert it later in the additional text
    drop_der = df[df.deu == "der"].index
    df.drop(index=drop_der,inplace=True)
    
    
drop_read_in_errors(nds_deu)


In [None]:
# load in some additonal hand picked cleaned data and append

hand_picked_words = pd.read_csv("gathering_data/sass/additional_words.txt", sep="\t", header=None, names = ["nds","deu"] )
nds_deu = nds_deu.append(hand_picked_words, sort = False)

In [None]:
# and drop only exact duplicates as a Low German word could have several meanings in German and vice versa
print("Entries before dropping duplicates: ", len(nds_deu))
nds_deu.drop_duplicates(inplace=True)
print("Entries after dropping duplicates: ", len(nds_deu))

# as we did some cleaning, maybe we have created some addtional empty elements
# we are using our drop_uncomplete function again

drop_uncomplete(nds_deu)

print("Entries after dropping empty entries: ", len(nds_deu))



In [None]:
# resetting index
nds_deu.reset_index(drop=True, inplace=True)

In [None]:
nds_deu.to_csv("data/nds_deu_dictionary.csv")

# Using the dictionairy

Finally we have the dictionairy and can try to correct the sentences.



#### Now some Markdown cells which should be code, but take too long

###### count how often words from our dictionary occur in the text already
if 'count' not in globals():

    count = [df.nds.str.count(re.escape(text)).sum() for text in nds_deu.nds]



#[nds_deu.nds.str.count(re.escape(sentence)).sum() for sentence in df.nds]

nds_deu["corpus_count"] = count


print("Words not used: ", sum(nds_deu["corpus_count"] == 0))
print("Words used: ", len(nds_deu) - sum(nds_deu["corpus_count"] == 0))

nds_deu.sort_values("corpus_count",ascending=False).head(10)

In [None]:
nds_deu.nds.str.len()[8441]

nds_deu[nds_deu.nds =="ik"]

We don't have a frequency vector for Low German, but it is very closely related to German.
Therefore we can use the German word_frequency for Low German as well.


In [None]:
from wordfreq import word_frequency



In [None]:

def get_frq(text):
    return word_frequency(text, "de")
word_frq = nds_deu.deu.apply(get_frq).rename("frq")


In [None]:
nds_dic = nds_deu.join(word_frq)

nds_dic.dropna( inplace=True)
nds_dic.shape

In [None]:
nds_dic.sort_values("frq", ascending = False).head(5)

nds_dic.isna().sum()
nds_dic[nds_dic.nds == "ik"]
nds_dic[nds_dic.duplicated("nds", keep=False)].head(2)



In [None]:
nds_dic = nds_dic[["nds","deu","frq"]]
nds_dic.reset_index(drop=True, inplace=True)
print("Before removing duplicates: ",len(nds_dic))

nds_dic = nds_dic.sort_values("frq", ascending = True)
nds_dic = nds_dic.drop_duplicates(subset='nds', keep="last")
print("After removing duplicates: ",len(nds_dic))
nds_dic.to_csv("data/nds_deu_count.csv", index=False)
nds_dic.isna().sum()

In [None]:
# to get full integers
nds_dic.frq = nds_dic.frq * 100000000
nds_dic.frq = nds_dic.frq.astype("int")
nds_tojson = nds_dic.set_index("nds")
nds_tojson = nds_tojson.to_dict(orient="dict")["frq"]

In [None]:
import json

with open('data/nds_dic.txt', 'w') as fp:
    json.dump(nds_tojson, fp, ensure_ascii=False, )

Now we are loading the module Spellchecker which can correct words based on our created dictionary.
After that we are going to see how many words from our sentences are in the dictionairy and try to automize the correction.

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker(local_dictionary="data/nds_dic.txt")  # loads nds word frequency list

spell.export("data/example_dict", gzipped=False)

In [None]:
word = "maol"
spell.correction(word)

In [None]:
spell.correction("ick")

In [None]:
spell.known(["ik", "gehe"])

First we will tokenize the words, so we can work with single words.
Probably a simple tokenizer which splits by whitespace and punctuation would be enough, but we will use Spacy, so we will have the chance to get better tokens.

In [None]:

import spacy

spacy_de = spacy.load('de')

In [None]:
spell.known(["ik", "verdenen"])

In [None]:
def extract_signs(series): 
    series = series.str.replace(r"[^a-zäßöü\s]*","")
    return series
def get_spell(series, dic):
    series = extract_signs(series)
    #tokenize
    sent_tok = series.apply(spacy_de.tokenizer)
    # get as list
    complete = sent_tok.apply(lambda x: [a.text for a in x])
    # get unknown and known words
    unknown_words = complete.apply(dic.unknown)
    known_words = complete.apply(dic.known)
    return unknown_words, known_words, complete
def hit_quote(series, dic):
    unknown_words,known_words,complete = get_spell(series,dic)
    total_words = complete.apply(len).sum()
    # each row is saved as spell_object, therefore we have to do the following workaround
    known_count = known_words.apply(lambda x: [1 for i in x]).apply(len).sum()
    unknown_count = unknown_words.apply(lambda x: [1 for i in x]).apply(len).sum()
    print("Total words: ", total_words)
    print("Known words: ", known_count)
    print("Unknown words: ", unknown_count)
    print("Hit quote: ", round(100*known_count/total_words))

hit_quote(df.nds, spell)

# Prepairing sentences

In this section we want to select and preprocess the sentences for our model.

As explained the sentences are not perfectly aligned. To find only high-quality sentences we use our wordbook and see in which we find translations with our wordbook. Actually, this is not the perfect way as we might exclude translations, which can't be translated word by word.
We will evaluate later if we have a gain.

In [None]:
# for this task we will convert all sentences in lowercase

df.nds = df.nds.str.lower()
df.deu = df.deu.str.lower()


unknown_nds , known_nds , _ = get_spell(df.nds,spell)



In [None]:
dic = pd.read_csv("data/nds_deu_count.csv")

dic.nds = dic.nds.str.lower()
dic.deu = dic.deu.str.lower()
dic = dic.sort_values("frq",ascending=False).drop_duplicates(subset="nds")
dic.set_index("nds",inplace=True)


In [None]:
# delete the rows where we don't have any match
def delete_unknown(known_df):

    del_id = known_df[known_df.apply(len) < 1].index
    # in some sets there might be NaN values saved as string
    known_df_list = known_df.apply(lambda x: [a for a in x if a != "nan"])
    print("Rows with no match with dictionary: ", len(del_id) )
    
    known_df_list.drop(del_id, inplace =True)
    return known_df_list
    
known_nds = delete_unknown(known_nds)

In [None]:
def german_word(dic_list):
    strings = ""

    strings = [dic.loc[a,"deu"] for a in dic_list]

    return strings
def frq_per_word(dic_list):
    strings = [dic.loc[a,"frq"] for a in dic_list]
    return strings
translated_words = known_nds.apply(german_word)
translated_frq = known_nds.apply(frq_per_word)

In [None]:
df.deu = extract_signs(df.deu)
transl_df = pd.DataFrame(translated_words)
transl_df = transl_df.join(df.deu)

transl_df.head(10)

In [None]:
matches = pd.DataFrame().astype('object')

for i in transl_df.index:
    searchfor = transl_df.nds[i]
    match_list = ' '
    for a in searchfor:
        
        match = re.findall(re.escape(a),transl_df.deu[i])
        match_list = match_list.join(match)
    matches.loc[i,"matches"] = match_list
        

In [None]:
matches.head(3)

In [None]:
df[df.nds.str.contains(r"[lpgtrda]+aa[tzrsdfgbnm]*")].head(5)


In [None]:
#hit_quote(df.nds, spell)

In [None]:
test_str = "De Stevel is Biebel"

In [None]:
import re

# According to Arbatzat "Basiswortschatz Plattdeutsch" if it is possible, b should be replaced with v
#assumption from my side: if there is a vowel in front and after

#df.nds = df.nds.str.replace(r"([AEOUÄÜÖaeouäöü])b([AEUOaeouäöü])", r"\1v\2")

# by chance I saw that Bibel is spelled wrong a lot of times

#df.nds = df.nds.str.replace(r"Bi(e|)bel", "Bivel")

# replacement of "sch" to "s"

#df.nds = df.nds.str.replace(r"((sch)([lmn]))", r"s\2")
#df.nds = df.nds.str.replace(r"((Sch)([lmn]))", r"S\2")


In [None]:
df.to_csv(path_or_buf="data/data.csv", index=False)

# Filter long sentences

The lenght of the sentences is crucial for a good performance and quality of the NN. For that we look at tokens only as the length of one word is irrevelant since it will be transformed into a token and then into a number.
Therefore we will see,what the optimum

In [None]:
import dtale
import plotly.graph_objects as go
import matplotlib.pyplot as plt


In [None]:
df_all = pd.read_csv("data/data_all.csv")

df_all.head(3)

In [None]:
def get_length(df):
    df_output = df.copy()
    df_output.nds = df_output.nds.str.split(r"[\s.,;:?!-\"\']*")
    df_output.deu = df_output.deu.str.split(r"[\s.,;:?!-\"\']*")
    return df_output.applymap(len)


def plot_length_dis(df):
    df_plot = get_length(df)
    fig = go.Figure()
    fig.add_trace(go.Box(y=df_plot.deu, name = "Hochdeutsch"))
    fig.add_trace(go.Box(y=df_plot.nds, name="Niederdeutsch"))

    fig.show()
    
def get_range(df, start, end):
    df_length = get_length(df)
    df_length = df_length[df_length.nds.ge(start) & df_length.nds.le(end)]
    df_length = df_length[df_length.deu.ge(start) & df_length.deu.le(end)]

    return df.loc[df_length.index,:]

plot_length_dis(df_all)
#get_length(df_all)


It seems that we have a lot of extreme outliers. The median is around 14 the first and third quartile around 5-7 below and above. The upper fence is around 40, so we will have a look at the extreme outliers and see if we can split these sentences. 

In [None]:
# the dtale function takes a while but makes it easier to read the whole sentences.
# You can activate it by deleting the #

df_very_long = df_all[(get_length(df_all).deu > 40) |( get_length(df_all).nds > 41)]
print(len(df_very_long))
#dtale.show(df_very_long)

The above data seem to be accurate pair alignments. Still it will cost a lot of computer power later and will lower even the quality of the output of some models, because the shorter sentences need to be padded, containing a lot of unnecessary information.
Moreover the loss of datapoints is in an acceptable range (Compared to loosing half of the datapoints.)

In [None]:
# For better perfomance these will be dropped

idx_to_drop = df_very_long.index

df_all.drop(idx_to_drop, inplace = True)

plot_length_dis(df_all)


The automaticaly aligned sentences from wikipedia could have a problem of matching for very short sentences. Therefore we will have a look especially at them.
We choose here the all up to 4 words which should be only very short sentences.

In [None]:

df_short = df_all[(get_length(df_all).deu < 8) | (get_length(df_all).nds < 8)]



In [None]:
df_short.iloc[df_short.index > 18000,:].sample(50).sort_index().head(4)

# Selecting the right wikipedia dataset



In [None]:
wiki_complete = pd.read_csv("data/fb-wiki/WikiMatrix.de-nds.tsv.gz",sep="\t+"
                            , engine="python", header= None
                            ,encoding="utf-8", compression="gzip",
                           names = ["threshold","deu","nds"])
wiki_complete.shape

In [None]:
# sentences with values above 1.2 seem to be all wrong

def wiki_selection(df, boundary):
    '''returns a copy of wikipedia dataframe only containing values above boundary'''
    df = df.copy()
    df = df[(df.threshold < 1.2) & (df.threshold > boundary)]
    return df[["deu","nds"]]




In [None]:
wiki_complete[(wiki_complete.threshold < 1.2) & (wiki_complete.threshold > 1.07)]

Maybe you are not into the language, but we have quality problem at least in the last third of the dataset.
Facebook provides a threshold for the certainty that the data is aligned right. We will try to grap a new dataset with a higher threshold.

In [None]:
def read_wiki(threshold):
    '''Reads from the folder fb-wiki-'THRESHOLD' the files from the Facebook LASER project.
    threshold should be a number as string between 1000 and 1010 and folder must be created by hand right now'''
    wiki_threshold_nds = pd.read_csv("data/fb-wiki-" + threshold + "/WikiMatrix.de-nds.txt.nds",
                           sep="\n+",engine='python', encoding="utf-8",
                           header=None, names=["nds"])

    wiki_threshold_deu = pd.read_csv("data/fb-wiki-"+ threshold +"/WikiMatrix.de-nds.txt.de",
                           sep="\n+",engine='python', encoding="utf-8",
                           header=None, names=["deu"])
    return wiki_threshold_deu.join(wiki_threshold_nds)
    

In [None]:
wiki_thre_1055 = read_wiki("1055")
wiki_thre_1070 = read_wiki("1070")

wiki_thre_1055.shape, wiki_thre_1070.shape

In [None]:
get_range(wiki_thre_1055, 0, 5).tail(5)

In [None]:
get_range(wiki_thre_1070, 0, 5).tail(5)

In [None]:
from sklearn.model_selection import train_test_split

def save_train_test_split(df, path):
    train_data, test_data = train_test_split(df, test_size=0.1, random_state=42)
    train_data, valid_data = train_test_split(train_data, test_size=0.1, random_state=42)

    train_data.to_csv(path_or_buf= path + "train_data.csv", index=False)
    valid_data.to_csv(path_or_buf= path + "valid_data.csv", index=False)
    test_data.to_csv(path_or_buf= path + "test_data.csv", index=False)

    print("Numbers of training samples: " , len(train_data))
    print("Number of validation samples: ",len(valid_data))
    print("Number of test samples: ",len(test_data))

# Testing if pretrained model can select right wiki pairings

We will try to use our tatoeba dataset with right aligned sentences for predicting, if the wiki sentences are aligned right.

We will save a wikipedia example and the tatoeba datasets, both preprocessed in the same way.

In [None]:
df_tatoeba.head(2)

In [None]:
#limit up to third quantil for testing
wiki_pairs = get_range(wiki_thre_1070, 0 , 20)
tatoeba_pairs = get_range(df_tatoeba, 0 , 20)

In [None]:
regex_all(wiki_pairs)

In [None]:
regex_all(tatoeba_pairs)

In [None]:
index_range = np.arange(len(tatoeba_pairs),len(wiki_pairs) + len(tatoeba_pairs))
tatoeba_pairs.reset_index(inplace=True, drop=True)

In [None]:
tatoeba_pairs.tail(1)

In [None]:
wiki_pairs.set_index(index_range, inplace = True)

In [None]:
wiki_pairs.to_csv("proof_of_concept/index_test.csv")

In [None]:
wiki_pairs.reset_index(inplace=True,drop=True)

In [None]:
wiki_pairs.to_csv("data/wiki_pairs.tsv", sep="\t", index=False)

In [None]:
save_train_test_split(tatoeba_pairs, "data/tatoeba/")

In [None]:
# with the attention model we calculated the loss on every sentence. The model was trained on the tatoeba dataset
wiki_loss = pd.read_csv("data/wiki_loss.csv")
wiki_eval = wiki_pairs.join(wiki_loss)


In [None]:
wiki_eval.loss.hist()

In [None]:
quantile = wiki_eval.loss.quantile(0.25)
print(quantile)
print(len(wiki_eval) // 4)
wiki_eval[(wiki_eval.loss < quantile)].sort_values("loss", ascending = False)

In [None]:
pd.DataFrame(np.zeros([3,20]), columns = [ str(i) + "_ID" for i in range(20)])


In [None]:

wiki_eval_mean = wiki_eval.groupby(wiki_eval.index // 500).loss.mean()

In [None]:
fig = go.Figure([go.Bar(x=wiki_eval_mean.index, y=wiki_eval_mean )])
fig.show()

In [None]:
def find_enm(text):
    return re.match(r"$\d+\.", text)
wiki_eval[["deu","nds"]].applymap(find_enm)

wiki_eval[wiki_eval.deu.str.endswith(r"\d+\.")]


In [None]:
print("Remaining sentences: ", len(df_all))
plot_length_dis(df_all)

In [None]:
# we can see already that Low German seems to be shorter in general than German.

df_above_third_quartile = get_length(df_all)
df_above_third_quartile = df_above_third_quartile[df_above_third_quartile.deu > 135]
print(len(df_above_third_quartile))
plot_length_dis(df_all.loc[df_above_third_quartile.index,:])

# OLD STUFF

In [None]:

wiki_deu = pd.read_csv("data/fb-wiki/WikiMatrix.de-nds-de.txt",
                       sep="\n+",engine='python', encoding="utf-8",
                       header=None, names=["deu"])



In [None]:
wiki_nds = pd.read_csv("data/fb-wiki/WikiMatrix.de-nds-nds.txt"
                       , sep="\n+",engine='python'
                       , encoding="utf-8", names=["nds"])

In [None]:
wiki_df = wiki_deu.join(wiki_nds)
print(len(wiki_df))
# there are some specific mistakes in the wiki dataset.
def delete_wrong_enumeration(df):
    len_before = len(df)
    drop_index = df[df.nds.str.contains("\d\.") | df.deu.str.contains("\d\.")].index
    df.drop(index=drop_index, inplace = True)
    print("Deleted wrong enumerations: ", len_before - len(df))

delete_wrong_enumeration(wiki_df)

# save all sentences in our df_all
df_all = df_all.append(wiki_df, ignore_index = True)
df_all.to_csv("data/data_all.csv", index=False)
# exclude long sentences for faster training
wiki_df = wiki_df[(wiki_df.deu.str.len() < 70) & (wiki_df.nds.str.len() < 70)]
print(len(wiki_df))


df = df.append(wiki_df, ignore_index=True)