<h3> Overview </h3>

- 1) Load and pre-process EuroParl data (as dataframe)
- 2) Load and pre-process ManyThings data (as dataframe)
- 3) Combine the two dataframes & process sentences
- 4) Create German and English corpuses
- 5) Create German and English tokenizers 
- 6) Save df_complete.csv
- 7) Save df_complete_30.csv



In [1]:
import pandas as pd
import numpy as np

In [2]:
from tokenizer_helpers import preprocess_sentence, create_train_tokenizer, load_tokenizer

<h3> 1) EuroParl Data: Load & Pre-Process </h3>

In [8]:
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, mode='rt', encoding='utf-8')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# split a loaded document into sentences
def to_sentences(doc):
	return doc.strip().split('\n')

In [9]:
filename_en = 'data/europarl_de-en/europarl-v7.de-en.en'
text_en = load_doc(filename_en)
doc_en = to_sentences(text_en)

filename_de = 'data/europarl_de-en/europarl-v7.de-en.de'
text_de = load_doc(filename_de)
doc_de = to_sentences(text_de)

In [10]:
both_languages = zip(doc_de, doc_en)
df_euro_parl = pd.DataFrame(both_languages, columns = ['german', 'english'])

# replace empty strings with NaN
df_euro_parl['german'].replace('', np.nan, inplace=True)
df_euro_parl['english'].replace('', np.nan, inplace=True)

# check number of NaN values
df_euro_parl['german'].isna().sum(), df_euro_parl['english'].isna().sum()

(2923, 8366)

In [11]:
# drop NaN values
df_euro_parl = df_euro_parl.dropna()

<h3> 2) ManyThings Data: Load & Pre-Process </h3>

In [12]:
df_ManyThings = pd.read_table('data/deu-eng/deu.txt', names=['eng', 'deu', 'attr'])
df_ManyThings = df_ManyThings.drop('attr',axis = 1).rename(columns = {'eng':'english', 'deu':'german'})

# switch order of columns
df_ManyThings = df_ManyThings[['german', 'english']]

# replace empty strings with NaN
df_ManyThings['german'].replace('', np.nan, inplace=True)
df_ManyThings['english'].replace('', np.nan, inplace=True)

In [13]:
# check number of NaN values
df_ManyThings['german'].isna().sum(), df_ManyThings['english'].isna().sum()

(0, 0)

<h3> 3) Combine the two dataframes + further pre-process </h3>

In [14]:
df_complete = pd.concat([df_euro_parl, df_ManyThings], axis = 0)

In [15]:
# preprocess data 
# Here I'm using my "preprocess_sentence" function, imported from "model_components"
df_complete['german'] = df_complete['german'].apply(preprocess_sentence)
df_complete['english'] = df_complete['english'].apply(preprocess_sentence)

run time: 57 secs

In [16]:
# check some random examples

for i in range (100, 102):
    print(df_complete.iloc[i,0])
    print( df_complete.iloc[i,1])

Wir haben Vertrauen zur Kommission, zu Romano Prodi, und die ganz grosse Mehrheit unserer Fraktion hat Romano Prodi und der Kommission nach einem schwierigen Prozess, wie jeder weiss, das Vertrauen ausgesprochen.
We have confidence in the Commission and in Romano Prodi and, after a difficult procedure, as everyone knows, the vast majority of our Group supported the vote of confidence in Romano Prodi and the Commission.
Aber wir sind auch der Meinung, dass wir eine Debatte ueber diese Strategie der Kommission in einem geordneten Verfahren fuehren muessen, nicht nur aufgrund einer muendlichen Erklaerung hier im Europaeischen Parlament, sondern auch aufgrund eines Dokumentes, das in der Kommission beschlossen ist und dieses Programm fuer fuenf Jahre beschreibt.
We believe, however, that the Commission ' s strategic plan needs to be debated within a proper procedural framework, not only on the basis of an oral statement here in the European Parliament, but also on the basis of a document w

<h3> 4) Create English and German corpuses </h3>

- create corpus for english and german to train BPE tokenizers on

In [17]:
with open ('corpus/en_corpus.txt', 'w') as f:
    for idx, row in df_complete.iterrows():
        f.write(row['english'] + ' \n')

with open ('corpus/de_corpus.txt', 'w') as f:
    for idx, row in df_complete.iterrows():
        f.write(row['german'] + ' \n')
    

- run time: 1m 16 s

<h3> 5) Create German and English tokenizers based on the corpuses </h3>

In [3]:
# Create tokenizers 
# Here I'm using my "create_train_tokenizer" function from "model_components"
en_tokenizer = create_train_tokenizer(['corpus/en_corpus.txt'])
de_tokenizer = create_train_tokenizer(['corpus/de_corpus.txt'])




json file saved at: tokenizers/tokenizer_en_corpus.json



json file saved at: tokenizers/tokenizer_de_corpus.json


In [4]:
# test tokenizers
# Here I'm using my "load_tokenizer" function imported from "model_components"

# load tokenizers
en_tokenizer, en_word_index = load_tokenizer('tokenizers/tokenizer_en_corpus.json')
de_tokenizer, de_word_index = load_tokenizer('tokenizers/tokenizer_de_corpus.json')

# random German sentence 
sentence = "In diesem      Sinne   , möchte ich Sie bitten, nur auf kurzerer Zeit eine Mail zu schicken."
# pre-process
sentence = preprocess_sentence(sentence)

# encode using my loaded tokenizer
output = de_tokenizer.encode(sentence)

# view tokens and ids
output.tokens, output.ids


(['start_',
  'In',
  'Ġdiesem',
  'ĠSinne',
  'Ġ,',
  'Ġmoechte',
  'Ġich',
  'ĠSie',
  'Ġbitten',
  ',',
  'Ġnur',
  'Ġauf',
  'Ġkurz',
  'erer',
  'ĠZeit',
  'Ġeine',
  'ĠMa',
  'il',
  'Ġzu',
  'Ġschicken',
  '.',
  '_end'],
 [1,
  761,
  630,
  3086,
  2126,
  585,
  406,
  589,
  3201,
  14,
  633,
  359,
  1876,
  8181,
  842,
  390,
  977,
  472,
  300,
  12817,
  16,
  2])

<h3> 6) Save df_complete as csv file </h3>

In [20]:
df_complete.to_csv('data/df_complete.csv', index = False)

In [21]:
df_complete = pd.read_csv('data/df_complete.csv')

In [22]:
df_complete.isna().sum()

german     2
english    2
dtype: int64

In [23]:
# Mysteriously, there are two nan values. We'll simply drop them.
df_complete = df_complete.dropna()

In [24]:
df_complete.to_csv('df_complete.csv', index = False)

<h3> 7) Create dataframe with capped sentence length </h3>

In [25]:
df_complete_30 = df_complete.copy(deep = True)
df_complete_30['german_length'] = df_complete['german'].apply(lambda x: len(x.split(' ')))
df_complete_30['english_length'] = df_complete['english'].apply(lambda x: len(x.split(' ')))

In [26]:
df_complete_30.head()

Unnamed: 0,german,english,german_length,english_length
0,Wiederaufnahme der Sitzungsperiode,Resumption of the session,3,4
1,"Ich erklaere die am Freitag, dem 17. Dezember ...",I declare resumed the session of the European ...,29,38
2,"Wie Sie feststellen konnten, ist der gefuercht...","Although, as you will have seen, the dreaded '...",22,31
3,Im Parlament besteht der Wunsch nach einer Aus...,You have requested a debate on this subject in...,16,19
4,Heute moechte ich Sie bitten - das ist auch de...,"In the meantime, I should like to observe a mi...",33,40


In [27]:
mask_30 = (df_complete_30['german_length'] <= 30) & (df_complete_30['english_length'] <= 30)
df_complete_30 = df_complete_30[mask_30]

In [28]:
len(df_complete), len(df_complete_30)

(2160638, 1562141)

In [29]:
df_complete_30.to_csv('data/df_complete_30.csv', index = False)