<h2> Notebook Overview </h2>

- 1) Load and pre-process the data
- 2) Pre-process data and split data into source / target arrays
- 3) Create tokenizers for source and target languages using Keras' Tokenizer module
- 4) Save tokenizers as json files
- 5) Split data into training and test data: source_train_tensor, source_test_tensor, target_train_tensor, target_test_tensor
- 6) Save training and test data as numpy arrays
- 7) Create embedding matrices for source and target languages
- 8) Save embedding matrices


In [18]:
import pandas as pd
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
import io
import json
from sklearn.model_selection import train_test_split
import os
import numpy as np

# Import from my own module
from preprocessing import preprocess_sentence



<h3> 1) Load data from txt file </h3>

In [19]:
df_en_de = pd.read_table('deu-eng/deu.txt', names=['eng', 'deu', 'attr'])
df_en_de = df_en_de.drop('attr',axis = 1).rename(columns = {'eng':'english', 'deu':'german'})

<h3> 2) Pre-process sentences and divide data into source / target </h3>

In [20]:
pairs = df_en_de
pairs = pairs.sample(frac = 1.00)
pairs['german'] = pairs['german'].apply(preprocess_sentence)
pairs['english'] = pairs['english'].apply(preprocess_sentence)

source = pairs['german']
target = pairs ['english']

In [32]:
len(pairs)

251720

<h3> 3) Create word-based tokenizers + tokenize the sentences </h3>

In [21]:
# create tokenizer for source language
source_sentence_tokenizer = Tokenizer(filters='')
# fit to source data
source_sentence_tokenizer.fit_on_texts(source)
# create tensor for source language -- every row is sequence of integers
source_tensor = source_sentence_tokenizer.texts_to_sequences(source)
# add zero padding to each sequence
source_tensor = tf.keras.preprocessing.sequence.pad_sequences(source_tensor, padding='post' )

# repeat for target language
target_sentence_tokenizer = Tokenizer(filters='')
target_sentence_tokenizer.fit_on_texts(target)
target_tensor = target_sentence_tokenizer.texts_to_sequences(target)
target_tensor = tf.keras.preprocessing.sequence.pad_sequences(target_tensor, padding='post' )

<h3> 4) Save tokenizers </h3>

- save tokenizers as json files

In [22]:
source_sentence_tokenizer_json = source_sentence_tokenizer.to_json()
with io.open('tokenizers/source_sentence_tokenizer.json', 'w', encoding = 'utf-8') as f:
    f.write(json.dumps(source_sentence_tokenizer_json, ensure_ascii = False))

target_sentence_tokenizer_json = target_sentence_tokenizer.to_json()
with io.open('tokenizers/target_sentence_tokenizer.json', 'w', encoding = 'utf-8') as f:
    f.write(json.dumps(target_sentence_tokenizer_json, ensure_ascii = False))

- Create word-to-index and index-to-word mappings for source and target languages

In [23]:
source_word_index = source_sentence_tokenizer.word_index
target_word_index = target_sentence_tokenizer.word_index

source_index_word = source_sentence_tokenizer.index_word
target_index_word = target_sentence_tokenizer.index_word

<h3> 5) Split data into train + test sets </h3>

- split data into train and test sets

In [24]:
# Split into train and test sets
source_train_tensor, source_test_tensor, target_train_tensor, target_test_tensor = train_test_split(
                                                                source_tensor, target_tensor,test_size=0.2
                                                                )

<h3> 6) Save train and test sets (numpy arrays) as CSV files </h3>

- save numpy arrays as csv files

In [25]:
# save numpy array as csv file:

np.savetxt('tensors/source_train_tensor.csv', source_train_tensor, delimiter = ',')
np.savetxt('tensors/source_test_tensor.csv', source_test_tensor, delimiter = ',')
np.savetxt('tensors/target_train_tensor.csv', target_train_tensor, delimiter = ',')
np.savetxt('tensors/target_test_tensor.csv', target_test_tensor, delimiter = ',')

<h3> 7) Create embedding matrices for source and target languages </h3>

- import German and English pipelines from spaCy 

In [26]:
#!python -m spacy download de_core_news_sm
import de_core_news_sm

#!python -m spacy download en_core_web_lg
import en_core_web_lg

- load models

In [27]:
nlp_source = de_core_news_sm.load()
nlp_target = en_core_web_lg.load()

- define "vocab_len_source" and "vocab_len_target" variables

In [28]:
vocab_len_source = len(source_word_index.keys())
vocab_len_target = len(target_word_index.keys())
print (vocab_len_source, vocab_len_target)

37347 17409


- add 1 for zero padding in embedding matrices

In [29]:
num_tokens_source = vocab_len_source + 1
num_tokens_target = vocab_len_target + 1

- Create embedding matrices

In [30]:
# source language embedding dimensions
embedding_dim_source = len(nlp_source('Der').vector)
# initialise embedding matrix for source language
# number of rows = number of tokens in source language
embedding_matrix_source = np.zeros((num_tokens_source, embedding_dim_source))
# for every word in source language
for word, i in source_word_index.items():
    # retrieve embedding vector
    embedding_vector = nlp_source(word).vector
    # words not found in embedding index will be all-zeros.
    if embedding_vector is not None:    
        # insert embedding vector into row of embedding matrix
        embedding_matrix_source[i] = embedding_vector

# target language embedding dimensions
embedding_dim_target = len(nlp_target('The').vector)
# initialise embedding matrix for target language
embedding_matrix_target = np.zeros((num_tokens_target, embedding_dim_target))
for word, i in target_word_index.items():
    embedding_vector = nlp_target(word).vector
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix_target[i] = embedding_vector

- run time for entire dataset: 2m 33s

<h3> 8) Save embedding matrices </h3>

In [31]:
# save embedding matrices (numpy arrays) as csv files:
np.savetxt('embeddings/embedding_matrix_source.csv', embedding_matrix_source, delimiter = ',')
np.savetxt('embeddings/embedding_matrix_target.csv', embedding_matrix_target, delimiter = ',')

In [17]:
# load embedding matrices
embedding_matrix_source = np.loadtxt('embeddings/embedding_matrix_source.csv', delimiter = ',', dtype = 'int32')
embedding_matrix_target = np.loadtxt('embeddings/embedding_matrix_target.csv', delimiter = ',', dtype = 'int32')