This sequence-to-sequence experiment aims to learn joint representation of document in one particular language (e.g. English) to document in another language (e.g. Dutch). For this first stage of experiment, we use a parallel corpus from europarl (the one available in data/europarl_*_test is a splitted sample of first paragraph from the original europarl text document - for a purpose of tutorial).

The main code for running on raw europarl data is available in scripts/train_bi_europarl.py (please pay an attention to the folder structure described in this page).

In [1]:
from __future__ import print_function
import os
import sys
import numpy as np

In [2]:
from scripts.text_preprocessing import *

In [3]:
from scripts.language_models import *

The data available here have been preprocessed and the pre-processing code is also available in text_preprocessing.py (the detail of pre-processing stage and tutorial is available in 'preprocessing.ipynb'). 

Example of pre-processed europarl data, as follows:

In [4]:
vocab = readPickle('data/europarl_en_nl_test/vocabulary')

vocabulary consists of 2 keys (python dictionary format): 

[0] for English or 'en' vocabulary, which will be look-up vocabulary/dictionary for the input. 

[1] for Dutch or 'nl' vocabulary, which is the dictionary for text output in sequence-to-sequence model.

In [5]:
vocab.keys()

[0, 1]

In [6]:
vocab[0].items()[:5]

[(0, 'zero'), (1, 'represent'), (2, 'all'), (3, 'plenary'), (4, 'month')]

To access the word based on index:

In [7]:
vocab[0][0]

'zero'

and the following is vocabulary in Dutch (nl) language:

In [8]:
vocab[1].items()[:5]

[(0, 'zero'), (1, 'mensen'), (2, 'doel'), (3, 'Hervatting'), (4, 'inderdaad')]

The dictionary is not sorted in parallel. So, it is not a translation dictionary, but a look-up indexing words for each language.

In [9]:
len(vocab[0])

461

Since the data here is a sample from the original document, the number of word English vocabulary is relatively small (461), while the vocabulary of dutch document contains 483 words

In [10]:
len(vocab[1])

483

In [11]:
documents = readPickle('data/europarl_en_nl_test/documents')

In [12]:
documents.keys()

['en', 'nl']

Each language in this sample corpus only consists of a single document. English document consists of total 1197 words. 

In [13]:
len(documents['en'][0])

1197

And the sequence of words in corresponding document has been encoded into its numerical value based on index in its word vocabulary list.

In [14]:
documents['en'][0][:5]

[398, 92, 355, 240, 347]

To revert back into its text form, you can use indexToWords(vocab,numSentence) from scripts/text_preprocessing.py

In [15]:
enTxt = indexToWords(vocab[0],documents['en'][0])

In [16]:
enTxt[:5]

['Resumption', 'of', 'the', 'session', 'I']

While the paralel version - Dutch document consists of 1189 words

In [17]:
len(documents['nl'][0])

1189

In [18]:
documents['nl'][0][:5]

[3, 434, 284, 447, 364]

In [19]:
nlTxt = indexToWords(vocab[1],documents['nl'][0])

In [20]:
nlTxt[:5]

['Hervatting', 'van', 'de', 'zitting', 'Ik']

File scripts/train_bi_europarl.py can be run on data before preprocessing stage (e.g. directly using europarl data downloaded from corresponding website), but to use this code you need to locate under the following structure of folder:

```
main_data_path
│   
└───subfolder (if anys)
    │   
    └───en
    │      europarl.en 
    │
    └───nl
           europarl.nl
        
```
And then you can specify the folder location (line 36 - train_bi_europarl.py) as:

PATH = 'main_data_path/subfolder'

For instance:

PATH = 'data/multilingual/europarl'

As such the code can generate the python dictionary format of language pair.



Other wise, you can use the stored pre-processed data and disregard/comment out line 41 - 47 in main function of train_bi_europarl.py

In [21]:
X_vocab_len = len(vocab[0])
y_vocab_len = len(vocab[1])

Line 54 split sentences from sequence words of document, but what we got here is encoded version of document (numeric sequence instead of word sequence), but you can use another function in text_preprocessing.py 

sentToWordsBi()

as follows:

In [22]:
worddocs = sentToWordsBi(documents,vocab)   

The result is a document with sequence of words as follow. We will split this tokenized document into array of sentences by using splitSentences() in function getSentencesClass(). Basically, this function splits document into sentences based on characters/punctuations ('.','?','!')  

In [23]:
worddocs['en'][0][:5]

['Resumption', 'of', 'the', 'session', 'I']

From here, you can just follow the code in scripts/train_bi_europarl.py from line 54

In [24]:
sentences = getSentencesClass(worddocs)

In [25]:
len(sentences[0][0]) # number of sentences in doc-id[0] of language-id[0]

47

In [26]:
numSentences = sentToNumBi(sentences,vocab)
nSentences, nWords, minSent, maxSent, sumSent, avgSent, minWords, maxWords, sumWords, avgWords = getStatClass(numSentences)
X_max_len = maxWords[0][0]
y_max_len = maxWords[1][0]

from keras.preprocessing.sequence import pad_sequences

print('[INFO] Zero padding...')

X = pad_sequences(numSentences[0][0], maxlen=X_max_len, dtype='int32')
y = pad_sequences(numSentences[1][0], maxlen=y_max_len, dtype='int32')

In [27]:
EMBEDDING_DIM = 200
HIDDEN_DIM = 200
LAYER_NUM = 3
BATCH_SIZE = 100
NB_EPOCH = 20
MODE = 'train'

In [None]:
print('[INFO] Compiling model...')
model = seqEncoderDecoder(X_vocab_len, X_max_len, y_vocab_len, y_max_len, EMBEDDING_DIM, HIDDEN_DIM, LAYER_NUM)

I'll skip the step for checking the stored weight, loading any weights if available, or storing weights when training the model in this tutorial

In [None]:
_N = X_max_len
_start = 1
_end = 0
for k in range(_start, NB_EPOCH+1):
    xShuffled = shuffleSentences(X)
    yShuffled = shuffleSentences(y)
    
    for i in range(0, len(xShuffled), _N):
        if i + _N >= len(xShuffled):
            i_end = len(xShuffled)
        else:
            i_end = i + _N
        
        yEncoded = sentenceMatrixVectorization(yShuffled[i:i_end], y_max_len, y_vocab_len)
        print('[INFO] Training model: epoch {}th {}/{} samples'.format(k, i, len(X)))
        model.fit(xShuffled[i:i_end], yEncoded, batch_size=BATCH_SIZE, nb_epoch=1, verbose=2)
        
        