# Practical 6.2

# Word-level Sequence-to-Sequence (Seq2Seq) Model

# Application-1: Machine Translation  

Similar architecture and objectives with Practical 6.2, but we will train the translation model on word sequences.

![Image](rnn_word_translation.png?raw=true)

In [3]:
from __future__ import print_function

import os
import sys
import numpy as np
import nltk
import string
from string import punctuation
import re

from keras.models import Model
from keras.layers import Input, LSTM, Dense, Embedding, GRU, Lambda, Bidirectional, concatenate

Using TensorFlow backend.


In [None]:
import gensim
from gensim.models import Word2Vec

## 1. Data preprocessing

In [0]:
file_to_read = 'nld.txt'

Function to tokenize text into words (array list of words). Notice that we discard all punctuation in original text.

In [0]:
def tokenizeWords(text):
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    clean_text = regex.sub('', text)
    tokens = clean_text.split()
   
       
    return [t.lower() for t in tokens]

In [0]:
def indexingVocabulary(array_of_words):
    
    # frequency of word across document corpus
    tf = nltk.FreqDist(array_of_words)
    wordIndex = list(tf.keys())
    
    wordIndex.insert(0,'<pad>')
    wordIndex.append('<start>')
    wordIndex.append('<end>')
    wordIndex.append('<unk>')
    # indexing word vocabulary : pairs of (index,word)
    vocab=dict([(i,wordIndex[i]) for i in range(len(wordIndex))])
    
    return vocab

In [0]:
#reading text line by line
lines = open(os.path.join(local_download_path,file_to_read)).read().split('\n')

### Tokenization and vocabulary indexing

Notice that we only use 10.000 samples from data. Training a Seq2Seq model is computationally expensive (memory!)

In [0]:
num_samples = 10000  # Number of samples to train on.

input_str_tokens = []
target_str_tokens = []

ind_start = 10000
ind_end = 10000 + min(num_samples, len(lines) - 1)

for line in lines[ind_start : ind_end]:
    input_text, target_text = line.split('\t')
    # tokenize text from source language (english text)
    input_str_tokens.append(tokenizeWords(input_text))
    # tokenize text from target language (dutch text)
    target_str_tokens.append(tokenizeWords(target_text))

* `en_vocab` stores word index for encoder input sequences (english dictionary)
* `nl_vocab` stores word index for decoder target sequences (dutch dictionary)

In [0]:
# build vocabulary index
input_words = []
target_words = []

for i, tokens in enumerate(input_str_tokens):
    input_words.extend(tokens)
# vocabulary index for english text (input)    
en_vocab = indexingVocabulary(input_words)

for i, tokens in enumerate(target_str_tokens):
    target_words.extend(tokens)
# vocabulary index for dutch text (output)
nl_vocab = indexingVocabulary(target_words)

We also need to create reverse version of look up index to map text sequences into integer format.

In [0]:
en_reversedvocab = dict((v,k) for (k,v) in en_vocab.items())
nl_reversedvocab = dict((v,k) for (k,v) in nl_vocab.items())

### Preparing training sequences


* `seq_int_input`: input sequences for encoder model 
* `seq_int_target`: input sequences for decoder model

In [0]:
# integer format of sequence input 
seq_int_input = []
for i, text in enumerate(input_str_tokens):
    int_tokens = [en_reversedvocab[i] for i in text]
    seq_int_input.append(int_tokens)

For input and output sequences of decoder model, we will use `starting` sign (`'<start>'`) and `ending` sign (`'<end>'`) at the beginning and last part of sequence. 

In [0]:
seq_int_target = []
for i, text in enumerate(target_str_tokens):
    targettext = list(text)
    targettext.insert(0,'<start>')
    targettext.append('<end>')
  
    int_tokens = [nl_reversedvocab[i] for i in targettext]
    seq_int_target.append(int_tokens)

## 2. Word embedding

* In character level, we define input and output sequences as one-hot vector in 3D numpy arrays (number of samples, sequence length, vocabulary size). 
* For word-level, we have integer input sequences with 2D shape (number of samples, sequence length). Instead of one-hot encoding words, we will use embedding layer to project each word sequence to its embedding.
* We will train our text with Word2Vec - Skipgram to provide initial weight for our embedding layer (may also use pretrained word embedding).

In [0]:
# for english text
# skipgram model with hierarchical softmax and negative sampling
word2vec_model_en = Word2Vec(size=256, min_count=0, window=5, sg=1, 
                          hs=1, negative=5, iter=100)

In [0]:
word2vec_model_en.build_vocab(input_str_tokens)
word2vec_vocab_en = dict([(v.index,k) for k, v in word2vec_model_en.wv.vocab.items()]) 
revert_w2v_vocab_en = dict((v,k) for (k,v) in word2vec_vocab_en.items())

In [0]:
# for dutch text
# skipgram model with hierarchical softmax and negative sampling
word2vec_model_nl = Word2Vec(size=256, min_count=0, window=5, sg=1, 
                          hs=1, negative=5, iter=100)

In [0]:
word2vec_model_nl.build_vocab(target_str_tokens)
word2vec_vocab_nl = dict([(v.index,k) for k, v in word2vec_model_nl.wv.vocab.items()]) 
revert_w2v_vocab_nl = dict((v,k) for (k,v) in word2vec_vocab_nl.items())

In [30]:
print('Training word2vec model...')

# for english text
# number of tokens
n_tokens = sum([len(seq) for seq in input_str_tokens])
# number of sentences/documents
n_examples = len(input_str_tokens)
word2vec_model_en.train(input_str_tokens, total_words=n_tokens, 
                        total_examples=n_examples, epochs=100)

Training word2vec model...


(4257792, 5946200)

In [31]:
# for dutch text
# number of tokens
n_tokens = sum([len(seq) for seq in target_str_tokens])
# number of sentences/documents
n_examples = len(target_str_tokens)
word2vec_model_nl.train(target_str_tokens, total_words=n_tokens, 
                        total_examples=n_examples, epochs=100)

(4297470, 6123900)

The following variables store our word embedding learnt from word2vec skipgram.

In [None]:
# the resulting learnt word embedding 
# for input text sequence (english language)
word2vec_we_en = word2vec_model_en.wv.syn0

# for target text sequence (dutch language)
word2vec_we_nl = word2vec_model_nl.wv.syn0

Notice that vocabulary size of word embedding learnt by word2vec is less than our vocabulary size since we add additional word tokens: `'<pad>'`, `'<start>'`, `'<end>'`, `'<unk>'`

In [33]:
word2vec_we_en.shape

(4163, 256)

In [34]:
word2vec_we_nl.shape

(5280, 256)

In [0]:
embedding_en = np.zeros(shape=(len(en_vocab), 256), dtype='float32')
embedding_nl = np.zeros(shape=(len(nl_vocab), 256), dtype='float32')

In [41]:
embedding_en.shape

(4167, 256)

In [42]:
embedding_nl.shape

(5284, 256)

In [0]:
# for input sequences (text in english language)
for i, w in en_vocab.items():
    # this will assign default weight 0 for words: 'SOF', 'EOF', and 'UNK'
    if w not in word2vec_vocab_en.values():
        continue
    embedding_en[i, :] = word2vec_we_en[revert_w2v_vocab_en[w], :]

In [0]:
# for target output sequences (text in dutch language)
for i, w in nl_vocab.items():
    # this will assign default weight 0 for words: 'SOF', 'EOF', and 'UNK'
    if w not in word2vec_vocab_nl.values():
        continue
    embedding_nl[i, :] = word2vec_we_nl[revert_w2v_vocab_nl[w], :]

## 3. Word-based Translation model

In [0]:
batch_size = 100  # Batch size for training.
epochs = 100  # Number of epochs to train for.
rnn_dim = 256  # Latent dimensionality of the encoding space.

## 3.1. Encoder model

* For this model, set parameters in embedding layer to be trainable. 
* You may also try using empty embedding layer (without initialization from pretrained Word2Vec)

In [0]:
# YOUR CODE HERE

## 3.2. Decoder model

In [0]:
# YOUR CODE HERE

In [0]:
model = Model([encoder_inputs, decoder_inputs], prediction_outputs)

In [0]:
# Compile & run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

In [50]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     (None, None)         0                                            
__________________________________________________________________________________________________
decoder_inputs (InputLayer)     (None, None)         0                                            
__________________________________________________________________________________________________
embedding_encoder (Embedding)   (None, None, 256)    1066752     encoder_inputs[0][0]             
__________________________________________________________________________________________________
embedding_decoder (Embedding)   (None, None, 256)    1352704     decoder_inputs[0][0]             
__________________________________________________________________________________________________
lstm_encod

## Training translation model

In [0]:
max_encoder_seq_length = max([len(sequences) for sequences in seq_int_input])
max_decoder_seq_length = max([len(sequences) for sequences in seq_int_target])

In [0]:
encoder_input_data = np.zeros((len(seq_int_input), max_encoder_seq_length), dtype='float32')

In [0]:
decoder_input_data = np.zeros((len(seq_int_input), max_decoder_seq_length), dtype='float32')

### Important

Be aware that the output of decoder layer need to be in categorical format. 

### Padding input sequences for encoder and decoder

In [0]:
for i, seq_int in enumerate(seq_int_input):
    for j, word_index in enumerate(seq_int):
        encoder_input_data[i][j] = word_index

In [0]:
for i, seq_int in enumerate(seq_int_target):
    for j, word_index in enumerate(seq_int):
        decoder_input_data[i][j] = word_index

### Fitting sequences into model

### Important

Note: creating 3D numpy arrays of decoder output (one-hot-encoding) might cause Memory Error. 


In [0]:
# YOUR CODE HERE

In [None]:
# Save model
model.save('rnn_word_lstm_translation.h5')

In [0]:
model.save_weights('weights_rnn_word_lstm_translation.hdf5')

## 4. Inference mode

## 4.1. Re-define encoder model

In [0]:
encoder_model = Model(encoder_inputs, encoder_states)

In [0]:
encoder_model.save('encoder_word_lstm_translation.h5')

## 4.2. Re-define decoder model to do the inference

In [0]:
# YOUR CODE HERE

In [None]:
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

decoder_model.save(os.path.join('decoder_word_lstm_translation.h5')

## 4.3. Translate sentence

In [None]:
# YOUR CODE HERE