# Home 5: Build a seq2seq model for machine translation.

### Name: [Ian Gomez]

### Task: Translate English to [Swedish]

## 0. You will do the following:

1. Read and run my code.
2. Complete the code in Section 1.1 and Section 4.2.

    * Translation **English** to **German** is not acceptable!!! Try another pair of languages.
    
3. **Make improvements.** Directly modify the code in Section 3. Do at least one of the two. By doing both correctly, you will get up to 1 bonus score to the total.

    * Bi-LSTM instead of LSTM.
        
    * Attention. (You are allowed to use existing code.)
    
4. Evaluate the translation using the BLEU score. 

    * Optional. Up to 1 bonus scores to the total.
    
5. Convert the notebook to .HTML file. 

    * The HTML file must contain the code and the output after execution.

6. Put the .HTML file in your Google Drive, Dropbox, or Github repo.  (If you submit the file to Google Drive or Dropbox, you must make the file "open-access". The delay caused by "deny of access" may result in late penalty.)

7. Submit the link to the HTML file to Canvas.    


### Hint: 

To implement ```Bi-LSTM```, you will need the following code to build the encoder. Do NOT use Bi-LSTM for the decoder.

In [3]:
# from keras.layers import Bidirectional, Concatenate

# encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
#                                   dropout=0.5, name='encoder_lstm'))
# _, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

# state_h = Concatenate()([forward_h, backward_h])
# state_c = Concatenate()([forward_c, backward_c])

## 1. Data preparation

1. Download data (e.g., "deu-eng.zip") from http://www.manythings.org/anki/
2. Unzip the .ZIP file.
3. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".

### 1.1. Load and clean text


In [4]:
import re
import string
from unicodedata import normalize
import numpy

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return numpy.array(cleaned)

#### Fill the following blanks:

In [5]:
# e.g., filename = 'Data/deu.txt'
filename = 'Data/swe.txt'

# e.g., n_train = 20000
n_train = 11353

In [6]:
# load dataset
doc = load_doc(filename)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)

# clean sentences
clean_pairs = clean_data(pairs)[0:n_train, :]

In [7]:
for i in range(3000, 3010):
    print('[' + clean_pairs[i, 0] + '] => [' + clean_pairs[i, 1] + ']')

[this is the end] => [det har ar slutet]
[this is unusual] => [det har ar ovanligt]
[throw it to tom] => [kasta den till tom]
[to err is human] => [att fela ar manskligt]
[tom cut himself] => [tom skar sig]
[tom didnt mind] => [tom hade inget emot det]
[tom exaggerated] => [tom overdrev]
[tom got dressed] => [tom kladde pa sig]
[tom got engaged] => [tom forlovade sig]
[tom got excited] => [tom blev ivrig]


In [8]:
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

print('Length of input_texts:  ' + str(input_texts.shape))
print('Length of target_texts: ' + str(input_texts.shape))

Length of input_texts:  (11353,)
Length of target_texts: (11353,)


In [9]:
max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))

max length of input  sentences: 23
max length of target sentences: 51


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

In [10]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# encode and pad sequences
def text2sequences(max_len, lines, indices=None):
    tokenizer = Tokenizer(char_level=True, filters='')
    if indices:
        tokenizer.word_index = indices
    else:
        tokenizer.fit_on_texts(lines)
    seqs = tokenizer.texts_to_sequences(lines)
    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
    return seqs_pad, tokenizer.word_index

encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length, 
                                                      input_texts)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length, 
                                                       target_texts)

print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))
print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))

shape of encoder_input_seq: (11353, 23)
shape of input_token_index: 27
shape of decoder_input_seq: (11353, 51)
shape of target_token_index: 29


In [11]:
num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))

num_encoder_tokens: 28
num_decoder_tokens: 30


**Remark:** To this end, the input language and target language texts are converted to 2 matrices. 

- Their number of rows are both n_train.
- Their number of columns are respective max_encoder_seq_length and max_decoder_seq_length.

The followings print a sentence and its representation as a sequence.

In [12]:
target_texts[100]

'\tlat det vara\n'

In [13]:
decoder_input_seq[100, :]

array([ 7, 13,  2,  3,  1, 10,  5,  3,  1, 18,  2,  4,  2,  8,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
      dtype=int32)

## 2.2. One-hot encode

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.
- It is represented by a $n\times t \times v$ tensor ($t$ is the number of unique chars) after the one-hot encoding.

In [14]:
from keras.utils import to_categorical

# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = numpy.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq, 
                                    max_decoder_seq_length, 
                                    num_decoder_tokens)

print(encoder_input_data.shape)
print(decoder_input_data.shape)

(11353, 23, 28)
(11353, 51, 30)


## 3. Build the networks (for training)

- Build encoder, decoder, and connect the two modules to get "model". 

- Fit the model on the bilingual data to train the parameters in the encoder and decoder.

### 3.1. Encoder network

- Input:  one-hot encode of the input language

- Return: 

    -- output (all the hidden states   $h_1, \cdots , h_t$) are always discarded
    
    -- the final hidden state  $h_t$
    
    -- the final conveyor belt $c_t$

In [15]:
from keras.layers import Input, LSTM
from keras.layers import Bidirectional, Concatenate
from keras.models import Model

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name='encoder_inputs')

# set the LSTM layer

encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs, 
                      outputs=[state_h, state_c],
                      name='encoder')

Print a summary and save the encoder network structure to "./encoder.pdf"

In [16]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(encoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=encoder_model, show_shapes=False,
    to_file='encoder.pdf'
)

encoder_model.summary()

Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
bidirectional (Bidirectional)   [(None, 512), (None, 583680      encoder_inputs[0][0]             
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 512)          0           bidirectional[0][1]              
                                                                 bidirectional[0][3]              
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 512)          0           bidirectional[0][2]        

### 3.2. Decoder network

- Inputs:  

    -- one-hot encode of the target language
    
    -- The initial hidden state $h_t$ 
    
    -- The initial conveyor belt $c_t$ 

- Return: 

    -- output (all the hidden states) $h_1, \cdots , h_t$

    -- the final hidden state  $h_t$ (discarded in the training and used in the prediction)
    
    -- the final conveyor belt $c_t$ (discarded in the training and used in the prediction)

In [17]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim*2,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim*2,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(2*latent_dim, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x, 
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

Print a summary and save the encoder network structure to "./decoder.pdf"

In [18]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(decoder_model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=decoder_model, show_shapes=False,
    to_file='decoder.pdf'
)

decoder_model.summary()

Model: "decoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_input_x (InputLayer)    [(None, None, 30)]   0                                            
__________________________________________________________________________________________________
decoder_input_h (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
decoder_input_c (InputLayer)    [(None, 512)]        0                                            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1112064     decoder_input_x[0][0]            
                                                                 decoder_input_h[0][0]      

### 3.3. Connect the encoder and decoder

In [19]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x], 
              outputs=decoder_pred, 
              name='model_training')

In [20]:
print(state_h)
print(decoder_input_h)

KerasTensor(type_spec=TensorSpec(shape=(None, 512), dtype=tf.float32, name=None), name='decoder_lstm/PartitionedCall:2', description="created by layer 'decoder_lstm'")
KerasTensor(type_spec=TensorSpec(shape=(None, 512), dtype=tf.float32, name='decoder_input_h'), name='decoder_input_h', description="created by layer 'decoder_input_h'")


In [21]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot, plot_model

SVG(model_to_dot(model, show_shapes=False).create(prog='dot', format='svg'))

plot_model(
    model=model, show_shapes=False,
    to_file='model_training.pdf'
)

model.summary()

Model: "model_training"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input_x (InputLayer)    [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
decoder_input_x (InputLayer)    [(None, None, 30)]   0                                            
__________________________________________________________________________________________________
encoder (Functional)            [(None, 512), (None, 583680      encoder_input_x[0][0]            
__________________________________________________________________________________________________
decoder_lstm (LSTM)             [(None, None, 512),  1112064     decoder_input_x[0][0]            
                                                                 encoder[0][0]       

### 3.5. Fit the model on the bilingual dataset

- encoder_input_data: one-hot encode of the input language

- decoder_input_data: one-hot encode of the input language

- decoder_target_data: labels (left shift of decoder_input_data)

- tune the hyper-parameters

- stop when the validation loss stop decreasing.

In [22]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(11353, 23, 28)
shape of decoder_input_data(11353, 51, 30)
shape of decoder_target_data(11353, 51, 30)


In [23]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=100, validation_split=0.2)

model.save('seq2seq.h5')

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## 4. Make predictions


### 4.1. Translate English to XXX

1. Encoder read a sentence (source language) and output its final states, $h_t$ and $c_t$.
2. Take the [star] sign "\t" and the final state $h_t$ and $c_t$ as input and run the decoder.
3. Get the new states and predicted probability distribution.
4. sample a char from the predicted probability distribution
5. take the sampled char and the new states as input and repeat the process (stop if reach the [stop] sign "\n").

In [24]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [25]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # this line of code is greedy selection
        # try to use multinomial sampling instead (with temperature)
        sampled_token_index = numpy.argmax(output_tokens[0, -1, :])
        if(sampled_token_index == 0):
            sampled_token_index = numpy.argsort(output_tokens[0, -1, :])[1]
            
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]

    return decoded_sentence


In [26]:
for seq_index in range(2100, 2120):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('English:       ', input_texts[seq_index])
    print('German (true): ', target_texts[seq_index][1:-1])
    print('German (pred): ', decoded_sentence[0:-1])


-
English:        thats a start
German (true):  det ar en borjan
German (pred):  det ar ett bisstag
-
English:        thats my line
German (true):  det ar min replik
German (pred):  det ar min cdsliva
-
English:        thats my seat
German (true):  det ar min plats
German (pred):  det ar min brokiv
-
English:        thats my wife
German (true):  det ar min fru
German (pred):  det ar min cdsliva
-
English:        thats plastic
German (true):  det dar ar plast
German (pred):  det ar oviktigt
-
English:        thats serious
German (true):  det ar allvarligt
German (pred):  det ar orotigt
-
English:        thats suicide
German (true):  det ar sjalvmord
German (pred):  det ar orojligt
-
English:        thats treason
German (true):  det ar forraderi
German (pred):  det ar sant
-
English:        thats unlucky
German (true):  det var oturligt
German (pred):  det ar orojligt
-
English:        the radio died
German (true):  radion dog
German (pred):  radinn ar dod
-
English:        the radio die

### 4.2. Translate an English sentence to the target language

1. Tokenization
2. One-hot encode
3. Translate

In [27]:
input_sentence = 'I love you'

input_sequence,_ = text2sequences(max_encoder_seq_length, [input_sentence], indices=input_token_index)
input_x = onehot_encode(input_sequence, max_encoder_seq_length, num_encoder_tokens)
translated_sentence = decode_sequence(input_x)

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)

source sentence is: I love you
translated sentence is: jag alskar dig



## 5. Evaluate the translation using BLEU score

Reference: 
- https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
- https://en.wikipedia.org/wiki/BLEU


**Hint:** 

- Randomly partition the dataset to training, validation, and test. 

- Evaluate the BLEU score using the test set. Report the average.

- A reasonable BLEU score should be 0.1 ~ 0.5.

In [28]:
import numpy as np
n_train = 11353
n_val = 3784
n_test = 3784
# load dataset
doc = load_doc(filename)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)
all_clean_pairs = clean_data(pairs)
print(all_clean_pairs.shape)
shuffle_inds = np.arange(len(all_clean_pairs))
np.random.shuffle(shuffle_inds)
print(shuffle_inds)
all_clean_pairs = all_clean_pairs[shuffle_inds]
print(all_clean_pairs.shape)
# clean sentences
clean_pairs = all_clean_pairs[0:n_train, :]
print(clean_pairs.shape)
# encode and pad sequences

encoder_input_seq, input_token_index = text2sequences(max_encoder_seq_length, 
                                                      input_texts)
decoder_input_seq, target_token_index = text2sequences(max_decoder_seq_length, 
                                                       target_texts)

num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1

encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq, 
                                    max_decoder_seq_length, 
                                    num_decoder_tokens)

(18923, 3)
[ 6808 10591  9586 ... 14595 18646 16862]
(18923, 3)
(11353, 3)


In [34]:
from keras.layers import Input, LSTM
from keras.layers import Bidirectional, Concatenate
from keras.models import Model
# Reset weights for training on new shuffled dataset
latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens), 
                       name='encoder_inputs')

# set the LSTM layer

encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs, 
                      outputs=[state_h, state_c],
                      name='encoder')

from keras.layers import Input, LSTM, Dense
from keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim*2,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim*2,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(2*latent_dim, return_sequences=True, 
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x, 
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x], 
              outputs=decoder_pred, 
              name='model_training')
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=20, validation_split=0.2)

model.save('seq2seq.h5')

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [35]:
from nltk.translate.bleu_score import sentence_bleu

import numpy as np
# Evaluate the model on the validation set prior to using the final test set
# Get the pairs needed for the validation set and parse the data
val_pairs = all_clean_pairs[n_train:n_train+n_val, :]
val_inputs = val_pairs[:, 0]
val_true = [sentence.split(" ") for sentence in val_pairs[:, 1]]
val_input_seqs,_ = text2sequences(max_encoder_seq_length, val_inputs, indices=input_token_index)
val_input_x = onehot_encode(val_input_seqs, max_encoder_seq_length, num_encoder_tokens)
# Translate validation sentences
val_translations = []
for i, x in enumerate(val_input_x):
    decoded_sentence = decode_sequence(x.reshape(1,*x.shape)).split(" ")
    if decoded_sentence[-1] == '\n':
      decoded_sentence = decoded_sentence[:-1]
    val_translations.append(decoded_sentence)
    print(f"{i}/{len(val_input_x)}", end="")
    print('\r')
print(len(val_translations))
print(len(val_true))


0/3784
1/3784
2/3784
3/3784
4/3784
5/3784
6/3784
7/3784
8/3784
9/3784
10/3784
11/3784
12/3784
13/3784
14/3784
15/3784
16/3784
17/3784
18/3784
19/3784
20/3784
21/3784
22/3784
23/3784
24/3784
25/3784
26/3784
27/3784
28/3784
29/3784
30/3784
31/3784
32/3784
33/3784
34/3784
35/3784
36/3784
37/3784
38/3784
39/3784
40/3784
41/3784
42/3784
43/3784
44/3784
45/3784
46/3784
47/3784
48/3784
49/3784
50/3784
51/3784
52/3784
53/3784
54/3784
55/3784
56/3784
57/3784
58/3784
59/3784
60/3784
61/3784
62/3784
63/3784
64/3784
65/3784
66/3784
67/3784
68/3784
69/3784
70/3784
71/3784
72/3784
73/3784
74/3784
75/3784
76/3784
77/3784
78/3784
79/3784
80/3784
81/3784
82/3784
83/3784
84/3784
85/3784
86/3784
87/3784
88/3784
89/3784
90/3784
91/3784
92/3784
93/3784
94/3784
95/3784
96/3784
97/3784
98/3784
99/3784
100/3784
101/3784
102/3784
103/3784
104/3784
105/3784
106/3784
107/3784
108/3784
109/3784
110/3784
111/3784
112/3784
113/3784
114/3784
115/3784
116/3784
117/3784
118/3784
119/3784
120/3784
121/3784
122/3784
123

In [36]:
from nltk.translate.bleu_score import SmoothingFunction

# Compute and report the average bleu score across the dataset
smoother = SmoothingFunction().method4
score = [sentence_bleu(translation, true, smoothing_function=smoother) for translation, true in zip(val_translations, val_true)]
print(sum(score)/len(score))

0.007658907225465915


In [37]:
from nltk.translate.bleu_score import sentence_bleu
# Evaluate the model on the test set prior to using the final test set
# Get the pairs needed for the test set and parse the data

test_pairs = all_clean_pairs[n_train+n_val:n_train+n_val+n_test, :]
test_inputs = test_pairs[:, 0]
test_true = [sentence.split(" ") for sentence in test_pairs[:, 1]]
test_input_seqs,_ = text2sequences(max_encoder_seq_length, test_inputs, indices=input_token_index)
test_input_x = onehot_encode(test_input_seqs, max_encoder_seq_length, num_encoder_tokens)
# Translate test sentences
test_translations = []
for i, x in enumerate(test_input_x):
    decoded_sentence = decode_sequence(x.reshape(1,*x.shape)).split(" ")
    if decoded_sentence[-1] == '\n':
      decoded_sentence = decoded_sentence[:-1]
    test_translations.append(decoded_sentence)
    print(f"{i}/{len(test_input_x)}", end="")
    print('\r')
print(len(test_translations))
print(len(test_true))


0/3784
1/3784
2/3784
3/3784
4/3784
5/3784
6/3784
7/3784
8/3784
9/3784
10/3784
11/3784
12/3784
13/3784
14/3784
15/3784
16/3784
17/3784
18/3784
19/3784
20/3784
21/3784
22/3784
23/3784
24/3784
25/3784
26/3784
27/3784
28/3784
29/3784
30/3784
31/3784
32/3784
33/3784
34/3784
35/3784
36/3784
37/3784
38/3784
39/3784
40/3784
41/3784
42/3784
43/3784
44/3784
45/3784
46/3784
47/3784
48/3784
49/3784
50/3784
51/3784
52/3784
53/3784
54/3784
55/3784
56/3784
57/3784
58/3784
59/3784
60/3784
61/3784
62/3784
63/3784
64/3784
65/3784
66/3784
67/3784
68/3784
69/3784
70/3784
71/3784
72/3784
73/3784
74/3784
75/3784
76/3784
77/3784
78/3784
79/3784
80/3784
81/3784
82/3784
83/3784
84/3784
85/3784
86/3784
87/3784
88/3784
89/3784
90/3784
91/3784
92/3784
93/3784
94/3784
95/3784
96/3784
97/3784
98/3784
99/3784
100/3784
101/3784
102/3784
103/3784
104/3784
105/3784
106/3784
107/3784
108/3784
109/3784
110/3784
111/3784
112/3784
113/3784
114/3784
115/3784
116/3784
117/3784
118/3784
119/3784
120/3784
121/3784
122/3784
123

In [45]:
# Compute and report the average bleu score across the dataset
score = [sentence_bleu(translation, true, smoothing_function=smoother) for translation, true in zip(test_translations, test_true)]