# Introduction

This notebook presents **Sequence-to-Sequence** encoder-decoder achitecture based on **LSTM** cells. Neural network is used to learn **English to French** translation task on a small corpus of sequences (.

<img src="assets/seq2seq.png"/>
<center>Sequence to sequence architecuter</center>

**Dataset**

* [Udacity NLP Nanodegree](https://eu.udacity.com/course/natural-language-processing-nanodegree--nd892) - I found dataset as part of the course
* [Udacity NLP GitHub](https://github.com/udacity/aind2-nlp-capstone) - dataset link

**Code**

* [A ten-minute introduction to sequence-to-sequence learning in Keras](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html) - code with explanation
* [Official Keras Seq2Seq Example](https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py) - code

**Resources**

* [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) (2014) by Ilya Sutskever, Oriol Vinyals, Quoc V. Le
* [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078) (2014) by Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio

# Imports

In [2]:
import os
import numpy as np
import matplotlib.pyplot as plt

Limit TensorFlow GPU memory usage

In [3]:
import tensorflow as tf
gpu_options = tf.GPUOptions(allow_growth=True)  # init TF ...
config=tf.ConfigProto(gpu_options=gpu_options)  # w/o taking ...
with tf.Session(config=config): pass            # all GPU memory

# English to French Dataset

Download dataset from the link in the introduction and point path below to folder with *small_vocab_en* and *small_vocab_fr*

In [7]:
dataset_location = '/home/marcin/Udacity/NLPND/aind2-nlp-capstone/data/'

*small_vocab_en* contains approx 137860 short sentences in English. *small_vocab_fr* contains corresponding sentences in french.

In [20]:
with open(os.path.join(dataset_location, 'small_vocab_en')) as f:
    # line below: 1) reads lines from file, 2) strips /n char and converts to lowercase, 3) adds special start/end words
    data_en_raw = list(map(lambda x: 'ST '+x.strip().lower()+' EN', f.readlines()))
print('len:', len(data_en_raw))
print('example sentences:')
data_en_raw[0:3]

len: 137860
example sentences:


['ST new jersey is sometimes quiet during autumn , and it is snowy in april . EN',
 'ST the united states is usually chilly during july , and it is usually freezing in november . EN',
 'ST california is usually quiet during march , and it is usually hot in june . EN']

In [21]:
with open(os.path.join(dataset_location, 'small_vocab_fr')) as f:
    # line below: 1) reads lines from file, 2) strips /n char and converts to lowercase, 3) adds special start/end words
    data_fr_raw = list(map(lambda x: 'ST '+x.strip().lower()+' EN', f.readlines()))
print('len:', len(data_fr_raw))
print('example sentences:')
data_fr_raw[0:3]

len: 137860
example sentences:


["ST new jersey est parfois calme pendant l' automne , et il est neigeux en avril . EN",
 'ST les états-unis est généralement froid en juillet , et il gèle habituellement en novembre . EN',
 'ST california est généralement calme en mars , et il est généralement chaud en juin . EN']

Use Keras tokenizer to convert text sentences to tokens. Each word gets it's own unique integer token. Special words ST/EN also get their tokens.

In [22]:
tok_en = tf.keras.preprocessing.text.Tokenizer(lower=False)
tok_en.fit_on_texts(data_en_raw)
data_en_tok = tok_en.texts_to_sequences(data_en_raw)

[[2, 19, 25, 1, 10, 69, 6, 41, 9, 5, 1, 57, 4, 46, 3], [2, 7, 22, 23, 1, 11, 64, 6, 45, 9, 5, 1, 11, 53, 4, 47, 3], [2, 24, 1, 11, 69, 6, 40, 9, 5, 1, 11, 70, 4, 36, 3]]


In [42]:
print('example tokens for English:')
print('is:', tok_en.word_index['is'], '   ',
      'ST:', tok_en.word_index['ST'], '   ',
      'EN:', tok_en.word_index['EN'], '   ',
      'in:', tok_en.word_index['in'], '   ',
      'it:', tok_en.word_index['it'])
print('example sentences after tokenization:')
data_en_tok[0:3]

example tokens for English:
is: 1     ST: 2     EN: 3     in: 4     it: 5
example sentences after tokenization:


[[2, 19, 25, 1, 10, 69, 6, 41, 9, 5, 1, 57, 4, 46, 3],
 [2, 7, 22, 23, 1, 11, 64, 6, 45, 9, 5, 1, 11, 53, 4, 47, 3],
 [2, 24, 1, 11, 69, 6, 40, 9, 5, 1, 11, 70, 4, 36, 3]]

In [36]:
tok_fr = tf.keras.preprocessing.text.Tokenizer(lower=False)
tok_fr.fit_on_texts(data_fr_raw)
data_fr_tok = tok_fr.texts_to_sequences(data_fr_raw)

In [40]:
print('example tokens for French:')
print('est:', tok_fr.word_index['est'], '   ',
      'ST:', tok_fr.word_index['ST'], '   ',
      'EN:', tok_fr.word_index['EN'], '   ',
      'en:', tok_fr.word_index['en'], '   ',
      'il:', tok_fr.word_index['il'])
print('example sentences after tokenization:')
data_fr_tok[0:3]

example tokens for French:
est: 1     ST: 2     EN: 3     en: 4     il: 5
example sentences after tokenization:


[[2, 37, 36, 1, 10, 69, 39, 13, 26, 8, 5, 1, 114, 4, 52, 3],
 [2, 6, 34, 33, 1, 14, 21, 4, 51, 8, 5, 97, 71, 4, 53, 3],
 [2, 103, 1, 14, 69, 4, 47, 8, 5, 1, 14, 23, 4, 43, 3]]

In [46]:
max_len_en = len(max(data_en_tok, key=len))
max_len_fr = len(max(data_fr_tok, key=len))
max_len_both = max(max_len_en, max_len_fr)
print('Maximum sentence length in either English or French:', max_len_both, 'tokens (including EN/ST)')

Maximum sentence length in either English or French: 23 tokens (including EN/ST)


Pad both corpuses to longest sentence in each language

In [48]:
data_en = tf.keras.preprocessing.sequence.pad_sequences(data_en_tok, maxlen=max_len_en, padding='post')
data_fr = tf.keras.preprocessing.sequence.pad_sequences(data_fr_tok, maxlen=max_len_fr, padding='post')

In [51]:
n_en_seq = data_en.shape[1]
n_fr_seq = data_fr.shape[1]
n_en_vocab = len(tok_en.word_index)
n_fr_vocab = len(tok_fr.word_index)
max_seq_len = max(n_en_seq, n_fr_seq)
print('Max length English sentence (tokens):   ', n_en_seq)
print('Max length French sentence (tokens):    ', n_fr_seq)
print('Num tokens in English vocabulary:       ', n_en_vocab)
print('Num tokens in English vocabulary:       ', n_fr_vocab)

Max length English sentence (tokens):    17
Max length French sentence (tokens):     23
Num tokens in English vocabulary:        201
Num tokens in English vocabulary:        346


In [58]:
print('English train data')
print('shape:', data_en.shape)
print(data_en[4:7])

English train data
shape: (137860, 17)
[[ 2 31 14 18 15  1  7 84  8 32 14 18  1  7 85  3  0]
 [ 2 33 13 15  1  7 86  8 32 13  1  7 84  3  0  0  0]
 [ 2 20  1 68  6 49  8  5  1 11 64  4 45  3  0  0  0]]


In [61]:
print('French train targets data')
print('shape:', data_fr.shape)
print(data_fr[4:7])

French train targets data
shape: (137860, 23)
[[ 2 42 15 16 18  1 12 84  7 41 15 16  1  9 85  3  0  0  0  0  0  0  0]
 [ 2 22 18 19  1 86  7 41 19  1 12 84  3  0  0  0  0  0  0  0  0  0  0]
 [ 2 31  1 60  4 54  7  5  1 14 21  4 51  3  0  0  0  0  0  0  0  0  0]]


# Simple Model

<img src="assets/rnn_bidirectional.png"/>
<center>Figure from Bidirectional Recurrent Neural Networks (1997) by Mike Schuster and kuldip K. Paliwal</center>

In [101]:
from tensorflow.keras.layers import Input, Embedding, Bidirectional, GRU, TimeDistributed, Dense, Activation

X_input = Input(shape=(n_en_seq,))
X = Embedding(input_dim=n_en_vocab, output_dim=50)(X_input)
X = Bidirectional( GRU(units=64, return_sequences=True) )(X)
X = TimeDistributed(Dense(units=n_fr_vocab))(X)
X = Activation('softmax')(X)

model = tf.keras.Model(inputs=X_input, outputs=X)
model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,
              optimizer=tf.keras.optimizers.Adam(lr=0.001),
              metrics=[tf.keras.metrics.sparse_categorical_accuracy])    
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 23)                0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 23, 50)            10050     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 23, 128)           44160     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 23, 346)           44634     
_________________________________________________________________
activation_1 (Activation)    (None, 23, 346)           0         
Total params: 98,844
Trainable params: 98,844
Non-trainable params: 0
_________________________________________________________________


In [108]:
hist = model.fit(x=data_en, y=np.expand_dims(data_fr, axis=-1),
                 batch_size=1024, epochs=10, validation_split=0.2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 110288 samples, validate on 27572 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**Test Model**

In [13]:
def sequence_to_english(seq):
    words = [tok_en.index_word[x] for x in seq if x in tok_en.index_word]
    return ' '.join(words)
def sequence_to_french(seq):
    words = [tok_fr.index_word[x] for x in seq if x in tok_fr.index_word]
    return ' '.join(words)

In [112]:
index = 234
english_sentence = data_en_raw[index]
french_sentence = data_fr_raw[index]

prediction_prob = model.predict(data_en[index:index+1])
prediction_prob = prediction_prob.squeeze()
prediction_tok = prediction_prob.argmax(axis=-1)
predicted_sentence = sequence_to_french(prediction_tok)

print('english:            ', english_sentence)
print('french (original):  ', french_sentence)
print('french (predicted): ', predicted_sentence)

english:             ST we dislike oranges , grapefruit , and bananas . EN
french (original):   ST nous détestons les oranges , le pamplemousse et les bananes . EN
french (predicted):  ST nous détestons les le le pamplemousse et les les EN


# Sequence to Sequence

We will use technique called 'Teacher Forces' to train decoder. I.e. instead of getting decoder to generate one word at a time and then feed it into the next step in decoder, we will pretend decoder generated correct sequence and just feed in correct inputs. Because we know correct french translation, we don't have to sample one-at-a-time.

To do this we will need two version of French dataset:

* actual target dataset, with ST marker removed
* feed-in target dataset, which we will use as input to decoder, this one contains ST token at first position

In [60]:
data_fr_noST = np.roll(data_fr, shift=-1, axis=-1)  # shift left by one and pad 0 on the right
data_fr_noST[:,-1] = 0
print('French train targets data')
print('shape:', data_fr_noST.shape)
print(data_fr_noST[4:7])

French train targets data
shape: (137860, 23)
[[42 15 16 18  1 12 84  7 41 15 16  1  9 85  3  0  0  0  0  0  0  0  0]
 [22 18 19  1 86  7 41 19  1 12 84  3  0  0  0  0  0  0  0  0  0  0  0]
 [31  1 60  4 54  7  5  1 14 21  4 51  3  0  0  0  0  0  0  0  0  0  0]]


Create following parts of graph:
* Encoder
  * inputs: whole English sentence
  * outputs: LSTM hidden states
* Decoder in train mode
  * inputs: LSTM hidden states **and** target French sentence in "teacher forcing" mode (w/o ST token at the begining)
  * outputs: whole French sentence

In [20]:
from tensorflow.keras.layers import Input, Embedding, LSTM, TimeDistributed, Dense, Activation

# Encoder
E_input = Input(shape=(n_en_seq,), name='Enc_Input')                                       # encoder input tensor
E = Embedding(input_dim=n_en_vocab, output_dim=50, name='Enc_Embbeding')(E_input)          # encoder embedding layer
_, Eh, Ec = LSTM(units=512, return_state=True, name='Enc_LSTM')(E)                         # encoder LSTM layer

# Decoder layer definitions
decoder_embedding = Embedding(input_dim=n_fr_vocab, output_dim=50, name='Dec_Embbedingg')  # we will need to reuse these
decoder_lstm = LSTM(512, return_sequences=True, return_state=True, name='Dec_LSTM')        # layers in sampling mode
decoder_dense = Dense(n_fr_vocab, activation='softmax', name='Dec_Output')                 # in next section

# Decoder in train mode
D_input = Input(shape=(n_fr_seq,), name='Dec_Target')                                      # decoder input tensor
D = decoder_embedding(D_input)                                                             # decoder embedding layer
D, _, _ = decoder_lstm(D, initial_state=[Eh, Ec])                                          # decoder LSTM layer
D_output = decoder_dense(D)                                                                # decoder dense on output

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Enc_Input (InputLayer)          (None, 17)           0                                            
__________________________________________________________________________________________________
Dec_Target (InputLayer)         (None, 23)           0                                            
__________________________________________________________________________________________________
Enc_Embbeding (Embedding)       (None, 17, 50)       10050       Enc_Input[0][0]                  
__________________________________________________________________________________________________
Dec_Embbedingg (Embedding)      (None, 23, 50)       17300       Dec_Target[0][0]                 
__________________________________________________________________________________________________
Enc_LSTM (

Create end-to-end Keras model for training. Contains both encoder and decoder

In [None]:
model = tf.keras.Model(inputs=[E_input, D_input], outputs=D_output)                        # full seq-2-seq model
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.001),
              loss=tf.keras.losses.sparse_categorical_crossentropy,
              metrics=[tf.keras.metrics.sparse_categorical_accuracy])    
model.summary()

Train model

In [15]:
model.fit(x=[data_en, data_fr], y=np.expand_dims(data_fr_noST, axis=-1),
          batch_size=1024, epochs=10, validation_split=0.2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 110288 samples, validate on 27572 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fe65a593f28>

**Model for sampling translations**

Create Keras model for encoder as separate unit

In [62]:
encoder = tf.keras.Model(inputs=E_input, outputs=[Eh, Ec])
encoder.summary()

NameError: name 'E_input' is not defined

Create decoder in sampling mode, reuse layer definitions form previous section
* Inputs: hidden states **and** single English word-token (not whole French sentence)
* Outputs: single French word-token

In [18]:
Sh_init = Input(shape=(512,))
Sc_init = Input(shape=(512,))
S_input = Input(shape=(1,), name='Sam_Input')
S = decoder_embedding(S_input)
S, Sh, Sc = decoder_lstm(S, initial_state=[Sh_init, Sc_init])
S_output = decoder_dense(S)

Create Keras model for decoder-sampler (one word at a time)

In [23]:
sampler = tf.keras.Model(inputs=[S_input, Sh_init, Sc_init], outputs=[S_output, Sh, Sc])
sampler.summary()

In [50]:
index = 666
english_sentence = data_en_raw[index]
french_sentence = data_fr_raw[index]
print('english:            ', english_sentence)
print('french (original):  ', french_sentence)

english:             ST his least favorite fruit is the pear , but our least favorite is the banana . EN
french (original):   ST son fruit préféré est moins la poire , mais notre moins préféré est la banane . EN


**Actually Sample**

Run input sentence through encoder

In [51]:
st_h, st_c = encoder.predict(data_en[index:index+1])
assert st_h.shape == (1, 512) and st_c.shape == (1, 512)

Create input variables - thse will be feed into decoder at first decode time step

In [52]:
st_input = tok_fr.word_index['ST']
st_input = np.array([[st_input]])  # batch size = 1, seq len = 1
assert st_input.shape == (1, 1)

Generate output words one-at-a-time and feed them back next time step

In [53]:
prediction_tok = []                                              # list of output tokens, generated one at a time
for i in range(n_fr_seq):
    # feed one word (st_input) intot decoder
    probs, st_h, st_c = sampler.predict([st_input, st_h, st_c])
    assert st_h.shape == (1, 512) and st_c.shape == (1, 512)
    
    # pick maximum probability prediction as next word
    # (but keep shape so we can feed in next step)
    st_input = probs.argmax(axis=-1)
    assert st_input.shape == (1, 1)
    
    # pick maximum probability prediction and append to generate list
    # (this does same as line above, but discards shape)
    token = probs.argmax()
    prediction_tok.append(token)
    
    # if decoder generated special end-word, break
    if token == tok_fr.word_index['EN']:
        break    

Print output sentence tokens

In [54]:
prediction_tok

[22, 18, 19, 1, 15, 9, 90, 7, 22, 15, 19, 1, 9, 91, 3]

Print input english sentence, target French and generated French sentences

In [55]:
print('english:            ', english_sentence)
print('french (original):  ', french_sentence)
predicted_sentence = sequence_to_french(prediction_tok)
print('french (predicted): ', predicted_sentence)

english:             ST his least favorite fruit is the pear , but our least favorite is the banana . EN
french (original):   ST son fruit préféré est moins la poire , mais notre moins préféré est la banane . EN
french (predicted):  son fruit préféré est moins la poire mais son moins préféré est la fraise EN
