# SequenceToSequence Model

## Summary
1. [Understanding a bit](#Understanding-a-bit)
2. [Preparing the model](#Preparing-the-data-to-the-model)
3. [Trainning the data](#Trainning-the-data)
4. [Make Predictions](#Making-Predictions)

...

## Understanding a bit

So far, we learned to use deep learning to build models to predict results from a previously amount of content. We used a special type of model named RNNs to make predictions given a set of data which depends on each other according to the time. We saw a Poetry Generation, that is a way to use RNNs to predic texts, to do so, we used Glove as Pre-Trained word vectors to transform word in numbers.

Now we are going to see that, we can not only predict the sintax of the sentence, but we can use it to predict responses depending on that sintax we found.

SeqToSeq is largerly used on machine translation and chatbot. SeqToSeq join together 2 RNNs, named encoder and decoder, one to find the sintax, and the other to transform the sintax back to a sentence, but in another esphere. It could be language to language, or question to answer.

...

### Where to find the data used here?
1. Follow the [link](http://www.manythings.org/anki/) to find the data for translation on this notebook.
2. Also, [GloVe](http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip) can be used for pre-trainned wordvectors

### Find out More
1. [How does SeqToSeq works - Towards Data Science](https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346)
2. [Attention - One Step Ahead - WildMl](http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/?source=post_page---------------------------)

### Mistakes fixed
1. ...

## Preparing the data to the model

In [1]:
import os, sys

from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense, Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

import numpy as np
import matplotlib.pyplot as plt

Using TensorFlow backend.


In [2]:
# Pre Configurations is set here

batch_size = 64 # Trainning set
epochs = 100 # Times the model is going to repeat the steps
latent_dim = 256 # Encoding space dimensionality
num_samples = 10000 # Number of exemples
max_sequence_length = 100
max_num_words = 20000
embedding_dim = 100 # Embedding Matrix

In [3]:
# Place to store the data

input_texts = [] # Original Language
target_texts = [] # Target Language
target_texts_inputs = [] # Same than the above but offset by 1

In [4]:
# Load in the data from manythings

t = 0
for line in open('../databases/large_files/translation/por.txt'):
    # Limit the number of samples
    '''
    We know the input and the output needs to have the same size, so what we are
    going to do is to pad every sample with the same length of the greates one. 
    If it is really big, there will be lots of wasted calculation for those how 
    are not so big. Keep the samples as shorest as possibe is a good way to go!
    '''
    t += 1
    if t > num_samples:
        break
        
    # Tab separate the input and the target
    if '\t' not in line:
        continue
    
    # Split them into 2 pieces
    input_text, translation = line.split('\t')
    
    # Using Teacher Forcing (https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/)
    # Building the output and the input, introducing the tokens to them
    target_text = translation + ' <eos>'
    target_text_input = '<sos> ' + translation
    
    input_texts.append(input_text)
    target_texts.append(target_text)
    target_texts_inputs.append(target_text_input)
print("Number of Samples: ", len(input_texts))

Number of Samples:  10000


In [5]:
# Tokenization step!

# Inputs
tokenizer_inputs = Tokenizer(num_words=max_num_words)
tokenizer_inputs.fit_on_texts(input_texts)
input_sequences = tokenizer_inputs.texts_to_sequences(input_texts)

# Index Mapping
word2idx_inputs = tokenizer_inputs.word_index
print('Found {0} unique input tokens'.format(len(word2idx_inputs)))

# Determine the Maximun Input Sequence Length
max_len_input = max(len(s) for s in input_sequences)

Found 2061 unique input tokens


In [17]:
# Outputs
'''
Be carreful to not filter special characteres, otherwise <sos> and 
<eos> will not appear!!
'''
tokenizer_outputs = Tokenizer(num_words=max_num_words, filters='')
tokenizer_outputs.fit_on_texts(target_texts + target_texts_inputs) # inefficient?
target_sequences = tokenizer_outputs.texts_to_sequences(target_texts)
target_sequences_inputs = tokenizer_outputs.texts_to_sequences(target_texts_inputs) 

# Index Mapping
word2idx_outputs = tokenizer_outputs.word_index
print('Found {0} unique output tokens'.format(len(word2idx_outputs)))

# Starts at 1 because of indexing
num_words_output = len(word2idx_outputs) + 1

# Determine the Maximun Input Sequence Length
max_len_target = max(len(s) for s in target_sequences)

Found 4957 unique output tokens


In [35]:
# Pad Sequences

encoder_inputs = pad_sequences(input_sequences, maxlen=max_len_input)
print("Enconder data shape: {0}".format(encoder_inputs.shape))
print("Enconder data [0]: {0}".format(encoder_inputs[0]))
      
decoder_inputs = pad_sequences(target_sequences_inputs, maxlen=max_len_target, padding='post')
print("Deconder data [0]: {0}".format(decoder_inputs[0]))
print("Deconder data shape: {0}".format(decoder_inputs.shape))

decoder_targets = pad_sequences(target_sequences, maxlen=max_len_target, padding='post')

Enconder data shape: (10000, 5)
Enconder data [0]: [ 0  0  0  0 24]
Deconder data [0]: [   2 1390    0    0    0    0    0    0    0]
Deconder data shape: (10000, 9)


In [36]:
# Store all the pre-trained word vectors
print('Loading word vectors...')
word2vec = {}
with open(os.path.join(
    '../databases/large_files/glove.6B/glove.6B.%sd.txt' % embedding_dim
)) as f:
    for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
print('Found {0} word vectors'.format(len(word2vec)))

Loading word vectors...
Found 400000 word vectors


In [37]:
# Embedding Matrix
print('Filling pre-trained embeddings...')
num_words = min(max_num_words, len(word2idx_inputs)+1)
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word2idx_inputs.items():
    if i < max_num_words:
        embedding_vector = word2vec.get(word)
        if embedding_vector is not None:
            # Zeros if we can't find a word in Embedding
            embedding_matrix[i] = embbeding_vector

Filling pre-trained embeddings...


In [38]:
# Embedding Layer

embedding_layer = Embedding(
    num_words,
    embedding_dim,
    weights=[embedding_matrix],
    input_length=max_len_input,
    # trainable=True
)

In [39]:
'''
Once we cannot use sparce categorical cross entropy with sequences,
let's now create the targets
'''
decoder_targets_one_hot = np.zeros((
    len(input_texts),
    max_len_target,
    num_words_output
), dtype='float32')

In [40]:
# Finally, assing the values
for i, d in enumerate(decoder_targets):
    for t, word in enumerate(d):
        decoder_targets_one_hot[i, t, word] = 1

## Trainning the data

In [41]:
encoder_inputs_placeholder = Input(shape=(max_len_input,))
x = embedding_layer(encoder_inputs_placeholder)
encoder = LSTM(latent_dim, return_state=True, dropout=0.5)
encoder_outputs, h, c = encoder(x)
# encoder_outputs, h = encoder(x) # GRU?

#States to pass to the decoder
encoder_states = [h, c]
# encoder_states = [state_h] # GRU

# For the decoder, we are going to use [h, c] as initial state
decoder_inputs_placeholder = Input(shape=(max_len_target,))

# Not using pre-trained word vectors
decoder_embedding = Embedding(num_words_output, latent_dim)
decoder_inputs_x = decoder_embedding(decoder_inputs_placeholder)

# The decoder is a to_many model, so we must set return_sequences=True
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True, dropout=0.5)
decoder_outputs, _, _ = decoder_lstm(
    decoder_inputs_x,
    initial_state=encoder_states
)

# decoder_outputs, _ = decoder_gru(
#     decoder_inputs_x,
#     initial_state=encoder_states
# )

# Dense layers for predictions
decoder_dense = Dense(num_words_output, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([
    encoder_inputs_placeholder,
    decoder_inputs_placeholder
], decoder_outputs)

In [42]:
model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

fit = model.fit(
    [encoder_inputs, decoder_inputs],
    decoder_targets_one_hot,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2
)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 8000 samples, validate on 2000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100


Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


In [43]:
# Save Model
model.save('../models/translation-eng-por.h5')
print('Model save to the disk at models/translation-eng-por.h5')

# # Load Model
# from keras.models import load_model
# model = load_model('../models/translation-eng-por.h5')

  '. They will not be included '


Model save to the disk at models/translation-eng-por.h5


## Making Predictions