# Character Sequence to Sequence 
In this notebook, we'll build a model that takes in a sequence of letters, and outputs a sorted version of that sequence. We'll do that using what we've learned so far about Sequence to Sequence models.

<img src="images/sequence-to-sequence.jpg"/>


## Dataset 

The dataset lives in the /data/ folder. At the moment, it is made up of the following files:
 * **letters_source.txt**: The list of input letter sequences. Each sequence is its own line. 
 * **letters_target.txt**: The list of target sequences we'll use in the training process. Each sequence here is a response to the input sequence in letters_source.txt with the same line number.

In [1]:
import helper

source_path = 'data/letters_source.txt'
target_path = 'data/letters_target.txt'

source_sentences = helper.load_data(source_path)
target_sentences = helper.load_data(target_path)

Let's start by examining the current state of the dataset. `source_sentences` contains the entire input sequence file as text delimited by newline symbols.

In [2]:
source_sentences[:50].split('\n')

['bsaqq',
 'npy',
 'lbwuj',
 'bqv',
 'kial',
 'tddam',
 'edxpjpg',
 'nspv',
 'huloz',
 '']

`target_sentences` contains the entire output sequence file as text delimited by newline symbols.  Each line corresponds to the line from `source_sentences`.  `target_sentences` contains a sorted characters of the line.

In [3]:
target_sentences[:50].split('\n')

['abqqs',
 'npy',
 'bjluw',
 'bqv',
 'aikl',
 'addmt',
 'degjppx',
 'npsv',
 'hlouz',
 '']

## Preprocess
To do anything useful with it, we'll need to turn the characters into a list of integers: 

In [4]:
def extract_character_vocab(data):
    special_words = ['<pad>', '<unk>', '<s>',  '<\s>']

    set_words = set([character for line in data.split('\n') for character in line])
    int_to_vocab = {word_i: word for word_i, word in enumerate(special_words + list(set_words))}
    vocab_to_int = {word: word_i for word_i, word in int_to_vocab.items()}

    return int_to_vocab, vocab_to_int

# Build int2letter and letter2int dicts
source_int_to_letter, source_letter_to_int = extract_character_vocab(source_sentences)
target_int_to_letter, target_letter_to_int = extract_character_vocab(target_sentences)

# Convert characters to ids
source_letter_ids = [[source_letter_to_int.get(letter, source_letter_to_int['<unk>']) for letter in line] for line in source_sentences.split('\n')]
target_letter_ids = [[target_letter_to_int.get(letter, target_letter_to_int['<unk>']) for letter in line] for line in target_sentences.split('\n')]

print("Example source sequence")
print(source_letter_ids[:3])
print("\n")
print("Example target sequence")
print(target_letter_ids[:3])

Example source sequence
[[13, 18, 28, 26, 26], [12, 6, 15], [22, 13, 19, 9, 20]]


Example target sequence
[[28, 13, 26, 26, 18], [12, 6, 15], [13, 20, 22, 9, 19]]


The last step in the preprocessing stage is to determine the the longest sequence size in the dataset we'll be using, then pad all the sequences to that length.

In [5]:
def pad_id_sequences(source_ids, source_letter_to_int, target_ids, target_letter_to_int, sequence_length):
    new_source_ids = [sentence + [source_letter_to_int['<pad>']] * (sequence_length - len(sentence)) \
                      for sentence in source_ids]
    new_target_ids = [sentence + [target_letter_to_int['<pad>']] * (sequence_length - len(sentence)) \
                      for sentence in target_ids]

    return new_source_ids, new_target_ids


# Use the longest sequence as sequence length
sequence_length = max(
        [len(sentence) for sentence in source_letter_ids] + [len(sentence) for sentence in target_letter_ids])

# Pad all sequences up to sequence length
source_ids, target_ids = pad_id_sequences(source_letter_ids, source_letter_to_int, 
                                          target_letter_ids, target_letter_to_int, sequence_length)

print("Sequence Length")
print(sequence_length)
print("\n")
print("Input sequence example")
print(source_ids[:3])
print("\n")
print("Target sequence example")
print(target_ids[:3])

Sequence Length
7


Input sequence example
[[13, 18, 28, 26, 26, 0, 0], [12, 6, 15, 0, 0, 0, 0], [22, 13, 19, 9, 20, 0, 0]]


Target sequence example
[[28, 13, 26, 26, 18, 0, 0], [12, 6, 15, 0, 0, 0, 0], [13, 20, 22, 9, 19, 0, 0]]


This is the final shape we need them to be in. We can now proceed to building the model.

### Sequence to Sequence
The decoder is probably the most complex part of this model. We need to declare a decoder for the training phase, and a decoder for the inference/prediction phase. These two decoders will share their parameters (so that all the weights and biases that are set during the training phase can be used when we deploy the model).


First, we'll need to define the type of cell we'll be using for our decoder RNNs. We opted for LSTM.

Then, we'll need to hookup a fully connected layer to the output of decoder. The output of this layer tells us which word the RNN is choosing to output at each time step.

Let's first look at the inference/prediction decoder. It is the one we'll use when we deploy our chatbot to the wild (even though it comes second in the actual code).

<img src="images/sequence-to-sequence-inference-decoder.png"/>

We'll hand our encoder hidden state to the inference decoder and have it process its output. TensorFlow handles most of the logic for us. We just have to use [`tf.contrib.seq2seq.simple_decoder_fn_inference`](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/simple_decoder_fn_inference) and [`tf.contrib.seq2seq.dynamic_rnn_decoder`](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/dynamic_rnn_decoder) and supply them with the appropriate inputs.

Notice that the inference decoder feeds the output of each time step as an input to the next.

As for the training decoder, we can think of it as looking like this:
<img src="images/sequence-to-sequence-training-decoder.png"/>

The training decoder **does not** feed the output of each time step to the next. Rather, the inputs to the decoder time steps are the target sequence from the training dataset (the orange letters).

## Model
#### Check the Version of TensorFlow
This will check to make sure you have the correct version of TensorFlow

In [6]:
from distutils.version import LooseVersion
import tensorflow as tf
tf.reset_default_graph()
# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

TensorFlow Version: 1.0.1


### Hyperparameters

In [7]:
# Number of Epochs
epochs = 60
# Batch Size
batch_size = 128
# RNN Size
rnn_size = 50
# Number of Layers
num_layers = 2
# Embedding Size
encoding_embedding_size = 13
decoding_embedding_size = 13
# Learning Rate
learning_rate = 0.001
#keep probability 
keep_prob=1

### Input

In [8]:
input_data = tf.placeholder(tf.int32, [batch_size, sequence_length])
targets = tf.placeholder(tf.int32, [batch_size, sequence_length])
lr = tf.placeholder(tf.float32)

In [9]:
input_data

<tf.Tensor 'Placeholder:0' shape=(128, 7) dtype=int32>

In [10]:
targets

<tf.Tensor 'Placeholder_1:0' shape=(128, 7) dtype=int32>

In [11]:
## slice each time step into its' own tensor
def slice_inputs(inputs,length):
    inputs =  tf.split(inputs, sequence_length, 1)
    inputs = [tf.squeeze(input_, [1]) for input_ in inputs]
    return inputs

In [12]:
# slice inputs
inputs = slice_inputs(input_data,sequence_length)
#process decode input - slice decod inputs
ending = tf.strided_slice(targets, [0, 0], [batch_size, -1], [1, 1]) 
dec_input = tf.concat([tf.fill([batch_size, 1], target_letter_to_int['<s>']), ending], 1)
dec_input = slice_inputs(dec_input,sequence_length)
#slice targets
labels = slice_inputs(targets,sequence_length)

In [13]:
# define lstm cell
def basic_cell(rnn_size,keep_prob):  
    lstm = tf.contrib.rnn.DropoutWrapper(
                    tf.contrib.rnn.BasicLSTMCell(rnn_size, state_is_tuple=True),
                    output_keep_prob=keep_prob)
    return lstm

stacked_lstm = tf.contrib.rnn.MultiRNNCell([basic_cell(rnn_size,keep_prob) for _ in range(num_layers)], state_is_tuple=True)

In [14]:
source_vocab_size = len(source_letter_to_int)
target_vocab_size = len(target_letter_to_int)

In [15]:
with tf.variable_scope('decoder') as scope:
    # build the seq2seq model 
    #  inputs : encoder, decoder inputs, LSTM cell type, vocabulary sizes, embedding dimensions
    decode_outputs, decode_states = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(inputs,dec_input, stacked_lstm,
                                        source_vocab_size, target_vocab_size, encoding_embedding_size)
    # share parameters
    scope.reuse_variables()
    # testing model, where output of previous timestep is fed as input 
    #  to the next timestep
    decode_outputs_test,decode_states_test = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(inputs,dec_input, stacked_lstm,
                                        source_vocab_size, target_vocab_size, encoding_embedding_size,feed_previous=True)

In [16]:
decode_outputs[0]

<tf.Tensor 'decoder/embedding_rnn_seq2seq/embedding_rnn_decoder/rnn_decoder/output_projection_wrapper/BiasAdd:0' shape=(128, 30) dtype=float32>

### Optimization
Our loss function is [`tf.contrib.seq2seq.sequence_loss`](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/sequence_loss) provided by the tensor flow seq2seq module. It calculates a weighted cross-entropy loss for the output logits.

In [17]:
# Loss function
loss_weights = [ tf.ones_like(label, dtype=tf.float32) for label in labels ]
cost = tf.contrib.legacy_seq2seq.sequence_loss(decode_outputs,labels,loss_weights)

# Optimizer
optimizer = tf.train.AdamOptimizer(lr)

# Gradient Clipping
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
train_op = optimizer.apply_gradients(capped_gradients)

In [19]:
decode_outputs_test

[<tf.Tensor 'decoder/embedding_rnn_seq2seq_1/embedding_rnn_decoder/rnn_decoder/output_projection_wrapper/BiasAdd:0' shape=(128, 30) dtype=float32>,
 <tf.Tensor 'decoder/embedding_rnn_seq2seq_1/embedding_rnn_decoder/rnn_decoder/output_projection_wrapper_1/BiasAdd:0' shape=(128, 30) dtype=float32>,
 <tf.Tensor 'decoder/embedding_rnn_seq2seq_1/embedding_rnn_decoder/rnn_decoder/output_projection_wrapper_2/BiasAdd:0' shape=(128, 30) dtype=float32>,
 <tf.Tensor 'decoder/embedding_rnn_seq2seq_1/embedding_rnn_decoder/rnn_decoder/output_projection_wrapper_3/BiasAdd:0' shape=(128, 30) dtype=float32>,
 <tf.Tensor 'decoder/embedding_rnn_seq2seq_1/embedding_rnn_decoder/rnn_decoder/output_projection_wrapper_4/BiasAdd:0' shape=(128, 30) dtype=float32>,
 <tf.Tensor 'decoder/embedding_rnn_seq2seq_1/embedding_rnn_decoder/rnn_decoder/output_projection_wrapper_5/BiasAdd:0' shape=(128, 30) dtype=float32>,
 <tf.Tensor 'decoder/embedding_rnn_seq2seq_1/embedding_rnn_decoder/rnn_decoder/output_projection_wrapp

In [20]:
labels

[<tf.Tensor 'Squeeze_14:0' shape=(128,) dtype=int32>,
 <tf.Tensor 'Squeeze_15:0' shape=(128,) dtype=int32>,
 <tf.Tensor 'Squeeze_16:0' shape=(128,) dtype=int32>,
 <tf.Tensor 'Squeeze_17:0' shape=(128,) dtype=int32>,
 <tf.Tensor 'Squeeze_18:0' shape=(128,) dtype=int32>,
 <tf.Tensor 'Squeeze_19:0' shape=(128,) dtype=int32>,
 <tf.Tensor 'Squeeze_20:0' shape=(128,) dtype=int32>]

In [23]:
len(target_batch[0])

7

## Train
We're now ready to train our model. If you run into OOM (out of memory) issues during training, try to decrease the batch_size.

In [24]:
def evel_accuracy(logits_list):
    logits = np.array(logits_list).transpose(1,0,2)
    pred = np.argmax(logits,2)
    return pred

In [31]:
import numpy as np

sess = tf.InteractiveSession()

train_source = source_ids[batch_size:]
train_target = target_ids[batch_size:]

valid_source = source_ids[:batch_size]
valid_target = target_ids[:batch_size]

sess.run(tf.global_variables_initializer())

for epoch_i in range(epochs):
    for batch_i, (source_batch, target_batch) in enumerate(
            helper.batch_data(train_source, train_target, batch_size)):
        _, loss = sess.run(
            [train_op, cost],
            {input_data: source_batch, targets: target_batch, lr: learning_rate})
    batch_train_logits = sess.run(
        decode_outputs_test,
        {input_data: source_batch, targets: target_batch})
    batch_valid_logits = sess.run(
        decode_outputs_test,
        {input_data: valid_source, targets: target_batch})

    train_acc = np.mean(np.equal(target_batch, evel_accuracy(batch_train_logits)))
    valid_acc = np.mean(np.equal(valid_target, evel_accuracy(batch_valid_logits)))
    print('Epoch {:>3} Batch {:>4}/{} - Train Accuracy: {:>6.3f}, Validation Accuracy: {:>6.3f}, Loss: {:>6.3f}'
          .format(epoch_i, batch_i, len(source_ids) // batch_size, train_acc, valid_acc, loss))


Epoch   0 Batch   76/78 - Train Accuracy:  0.425, Validation Accuracy:  0.473, Loss:  2.089
Epoch   1 Batch   76/78 - Train Accuracy:  0.494, Validation Accuracy:  0.533, Loss:  1.638
Epoch   2 Batch   76/78 - Train Accuracy:  0.584, Validation Accuracy:  0.619, Loss:  1.327
Epoch   3 Batch   76/78 - Train Accuracy:  0.636, Validation Accuracy:  0.664, Loss:  1.160
Epoch   4 Batch   76/78 - Train Accuracy:  0.662, Validation Accuracy:  0.680, Loss:  1.047
Epoch   5 Batch   76/78 - Train Accuracy:  0.690, Validation Accuracy:  0.705, Loss:  0.923
Epoch   6 Batch   76/78 - Train Accuracy:  0.706, Validation Accuracy:  0.723, Loss:  0.900
Epoch   7 Batch   76/78 - Train Accuracy:  0.734, Validation Accuracy:  0.747, Loss:  0.790
Epoch   8 Batch   76/78 - Train Accuracy:  0.729, Validation Accuracy:  0.751, Loss:  0.722
Epoch   9 Batch   76/78 - Train Accuracy:  0.742, Validation Accuracy:  0.744, Loss:  0.691
Epoch  10 Batch   76/78 - Train Accuracy:  0.752, Validation Accuracy:  0.790, L

## Prediction

In [45]:
input_sentence = 'hello'


input_sentence = [source_letter_to_int.get(word, source_letter_to_int['<unk>']) for word in input_sentence.lower()]
input_sentence = input_sentence + [0] * (sequence_length - len(input_sentence))
batch_shell = np.zeros((batch_size, sequence_length))
batch_shell[0] = input_sentence
chatbot_logits = sess.run(decode_outputs_test, {input_data: batch_shell,targets: batch_shell})
chatbot_logits = np.argmax(np.array(chatbot_logits).transpose(1,0,2),2)[0,:]

print('Input')
print('  Word Ids:      {}'.format([i for i in input_sentence]))
print('  Input Words: {}'.format([source_int_to_letter[i] for i in input_sentence]))

print('\nPrediction')
print('  Word Ids:      {}'.format([i for i in chatbot_logits]))
print('  Chatbot Answer Words: {}'.format([target_int_to_letter[i] for i in chatbot_logits]))

Input
  Word Ids:      [10, 14, 22, 22, 25, 0, 0]
  Input Words: ['h', 'e', 'l', 'l', 'o', '<pad>', '<pad>']

Prediction
  Word Ids:      [14, 10, 22, 22, 25, 0, 0]
  Chatbot Answer Words: ['e', 'h', 'l', 'l', 'o', '<pad>', '<pad>']
