### This is a re-implementation of the sequence to sequenc learning assignment

data sources:
 - **./data/letters_source.txt**: random string of letters
 - **./data/letters_target.txt**: sorted string of letter corresponding to the same line from **letters_source.txt**


### Sequence to Sequence

The decoder is probably the most complex part of this model. We need to declare a decoder for the training phase, and a decoder for the inference/prediction phase. These two decoders will share their parameters (so that all the weights and biases that are set during the training phase can be used when we deploy the model).

First, we'll need to define the type of cell we'll be using for our decoder RNNs. We opted for LSTM.
Then, we'll need to hookup a fully connected layer to the output of decoder. The output of this layer tells us which word the RNN is choosing to output at each time step.

Let's first look at the inference/prediction decoder. It is the one we'll use when we deploy our chatbot to the wild (even though it comes second in the actual code).

<img src="images/sequence-to-sequence-inference-decoder.png">

We'll hand our encoder hidden state to the inference decoder and have it process its output. TensorFlow handles most of the logic for us. We just have to use
**tf.contrib.seq2seq.simple_decoder_fn_inference** and **tf.contrib.seq2seq.dynamic_rnn_decoder** and supply them with the appropriate inputs.

Notice that the inference decoder feeds the output of each time step as an input to the next.

As for the training decoder, we can think of it as looking like this:

<img src="images/sequence-to-sequence-training-decoder.png">

The training decoder does not feed the output of each time step to the next. Rather, the inputs to the decoder time steps are the target sequence from the training dataset (the orange letters).

**Encoding**
 - Embed the input data using tf.contrib.layers.embed_sequence
 - Pass the embedded input into a stack of RNNs. Save the RNN state and ignore the output.

In [1]:
def load_data(path):
    with open(path, 'r') as f:
        data = f.read()
    return data

In [2]:
letters_source = load_data('./data/letters_source.txt').split('\n')

In [3]:
letters_source[:5]

['bsaqq', 'npy', 'lbwuj', 'bqv', 'kial']

In [4]:
letters_targets = load_data('./data/letters_target.txt').split('\n')

In [5]:
letters_targets[:5]

['abqqs', 'npy', 'bjluw', 'bqv', 'aikl']

#### Goal: Preprocess the data by turning the characters into ints

In [6]:
def extract_character_vocab(data):
    special_characters = ['<pad>', '<unk>', '<s>', '</s>']
    set_words = set([word for line in data
                          for word in line])
    vocab_to_int = {word : idx for idx, word in enumerate(special_characters + list(set_words))}
    int_to_vocab = {idx : word for word, idx in vocab_to_int.items()}
    return int_to_vocab, vocab_to_int

src_int_to_voc, src_voc_to_int = extract_character_vocab(letters_source)
tar_int_to_voc, tar_voc_to_int = extract_character_vocab(letters_targets)

src_ids = [[src_voc_to_int.get(ch, src_voc_to_int['<unk>']) for ch in line] for line in letters_source]
tar_ids = [[tar_voc_to_int.get(ch, tar_voc_to_int['<unk>']) for ch in line] for line in letters_targets]

print("Example source sequence")
print(src_ids[:3])
print("\n")
print("Example target sequence")
print(tar_ids[:3])

Example source sequence
[[19, 13, 20, 22, 22], [8, 21, 11], [27, 19, 18, 23, 15]]


Example target sequence
[[20, 19, 22, 22, 13], [8, 21, 11], [19, 15, 27, 23, 18]]


#### Apply Padding to the inputs and targets based on the longest sequence

In [7]:
seq_len = max(max(map(len, src_ids)), max(map(len, tar_ids)))

In [8]:
src_ids = [line + [src_voc_to_int['<pad>']]*(seq_len - len(line)) for line in src_ids ]
tar_ids = [line + [tar_voc_to_int['<pad>']]*(seq_len - len(line)) for line in tar_ids ]

print("Sequence Length:", seq_len)
print("\n")
print("Input sequence example")
print(src_ids[:3])
print("\n")
print("Target sequence example")
print(tar_ids[:3])

Sequence Length: 7


Input sequence example
[[19, 13, 20, 22, 22, 0, 0], [8, 21, 11, 0, 0, 0, 0], [27, 19, 18, 23, 15, 0, 0]]


Target sequence example
[[20, 19, 22, 22, 13, 0, 0], [8, 21, 11, 0, 0, 0, 0], [19, 15, 27, 23, 18, 0, 0]]


In [9]:
for line in src_ids[:3]:
    print(list(map(src_int_to_voc.get, line)))
    
for line in tar_ids[:3]:
    print(list(map(src_int_to_voc.get, line)))

['b', 's', 'a', 'q', 'q', '<pad>', '<pad>']
['n', 'p', 'y', '<pad>', '<pad>', '<pad>', '<pad>']
['l', 'b', 'w', 'u', 'j', '<pad>', '<pad>']
['a', 'b', 'q', 'q', 's', '<pad>', '<pad>']
['n', 'p', 'y', '<pad>', '<pad>', '<pad>', '<pad>']
['b', 'j', 'l', 'u', 'w', '<pad>', '<pad>']


## Model

test the version

In [10]:
from distutils.version import LooseVersion
import tensorflow as tf

tf.reset_default_graph()

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

TensorFlow Version: 1.0.0


### Hyperparms copied from the original implementation

In [11]:

# Number of Epochs
epochs = 60
# Batch Size
batch_size = 128
# RNN Size
rnn_size = 50
# Number of Layers
num_layers = 2
# Embedding Size
encoding_embedding_size = 13
decoding_embedding_size = 13
# Learning Rate
learning_rate = 0.001

In [12]:
inputs = tf.placeholder(tf.int32, shape=[batch_size, seq_len], name="inputs")
targets = tf.placeholder(tf.int32, shape=[batch_size, seq_len], name="targets")

In [13]:
src_vocab_size = len(src_voc_to_int)

#encoder embedding
with tf.name_scope("encoder_embedding"):
    enc_embedding = tf.contrib.layers.embed_sequence(inputs, vocab_size=src_vocab_size, embed_dim=encoding_embedding_size)

    #encoder
with tf.name_scope("encoder_rnn"):
    enc_cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.BasicLSTMCell(rnn_size)] * num_layers)
    _, enc_state = tf.nn.dynamic_rnn(cell = enc_cell, inputs= enc_embedding, dtype=tf.float32)

### create decoding input

This will take the target input and append the start tag to the pending and drop the last token from the end

In [14]:
import numpy as np

with tf.name_scope("decoder_input"):
    ending = tf.strided_slice(targets, begin = [0,0], end=[batch_size, -1],
                              strides = [1,1], name="strided_slice") #removes the last colum
    dec_input = tf.concat([tf.fill(dims=(batch_size, 1), value = tar_voc_to_int['<s>']), ending], 1)

demo_outputs = np.reshape(range(batch_size * seq_len), (batch_size, seq_len))

sess = tf.InteractiveSession()

print("Targets")
print(demo_outputs[:2])
print("\n")
print('Ending')
print(sess.run(ending, {targets: demo_outputs})[:2])
print("\n")
print('dec_input')
print(sess.run(dec_input, {targets: demo_outputs})[:2])


Targets
[[ 0  1  2  3  4  5  6]
 [ 7  8  9 10 11 12 13]]


Ending
[[ 0  1  2  3  4  5]
 [ 7  8  9 10 11 12]]


dec_input
[[ 2  0  1  2  3  4  5]
 [ 2  7  8  9 10 11 12]]


### Decoding
 - embed the decoding input
 - build the decoding rnns
 - build the output in decoding scope so it can be shared between the training and inference decoders

In [15]:
tar_vocab_size = len(tar_voc_to_int)

# decoder embedding 
# need a seperate decoder to handle decoding each vector 
with tf.name_scope("decoder_embedding"):
    dec_embeddings = tf.Variable(tf.random_uniform([tar_vocab_size, decoding_embedding_size]), name="dec_embed")
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)

# decoder rnns
with tf.name_scope("decoder_rnns"):
    dec_cell = tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.BasicLSTMCell(rnn_size)] * num_layers)

with tf.variable_scope("decoding") as decoding_scope:
    output_fn = lambda x : tf.contrib.layers.fully_connected(inputs = x, 
                                                             num_outputs = tar_vocab_size, 
                                                             activation_fn = None, 
                                                             scope = decoding_scope)

#### Decoder During Training

 - Build the training decoder using tf.contrib.seq2seq.simple_decoder_fn_train and tf.contrib.seq2seq.dynamic_rnn_decoder.

- Apply the output layer to the output of the training decoder


In [16]:
with tf.name_scope("train_decoder"):
    with tf.variable_scope("decoding") as decoding_scope:
    
    # training decoder

        train_decoder_fn = tf.contrib.seq2seq.simple_decoder_fn_train(encoder_state = enc_state)
        train_pred, _, _ = tf.contrib.seq2seq.dynamic_rnn_decoder(cell = dec_cell,
                                                                  decoder_fn = train_decoder_fn,
                                                                  inputs = dec_embed_input,
                                                                  sequence_length = seq_len,
                                                                  scope = decoding_scope,
                                                                  name = "train_decoder"
                                                                 )
        train_logits = output_fn(train_pred) #reshapes outputs back to the tar_vocab_size

#### Decoder During Inference

 - Reuse the weights the biases from the training decoder using tf.variable_scope("decoding", reuse=True)
 - Build the inference decoder using tf.contrib.seq2seq.simple_decoder_fn_inference and tf.contrib.seq2seq.dynamic_rnn_decoder.
  - The output function is applied to the output in this step

In [17]:
with tf.name_scope("inference_decoder"):
    with tf.variable_scope("decoding", reuse=True) as decoding_scope:
    
        # inference decoder

        infer_decoder_fn = tf.contrib.seq2seq.simple_decoder_fn_inference(output_fn=output_fn, 
                                                                          encoder_state = enc_state,
                                                                          embeddings = dec_embeddings,
                                                                          start_of_sequence_id = tar_voc_to_int['<s>'],
                                                                          end_of_sequence_id = tar_voc_to_int['</s>'],
                                                                          maximum_length = seq_len - 1, #i beleive this n - 1, because the first step is assumed, but not sure
                                                                          num_decoder_symbols = tar_vocab_size
                                                                         )
        infer_logits, _, _ = tf.contrib.seq2seq.dynamic_rnn_decoder(cell = dec_cell,
                                                                    decoder_fn = infer_decoder_fn,
                                                                    scope = decoding_scope,
                                                                    name = "inference_decoder")

#### Optimization

Our loss function is tf.contrib.seq2seq.sequence_loss provided by the tensor flow seq2seq module. It calculates a weighted cross-entropy loss for the output logits.

In [18]:
# Loss Function

cost = tf.contrib.seq2seq.sequence_loss(
    logits = train_logits, 
    targets = targets, 
    weights = tf.ones([batch_size, seq_len]),
    name = "cost_fn"
)

# Optimizer
#Gradient Clipping
with tf.name_scope("AdamOptimizerClipping"):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    grads = optimizer.compute_gradients(cost)
    clipped_grads = [(tf.clip_by_value(t = grad, clip_value_min = -1.0, clip_value_max=1.0), var) for grad, var in grads if grad is not None]
    train_op = optimizer.apply_gradients(clipped_grads)


In [19]:
#Performance KPIs
with tf.name_scope("kpis"):
    correct_pred = tf.equal(targets, tf.cast(tf.argmax(infer_logits, 2), tf.int32))
    acc = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

    tf.summary.scalar("acc", acc)
    merged = tf.summary.merge_all()
    tf.summary.scalar("cost", cost)


### Train

We're now ready to train our model. If you run into OOM (out of memory) issues during training, try to decrease the batch_size.

In [20]:
len(src_ids), len(tar_ids)

(10000, 10000)

In [21]:
def batch_data(src_ids, tar_ids, batch_size):
    num_batches = len(src_ids)//batch_size
    for batch_idx in range(0, num_batches * batch_size, batch_size):
        src_batch = src_ids[batch_idx : batch_idx + batch_size]
        tar_batch = tar_ids[batch_idx : batch_idx + batch_size]
        yield src_batch, tar_batch

In [25]:
import os
from datetime import datetime
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
ROOT_LOGDIR = "logs"
LOGDIR = "{}/run-{}".format(ROOT_LOGDIR, now)
graph_writer = tf.summary.FileWriter(LOGDIR, tf.get_default_graph())
train_writer = tf.summary.FileWriter(LOGDIR+"/train")
val_writer = tf.summary.FileWriter(LOGDIR+"/val")

In [26]:

with open(os.path.join(LOGDIR, 'metadata.tsv'), 'w') as f:
    for i in range(30):
        f.write( tar_int_to_voc[i] + "\n")

In [27]:
from tensorflow.contrib.tensorboard.plugins import projector

# Format: tensorflow/contrib/tensorboard/plugins/projector/projector_config.proto
config = projector.ProjectorConfig()

# You can add multiple embeddings. Here we add only one.
embedding = config.embeddings.add()
embedding.tensor_name = dec_embeddings.name
# Link this tensor to its metadata file (e.g. labels).
embedding.metadata_path = os.path.join(LOGDIR, 'metadata.tsv')

In [28]:
import numpy as np

train_src = src_ids[batch_size:]
train_tar = tar_ids[batch_size:]

valid_src = src_ids[:batch_size]
valid_tar = tar_ids[:batch_size]

saver = tf.train.Saver()

#file_writer = tf.summary.FileWriter("logs", tf.get_default_graph())
sess.run(tf.global_variables_initializer())


In [29]:
import numpy as np

train_src = src_ids[batch_size:]
train_tar = tar_ids[batch_size:]

valid_src = src_ids[:batch_size]
valid_tar = tar_ids[:batch_size]

saver = tf.train.Saver()
#file_writer = tf.summary.FileWriter("logs", tf.get_default_graph())
sess.run(tf.global_variables_initializer())
step = 0
#epochs = 1
for epoch_i in range(epochs):
    for batch_i, (src_batch, tar_batch) in enumerate(batch_data(train_src, train_tar, batch_size)):
        _, loss = sess.run([train_op, cost], 
                           feed_dict={inputs : src_batch, targets: tar_batch})
        if (batch_i % 5) == 0:
            #batch_train_logits = sess.run(infer_logits, feed_dict={inputs: src_batch})
            #valid_train_logits = sess.run(infer_logits, feed_dict={inputs: valid_src})
            #print(batch_train_logits)
            #batch_train_logits.shape
            #train_acc = np.mean(np.equal(tar_batch, np.argmax(batch_train_logits, 2)))
            #valid_acc = np.mean(np.equal(valid_tar, np.argmax(valid_train_logits, 2)))
            train_acc, train_acc_summ = sess.run([acc, merged], feed_dict={inputs: src_batch, 
                                                                           targets: tar_batch})
            valid_acc, val_acc_summ = sess.run([acc, merged], feed_dict={inputs: valid_src, 
                                                                       targets: valid_tar})
            print("Epoch {:>3} Batch {:>4}/{} - Train Accuracy: {:>.3f}, Validation Accuracy: {:6>.3f}, Loss: {:>6.3f}"\
                  .format(epoch_i, batch_i, len(train_src)//batch_size, train_acc, valid_acc, loss))
            
            train_writer.add_summary(train_acc_summ, step)
            val_writer.add_summary(val_acc_summ, step)
        step += 1
saver.save(sess, LOGDIR+"/final_model.ckpt")


Epoch   0 Batch    0/77 - Train Accuracy: 0.028, Validation Accuracy: 0.035, Loss:  3.418
Epoch   0 Batch    5/77 - Train Accuracy: 0.445, Validation Accuracy: 0.449, Loss:  3.304
Epoch   0 Batch   10/77 - Train Accuracy: 0.402, Validation Accuracy: 0.449, Loss:  3.156
Epoch   0 Batch   15/77 - Train Accuracy: 0.402, Validation Accuracy: 0.449, Loss:  2.881
Epoch   0 Batch   20/77 - Train Accuracy: 0.478, Validation Accuracy: 0.449, Loss:  2.324
Epoch   0 Batch   25/77 - Train Accuracy: 0.424, Validation Accuracy: 0.449, Loss:  2.428
Epoch   0 Batch   30/77 - Train Accuracy: 0.442, Validation Accuracy: 0.449, Loss:  2.349
Epoch   0 Batch   35/77 - Train Accuracy: 0.430, Validation Accuracy: 0.449, Loss:  2.301
Epoch   0 Batch   40/77 - Train Accuracy: 0.434, Validation Accuracy: 0.449, Loss:  2.232
Epoch   0 Batch   45/77 - Train Accuracy: 0.424, Validation Accuracy: 0.449, Loss:  2.198
Epoch   0 Batch   50/77 - Train Accuracy: 0.362, Validation Accuracy: 0.451, Loss:  2.321
Epoch   0 

'logs/run-20170423225133/final_model.ckpt'

In [30]:
projector.visualize_embeddings(train_writer, config)

In [31]:
batch_train_logits.argmax(2) == tar_batch

NameError: name 'batch_train_logits' is not defined

In [None]:
print(np.array(src_batch)[: 10])
print("\n")
print(np.array(tar_batch)[: 10])
print("\n")
print(batch_train_logits.argmax(2)[:10])


In [None]:
inp_sent = 'rebecca'

inp_sent = [src_voc_to_int.get(c, '<unk>') for c in inp_sent]
inp_sent = inp_sent + [0] * (seq_len - len(inp_sent))

batch = np.zeros([batch_size, seq_len])
batch[0] = inp_sent

logit = sess.run(infer_logits, {inputs: batch})
logit = logit[0]

print('Input')
print('  Word Ids:      {}'.format([i for i in inp_sent]))
print('  Input Words: {}'.format([src_int_to_voc[i] for i in inp_sent]))

print('\nPrediction')
print('  Word Ids:      {}'.format([i for i in np.argmax(logit, 1)]))
print('  Chatbot Answer Words: {}'.format([tar_int_to_voc[i] for i in np.argmax(logit, 1)]))