In [2]:
!wget https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/train.de

!wget https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/train.en


!wget https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/vocab.50K.en

!wget https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/vocab.50K.de


!ls -lrt

#### All the necessary files have been downloaded.
#### Now we will begin building the system

--2018-10-21 18:51:36--  https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/train.de
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 717610118 (684M) [text/plain]
Saving to: ‘train.de’


2018-10-21 18:52:31 (12.6 MB/s) - ‘train.de’ saved [717610118/717610118]

--2018-10-21 18:52:32--  https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/train.en
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 644937323 (615M) [text/plain]
Saving to: ‘train.en’


2018-10-21 18:53:46 (8.31 MB/s) - ‘train.en’ saved [644937323/644937323]

--2018-10-21 18:53:46--  https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/vocab.50K.en
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecti

In [0]:
# Hide all the warnings
import warnings
warnings.filterwarnings('ignore')

In [0]:
%matplotlib inline
import math
import numpy as np
import os
import random
import tensorflow as tf
from matplotlib import pylab
from collections import Counter
import csv

# Seq2Seq Items
import tensorflow.contrib.seq2seq as seq2seq
from tensorflow.python.ops.rnn_cell import LSTMCell
from tensorflow.python.ops.rnn_cell import MultiRNNCell
from tensorflow.contrib.seq2seq.python.ops import attention_wrapper
from tensorflow.python.layers.core import Dense

In [0]:
# Building the dictionary required for word lookup.
# To dictionaries built - one for word to key and other for key to word. 
# The dictionary is built for both source and target language

src_dictionary = dict()
with open("vocab.50K.de", encoding = "utf-8") as f:
  word_count = 0
  for line in f:
    src_dictionary[line.split("\n")[0]] = word_count 
    word_count +=1

src_reverse_dictionary = dict(zip(src_dictionary.values(), src_dictionary.keys()))

In [6]:
# Check whether the above code worked right. 
# The word 'das' is mapped to 16

print(src_reverse_dictionary[16])
print(src_dictionary['das'])
print("Vocabulary:"+ str(len(src_dictionary)))

das
16
Vocabulary:50000


In [0]:
tgt_dictionary = dict()
with open("vocab.50K.en", encoding = "utf-8") as f:
  word_count = 0
  for line in f:
    tgt_dictionary[line.split("\n")[0]] = word_count 
    word_count +=1

tgt_reverse_dictionary = dict(zip(tgt_dictionary.values(), tgt_dictionary.keys()))

In [8]:
# Check whether the above code worked right. 
# The word 'you' is mapped to 28

print(tgt_reverse_dictionary[28])
print(tgt_dictionary['you'])
print("Vocabulary:"+ str(len(tgt_dictionary)))

you
28
Vocabulary:50000


#### Now, lets load the german (source) and the english (target) sentences separately. 

In [9]:
source_sent = []
no_sentences_to_be_read = 50000

with open("train.de", encoding = "utf-8") as f:
    #count = 1
    for line in f:
        if len(source_sent) >= no_sentences_to_be_read:
            break
        line = line.split("\n")[0]
        source_sent.append(line)
        #count +=1

print(len(source_sent))

50000


In [0]:
# The first 50 sentences were English for some reason
source_sent = source_sent[50:]

In [11]:
target_sent = []

with open("train.en", encoding = "utf-8") as f:
    #count = 1
    for line in f:
        if len(target_sent) >= no_sentences_to_be_read:
            break
        line = line.split("\n")[0]
        target_sent.append(line)
        #count +=1

print(len(target_sent))

50000


In [0]:
#The number of sentences must be same for both source and target
target_sent = target_sent[50:]

In [0]:
assert(len(source_sent) == len(target_sent))
#len(target_sent)
#len(source_sent)

In [14]:
# Lets print some samples of both the source and target languages
for i in range(0, len(source_sent), 10000):
    print("Source (German):", source_sent[i])
    print("Target (English):", target_sent[i])
    print("\n")

Source (German): Heute verstehen sich QuarkXPress ® 8 , Photoshop ® und Illustrator ® besser als jemals zuvor . Dank HTML und CSS ­ können Anwender von QuarkXPress inzwischen alle Medien bedienen , und das unabhängig von Anwendungen der Adobe ® Creative Suite ® wie Adobe Flash ® ( SWF ) und Adobe Dreamweaver ® .
Target (English): Today , QuarkXPress ® 8 has tighter integration with Photoshop ® and Illustrator ® than ever before , and through standards like HTML and CSS , QuarkXPress users can publish across media both independently and alongside Adobe ® Creative Suite ® applications like Adobe Flash ® ( SWF ) and Adobe Dreamweaver ® .


Source (German): Es existieren Busverbindungen in nahezu jeden Ort der Provence ( eventuell mit Umsteigen in Aix ##AT##-##AT## en ##AT##-##AT## Provence ) , allerdings sollte beachtet werden , dass die letzten Busse abends ca. um 19 Uhr fahren .
Target (English): As always in France those highways are expensive but practical , comfortable and fast .


S

In [0]:
# Split the sentence into space separated tokens. 
# If tokens are not present in the vocabulary, then we add the <unk> unknown
# token.
def split_to_token(sentence, is_source):
    sentence = sentence.replace(","," ,")
    sentence = sentence.replace("."," .")
    sentence = sentence.replace("\n"," ")
  
    tokens = sentence.split(" ")
  
    for i in range(len(tokens)):
        if is_source:
            if tokens[i] not in src_dictionary:
                tokens[i] = '<unk>'
    
        else:
            if tokens[i] not in tgt_dictionary:
                tokens[i] = '<unk>'
        
    return tokens

In [16]:
# Finding the mean, max length of the full data
source_len = []
target_len = []

# Number of source and target samples are same. 
for i in range(len(source_sent)):
    source_len.append(len(split_to_token(source_sent[i], True)))
    target_len.append(len(split_to_token(target_sent[i], False)))

print("Mean length in source (German):" , np.mean(source_len))
print("Mean length in target: (English)" , np.mean(target_len))
print("\n")
print("Max length in source (German):" , np.max(source_len))
print("Max length in target: (English)" , np.max(target_len))


Mean length in source (German): 25.35941941941942
Mean length in target: (English) 27.59089089089089


Max length in source (German): 118
Max length in target: (English) 119


In [17]:
train_inputs = []
train_outputs = []
train_inp_lengths = []
train_out_lengths = []

max_src_length = 40
max_tgt_length = 60

for i in range(len(source_sent)):
    src_tokens = split_to_token(source_sent[i], True)
    target_tokens = split_to_token(target_sent[i], False)
  
    src_sentence_numbers = []
    for token in src_tokens:
        src_sentence_numbers.append(src_dictionary[token])
    
    
    target_sentence_numbers = []
    # Add a token which indicates the end of source and begining of the target.
    target_sentence_numbers.append(tgt_dictionary['</s>'])
  
    for token in target_tokens:
        target_sentence_numbers.append(tgt_dictionary[token])
   
    # Reverse the source sentence list for better performance/translation. 
    # This fact is based out of a paper for NMT.
  
    src_sentence_numbers = src_sentence_numbers[::-1]

    # Add the start symbol at the start of source.
    src_sentence_numbers.insert(0, src_dictionary['<s>'])
    train_inp_lengths.append(min(len(src_sentence_numbers)+1,max_src_length))
    # Make sure that both the source and target have same length.


    if len(src_sentence_numbers) < max_src_length:
        src_sentence_numbers.extend([ src_dictionary['</s>'] for i in range( max_src_length - len(src_sentence_numbers) )])

    elif len(src_sentence_numbers) > max_src_length:
        src_sentence_numbers = src_sentence_numbers[:max_src_length]


    if len(target_sentence_numbers) < max_tgt_length:
        target_sentence_numbers.extend([ tgt_dictionary['</s>'] for i in range( max_tgt_length - len(target_sentence_numbers) )])

    elif len(target_sentence_numbers) > max_tgt_length:
        target_sentence_numbers = target_sentence_numbers[:max_tgt_length]


    if len(src_sentence_numbers) == max_src_length and len(target_sentence_numbers) == max_tgt_length:
        train_inputs.append(src_sentence_numbers)
        train_outputs.append(target_sentence_numbers)

train_inp_lengths = np.array(train_inp_lengths, dtype=np.int32)
print("Total number of source sentences:", len(train_inputs))
print("Total number of target sentences:", len(train_outputs))
print("Length of each source sentence:", len(train_inputs[0]))
print("Length of each target sentence:", len(train_outputs[0]))

Total number of source sentences: 49950
Total number of target sentences: 49950
Length of each source sentence: 40
Length of each target sentence: 60


In [0]:
# Convert the source and target to numpy values
train_inputs = np.array(train_inputs, dtype=np.int32)
train_outputs = np.array(train_outputs, dtype=np.int32)

In [19]:
# See some sample source and target sentences.
# The source sentence is reversed
print([ src_reverse_dictionary[i] for i in train_inputs[0] ])
print([ tgt_reverse_dictionary[i] for i in train_outputs[0] ])

print([ src_reverse_dictionary[i] for i in train_inputs[100] ])
print([ tgt_reverse_dictionary[i] for i in train_outputs[100] ])

['<s>', '.', '<unk>', '®', '<unk>', 'Adobe', 'und', ')', 'SWF', '(', '®', 'Flash', 'Adobe', 'wie', '®', 'Suite', 'Creative', '®', 'Adobe', 'der', 'Anwendungen', 'von', 'unabhängig', 'das', 'und', ',', '<unk>', 'bedienen', 'Medien', 'alle', 'inzwischen', 'QuarkXPress', 'von', 'Anwender', 'können', '\xad', 'CSS', 'und', 'HTML', 'Dank']
['</s>', 'Today', '<unk>', ',', 'QuarkXPress', '®', '8', 'has', 'tighter', 'integration', 'with', 'Photoshop', '®', 'and', 'Illustrator', '®', 'than', 'ever', 'before', '<unk>', ',', 'and', 'through', 'standards', 'like', 'HTML', 'and', 'CSS', '<unk>', ',', 'QuarkXPress', 'users', 'can', 'publish', 'across', 'media', 'both', 'independently', 'and', 'alongside', 'Adobe', '®', 'Creative', 'Suite', '®', 'applications', 'like', 'Adobe', 'Flash', '®', '(', 'SWF', ')', 'and', 'Adobe', 'Dreamweaver', '®', '<unk>', '.', '</s>']
['<s>', '.', '<unk>', 'Mittelmeer', 'am', 'Katalonien', 'von', 'Zentrum', 'im', 'mitten', 'Halbinsel', 'iberischen', 'der', 'Nordosten', '

In [0]:
class DataGeneratorMT(object):

    def __init__(self,batch_size,num_unroll,is_source):

        self._batch_size = batch_size
        self._num_unroll = num_unroll
        self._cursor = [0 for offset in range(self._batch_size)]

        self._sent_ids = None

        self._is_source = is_source


    def print(self):
        print(self._cursor)


    def next_batch(self, sent_ids):
        if self._is_source:
            max_sent_length = max_src_length
        else:
            max_sent_length = max_tgt_length

        batch_data = np.zeros((self._batch_size),dtype=np.float32)
        batch_labels = np.zeros((self._batch_size),dtype=np.float32)

        for b in range(self._batch_size):
            sent_id = sent_ids[b]

            if self._is_source:
                sent_text = train_inputs[sent_id]
                batch_data[b] = sent_text[self._cursor[b]]
                batch_labels[b] = sent_text[self._cursor[b] + 1]
                
            else:
                sent_text = train_outputs[sent_id]
                batch_data[b] = sent_text[self._cursor[b]]
                batch_labels[b] = sent_text[self._cursor[b] + 1]

            self._cursor[b] = (self._cursor[b] + 1)%(max_sent_length - 1)

        return batch_data, batch_labels


    def unroll_batches(self, sent_ids):
        if sent_ids is not None:
            self._sent_ids = sent_ids

        unroll_data, unroll_labels = [], []
        inp_lengths = None
        for i in range(self._num_unroll):
            data, labels = self.next_batch(self._sent_ids)
            unroll_data.append(data)
            unroll_labels.append(labels)
            inp_lengths = train_inp_lengths[sent_ids]

        return unroll_data, unroll_labels, self._sent_ids, inp_lengths

In [21]:
ob = DataGeneratorMT(5, 40, True)
data, label, _, _ = ob.unroll_batches([0,1,2,3,4])
#print(data)
#print(label)

count = 0
print('Source data - German')
for dta, lbl in zip(data,label):
    if count > 5:
        break
    print([src_reverse_dictionary[w] for w in dta.tolist()])
    print([src_reverse_dictionary[w] for w in lbl.tolist()])
    print("\n")
    count +=1

Source data - German
['<s>', '<s>', '<s>', '<s>', '<s>']
['.', '.', '.', '.', '.']


['.', '.', '.', '.', '.']
['<unk>', '<unk>', '<unk>', '<unk>', '<unk>']


['<unk>', '<unk>', '<unk>', '<unk>', '<unk>']
['®', 'können', 'lässt', 'bietet', 'nutzen']


['®', 'können', 'lässt', 'bietet', 'nutzen']
['<unk>', 'nutzen', 'erschließen', 'Dateiformat', 'optimal']


['<unk>', 'nutzen', 'erschließen', 'Dateiformat', 'optimal']
['Adobe', 'QuarkXPress', 'Software', '##AT##-##AT##', 'Bilder']


['Adobe', 'QuarkXPress', 'Software', '##AT##-##AT##', 'Bilder']
['und', 'mit', '##AT##-##AT##', 'PSD', 'Ihre']




#### The above output looks fine. 
#### In seq2seq model, if we have [1,2,3,4,5] as input, using batch size of 3, we can have [1,2,3] , [2,3,4], [3,4,5] as batches.
#### Because of this, we have the code:
##### batch_data[b] = sent_text[self._cursor[b]]
##### batch_labels[b] = sent_text[self._cursor[b] + 1]

In [22]:
ob = DataGeneratorMT(5, 40, False)
data, label, _, _ = ob.unroll_batches([0,1,2,3,4])
#print(data)
#print(label)

count = 0
print('Target data - English')
for dta, lbl in zip(data,label):
    if count > 5:
        break
    print([tgt_reverse_dictionary[w] for w in dta.tolist()])
    print([tgt_reverse_dictionary[w] for w in lbl.tolist()])
    print("\n")
    count +=1

Target data - English
['</s>', '</s>', '</s>', '</s>', '</s>']
['Today', 'Here', 'You', 'QuarkXPress', 'In']


['Today', 'Here', 'You', 'QuarkXPress', 'In']
['<unk>', '<unk>', '’', '8', 'this']


['<unk>', '<unk>', '’', '8', 'this']
[',', ',', 'll', 'is', 'section']


[',', ',', 'll', 'is', 'section']
['QuarkXPress', 'you', 'be', 'considered', 'we']


['QuarkXPress', 'you', 'be', 'considered', 'we']
['®', '’', 'surprised', 'by', '’']


['®', '’', 'surprised', 'by', '’']
['8', 'll', 'how', 'many', 'll']




In [23]:
!wget -O en-embeddings.npy https://www.dropbox.com/s/01e7dndrxopguk6/en-embeddings.npy?dl=0
!wget -O de-embeddings.npy https://www.dropbox.com/s/4204d2kgfknu4ci/de-embeddings.npy?dl=0

--2018-10-21 18:54:17--  https://www.dropbox.com/s/01e7dndrxopguk6/en-embeddings.npy?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.1, 2620:100:6019:1::a27d:401
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/01e7dndrxopguk6/en-embeddings.npy [following]
--2018-10-21 18:54:17--  https://www.dropbox.com/s/raw/01e7dndrxopguk6/en-embeddings.npy
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uce88980a4a1f9f35b2cca7c9195.dl.dropboxusercontent.com/cd/0/inline/ATlqgCMmyAUwMPO7HSSI_EARQ3JiL1CbAIjZEEvs-5A-SzY2HgBlT_EbwvZPx12PIXgSKxo23uYtRzG6-HdLg75eX3IGeDk3hufLOvoo6tSniYPJQPBmdy-CQHBgrRPfk_P9x-GJSo-OWpM15S-iSLdjF29txgqp2-i_5nQjHaAe0mc-qg43YEiQcdwSeJQ-yhE/file [following]
--2018-10-21 18:54:18--  https://uce88980a4a1f9f35b2cca7c9195.dl.dropboxusercontent.com/cd/0/inline/ATlqgCMmyAUwMPO7HSSI_EARQ3JiL1C

In [24]:
!ls -lrt

total 1381552
-rw-r--r-- 1 root root 644937323 Aug 22  2014 train.en
-rw-r--r-- 1 root root 717610118 Aug 22  2014 train.de
-rw-r--r-- 1 root root    403998 Jan 27  2015 vocab.50K.en
-rw-r--r-- 1 root root    505304 Jan 27  2015 vocab.50K.de
drwxr-xr-x 2 root root      4096 Oct 18 16:40 sample_data
-rw-r--r-- 1 root root  25600080 Oct 21 18:54 en-embeddings.npy
-rw-r--r-- 1 root root  25600080 Oct 21 18:54 de-embeddings.npy


In [0]:
tf.reset_default_graph()
# Load pre-trained word embeddings for german and english
encoder_emb_layer = tf.convert_to_tensor(np.load('de-embeddings.npy'))
decoder_emb_layer = tf.convert_to_tensor(np.load('en-embeddings.npy'))

In [26]:
sess = tf.InteractiveSession()  
print("Source (German) Embeddings")
print(encoder_emb_layer.eval()[0])
#print("Target (English) Embeddings")
#print(decoder_emb_layer.eval()[0])

# Check if the length of source and target embeddings are same.
assert(len(encoder_emb_layer.eval()) == len(decoder_emb_layer.eval()))
assert(len(encoder_emb_layer.eval()[0]) == len(decoder_emb_layer.eval()[0]))

print("Number of Embeddings:", len(encoder_emb_layer.eval()))
print("Source Embedding Size:",len(encoder_emb_layer.eval()[0]))
print("Target Embedding Size:",len(decoder_emb_layer.eval()[0]))
sess.close()

Source (German) Embeddings
[ 0.02807283  0.08129065  0.04325406  0.03861211 -0.04374592  0.08648341
 -0.12594177  0.05204354  0.02891895 -0.00274239  0.0464763  -0.0177199
  0.0113791   0.10005165 -0.13852786  0.07391532  0.14600192 -0.07613634
  0.0165039   0.09500151 -0.09135051  0.06103227 -0.09518221 -0.00840024
  0.1021672  -0.09210443  0.05864106  0.02367448 -0.12617454  0.03162083
 -0.00553827 -0.06233861 -0.09098011  0.04980104 -0.08403688 -0.02544336
 -0.03200009 -0.36211336  0.04187389  0.13499737  0.01335561  0.05164875
  0.07950916  0.04037105  0.07873604  0.01508441 -0.01101452 -0.02970273
 -0.11738252  0.02539947 -0.10869873 -0.04156078  0.0270797   0.09760202
  0.01272728  0.12135591 -0.09459837 -0.08765817  0.04319254 -0.04632879
  0.03997531 -0.09053234 -0.05756423 -0.20108807 -0.04703222 -0.05928725
 -0.00646044  0.14901383  0.05235337 -0.00089508 -0.05842103 -0.01100476
  0.11242474 -0.09033585 -0.04254638 -0.15847538 -0.12196966 -0.10012189
  0.04274154 -0.10345502 

In [27]:
batch_size = 16
enc_train_inputs = []
dec_train_inputs = []
dec_train_labels=[]
dec_label_masks = []

print("Dimension of encoder train inputs:", max_src_length, "x",batch_size)
for i in range(max_src_length):
    enc_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size],name='enc_train_inputs_%d'%i))

print(len(enc_train_inputs))

Dimension of encoder train inputs: 40 x 16
40


In [28]:
print("Dimension of decoder train inputs:", max_tgt_length, "x",batch_size)
for i in range(max_tgt_length):
    dec_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size], name='dec_train_inputs_%d'%i))
    dec_train_labels.append(tf.placeholder(tf.int32, shape=[batch_size], name='dec_train_labels_%d'%i))
    dec_label_masks.append(tf.placeholder(tf.float32, shape=[batch_size], name='dec_label_masks_%d'%i))


Dimension of decoder train inputs: 60 x 16


In [29]:
# Sample code to read and display an embedding
sample_embedding = tf.nn.embedding_lookup(encoder_emb_layer, 0)
sess = tf.InteractiveSession()
print(sample_embedding.eval())
sess.close()

[ 0.02807283  0.08129065  0.04325406  0.03861211 -0.04374592  0.08648341
 -0.12594177  0.05204354  0.02891895 -0.00274239  0.0464763  -0.0177199
  0.0113791   0.10005165 -0.13852786  0.07391532  0.14600192 -0.07613634
  0.0165039   0.09500151 -0.09135051  0.06103227 -0.09518221 -0.00840024
  0.1021672  -0.09210443  0.05864106  0.02367448 -0.12617454  0.03162083
 -0.00553827 -0.06233861 -0.09098011  0.04980104 -0.08403688 -0.02544336
 -0.03200009 -0.36211336  0.04187389  0.13499737  0.01335561  0.05164875
  0.07950916  0.04037105  0.07873604  0.01508441 -0.01101452 -0.02970273
 -0.11738252  0.02539947 -0.10869873 -0.04156078  0.0270797   0.09760202
  0.01272728  0.12135591 -0.09459837 -0.08765817  0.04319254 -0.04632879
  0.03997531 -0.09053234 -0.05756423 -0.20108807 -0.04703222 -0.05928725
 -0.00646044  0.14901383  0.05235337 -0.00089508 -0.05842103 -0.01100476
  0.11242474 -0.09033585 -0.04254638 -0.15847538 -0.12196966 -0.10012189
  0.04274154 -0.10345502  0.00574398  0.07736626  0.

In [30]:
# Each word_int is of length 16. 16 id's are passed to embedding_lookup to get their 128 dimension embedding.
# So , the embedding_lookup returns 16x128 vector. 
# This process to repeated 40 times - which is the length of enc_train_inputs
encoder_emb_inp = [tf.nn.embedding_lookup(encoder_emb_layer, word_int) for word_int in enc_train_inputs]
print(encoder_emb_inp[0])
encoder_emb_inp = tf.stack(encoder_emb_inp)
# All the 40 elements are stacked to form a single tensor.
print(encoder_emb_inp)

Tensor("embedding_lookup_1/Identity:0", shape=(16, 128), dtype=float32)
Tensor("stack:0", shape=(40, 16, 128), dtype=float32)


In [31]:
decoder_emb_inp = [tf.nn.embedding_lookup(decoder_emb_layer, word_int) for word_int in dec_train_inputs]
print(decoder_emb_inp[0])
decoder_emb_inp = tf.stack(decoder_emb_inp)
print(decoder_emb_inp)

Tensor("embedding_lookup_41/Identity:0", shape=(16, 128), dtype=float32)
Tensor("stack_1:0", shape=(60, 16, 128), dtype=float32)


In [32]:
enc_train_inp_lengths = tf.placeholder(tf.int32, shape=[batch_size],name='train_input_lengths')
dec_train_inp_lengths = tf.placeholder(tf.int32, shape=[batch_size],name='train_output_lengths')
print(enc_train_inp_lengths)

Tensor("train_input_lengths:0", shape=(16,), dtype=int32)


#### The input and output layers end here.
#### Next we begin the encoder, decoder architecture

### Encoder

In [33]:
num_units = 128 
# num_units  - Parameter of BasicLSTMCell - Number of hidden neurons in each cell
# Ideally, its size should be the size of the embeddings.

encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units=num_units)
initial_state = encoder_cell.zero_state(batch_size=batch_size, dtype=tf.float32)
print(initial_state)

# Reference - https://stackoverflow.com/questions/41885519/tensorflow-dynamic-rnn-parameters-meaning
"""
cell -            Each cell of the sequential modelling. In our case, each cell is a Basic LSTM Cell with 
                  128 neurons.

inputs -          Its the input for the entire sequence model.
                  If we use, time_major = True then dimension of encoder_emb_inp should be 
                  (max_time, batch_size, input_size - each embedding size) = (40, 16, 128)
        
initial_state  -  The hidden states of all the LSTM cells for each sequence. Hidden states are initialized
                  for each sequence i.e., no hidden state is passed between the sequences even if they are
                  from the same batch.
                 
sequence_length - is a vector of size batch_size in which each element gives the length of each sequence in 
                  the batch. 

swap_memory     - If True, Tensors are swapped between CPU and GPU.
"""
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
    cell= encoder_cell, inputs = encoder_emb_inp, initial_state=initial_state,
    sequence_length=enc_train_inp_lengths, 
    time_major=True, swap_memory=True)


print("\n")
print(encoder_state)
print("\n\n")
print(encoder_outputs)

Instructions for updating:
This class is deprecated, please use tf.nn.rnn_cell.LSTMCell, which supports all the feature this cell currently has. Please replace the existing code with tf.nn.rnn_cell.LSTMCell(name='basic_lstm_cell').
LSTMStateTuple(c=<tf.Tensor 'BasicLSTMCellZeroState/zeros:0' shape=(16, 128) dtype=float32>, h=<tf.Tensor 'BasicLSTMCellZeroState/zeros_1:0' shape=(16, 128) dtype=float32>)


LSTMStateTuple(c=<tf.Tensor 'rnn/while/Exit_3:0' shape=(16, 128) dtype=float32>, h=<tf.Tensor 'rnn/while/Exit_4:0' shape=(16, 128) dtype=float32>)



Tensor("rnn/TensorArrayStack/TensorArrayGatherV3:0", shape=(40, 16, 128), dtype=float32)


### Decoder

In [34]:
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
print(decoder_cell)
vocab_size = 50000

# Projection layer is used to get the output of every LSTM cell
projection_layer = Dense(units=vocab_size, use_bias=True)
print(projection_layer)


<tensorflow.python.ops.rnn_cell_impl.BasicLSTMCell object at 0x7fcdbe66fb00>
<tensorflow.python.layers.core.Dense object at 0x7fcdbe66fda0>


In [35]:
seq_len_vector_for_helper = [max_tgt_length for _ in range(batch_size)]
print(seq_len_vector_for_helper)

[60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60, 60]


In [36]:
type(decoder_emb_inp)

tensorflow.python.framework.ops.Tensor

In [37]:
# TrainingHelper is used to feed the ground truth at every step instead of the decoded output 
# value from the previous step.
# Reference- https://stackoverflow.com/questions/43826784/trouble-understanding-tf-contrib-seq2seq-traininghelper
helper = tf.contrib.seq2seq.TrainingHelper(inputs=decoder_emb_inp, sequence_length= seq_len_vector_for_helper, 
                                           time_major= True)
#Final encoder state becomes the first input for Decoder.
decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection_layer)
print(decoder)

<tensorflow.contrib.seq2seq.python.ops.basic_decoder.BasicDecoder object at 0x7fcdbe66f898>


In [0]:

outputs, final_state, final_sequence_lengths = tf.contrib.seq2seq.dynamic_decode(
    decoder, output_time_major=True, swap_memory=True)

In [0]:
logits = outputs.rnn_output
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=dec_train_labels, logits=logits)
loss = tf.reduce_sum(cross_entropy * tf.stack(dec_label_masks))/(batch_size * max_tgt_length)
train_prediction = outputs.sample_id

In [40]:
print('Defining Optimizer')
# Adam Optimizer. And gradient clipping.
global_step = tf.Variable(0, trainable=False)
inc_gstep = tf.assign(global_step,global_step + 1)
learning_rate = tf.train.exponential_decay(
    0.01, global_step, decay_steps=10, decay_rate=0.9, staircase=True)

with tf.variable_scope('Adam'):
    adam_optimizer = tf.train.AdamOptimizer(learning_rate)

#Reference - https://www.tensorflow.org/api_docs/python/tf/train/Optimizer    
    
# When we zip, all the gradients are grouped in one tuple
# and all the variables are grouped in one.
# adam_gradients = (v1_grad, v2_grad, v3_grad...) 
# variable = (v1, v2, v3...)
# We do this to apply gradient clipping on all the gradients. 
adam_gradients, variable = zip(*adam_optimizer.compute_gradients(loss))
adam_gradients, _ = tf.clip_by_global_norm(adam_gradients, 25.0)

#We convert back to the original form of [(grad1, variable1), (grad2, v)....] to apply gradients
adam_optimize = adam_optimizer.apply_gradients(zip(adam_gradients, variable))

sess = tf.InteractiveSession()

Defining Optimizer


In [0]:
tf.global_variables_initializer().run()

In [42]:
enc_data_generator = DataGeneratorMT(batch_size=batch_size,num_unroll=max_src_length,is_source=True)
dec_data_generator = DataGeneratorMT(batch_size=batch_size,num_unroll=max_tgt_length,is_source=False)

enc_data_generator.print()

sent_ids = np.random.randint(low=0,high=train_inputs.shape[0],size=(batch_size))
print(sent_ids)
# ====================== ENCODER DATA COLLECTION ================================================

eu_data, eu_labels, _, eu_lengths = enc_data_generator.unroll_batches(sent_ids=sent_ids)


feed_dict = {}
feed_dict[enc_train_inp_lengths] = eu_lengths
for ui,(dat,lbl) in enumerate(zip(eu_data,eu_labels)):     
    #print(ui)
    feed_dict[enc_train_inputs[ui]] = dat                

# ====================== DECODER DATA COLLECTION ===========================

du_data, du_labels, _, du_lengths = dec_data_generator.unroll_batches(sent_ids=sent_ids)
print(du_lengths)

feed_dict[dec_train_inp_lengths] = du_lengths
for ui,(dat,lbl) in enumerate(zip(du_data,du_labels)):            
    feed_dict[dec_train_inputs[ui]] = dat
    feed_dict[dec_train_labels[ui]] = lbl
    feed_dict[dec_label_masks[ui]] = (np.array([ui for _ in range(batch_size)])<du_lengths).astype(np.int32)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[14067 16881 45752 36252 37743  8335 11107 31424 43271 32861 29950 18223
  1644 13117 23985 10072]
[24 18 23 10 40 22 17 40 40 31 40 40 34 40 21 18]


In [43]:
eu_data[0]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
      dtype=float32)

In [44]:
eu_labels[0]

array([4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
      dtype=float32)

In [45]:
du_data[0]

array([2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
      dtype=float32)

In [46]:
du_labels[0]

array([  318.,  1754.,    17.,  2567.,   136.,   760.,    17.,     0.,
         177.,    17.,   241.,    49.,  1493.,    91.,    17., 14401.],
      dtype=float32)

In [47]:
loss_over_time = []
num_epochs = 10000
enc_data_generator = DataGeneratorMT(batch_size=batch_size,num_unroll=max_src_length,is_source=True)
dec_data_generator = DataGeneratorMT(batch_size=batch_size,num_unroll=max_tgt_length,is_source=False)
avg_loss = 0


for i in range(num_epochs):
    sent_ids = np.random.randint(low=0,high=train_inputs.shape[0],size=(batch_size))
    # ====================== ENCODER DATA COLLECTION ================================================

    eu_data, eu_labels, _, eu_lengths = enc_data_generator.unroll_batches(sent_ids=sent_ids)


    feed_dict = {}
    feed_dict[enc_train_inp_lengths] = eu_lengths
    for ui,(dat,lbl) in enumerate(zip(eu_data,eu_labels)):     
        #print(ui)
        feed_dict[enc_train_inputs[ui]] = dat                

    # ====================== DECODER DATA COLLECTION ===========================

    du_data, du_labels, _, du_lengths = dec_data_generator.unroll_batches(sent_ids=sent_ids)


    feed_dict[dec_train_inp_lengths] = du_lengths
    for ui,(dat,lbl) in enumerate(zip(du_data,du_labels)):            
        feed_dict[dec_train_inputs[ui]] = dat
        feed_dict[dec_train_labels[ui]] = lbl
        feed_dict[dec_label_masks[ui]] = (np.array([ui for _ in range(batch_size)])<du_lengths).astype(np.int32)
        
        
    _,l,tr_pred = sess.run([adam_optimize,loss,train_prediction], feed_dict=feed_dict)
    tr_pred = tr_pred.flatten()
    
    
    if (i+1)%500==0:
        rand_idx = np.random.randint(low=1,high=batch_size)
        print_str = 'Actual: '
        for w in np.concatenate(du_labels,axis=0)[rand_idx::batch_size].tolist():
            print_str += tgt_reverse_dictionary[w] + ' '
            if tgt_reverse_dictionary[w] == '</s>':
                break
        print(print_str)

            
        print()
        print_str = 'Predicted: '
        for w in tr_pred[rand_idx::batch_size].tolist():
            print_str += tgt_reverse_dictionary[w] + ' '
            if tgt_reverse_dictionary[w] == '</s>':
                break
        print(print_str)
        print()        
        
    avg_loss += l
    
    if (i+1)%500==0:
      print('============= Step ', str(i+1), ' =============')
      print('\t Loss: ',avg_loss/500.0)
      #print(avg_loss)
        
      loss_over_time.append(avg_loss/500.0)
             
      avg_loss = 0.0
      sess.run(inc_gstep)
    

Actual: unloading of fruit <unk> . A retractable balcony was also installed for safety purposes <unk> . </s> 

Predicted: <unk> <unk> the <unk> , </s> 

	 Loss:  1.5996628906428814
Actual: </s> 

Predicted: </s> 

	 Loss:  1.3155119487941265
Actual: </s> 

Predicted: </s> 

	 Loss:  1.258278371937573
Actual: </s> 

Predicted: </s> 

	 Loss:  1.173017665296793
Actual: </s> 

Predicted: </s> 

	 Loss:  1.1739949850440026
Actual: </s> 

Predicted: </s> 

	 Loss:  1.112998608469963
Actual: sites <unk> , ecommerce ASPs <unk> , purchase systems <unk> , etc ) downloading this ICEcat data ##AT##-##AT## sheet since 23 Mar 2007 <unk> . </s> 

Predicted: <unk> <unk> , the ASPs <unk> , purchase systems <unk> , etc ) downloading this ICEcat data ##AT##-##AT## sheet since 27 Sep 2005 <unk> . </s> 

	 Loss:  1.1558712753653526
Actual: </s> 

Predicted: </s> 

	 Loss:  1.0711237744390965
Actual: &quot; ( Charming Spain Hotels ) <unk> . </s> 

Predicted: , <unk> <unk> ) ) ) <unk> . </s> 

	 Loss:  1.09

In [60]:
rand_idx = np.random.randint(low=1,high=batch_size)
print_str = 'Actual: '
for w in np.concatenate(du_labels,axis=0)[rand_idx::batch_size].tolist():
    print_str += tgt_reverse_dictionary[w] + ' '
    if tgt_reverse_dictionary[w] == '</s>':
        break
print(print_str)


print()
print_str = 'Predicted: '
for w in tr_pred[rand_idx::batch_size].tolist():
    print_str += tgt_reverse_dictionary[w] + ' '
    if tgt_reverse_dictionary[w] == '</s>':
        break
print(print_str)
print()  

Actual: is booted <unk> , it will mount the <unk> as its root filesystem <unk> . </s> 

Predicted: <unk> a <unk> , and is be the ability <unk> a name <unk> <unk> . </s> 

