# Attentive Seq2Seq with copying mechanism for Table-to-Text 

In this ipynb notbook, we'll be building a deep learning based Sequence-to-Sequence (Seq2Seq) model in an attempt to generate textual sequence on a tabular data, using PyTorch and TorchText. This will be done on a small fabricated dataset (student_grade_comments), but the models can be applied to any dataset that has tabular subject (attribute), values (cell) and annotation (caption) column. This notebook will not talk about every concept and reasoning behind the network representation and structure, that work has been left for the final word document which would be in form of an exhaustive report. Let's start with the basics first.

## Introduction

The most common sequence-to-sequence (seq2seq) models are *encoder-decoder* models, which commonly use a *recurrent neural network* (RNN) to *encode* the source (input) sentence into a single vector. In this notebook, we'll refer to this single vector as a *context vector*. We can think of the context vector as being an abstract representation of the entire input sentence. This vector is then *decoded* by a second RNN which learns to output the target (output) sentence by generating it one word at a time.

![](/assets/seq2seq1.png)

The above image shows an example translation. The input/source sentence, "guten morgen", is passed through the embedding layer (yellow) and then input into the encoder (green). We also append a *start of sequence* (`<sos>` or `<START>`, whichever one would like to trigger the decoder with) and *end of sequence* (`<eos>` or `<END>`) token to the start and end of sentence, respectively. At each time-step, the input to the encoder RNN is both the embedding, $e$, of the current word, $e(x_t)$, as well as the hidden state from the previous time-step, $h_{t-1}$, and the encoder RNN outputs a new hidden state $h_t$. We can think of the hidden state as a vector representation of the sentence so far. The RNN can be represented as a function of both of $e(x_t)$ and $h_{t-1}$:

Table-to-text can be represented in similar manner, where a combination (... more on this in the report) of table attributes and corresponding cell values, is passed as a input sequence to the encoder then the content description is produced by the decoder. To represent the encoder-decoder model in mathematical form...

$$h_t = \text{EncoderRNN}(e(x_t), h_{t-1})$$

We're using the term RNN generally here, it could be any recurrent architecture, such as an *LSTM* (Long Short-Term Memory) or a *GRU* (Gated Recurrent Unit). In our case, we have gone ahead with the *GRU* due to it's simpler architecture. 

Here, we have $X = \{x_1, x_2, ..., x_T\}$, where $x_1 = \text{<sos>}, x_2 = \text{guten}$, etc. The initial hidden state, $h_0$, is usually either initialized to zeros or a learned parameter.

Once the final word, $x_T$, has been passed into the RNN via the embedding layer, we use the final hidden state, $h_T$, as the context vector, i.e. $h_T = z$. This is a vector representation of the entire source sentence.

Now we have our context vector, $z$, we can start decoding it to get the output/target sentence, "good morning". Again, we append start and end of sequence tokens to the target sentence. At each time-step, the input to the decoder RNN (blue) is the embedding, $d$, of current word, $d(y_t)$, as well as the hidden state from the previous time-step, $s_{t-1}$, where the initial decoder hidden state, $s_0$, is the context vector, $s_0 = z = h_T$, i.e. the initial decoder hidden state is the final encoder hidden state. Thus, similar to the encoder, we can represent the decoder as:

$$s_t = \text{DecoderRNN}(d(y_t), s_{t-1})$$

Although the input/source embedding layer, $e$, and the output/target embedding layer, $d$, are both shown in yellow in the diagram they are two different embedding layers with their own parameters.

In the decoder, we need to go from the hidden state to an actual word, therefore at each time-step we use $s_t$ to predict (by passing it through a `Linear` layer, shown in purple) what we think is the next word in the sequence, $\hat{y}_t$. 

$$\hat{y}_t = f(s_t)$$

The words in the decoder are always generated one after another, with one per time-step. We always use `<sos>` for the first input to the decoder, $y_1$, but for subsequent inputs, $y_{t>1}$, we will sometimes use the actual, ground truth next word in the sequence, $y_t$ and sometimes use the word predicted by our decoder, $\hat{y}_{t-1}$. This is called *teacher forcing*, see a bit more info about it [here](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/). 

When training/testing our model, we always know how many words are in our target sentence, so we stop generating words once we hit that many. During inference it is common to keep generating words until the model outputs an `<eos>` token or after a certain amount of words have been generated.

Once we have our predicted target sentence, $\hat{Y} = \{ \hat{y}_1, \hat{y}_2, ..., \hat{y}_T \}$, we compare it against our actual target sentence, $Y = \{ y_1, y_2, ..., y_T \}$, to calculate our loss. We then use this loss to update all of the parameters in our model.

## Preparing Data

We'll be coding up the models in PyTorch and using TorchText to help us do all of the pre-processing required. 

In [6]:
'''
Import Relevant Libraries: 
    For this exercise, we will be needing below mentioned libraries. 
    In case module is found absent, look upto pip install or conda install commands for installations.
'''

import pandas as pd
import numpy as np
import re
import torch
import torch.nn as nn
import random
import torch.nn.functional as F
#from gensim.models import FastText
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [7]:
'''Import the .csv file containing table rows and corresponding annotations'''

df = pd.read_csv('student_grade_comments.csv')
source_df = df.drop(['comments'], axis=1)
source_col = list(source_df.columns)

In [8]:
df.head()

Unnamed: 0,name,gender,math,reading,writing,comments
0,liam,female,72,72,74,liam performance was decent and she was consis...
1,noah,female,69,90,88,noah scored good in reading and writing but he...
2,william,female,90,95,93,william was one of the top performers in the c...
3,james,male,47,57,44,james performed poorly across all three subjec...
4,oliver,male,76,78,75,oliver was consistent and with more efforts he...


In [13]:
'''
converting the tabular attributes and cells into sequential form to feed into encoder and decoder:
eg: Table
    name | gender | math | reading | writing  [attributes]
    noah | female | 69   | 90      | 88       [cell values]
sequence:
    'name noah gender female math 69 reading 90 writing 88' 
    (this representation is important, the argument behind it is presented in the report)
'''

source = list()
for index, row in source_df.iterrows():
    source_seq = list()
    for col in source_col:
        source_seq.append(col)
        source_seq.append(str(row[col]))
    source.append(' '.join(source_seq))

In [9]:
'''Putting the source/input sequence and target/output annotations in two different columns of a dataframe'''

data = pd.DataFrame({'Text': source, 'Summary': df['comments']})

In [10]:
x = data['Text']  # input/source sequence
y = data['Summary'] #output/target sequence

In [14]:
''' define a function to clean the text as suited using regex'''

def clean(text):
    text = str(text)
    text = text.lower()
    '''
    perform other regex operations as per model requirement
    '''
    return text

In [15]:
'''clean the sequences and put them in the list'''

cleaned_source = list(map(clean,x))
cleaned_summary = list(map(clean,y))

'''now adding the <START> and <END> tokens at the extermes of target sequences'''
for i in range(len(cleaned_summary)):
    cleaned_summary[i] = "<START> " + cleaned_summary[i] + " <END>"

print(cleaned_source[10])
print(cleaned_summary[10])

name alexander gender male math 58 reading 54 writing 52
<START> alexander was average but he was consistent. <END>


In [16]:
'''check out the maximum source and target length, to be used in the model later'''

max_source_length = max([len(text.split()) for text in cleaned_source])
max_summary_length = max([len(text.split()) for text in cleaned_summary])

print('Maximum source length is: ', max_source_length)
print('Maximum target length is: ', max_summary_length)

Maximum source length is:  10
Maximum target length is:  23


In [17]:
'''
use sklearn train_test_split method to split the data into training and validation (for validation).
one can further split the dataset to allocate a minor portion of it for testing
'''
from sklearn.model_selection import train_test_split

new_source, test_source, new_summary, test_summary = train_test_split(cleaned_source, cleaned_summary, test_size = 0.2)

## Building the Vocabulary

In this section, we build the soruce and target vocabulary from the training data. Vocabulary is presented in the form dictionary object mapping words to their corresponding indices and vice versa. 

A problem peculiar with Table-to-text task is the reprsentation of low frequency words like 'named entity', if we try to enlarge the decoder vocabulary and build it on full training summary, there would be still be instances during validation or testing where model would encounter a new word and unable to find that in the vocabulary and that too at expense of large look up table leading to slow training.

To deal with the problem mentioned above, we train the decoder on small vocabulary comprising only most frequent words and try to copy the unseen words represented in the encoder vocabulary through *Copying Mechanism* which we will talk about later. 


In [18]:
'''
Vocab objects:
    word2Index_enc: dictionary containing all words in source corpus and their index
    
    word2Index_dec_big: dictionary containing all words in target corpus and their index
    word2Index_dec: dictionary containing most frquent words in target corpus 
                    and their original index word2Index_dec_big
    word2PsuInd_dec: pseudo dictionary containing most frquent words in target corpus but with new serial index
    
    len(word2PsuInd_dec) == len(word2Index_dec) (length of vocab on which decoder is trained)

'''


from collections import OrderedDict 

decoder_word_freq = 2

word2Index_enc = {}   
word2Index_dec = {}   
word2Index_dec_big = {} 

ind2Word_enc = {}
ind2Word_dec = {}
ind2Word_dec_big = {}

word2PsuInd_dec = {}  
psuInd2Word_dec = {}

encoder_paragraph = list(set((' '.join(new_source)).split()))

decoder_paragraph_list = list((' '.join(new_summary)).split())
decoder_dict = OrderedDict()
for word in decoder_paragraph_list:
    try:
        decoder_dict[word] = decoder_dict[word] + 1
    except:
        decoder_dict[word] = 1

ind2Word_enc[0] = '<UNK>'
ind2Word_dec[0] = '<UNK>'
word2Index_enc['<UNK>'] = 0
word2Index_dec['<UNK>'] = 0
ind2Word_dec_big[0] = '<UNK>'
word2Index_dec_big['<UNK>'] = 0
word2PsuInd_dec['<UNK>'] = 0
psuInd2Word_dec[0] = '<UNK>'

dec_index = 1
for (decoder_dict_word, decoder_dict_number) in decoder_dict.items():
    word2Index_dec_big[decoder_dict_word] = dec_index  
    ind2Word_dec_big[dec_index] = decoder_dict_word
    if decoder_dict_number >= 2 :                      
        word2Index_dec[decoder_dict_word] = dec_index     
        ind2Word_dec[dec_index] = decoder_dict_word
        psuedo_index = len(word2PsuInd_dec.keys())
        word2PsuInd_dec[decoder_dict_word] = psuedo_index 
        psuInd2Word_dec[psuedo_index] = decoder_dict_word
    dec_index+=1

enc_index = 1
for index,word in enumerate(encoder_paragraph):
    if word != ' ':
        word2Index_enc[word] = enc_index    
        ind2Word_enc[enc_index] = word 
        enc_index+=1


## Building the Seq2Seq Model

### Encoder

First, we'll build the encoder. We only use a single layer GRU, we also have a flexibility to use a *bidirectional RNN*. With a bidirectional RNN, we have two RNNs in each layer. A *forward RNN* going over the embedded sentence from left to right (shown below in green), and a *backward RNN* going over the embedded sentence from right to left (teal). All we need to do in code is set `bidirectional = True` and then pass the embedded sentence to the RNN as before. But for simplicity we will not choose the bidirectionality option unless required.

![](/assets/seq2seq8.png)

Mathematically a full bidirection GRU can be represented by:

$$\begin{align*}
h_t^\rightarrow &= \text{EncoderGRU}^\rightarrow(e(x_t^\rightarrow),h_{t-1}^\rightarrow)\\
h_t^\leftarrow &= \text{EncoderGRU}^\leftarrow(e(x_t^\leftarrow),h_{t-1}^\leftarrow)
\end{align*}$$

Where $x_0^\rightarrow = \text{<sos>}, x_1^\rightarrow = \text{guten}$ and $x_0^\leftarrow = \text{<eos>}, x_1^\leftarrow = \text{morgen}$.

We can pass an input (`embedded_outputs' and 'prev_hidden_state`) to the RNN, which tells PyTorch to initialize both the forward and backward initial hidden states ($h_0^\rightarrow$ and $h_0^\leftarrow$, respectively) to a tensor of all zeros. We'll also get two context vectors, one from the forward RNN after it has seen the final word in the sentence, $z^\rightarrow=h_T^\rightarrow$, and one from the backward RNN after it has seen the first word in the sentence, $z^\leftarrow=h_T^\leftarrow$.

The RNN returns `output` and `prev_hidden_state`. 

`output` is of size **[src len, batch size, hid dim * num directions]** where the first `hid_dim` elements in the third axis are the hidden states from the top layer forward RNN, and the last `hid_dim` elements are hidden states from the top layer backward RNN. We can think of the third axis as being the forward and backward hidden states concatenated together other, i.e. $h_1 = [h_1^\rightarrow; h_{T}^\leftarrow]$, $h_2 = [h_2^\rightarrow; h_{T-1}^\leftarrow]$ and we can denote all encoder hidden states (forward and backwards concatenated together) as $H=\{ h_1, h_2, ..., h_T\}$.

`prev_hidden_state` is of size **[n layers * num directions, batch size, hid dim]**, where **[-2, :, :]** gives the top layer forward RNN hidden state after the final time-step (i.e. after it has seen the last word in the sentence) and **[-1, :, :]** gives the top layer backward RNN hidden state after the final time-step (i.e. after it has seen the first word in the sentence).

As the decoder is not bidirectional, it only needs a single context vector, $z$, to use as its initial hidden state, $s_0$, and in case of birectional decoder, we will have two, a forward and a backward one ($z^\rightarrow=h_T^\rightarrow$ and $z^\leftarrow=h_T^\leftarrow$, respectively). We solve this by concatenating the two context vectors together, passing them through a linear layer, $g$, and applying the $\tanh$ activation function. 

$$z=\tanh(g(h_T^\rightarrow, h_T^\leftarrow)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$$

As we want our model to look back over the whole of the source sentence we return `output`, the stacked forward and backward hidden states for every token in the source sentence. We also return `prev_hidden_state`, which acts as our initial hidden state in the decoder.

In case of unidirectional forward encoder, things are going to be much simpler. Every mathematical representation holds true as above, we just ignore the backward nature of encoder. 

In [21]:
class Encoder(nn.Module):

    '''
    Args:
        input_vocab_size: (int) Size of source vocabulary
        embed_size: (int) Embedding dimensions
        hidden_size: (int) Dimensions of hidden state
        num_layers: (int) Number of stacked GRU layers, default is 1
        bidirectional: (Bool) If RNN is required to be birectional in nature, default is False
    '''
  
    def __init__(self,input_vocab_size, embed_size, hidden_size,num_layers=1,bidirectional=False):
        super(Encoder,self).__init__()

        self.bidirectional = bidirectional
        self.num_layers = num_layers

        self.hidden_size = hidden_size
        self.input_vocab_size = input_vocab_size

        self.embedding = nn.Embedding(input_vocab_size, embed_size)

        self.gru_layer = nn.GRU(embed_size, hidden_size, num_layers, bidirectional=bidirectional)

    def forward(self,input_,prev_hidden_state):
        '''Arg:
            input_: Tensor of source word indices [source length x batch size]
                    (in this case 'batch size' = 1)
            prev_hidden_state: Previous hidden state [n_layers*n direction x batch size x hidden dim]
                                (in this case n_layers, n direction, batch size = 1)
        '''
        input_tensor = input_.view(-1,1)
        #input_tensor = [source length x batch size]
        embedded_outputs = self.embedding(input_tensor).view(1,1,-1)
        #embedded_outputs = [source length x batch size x embed dim]

        output, prev_hidden_state = self.gru_layer(embedded_outputs,prev_hidden_state)
        #prev_hidden_state = [n_layers*n direction x batch size x hidden dim]
        #output = [source length x batch size x hidden dim*n direction]

        return output, prev_hidden_state

    def init_hidden(self):
        return torch.zeros(1,1,self.hidden_size, device=device)

### AttentionDecoder

**Attention**

The attention layer mechanism will take in the previous hidden state of the decoder, $s_{t-1}$, and hidden states (all of the stacked forward and backward in case of bidirectional encoder) from the encoder, $H$. The layer will output an attention vector, $a_t$, that is the length of the source sentence, each element is between 0 and 1 and the entire vector sums to 1. We ensure the attention vector fits the constraints of having all elements between 0 and 1 and the vector summing to 1 by passing it through a $\text{softmax}$ layer.

Intuitively, this layer takes what we have decoded so far, $s_{t-1}$, and all of what we have encoded, $H$, to produce a vector, $a_t$, that represents which words in the source sentence we should pay the most attention to in order to correctly predict the next word to decode, $\hat{y}_{t+1}$. 

This gives us the attention over the source sentence!

Graphically, this looks something like below. This is for calculating the very first attention vector, where $s_{t-1} = s_0 = z$. The green/teal blocks represent the hidden states from both the forward and backward RNNs, and the attention computation is all done within the pink block.

![](/assets/seq2seq9.png)

**Decoder**

Next, the attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector $w_t$, which is given by `attention_applied`:

$$w_t = \Sigma a_i^th_i$$

The context vector, which can be seen as a fixed size representation of what has been read from the source for this step, is concatenated with the decoder state st and fed through two linear layers to produce the vocabulary distribution $P_{vocab}$

$$P_{vocab} = \text{softmax}(V^* (V[s_t; w_t] + b) + b^*)$$

where $V$, $V^*$, $b$ and $b^*$ are learnable parameters. $P_{vocab}$ is a probability distribution over all words in the vocabulary, and provides us with our final distribution from which to predict words $w$:

$$P(w) = P_{vocab}(w)$$

The embedded (with dropout) input word, $d(y_t)$, the weighted source vector, $w_t$ is then concatenated to form `attention_combine_relu`, which in turn is passed with the previous decoder hidden state 'prev_hidden_state', $s_{t-1}$, into the decoder RNN to produce `output` and `hidden`.

$$s_t = \text{DecoderGRU}(d(y_t), w_t, s_{t-1})$$

We then pass `output` through the linear layer, $Linear$ and apply $logsoftmax$ to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$.

$$\hat{y}_{t+1} = LogSoftmax(Linear(output))$$

The image below shows decoding the first word in an example translation.

![](/assets/seq2seq10.png)


In [24]:
class AttentionDecoder(nn.Module):

    '''
    Args:
        output_vocab_size: (int) Size of target vocabulary
        embed_dim: (int) Embedding dimensions
        hidden_size: (int) Dimensions of hidden state
        max_length_encoder: (int) Maximum length of encoder sequence
        dropout_value: (float) Value between 0 & 1
        num_layers: (int) Number of stacked GRU layers, default is 1
    ''' 

    def __init__(self, output_vocab_size, embed_dim, hidden_size, max_length_encoder, dropout_value, num_layers=1):
        super(AttentionDecoder,self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.output_vocab_size = output_vocab_size
        self.dropout_p = dropout_value
        self.max_length_encoder = max_length_encoder

        self.embedding = nn.Embedding(output_vocab_size, embed_dim) 

        self.attention_layer = nn.Linear(hidden_size*2, max_length_encoder)
        self.attention_combine = nn.Linear(hidden_size*2, hidden_size)

        self.s_layer = nn.Linear(hidden_size, 1)
        self.x_layer = nn.Linear(hidden_size, 1)
        self.context_layer = nn.Linear(hidden_size, 1)
        self.linear_pgen = nn.Linear(3, 1)

        self.gru_layer = nn.GRU(embed_dim, hidden_size)
        self.output_layer = nn.Linear(hidden_size, output_vocab_size)
        self.dropout_layer = nn.Dropout(self.dropout_p)    

    def forward(self,input_tens,prev_hidden_state,encoder_output):
        '''
        Args:
            input_tens = [1 x batch size] (seq length is strictly 1, batch size in our case 1)
            prev_hidden_state = Final encoder state [n_layers x batch size x hidden dim]
                                (in our case, n_layers and batch size is 1)
            encoder_output = encoder final output over source sequence [max_length_encoder x hidden dim]
        '''
        embedded_outputs = self.embedding(input_tens).view(1,1,-1)
        #input_tens = [1 x batch size]
        #embedded_outputs = [1 x batch size x embed dim]

        embeddings_dropout = self.dropout_layer(embedded_outputs)
        #embeddings_dropout = [1 x batch size x embed dim]
        
        '''
        Attention Portion:
        '''
        #prev_hidden_state = [n_layers x batch size x hidden dim]
        attention_layer_output = self.attention_layer(torch.cat((embeddings_dropout[0],prev_hidden_state[0]),1)) 
        
        #cat = [batch size x (embed dim + hidden dim)] = [batch size x 2*(hidden dim)]
        #in our case emdedding dimension is going to be same as hidden dimension
        #attention_layer_output = [batch size x max_length_encoder]

        attention_weights = nn.functional.softmax(attention_layer_output,dim=1)
        #attention_weights = [batch size x max_length_encoder]
        
        '''
        Decoder Portion:
        '''
        attention_applied = torch.bmm(attention_weights.unsqueeze(0),encoder_output.unsqueeze(0))
        #attention_weights = [batch size x max_length_encoder], after unsqueezing in 0th dim ==> [1 x batch size x max_length_encoder]
        #encoder_output = [max_length_encoder x hidden dim], after unsqueezing in 0th dim ==> [1 x max_length_encoder x hidden dim] 
        #attention_applied = [1 x batch size x hidden dim]

        attention_combine_logits = self.attention_combine(torch.cat((embeddings_dropout[0],attention_applied[0]),1)).unsqueeze(0)  #since gru requires a batch dimension
        #embeddings_dropout = [1 x batch size x embed dim]
        #attention_applied = [1 x batch size x hidden dim]
        #cat = [batch size x (embed dim + hidden dim)] = [batch size x 2*(hidden dim)]
        #attention_combine_logits = [batch size x hidden dim], after unsqueezing in 0th dim ==> [1 x batch size x hidden dim]

        attention_combine_relu = nn.functional.relu(attention_combine_logits)
        #attention_combine_relu = [1 x batch size x hidden dim]
        
        '''
        Pgen calculation used for copying
        '''
        s_output = self.s_layer(prev_hidden_state[0])
        #prev_hidden_state = [n_layers x batch size x hidden dime]
        #s_output = [batch size x 1] = [1 x 1]

        x_output = self.x_layer(embeddings_dropout[0])
        #embeddings_dropout = [1 x batch size x embed dim]
        #x_output = [batch size x 1] = [1 x 1] as (hidden dim = embed dim)

        context = torch.flatten(attention_applied)
        #attention_applied = [1 x batch size x hidden dim]
        #context = [batch size * hidden dim] = [hidden dim]

        context_weights = self.context_layer(attention_applied)
        #context_weights = [1 x batch size x 1] = [1 x 1 x 1]

        sx = torch.cat((s_output[0],x_output[0]),0)
        #sx = [1 x 2*(unit)]
        sxc = torch.cat((sx,context_weights[0][0]),0)
        #sxc = [1 x 3*(unit)]
        linear_pgen = self.linear_pgen(sxc)
        #linear_pgen = [1 x 1]
        m = nn.Sigmoid()
        pgen = m(linear_pgen)
        #pgen = [1 x 1]

        output,hidden = self.gru_layer(attention_combine_relu,prev_hidden_state)
        #attention_combine_relu = [1 x batch size x hidden dim]
        #prev_hidden_state = [n_layers x batch size x hidden dime]
        #output = [1 x batch size x hidden dim] =[1 x 1 x hidden dim]
        #hidden = [n_layers x batch size x hidden dime] = [1 x 1 x hidden dim]

        output_logits = self.output_layer(output)
        #output_logits = [1 x batch size x output vocab size]
        output_softmax = nn.functional.log_softmax(output_logits[0],dim=1)
        #output_softmax = [batch size x output vocab size] = [1 x output vocab size] 
        #softmax applied distribution over target vocab
        return output_softmax,hidden,attention_weights,pgen

    def init_hidden(self):
        return torch.zeros(1,1,self.hidden_size,device=device)

## Basic Seq2Seq with Attention

The model may attend to relevant words in the source text to generate novel words, e.g., to produce the novel word beat in the abstractive summary Germany beat Argentina 2-0 the model may attend to the words victorious and win in the source text.

![](/assets/basic_seq2seq_attention.png)

Descriptions provided in the Encoder and AttentionDecoder section would be enough for modelling this type of netwok. But this kind of vanila seq2seq model has obvious shortcomings perticularly when exhaustive vocabulary has more representations of low frequency words. Keeping the full vocab will result in larger training time but removal of low frequent words leads to substantial loss of information. This situation is more likely to arise in table to text generation tasks where tabular contents are very specific and include large number of name entities. This problem can be overcome using copying or pointing mechanism.

## Seq2Seq with Attention and Copying Mechanism

For each decoder timestep a generation probability $P_{gen}$ $\epsilon$ $[0,1]$ is calculated, which weights the probability of generating words from the vocabulary, versus copying words from the source text. The vocabulary distribution and the attention distribution are weighted and summed to obtain the final distribution, from which we make our prediction. Note that out-of-vocabulary article words such as 2-0 are included in the final distribution.

![](/assets/seq2seq_with_pg.png)

In the pointer-generator model (depicted in Figure abthe attention distribution at and context vector $w_t$ are calculated as described in AttentionDecoder section. In addition, the generation probability $P_{gen}$ $\epsilon$ [0;1] for timestep t is calculated from the context vector $w_t$, the decoder state $s_t$ and the decoder input $x_t$ as following:

$$P_{gen} = Sigmoid(W_h^Tw_t+W_s^Ts_t+W_x^Tx_t+b_{pg})$$

Next, $P_{gen}$ is used as a soft switch to choose between generating a word from the vocabulary by
sampling from $P_{vocab}$, or copying a word from the input sequence by sampling from the attention distribution $a_t$ . For each source sequence, the extended vocabulary `extended_vocab` represents the union of the decoder vocabulary, and all words appearing in the source sequence which are absent in decoder vocab. We obtain the following probability distribution over the extended vocabulary or `P_over_extended_vocab` as:

$$P(w) = P_{gen}P_{vocab}(w)+(1-P_{gen})\Sigma_{i:w_i=w}a_i^t$$

If $w$ is an out-of-vocabulary (OOV) word, then $P_{vocab}(w)$ is zero; similarly if $w$ does not appear in the source document, then $\Sigma_{i:w_i=w}a_i^t$ at $i$ is zero. The ability to produce OOV words is one of the primary advantages of pointing or copying mehanism.

In [31]:
'''let us check the vocab size once again'''

print('Size of encoder vocab: ',len(word2Index_enc))
print('Size of decoder vocab: Full {} | Frequent {}'.format(len(word2Index_dec_big),len(word2Index_dec)))

Size of encoder vocab:  155
Size of decoder vocab: Full 236 | Frequent 110


## Train and Validation Loss

Our next step would be to define the training and validation loss and this is done in `train` and `validate` function. Train method is defined below, we allow the model to randomly choose between 'Teacher forcing' or 'No teacher forcing' by imposing a threshold `teacher_forcing_ratio`. 

    - if decoder is teacher forced, actual reference token (ground truth) is used as next decoder input
    - if not teacher forced, decoder output produced in the previous step is used as next decoder input

During training, the loss for timestep $t$ is the negative log likelihood of the target word $w_t^*$ for that
timestep, i.e.:

$$loss_t = -log(P(w_t^*))$$

and the overall loss for the whole sequence is:

$$loss = \Sigma_{t=0}^Tloss_t$$

validate step is similar to train step but in the absence of Teacher forcing and varying gradient.

In [28]:
def train(encoder, decoder, input_tensor, target_tensor, 
          encoder_optimizer, decoder_optimizer, criterion, max_length, iters, 
          teacher_forcing_ratio = 0.4, clip = 0.4):
    '''
    Arg:
        encoder: encoder model to train
        decoder: decoder model to train
        input_tensor: source seq in tensor form [seq length x 1] batch size = 1
        target_tensor: target seq in tensor form [seq length x 1] batch size = 1
        encoder_optimizer: optimizer for encoder
        decoder_optimizer: optimizer for decoder
        citerion: Loss criterion
        max length: maximum source length 
        iters: number of iterations
        teacher_forcing_ratio: if teacher forcing, actual next token is useed as next input
        clip: to prevent gradients from exploding 
    '''
    encoder_optimizer.zero_grad() #initialize encoder_optimizer at zero gradient
    decoder_optimizer.zero_grad() #initialize decoder_optimizer at zero gradient

    #prev_unk_word = ''
    encoder_hidden = encoder.init_hidden()
    #encoder_hidden = [1 x 1 x hidden dim]

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device = device)
    #encoder_outputs = [seq length x hidden dim]

    input_length = input_tensor.size(0)
    output_length = target_tensor.size(0)

    for encoder_index in range(0, input_length):
        encoder_output,encoder_hidden = encoder(input_tensor[encoder_index], encoder_hidden)
        #input_tensor[encoder_index] = [1 x 1 x embed dim] (embed dim = hidden dim)
        #encoder_hidden = [1 x 1 x hidden dim] {encoder arg inp}
        #encoder_hidden = [n_layers*n direction x 1 x hidden dime] {encoder product}
        #encoder_output = [seq length x 1 x hidden dim*n direction]
        #seq length, n_layers, n direction = 1  

        encoder_outputs[encoder_index] = encoder_output[0,0] # [1 x hidden dim]
        #encoder_outputs: [seq length x hidden dim] ==> [seq length x hidden dim] (hidden state from all 0 to new)

    decoder_input = torch.tensor([word2Index_dec['<START>']],device=device)
    #decoder_input = [1 x 1]
    decoder_hidden = encoder_hidden
    #decoder_hidden = [1 x 1 x hidden dim]
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    extended_vocab = psuInd2Word_dec.copy()
    reverse_extended_vocab = word2PsuInd_dec.copy()
    duplicate_words = {}
    extend_key = len(word2Index_dec.keys())
    input_list = input_tensor.tolist()
    i =0
    for input_word in input_list:
        if ind2Word_enc[input_word[0]] in word2Index_dec.keys():
            duplicate_words[i] = word2PsuInd_dec[ind2Word_enc[input_word[0]]]
        else:
            extended_vocab[extend_key] = ind2Word_enc[input_word[0]]
            reverse_extended_vocab[ind2Word_enc[input_word[0]]] = extend_key
            extend_key += 1
        i = i+1
    '''
    Note:
        - Words appearing both in Input/Source sequence is copied in 'duplicate_words' dictionary.
        - New unseen words for decoder i.e. Target OOV is copied as an extension of Pseudo decoder 
          vocabulary i.e. 'extended_vocab'
        - Both of these vocabularies play vital role during copying 
    '''
    
    loss = 0
    for decoder_index in range(output_length):
        decoder_output,decoder_hidden,decoder_attention,pgen = decoder(decoder_input,decoder_hidden,encoder_outputs)
        #decoder_input = [1 x 1]
        #decoder_hidden = [1 x 1 x hidden dim]
        #encoder_outputs = [seq length x hidden dim]

        #decoder_output = [1 x output vocab size]
        #decoder_hidden = [1 x 1 x hidden dime]
        #decoder_attention = [1 x max source length]
        #pgen = [1 x 1]

        P_over_extended_vocab = torch.exp(decoder_output)*pgen.expand_as(torch.exp(decoder_output))
        #P_over_extended_vocab = [1 x output vocab size] (exp(decoder_output)*pgen)

        decoder_attention = decoder_attention.squeeze(0)[0:input_length].unsqueeze(0)
        #restricting decoder attention upto only input length
        #decoder_attention = [1 x input_length]
        p_duplicate_list = torch.zeros([input_length, P_over_extended_vocab.size(1)], device=device)
        #p_duplicate_list = [input_length x output vocab size] 

        p_duplicate_list = p_duplicate_list.tolist()
        for (duplicate_word_key,duplicate_word_value) in duplicate_words.items():
            p_duplicate_list[duplicate_word_key][duplicate_word_value] = 1 #making duplicate key,vals apparent
      
        p_duplicate = torch.tensor(p_duplicate_list, dtype=torch.float, device=device)
        p_diag = torch.mm(decoder_attention, p_duplicate)
        #p_diag = [1 x output vocab size]

        p_diag = p_diag*(torch.tensor([1], device=device).sub(pgen)).expand_as(p_diag)
        #p_diag = p_diag*(1 - pgen)

        p_add_diag = torch.diag(p_diag.squeeze(0),diagonal=0) #p_diag.squeeze(0) ==> [output vocab size]
        #p_add_diag = [output vocab size x output vocab size]

        P_over_extended_vocab = torch.mm(P_over_extended_vocab,p_add_diag).add(P_over_extended_vocab)
        #mm = [1 x output vocab size]
        #P_over_extended_vocab = [1 x output vocab size] (element wise summation)

        for i in range(input_length):
            if not (1 in p_duplicate_list[i]):
                P_over_extended_vocab = torch.cat((P_over_extended_vocab[0], torch.mm(decoder_attention.squeeze(0)[i].unsqueeze(0).unsqueeze(0), torch.tensor([1], device=device).sub(pgen).unsqueeze(0)).squeeze(0)),0).unsqueeze(0)
        
        '''
        This above step makes sure if <UNK> token is the best decoder can produce over orginal 
        vocabulary, it is forced to look at extended vocab to produce the most appropriate word
        '''
        try: # Loss calculation
            loss += -torch.log(P_over_extended_vocab[0][ reverse_extended_vocab[ ind2Word_dec_big[ target_tensor[decoder_index].item() ] ] ] + 1e-12)
            loss.backward(retain_graph=True)
        except KeyError:
            loss += torch.tensor(0,dtype=torch.float,device=device)
    
        if use_teacher_forcing:
            '''if decoder is teacher forced, actual reference token (ground truth) is used as next decoder input'''
            next_input = target_tensor[decoder_index]
            if next_input.item() in ind2Word_dec.keys():
                dec_train_word = ind2Word_dec[next_input.item()]
                decoder_input = torch.tensor([word2PsuInd_dec[dec_train_word]], dtype=torch.long, device=device)
            else:
                decoder_input = torch.tensor([0], dtype=torch.long, device=device)
            
            if (decoder_input.item() == word2Index_dec['<END>']):
                break       
        else:
            '''if not teacher forced, decoder output produced in the previous step is used as next decoder input'''
            idx = torch.topk(P_over_extended_vocab, k=1, dim=1)[1]
            if idx.item() < len(word2Index_dec.keys()):   
                decoder_input = torch.tensor([idx.item()],dtype=torch.long,device=device)
            elif idx.item() >= len(word2Index_dec.keys()):
                #prev_unk_word = extended_vocab[idx.item()] # use <UNK> if doesn't work
                decoder_input = torch.tensor([0],dtype=torch.long,device=device)
            
            if (decoder_input.item() == word2Index_dec['<END>']):
                break      


    if iters > 20000:
        torch.nn.utils.clip_grad_norm_(encoder.parameters(), clip)
        torch.nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item()/output_length

In [32]:
def validate(encoder, decoder, input_tensor, target_tensor, criterion, max_length):
    '''
    Note: validate step is similar to train step but in the absence of Teacher forcing and varying gradient
    Arg:
        encoder: encoder model trained
        decoder: decoder model trained
        input_tensor: source seq in tensor form [seq length x 1] batch size = 1
        target_tensor: target seq in tensor form [seq length x 1] batch size = 1
        citerion: Loss criteria
        max_length: maximum source length length desired
    '''
    with torch.no_grad():
    
        #prev_unk_word = ''

        encoder_hidden = encoder.init_hidden()
        #encoder_hidden = [1 x 1 x hidden dim]

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device = device)
        #encoder_outputs = [seq length x hidden dim]

        input_length = input_tensor.size(0)
        output_length = target_tensor.size(0)

        loss = 0

        for encoder_index in range(0, input_length):
            encoder_output,encoder_hidden = encoder(input_tensor[encoder_index], encoder_hidden)
            #input_tensor[encoder_index] = [1 x 1 x embed dim] (embed dim = hidden dim)
            #encoder_hidden = [1 x 1 x hidden dim] {encoder arg inp}
            #encoder_hidden = [n_layers*n direction x 1 x hidden dime] {encoder product}
            #encoder_output = [seq length x 1 x hidden dim*n direction]
            #seq length, n_layers, n direction = 1  

            encoder_outputs[encoder_index] = encoder_output[0,0] # [1 x hidden dim]
            #encoder_outputs: [seq length x hidden dim] ==> [seq length x hidden dim] (hidden state from all 0 to new)

        decoder_input = torch.tensor([word2Index_dec['<START>']],device=device)   
        #decoder_input = [1 x 1]
        decoder_hidden = encoder_hidden
        #decoder_hidden = [1 x 1 x hidden dim]

        extended_vocab = psuInd2Word_dec.copy()
        reverse_extended_vocab = word2PsuInd_dec.copy()
        duplicate_words = {}
        extend_key = len(word2Index_dec.keys())
        input_list = input_tensor.tolist()
        i =0
        for input_word in input_list:
            if ind2Word_enc[input_word[0]] in word2Index_dec.keys():
                duplicate_words[i] = word2PsuInd_dec[ind2Word_enc[input_word[0]]]
            else:
                extended_vocab[extend_key] = ind2Word_enc[input_word[0]]
                reverse_extended_vocab[ind2Word_enc[input_word[0]]] = extend_key
                extend_key += 1
            i = i+1
        
        for decoder_index in range(output_length):
            decoder_output,decoder_hidden,decoder_attention,pgen = decoder(decoder_input,decoder_hidden,encoder_outputs)
            #decoder_input = [1 x 1]
            #decoder_hidden = [1 x 1 x hidden dim]
            #encoder_outputs = [seq length x hidden dim]

            #decoder_output = [1 x output vocab size]
            #decoder_hidden = [1 x 1 x hidden dime]
            #decoder_attention = [1 x max source length]
            #pgen = [1 x 1]

            P_over_extended_vocab = torch.exp(decoder_output)*pgen.expand_as(torch.exp(decoder_output))
            #P_over_extended_vocab = [1 x output vocab size] (exp(decoder_output)*pgen)

            decoder_attention = decoder_attention.squeeze(0)[0:input_length].unsqueeze(0)
            #restricting decoder attention upto only input length
            #decoder_attention = [1 x input_length]
            p_duplicate_list = torch.zeros([input_length, P_over_extended_vocab.size(1)], device=device)
            #p_duplicate_list = [input_length x output vocab size] 

            p_duplicate_list = p_duplicate_list.tolist()
            for (duplicate_word_key,duplicate_word_value) in duplicate_words.items():
                p_duplicate_list[duplicate_word_key][duplicate_word_value] = 1 #making duplicate key,vals apparent

            p_duplicate = torch.tensor(p_duplicate_list, dtype=torch.float, device=device)
            p_diag = torch.mm(decoder_attention, p_duplicate)
            #p_diag = [1 x output vocab size]

            p_diag = p_diag*(torch.tensor([1], device=device).sub(pgen)).expand_as(p_diag)
            #p_diag = p_diag*(1 - pgen)

            p_add_diag = torch.diag(p_diag.squeeze(0),diagonal=0) #p_diag.squeeze(0) ==> [output vocab size]
            #p_add_diag = [output vocab size x output vocab size]

            P_over_extended_vocab = torch.mm(P_over_extended_vocab,p_add_diag).add(P_over_extended_vocab)
            #mm = [1 x output vocab size]
            #P_over_extended_vocab = [1 x output vocab size] (element wise summation)

            for i in range(input_length):
                if not (1 in p_duplicate_list[i]):
                    P_over_extended_vocab = torch.cat((P_over_extended_vocab[0], torch.mm(decoder_attention.squeeze(0)[i].unsqueeze(0).unsqueeze(0), torch.tensor([1], device=device).sub(pgen).unsqueeze(0)).squeeze(0)),0).unsqueeze(0)

            try:
                loss += -torch.log(P_over_extended_vocab[0][ reverse_extended_vocab[ ind2Word_dec_big[ target_tensor[decoder_index].item() ] ] ] + 1e-12)
            except KeyError:
                loss += torch.tensor(0,dtype=torch.float,device=device)

            idx = torch.topk(P_over_extended_vocab, k=1, dim=1)[1]
            if idx.item() < len(word2Index_dec.keys()):   
                decoder_input = torch.tensor([idx.item()],dtype=torch.long,device=device)
            elif idx.item() >= len(word2Index_dec.keys()):
                #prev_unk_word = extended_vocab[idx.item()] # use <UNK> if doesn't work
                decoder_input = torch.tensor([0],dtype=torch.long,device=device)
            if (decoder_input.item() == word2Index_dec['<END>']):
                break      

    return loss.item()/output_length

In [33]:
import os

'''make directory for saving the model'''

if not os.path.exists('checkpoints_table2text/encoder'):
    os.makedirs('checkpoints_table2text/encoder')
if not os.path.exists('checkpoints_table2text/decoder'):
    os.makedirs('checkpoints_table2text/decoder')

## Model Training

In this section, we define the training procedure for our model. We choose the 'Stochastic Gradient Descent' `SGD` optimizer for both encoder and decoder and start off with very small learning rate as the size of dataset is very small in our case. Rest of the steps are self explanatory. The model is saved in the assigned directory when minimum validation loss occurs.

In [45]:
def train_Iters(encoder,decoder,n_iters, print_every=1, plot_every=5,learning_rate = 0.0005):
    '''
    Args:
        encoder: defined encoder
        decoder: defined decoder
        n_iters: int, number of iterations
        print_every: int, prints the predicted text on a randomly selected example from
                     training source every 'print_every' iterations
        plot_every: int, stores the training loss every 'plot_every' iteration
        learning rate: encoder and decoder optimizer's learning rate
    '''
    train_loss_graph = {}
    val_loss_graph = {}
    plot_losses = []
    
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0
    print_val_loss = 0

    encoder_optimizer = torch.optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = torch.optim.SGD(decoder.parameters(), lr=learning_rate)

    encoder_input = [[word2Index_enc[word] if word in word2Index_enc.keys() else word2Index_enc['<UNK>'] for word in sentence.split()] for sentence in new_source ]
    decoder_input = [[word2Index_dec_big[word] if word in word2Index_dec_big.keys() else word2Index_dec_big['<UNK>'] for word in sentence.split()] for sentence in new_summary ]
    train_pairs = [[enc,dec] for enc,dec in zip(encoder_input,decoder_input)]
    training_pairs = [random.choice(train_pairs) for i in range(n_iters)]

    encoder_val = [[word2Index_enc[word] if word in word2Index_enc.keys() else word2Index_enc['<UNK>'] for word in sentence.split()] for sentence in test_source ]
    decoder_val = [[word2Index_dec_big[word] if word in word2Index_dec_big.keys() else word2Index_dec_big['<UNK>'] for word in sentence.split()] for sentence in test_summary ]
    val_pairs = [[enc,dec] for enc,dec in zip(encoder_val,decoder_val)]
    validation_pairs = [random.choice(val_pairs) for i in range(n_iters)]

    best_model_iters = []
    best_valid_loss = float('inf')
    criterion = nn.NLLLoss()
    for iters in range(n_iters):
        training_pair = training_pairs[iters]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        input_tensor = torch.tensor(input_tensor, dtype=torch.long, device = device).view(-1, 1)
        target_tensor = torch.tensor(target_tensor, dtype=torch.long, device = device).view(-1, 1)

        loss = train(encoder,decoder,input_tensor,target_tensor,
                     encoder_optimizer,decoder_optimizer,criterion,max_source_length, iters=n_iters)
        print_loss_total += loss
        plot_loss_total += loss

        validation_pair = validation_pairs[iters]
        val_input_tensor = validation_pair[0]
        val_target_tensor = validation_pair[1]

        val_input_tensor = torch.tensor(val_input_tensor, dtype=torch.long, device = device).view(-1, 1)
        val_target_tensor = torch.tensor(val_target_tensor, dtype=torch.long, device = device).view(-1, 1)

        val_loss = validate(encoder, decoder, val_input_tensor, val_target_tensor, criterion, max_source_length)
        print_val_loss += val_loss

        if iters % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            val_loss_avg = print_val_loss / print_every
            print_val_loss = 0

            print(f'Iteration: {iters}, Train Loss: {print_loss_avg:.4f}, Val Loss: {val_loss_avg:.4f}') 
            evaluateRandomly(rnn_encoder, rnn_decoder, new_source, new_summary)
            if iters > 0:
                loss_graph[iters] = print_loss_avg

        if iters % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0


        if val_loss < best_valid_loss:
            best_valid_loss = val_loss
            torch.save(encoder, 'checkpoints_table2text/encoder/best_encoder.pt')
            torch.save(decoder, 'checkpoints_table2text/decoder/best_decoder.pt')
            best_model_iters.append(iters)

    print(f'\nbest model found at iteration {max(best_model_iters)}')

In [41]:
'''Define the encoder and decoder model'''

INPUT_VOCAB_SIZE = len(word2Index_enc.keys())
OUTPUT_VOCAB_SIZE = len(word2Index_dec.keys())
MAX_SOURCE_LENGTH = max_source_length
DROPOUT_VAL = 0.2
HIDDEN_SIZE = EMBED_DIM = 128

rnn_encoder = Encoder(INPUT_VOCAB_SIZE, EMBED_DIM, HIDDEN_SIZE).to(device=device)
rnn_decoder = AttentionDecoder(OUTPUT_VOCAB_SIZE, EMBED_DIM, 
                               HIDDEN_SIZE, MAX_SOURCE_LENGTH, DROPOUT_VAL).to(device=device)

In [44]:
def evaluate(encoder, decoder, encoder_tensor, 
             max_source_length=max_source_length, max_summary_length=max_summary_length):
    '''
    returns a list decoded tokens on provided source tensor along with Attention tensor
    
    Args:
        encoder: trained encoder
        decoder: trained decoder
        encoder_tensor: a LongTensor of sequence, dtype must be torch.long
        max_source_length: int, maximum source length present in training set
        max_summary_length: int, maximum target length present in the training set
    '''
    
    with torch.no_grad():
        input_tensor = encoder_tensor
        input_length = input_tensor.size(0)
        encoder_hidden = encoder.init_hidden()

        #prev_unk_word = ''

        encoder_outputs = torch.zeros(max_source_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei].unsqueeze(0),
                                                     encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        extended_vocab = psuInd2Word_dec.copy()
        duplicate_words = {}
        extend_key = len(word2Index_dec.keys())
        input_list = input_tensor.tolist()
        i =0
        for input_word in input_list:
            if ind2Word_enc[input_word] in word2Index_dec.keys():
                duplicate_words[i] = word2PsuInd_dec[ind2Word_enc[input_word]]
            else:
                extended_vocab[extend_key] = ind2Word_enc[input_word]
                extend_key += 1
            i = i+1

        decoder_input = torch.tensor([word2Index_dec['<START>']], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_summary_length, max_source_length)

        for di in range(max_summary_length):
            decoder_output, decoder_hidden, decoder_attention,pgen = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data

            P_over_extended_vocab = torch.exp(decoder_output)*pgen.expand_as(torch.exp(decoder_output))

            decoder_attention = decoder_attention.squeeze(0)[0:input_length].unsqueeze(0)
            p_duplicate_list = torch.zeros([input_length, P_over_extended_vocab.size(1)], device=device)
            p_duplicate_list = p_duplicate_list.tolist()
            for (duplicate_word_key,duplicate_word_value) in duplicate_words.items():
                p_duplicate_list[duplicate_word_key][duplicate_word_value] = 1
            p_duplicate = torch.tensor(p_duplicate_list, dtype=torch.float, device=device)
            p_diag = torch.mm(decoder_attention, p_duplicate)
            p_diag = p_diag*(torch.tensor([1], device=device).sub(pgen)).expand_as(p_diag)
            p_add_diag = torch.diag(p_diag.squeeze(0),diagonal=0)
            P_over_extended_vocab = torch.mm(P_over_extended_vocab,p_add_diag).add(P_over_extended_vocab)

            for i in range(input_length):
                if not (1 in p_duplicate_list[i]):
                    P_over_extended_vocab = torch.cat((P_over_extended_vocab[0], torch.mm(decoder_attention.squeeze(0)[i].unsqueeze(0).unsqueeze(0), torch.tensor([1], device=device).sub(pgen).unsqueeze(0)).squeeze(0)),0).unsqueeze(0)

            idx = torch.topk(P_over_extended_vocab, k=1, dim=1)[1]
            if idx.item() < len(word2Index_dec.keys()):   
                decoder_input = torch.tensor([idx.item()],dtype=torch.long,device=device)
                decoded_words.append(extended_vocab[idx.item()])
            elif idx.item() >= len(word2Index_dec.keys()):
                decoder_input = torch.tensor([0],dtype=torch.long,device=device)
                prev_unk_word = extended_vocab[idx.item()] 
                decoded_words.append(prev_unk_word)
            if idx.item() == word2Index_dec['<END>']:
                decoded_words.append('<END>')
                break

        return decoded_words, decoder_attentions[:di + 1]

In [43]:
def evaluateRandomly(encoder, decoder, source_seqs, target_seqs=None, n=1):
    '''
    randomly selects a source sequence from a list of sequences and prints the decoded text
    Args:
        encoder: trained encoder model
        decoder: trained decoder model
        source_seqs: List of sequences in text form
        target_seqs: List of corresponding target sequences in text form, default None
        n: int, Number of random selections from soource_seqs, default 1
    '''
    for i in range(n):
        idx = random.choice(range(len(source_seqs)))
        source_seq = source_seqs[idx]
        source_inp = [word2Index_enc[word] if word in word2Index_enc.keys() else word2Index_enc['<UNK>'] for word in source_seq.split()]      
        source_tensor = torch.tensor(source_inp,dtype=torch.long,device=device)
        output_words, attentions = evaluate(encoder, decoder, source_tensor)
        output_seq = ' '.join(output_words)
        
        print('   SOURCE: ',source_seq)
        if target_seqs is not None:
            target_seq = target_seqs[idx]
            print('   ACTUAL: ',target_seq)
        print('PREDICTED: ',output_seq)

In [46]:
train_Iters(rnn_encoder,rnn_decoder,1000, 10)

Iteration: 0, Train Loss: 0.5547, Val Loss: 0.5284
   SOURCE:  name josiah gender male math 53 reading 44 writing 42
   ACTUAL:  <START> josiah scores were really poor and he needs to put in a lot more effort in all three sections. <END>
PREDICTED:  josiah 53 42 53 42 42 42 josiah 42 male 42 42 42 42 male 42 53 53 gender gender 53 42 42


  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


Iteration: 10, Train Loss: 4.7632, Val Loss: 4.6299
   SOURCE:  name lincoln gender male math 57 reading 56 writing 57
   ACTUAL:  <START> lincoln scores were average at best and he should work harder in math.  <END>
PREDICTED:  lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln lincoln
Iteration: 20, Train Loss: 4.2701, Val Loss: 4.6575
   SOURCE:  name hunter gender female math 33 reading 41 writing 43
   ACTUAL:  <START> hunter failed in math and she will have to repeat the course, she nearly failed in other two sections as well. <END>
PREDICTED:  hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter hunter
Iteration: 30, Train Loss: 4.2032, Val Loss: 4.6274
   SOURCE:  name elias gender male math 45 reading 37 writing 37
   ACTUAL:  <START> elias nearly failed 