<h1><center>Title Generation </center></h1>
<h2><center>Sequence-to-Sequence Text Summarization for Academic Journal Articles </center></h2>
<center>Karina Huang, Abhimanyu Vasishth, Phoebe Wong </center>
<center>AC209b: Advanced Topics in Data Science </center>
<center>Spring 2019 </center>

---

In [20]:
#import package dependencies
import re
import sys
import glove
import random
import numpy as np
import pandas as pd
import pickle
from keras.utils import np_utils
from collections import Counter
from keras.models import load_model
from sklearn.model_selection import train_test_split
from tensorflow.contrib import keras
from keras.layers import Bidirectional, Dropout, Dense,LSTM,Input,Activation,Add,TimeDistributed,\
Permute,Flatten,RepeatVector,merge,Lambda,Multiply,Reshape, Concatenate, Dot
from keras.layers.wrappers import TimeDistributed
from keras.preprocessing.sequence import pad_sequences
from keras.layers.embeddings import Embedding
from keras.models import Sequential,Model
from keras.optimizers import RMSprop
from keras import backend as K
import tensorflow as tf
import warnings
warnings.filterwarnings("ignore")

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')


## 1. Introduction

In [None]:
#abhi

## 2. Baseline Model - Nearest Neighbors with TF-IDF

In [None]:
#abhi

---

## 3. Data Preprocessing for Recurrent Neural Network (RNN)

### 3.1 Data Cleaning
The original [NIPS dataset](https://www.kaggle.com/benhamner/nips-2015-papers/version/2) acquired from Kaggle included XXXX journal articles, most of which were missing abstracts. Our first attempt to clean the data was to find and extract abstracts for articles missing the piece of information. The `getAbstract` code performs the search for abstract in two steps. These two steps were results of ad-hoc identification of abstract extractions and recovered 3,250 abstracts. We removed all articles missing abstracts and formatted the text for data cleaning. Due to computational constraints, we subsetted articles with abstract of length 250 in words for the current study. The final dataset used included 4,638 observations without missing data in title or abstract. 

In [2]:
def formatText(x):
    "render space in text and text to lowercase"
    for i in range(len(x)):
        #check for data type
        if type(x[i]) == str:
            try:
                x[i] = x[i].replace('\n', ' ').lower()
            except:
                x[i] = x[i].lower()
    return x

def getAbstract(paper_text, methods = 1):
    "extract abstract from text in two steps"
    #step 1:
    #find 'abstract' in text
    #find the next word/phrase in all cap, wrapped in '\n'
    #extract everything in between as abstract
    if methods == 1:
        try:
            #find abstract
            a1 = re.search('abstract\n', paper_text, re.IGNORECASE)
            paper_text = paper_text[a1.end():]
            #find the next section in all cap
            a2 = re.search(r'\n+[A-Z\s]+\n', paper_text)
            return paper_text[: a2.start()]
        except:
            return np.nan
    #step 2:
    #find abstract in text
    #find next item wrapped between '\n\n' and '\n\n'
    #extract everything in between as abstract
    if methods == 2:
        try:
            a1 = re.search('abstract\n', paper_text, re.IGNORECASE)
            paper_text = paper_text[a1.end():]
            #find the next section in all cap
            a2 = re.search(r'\n\n+.+\n\n', paper_text)
            return paper_text[: a2.start()]
        except:
            return np.nan


def preprocessing(papers, formatCols = ['title', 'abstract','paper_text'], dropnan = False):
    "preliminary data preprocessing for model fitting"
    #avoid modifying original dataset
    papersNew = papers.copy()
    #replace missing values with nan
    papersNew.abstract = papersNew.abstract.apply(lambda x: np.nan if x == 'Abstract Missing' else x)
    #extract missing abstract in two steps
    #steps identified by ad-hoc examination of missing values
    for m in [1, 2]:
        #try searching for abstract in text if value is missing
        papersNew['abstract_new'] = papersNew.paper_text.apply(lambda x: getAbstract(x, methods = m))
        #replace nan in abstract with extracted abstract
        papersNew.loc[papersNew.abstract.isnull(), 'abstract'] = papersNew.abstract_new
        papersNew.drop(['abstract_new'], axis = 1, inplace = True)
    #format columns of interest
    papersNew[formatCols] = papersNew[formatCols].apply(lambda x: formatText(x), axis = 1)
    if dropnan:
        #drop na in abstract
        papersNew = papersNew.dropna(subset = ['abstract'])
        #append abstract and title length to data frame
        papersNew ['aLen'] = papersNew.abstract.apply(lambda x: len(x.split(' ')))
        papersNew ['tLen'] = papersNew.title.apply(lambda x: len(x.split(' ')))
    return papersNew

In [3]:
#load data
data = pd.read_csv('../data/papers.csv')
#preprocessing
dataNew = preprocessing(data, dropnan = True)
#subset articles with a length less than or equal to 250
data250 = dataNew[dataNew.aLen <= 250]
print('Final Dataset Used')
print('==================')
print('Number of observations: ', data250.shape[0])
print('Maximum title length: ', data250.tLen.max())

Final Dataset Used
Number of observations:  4638
Maximum title length:  20


### 3.1 Data Preprocessing for Model Training

After cleaning the data, we processed the titles and abstracts using `processText` below. The class object records and returns:

* number of unique words
* maximum sequence length (should be 250 as titles are shorter than abstracts)
* dictionaries for tokenization
* tokenized vector of titles and abstracts

Note that an important step of our tokenization was the qualification of rare, or unwanted, words. This is because NIPS articles are often written in laTex, and included scientific equations that may compromise the learning of important words. We approximated patterns of unwanted words and replaced them with an `<ign>` tag in the tokenization process. This resulted in 32,468 unique words in our dataset.

In [4]:
def qualify(word):
    '''helper function to select words for tokenization.'''
    #symbols
    symbols = """/?~`!@#$%^&*()_-+=|\{}[];<>"'.,:"""
    #abbreviations
    abb = """e.g.,i.e.,etal.,"""
    #disqualify empty space and words starting with symbol
    if len(word) < 1 or word[0] in symbols:
        return False
    elif len(word) > 2:
        #disqualify abbreviations
        if word in abb:
            return False
        #otherwise count all combinations with length > 2
        else:
            return True
    #if input length is one
    #count only if it is 'a'
    elif len(word) == 1:
        if word in ['a', 'i']:
            return True
    #with input length of 2
    #disqualify those with a symbol as the second character
    elif len(word) == 2:
        if word[1] not in symbols:
            return True
    #otherwise disqualify input
    else:
        return False

class processText:
    '''
    class object for data processing preperation for embedding training.

    Parameters:
    ===========
    1) textVec: list of array-like, vector of text in strings

    Methods:
    ===========
    1) updateMaxLen: count and update maximum sequence length
    2) getDictionary: update dictionaries of words and tokens,
        function called in `tokenize`
    3) tokenize: return tokenized vector of text for model training
    '''
    def __init__(self, textVec):

        #initiate class object
        self.textVec = list()
        for vec in textVec:
            #string to list
            vec = [x.strip().split(' ') for x in vec]
            self.textVec.append(vec)

        #prep  dictionaries for update
        self.word2idx = dict()
        self.idx2word = dict()
        self.maxLen = 0
        self.nUnique = 0

    def updateMaxLen(self):
        for vec in self.textVec:
            for txt in vec:
                #get length of sequence
                cntLen = len(txt)
                #update maximum sequence length
                if self.maxLen < cntLen:
                    self.maxLen = cntLen

    def getDictionary(self):

        if len(self.word2idx) != 0:
            print("Dictionary already updated.")

        else:
            #initiate dictionary updates
            #pad with 0
            #end of sequence as 1
            #ignored/disqualified words as 2
            #start tokenization at 3
            pad = 0
            eos = 1
            ign = 2
            start = 3

            self.word2idx['_'] = pad
            self.word2idx['*'] = eos
            self.word2idx['<ign>'] = ign

            for vec in self.textVec:
                for txt in vec:
                    for w in txt:
                        if qualify(w) == True:
                            if w not in self.word2idx.keys():
                                self.word2idx.update({w: start})
                                start += 1

            #update number of unique words in data set
            self.nUnique = start - 3
            #update idx to word dictionary
            self.idx2word = dict((idx,word) for word,idx in self.word2idx.items())

    def tokenize(self):
        #get dictionaries if function hasn't been called
        if len(self.word2idx) == 0:
            self.getDictionary()
        #cache list for tokenization
        tokenizedVec = list()
        for i in range(len(self.textVec)):
            vec = self.textVec[i]
            #cache list for the tokenized vector
            tempVec = list()
            for txt in vec:
                #cache list for sequence
                sVec = list()
                for w in txt:
                    #if word is in dictionary, tokenize
                    if w in self.word2idx:
                        sVec.append(self.word2idx[w])
                    #if word not in dictionary, tag as ignored
                    else:
                        sVec.append(self.word2idx['<ign>'])
                tempVec.append(sVec)
            tokenizedVec.append(tempVec)

        return tokenizedVec

In [5]:
#tokenize data
prep = processText(data250[['title', 'abstract']].values.T)
#update sequence length
prep.updateMaxLen()
#get dictionaries of word and tags
prep.getDictionary()
word2idx = prep.word2idx
idx2word = prep.idx2word

print('Number of unique words: ', prep.nUnique)
print('Maxmimum sequence length: ', prep.maxLen)
print('='*110)

#get tokenized vector of text
txtTokenized = prep.tokenize()
titles = txtTokenized[0]
abstracts = txtTokenized[1]
print('Example of tokenized title:\n {0} => {1}'.format(titles[0], [prep.idx2word[i] for i in titles[0]]))
print('='*110)
print('Example of tokenized abstract:\n {0} => {1}'.format(abstracts[0],[prep.idx2word[i] for i in abstracts[0]]))

Number of unique words:  32468
Maxmimum sequence length:  250
Example of tokenized title:
 [3, 4, 5, 6, 7, 8, 9] => ['self-organization', 'of', 'associative', 'database', 'and', 'its', 'applications']
Example of tokenized abstract:
 [42, 466, 64, 4, 580, 5, 5497, 431, 5498, 5499, 51, 9, 19, 321, 5500, 5501, 58, 5498, 5497, 176, 5502, 3251, 503, 51, 309, 5503, 75, 58, 619, 5504, 1743, 4, 5505, 42, 61, 4, 3, 431, 5506, 368, 42, 1019, 4, 5507, 1727, 5508, 10, 289, 4072, 4, 21, 5509, 75, 58, 5510, 5504, 5511, 42, 5512, 19, 187, 1181, 92, 7, 122, 19, 42, 319, 320, 321, 159, 1391, 5513] => ['an', 'efficient', 'method', 'of', 'self-organizing', 'associative', 'databases', 'is', 'proposed', 'together', 'with', 'applications', 'to', 'robot', 'eyesight', 'systems.', 'the', 'proposed', 'databases', 'can', 'associate', 'any', 'input', 'with', 'some', 'output.', 'in', 'the', 'first', 'half', 'part', 'of', 'discussion,', 'an', 'algorithm', 'of', 'self-organization', 'is', 'proposed.', 'from', 'an', 

Finally, we split our dataset into training ($\approx$72\%), validation ($\approx$8\%) and test set ($\approx$20\%). For consistency, we set the random state to 209.

In [6]:
#split data into train, validation, and test set
trainX, testX, trainY, testY = train_test_split(abstracts, titles, test_size = 0.2 , random_state = 209)
trainX, valX, trainY, valY = train_test_split(trainX, trainY, test_size = 0.1 , random_state = 209)

print('Number of training samples: ', len(trainX))
print('Number of validation samples: ', len(valX))
print('Number of test samples: ', len(testX))

Number of training samples:  3339
Number of validation samples:  371
Number of test samples:  928


### 3.2 Data Generator for Model Training

Because we need to feed input of consistent shape for model training, our past step of data preprocessing entails creating a generator that pads sequences to the maximum defined length. The code below was adapted from the [Computefest NLP workshop](https://github.com/Harvard-IACS/2019-computefest/blob/master/Friday/train_model.ipnb.ipynb). Due to computational constraint, we could not train our model with batch size larger than 32. 

In [16]:
#params for model training
seed = 209
p_W, p_U, p_dense, p_emb, weight_decay = 0, 0, 0, 0, 0
LR = 1e-4
batch_size = 32
num_train_batches = len(trainX) // batch_size
num_val_samples = len(valX) + len(trainX) - batch_size*num_train_batches
num_val_batches = len(valX) // batch_size
total_entries = (num_train_batches + num_val_batches)*batch_size
#number of unique tags
nUnique = len(word2idx)
#maximum length for title
tMaxLen = 250
#maximum length for abstract
aMaxLen = 250
#total maximum length
maxlen = tMaxLen + aMaxLen
batch_norm=False
embeddDim = embeddMatrix.shape[1]
nUnique = embeddMatrix.shape[0]
hidden_units= embeddDim

learning_rate = 0.002
clip_norm = 1.0

#padding function for abstracts
def padAbstract(x, maxL = aMaxLen, dictionary = word2idx):
    '''pad sequence for abstract'''
    n = len(x)
    #this section shouldn't apply 
    #because we subsetted our data
    #so that the maximum sequence length is 250
    if n > maxL:
        x = x[-maxL:]
        n = maxL
    return [dictionary['_']]*(maxL - n) + x + [dictionary['*']]

#build generator for model
def generator(trainX, trainY, batch_size = batch_size, 
              nb_batches = None, model = None, seed = seed):
    '''randomly shuffle input data'''
    c = nb_batches if nb_batches else 0
    while True:
        titles = list()
        abstracts = list()
        if nb_batches and c >= nb_batches:
            c = 0
        new_seed = random.randint(0, sys.maxsize)
        random.seed(c+123456789+seed)
        
        for b in range(batch_size):
            a = random.randint(0,len(trainX)-1)
            
            #random shuffling of data
            abstract = trainX[a]
            s = random.randint(min(aMaxLen,len(abstract)), max(aMaxLen,len(abstract)))
            abstracts.append(abstract[:s])
            
            title = trainY[a]
            s = random.randint(min(tMaxLen,len(title)), max(tMaxLen,len(title)))
            titles.append(title[:s])

        # undo the seeding before we yield in order not to affect the caller
        c+= 1
        random.seed(new_seed)

        yield conv_seq_labels(abstracts, titles)

#pad sequence and convert title to labels
def conv_seq_labels(abstracts, titles, nflips = None, model = None, dictionary = word2idx):
    """Abstract and titles are converted to padded input vectors. Titles are one-hot encoded to labels."""
    batch_size = len(titles)
    #pad sequence
    x = [padAbstract(a)+t for a,t in zip(abstracts, titles)] 
    x = pad_sequences(x, maxlen = maxlen, value = dictionary['_'], 
                               padding = 'post', truncating = 'post')
    
    #one-hot encode titles for training
    y = np.zeros((batch_size, tMaxLen, nUnique))
    for i, it in enumerate(titles):
        it = it + [dictionary['*']] + [dictionary['_']]*tMaxLen  
        it = it[:tMaxLen]
        y[i,:,:] = np_utils.to_categorical(it, nUnique)
        
    return [x[:,:aMaxLen],x[:,aMaxLen:]], y

In [17]:
#demonstrate generator
demo = next(generator(trainX, trainY, batch_size = batch_size))
print('Encoder Input Shape: ', demo[0][0].shape)
print('Decoder Input Shape: ', demo[0][1].shape)
print('One-hot encoded title shape: ', demo[1].shape)
print('='*110)
print("Padded Abstract:\n", [idx2word[i] for i in demo[0][0][1]])
print('='*110)
print("Padded Title:\n", [idx2word[i] for i in demo[0][1][1]])

Encoder Input Shape:  (32, 250)
Decoder Input Shape:  (32, 250)
One-hot encoded title shape:  (32, 250, 32471)
Padded Abstract:
 ['_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', '_', 'with', 'the', 'increase', 'in', 'available', 'data', 'parallel', 'machine', 'learning', 'has', '<ign>', '<ign>', 'become', 'an', 'increasingly', 'pressing', 'problem.', 'in', 'this', 'paper', 'we', 'present', '<ign>', '<ign>', 'the', 'first', 'parallel', 'stochastic', 'gradient'

## 4. Word Embeddings

In [None]:
#embedding karina
#word2vec phoebe
#glove pre-trained abhi

### 4.3 Self-Trained GloVe Embeddings

One motivation to train our own embedding is that pre-trained word embeddings may not capture well the similarities and co-occurances between words in the current dataset. Because academic journal article come with many technical terms, it is possible that weights using pre-trained embeddings do not apply to these texts. As mentioned above, only half of the unique words in the current dataset were found in the pre-trained GloVe embeddings. Therefore, we experimented with training our own embedding matrix, in hopes that the initialized weights would help our models learn to summarize the abstracts more effectively. Note that the embedding matrix is only trained on the training examples split using the code in the data preprocessing section. This is because realistically we do not have access to the hold-out sets. Again, the trained embedding matrix dimension is (32471, 100); the height corresponds to the sum of the number of unique words and the number of special tags (`<eos>`, `<ign>`, `<pad>`).

In [11]:
#get text for training
#remove ignored/disqualified words
#because we do not want to learn this tag
embedd_trainX = [[idx2word[x] for x in v if x != 2] for v in trainX]
embedd_trainY = [[idx2word[x] for x in v if x != 2] for v in trainY]
embeddTxt = trainX + trainY

#prep dictionary for embedding training
#drop pad, eos, and ignore tag
start = 0
embeddDict = dict()
for vec in embeddTxt:
    for w in vec:
        if w not in embeddDict.keys():
            embeddDict[w] = start
            start += 1

print('Number of unique words for embedding training: ', len(embeddDict))

Number of unique words for embedding training:  27141


In [12]:
#the code below were used for embedding training
#please see rnn_preprocessing.ipynb for the execution history

# #train glove embedding 
# #creating a corpus object
# corpus_ = glove.Corpus(dictionary = embeddDict) 
# #training the corpus to generate the co occurence matrix which is used in GloVe
# corpus_.fit(embeddTxt, window = 10)
# #train embedding using corpus weight matrix created above 
# glove_ = glove.Glove(no_components = 100, learning_rate = 0.01, random_state = 209)
# glove_.fit(corpus_.matrix, epochs=50, no_threads=10, verbose = True)
# glove_.add_dictionary(corpus_.dictionary)

# #embedding matrix
# #initiate a matrix with shape 
# #(number of unique words in our dataset, latent dimension of embedding)
# embeddMatrix = np.zeros((len(word2idx), 100))
# #loop through trained embedding matrix to find weights of trained words
# for i, w in enumerate(word2idx):
#     try:
#         embeddVec = glove_.word_vectors[glove_.dictionary[w]]
#         embeddMatrix[i] = embeddVec
#     except:
#         continue

In [13]:
#load trained glove model
glove_ = glove.Glove.load('rnn_training_history/glove_.model')
#display 10 words most similar to 'stochastic' by glove training
print('Top 10 words most similar to "stochastic"')
print('=========================================')
for (i,j) in glove_.most_similar(word = 'stochastic', number = 10):
    print("word: {0} | cosine similarity: {1}".format(i, j))

Top 10 words most similar to "stochastic"
word: gradient | cosine similarity: 0.8991909690378893
word: three-composite | cosine similarity: 0.8937557315973716
word: descent | cosine similarity: 0.8744804215977376
word: proximal | cosine similarity: 0.76864338053953
word: optimization | cosine similarity: 0.7646848913256339
word: descent. | cosine similarity: 0.7640788665473602
word: accelerated | cosine similarity: 0.7629822276705301
word: gradients | cosine similarity: 0.7624208397216201
word: projected | cosine similarity: 0.7623820982466453


## 5. RNN Model 

The illustration below shows the first Recurrent Neural Network model we trained. The encoder takes in the tokenized abstract, encodes it with one of the embedding matrix trained above, and learns the input text with a bidirectional LSTM. We chose the bidirectional LSTM in hope to better learn the syntax. The output bidirectional LSTM weights are then used to initialize states for LSTM learning of the title, which is encoded using the same embedding matrix choice. We did not think it is necessary to train the title using a bidirectional LSTM because presumably the information is already learned with the abstract, assuming that the combination of words in a title is well-represented in the respective abstract. This decoder LSTM output is then passed into a time-distributed dense layer, from which we get a vector of probabilities for each unique words in the full dataset.  

![](rnn_1.png)

In [24]:
#rnn model 
def getRNNModel(genTrain, genVal, embeddMatrix,
                learning_rate, clip_norm, nUnique,
                embeddDim, hidden_units, encoder_shape = aMaxLen,
                decoder_shape = tMaxLen):
    
    '''
    compile RNN Model.
    
    Parameters:
    ===========
    1) genTrain: training sample generator
    2) genVal: validation sample generator
    3) embeddMatrix: embedding matrix of choice, shape (32471, 100)
    4) learning_rate
    5) clip_norm
    6) nUnique: number of unique words in dataset
    7) embeddDim: number of latent features in embedding matrix
    8) hidden_units: number of hidden units for layer
    9) encoder_shape: should be the maximum length of abstract
    10) decoder_shape: maximum length of title, we padded titles to 
        the same length as encoder_shape
    
    Returns:
    ===========
    compiled rnn model
    '''
    #ENCODER
    #input shape as the vector of sequence, with length padded to 250
    encoder_inputs = Input(shape = (encoder_shape, ), name = 'encoder_input')

    #encode input with embedding layer
    encoder_embedding = Embedding(nUnique, embeddDim,
                                  input_length = encoder_shape,
                                  weights = [embeddMatrix],
                                  mask_zero = True,
                                  name = 'encoder_embedd')(encoder_inputs)

    #1-layer bidirectional LSTM
    #add drop out for regularization
    #return only states
    encoder_lstm = Bidirectional(LSTM(hidden_units, dropout_U = 0.2,
                                      dropout_W = 0.2 , return_state=True),name = 'encoder_bilstm')

    #get states from Bi-LSTM
    encoder_outputs, f_h, f_c, b_h, b_c = encoder_lstm(encoder_embedding)

    #add final states together
    #to initialize weights for decoder
    state_hfinal=Add(name = 'add_hidden_states')([f_h, b_h])
    state_cfinal=Add(name = 'add_cell_states')([f_c, b_c])

    #save encoder states
    encoder_states = [state_hfinal,state_cfinal]

    #DECODER
    decoder_inputs = Input(shape = (decoder_shape, ), name = 'decoder_input')

    #encode decoder input with embedding matrix
    decoder_embedding = Embedding(nUnique, embeddDim,
                                  input_length = decoder_shape,
                                  weights = [embeddMatrix],
                                  mask_zero = True,
                                  name = 'decoder_embedd')

    #1-layer lstm
    decoder_lstm = LSTM(hidden_units,return_sequences = True, return_state=True, name = 'decoder_lstm')

    #save decoder outputs
    decoder_outputs, s_h, s_c = decoder_lstm(decoder_embedding(decoder_inputs), 
                                             initial_state = encoder_states)
    # decoder_dense = Dense(decoder_shape, activation='linear')

    #time distributed layer, probability predictions for all unique words
    decoder_time_distributed = TimeDistributed(Dense(nUnique),name = 'decoder_timedistributed')
    decoder_activation = Activation('softmax', name = 'decoder_activation')
    decoder_outputs = decoder_activation(decoder_time_distributed(decoder_outputs))

    #MODEL
    model = Model(inputs = [encoder_inputs,decoder_inputs], outputs = decoder_outputs)
    rmsprop = RMSprop(lr = learning_rate, clipnorm = clip_norm)
    model.compile(loss = 'categorical_crossentropy', optimizer = rmsprop)
    return model

In [25]:
#generator for training and validation
genTrain = generator(trainX, trainY, batch_size = batch_size)
genVal =  generator(valX, valY, nb_batches = len(valX)// batch_size, batch_size = batch_size)
#load embedding matrix
#this corresponds to the self-trained glove embedding
#for now we will use it for demo
embeddMatrix = np.load('rnn_training_history/embeddMatrix.npy')
#compile rnn model
K.clear_session()
rnn = getRNNModel(genTrain, genVal, embeddMatrix,
                  learning_rate, clip_norm, nUnique,
                  embeddDim, hidden_units)
#output model summary
rnn.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input (InputLayer)      (None, 250)          0                                            
__________________________________________________________________________________________________
encoder_embedd (Embedding)      (None, 250, 100)     3247100     encoder_input[0][0]              
__________________________________________________________________________________________________
decoder_input (InputLayer)      (None, 250)          0                                            
__________________________________________________________________________________________________
encoder_bilstm (Bidirectional)  [(None, 200), (None, 160800      encoder_embedd[0][0]             
__________________________________________________________________________________________________
decoder_em

## 6. RNN Model with Attention and Context Mechanism

We explored the use of an attention mechanism in our RNN model after training the above model. Intuitively, this mechanism learns where the model should pay attention to in a sequence. The change to the original RNN model is relatively minor, where instead of using the decoder LSTM output for predictions, we combine the states output by the bidirectional LSTM from the encoder with that from the decoder LSTM. However, because the dimensionalities of outputs need to match for the attention mechanism, we chose to pass the forward LSTM output from the encoder into the attention layer for further learning. However, note that the states from decoder is still used for initialization of the decoder LSTM layer. The flowchart below shows a simple demonstration of the learning steps of the model incorporating attention and context mechanism. The operation of the mechanism is also fairly simple: 

* combine states from encoder forward LSTM and decoder LSTM through dot product, which comprises the attention weights
* applies attention weights to encoder outputs through another matrix multiplication for learning with attention
* combine attention outputs with decoder outputs for prediction

The prediction is again passed into a time-distributed dense layer for probability predictions of each unique word at each index of a sequence. 

![](rnn_2.png)

In [26]:
def getAttentionModel(genTrain, genVal, embeddMatrix,
                      learning_rate, clip_norm, nUnique,
                      embeddDim, hidden_units, encoder_shape = aMaxLen,
                      decoder_shape = tMaxLen):

    '''
    RNN Model with added Attention/Context Mechanism.
    Attention model code reference @ https://github.com/wanasit/katakana.git
    '''

    #ENCODER
    #input shape as the vector of sequence, with length padded to 250
    encoder_inputs = Input(shape = (encoder_shape, ), name = 'encoder_input')

    #encode input with embedding layer
    #do not mask 0s because the attention layer does not allow this
    encoder_embedding = Embedding(nUnique, embeddDim,
                                  input_length = encoder_shape,
                                  weights = [embeddMatrix],
                                  mask_zero = True,
                                  name = 'encoder_embedd')(encoder_inputs)

    #forward
    encoder_lstm = LSTM(hidden_units, dropout_U = 0.2, dropout_W = 0.2, 
                        return_sequences = True, return_state=True, name = 'encoder_forward_lstm')
    encoder_lstm_rev = LSTM(hidden_units, dropout_U = 0.2, dropout_W = 0.2,
                            go_backwards = True, return_sequences = True, 
                            return_state=True, name = 'encoder_backward_lstm')

    #get states from LSTM
    encoder_outputs_f, h_f, c_f = encoder_lstm(encoder_embedding)
    encoder_outputs_r, h_r, c_r = encoder_lstm_rev(encoder_embedding)

    #save encoder states
    state_hfinal=Add()([h_f, h_r])
    state_cfinal=Add()([c_f, c_r])

    #save encoder states
    encoder_states = [state_hfinal,state_cfinal]

    #DECODER
    decoder_inputs = Input(shape = (decoder_shape, ), name = 'decoder_input')

    #encode decoder input with embedding matrix
    decoder_embedding = Embedding(nUnique, embeddDim,
                                  input_length = decoder_shape,
                                  weights = [embeddMatrix],
                                  mask_zero = True,
                                  name = 'decoder_embedd')
    
    #1-layer lstm
    decoder_lstm = LSTM(hidden_units,return_sequences = True, return_state=True, name = 'decoder_lstm')

    #save decoder outputs
    decoder_outputs, s_h, s_c = decoder_lstm(decoder_embedding(decoder_inputs), initial_state = encoder_states)
  
    #ATTENTION
    attention = Dot(axes = [2,2], name = 'attention')([decoder_outputs, encoder_outputs_f])
    attention = Activation('softmax')(attention)
    context = Dot(axes = [2,1], name = 'context')([attention, encoder_outputs_f])
    decoder_combined_context = Concatenate(name = 'decoder_added_attention')([context, decoder_outputs])


    #time distributed layer, probability predictions for all unique words
    decoder_time_distributed = TimeDistributed(Dense(nUnique), name = 'decoder_timedistributed')
    decoder_activation = Activation('softmax', name = 'decoder_activation')
    decoder_outputs = decoder_activation(decoder_time_distributed(decoder_combined_context))

    #MODEL
    model = Model(inputs = [encoder_inputs,decoder_inputs], outputs = decoder_outputs)
    rmsprop = RMSprop(lr = learning_rate, clipnorm = clip_norm)
    model.compile(loss = 'categorical_crossentropy',optimizer = rmsprop)
    return model

In [27]:
#compile rnn model
K.clear_session()
attention = getAttentionModel(genTrain, genVal, embeddMatrix,
                              learning_rate, clip_norm, nUnique,
                              embeddDim, hidden_units)
#output model summary
attention.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_input (InputLayer)      (None, 250)          0                                            
__________________________________________________________________________________________________
encoder_embedd (Embedding)      (None, 250, 100)     3247100     encoder_input[0][0]              
__________________________________________________________________________________________________
decoder_input (InputLayer)      (None, 250)          0                                            
__________________________________________________________________________________________________
encoder_forward_lstm (LSTM)     [(None, 250, 100), ( 80400       encoder_embedd[0][0]             
__________________________________________________________________________________________________
encoder_ba

## 7. Results

In [None]:
#phoebe

## 8. Discussion

Incorporating the attention mechanism in our model appeared to have rendered more diverse predictions than the model without such mechanism. However, the predictions are still far from perfect. One interpretation is that perhaps the discrepancy in sequence length in the current model is too large from the attention mechanism to have an effect. Because our titles are all within 20 word of length, we have a lot of paddings in the decoder inputs. Moreover, our abstract sequence is front-padded, whereas the title sequence is back-padded. Perhaps that this design also compromised the performance of the attention mechanism. Another possible caveat in our implementation of the attention mechanism is that we intialized the states of the decoder with encoder states. Perhaps that with the attention mechanism, this step is not necessary as we are combining the outputs in the attention layer. If the attention mechanism is indeed not performing well due to an excessive amount of padding in title, one idea to improve the attention mechanism is to employ a convolutional neural network with pooling to downplay the padding weights. Alternatively, the current attention mechanism might have worked best for translation tasks, where sequences are of roughly equal lengths, and we should explore other attention mechanisms that apply better to summarization tasks.  

In addition to the two models reported above, we also explored using bidirectional layer for the decoder. Our original thought was that it would not be necessary to employ such layer in the decoder part; in fact, using a bidirectional layer for the decoder appeared to break the syntax learning. Our models using bidirectional LSTM in decoder (with and without attention) predicted more diverse words in comparison to the RNN model without attention above. However, the words were mostly disconnected, yielding more of a keyword prediction than an actual sentence prediction. 

In [None]:
#karina
#abhi
#PHOEBE