<h1><center>Title Generation </center></h1>
<h2><center>Sequence-to-Sequence Text Summarization for Academic Journal Articles </center></h2>
<center>Karina Huang, Abhimanyu Vasishth, Phoebe Wong </center>
<center>AC209b: Advanced Topics in Data Science </center>
<center>Spring 2019 </center>

---

In [1]:
#import package dependencies
import re
import glove
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

## 1. Introduction

In [None]:
#abhi

## 2. Baseline Model - Nearest Neighbors with TF-IDF

In [None]:
#abhi

---

## 3. Data Preprocessing for Recurrent Neural Network (RNN)

### 3.1 Data Cleaning
The original [NIPS dataset](https://www.kaggle.com/benhamner/nips-2015-papers/version/2) acquired from Kaggle included XXXX journal articles, most of which were missing abstracts. Our first attempt to clean the data was to find and extract abstracts for articles missing the piece of information. The `getAbstract` code performs the search for abstract in two steps. These two steps were results of ad-hoc identification of abstract extractions and recovered 3,250 abstracts. We removed all articles missing abstracts and formatted the text for data cleaning. Due to computational constraints, we subsetted articles with abstract of length 250 in words for the current study. The final dataset used included 4,638 observations without missing data in title or abstract. 

In [2]:
def formatText(x):
    "render space in text and text to lowercase"
    for i in range(len(x)):
        #check for data type
        if type(x[i]) == str:
            try:
                x[i] = x[i].replace('\n', ' ').lower()
            except:
                x[i] = x[i].lower()
    return x

def getAbstract(paper_text, methods = 1):
    "extract abstract from text in two steps"
    #step 1:
    #find 'abstract' in text
    #find the next word/phrase in all cap, wrapped in '\n'
    #extract everything in between as abstract
    if methods == 1:
        try:
            #find abstract
            a1 = re.search('abstract\n', paper_text, re.IGNORECASE)
            paper_text = paper_text[a1.end():]
            #find the next section in all cap
            a2 = re.search(r'\n+[A-Z\s]+\n', paper_text)
            return paper_text[: a2.start()]
        except:
            return np.nan
    #step 2:
    #find abstract in text
    #find next item wrapped between '\n\n' and '\n\n'
    #extract everything in between as abstract
    if methods == 2:
        try:
            a1 = re.search('abstract\n', paper_text, re.IGNORECASE)
            paper_text = paper_text[a1.end():]
            #find the next section in all cap
            a2 = re.search(r'\n\n+.+\n\n', paper_text)
            return paper_text[: a2.start()]
        except:
            return np.nan


def preprocessing(papers, formatCols = ['title', 'abstract','paper_text'], dropnan = False):
    "preliminary data preprocessing for model fitting"
    #avoid modifying original dataset
    papersNew = papers.copy()
    #replace missing values with nan
    papersNew.abstract = papersNew.abstract.apply(lambda x: np.nan if x == 'Abstract Missing' else x)
    #extract missing abstract in two steps
    #steps identified by ad-hoc examination of missing values
    for m in [1, 2]:
        #try searching for abstract in text if value is missing
        papersNew['abstract_new'] = papersNew.paper_text.apply(lambda x: getAbstract(x, methods = m))
        #replace nan in abstract with extracted abstract
        papersNew.loc[papersNew.abstract.isnull(), 'abstract'] = papersNew.abstract_new
        papersNew.drop(['abstract_new'], axis = 1, inplace = True)
    #format columns of interest
    papersNew[formatCols] = papersNew[formatCols].apply(lambda x: formatText(x), axis = 1)
    if dropnan:
        #drop na in abstract
        papersNew = papersNew.dropna(subset = ['abstract'])
        #append abstract and title length to data frame
        papersNew ['aLen'] = papersNew.abstract.apply(lambda x: len(x.split(' ')))
        papersNew ['tLen'] = papersNew.title.apply(lambda x: len(x.split(' ')))
    return papersNew

In [3]:
#load data
data = pd.read_csv('../data/papers.csv')
#preprocessing
dataNew = preprocessing(data, dropnan = True)
#subset articles with a length less than or equal to 250
data250 = dataNew[dataNew.aLen <= 250]
print('Final Dataset Used')
print('==================')
print('Number of observations: ', data250.shape[0])
print('Maximum title length: ', data250.tLen.max())

Final Dataset Used
Number of observations:  4638
Maximum title length:  20


### 3.1 Data Preprocessing for Model Training

After cleaning the data, we processed the titles and abstracts using `processText` below. The class object records and returns:

* number of unique words
* maximum sequence length (should be 250 as titles are shorter than abstracts)
* dictionaries for tokenization
* tokenized vector of titles and abstracts

Note that an important step of our tokenization was the qualification of rare, or unwanted, words. This is because NIPS articles are often written in laTex, and included scientific equations that may compromise the learning of important words. We approximated patterns of unwanted words and replaced them with an `<ign>` tag in the tokenization process. This resulted in 32,468 unique words in our dataset.

In [4]:
def qualify(word):
    '''helper function to select words for tokenization.'''
    #symbols
    symbols = """/?~`!@#$%^&*()_-+=|\{}[];<>"'.,:"""
    #abbreviations
    abb = """e.g.,i.e.,etal.,"""
    #disqualify empty space and words starting with symbol
    if len(word) < 1 or word[0] in symbols:
        return False
    elif len(word) > 2:
        #disqualify abbreviations
        if word in abb:
            return False
        #otherwise count all combinations with length > 2
        else:
            return True
    #if input length is one
    #count only if it is 'a'
    elif len(word) == 1:
        if word in ['a', 'i']:
            return True
    #with input length of 2
    #disqualify those with a symbol as the second character
    elif len(word) == 2:
        if word[1] not in symbols:
            return True
    #otherwise disqualify input
    else:
        return False

class processText:
    '''
    class object for data processing preperation for embedding training.

    Parameters:
    ===========
    1) textVec: list of array-like, vector of text in strings

    Methods:
    ===========
    1) updateMaxLen: count and update maximum sequence length
    2) getDictionary: update dictionaries of words and tokens,
        function called in `tokenize`
    3) tokenize: return tokenized vector of text for model training
    '''
    def __init__(self, textVec):

        #initiate class object
        self.textVec = list()
        for vec in textVec:
            #string to list
            vec = [x.strip().split(' ') for x in vec]
            self.textVec.append(vec)

        #prep  dictionaries for update
        self.word2idx = dict()
        self.idx2word = dict()
        self.maxLen = 0
        self.nUnique = 0

    def updateMaxLen(self):
        for vec in self.textVec:
            for txt in vec:
                #get length of sequence
                cntLen = len(txt)
                #update maximum sequence length
                if self.maxLen < cntLen:
                    self.maxLen = cntLen

    def getDictionary(self):

        if len(self.word2idx) != 0:
            print("Dictionary already updated.")

        else:
            #initiate dictionary updates
            #pad with 0
            #end of sequence as 1
            #ignored/disqualified words as 2
            #start tokenization at 3
            pad = 0
            eos = 1
            ign = 2
            start = 3

            self.word2idx['_'] = pad
            self.word2idx['*'] = eos
            self.word2idx['<ign>'] = ign

            for vec in self.textVec:
                for txt in vec:
                    for w in txt:
                        if qualify(w) == True:
                            if w not in self.word2idx.keys():
                                self.word2idx.update({w: start})
                                start += 1

            #update number of unique words in data set
            self.nUnique = start - 3
            #update idx to word dictionary
            self.idx2word = dict((idx,word) for word,idx in self.word2idx.items())

    def tokenize(self):
        #get dictionaries if function hasn't been called
        if len(self.word2idx) == 0:
            self.getDictionary()
        #cache list for tokenization
        tokenizedVec = list()
        for i in range(len(self.textVec)):
            vec = self.textVec[i]
            #cache list for the tokenized vector
            tempVec = list()
            for txt in vec:
                #cache list for sequence
                sVec = list()
                for w in txt:
                    #if word is in dictionary, tokenize
                    if w in self.word2idx:
                        sVec.append(self.word2idx[w])
                    #if word not in dictionary, tag as ignored
                    else:
                        sVec.append(self.word2idx['<ign>'])
                tempVec.append(sVec)
            tokenizedVec.append(tempVec)

        return tokenizedVec

In [5]:
#tokenize data
prep = processText(data250[['title', 'abstract']].values.T)
#update sequence length
prep.updateMaxLen()
#get dictionaries of word and tags
prep.getDictionary()
word2idx = prep.word2idx
idx2word = prep.idx2word

print('Number of unique words: ', prep.nUnique)
print('Maxmimum sequence length: ', prep.maxLen)
print('='*110)

#get tokenized vector of text
txtTokenized = prep.tokenize()
titles = txtTokenized[0]
abstracts = txtTokenized[1]
print('Example of tokenized title:\n {0} => {1}'.format(titles[0], [prep.idx2word[i] for i in titles[0]]))
print('='*110)
print('Example of tokenized abstract:\n {0} => {1}'.format(abstracts[0],[prep.idx2word[i] for i in abstracts[0]]))

Number of unique words:  32468
Maxmimum sequence length:  250
Example of tokenized title:
 [3, 4, 5, 6, 7, 8, 9] => ['self-organization', 'of', 'associative', 'database', 'and', 'its', 'applications']
Example of tokenized abstract:
 [42, 466, 64, 4, 580, 5, 5497, 431, 5498, 5499, 51, 9, 19, 321, 5500, 5501, 58, 5498, 5497, 176, 5502, 3251, 503, 51, 309, 5503, 75, 58, 619, 5504, 1743, 4, 5505, 42, 61, 4, 3, 431, 5506, 368, 42, 1019, 4, 5507, 1727, 5508, 10, 289, 4072, 4, 21, 5509, 75, 58, 5510, 5504, 5511, 42, 5512, 19, 187, 1181, 92, 7, 122, 19, 42, 319, 320, 321, 159, 1391, 5513] => ['an', 'efficient', 'method', 'of', 'self-organizing', 'associative', 'databases', 'is', 'proposed', 'together', 'with', 'applications', 'to', 'robot', 'eyesight', 'systems.', 'the', 'proposed', 'databases', 'can', 'associate', 'any', 'input', 'with', 'some', 'output.', 'in', 'the', 'first', 'half', 'part', 'of', 'discussion,', 'an', 'algorithm', 'of', 'self-organization', 'is', 'proposed.', 'from', 'an', 

Finally, we split our dataset into training ($\approx$72\%), validation ($\approx$8\%) and test set ($\approx$20\%). For consistency, we set the random state to 209.

In [6]:
#split data into train, validation, and test set
trainX, testX, trainY, testY = train_test_split(abstracts, titles, test_size = 0.2 , random_state = 209)
trainX, valX, trainY, valY = train_test_split(trainX, trainY, test_size = 0.1 , random_state = 209)

print('Number of training samples: ', len(trainX))
print('Number of validation samples: ', len(valX))
print('Number of test samples: ', len(testX))

Number of training samples:  3339
Number of validation samples:  371
Number of test samples:  928


## 4. Word Embeddings

In [None]:
#embedding karina
#word2vec phoebe
#glove pre-trained abhi

### 4.1 Word2vec 

Word2vec (by Mikolov et al., 2013[1]) model, similar to other word embedding models, is used to learn about vector representations of words, also known as word embeddings. Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model. Algorithmically, the two models are similar with a small differnce in focus. 

CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'), while the skip-gram does the inverse and predicts source context-words from the target words. In our model, we went with the CBOW approach, because it is recommended to work better for smaller datasets[2].

To make our embeddings comparable with other embeddings, we limit the number of embedding vectors to be 100. Because not all words in our dataset has a feature vector pre-trained in the Word2Vec corpus, we needed to pad in zeros for those words. In other words, for words that are unseen in the word2vec model (2.9% of unique words in our dataset), their embedding vectors are vectors of 0. With 32,471 unique words, we resulted in an embedding matrix with a dimension of (32471, 100).

Finally, word2vec can tell us the similarity of words by calculating the cosine distance between the embedding vectors of the two words

Reference: 

[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

[2] https://www.tensorflow.org/tutorials/representation/word2vec

In [None]:
# Create a list that each element is an abstract (actual words)
abstracts_list_word = []
for i in range(len(abstracts)):
    abstracts_list_word.append([prep.idx2word[word] for word in abstracts[i]])

# Initiate and train the word2vec model using our dataset
word_model = Word2Vec(abstracts_list_word, size=100, min_count=1, window=5, iter=100) 
# Initiate the model with the documents
word_model.train(abstracts_list_word, total_examples=len(abstracts_list_word), epochs=10, compute_loss = True) 

pretrained_weights = word_model.wv.syn0
vocab_size, emdedding_size = pretrained_weights.shape


In [None]:
# If try to replicate the result, please load the pre-trained weight instead.
# word_model = np.load(histPath+'embeddMatrix_word2vec_0512.npy')

In [None]:
# Add unseen vocab to the embedding matrix
all_unique_words = list(prep.word2idx.keys())
embeddMatrix = np.zeros(shape = (len(all_unique_words), 100)) # initiate with zero, with 32471 unique words

for i, word in enumerate(all_unique_words):
    try:
        embeddMatrix[i] = word_model.wv.word_vec(word) # find the word in the vector space and store the embeddings 
    except KeyError: # unseen vocab stay with 0 by skippig 
        continue

In [42]:
# Example of similar word using the model
print('Checking similar words:')
for word in ['model', 'network', 'convolution', 'learning', 'neural', 'barn']:
    most_similar = ', '.join('%s (%.2f)' % (similar, dist) for similar, dist in word_model.wv.most_similar(word)[:8])
    print('  %s -> %s' % (word, most_similar))


Checking similar words:
  model -> model, (0.72), approach (0.58), method (0.54), models (0.54), framework (0.52), mechanism (0.51), methodology (0.50), formulation (0.47)
  network -> net (0.72), networks (0.70), network, (0.65), nets (0.61), dude, (0.60), networks, (0.58), network. (0.58), signaling. (0.56)
  convolution -> transformation (0.49), de-convolution (0.46), filter, (0.44), operators (0.43), layer (0.43), non-linear (0.42), combinations (0.42), white-noise (0.42)
  learning -> learning, (0.65), learning. (0.60), learning: (0.46), adaptation (0.44), translation (0.44), single-core (0.44), learning-based (0.44), teaching (0.41)
  neural -> kwta (0.63), probabilities." (0.58), l(winner-take-all (0.58), formed; (0.58), context-independent (0.57), pretrain (0.56), disco (0.55), rbf (0.55)
  barn -> owl (0.72), owls (0.58), owl. (0.58), map-like (0.56), heading (0.52), microstimulation (0.52), owl's (0.51), young (0.50)


### 4.3 Self-Trained GloVe Embeddings

One motivation to train our own embedding is that pre-trained word embeddings may not capture well the similarities and co-occurances between words in the current dataset. Because academic journal article come with many technical terms, it is possible that weights using pre-trained embeddings do not apply to these texts. As mentioned above, only half of the unique words in the current dataset were found in the pre-trained GloVe embeddings. Therefore, we experimented with training our own embedding matrix, in hopes that the initialized weights would help our models learn to summarize the abstracts more effectively. Note that the embedding matrix is only trained on the training examples split using the code in the data preprocessing section. This is because realistically we do not have access to the hold-out sets. Again, the trained embedding matrix dimension is (32,471, 100); the height corresponds to the sum of the number of unique words and the number of special tags (`<eos>`, `<ign>`, `<pad>`).

In [7]:
#get text for training
#remove ignored/disqualified words
#because we do not want to learn this tag
embedd_trainX = [[idx2word[x] for x in v if x != 2] for v in trainX]
embedd_trainY = [[idx2word[x] for x in v if x != 2] for v in trainY]
embeddTxt = trainX + trainY

#prep dictionary for embedding training
#drop pad, eos, and ignore tag
start = 0
embeddDict = dict()
for vec in embeddTxt:
    for w in vec:
        if w not in embeddDict.keys():
            embeddDict[w] = start
            start += 1

print('Number of unique words for embedding training: ', len(embeddDict))

Number of unique words for embedding training:  27141


In [8]:
#the code below were used for embedding training
#please see rnn_preprocessing for the execution history

# #train glove embedding 
# #creating a corpus object
# corpus_ = glove.Corpus(dictionary = embeddDict) 
# #training the corpus to generate the co occurence matrix which is used in GloVe
# corpus_.fit(embeddTxt, window = 10)
# #train embedding using corpus weight matrix created above 
# glove_ = glove.Glove(no_components = 100, learning_rate = 0.01, random_state = 209)
# glove_.fit(corpus_.matrix, epochs=50, no_threads=10, verbose = True)
# glove_.add_dictionary(corpus_.dictionary)

# #embedding matrix
# #initiate a matrix with shape 
# #(number of unique words in our dataset, latent dimension of embedding)
# embeddMatrix = np.zeros((len(word2idx), 100))
# #loop through trained embedding matrix to find weights of trained words
# for i, w in enumerate(word2idx):
#     try:
#         embeddVec = glove_.word_vectors[glove_.dictionary[w]]
#         embeddMatrix[i] = embeddVec
#     except:
#         continue

In [27]:
#load trained glove model
glove_ = glove.Glove.load('rnn_training_history/glove_.model')
#display 10 words most similar to 'stochastic' by glove training
print('Top 10 words most similar to "stochastic"')
print('=========================================')
for (i,j) in glove_.most_similar(word = 'stochastic', number = 10):
    print("word: {0} | cosine similarity: {1}".format(i, j))

Top 10 words most similar to "stochastic"
word: gradient | cosine similarity: 0.8991909690378893
word: three-composite | cosine similarity: 0.8937557315973716
word: descent | cosine similarity: 0.8744804215977376
word: proximal | cosine similarity: 0.76864338053953
word: optimization | cosine similarity: 0.7646848913256339
word: descent. | cosine similarity: 0.7640788665473602
word: accelerated | cosine similarity: 0.7629822276705301
word: gradients | cosine similarity: 0.7624208397216201
word: projected | cosine similarity: 0.7623820982466453


## 5. RNN Model 

In [None]:
#karina

## 6. RNN Model with Attention and Context Mechanism

In [None]:
#karina

## 7. Results

### 7.1 Evaluation Metrics
Similar to classical classification task, we can evaluate our model performance using F-score, precision and recall. 

In the field of Natural Language Processing (NLP), it is common to use BLEU and ROUGE to measure precision and recall. 

### 7.1.1  BLEU
BLEU (BiLingual Evaluation Understudy) stands for  measures how well a candidate translation matches a set of reference translations by counting the percentage of n-grams in the candidate translation overlapping with the references. BLEU was first introduced in Papineni et. al. (2001).

### 7.1.2 ROUGE
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It comes with mainly two metrics, ROUGE-N and ROUGE-L. 

ROUGE-N is a recall-related measure because the denominator of the equation is the total sum of the number of n-grams occurring at the reference summary side. 

ROUGE-N: Overlap of N-grams[2] between the system and reference summaries.
- ROUGE-1 refers to the overlap of 1-gram (each word) between the system and reference summaries.
- ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.

ROUGE-L: Longest Common Subsequence (LCS)[3] based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.

Rouge applies in cases with multiple reference summary, however, because we have only one ground truth (i.e., one title), we will simplify the definition of rouge as following:
Recall (ROUGE) = $\frac{Count_{match}(gram_n)}{Count(gram_n)}$

n stands for the length of the n-gram ($gram_n$), and $Count_{match}(gram_n)$ is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries"

### 7.1.3 F-score
$F = 2 * \frac{Precision * Recall}{Precision + Recall}$

### 7.2 Test-set performance

In [None]:
# Relative import of our functions
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)
# import our rouge functions
from myeval import one_gram_recall, ngrams, two_gram_recall

# BLEU
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

In [None]:
# Import our model predictions
# glove self-trained embedding result
with open("glove_self_trained/predictions_base",'rb') as f:
    basePred = pickle.load(f)
with open("glove_self_trained/predictions_attention",'rb') as f:
    attentionPred = pickle.load(f)

# kNN
knn_pred = np.load("kNN/baseline_generated.npy")
knn_truth = np.load("kNN/baseline_true.npy")

# word2vec
with open("word2vec/predictions_base",'rb') as f:
    word2vec_basePred = pickle.load(f)
# with open("word2vec/predictions_attention",'rb') as f:
#     word2vec_attnPred = pickle.load(f)


#### Baseline kNN

In [None]:
models = ["kNN"]
metrics = ['rouge1', 'rouge2', 'bleu']
kNN_eval_dict = {model: {metric: [] for metric in metrics} for model in models}

for i in range(len(knn_pred)):
    for model in models:
        rouge1_res = one_gram_recall(knn_truth[i], knn_pred[i]) # calculate ROGUE with 1-gram
        rouge2_res = two_gram_recall(knn_truth[i], knn_pred[i]) # calculate ROGUE with 2-gram
        bleu = sentence_bleu(knn_truth[i], knn_pred[i], smoothing_function=SmoothingFunction().method3) # calculate BLEU

        kNN_eval_dict[model]['rouge1'].append(rouge1_res)
        kNN_eval_dict[model]['rouge2'].append(rouge2_res)
        kNN_eval_dict[model]['bleu'].append(bleu)

#### GloVe: self-trained embeddings

In [None]:
models = ["Greedy", "Non-Greedy"]
metrics = ['rouge1', 'rouge2', 'bleu']
baseline_eval_dict = {model: {metric: [] for metric in metrics} for model in models}
        
for i in range(len(basePred['Truth'])):
    for model in models:
        rouge2_res = two_gram_recall(basePred['Truth'][i], basePred[model][i])
        bleu = sentence_bleu(basePred['Truth'][i], basePred[model][i], smoothing_function=SmoothingFunction().method3)
        
        baseline_eval_dict[model]['rouge1'].append(rouge1_res)
        baseline_eval_dict[model]['rouge2'].append(rouge2_res)
        baseline_eval_dict[model]['bleu'].append(bleu)

attention_eval_dict = {model: {metric: [] for metric in metrics} for model in models}

for i in range(len(basePred['Truth'])):
    for model in models:
        rouge1_res = one_gram_recall(attentionPred['Truth'][i], attentionPred[model][i])
        rouge2_res = two_gram_recall(attentionPred['Truth'][i], attentionPred[model][i])
        bleu = sentence_bleu(attentionPred['Truth'][i], attentionPred[model][i], smoothing_function=SmoothingFunction().method3)
        
        attention_eval_dict[model]['rouge1'].append(rouge1_res)
        attention_eval_dict[model]['rouge2'].append(rouge2_res)
        attention_eval_dict[model]['bleu'].append(bleu)
    

#### Word2Vec embedding

In [None]:
models = ["Greedy", "Non-Greedy"]
metrics = ['rouge1', 'rouge2', 'bleu']
word2vec_eval_dict = {model: {metric: [] for metric in metrics} for model in models}

for i in range(len(word2vec_basePred['Truth'])):
    for model in models:
        rouge1_res = one_gram_recall(word2vec_basePred['Truth'][i], word2vec_basePred[model][i])
        rouge2_res = two_gram_recall(word2vec_basePred['Truth'][i], word2vec_basePred[model][i])
        bleu = sentence_bleu(word2vec_basePred['Truth'][i], word2vec_basePred[model][i], smoothing_function=SmoothingFunction().method3)
        
        word2vec_eval_dict[model]['rouge1'].append(rouge1_res)
        word2vec_eval_dict[model]['rouge2'].append(rouge2_res)
        word2vec_eval_dict[model]['bleu'].append(bleu)

Here's the summary of our model performance on the test-set. We compared our predicted title to the ground-truth title of the test set (928 titles).

| Model                                      | ROUGE-1 (Recall) | ROUGE-2 (Recall) | BLEU (Precision) | F-score |
|--------------------------------------------|------------------|------------------|------------------|---------|
| k-Nearest-Neighbor                         | .1728            | .0454            | .0135            | .1728   |
| GloVe (self-trained, greedy search)        | .0984            | .0136            | .0137            | .0241   |
| GloVe (self-trained, non-greedy search)    | .1215            | .0058            | .0103            | .0189   |
| Word2vec (self-trained, greedy search)     | .1060            | .0216            | .0116            | .0209   |
| Word2vec (self-trained, non-greedy search) | .1294            | .0064            | .0110            | .0202   |

In general, we can see k-NN model has the highest test-set performance based on the recall, precision and F-score metrics. However, for metrics we picked in this project, k-NN has an advantage over other models because the titles generated from kNN are coming from the dataset, which provides an advantage of having a higher recall (because of the overlap in the keywords as well as common keywords used in title especially when the dataset is in a very particular domain (NeurIPS). Similar reason applies for higher precision as well.

## 8. Discussion

In [None]:
#karina
#abhi
#PHOEBE

### 8.1 Evaluation metrics
In this project, we used ROUGE-N, BLEU and F-score to examine the model performance, because it is commonly used in text summarization task, which title generation can be seen as one. However, there are a couple areas that we want to discuss. 

#### 8.1.1 Comparison to the True Title
In our project, we compared our predicted title with the ground truth and used different metrics to evaluate how similar the predicted titles are with the true title. The main motivation to compare with the ground truth is because it is the only immediately available evaluation metrics. However, is the task about replicating the human-generated title? We argue that comparing the similarity of our generated title and the true title might not be the best approach. First of all, it is hard to know if the true title is created with the idea to summarize the article. It is possible to believe that journal articles title are created with some keywords or buzzwords in it, because it helps with getting attention and citations, which is especially important in academic journal articles. 

#### 8.1.2 Semantic Similarity 
Second, our current metrics only measure word co-occurence between our generated and true titles using precision and recall. The semantic similarity between the generated and true titles is ignored. However, word co-occurrence might not be the best metrics, because as discussed eaerlier, titles of journal articles might be created with the goal of including some popular and commonly used buzzwords which might not always be in the main body of the article or abstract, which therefore, using abstract as the source text to generate the title might result in dissimilar title as the ground truth, although they might still share similar semantic meanings. One metric that is commonly used to evaluate document similarity is `doc2vec`, which similar as `word2vec`, it represents documents in embedding vectors and compare document similarity by calculating the cosine distance of the embedding of documents. We end up did not use `doc2vec` because we ran out of the time to train our own embedding, while the other available pre-trained `doc2vec` embedding was trained with news or Wikipedia text corpus, which we think could be too different than our dataset because of its scientific journal article nature. 

#### 8.1.3 Human Evaluation
Another possible way to evaluate our generated title is to use human evaluation, for example, we can employ human/Amazon Mechanical Turkers to read the abstract of the article and evaluate both the generated title and true title in terms of subjective preference and evaluation of if the title summarizes the abstract. We can then compare the scorings of the two sets of titles and see if there is a big difference between them. However, human evaluation are time-consuming and expensive to collect and can introduce individual biases, which therefore, we did not use human to evaluate the titles