# Abstract

Our goal here is to create the dictionary of `Keywords_mapping_abstarct`, we go over the papers in the training set. For each paper, we first find the set of keywords (see below for the details of finding the keywords) of the abstract. Let’s say the normalized citations for this paper is $x$ and the abstract of this paper has $n$ keywords. Then, we update the `keyword_score` for each keyword of the abstract of this paper by adding $\frac{x}{n}$.


## Setup

In [2]:
import pandas as pd
import numpy as np
import json
import rake
import nltk
import unicodedata
import operator
from nltk.tokenize import RegexpTokenizer
from textblob import TextBlob

def abstract_preprocess(text):
    if type(text) == float:
        text = 'a'
    if '\\xc2\\xa0\\xe2\\x80\\xa6' in text:
        text = text.replace("\\xc2\\xa0\\xe2\\x80\\xa6", '')
    text = text.replace('"','').replace('“','').replace('”','').replace('“','').replace('”','')
    
    text2 = unicode(text, "utf8")  
    text = unicodedata.normalize('NFKD',text2).encode('ascii','ignore') 
    
    text_stripped_lower = text.strip().lower()
    return text_stripped_lower

## Extracting keywords:

The critical part of this task would be the way that we extract the keywords. To extract the keywords, form abstract, we use three different methods: 
* NLTK library, 
* TextBlob library, 
* RAKE (rapid automatic keyword extraction) algorithm. 

Following, we describe how the keywords are extracted in each of the mentioned techniques:

### Extract keywords based on  NLTK library

Natural Language Toolkit or NLTK is one of the most suitable and well known natural language processing libraries in the Python programming language. We already get familiar with this library throughout the assignments in this semester. One of the simplest ideas for keyword extraction is just looking at the nouns in the abstract. To do so, we first break the abstract into sentences. Then we use Python’s NLTK library features for sentence tokenizing and POS tagging. In this way, we would have tokens and their tags for each sentence in the abstract. The list of POS tags is as follow:

* CC coordinating conjunction
* CD cardinal digit
* DT determiner
* EX existential there (like: “there is” … think of it like “there exists”)
* FW foreign word
* IN preposition/subordinating conjunction
* JJ adjective ‘big’
* JJR adjective, comparative ‘bigger’
* JJS adjective, superlative ‘biggest’
* LS list marker 1)
* MD modal could, will
* NN noun, singular ‘desk’
* NNS noun plural ‘desks’
* NNP proper noun, singular ‘Harrison’
* NNPS proper noun, plural ‘Americans’
* PDT predeterminer ‘all the kids’
* POS possessive ending parent’s
* PRP personal pronoun I, he, she
* PRP\$ possessive pronoun my, his, hers
* RB adverb very, silently,
* RBR adverb, comparative better
* RBS adverb, superlative best
* RP particle give up
* TO, to go ‘to’ the store.
* UH interjection, errrrrrrrm
* VB verb, base form take
* VBD verb, past tense took
* VBG verb, gerund/present participle taking
* VBN verb, past participle taken
* VBP verb, sing. present, non-3d take
* VBZ verb, 3rd person sing. present takes
* WDT wh-determiner which
* WP wh-pronoun who, what
* WP\$ possessive wh-pronoun whose
* WRB wh-abverb where, when


Among these tags we choose “nouns” as the keywords of that sentence. Intuitively, it makes more sense that the keywords are from the nouns in compare with verbs, adjectives, conjunctions, etc. 

Below is the written function for the keyword extraction using NLTK’s tokenizer and POS tager:

In [3]:
def keywords_nltk(pure_text):
    sentences = nltk.sent_tokenize(pure_text) #tokenize sentences
    nouns = []  # empty to array to hold all nouns
    for sentence in sentences:
        for word, pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
            if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
                nouns.append(word)
    return nouns

Now, if we input an abstract to this function, the result would be as follow:

In [4]:
keywords = keywords_nltk('The coordinate descent (CD) method is a classical optimization algorithm that has seen a revival of interest because of its competitive performance in machine learning applications. A number of recent papers provided convergence rate estimates for their deterministic (cyclic) and randomized variants that differ in the selection of update coordinates. These estimates suggest randomized coordinate descent (RCD) performs better than cyclic coordinate descent (CCD), although numerical experiments do not provide clear justification for this comparison. In this paper, we provide examples and more generally problem classes for which CCD (or CD with any deterministic order) is faster than RCD in terms of asymptotic worst-case convergence. Furthermore, we provide lower and upper bounds on the amount of improvement on the rate of CCD relative to RCD, which depends on the deterministic order used. We also provide a characterization of the best deterministic order (that leads to the maximum improvement in convergence rate) in terms of the combinatorial properties of the Hessian matrix of the objective function.')
print(keywords)

['coordinate', 'descent', 'CD', 'method', 'optimization', 'algorithm', 'revival', 'interest', 'performance', 'machine', 'learning', 'applications', 'number', 'papers', 'convergence', 'rate', 'estimates', 'variants', 'selection', 'coordinates', 'estimates', 'descent', 'RCD', 'descent', 'CCD', 'experiments', 'justification', 'comparison', 'paper', 'examples', 'problem', 'classes', 'CCD', 'CD', 'order', 'RCD', 'terms', 'convergence', 'bounds', 'amount', 'improvement', 'rate', 'CCD', 'RCD', 'order', 'characterization', 'order', 'improvement', 'convergence', 'rate', 'terms', 'properties', 'Hessian', 'matrix', 'function']


### Extract keywords based on TextBlob library

TextBlob is another library in python that can be used for text processing. It is relatively new python NLP toolkit, which stands on the shoulders of giants like NLTK and Pattern, provides text mining, text analysis and text processing modules for python developers. It provides a simple API for diving into common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. Similar idea of POS tagging and noun extraction as the keywords is also implemented using TextBlob library. 

Following you can see the written function:

In [5]:
def keywords_textblob(pure_text):
    keywords = [w for (w, pos) in TextBlob(pure_text).pos_tags if pos[0] == 'N']
    return keywords

Here is the result if we input the same abstract to this function:

In [6]:
keywords = keywords_textblob('The coordinate descent (CD) method is a classical optimization algorithm that has seen a revival of interest because of its competitive performance in machine learning applications. A number of recent papers provided convergence rate estimates for their deterministic (cyclic) and randomized variants that differ in the selection of update coordinates. These estimates suggest randomized coordinate descent (RCD) performs better than cyclic coordinate descent (CCD), although numerical experiments do not provide clear justification for this comparison. In this paper, we provide examples and more generally problem classes for which CCD (or CD with any deterministic order) is faster than RCD in terms of asymptotic worst-case convergence. Furthermore, we provide lower and upper bounds on the amount of improvement on the rate of CCD relative to RCD, which depends on the deterministic order used. We also provide a characterization of the best deterministic order (that leads to the maximum improvement in convergence rate) in terms of the combinatorial properties of the Hessian matrix of the objective function.')
print(keywords)

['coordinate', 'descent', 'CD', 'method', 'optimization', 'algorithm', 'revival', 'interest', 'performance', 'machine', 'learning', 'applications', 'number', 'papers', 'convergence', 'rate', 'estimates', 'variants', 'selection', 'coordinates', 'estimates', 'descent', 'RCD', 'descent', 'CCD', 'experiments', 'justification', 'comparison', 'paper', 'examples', 'problem', 'classes', 'CCD', 'CD', 'order', 'RCD', 'terms', 'convergence', 'bounds', 'amount', 'improvement', 'rate', 'CCD', 'RCD', 'order', 'characterization', 'order', 'improvement', 'convergence', 'rate', 'terms', 'properties', 'Hessian', 'matrix', 'function']


As you can see the extracted keywords are same as the keywords as what we extracted from NLTK library. This result is actually expected because we used the same method of “noun” extraction with different library to verify the result. 

### Extract keywords based on RAKE algorithm
Another technique for keyword extraction is using an algorithm called RAKE which is an acronym for Rapid Automatic Keyword Extraction. This algorithm has three main components:
* Candidate selection: in this stage all potential keywords including nouns, phrases, terms, concepts, etc.  are selected.
* Properties calculation: For each candidate, some properties are calculated that show a potential candidate might be a keyword. For example, if a word occurs a few times in the abstract, it might be a keyword. Also, it is important how many times that word co-occurs with other words.
* Scoring and selecting keywords: all candidates can be scored by combining the mentioned properties into a formula. 

Then, a score or probability threshold is used to select final keywords. 

Following you can see the function that is used for keyword extraction using RAKE algorithm. As it is explained in the comments, there are three main values that should be identified. The minimum number of characters, the maximum number of words per phrase, and the frequency of the word in the abstract. The “SmartStoplist” is a list of words that can be used to split text into important and unimportant words. For example, “the”, “after”, “at”, “again”, etc. cannot be keyword and they couldn’t be important. 

In [9]:
def keywords_rake(pure_text):
    rake_object = rake.Rake("SmartStoplist.txt", 3, 4, 2)
    # Each word has at least 3 characters
    # Each phrase has at most 4 words
    # Each keyword appears in the text at least 2 times
    keywords_score = rake_object.run(pure_text)
    keywords = [i[0] for i in keywords_score]
    return keywords

Here is the output of this function for the same input abstract:


In [10]:
keywords = keywords_rake('The coordinate descent (CD) method is a classical optimization algorithm that has seen a revival of interest because of its competitive performance in machine learning applications. A number of recent papers provided convergence rate estimates for their deterministic (cyclic) and randomized variants that differ in the selection of update coordinates. These estimates suggest randomized coordinate descent (RCD) performs better than cyclic coordinate descent (CCD), although numerical experiments do not provide clear justification for this comparison. In this paper, we provide examples and more generally problem classes for which CCD (or CD with any deterministic order) is faster than RCD in terms of asymptotic worst-case convergence. Furthermore, we provide lower and upper bounds on the amount of improvement on the rate of CCD relative to RCD, which depends on the deterministic order used. We also provide a characterization of the best deterministic order (that leads to the maximum improvement in convergence rate) in terms of the combinatorial properties of the Hessian matrix of the objective function.')
print(keywords)

['deterministic order', 'ccd', 'rcd', 'terms']


The selected keywords using RAKE algorithm looks more reasonable rather than just “noun” selection. Firstly, it contains multiple-word phrases that could really be a key concept in the paper. Moreover, from the selected keywords, i.e., “CCD (cyclic coordinate descent)” and “RCD (randomized coordinate descent)” we can have the intuition that the keywords are selected more smartly. To summarize, RAKE is a simple keyword extraction library which focuses on finding multi-word phrases containing frequent words. 

For the rest of this notebook, we choose NLTK library ($mode=1$). For TextBlob library, use $mode = 2$ and for RAKE algorithm, use $mode=3$.

In [31]:
mode = 1

## Create the dictionary of keywords and their scores for training and test datesets

### Import the training data

In [22]:
df_train = pd.read_csv("./data/data_processed/Abstract_training.csv")

df_train.head()

Unnamed: 0,index,citations,year,Abstract,citations_average
0,0,87,1987,Inverse matrix calculation can be considered a...,2.806452
1,3,5,1987,In the visual cortex of the monkey the horizon...,0.16129
2,4,45,1987,The study of distributed memory systems has pr...,1.451613
3,6,22,1987,A lightness algorithm that separates surface r...,0.709677
4,8,0,1987,The interaction of a set of tropisms is suffic...,0.0


In [33]:
dic_keywords = {}  # "keyword : citation"

for i in range(0,len(df_train)):
    if i % 500 == 0:
        print(i,len(df_train))
    text = df_train.Abstract[i]
    pure_text = abstract_preprocess(text)
    if mode ==1:
        keywords = keywords_nltk(pure_text) # Extracting keywords with NLTK POS and picking nouns
    elif mode ==2:
        keywords = keywords_textblob(pure_text)  # Extracting keywords with TextBlob and picking nouns
    elif mode ==3:
        keywords = keywords_rake(pure_text)  # Extracting keywords with RAKE algorithm
    else:
        print('Wrong mode!!!')
    
    N = len(keywords)
    for word in keywords:
        if word in dic_keywords.keys():
            dic_keywords[word] = dic_keywords[word] + df_train.citations_average[i]/N
        else:
            dic_keywords[word] = df_train.citations_average[i]/N

if mode ==1:    
    with open('./data/data_processed/json/dic_keywords_nltk.json', 'w') as fp:
        json.dump(dic_keywords, fp)
elif mode ==2:    
    with open('./data/data_processed/json/dic_keywords_textblob.json', 'w') as fp:
        json.dump(dic_keywords, fp)
elif mode ==3:    
    with open('./data/data_processed/json/dic_keywords_rake.json', 'w') as fp:
        json.dump(dic_keywords, fp)      
else:
        print('Wrong mode!!!')

(0, 4368)
(500, 4368)
(1000, 4368)
(1500, 4368)
(2000, 4368)
(2500, 4368)
(3000, 4368)
(3500, 4368)
(4000, 4368)


### Score calculation function

In [24]:
def predict(dic_keywords, my_string, mode):

    pure_text = abstract_preprocess(my_string)
    if mode ==1:
        keywords = keywords_nltk(pure_text) # Extracting keywords with NLTK POS and picking nouns
    elif mode ==2:
        keywords = keywords_textblob(pure_text)  # Extracting keywords with TextBlob and picking nouns
    elif mode ==3:
        keywords = keywords_rake(pure_text)  # Extracting keywords with RAKE algorithm
    else:
        print('Wrong mode!!!')

    N = 0 # number of extracted keywords form abstract
    score = 0
    for word in keywords:
        if word in dic_keywords.keys():
            score += dic_keywords[word]
            N += 1

    if score == 0:
        return 0
    else:
        return score/N #averaging over number of keywords

In [34]:
if mode == 1:
    with open('./data/data_processed/json/dic_keywords_nltk.json') as f:
        data_dict = json.load(f)
elif mode == 2:
     with open('./data/data_processed/json/dic_keywords_textblob.json') as f:
        data_dict = json.load(f)
elif mode == 3:
     with open('./data/data_processed/json/dic_keywords_rake.json') as f:
        data_dict = json.load(f)
else:
    print('Wrong mode!!!')
        

Below, we find the top-10 keywords of the abstract sorted based on their `keyword_score`:

In [26]:
sorted_x = sorted(data_dict.items(), key=operator.itemgetter(1), reverse=True)

df = pd.DataFrame(
    {'first_column': data_dict.keys(),
     'second_column': data_dict.values()
    })
df.sort_values(['second_column'], ascending=[0])[0:10]

Unnamed: 0,first_column,second_column
7819,model,1290.779605
684,data,1056.789386
5924,networks,916.788576
1240,paper,761.949494
3231,learning,754.168289
4348,problem,669.091707
1158,method,667.055361
540,models,646.804264
2438,network,532.958951
1666,algorithm,500.291176


In [27]:
with open('./data/data_processed/json/dic_keywords_textblob.json') as f:
        data_dict = json.load(f)
        
sorted_x = sorted(data_dict.items(), key=operator.itemgetter(1), reverse=True)

df = pd.DataFrame(
    {'first_column': data_dict.keys(),
     'second_column': data_dict.values()
    })
df.sort_values(['second_column'], ascending=[0])[0:10]


Unnamed: 0,first_column,second_column
7819,model,1290.779605
684,data,1056.789386
5924,networks,916.788576
1240,paper,761.949494
3231,learning,754.168289
4348,problem,669.091707
1158,method,667.055361
540,models,646.804264
2438,network,532.958951
1666,algorithm,500.291176


In [30]:
with open('./data/data_processed/json/dic_keywords_rake.json') as f:
        data_dict = json.load(f)
        
sorted_x = sorted(data_dict.items(), key=operator.itemgetter(1), reverse=True)

df = pd.DataFrame(
    {'first_column': data_dict.keys(),
     'second_column': data_dict.values()
    })
df.sort_values(['second_column'], ascending=[0])[0:10]

Unnamed: 0,first_column,second_column
1740,gram model,2043.4
1688,lda,1553.728431
1874,gans,813.833333
2393,cifar,638.5
2231,problem,571.457244
367,approach,403.067781
1128,nmf,395.944444
1511,model,366.437725
1414,generative adversarial network,336.0
764,data,310.026527


### Prediction on training

Below we use our extracted keywords to predict citations for the papers in the training set. This may take a few minutes.

In [35]:
df_train['predicted_citations'] = df_train['Abstract'].apply(lambda x: predict(data_dict, x, mode))
df_train.head()

Unnamed: 0,index,citations,year,Abstract,citations_average,predicted_citations
0,0,87,1987,Inverse matrix calculation can be considered a...,2.806452,187.447644
1,3,5,1987,In the visual cortex of the monkey the horizon...,0.16129,83.991048
2,4,45,1987,The study of distributed memory systems has pr...,1.451613,116.162506
3,6,22,1987,A lightness algorithm that separates surface r...,0.709677,136.68866
4,8,0,1987,The interaction of a set of tropisms is suffic...,0.0,61.969787


### Calculate correlation between citations_average and predicted_citations

In [36]:
df_train.citations_average.corr(df_train.predicted_citations)

0.13839492300030556

In [37]:
df_train.fillna(0, inplace=True)

### Save the training data with predicated values

In [38]:
if mode ==1:
    df_train.to_csv('./data/data_processed/Abstract_training_predicted_nltk.csv', index=False)
elif mode ==2:
    df_train.to_csv('./data/data_processed/Abstract_training_predicted_textblob.csv', index=False)
elif mode ==3:
    df_train.to_csv('./data/data_processed/Abstract_training_predicted_rake.csv', index=False)
else:
    print('Wrong mode!!!')


### Import the test data

In [39]:
df_test = pd.read_csv("./data/data_processed/Abstract_test.csv")[0:]

df_test.head()

Unnamed: 0,index,citations,year,Abstract,citations_average
0,1,94,1987,Many connectionist learning models are impleme...,3.032258
1,2,1,1987,Amir F. Atiya (*) and James M. Bower (**)(*) D...,0.032258
2,5,66,1987,This paper generalizes the backpropagation met...,2.129032
3,7,73,1987,Recognizing patterns with temporal context is ...,2.354839
4,9,252,1987,We propose that the back propagation algorithm...,8.129032


### Prediction on test
Below we use our extracted keywords to predict citations for the papers in the training set. This may take a few minutes.

In [40]:
df_test['predicted_citations'] = df_test['Abstract'].apply(lambda x: predict(data_dict, x, mode) if(pd.notnull(x)) else x)

In [41]:
df_test.head()

Unnamed: 0,index,citations,year,Abstract,citations_average,predicted_citations
0,1,94,1987,Many connectionist learning models are impleme...,3.032258,226.145507
1,2,1,1987,Amir F. Atiya (*) and James M. Bower (**)(*) D...,0.032258,42.464439
2,5,66,1987,This paper generalizes the backpropagation met...,2.129032,308.769021
3,7,73,1987,Recognizing patterns with temporal context is ...,2.354839,114.649391
4,9,252,1987,We propose that the back propagation algorithm...,8.129032,215.310316


### Calculate correlation between citations_average and predicted_citations


In [42]:
df_test.citations_average.corr(df_test.predicted_citations)

0.051434255802564804

In [43]:
df_test.fillna(0, inplace=True)

### Save the test data with predicated values

In [44]:
if mode ==1:
    df_test.to_csv('./data/data_processed/Abstract_test_predicted_nltk.csv', index=False)
elif mode ==2:
    df_test.to_csv('./data/data_processed/Abstract_test_predicted_textblob.csv', index=False)
elif mode ==3:
    df_test.to_csv('./data/data_processed/Abstract_test_predicted_rake.csv', index=False)
else:
    print('Wrong mode!!!')