# Title

Our goal here is to create the dictionary of `Keywords_mapping_title`, we go over the papers in the training set. For each paper, we first find the set of keywords (see below for the details of finding the keywords) of the title. Let’s say the normalized citations for this paper is $x$ and the title of this paper has $n$ keywords. Then, we update the `keyword_score` for each keyword of the title of this paper by adding $\frac{x}{n}$.



## Setup

In [1]:
import pandas as pd
import numpy as np
import operator
from collections import OrderedDict
from operator import itemgetter
import unicodedata
import timeit, time
import scholarly
import os, os.path
import re
import nltk
from nltk.tokenize import RegexpTokenizer
import json
import csv
from nltk import pos_tag

def title_preprocess(text):
    if type(text) == float:
        text = 'a'
    if '\\xc2\\xa0\\xe2\\x80\\xa6' in text:
        text = text.replace("\\xc2\\xa0\\xe2\\x80\\xa6", '')
    text = text.replace('"','').replace('“','').replace('”','').replace('“','').replace('”','')
    
    text2 = unicode(text, "utf8")  
    text = unicodedata.normalize('NFKD',text2).encode('ascii','ignore') 
    
    text_stripped_lower = text.strip().lower()
    return text_stripped_lower

## Extracting keywords:

We obtain the keywords by tokenizing the title and removing punctuations from the title. For example, consider the following title “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”. By considering the aforementioned step, this title would be converted to the list of words [“LightGBM”, “A”,” Highly” ,” Efficient”, “Gradient”, “Boosting”,” Decision”,” Tree”]. 

Next, we detect the part of speech tag of each word by utilizing NLTK package command. We then ignore words with following tags.
1. CD cardinal digit
2.	CC coordinating conjunction
3.	DT determiner 
4.	EX existential there
5.	IN preposition/subordinating conjunction
6.	PDT predeterminer
7.	POS possessive ending
8.	PRP personal pronoun
9.	PRP\$ possessive pronoun
10.	RB adverb
11.	RBR adverb, comparative
12.	RBS adverb, superlative
13.	RP particle
14.	TO
15.	UH interjection
16.	WRB wh-abverb
17.	WP$ possessive wh-pronoun
18.	WP wh-pronoun
19.	WDT wh-determiner

Basically, we aim to only keep the words with Noun, Adj., and Verb tags as keywords. 

For instance, the corresponding keywords of our previous example would be 
Keywords = [“LightGBM”,” Efficient”, “Gradient”, “Boosting”,” Decision”,” Tree”]

## Create the dictionary of keywords and their scores for training and test datesets


### Import the training data

In [2]:
df_train = pd.read_csv("./data/data_processed/Title_training.csv")

df_train.head()

Unnamed: 0,index,citations,year,Title,citations_average
0,0,87,1987,An Optimization Network for Matrix Inversion,2.806452
1,3,5,1987,Centric Models of the Orientation Map in Prima...,0.16129
2,4,45,1987,PATTERN CLASS DEGENERACY IN AN UNRESTRICTED ST...,1.451613
3,6,22,1987,Learning a Color Algorithm from Examples,0.709677
4,8,0,1987,On Tropistic Processing and Its Applications,0.0


In [3]:
dic_word = {}  # e.g. "scheme:5"
dic_titles = {} # e.g  "...title...":"list of keywords of title" 
ignoring_pos_tag = {'CC':0,'CD':0,'DT':0,'EX':0,'IN':0,'PRP':0,'PRP$':0,'TO':0, 'WRB':0,'WP$':0,'WP':0,'WDT':0,'VBZ':0,'VBP':0,'VBN':0,'VBG':0,'VBD':0,'VB':0,'UH':0}

for i in range(0,len(df_train)):
    title = df_train.Title[i]
    pure_title = title_preprocess(title)
    tokenizer = RegexpTokenizer(r'\w+')
    word_list = tokenizer.tokenize(pure_title)
    ignoring_word = ['a','the','an','and','with','for','in','on','based','of', 'from', "to"]
    keywords = []
    for word in  word_list:
        if word not in ignoring_word:
            tag_word = nltk.tag.pos_tag([word])
            FLAG_DO_NOT_CHK_THIS_word = False
            for this_tag in tag_word[0][1]:
                if this_tag in ignoring_pos_tag.keys():
                    FLAG_DO_NOT_CHK_THIS_word = True
                    break    
            if FLAG_DO_NOT_CHK_THIS_word == False:
                keywords.append(word)
                if word not in dic_word.keys():
                    dic_word[word] = df_train.citations_average[i]/float(len(word_list))
                else:
                    dic_word[word] = dic_word[word] + df_train.citations_average[i]/float(len(word_list))
                    
with open('./data/data_processed/json/title_keywords_dict.json', 'w') as fp:
    json.dump(dic_word, fp)

### Score calculation function

In [4]:
def predict(dic_word, my_string):    
    pure_text = title_preprocess(my_string)
    tokenizer = RegexpTokenizer(r'\w+')
    word_list = tokenizer.tokenize(pure_text)

    N = 0 # number of extracted keywords form abstract
    score = 0
    for word in word_list:
        if word in dic_word.keys():
            score += dic_word[word]
            N += 1
    if score == 0:
        return 0
    else:
        return score/N #averaging over number of keywords

Below, we find the top-10 keywords of the title sorted based on their `keyword_score`:

In [5]:
df = pd.DataFrame(
    {'first_column': dic_word.keys(),
     'second_column': dic_word.values()
    })
df.sort_values(['second_column'], ascending=[0])[0:10]

Unnamed: 0,first_column,second_column
1823,learning,2043.384429
3449,networks,1505.991944
4165,neural,805.105671
2145,deep,664.126336
586,latent,608.495261
3015,dirichlet,576.266503
1914,allocation,535.76168
1525,training,492.706952
1988,using,466.637117
1093,models,443.733432


### Prediction on training

Below we use our extracted keywords to predict citations for the papers in the training set.

In [6]:
with open('./data/data_processed/json/title_keywords_dict.json') as f:
        data_dict = json.load(f)

df_train['predicted_citations'] = df_train['Title'].apply(lambda x: predict(data_dict, x))
df_train.head()

Unnamed: 0,index,citations,year,Title,citations_average,predicted_citations
0,0,87,1987,An Optimization Network for Matrix Inversion,2.806452,180.902423
1,3,5,1987,Centric Models of the Orientation Map in Prima...,0.16129,96.113563
2,4,45,1987,PATTERN CLASS DEGENERACY IN AN UNRESTRICTED ST...,1.451613,24.349512
3,6,22,1987,Learning a Color Algorithm from Examples,0.709677,537.814055
4,8,0,1987,On Tropistic Processing and Its Applications,0.0,16.038629


### Calculate correlation between citations_average and predicted_citations

In [7]:
df_train.citations_average.corr(df_train.predicted_citations)

0.1412783978472654

## Save the training data with predicated values

In [8]:
df_train.to_csv('./data/data_processed/Title_training_predicted.csv', index=False)


## Import the test data

In [9]:
df_test = pd.read_csv("./data/data_processed/Title_test.csv")[0:]

df_test.head()

Unnamed: 0,index,citations,year,Title,citations_average
0,1,94,1987,Minkowski-r Back-Propagation: Learning in Conn...,3.032258
1,2,1,1987,Optimal Neural Spike Classification,0.032258
2,5,66,1987,Learning on a General Network,2.129032
3,7,73,1987,A Dynamical Approach to Temporal Pattern Proce...,2.354839
4,9,252,1987,Supervised Learning of Probability Distributio...,8.129032


### Prediction on test

In [10]:
df_test['predicted_citations'] = df_test['Title'].apply(lambda x: predict(data_dict, x) if(pd.notnull(x)) else x)
df_test.head()

Unnamed: 0,index,citations,year,Title,citations_average,predicted_citations
0,1,94,1987,Minkowski-r Back-Propagation: Learning in Conn...,3.032258,342.188738
1,2,1,1987,Optimal Neural Spike Classification,0.032258,306.043302
2,5,66,1987,Learning on a General Network,2.129032,763.662084
3,7,73,1987,A Dynamical Approach to Temporal Pattern Proce...,2.354839,34.610457
4,9,252,1987,Supervised Learning of Probability Distributio...,8.129032,698.88214


### Calculate correlation between citations_average and predicted_citations


In [11]:
df_test.citations_average.corr(df_test.predicted_citations)

0.06332553159923121

### Save the test data with predicated values

In [12]:
df_test.to_csv('./data/data_processed/Title_test_predicted.csv', index=False)