# Research Paper Recommender

### :Review article recommender using PubMed API and Key word exraction from article titles using TF-IDF

During my research period, I have been switching my research topic very frequently, so I know how much the pain involves whenever they had to restart studying on the new field. Thus, I am currently developing the educational platform for junior researchers who just started research, using paper and keyword recommender system.

**Keyword recommendation:** Problem of current research article search engine is providing the keywords based on the keywords provided by article author, and mostly the one-word synonyms. I developed the keyword recommender using texts from titles and abstracts. Proper usage of ngram and TF-IDF provided intuitive keywords which gives information what to look up for the study.

**Paper recommendation:** Best way to learn new field is reading a good review article written by leading scientist in the field. Problem of finding the article by citation number is this will be highly biased towards the date of publication. First article published in the field gets most citation. This recommender provides list of review articles of most active researchers, based on publication number and citation recently, and displays most recently reviews by analyzing PubMed database. 

Currently working on the building a database using searched keywords and develop abstract analyzer

# Importing Packages and Functions

In [1]:
from Bio import Entrez
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import itertools

pd.set_option('display.max_colwidth', 1000)

In [2]:
def search(keyword):
    '''
    returns IDList of research articles related the keyword
    
    Arg:
        keyword (str): keyword of the interest
        
    return:
        IDList (Dict): List of publication IDs related to the keywords
    '''
    
    Entrez.email = 'sxxize@gmail.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax='1000',
                            retmode='xml', 
                            term=keyword)
    IDList = Entrez.read(handle)
    return IDList

In [3]:
def fetch_details(id_list):
    '''
    returns article information from pubmed
    
    Arg: 
        id_list (dict): id list of publications returned from serach function
        
    return:
        results (dict): full information of articles 
    '''
    ids = ','.join(id_list)
    Entrez.email = 'sxxize@gmail.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [4]:
def Author_list(papers):
    
    paper_author_lst=[i['MedlineCitation']['Article']['AuthorList']\
                      for i in papers['PubmedArticle']]
    dfs=[pd.DataFrame(paper_author_lst[i]) for i in range(len(paper_author_lst))]
    names_dfs=pd.concat(dfs, axis=0, sort=True )
    author_count_df=names_dfs[['ForeName', 'LastName']]\
                    .groupby(['ForeName', 'LastName']).size()\
                    .reset_index(name='count').sort_values(by='count', ascending=False)
    top=author_count_df.head(10)
    google_url='https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q='
    name=top['ForeName']+' '+top['LastName']
    result=top.reset_index(drop=True).join(pd.DataFrame({'Google Scholar':[google_url+i for i in name.str.replace(' ', '+')+'+review&oq=']}))
    
    return result

In [5]:
def key_from_papers(papers):
    '''
    extracting key words provided by the authors
    
    Arg:
        papers (dict): full information of articles
        
    return:
        key_word (Pandas Series): list of key words
    '''
    fetch_key_word_papers=[i['MedlineCitation']['KeywordList'] for i in papers['PubmedArticle']]
    lst_key_papers=list(itertools.chain.from_iterable(list(itertools.chain.from_iterable(fetch_key_word_papers))))
    key_from_paper=pd.DataFrame({'key word from papers':lst_key_papers})
    key_word=key_from_paper['key word from papers'].str.lower()
    return key_word

In [6]:
def title_key(papers):
    '''
    extracting key words for titles of article
    
    Arg:
        papers (dict): full information of articles
        
    return:
        key_word (Pandas Series): list of key words
    '''
    
    
    titles=[i['MedlineCitation']['Article']['ArticleTitle'].lower()\
            .replace(',','').replace('.','').replace(':', '').replace('?','')\
            .replace('<sub>', '').replace('</sub>','').replace('<sup>','').replace('</sup>','')\
            .replace('<i>','').replace('</i>','')\
            .replace(search_word.lower(),'') for i in papers['PubmedArticle']]
    tfidf=TfidfVectorizer(ngram_range=(2,2),stop_words='english')
    X=tfidf.fit_transform(titles)
    tfidf_df=pd.DataFrame(X.todense(), columns=sorted(tfidf.vocabulary_))
    key_rank=tfidf_df.sum().sort_values(ascending=False)
    
    return key_rank


In [7]:
def Abstract_key(papers):
    '''
    extracting key words for the abstract
    
    Arg:
        papers (dict): full information of articles
        
    return:
        key_word (Pandas Series): list of key words
    '''
    
    
    abstract_key=[]
    for i in papers['PubmedArticle']:
        try:
            abstract_key.append(i['MedlineCitation']['Article']['Abstract']['AbstractText'][0].lower()\
            .replace(',','').replace('.','').replace(':', '').replace('?','')\
            .replace('<sub>', '').replace('</sub>','').replace('<sup>','').replace('</sup>','')\
            .replace('<i>','').replace('</i>','').replace(search_word.lower(),''))
        except:
            continue
    tfidf=TfidfVectorizer(ngram_range=(2,2),stop_words='english')
    X=tfidf.fit_transform(abstract_key)
    tfidf_df=pd.DataFrame(X.todense(), columns=sorted(tfidf.vocabulary_))
    key_rank=tfidf_df.sum().sort_values(ascending=False)
    
    return key_rank

In [8]:
search_word='bioactive'

In [9]:
results = search(search_word)
id_list = results['IdList']
papers = fetch_details(id_list)

# Author List

In [10]:
result=Author_list(papers)
result

Unnamed: 0,ForeName,LastName,count,Google Scholar
0,Aldo R,Boccaccini,33,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Aldo+R+Boccaccini+review&oq=
1,Francesco,Baino,12,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Francesco+Baino+review&oq=
2,Jiang,Chang,10,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Jiang+Chang+review&oq=
3,Mohamed N,Rahaman,10,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Mohamed+N+Rahaman+review&oq=
4,Julian R,Jones,9,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Julian+R+Jones+review&oq=
5,Chengtie,Wu,9,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Chengtie+Wu+review&oq=
6,Wenhai,Huang,9,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Wenhai+Huang+review&oq=
7,Bo,Lei,8,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Bo+Lei+review&oq=
8,Hui,Wang,7,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Hui+Wang+review&oq=
9,Robert G,Hill,7,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Robert+G+Hill+review&oq=


In [11]:
result['Google Scholar']

0    https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Aldo+R+Boccaccini+review&oq=
1      https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Francesco+Baino+review&oq=
2          https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Jiang+Chang+review&oq=
3    https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Mohamed+N+Rahaman+review&oq=
4       https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Julian+R+Jones+review&oq=
5          https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Chengtie+Wu+review&oq=
6         https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Wenhai+Huang+review&oq=
7               https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Bo+Lei+review&oq=
8             https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Hui+Wang+review&oq=
9        https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Robert+G+Hill+review&oq=
Name: Google Scholar, dtype: object

# Key Word List

In [12]:
key_paper_lst=key_from_papers(papers)

In [13]:
key_paper_lst.value_counts().head(20)

bioactive glass               141
bioactive compounds           117
bioactive peptides             53
bone regeneration              34
bioactivity                    33
antioxidant activity           24
scaffold                       23
bioactive                      21
bioactive glasses              20
angiogenesis                   20
tissue engineering             19
strontium                      19
antioxidant                    18
bone tissue engineering        18
scaffolds                      16
bioactive peptide              16
cytotoxicity                   16
osteogenesis                   16
mechanical properties          15
mesoporous bioactive glass     14
Name: key word from papers, dtype: int64

# Key Words from Title

In [14]:
key_title=title_key(papers)
key_title[:20]

tissue engineering       7.612911
bone tissue              5.811692
bone regeneration        5.673898
mesoporous glass         5.290012
antioxidant activity     5.029740
mass spectrometry        4.442655
derived peptides         4.416204
glass scaffolds          4.291390
compounds antioxidant    4.109488
glass nanoparticles      3.964210
extraction compounds     3.541104
scaffolds bone           3.540254
stem cells               3.465676
liquid chromatography    3.188447
glass based              3.153830
mechanical properties    3.147539
natural products         3.010828
containing glass         2.935943
composite scaffolds      2.794192
ultrasound assisted      2.785950
dtype: float64

In [15]:
(key_title-key_title.mean())/key_title.std()

tissue engineering             26.116043
bone tissue                    19.610974
bone regeneration              19.113334
mesoporous glass               17.726933
antioxidant activity           16.786967
mass spectrometry              14.666719
derived peptides               14.571193
glass scaffolds                14.120427
compounds antioxidant          13.463494
glass nanoparticles            12.938822
extraction compounds           11.410782
scaffolds bone                 11.407716
stem cells                     11.138376
liquid chromatography          10.137169
glass based                    10.012150
mechanical properties           9.989430
natural products                9.495701
containing glass                9.225256
composite scaffolds             8.713326
ultrasound assisted             8.683560
wound healing                   8.460297
traditional chinese             8.142179
bone repair                     8.065522
sol gel                         7.404070
food derived    

# Key Words from Abstract

In [16]:
key_abstract=Abstract_key(papers)

In [17]:
key_abstract[:20]

tissue engineering       4.778234
antioxidant activity     4.454818
bone regeneration        4.332478
bone tissue              4.145507
mechanical properties    4.130609
body fluid               3.497098
anti inflammatory        3.390992
simulated body           3.330787
sol gel                  3.187613
stem cells               3.185462
aim study                3.128373
present study            3.106228
fatty acids              2.984774
glass nanoparticles      2.937820
bone defects             2.912652
mass spectrometry        2.660493
liquid chromatography    2.640712
phenolic compounds       2.565615
natural products         2.550389
scanning electron        2.526887
dtype: float64

In [18]:
(key_abstract-key_abstract.mean())/key_abstract.std()

tissue engineering         41.664670
antioxidant activity       38.776298
bone regeneration          37.683710
bone tissue                36.013902
mechanical properties      35.880850
body fluid                 30.223090
anti inflammatory          29.275479
simulated body             28.737803
sol gel                    27.459142
stem cells                 27.439928
aim study                  26.930079
present study              26.732305
fatty acids                25.647628
glass nanoparticles        25.228288
bone defects               25.003515
mass spectrometry          22.751531
liquid chromatography      22.574876
phenolic compounds         21.904200
natural products           21.768220
scanning electron          21.558330
electron microscopy        21.147168
results showed             21.009006
glass scaffolds            20.961547
derived peptides           20.662426
human health               19.999477
study aimed                19.611376
traditional chinese        19.534592
a

In [24]:
search_word+' '+key_abstract.index[0]

'bioactive tissue engineering'

Keyword used in search + Keyword suggeted by user
> search word: tissue engineering

> keyword suggestion: stem cell

> result: **stem cell tissue engineering**


LDA
word Vec
pyldavis

pyldavis

textrank
lexrank