# Research Paper Recommender

### :Review article recommender using PubMed API and Key word exraction from article titles using TF-IDF

During my research period, I have been switching my research topic very frequently, so I know how much the pain involves whenever they had to restart studying on the new field. Thus, I am currently developing the educational platform for junior researchers who just started research, using paper and keyword recommender system.

**Keyword recommendation:** Problem of current research article search engine is providing the keywords based on the keywords provided by article author, and mostly the one-word synonyms. I developed the keyword recommender using texts from titles and abstracts. Proper usage of ngram and TF-IDF provided intuitive keywords which gives information what to look up for the study.

**Paper recommendation:** Best way to learn new field is reading a good review article written by leading scientist in the field. Problem of finding the article by citation number is this will be highly biased towards the date of publication. First article published in the field gets most citation. This recommender provides list of review articles of most active researchers, based on publication number and citation recently, and displays most recently reviews by analyzing PubMed database. 

Currently working on the building a database using searched keywords and develop abstract analyzer

# Importing Packages and Functions

In [1]:
from Bio import Entrez
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import itertools

pd.set_option('display.max_colwidth', 1000)

In [2]:
def search(keyword):
    '''
    returns IDList of research articles related the keyword
    
    Arg:
        keyword (str): keyword of the interest
        
    return:
        IDList (Dict): List of publication IDs related to the keywords
    '''
    
    Entrez.email = 'sxxize@gmail.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax='1000',
                            retmode='xml', 
                            term=keyword)
    IDList = Entrez.read(handle)
    return IDList

In [3]:
def fetch_details(id_list):
    '''
    returns article information from pubmed
    
    Arg: 
        id_list (dict): id list of publications returned from serach function
        
    return:
        results (dict): full information of articles 
    '''
    ids = ','.join(id_list)
    Entrez.email = 'sxxize@gmail.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [4]:
def Author_list(papers):
    
    paper_author_lst=[i['MedlineCitation']['Article']['AuthorList']\
                      for i in papers['PubmedArticle']]
    dfs=[pd.DataFrame(paper_author_lst[i]) for i in range(len(paper_author_lst))]
    names_dfs=pd.concat(dfs, axis=0, sort=True )
    author_count_df=names_dfs[['ForeName', 'LastName']]\
                    .groupby(['ForeName', 'LastName']).size()\
                    .reset_index(name='count').sort_values(by='count', ascending=False)
    top=author_count_df.head(10)
    google_url='https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q='
    name=top['ForeName']+' '+top['LastName']
    result=top.reset_index(drop=True).join(pd.DataFrame({'Google Scholar':[google_url+i for i in name.str.replace(' ', '+')+'+review&oq=']}))
    
    return result

In [5]:
def key_from_papers(papers):
    '''
    extracting key words provided by the authors
    
    Arg:
        papers (dict): full information of articles
        
    return:
        key_word (Pandas Series): list of key words
    '''
    fetch_key_word_papers=[i['MedlineCitation']['KeywordList'] for i in papers['PubmedArticle']]
    lst_key_papers=list(itertools.chain.from_iterable(list(itertools.chain.from_iterable(fetch_key_word_papers))))
    key_from_paper=pd.DataFrame({'key word from papers':lst_key_papers})
    key_word=key_from_paper['key word from papers'].str.lower()
    return key_word

In [6]:
def title_key(papers):
    '''
    extracting key words for titles of article
    
    Arg:
        papers (dict): full information of articles
        
    return:
        key_word (Pandas Series): list of key words
    '''
    
    
    titles=[i['MedlineCitation']['Article']['ArticleTitle'].lower()\
            .replace(',','').replace('.','').replace(':', '').replace('?','')\
            .replace('<sub>', '').replace('</sub>','').replace('<sup>','').replace('</sup>','')\
            .replace('<i>','').replace('</i>','')\
            .replace(search_word.lower(),'') for i in papers['PubmedArticle']]
    tfidf=TfidfVectorizer(ngram_range=(2,2),stop_words='english')
    X=tfidf.fit_transform(titles)
    tfidf_df=pd.DataFrame(X.todense(), columns=sorted(tfidf.vocabulary_))
    key_rank=tfidf_df.sum().sort_values(ascending=False)
    
    return key_rank


In [7]:
def Abstract_key(papers):
    '''
    extracting key words for the abstract
    
    Arg:
        papers (dict): full information of articles
        
    return:
        key_word (Pandas Series): list of key words
    '''
    
    
    abstract_key=[]
    for i in papers['PubmedArticle']:
        try:
            abstract_key.append(i['MedlineCitation']['Article']['Abstract']['AbstractText'][0].lower()\
            .replace(',','').replace('.','').replace(':', '').replace('?','')\
            .replace('<sub>', '').replace('</sub>','').replace('<sup>','').replace('</sup>','')\
            .replace('<i>','').replace('</i>','').replace(search_word.lower(),''))
        except:
            continue
    tfidf=TfidfVectorizer(ngram_range=(2,2),stop_words='english')
    X=tfidf.fit_transform(abstract_key)
    tfidf_df=pd.DataFrame(X.todense(), columns=sorted(tfidf.vocabulary_))
    key_rank=tfidf_df.sum().sort_values(ascending=False)
    
    return key_rank

In [8]:
search_word='nanoparticle'

In [9]:
results = search(search_word)
id_list = results['IdList']
papers = fetch_details(id_list)

# Author List

In [10]:
result=Author_list(papers)
result

Unnamed: 0,ForeName,LastName,count,Google Scholar
0,Warren C W,Chan,6,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Warren+C+W+Chan+review&oq=
1,Conxita,Solans,6,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Conxita+Solans+review&oq=
2,David Julian,McClements,6,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=David+Julian+McClements+review&oq=
3,Achim,Aigner,5,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Achim+Aigner+review&oq=
4,Nikhil R,Jana,5,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Nikhil+R+Jana+review&oq=
5,Christine K,Payne,4,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Christine+K+Payne+review&oq=
6,Liangfang,Zhang,4,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Liangfang+Zhang+review&oq=
7,Klaus,Langer,4,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Klaus+Langer+review&oq=
8,Issei,Takeuchi,4,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Issei+Takeuchi+review&oq=
9,Jordan J,Green,4,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Jordan+J+Green+review&oq=


In [11]:
result['Google Scholar']

0            https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Warren+C+W+Chan+review&oq=
1             https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Conxita+Solans+review&oq=
2    https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=David+Julian+McClements+review&oq=
3               https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Achim+Aigner+review&oq=
4              https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Nikhil+R+Jana+review&oq=
5          https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Christine+K+Payne+review&oq=
6            https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Liangfang+Zhang+review&oq=
7               https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Klaus+Langer+review&oq=
8             https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Issei+Takeuchi+review&oq=
9             https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Jordan+J+Green+review&oq=
Name: Google Scholar

# Key Word List

In [12]:
key_paper_lst=key_from_papers(papers)

In [13]:
key_paper_lst.value_counts().head(20)

nanoparticles             189
nanoparticle              158
drug delivery              46
chitosan                   33
nanomedicine               21
protein corona             19
cancer                     19
gold nanoparticles         16
toxicity                   14
plga                       14
cytotoxicity               14
apoptosis                  13
macrophage                 12
doxorubicin                12
inflammation               11
biodistribution            11
silver nanoparticles       11
magnetic nanoparticles     11
nanotechnology             10
chemotherapy               10
Name: key word from papers, dtype: int64

# Key Words from Title

In [14]:
key_title=title_key(papers)
key_title[:20]

drug delivery           9.613238
protein corona          6.929557
iron oxide              5.616238
cancer therapy          4.732969
mesoporous silica       3.729772
surface charge          3.584749
cellular uptake         3.345526
core shell              2.877991
self assembled          2.837052
titanium dioxide        2.719085
drug release            2.702896
oral delivery           2.551574
lung cancer             2.529179
breast cancer           2.495949
cell interactions       2.470004
tracking analysis       2.460424
gene delivery           2.365072
photodynamic therapy    2.298330
serum albumin           2.115370
real time               2.088745
dtype: float64

# Key Words from Abstract

In [15]:
key_abstract=Abstract_key(papers)

In [16]:
key_abstract[:20]

drug delivery                 8.766699
cancer cells                  4.797810
protein corona                4.636144
drug release                  4.469569
electron microscopy           4.345457
iron oxide                    4.014953
cellular uptake               3.957454
particle size                 3.482660
zeta potential                3.381668
surface charge                3.202056
delivery systems              3.163817
transmission electron         2.748382
physicochemical properties    2.681125
vitro vivo                    2.623228
polyethylene glycol           2.548597
mesoporous silica             2.543251
present study                 2.367577
self assembly                 2.355023
core shell                    2.328665
gene delivery                 2.272032
dtype: float64

LDA
word Vec
pyldavis

pyldavis

textrank
lexrank