# Research Paper Recommender

### :Review article recommender using PubMed API and Key word exraction from article titles using TF-IDF

During my research period, I have been switching my research topic very frequently, so I know how much the pain involves whenever they had to restart studying on the new field. Thus, I am currently developing the educational platform for junior researchers who just started research, using paper and keyword recommender system.

**Keyword recommendation:** Problem of current research article search engine is providing the keywords based on the keywords provided by article author, and mostly the one-word synonyms. I developed the keyword recommender using texts from titles and abstracts. Proper usage of ngram and TF-IDF provided intuitive keywords which gives information what to look up for the study.

**Paper recommendation:** Best way to learn new field is reading a good review article written by leading scientist in the field. Problem of finding the article by citation number is this will be highly biased towards the date of publication. First article published in the field gets most citation. This recommender provides list of review articles of most active researchers, based on publication number and citation recently, and displays most recently reviews by analyzing PubMed database. 

Currently working on the building a database using searched keywords and develop abstract analyzer

# Importing Packages and Functions

In [1]:
from Bio import Entrez
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import itertools

pd.set_option('display.max_colwidth', 1000)

In [2]:
def search(keyword):
    '''
    returns IDList of research articles related the keyword
    
    Arg:
        keyword (str): keyword of the interest
        
    return:
        IDList (Dict): List of publication IDs related to the keywords
    '''
    
    Entrez.email = 'sxxize@gmail.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax='1000',
                            retmode='xml', 
                            term=keyword)
    IDList = Entrez.read(handle)
    return IDList

In [3]:
def fetch_details(id_list):
    '''
    returns article information from pubmed
    
    Arg: 
        id_list (dict): id list of publications returned from serach function
        
    return:
        results (dict): full information of articles 
    '''
    ids = ','.join(id_list)
    Entrez.email = 'sxxize@gmail.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [4]:
def Author_list(papers):
    
    paper_author_lst=[i['MedlineCitation']['Article']['AuthorList']\
                      for i in papers['PubmedArticle']]
    dfs=[pd.DataFrame(paper_author_lst[i]) for i in range(len(paper_author_lst))]
    names_dfs=pd.concat(dfs, axis=0, sort=True )
    author_count_df=names_dfs[['ForeName', 'LastName']]\
                    .groupby(['ForeName', 'LastName']).size()\
                    .reset_index(name='count').sort_values(by='count', ascending=False)
    top=author_count_df.head(10)
    google_url='https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q='
    name=top['ForeName']+' '+top['LastName']
    result=top.reset_index(drop=True).join(pd.DataFrame({'Google Scholar':[google_url+i for i in name.str.replace(' ', '+')+'+review&oq=']}))
    
    return result

In [5]:
def key_from_papers(papers):
    '''
    extracting key words provided by the authors
    
    Arg:
        papers (dict): full information of articles
        
    return:
        key_word (Pandas Series): list of key words
    '''
    fetch_key_word_papers=[i['MedlineCitation']['KeywordList'] for i in papers['PubmedArticle']]
    lst_key_papers=list(itertools.chain.from_iterable(list(itertools.chain.from_iterable(fetch_key_word_papers))))
    key_from_paper=pd.DataFrame({'key word from papers':lst_key_papers})
    key_word=key_from_paper['key word from papers'].str.lower()
    return key_word

In [6]:
def title_key(papers):
    '''
    extracting key words for titles of article
    
    Arg:
        papers (dict): full information of articles
        
    return:
        key_word (Pandas Series): list of key words
    '''
    
    
    titles=[i['MedlineCitation']['Article']['ArticleTitle'].lower()\
            .replace(',','').replace('.','').replace(':', '').replace('?','')\
            .replace('<sub>', '').replace('</sub>','').replace('<sup>','').replace('</sup>','')\
            .replace('<i>','').replace('</i>','')\
            .replace(search_word.lower(),'') for i in papers['PubmedArticle']]
    tfidf=TfidfVectorizer(ngram_range=(2,2),stop_words='english')
    X=tfidf.fit_transform(titles)
    tfidf_df=pd.DataFrame(X.todense(), columns=sorted(tfidf.vocabulary_))
    key_rank=tfidf_df.sum().sort_values(ascending=False)
    
    return key_rank


In [7]:
def Abstract_key(papers):
    '''
    extracting key words for the abstract
    
    Arg:
        papers (dict): full information of articles
        
    return:
        key_word (Pandas Series): list of key words
    '''
    
    
    abstract_key=[]
    for i in papers['PubmedArticle']:
        try:
            abstract_key.append(i['MedlineCitation']['Article']['Abstract']['AbstractText'][0].lower()\
            .replace(',','').replace('.','').replace(':', '').replace('?','')\
            .replace('<sub>', '').replace('</sub>','').replace('<sup>','').replace('</sup>','')\
            .replace('<i>','').replace('</i>','').replace(search_word.lower(),''))
        except:
            continue
    tfidf=TfidfVectorizer(ngram_range=(2,2),stop_words='english')
    X=tfidf.fit_transform(abstract_key)
    tfidf_df=pd.DataFrame(X.todense(), columns=sorted(tfidf.vocabulary_))
    key_rank=tfidf_df.sum().sort_values(ascending=False)
    
    return key_rank

In [8]:
search_word='tissue engineering'

In [9]:
results = search(search_word)
id_list = results['IdList']
papers = fetch_details(id_list)

# Author List

In [10]:
result=Author_list(papers)
result

Unnamed: 0,ForeName,LastName,count,Google Scholar
0,Rui L,Reis,19,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Rui+L+Reis+review&oq=
1,Ali,Khademhosseini,10,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Ali+Khademhosseini+review&oq=
2,Kyriacos A,Athanasiou,8,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Kyriacos+A+Athanasiou+review&oq=
3,Masoud,Mozafari,8,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Masoud+Mozafari+review&oq=
4,Xiongbiao,Chen,8,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Xiongbiao+Chen+review&oq=
5,Jerry C,Hu,7,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Jerry+C+Hu+review&oq=
6,Jafar,Ai,7,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Jafar+Ai+review&oq=
7,Hae-Won,Kim,7,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Hae-Won+Kim+review&oq=
8,Dilek,Keskin,6,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Dilek+Keskin+review&oq=
9,Antonios G,Mikos,6,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Antonios+G+Mikos+review&oq=


In [11]:
result['Google Scholar']

0               https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Rui+L+Reis+review&oq=
1       https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Ali+Khademhosseini+review&oq=
2    https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Kyriacos+A+Athanasiou+review&oq=
3          https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Masoud+Mozafari+review&oq=
4           https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Xiongbiao+Chen+review&oq=
5               https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Jerry+C+Hu+review&oq=
6                 https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Jafar+Ai+review&oq=
7              https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Hae-Won+Kim+review&oq=
8             https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Dilek+Keskin+review&oq=
9         https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Antonios+G+Mikos+review&oq=
Name: Google Scholar, dtype: object

# Key Word List

In [12]:
key_paper_lst=key_from_papers(papers)

In [13]:
key_paper_lst.value_counts().head(20)

tissue engineering              446
bone tissue engineering         125
scaffold                         98
regenerative medicine            67
scaffolds                        64
biomaterials                     63
electrospinning                  61
stem cells                       59
chitosan                         51
mesenchymal stem cells           43
hydrogel                         41
cartilage tissue engineering     39
bone                             34
drug delivery                    33
biomaterial                      28
extracellular matrix             28
hydrogels                        27
collagen                         26
gelatin                          23
biocompatibility                 23
Name: key word from papers, dtype: int64

# Key Words from Title

In [14]:
key_title=title_key(papers)
key_title[:20]

stem cells               13.291217
scaffolds bone           11.109212
regenerative medicine     9.959245
stem cell                 7.837889
drug delivery             6.995312
mesenchymal stem          6.919429
tissue engineering        6.221868
systematic review         5.786348
extracellular matrix      5.551079
recent advances           4.873178
silk fibroin              4.825905
based scaffolds           4.736733
articular cartilage       4.165468
cell based                4.024800
growth factor             3.819023
bone applications         3.509373
composite scaffolds       3.440955
3d printing               3.345236
animal models             3.305354
composite scaffold        3.283646
dtype: float64

# Key Words from Abstract

In [15]:
key_abstract=Abstract_key(papers)

In [16]:
key_abstract[:20]

stem cells               12.855377
mechanical properties     8.273759
regenerative medicine     7.703138
tissue engineered         7.271873
extracellular matrix      6.677208
stem cell                 6.039579
mesenchymal stem          5.745126
tissue regeneration       5.606731
growth factors            5.311438
tissue engineering        4.251521
cell proliferation        3.900226
3d printing               3.826932
growth factor             3.811594
bone tissue               3.690361
articular cartilage       3.539453
drug delivery             3.474507
adipose tissue            3.426886
cell culture              3.395901
bone regeneration         3.266220
cell adhesion             3.225387
dtype: float64

Keyword used in search + Keyword suggeted by user
> search word: tissue engineering

> keyword suggestion: stem cell

> result: **stem cell tissue engineering**


LDA
word Vec
pyldavis

pyldavis

textrank
lexrank