# Research Paper Recommender

### :Review article recommender using PubMed API and Key word exraction from article titles using TF-IDF

During my research period, I have been switching my research topic very frequently, so I know how much the pain involves whenever they had to restart studying on the new field. Thus, I am currently developing the educational platform for junior researchers who just started research, using paper and keyword recommender system.

Keyword recommendation: Problem of current research article search engine is providing the keywords based on the keywords provided by article author, and mostly the one-word synonyms. I developed the keyword recommender using texts from titles and abstracts. Proper usage of ngram and TF-IDF provided intuitive keywords which gives information what to look up for the study.

Paper recommendation: Best way to learn new field is reading a good review article written by leading scientist in the field. Problem of finding the article by citation number is this will be highly biased towards the date of publication. First article published in the field gets most citation. This recommender provides list of review articles of most active researchers, based on publication number and citation recently, and displays most recently reviews by analyzing PubMed database. 

Currently working on the building a database using searched keywords and develop abstract analyzer

# Importing Packages and Functions

In [1]:
from Bio import Entrez
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import itertools

pd.set_option('display.max_colwidth', 1000)

In [2]:
def search(query):
    Entrez.email = 'sxxize@gmail.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax='1000',
                            retmode='xml', 
                            term=query)
    results = Entrez.read(handle)
    return results

In [3]:
def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'sxxize@gmail.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [4]:
def Author_list(papers):
    paper_author_lst=[i['MedlineCitation']['Article']['AuthorList']\
                      for i in papers['PubmedArticle']]
    dfs=[pd.DataFrame(paper_author_lst[i]) for i in range(len(paper_author_lst))]
    names_dfs=pd.concat(dfs, axis=0, sort=True )
    author_count_df=names_dfs[['ForeName', 'LastName']]\
                    .groupby(['ForeName', 'LastName']).size()\
                    .reset_index(name='count').sort_values(by='count', ascending=False)
    top=author_count_df.head(10)
    google_url='https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q='
    name=top['ForeName']+' '+top['LastName']
    result=top.reset_index(drop=True).join(pd.DataFrame({'Google Scholar':[google_url+i for i in name.str.replace(' ', '+')+'+review&oq=']}))
    
    return result

In [5]:
def key_from_papers(papers):
    fetch_key_word_papers=[i['MedlineCitation']['KeywordList'] for i in papers['PubmedArticle']]
    lst_key_papers=list(itertools.chain.from_iterable(list(itertools.chain.from_iterable(fetch_key_word_papers))))
    key_from_paper=pd.DataFrame({'key word from papers':lst_key_papers})
    return key_from_paper

In [6]:
def title_key(papers):
    titles=[i['MedlineCitation']['Article']['ArticleTitle'].lower()\
            .replace(',','').replace('.','').replace(':', '').replace('?','')\
            .replace('<sub>', '').replace('</sub>','').replace('<sup>','').replace('</sup>','')\
            .replace('<i>','').replace('</i>','')\
            .replace(search_word.lower(),'') for i in papers['PubmedArticle']]
    tfidf=TfidfVectorizer(ngram_range=(2,2),stop_words='english')
    X=tfidf.fit_transform(titles)
    tfidf_df=pd.DataFrame(X.todense(), columns=sorted(tfidf.vocabulary_))
    key_rank=tfidf_df.sum().sort_values(ascending=False)
    return key_rank


In [7]:
def Abstract_key(papers):
    abstract_key=[]
    for i in papers['PubmedArticle']:
        try:
            abstract_key.append(i['MedlineCitation']['Article']['Abstract']['AbstractText'][0].lower()\
            .replace(',','').replace('.','').replace(':', '').replace('?','')\
            .replace('<sub>', '').replace('</sub>','').replace('<sup>','').replace('</sup>','')\
            .replace('<i>','').replace('</i>','').replace(search_word.lower(),''))
        except:
            continue
    tfidf=TfidfVectorizer(ngram_range=(2,2),stop_words='english')
    X=tfidf.fit_transform(abstract_key)
    tfidf_df=pd.DataFrame(X.todense(), columns=sorted(tfidf.vocabulary_))
    key_rank=tfidf_df.sum().sort_values(ascending=False)
    return key_rank

In [8]:
search_word='fuel cell'

In [9]:
results = search(search_word)
id_list = results['IdList']
papers = fetch_details(id_list)

# Author List

In [10]:
result=Author_list(papers)
result

Unnamed: 0,ForeName,LastName,count,Google Scholar
0,Carlo,Santoro,19,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Carlo+Santoro+review&oq=
1,Plamen,Atanassov,16,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Plamen+Atanassov+review&oq=
2,Alexey,Serov,16,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Alexey+Serov+review&oq=
3,Hong,Liu,13,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Hong+Liu+review&oq=
4,Ioannis,Ieropoulos,13,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Ioannis+Ieropoulos+review&oq=
5,John,Greenman,12,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=John+Greenman+review&oq=
6,Wei,Zhou,9,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Wei+Zhou+review&oq=
7,San Ping,Jiang,8,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=San+Ping+Jiang+review&oq=
8,Kateryna,Artyushkova,8,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Kateryna+Artyushkova+review&oq=
9,Zongping,Shao,8,https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Zongping+Shao+review&oq=


In [11]:
result['Google Scholar']

0           https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Carlo+Santoro+review&oq=
1        https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Plamen+Atanassov+review&oq=
2            https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Alexey+Serov+review&oq=
3                https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Hong+Liu+review&oq=
4      https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Ioannis+Ieropoulos+review&oq=
5           https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=John+Greenman+review&oq=
6                https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Wei+Zhou+review&oq=
7          https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=San+Ping+Jiang+review&oq=
8    https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Kateryna+Artyushkova+review&oq=
9           https://scholar.google.co.kr/scholar?hl=ko&as_sdt=0%2C5&q=Zongping+Shao+review&oq=
Name: Google Scholar, dtype: object

# Key Word List

In [12]:
key_paper_lst=key_from_papers(papers)

In [13]:
key_paper_lst['key word from papers'].value_counts().head(20)

Microbial fuel cell          218
fuel cells                    81
Microbial fuel cells          59
microbial fuel cell           48
microbial fuel cells          28
oxygen reduction reaction     28
Wastewater treatment          28
Electricity generation        25
Microbial community           25
solid oxide fuel cells        22
electrochemistry              20
electrocatalysis              19
Power generation              19
Microbial fuel cell (MFC)     18
fuel cell                     18
solid oxide fuel cell         17
Biosensor                     15
Bioelectricity                15
Power density                 14
Photocatalytic fuel cell      13
Name: key word from papers, dtype: int64

# Key Words from Title

In [14]:
key_title=title_key(papers)
key_title[:20]

solid oxide                  14.303989
exchange membrane            10.650541
electricity generation       10.439934
proton exchange               9.669568
polymer electrolyte           8.712775
oxygen reduction              7.215937
microbial community           6.839799
performance microbial         6.677906
single chamber                6.477852
direct methanol               6.390811
wastewater treatment          6.201207
high performance              6.169720
air cathode                   6.129212
chamber microbial             6.058198
generation microbial          5.749626
electrolyte membrane          5.145472
temperature solid             5.031403
bioelectricity generation     4.863232
cathode microbial             4.856361
electricity production        4.513672
dtype: float64

# Key Words from Abstract

In [15]:
key_abstract=Abstract_key(papers)

In [16]:
key_abstract[:20]

power density             12.018722
maximum power              7.532582
microbial mfcs             6.470135
microbial mfc              6.401803
cr vi                      6.128704
electron transfer          5.740389
electricity generation     5.507679
mw cm                      5.399763
oxygen reduction           5.336550
solid oxide                5.297186
exchange membrane          5.198596
proton exchange            4.805076
wastewater treatment       4.655052
power output               4.615069
power generation           4.507802
mw m2                      4.274770
current density            4.273477
reduction reaction         4.247751
microbial community        4.078056
polymer electrolyte        4.056608
dtype: float64

LDA
word Vec
pyldavis

pyldavis

textrank
lexrank