** DOCUMENTATION **


Python library repo: https://github.com/gijswobben/pymed


Code example in Python: https://stackoverflow.com/questions/57053378/query-pubmed-with-python-how-to-get-all-article-details-from-query-to-pandas-d (this is very useful!)


Code example in R: https://stackoverflow.com/questions/64103323/extracting-affiliation-information-from-pubmed-search-string-in-r


Other package in R: https://www.data-pulse.com/projects/Rlibs/vignettes/easyPubMed_03_retmax_example.html


Other package in R: https://cran.r-project.org/web/packages/easyPubMed/vignettes/getting_started_with_easyPubMed.html


Other useful code to sort the 'None' in the list: https://stackoverflow.com/questions/9327158/get-first-element-of-list-if-list-is-not-none-python




The aim of this notebook is to:
1. define a search query.
2. apply the search query to PubMed.
3. extract the results from the search query.
4. extract information from each result (where each row is a PubMed article). 
5. save the extracted information in a CSV.

In [469]:
### LOAD LIBRARIES


from pymed import PubMed
import pandas as pd

In [470]:
### GENERATE 'pubmed' OBJECT TO RUN THE SEARCH LATER.


pubmed = PubMed(tool = 'PubMedSearch')

In [471]:
### DEFINE THE SEARCH QUERY.

#query = '(("diabetes" AND ("USA" OR "United States" OR "United States of America" OR "US")))'
#query = '(("2022/01/01"[Date - Create] : "3000"[Date - Create]))'
#query = '(2017/01/01[Date - Entry] : 2021/12/31[Date - Entry])'



#query = '(("diabetes" AND "Brazil"))'
#query = '(("diabetes" AND "Mexico"))'
#query = '(("diabetes" AND "Chile"))'
#query = '(("diabetes" AND "Argentina"))'
#query = '(("diabetes" AND "Peru"))'
#query = '(("diabetes" AND "Colombia"))'
#query = '(("diabetes" AND "Venezuela"))'
#query = '(("diabetes" AND "Ecuador"))'
#query = '(("diabetes" AND "Guatemala"))'
#query = '(("diabetes" AND "Bolivia"))'
#query = '(("diabetes" AND "Antigua and Barbuda"))'
#query = '(("diabetes" AND "Aruba"))'
#query = '(("diabetes" AND "The Bahamas"))'
#query = '(("diabetes" AND "Barbados"))'
#query = '(("diabetes" AND "Belize"))'
#query = '(("diabetes" AND "British Virgin Islands"))'
#query = '(("diabetes" AND "Cayman Islands"))'
#query = '(("diabetes" AND "Costa Rica"))'
#query = '(("diabetes" AND "Cuba"))'
#query = '(("diabetes" AND "Curacao"))'
#query = '(("diabetes" AND "Dominica"))'
#query = '(("diabetes" AND "Dominican Republic"))'
#query = '(("diabetes" AND "El Salvador"))'
#query = '(("diabetes" AND "Grenada"))'
#query = '(("diabetes" AND "Guyana"))'
#query = '(("diabetes" AND "Haiti"))'
#query = '(("diabetes" AND "Honduras"))'
#query = '(("diabetes" AND "Jamaica"))'
#query = '(("diabetes" AND "Nicaragua"))'
#query = '(("diabetes" AND "Panama"))'
#query = '(("diabetes" AND "Paraguay"))'
#query = '(("diabetes" AND "Puerto Rico"))'
#query = '(("diabetes" AND "Sint Maarten"))'
#query = '(("diabetes" AND ("St. Kitts and Nevis" OR "Saint Kitts and Nevis")))'
#query = '(("diabetes" AND ("St. Lucia" OR "Saint Lucia")))'
#query = '(("diabetes" AND ("St. Martin" OR "Saint Martin")))'
#query = '(("diabetes" AND ("St. Vincent and the Grenadines" OR "Saint Vincent and the Grenadines")))'
#query = '(("diabetes" AND "Suriname"))'
#query = '(("diabetes" AND "Trinidad and Tobago"))'
#query = '(("diabetes" AND "Turks and Caicos Islands"))'
#query = '(("diabetes" AND "Uruguay"))'
#query = '(("diabetes" AND "US Virgin Islands"))'



#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Brazil")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Mexico")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Chile")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Argentina")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Peru")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Colombia")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Venezuela")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Ecuador")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Guatemala")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Bolivia")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Antigua and Barbuda")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Aruba")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "The Bahamas")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Barbados")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Belize")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "British Virgin Islands")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Cayman Islands")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Costa Rica")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Cuba")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Curacao")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Dominica")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Dominican Republic")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "El Salvador")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Grenada")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Guyana")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Haiti")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Honduras")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Jamaica")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Nicaragua")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Panama")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Paraguay")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Puerto Rico")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Sint Maarten")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND ("St. Kitts and Nevis" OR "Saint Kitts and Nevis"))'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND ("St. Lucia" OR "Saint Lucia"))'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND ("St. Martin" OR "Saint Martin"))'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND ("St. Vincent and the Grenadines" OR "Saint Vincent and the Grenadines"))'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Suriname")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Trinidad and Tobago")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Turks and Caicos Islands")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "Uruguay")'
#query = '(("2018/01/01"[Date - Create] : "3000"[Date - Create]) AND "US Virgin Islands")'

In [472]:
### RUN THE SEARCH AND SPECIFY THE MAXIMUM NUMBER OF RESULTS YOU WANT.


results = pubmed.query(query, max_results = 1000000)

In [473]:
### CREATE EMPYT LISTS WHERE YOU WILL:
### I- SAVE THE RESULTS FROM THE QUERY;
### II- EXTRACT THE INFORMATION NEEDED FROM EACH ARTICLE IN PUBMED.


articleList = []
articleInfo = []

In [474]:
### EXTRACT EACH RESULT FROM THE SEARCH QUERY ANS SAVE IT IN A LIST 'articleList'.


for article in results:
    articleDict = article.toDict()
    articleList.append(articleDict)

In [475]:
### DOUBLE CHECK THE NUMBER OF RESULTS FROM THE SEARCH. IT SHOULD BE EQUAL TO THE MAXIMUM NUMBER SPECIFIED OR LESS (IF THERE WERE FEWER ARTICLES THAN THE MAX SPECIFIED).


len(articleList)

80

In [476]:
### IN SOME OF THE DICTIONARIES IN THE LIST, THERE MAY BE KEYS WITH EMPTY VALUES.
### LET'S REPLACE THOSE EMPTY SLOTS WITH 'None', SO THAT THEY GET EXTRACTED LATER.


for i in range(len(articleList)):
  articleList[i] = {k: None if not v else v for k, v in articleList[i].items()}

In [477]:
### CREATE A LIST WITH ALL THE PUBMED IDS. THERE SHOULD BE AS MANY AS RESULTS FROM THE SEARCH QUERY.


PubMedID_list = []
for article in articleList:
    PubMedID_list.append(article['pubmed_id'].partition('\n')[0])


print(len(PubMedID_list))

80


In [478]:
### IN 'articleList' WE HAVE SAVED EACH RESULT FROM THE QUERY SEARCH AS A LIST. FROM EACH OF THESE LISTS, EXTRACT THE INFORMATION WE WANT.
### CHECK WHAT WE DID TO EXTRACT THE INFORMATION WHERE IT WAS 'None'.


for article in articleList:
    pubmedId = article['pubmed_id'].partition('\n')[0]
    #print(pubmedId)
    articleInfo.append({u'pubmed_id':pubmedId,
                        u'title':article['title'],
                        u'keywords':(article.get('keywords') or [None]),
                        u'journal':(article.get('journal') or [None]),
                        u'abstract':article['abstract'],
                        u'copyrights':article['copyrights'],
                        u'doi':article['doi'],
                        u'publication_date':article['publication_date']
                        #u'authors':article['authors']
                        })


articlesPD = pd.DataFrame.from_dict(articleInfo)
print(articlesPD.head(5))

  pubmed_id                                              title  \
0  35858059  Persistence of a sessile benthic organism prom...   
1  35833604  Disruptions in Oncology Care Confronted by Pat...   
2  35588174  A Cross-Sectional Household Survey in the US V...   
3  35516524  A biological condition gradient for Caribbean ...   
4  35501329  Portfolio effects and functional redundancy co...   

                                            keywords  \
0  [Caribbean, Millepora, coral, phenotypic plast...   
1  [Oncology care, Puerto Rico, US Virgin Islands...   
2  [Management practices, St. Croix, St. John, St...   
3  [Biocriteria, Biological Condition Gradient (B...   
4                                             [None]   

                                             journal  \
0                   Proceedings. Biological sciences   
1  Cancer control : journal of the Moffitt Cancer...   
2  Journal of the American Mosquito Control Assoc...   
3                              Ecological 

In [479]:
### IN THE DATAFRAME WE JUST CREATED WITH THE INFORMATION FROM EACH RESULT, LET'S JUST KEEP THE PUBMED IDs WE KNOW THERE SHOULD BE FROM THE LIST WE CREATED.


articlesPD = articlesPD[articlesPD['pubmed_id'].isin(PubMedID_list)]

In [480]:
### THE NUMBER OF ROWS IN THE DATAFRAME WITH THE INFORMATION FROM THE RESULTS SHOULD BE EQUAL TO THE NUMBER OF RESULTS (I.E., ONE ROW PER PAPER).


print(articlesPD.shape)

(80, 8)


In [481]:
### SAVE THE EXTRACTED INFORMATION, AS A CSV, IN YOUR LOCAL PC.


#articlesPD.to_csv('/Users/manwest/Downloads/Brazil.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Mexico.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Chile.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Argentina.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Peru.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Colombia.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Venezuela.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Ecuador.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Guatemala.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Bolivia.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Antigua and Barbuda.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Aruba.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/The Bahamas.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Barbados.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Belize.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/British Virgin Islands.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Cayman Islands.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Costa Rica.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Cuba.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Curacao.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Dominica.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Dominican Republic.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/El Salvador.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Grenada.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Guyana.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Haiti.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Honduras.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Jamaica.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Nicaragua.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Panama.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Paraguay.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Puerto Rico.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Sint Maarten.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/St Kitts and Nevis.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/St Lucia.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/St Martin.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/St Vincent and the Grenadines.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Suriname.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Trinidad and Tobago.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Turks and Caicos Islands.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/Uruguay.csv', index = False)
#articlesPD.to_csv('/Users/manwest/Downloads/US Virgin Islands.csv', index = False)
