A set of scripts for retrieving bibliometric & scientometric information for batches of publications.
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Scientometric API Handler

A tool for batch downloading and inspecting hypotheses from a number of scientometric sources:

PMIDs (described below) are used throughout as the original application of the tool was for biomedical literature, but the code could be adapted to work with DOIs (which may be more suitable for Altmetric records, which sometimes have a DOI but no PMID).

A PMID (PubMed identifier or PubMed unique identifier) is a unique number assigned to each PubMed record. A PMID is not the same as a PMCID which is the identifier for all works published in the free-to-access PubMed Central. -- PubMed on Wikipedia

This tool then calls the API with each PMID, respecting rate limits, and returns the data from that API.

This is one of my first projects in Ruby, so please forgive some of the crimes against good code.


The easiest way to get started: duplicate and alter bin/sample_task.rb. Example commands:

# Specify API
api = 'epmc'
# List of PMIDs to query
input_csv = './input/WTpmids2015_epmc.csv'
output_csv = ''
# Split the big input CSV in to blocks of 100, keep a hold of the directory
split_csv_directory = split_csv(input_csv) 
# Now churn through each and every CSV in that directory
process_split_csvs(split_csv_directory, :epmc)
# All done? Put it all together
# merge_csv(output_csv, directory_of_chunked_results)
merge_csv('./output/WT2015_all_papers.csv', '../input/WTpmids2015_epmc_SPLIT')



Sample XML: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=EXT_ID:26138067&resulttype=core

Retrieved data:

  • pmid
  • doi
  • title
  • journal
  • cited_by_count
  • pubtypes
  • author_ids: usually a collection of ORCID IDs
  • mesh_headings
  • abstract
  • dateofcreation: one of many date fields in EPMC
  • authorstring: all of the authors, pushed in to one string
  • firstauthor: first author in list
  • firstauthor_affiliation: institutional affiliation of first author
  • lastauthor: last author in list
  • lastauthor_affiliation: institutional affiliation of last author
  • url
  • affiliations: list of all affiliations. Duplicate values deleted - save on space, and we don't necessarily know which author an affiliation corresponds to if the two list are side by side
  • number_of_grants
  • all_grants
  • WT_grants: all grants with "Wellcome Trust"
  • WT_six_digit_grants: pulling out any five or six digit value we can find in those WT grants
  • hasTextMinedTerms
  • hasLabsLinks
  • labsLinks
  • hasDbCrossReferences
  • dbCrossReferenceList


  • Aim for a command line interface which takes a .csv of PMIDs as input alongside flags for each API to call
  • Look in to citations - can we pull in the complete list of citations?
  • Command line interface (in bin, read Pickaxe first)
  • Catch errors with URL handling
  • Raise errors if PMIDs aren't valid
  • Create tests for new ORCID API integration
  • Refactor individual API calls to own Ruby files
  • Process grant information past the first 10
    • Pull out WT related funding and cleaned, unique six digit grant codes
  • Change EPMC to use JSON
  • Abstract method for checking if fields are part of the EPMC metadata (lots of repetition in api_caller.rb)
  • Add EPMC search - ability to download all the results for a query (beyond the first 2000 available via web interface)
  • Add time to completion for all PMIDs currently being processed

Sample search strings

EPMC CORE SEARCH: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=QUERY&resulttype=core


Grant lookup

Nice publications: http://europepmc.org/search?query=PUB_TYPE%3A%22practice%20guideline%22%20NICE

Further resources / similar projects