Scientometric API Handler
A tool for batch downloading and inspecting hypotheses from a number of scientometric sources:
- EuropePMC REST Web Service - access to all publications indexed by EuropePMC
- EuropePMC Grist API - grant data
- Altmetric - Social media impact of articles with PMIDs / DOIs
PMIDs (described below) are used throughout as the original application of the tool was for biomedical literature, but the code could be adapted to work with DOIs (which may be more suitable for Altmetric records, which sometimes have a DOI but no PMID).
A PMID (PubMed identifier or PubMed unique identifier) is a unique number assigned to each PubMed record. A PMID is not the same as a PMCID which is the identifier for all works published in the free-to-access PubMed Central. -- PubMed on Wikipedia
This tool then calls the API with each PMID, respecting rate limits, and returns the data from that API.
This is one of my first projects in Ruby, so please forgive some of the crimes against good code.
The easiest way to get started: duplicate and alter
# Specify API api = 'epmc' # List of PMIDs to query input_csv = './input/WTpmids2015_epmc.csv' output_csv = '' # Split the big input CSV in to blocks of 100, keep a hold of the directory split_csv_directory = split_csv(input_csv) # Now churn through each and every CSV in that directory process_split_csvs(split_csv_directory, :epmc) # All done? Put it all together # merge_csv(output_csv, directory_of_chunked_results) merge_csv('./output/WT2015_all_papers.csv', '../input/WTpmids2015_epmc_SPLIT')
- author_ids: usually a collection of ORCID IDs
- dateofcreation: one of many date fields in EPMC
- authorstring: all of the authors, pushed in to one string
- firstauthor: first author in list
- firstauthor_affiliation: institutional affiliation of first author
- lastauthor: last author in list
- lastauthor_affiliation: institutional affiliation of last author
- affiliations: list of all affiliations. Duplicate values deleted - save on space, and we don't necessarily know which author an affiliation corresponds to if the two list are side by side
- WT_grants: all grants with "Wellcome Trust"
- WT_six_digit_grants: pulling out any five or six digit value we can find in those WT grants
- Aim for a command line interface which takes a .csv of PMIDs as input alongside flags for each API to call
- Look in to citations - can we pull in the complete list of citations?
- Command line interface (in bin, read Pickaxe first)
- Catch errors with URL handling
- Raise errors if PMIDs aren't valid
- Create tests for new ORCID API integration
- Refactor individual API calls to own Ruby files
- Process grant information past the first 10
- Pull out WT related funding and cleaned, unique six digit grant codes
- Change EPMC to use JSON
- Abstract method for checking if fields are part of the EPMC metadata (lots of repetition in
- Add EPMC search - ability to download all the results for a query (beyond the first 2000 available via web interface)
- Add time to completion for all PMIDs currently being processed