Web scraper to pull metadata (specifically corresponding author) from pages where unavailable through Pubmed's eUtils XML file. Work in progress 2013 / abandoned in 2014 for more useful projects.
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Original goal: "Browser automation using Watir (Web Application Testing in Ruby) to pull metadata from pages where unavailable through XML."

The corresponding author (CA) of a paper isn't available through Pubmed's search results, so I'm writing a few Ruby scripts in an attempt to obtain it on demand.

For anyone interested in how this works, the CA is indicated on an article's abstract page by:

  • [most simply] an <xref> tag enclosing an asterisk (*)
  • an <xref> tag with ref-type="corresp"
  • a <contrib> tag with corresp="yes"
  • a <corresp> tag enclosing an email address (sometimes the only thing indicating correspondence)
  • an <email> tag
  • etc.

Due to this inconsistency, even in Pubmed Central's supposedly consistently formatted eUtils XML files, this isn't the simplest situation to code for (although at least in this the CA can always be determined). The Nokogiri Ruby library with its support of xpaths allows dynamic selection of tags and their attributes, navigating in and out of the XML DOM tree, with the only requirement being a URL to take this information from.

A sample of over 500 results of a Pubmed query (.csv output stored in an online Google Spreadsheet with public read/write permissions) is in use to test the code. All being well, it would not be too difficult to render the spreadsheet's few formulae (mostly RegEx) in code and integrate this program directly into Pubmed's search API for a functioning web app or something similar! Heath Anderson has written a Ruby version of the Pubmed search API here.

For further details on why this is needed or how it ought to be working, please see the questions I've posted regarding the issue at Biostars, ResearchGate and most recently the Scraperwiki Google group.


The approach taken is :

  1. For the minority of papers (~10%) hosted in Pubmed Central (PMC), simply parse the eUtils XML file (included in Pubmed search output) with Nokogiri for tags as described above and match the name found to surnames already obtained in the search results to confirm CA.
  2. The URL for many of the remaining papers can be accesed from an available Digital Object Identifier (DOI) key, as simply http://dx.doi.org/DOI-goes-here, and from there parsed in text as for XML (but probably with less ease). A variety of methods can be used for this, with brute regular expressions likely being the last resort over names found in HTML tags and found to match those of the record's search result.
  3. The URL for those papers having neither a DOI key, nor a copy in PMC may then require:
    • finding a URL to the article's page on the journal website within the HTML of the Pubmed abstract page,
    • finding a DOI hidden in the wrong place (searching the whole abstract page's HTML indiscriminately)
    • headless browser automation with Watir, though this is slow and a last resort (despite the repo name!)

The end product also ought to be aware that there are sometimes "equally corresponding" authors and handle multiple corresponding authors as is often stated.

![logo](https://raw.github.com/lmmx/watir-paper-scanner/master/scrapertests.png "")
These are the outputs from running the pmcscraper.rb script as it stands (7th Nov '13).
I'm outputting the variables to show what values they take in given instances. The idea is now to match up usernames with surnames.
Lines referenced are those of the spreadsheet.

Parlez-vous Ruby?

I'm a life sciences undergraduate with no formal training in code (hopefully it doesn't show) beyond what I've picked up through practice, and am extremely grateful for the patience and assistance of more experienced Ruby programmers.

Once fully functional, I'd love to make this freely available as an online tool (as to my knowledge it's not a great cost to deploy these things) for other academics as well as leaving the code here, and would be happy for it to be put to use in your project - feel free to contact me through naivelocus@gmail.com with any ideas.

Any and all help given on the code in the meantime is greatly appreciated :~)

Louis Maddox
Biochemistry BSc
University of Manchester (UK)