Skip to content

Latest commit

 

History

History
11 lines (6 loc) · 791 Bytes

README.md

File metadata and controls

11 lines (6 loc) · 791 Bytes

Paper Fetcher

Given a list of DOIs, one per line in a file named dois.txt, php fetch.php will read each DOI, attempt to fetch bibliographic JSON from dx.doi.org and HTML from the publisher via dx.doi.org, then parse the HTML to find a PDF URL and fetch that PDF.

If successful, the files are stored in the data folder.

Alongside each file a JSON file is stored that contains metadata for the HTTP request/response.

The XPath selectors to find the PDF URL (or next HTML URL, if it's an interstitial page) are in selectors.json. They work for most publishers, but not all - it will almost certainly be necessary to add domain-specific rules to get 100% coverage.

Note that if page content is generated dynamically or fetched with JavaScript, it won't be captured.