Open Access PDF harvester and ingester

Python utility for harvesting efficiently a large Open Access collection of PDF (fault tolerant, can be resumed, parallel download and ingestion) and for transforming them into structured XML adapted to text mining and information retrieval applications.

Input currently supported:

list of DOI in a file, one DOI per line
metadata csv input file from CORD-19 dataset, see the CORD-19 result section below to see the capacity of the tool to get more full texts and better data quality that the official dataset
list of PMID in a file, one DOI per line
list of PMC ID in a file, one DOI per line

To do:

list of ISTEX identifiers or ark, one DOI per line
Apache Airflow for the task workflow
consolidate/resolve bibliographical references obtained via Pub2TEI

What:

Perform some metadata enrichment/agregation via biblio-glutton & CrossRef web API and output consolidated metadata in a json file
Harvest PDF from the specification of the article set (list of strong identifiers or basic metadata provided in a csv file), typically PDF available in Open Access PDF via the Unpaywall API (and some heuristics)
Perform Grobid full processing of PDF (including bibliographical reference consolidation and OA access resolution of the cited references), converting them into structured XML TEI
For PMC files (Open Access set only), harvest also XML JATS (NLM) files and perform a conversion into XML TEI (same TEI customization as Grobid) via Pub2TEI

Optionally:

Generate thumbnails for article (based on the first page of the PDF), small/medium/large
Upload the generated dataset on S3 instead of the local file system
Generate json PDF annotations (with coordinates) for inline reference markers and bibliographical references (see here)

Requirements

The utility has been tested with Python 3.5+. It is developed for a deployment on a POSIX/Linux server (it uses imagemagick as external process to generate thumbnails and wget). An S3 account and bucket must have been created for non-local storage of the data collection.

To install imagemagick:

on Linux Ubuntu:

sudo apt update
sudo apt build-dep imagemagick

on macos:

brew install libmagic

Installation

The following tools need to be installed and running, with access information specified in the configuration file (config.json):

Grobid, for converting PDF into XML TEI
biblio-glutton, for metadata retrieval and aggregation
Pub2TEI, for converting PMC XML files into XML TEI

It should be possible to use the public demo instance of biblio-glutton, as default configured in the config.json file (the tool scale at more than 6000 queries per second). However for Grobid, we strongly recommand to install a local instance, because the online public demo will not be able to scale and won't be reliable given that it is more or less always overloaded.

As biblio-glutton is using dataset dumps, there is a gap of several months in term of bibliographical data freshness. So, complementary, the CrossRef web API and Unpaywall API services are used to cover the gap. For these two services, you need to indicate your email in the config file (config.json) to follow the etiquette policy of these two services.

An important parameter in the config.json file is the number of parallel document processing that is allowed, this is specified by the attribute batch_size, default value being 10 (so 10 documents max downloaded in parallel with distinct threads/workers and processed by Grobid in parallel). You can set this number according to your available number of threads.

These tools requires Java 8 or more.

Docker

TBD

Usage

usage: harvest.py [-h] [--dois DOIS] [--cord19 CORD19] [--pmids PMIDS]
                  [--pmcids PMCIDS] [--config CONFIG] [--reset] [--reprocess]
                  [--thumbnail] [--annotation] [--diagnostic] [--dump]

COVIDataset harvester

optional arguments:
  -h, --help       show this help message and exit
  --dois DOIS      path to a file describing a dataset articles as a simple
                   list of DOI (one per line)
  --cord19 CORD19  path to the csv file describing the CORD-19 dataset
                   articles
  --pmids PMIDS    path to a file describing a dataset articles as a simple
                   list of PMID (one per line)
  --pmcids PMCIDS  path to a file describing a dataset articles as a simple
                   list of PMC ID (one per line)
  --config CONFIG  path to the config file, default is ./config.json
  --reset          ignore previous processing states, and re-init the
                   harvesting process from the beginning
  --reprocess      reprocessed existing failed entries
  --thumbnail      generate thumbnail files for the front page of the
                   harvested PDF
  --annotation     generate bibliographical annotations with coordinates for
                   the harvested PDF
  --diagnostic     perform a full consistency diagnostic on the harvesting and
                   transformation process
  --dump           write all the consolidated metadata in json in the file
                   consolidated_metadata.json
  --download       only download the raw files (PDF, NLM/JATS) without 
                   processing them

Fill the file config.json with relevant service and parameter url, then install the python mess:

pip3 install -r requirements.txt

For instance to process a list of DOI (one DOI per line):

python3 harvest.py --dois test/dois.txt

Similarly for a list of PMID or PMC ID:

python3 harvest.py --pmids test/pmids.txt 
python3 harvest.py --pmcids test/pmcids.txt

For instance for the CORD-19 dataset, you can use the metadata.csv (last tested version from 2020-06-29) file by running:

python3 harvest.py --cord19 metadata.csv

This will generate a consolidated metadata file (specified by --out, or consolidated_metadata.json by default), upload full text files, converted tei.xml files and other optional files either in the local file system (under data_path indicated in the config.json file) or on a S3 bucket if the fields are filled in config.json.

You can set a specific config file name with --config :

python3 harvest --cord19 metadata.csv --config my_config_file.json

To resume an interrupted processing, simply re-run the same command.

To re-process the failed articles of an harvesting, use:

python3 harvest.py --reprocess

To reset entirely an existing harvesting and re-start an harvesting from zero:

python3 harvest.py --cord19 metadata.csv --reset

To create a dump of the consolidated metadata of all the processed files (including the UUID identifier and the state of processing), add the parameter --dump:

python3 harvest.py --dump

The generated metadata file is named consolidated_metadata.json.

For producing the thumbnail images of the article first page, use --thumbnail argument. This option requires imagemagick installed on your system and will produce 3 PNG files of size height x150, x300 and x500. These thumbnails can be interesting for offering a preview to an article for an application using these data.

python3 harvest.py --cord19 metadata.csv --thumbnail

For producing PDF annotations in JSON format corresponding to the bibliographical information (reference markers in the article and bibliographical references in the bibliographical section), use the argument --annotation. See more information about these annotations here. They allow to enrich the display of PDF, and make them more interactive.

python3 harvest.py --cord19 metadata.csv --annotation

Finally you can run a short diagnostic/reporting on the latest harvesting like this:

python3 harvest.py --diagnostic

Generated files

Structure of the generated files for an article having as UUID identifier 98da17ff-bf7e-4d43-bdf2-4d8d831481e5

98/da/17/ff/98da17ff-bf7e-4d43-bdf2-4d8d831481e5/98da17ff-bf7e-4d43-bdf2-4d8d831481e5.pdf
98/da/17/ff/98da17ff-bf7e-4d43-bdf2-4d8d831481e5/98da17ff-bf7e-4d43-bdf2-4d8d831481e5.grobid.tei.xml

Optional additional files:

98/da/17/ff/98da17ff-bf7e-4d43-bdf2-4d8d831481e5/98da17ff-bf7e-4d43-bdf2-4d8d831481e5.nxml
98/da/17/ff/98da17ff-bf7e-4d43-bdf2-4d8d831481e5/98da17ff-bf7e-4d43-bdf2-4d8d831481e5-ref-annotations.json
98/da/17/ff/98da17ff-bf7e-4d43-bdf2-4d8d831481e5/98da17ff-bf7e-4d43-bdf2-4d8d831481e5-thumb-small.png
98/da/17/ff/98da17ff-bf7e-4d43-bdf2-4d8d831481e5/98da17ff-bf7e-4d43-bdf2-4d8d831481e5-thumb-medium.png
98/da/17/ff/98da17ff-bf7e-4d43-bdf2-4d8d831481e5/98da17ff-bf7e-4d43-bdf2-4d8d831481e5-thumb-large.png

The UUID identifier for a particular article is given in the generated consolidated_metadata.csv file.

The *.nxml files correspond to the JATS files available for PMC (Open Access set only).

On harvesting and ingesting the CORD-19 dataset

Using a local PDF repository for CORD-19

The CORD-19 dataset includes more than 19k articles corresponding to a set of Elsevier articles on COVID-19 recently put in Open Access. As Unpaywall does not cover these OA articles (on 23.03.2020 at least), you would need to download first these PDF and indicates to the harvesting tool where the local repository of PDF is located:

download the PDF files on the COVID-19 FTP server:

sftp public@coronacontent.np.elsst.com

Indicate beat_corona as password. See the instruction page in case of troubles.

cd pdf
mget *

indicate the local repository where you have downloaded the dataset in the config.json file:

"cord19_elsevier_pdf_path": "/the/path/to/the/pdf"

That's it. The file ./elsevier_covid_map_28_06_2020.csv.gz contains a map of DOI and PII (the Elsevier article identifiers) for these OA articles.

Results with CORD-19

Here are the results regarding the CORD-19 version 5 (metadata.csv) to illustrate the interest of the tool:

	official CORD-19	this harvester
total entries	45,828	45,828
entries with valid OA URL	-	42,742
entries with successfully downloaded PDF	-	42,362
entries with structured full texts via GROBID	~33,000 (JSON)	41,070 (TEI XML)
entries with structured full texts via PMC JATS	-	15,955 (TEI XML)
total entries with at least one structured full text	~33,000 (JSON)	41,609 (TEI XML)

Other main differences include:

the XML TEI contain richer structured full text,
usage of more recent GROBID models (with extra medRxiv and bioRxiv training data),
additional PMC JATS files download and conversion with Pub2TEI (normally without information loss because the TEI custumization we are using superseeds the structures covered by JATS). Note that a conversion from PMC JATS files has been introduced in CORD-19 from version 6.
full consolidation of the bibliographical references with publisher metadata, DOI, PMID, PMC ID, etc. when available
consolidation of article metadata with CrossRef and PubMed aggregations for the entries

We will try to re-ingest and update these numbers with version 7 of the dataset!

Converting the PMC XML JATS files into XML TEI

After the harvesting and processing realised by harvest.py, it is possible to convert of PMC XML JATS files into XML TEI. This will provide better XML quality than what can be extracted automatically by Grobid from the PDF. This conversion allows to have all the documents in the same XML TEI customization format. As the TEI format superseeds JATS, there is normally no loss of information from the JATS file.

To launch the conversion:

python3 nlm2tei.py

If a custom config file is used:

python3 nlm2tei.py --config ./my_config.json

This will apply Pub2TEI (a set of XSLT) to all the harvested *.nxml files and add to the document repository a new file TEI file:

98/da/17/ff/98da17ff-bf7e-4d43-bdf2-4d8d831481e5/98da17ff-bf7e-4d43-bdf2-4d8d831481e5.pub2tei.tei.xml

Note that Pub2TEI supports a lot of other publisher's XML formats, so the principle could be extended to transform a lot of different XML formats into a single one (TEI), facilitating further ingestion and process by avoiding to write complicated XML parsers for each case.

Troubleshooting with imagemagick

Recent update (end of October 2018) of imagemagick is breaking the normal conversion usage. Basically the converter does not convert by default for security reason related to server usage. For non-server mode as involved in our module, it is not a problem to allow PDF conversion. For this, simply edit the file /etc/ImageMagick-6/policy.xml (or /etc/ImageMagick/policy.xml) and put into comment the following line:

<!-- <policy domain="coder" rights="none" pattern="PDF" /> -->

License and contact

Distributed under Apache 2.0 license.

Main author and contact: Patrice Lopez (patrice.lopez@science-miner.com)

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
data		data
resources		resources
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
S3.py		S3.py
config.json		config.json
harvest.py		harvest.py
nlm2tei.py		nlm2tei.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Access PDF harvester and ingester

What:

Requirements

Installation

Docker

Usage

Generated files

On harvesting and ingesting the CORD-19 dataset

Using a local PDF repository for CORD-19

Results with CORD-19

Converting the PMC XML JATS files into XML TEI

Troubleshooting with imagemagick

License and contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open Access PDF harvester and ingester

What:

Requirements

Installation

Docker

Usage

Generated files

On harvesting and ingesting the CORD-19 dataset

Using a local PDF repository for CORD-19

Results with CORD-19

Converting the PMC XML JATS files into XML TEI

Troubleshooting with imagemagick

License and contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages