Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Store and process the Crossref Database

This repository downloads Crossref metadata using the Crossref API. The items retrieved are stored in MongoDB to preserve their raw structure. This design allows for flexible downstream analyses.


MongoDB is run via Docker. It's available on the host machine at http://localhost:27017/.

docker run \
  --name=mongo-crossref \
  --publish=27017:27017 \
  --volume=`pwd`/mongo.db:/data/db \
  --rm \



With mongo running, execute with the following commands:

# Download all works
# To start fresh, use `--cursor=*`
# If querying fails midway, you can extract the cursor of the
# last successful query from the tail of query-works.log.
# Then rerun, passing the intermediate cursor
# to --cursor instead of *.
python \
  --component=works \
  --batch-size=550 \
  --log=logs/query-works.log \

# Export mongodb works collection to JSON
mongoexport \
  --db=crossref \
  --collection=works \
  | xz > data/mongo-export/crossref-works.json.xz

See data/mongo-export for more information on crossref-works.json.xz. Note that creating this file from the Crossref API takes several weeks. Users are encouraged to use the cached version available on figshare (see also Other resources below). is a Jupyter notebook that extracts tabular datasets of works (TSVs), which are tracked using Git LFS:

  • doi.tsv.xz: a table where each row is a work, with columns for the DOI, type, and issued date.
  • doi-to-issn.tsv.xz: a table where each row is a work (DOI) to journal (ISSN) mapping.


With mongo running, execute with the following command:

python \
  --component=types \


This repository uses conda to manage its environment as specified in environment.yml. Install the environment with:

conda env create --file=environment.yml

Then use source activate crossref and source deactivate to activate or deactivate the environment. On windows, use activate crossref and deactivate instead.

Other resources

Ideally, Crossref would provide a complete database dump, rather than requiring users to go through the inefficient process of API querying all works: see CrossRef/rest-api-doc#271. Until then, users should checkout the Crossref data currently hosted by this repository, whose query date is 2017-03-21, and its corresponding figshare.

For users who need more recent data, Bryan Newbold used this codebase to create a MongoDB dump dated January 2018 (query date of approximately 2018-01-10), which he uploaded to the Internet Archive. His output file crossref-works.2018-01-21.json.xz contains 93,585,242 DOIs and consumes 28.9 GB compared to 87,542,370 DOIs and 7.0 GB for the crossref-works.json.xz dated 2017-03-21. This increased size is presumably due to the addition of I4OC references to Crossref work records.

Bryan Newbold has also created a September 2018 release, which is uploaded to the Internet Archive. This repository is currently seeking contributions to update the convenient TSV outputs based on the more recent database dumps.

Daniel Ecer also downloaded the Crossref work metadata in January 2018, using the codebase at elifesciences/datacapsule-crossref. His database dump is available on figshare. While the multi-part format of this dump is likely less convenient than the dumps produced by this repository, Daniel Ecer's analysis also exports a DOI-to-DOI table of citations/references available here. This citation catalog contains 314,785,303 citations (summarized here) and is thus more comprehensive than the catalog available from greenelab/opencitations.


This work is funded in part by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4552 to @cgreene.


Download metadata for all DOIs using the Crossref API








No releases published


No packages published