Processing OpenCitations Data
This repository processes the OpenCitations data to make it more user-friendly and concise.
The primary output is
data/citations-doi.tsv.xz, which is a catalog of DOI-to-DOI citations.
The file is formated like:
All DOIs are lowercase. Quality control steps were performed on the DOIs to remove clearly incorrect DOIs. However, for best results, we recommend users intersect these DOIs with a catalog of valid DOIs to remove any remaining errant DOIs.
The downloading and processing of the OpenCitations data is accomplished by sequentially running the notebooks in this repository.
To update the pipeline to use newer versions of OpenCitations data, one should update the figshare article IDs in
conda env create --file=environment.yml
source activate opencitations and
source deactivate to activate or deactivate the environment.
On windows, use
activate opencitations and
In addition, to the conda environment, users will need to install the Disk ARchive (
dar) utility to their system.