Future updates of bulk Crossref metadata corpus #5

bnewbold · 2017-12-19T20:33:10Z

In April 2017 @dhimmel uploaded a bulk snapshot of Crossref metadata to figshare (where it was assigned DOI 10.6084/m9.figshare.4816720.v1).

While this metadata can be scraped from the Crossref API by anybody (eg, using the scripts in this repository), I found it really helpful to grab in bulk form.

I'm curious whether this dump could be updated on an annual or quarterly basis. I don't have a particular need for the the data to be versioned (eg, assigned sequential .v2, .v3 DOIs at figshare), but that would probably help with discovery for other folks and generally be a best practice.

If nobody has time to do such an update I will probably run the scripts from this repository and push to archive.org at: https://archive.org/details/ia_biblio_metadata.

The text was updated successfully, but these errors were encountered:

dhimmel · 2017-12-19T21:08:30Z

@bnewbold thanks for your interest!

I found it really helpful to grab in bulk form.

Indeed! I'm hoping Crossref starts releasing their database dumps, so we don't have to keep going through the laborious steps of recreating them by millions of API calls 😸 . However, until then, it'd be nice to update this repo and data release.

Shortly after we did our extraction in April 2017, Crossref's API began returning citation information. This is really useful but makes the API responses much larger. It also opens up the possibility that we could produce to DOI-to-DOI citation table, which I'm sure would appeal to many users.

Anyways, I wasn't planning on updating these records until I needed newer data for my research (which could be never). Rerunning the API queries will probably take several weeks. You may run into issues. You'll need an internet connection with good uptime! @bnewbold if this is something you're interested in, we'd love if you could open a PR with the updates. We'd also love to add your revised DB dump to the figshare. You'd become an author on the figshare dataset and potentially other work we do in the future that makes use of this data. What do you think?

I'd also likely be interested in extracted the citation graph from the enlarged dump.

bnewbold · 2017-12-20T07:32:25Z

I have download.py chugging away on a reliable host (up to a couple million so far, estimate says 240+ hours to go, but I won't be surprised if it takes longer), and the works (at least some of them) do indeed contain citation information.

dhimmel · 2017-12-20T14:31:31Z

I have download.py chugging away on a reliable host

Nice! IIRC correctly the order of the works is somewhat chronological. The newer works tend to have more metadata, so things slow down and become more error prone closer to the end. Happy to help if anything breaks.

the works (at least some of them) do indeed contain citation information.

Nice! I believe these will be the I4OC citations (more accurately "references"), which should be much more prevalent than the OpenCitations corpus we processed in greenelab/opencitations.

bnewbold · 2018-01-05T17:27:20Z

The script is about 80% complete. It halted last week when the local hard disk ran out of space (from a bug in an unrelated script), but has been restarted with the most recent cursor position just now.

bnewbold · 2018-01-09T19:13:43Z

I ran into two problems: the other script misbehaved again (disk filled up), and it seemed like the dump hadn't continued where it had left off when I restarted back on Jan 5th (even though I specified a cursor). I'm not sure if the cursor is local (mongodb) or remote (crossref API), but the mongo container had been restarted and it had been more than a few days between failure and restart, either of which could have caused problems.
Anyways, I've restarted from scratch with this process on it's own (SSD) disk. It looks like there are now 94,035,712 remote records, which I think is also a bump from the last attempt a couple weeks back.
Will update here again when this completes (expecting late January).

dhimmel · 2018-01-10T19:19:57Z

it seemed like the dump hadn't continued where it had left off when I restarted back on Jan 5th (even though I specified a cursor).

@bnewbold thanks for the update. My understanding of the cursor is that it's remote. I'm not sure how long cursors are retained... the cursor could have been retired after some days of inactivity. Ideally, specifying an invalid cursor would trigger an error and not proceed silently.

It looks like there are now 94,035,712 remote records

Nice! nearing 100 million.

Anyways, I've restarted from scratch with this process

It'll be nice to have metadata through all of 2017. There may be a few articles published in 2017 that still have been deposited in Crossref, but I hope not too many

bnewbold · 2018-01-22T21:25:29Z

The script completed successfully yesterday (2018-01-21) after about 11 days:

94035712/94035712 [270:33:11<00:00, 32.70it/s]Finished queries with 93,585,242 works in db

I'm not sure what the discrepancy is between 93,585,242 and 94,035,712; some works get skipped intentionally?

I'm dumping to .json.xz now, which looks like it will take a couple hours, after which i'll upload to archive.org and you can review before pushing to figshare. I've named the file data/mongo-export/crossref-works.2018-01-21.json.xz to reduce possible confusion. I'll try to run the .ipynb files to update other formats, though i'm worried this machine won't have enough RAM. Either way i'll do a PR with, eg, the sha256 checksums updates.

dhimmel · 2018-01-22T21:38:47Z

I'm not sure what the discrepancy is between 93,585,242 and 94,035,712; some works get skipped intentionally?

What immediately comes to mind is if Crossref had multiple records for the same DOI. That could make the query number larger than the MongoDB number:

crossref/download.py

Line 37 in 768a49b

collection.replace_one(filter_, work, upsert=True)

Do you still have the log? I wonder if we should preserve this as well? It could probably help us diagnose the discrepancy.

bnewbold · 2018-01-23T19:22:45Z

I do have the complete log (including my first failed attempt). Skimming through it, doesn't look like it will answer this question, but i'll include it when I upload.

mongo-export is still running, about 3/4 complete now 22 hours in. The new corpus is significantly larger, presumably because of citation and maybe other new metadata being included. I estimate it will be 250 GB of uncompressed JSON, or about 25 GB compressed (xz, default settings).

bnewbold · 2018-01-24T22:02:35Z

Uploaded here: https://archive.org/download/crossref_doi_dump_201801/crossref-works.2018-01-21.json.xz

File is 30980612708 bytes (~29 GB), sha256 is 28075b3abf7724a284467000d3b2eba720f97967bb7b81bad62b7e9c0b24c761.

Logs are uploaded to the same item, but might take a few minutes to appear (while main file is still being hashed and replicated).

Running the .ipynb files as-is didn't work from the command line. From within the conda environment:

(crossref) bnewbold@ia601101$ jupyter-run 1.works-to-dataframe.ipynb
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-7f436dceca0c> in <module>()
      4    "cell_type": "markdown",
      5    "metadata": {
----> 6     "deletable": true,
      7     "editable": true
      8    },

NameError: name 'true' is not defined
Traceback (most recent call last):
  File "/schnell/crossref-dump/miniconda3/envs/crossref/bin/jupyter-run", line 11, in <module>
    sys.exit(main())
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/jupyter_core/application.py", line 266, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/jupyter_client/runapp.py", line 115, in start
    raise Exception("jupyter-run error running '%s'" % filename)
Exception: jupyter-run error running '1.works-to-dataframe.ipynb'

(crossref) bnewbold@ia601101$ jupyter nbconvert --to python 1.works-to-dataframe.ipynb
[NbConvertApp] Converting notebook 1.works-to-dataframe.ipynb to python
[NbConvertApp] Writing 1589 bytes to 1.works-to-dataframe.py

(crossref) bnewbold@ia601101$ python 1.works-to-dataframe.py

Traceback (most recent call last):
  File "1.works-to-dataframe.py", line 57, in <module>
    doi_writer.writerow((doi, work['type'], issued))
KeyError: 'type'

My interest was in getting the .json.xz flie, not the derived files, but it looks like the notebook files talk to (local) mongodb directly instead of parsing the bulk file. @dhimmel, can you bulk-load the dump into a local mongo instance and generate the derived files there?

bnewbold · 2018-01-24T22:08:59Z

I'll hold off on tearing down the mongo database for a few days in case it ends up being useful.

Two other infrastructure notes I had from setting up this run:

I initially got docker NAT/iptables errors. These were resolved by restarting the docker daemon. As always with docker, I have no idea if this was due to the kernel/vm/os setup on this machine, the particular docker/moby daemon installed, our firewall config, etc.
I hadn't used "miniconda" package management system before. Installed in a local project-specific directory, and had to update $PATH with export PATH=/PROJECT/crossref-dump/miniconda3/bin:$PATH before running source activate crossref

dhimmel · 2018-01-24T22:55:54Z

Running the .ipynb files as-is didn't work from the command line. From within the conda environment:

I don't think jupyter-run is the right command (despite it's name). In the past, I've used nbconvert like

jupyter nbconvert --inplace --execute --ExecutePreprocessor.timeout=-1 1.works-to-dataframe.ipynb

Do you think you could open a PR with at least the update to:

crossref/data/mongo-export/checksums-sha256.txt

Line 1 in 768a49b

    
           a884f1d52bd753ee6c0777f52f26f65f236a50701e7a24b0f592e2578382370e  crossref-works.json.xz

I'd like for you to be in the commit history. If you get the notebooks to run, great. Otherwise I can try to do it by importing the dump.

bnewbold · 2018-01-30T18:37:48Z

Here's what I get trying to use the above jupyter line:

(crossref) bnewbold@ia601101$ head environment.yml -n1
name: crossref
(crossref) bnewbold@ia601101$ conda-env list
# conda environments:
#
crossref              *  /schnell/crossref-dump/miniconda3/envs/crossref
root                     /schnell/crossref-dump/miniconda3

(crossref) bnewbold@ia601101$ jupyter nbconvert --inplace --execute --ExecutePreprocessor.timeout=-1 1.works-to-dataframe.ipynb
[NbConvertApp] Converting notebook 1.works-to-dataframe.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: conda-env-crossref-py
Traceback (most recent call last):
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/jupyter_client/kernelspec.py", line 201, in get_kernel_spec
    resource_dir = d[kernel_name.lower()]
KeyError: 'conda-env-crossref-py'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/schnell/crossref-dump/miniconda3/envs/crossref/bin/jupyter-nbconvert", line 11, in <module>
    load_entry_point('nbconvert==5.1.1', 'console_scripts', 'jupyter-nbconvert')()
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/jupyter_core/application.py", line 266, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/nbconvertapp.py", line 305, in start
    self.convert_notebooks()
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/nbconvertapp.py", line 473, in convert_notebooks
    self.convert_single_notebook(notebook_filename)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/nbconvertapp.py", line 444, in convert_single_notebook
    output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/nbconvertapp.py", line 373, in export_single_notebook
    output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/exporters/exporter.py", line 171, in from_filename
    return self.from_file(f, resources=resources, **kw)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/exporters/exporter.py", line 189, in from_file
    return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/exporters/notebook.py", line 31, in from_notebook_node
    nb_copy, resources = super(NotebookExporter, self).from_notebook_node(nb, resources, **kw)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/exporters/exporter.py", line 131, in from_notebook_node
    nb_copy, resources = self._preprocess(nb_copy, resources)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/exporters/exporter.py", line 308, in _preprocess
    nbc, resc = preprocessor(nbc, resc)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/preprocessors/base.py", line 47, in __call__
    return self.preprocess(nb,resources)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 207, in preprocess
    cwd=path)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/nbconvert/preprocessors/execute.py", line 188, in start_new_kernel
    km.start_kernel(**kwargs)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/jupyter_client/manager.py", line 244, in start_kernel
    kernel_cmd = self.format_kernel_cmd(extra_arguments=extra_arguments)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/jupyter_client/manager.py", line 175, in format_kernel_cmd
    cmd = self.kernel_spec.argv + extra_arguments
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/jupyter_client/manager.py", line 87, in kernel_spec
    self._kernel_spec = self.kernel_spec_manager.get_kernel_spec(self.kernel_name)
  File "/schnell/crossref-dump/miniconda3/envs/crossref/lib/python3.6/site-packages/jupyter_client/kernelspec.py", line 203, in get_kernel_spec
    raise NoSuchKernel(kernel_name)
jupyter_client.kernelspec.NoSuchKernel: No such kernel named conda-env-crossref-py

dhimmel · 2018-01-30T22:03:44Z

jupyter_client.kernelspec.NoSuchKernel: No such kernel named conda-env-crossref-py

Ah I've hit that annoying bug as well in anaconda/nb_conda_kernels#34 (comment).

If you add --ExecutePreprocessor.kernel_name=python, it might be fixed.

Also I just came across https://github.com/elifesciences/datacapsule-crossref by @de-code which seems to have also downloaded the works data from Crossref.

de-code · 2018-02-02T10:29:51Z

Yes, I've just updated the download recently. I will try to share the dump. But the whole works dumps is about 32 GB. Looking for an easy way to get that into Figshare (from a headless server). (I also have just citation links which is a more manageable <3 GB)

de-code · 2018-02-05T17:46:29Z

Data downloaded January 2018 now available in Figshare:
https://doi.org/10.6084/m9.figshare.5845554

And just citation links:
https://doi.org/10.6084/m9.figshare.5849916

de-code · 2018-02-05T17:51:23Z

There is also an open issue / request for Crossref to provide something similar: CrossRef/rest-api-doc#271

This file, along with logs from it's creation, is available at https://archive.org/download/crossref_doi_dump_201801 Refs #5

Reference recent Crossref dumps discussed in #5 Refs CrossRef/rest-api-doc#271

Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271

bnewbold · 2018-09-06T00:58:02Z

In case anybody is interested, i've started another dump using the exact same code path today. Not sure if i'll continue updating dumps in the future, but I wanted a fresher one and might as well share. I cross-posted at elifesciences/datacapsule-crossref#1 as well.

Notes since last time:

Crossref now offers weekly snapshot dumps, but as a paid service; i'm not sure what the license is on the dump files themselves
Others have noted duplicated DOIs in bulk iteration dumps; the cursors don't seem to hold a consistent transaction open, so updates to DOIs that occur during the dump can result in duplicate entries

dhimmel · 2018-09-19T17:09:40Z

i've started another dump using the exact same code path today

Nice!

updates to DOIs that occur during the dump can result in duplicate entries

Hmm. I thought this repository should be replacing duplicate DOI entries in the Mongo database. In other words, the Mongo DB should only contain the most recently added metadata for a DOI:

crossref/download.py

Lines 33 to 37 in 1dc4171

    
           # Add works 
        
           if component == 'works': 
        
               for work in generator: 
        
                   filter_ = {'DOI': work['DOI']} 
        
                   collection.replace_one(filter_, work, upsert=True)

That is unfortunate that the iterative queries return duplicate DOIs (I thought the cursor was to prevent / I hope other DOIs aren't missing). However, replace_one should prevent our exports from being contaminated with duplicates? @bnewbold have you seen otherwise?

bnewbold · 2018-09-24T21:32:20Z

@dhimmel I hadn't noticed this behavior (DOI upsert) of these scripts (which I haven't read, just run blindly).

My most recent dump completed and is available here: https://archive.org/download/crossref_doi_dump_201809

SHA256 available here (feel free to PR/merge): bnewbold@9f99032

@bnewbold

Merges #11 Created by Bryan Newbold (@bnewbold). Queries initiated on 2018-09-05. Refs #5 (comment) Full dataset and logs online at https://archive.org/download/crossref_doi_dump_201809

dhimmel · 2018-09-25T13:40:31Z

Repo updated with Newbold's September 2018 dump

My most recent dump completed and is available here.

Awesome, added checksum in #11 / 48a8589, updated README in cc79bd0, and tweeted:

Looking for @CrossRef DOI metadata (including citation links) for ~100 million scholarly articles? See Byran Newbold's September 2018 bulk export hosted by the @internetarchive. https://archive.org/details/crossref_doi_dump_201809

Looks like the queries took 16 days (2018-09-05 to 2018-09-20). File size is 33.2 GB, up from 28.9 GB for the January 2019 release.

@bnewbold I have a slight preference that if you make another update in the future, for you to open a new issue.

bjrne · 2020-10-24T21:23:22Z

Thanks @bnewbold for the effort to upload newer versions of the dataset to the internet archive pages ☺️
Since it took me a while to also find the newer ones, I'll link them here below for people who land on this issue:

"Official" crossref dump 2020-04: https://archive.org/details/crossref-doi-metadata-20200408
Self-crawled with this repo 2019-09: https://archive.org/details/crossref_doi_dump_201909

omeletteinc · 2020-12-13T17:14:15Z

possible to get single file?

bnewbold mentioned this issue Jan 30, 2018

add Jan 2018 dump SHA256 checksum #6

Merged

dhimmel pushed a commit that referenced this issue Feb 6, 2018

Add checksum for crossref-works.2018-01-21.json.xz (#6)

15ec9f6

This file, along with logs from it's creation, is available at https://archive.org/download/crossref_doi_dump_201801 Refs #5

dhimmel added a commit that referenced this issue Feb 6, 2018

README: other resources section

c9fd6fe

Reference recent Crossref dumps discussed in #5 Refs CrossRef/rest-api-doc#271

dhimmel added a commit that referenced this issue Feb 6, 2018

README: other resources section

9551d99

Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271

dhimmel mentioned this issue Feb 6, 2018

README: other resources section #7

Merged

dhimmel closed this as completed in #7 Feb 7, 2018

dhimmel added a commit that referenced this issue Feb 7, 2018

README: other resources section (#7)

1dc4171

Reference recent Crossref dumps discussed in #5. Closes #5 Refs CrossRef/rest-api-doc#271

dhimmel mentioned this issue Sep 25, 2018

Add crossref-works.2018-09-05.json.xz checksum #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future updates of bulk Crossref metadata corpus #5

Future updates of bulk Crossref metadata corpus #5

bnewbold commented Dec 19, 2017 •

edited

Loading

dhimmel commented Dec 19, 2017

bnewbold commented Dec 20, 2017

dhimmel commented Dec 20, 2017 •

edited

Loading

bnewbold commented Jan 5, 2018

bnewbold commented Jan 9, 2018

dhimmel commented Jan 10, 2018 •

edited

Loading

bnewbold commented Jan 22, 2018

dhimmel commented Jan 22, 2018 •

edited

Loading

bnewbold commented Jan 23, 2018

bnewbold commented Jan 24, 2018

bnewbold commented Jan 24, 2018

dhimmel commented Jan 24, 2018

bnewbold commented Jan 30, 2018

dhimmel commented Jan 30, 2018

de-code commented Feb 2, 2018

de-code commented Feb 5, 2018

de-code commented Feb 5, 2018

bnewbold commented Sep 6, 2018

dhimmel commented Sep 19, 2018

bnewbold commented Sep 24, 2018

dhimmel commented Sep 25, 2018

bjrne commented Oct 24, 2020

omeletteinc commented Dec 13, 2020

Future updates of bulk Crossref metadata corpus #5

Future updates of bulk Crossref metadata corpus #5

Comments

bnewbold commented Dec 19, 2017 • edited Loading

dhimmel commented Dec 19, 2017

bnewbold commented Dec 20, 2017

dhimmel commented Dec 20, 2017 • edited Loading

bnewbold commented Jan 5, 2018

bnewbold commented Jan 9, 2018

dhimmel commented Jan 10, 2018 • edited Loading

bnewbold commented Jan 22, 2018

dhimmel commented Jan 22, 2018 • edited Loading

bnewbold commented Jan 23, 2018

bnewbold commented Jan 24, 2018

bnewbold commented Jan 24, 2018

dhimmel commented Jan 24, 2018

bnewbold commented Jan 30, 2018

dhimmel commented Jan 30, 2018

de-code commented Feb 2, 2018

de-code commented Feb 5, 2018

de-code commented Feb 5, 2018

bnewbold commented Sep 6, 2018

dhimmel commented Sep 19, 2018

bnewbold commented Sep 24, 2018

dhimmel commented Sep 25, 2018

Repo updated with Newbold's September 2018 dump

bjrne commented Oct 24, 2020

omeletteinc commented Dec 13, 2020

bnewbold commented Dec 19, 2017 •

edited

Loading

dhimmel commented Dec 20, 2017 •

edited

Loading

dhimmel commented Jan 10, 2018 •

edited

Loading

dhimmel commented Jan 22, 2018 •

edited

Loading