Skip to content
/ scholkit Public

Assorted utitlies around scholarly metadata.

License

Notifications You must be signed in to change notification settings

miku/scholkit

Repository files navigation

scholkit

   ,   ,
  /////|
 ///// |
|~~~|  |
|===|  |
|j  |  |
| g |  |
|  s| /
|===|/
'---'

Sketch project, assorted utitlies around scholarly metadata.

status: unstable, wip

Try

$ git clone https://github.com/miku/scholkit.git
$ cd scholkit
$ make

This builds a couple of executables:

  • catshape (data format conversions)
  • urlstream (stream data from many urls)
  • cdxlookup (ad-hoc cdx api lookup)
  • strnorm (quick string normalization)

Example dataset to work with, e.g. convert arxiv to fatcat release:

$ curl -sL https://archive.org/download/arxiv-2024-02-15/arxiv-2024-02-15.xml.zst | \
    zstd -dc | \
    catshape -f arxiv

Tools

Conversions

We want conversions from various formats to one single format (e.g. release entities). Source formats include:

  • crossref
  • datacite
  • pubmed
  • arxiv
  • oaiscrape
  • openalex
  • dblp

Target:

  • fatcat entities (release, work, container, file, contrib, abstract)

For each format, try to find the smallest conversion unit, e.g. one record. Then add convenience layers on top, e.g. for streams.

No bulk conversion should take longer than an 1 hour, roughly (slowest currently is openalex - 250M records - which takes about 45 min).

Clustering

Create a "works" view from releases.

Misc

The urlstream utility streams content from multiple URLs to stdout. Can help to create single file versions of larger datasets like pubmed, openalex, etc.

$ curl -s "https://www.gutenberg.org/browse/scores/top" | \
    grep -Eo "/ebooks/[0-9]+" | \
    awk '{print "https://gutenberg.org"$0".txt.utf-8"}' > top100.txt

$ urlstream < top100.txt > top100books.txt

Notes

TODO

  • map basic fields to fatcat release entities
  • map all fields to fatcat release entities
  • basic clustering algorithm

About

Assorted utitlies around scholarly metadata.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages