Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental update; blocking + pairwise matching double step; support all known crossref dumps #66

Merged
merged 64 commits into from
Apr 11, 2022

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Sep 13, 2021

  • rename node.js module "matching" as "indexing" because it is indexing and not matching anything :)
  • migrate indexer and lookup to ElasticSearch 7
  • support of CrossRef dump as single file .xz (GreeneLab), gzip or uncompressed, or directory of gzip (academic torrent) or uncompressed json file in a tar.gz file (Metadata Plus) ; files can be jsonl or json in an array (indexer and lookup) (Add the ability to compile lmdb from crossref premium #39)
  • fix potential ES index refresh bug
  • update to gradle 7
  • add a nicer progress ES indexing progress "bar" with colors :)
  • as blocking step, we retrieve via ElasticSearch n best records (list of MatchingDocument objects, n=4 currently and it is defined in the configuration file) (Better selection step #13)
  • as pairwise matching step, we compute and rank the n best records according to a pairwise distance of the expected/candidate fields (Better matching step #21)
  • best ranked candidate is returned if above a matching score threshold
  • heavy factorization/simplification of the search cases, improve overall runtime
  • make all components (lookup, indexer and pubmed-glutton) use a unique yaml config file (under config/glutton.yml)
  • update pubmed-glutton (ES7, gradle 7)
  • remove withValidation option (we always validate against matching score)
  • remove some useless endpoints
  • add a Crossref client
  • "gap" incremental update using CrossRef REST API (using cursors) (Add the ability to update lmdb data from crossref #38, Feature request/Question: Keep data up to date, incremental appending #49)
  • daily CrossRef update launched by the server at an hour indicated in the config
  • year can be passed now as additional metadata and it is used in the matching distance
  • mix matching mode (Add REST service for mixed matching approach #12) updated but not used any more, because the full raw ref matching mode is now almost as fast and still much more accurate (96.26 f1-score against 94.75 mixed matching, 0.067s per request versus 0.048s mixed matching)
  • update readme
  • parsing & conversion of the medline/pudmed records to the Crossref format (a bit extended to accommodate the extra info, like MeSH), generate a dump similar to Crossref snapshot to be loaded by biblio-glutton

To be done:

  • document the incremental update (move doc to readthedocs)
  • unpaywall incremental file update (via unpaywall subscription)
  • review existing matching (which is super basic at the moment)

Loading with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

  • 115,972,356 indexed records
  • around 4 hours
  • 232GB LMDB index volume

Indexing with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

  • 115,972,356 indexed records
  • around 6:30 to index (working on the same time on the computer), 4797 records/s
  • 25.94GB index volume

Current evaluation against 17015 raw references in PMC 1943 sample set (CRF model to parse the raw references prior to matching):

17015 bibliographical references processed in 1145.593 seconds, 0.06732841610343815 seconds per bibliographical reference.
Found 16699 DOI

======= GLUTTON API ======= 

precision:      0.9732918138810708
recall: 0.9552159858947987
f-score:        0.9641691878744736

With BiLSTM-CRF_FEATURES model instead of CRF for parsing the raw references prior to matching:

Found 16752 DOI

======= GLUTTON API ======= 

precision:      0.9733763132760267
recall: 0.9583308845136644
f-score:        0.9657950069594575

Previous one with smaller index (2019, so in principle easier) was:

======= GLUTTON API ======= 

17015 bibliographical references processed in 2363.978 seconds, 0.13893493975903615 seconds per bibliographical reference.
Found 16462 DOI

precision:      0.9699307496051512
recall: 0.9384072876873347
f-score:        0.953908653702542

@lfoppiano lfoppiano mentioned this pull request Sep 21, 2021
@lfoppiano
Copy link
Collaborator

I tested the loading and indexing from scratch and looks very good.

@karatekaneen
Copy link
Contributor

Anything I can do here to help move this along? Seems like a really nice update

@karatekaneen
Copy link
Contributor

Tried out this branch in a Docker container and ran into some issues.

Path

For the incremental update, it tried to run ../indexing but the path was not correct as it seems that the folder structure in the container was 1 level deeper now.

Indexing error

When running the incremental update I got a BUNCH (88m log file) of errors similar to the one below. I'm not sure if any documents was indexed because I forgot to check the size before the update ran but at least the file says it was modified at ~09:33

ERROR [2022-02-15 09:33:12,403] com.scienceminer.lookup.storage.lookup.MetadataLookup: Cannot store the entry 10.1553/ita-ms-20-02, {"institution":[{"name":"Institut fuer Technologiefolgenabschaetzung der OEAW","acronym":["ITA"],"place":["Vienna, Austria"]}],"publisher-location":"Vienna","reference-count":0,"publisher":"self","content-domain":{"domain":[],"crossmark-restriction":false},"DOI":"10.1553/ita-ms-20-02","type":"report","created":{"date-parts":[[2021,11,15]],"date-time":"2021-11-15T21:52:18Z","timestamp":1637013138000},"source":"Crossref","is-referenced-by-count":0,"title":["COVID-19 - Voices from Academia (ITA-manu:script 21-02)"],"prefix":"10.1553","member":"418","published-online":{"date-parts":[[2021]]},"deposited":{"date-parts":[[2022,2,15]],"date-time":"2022-02-15T04:57:55Z","timestamp":1644901075000},"score":0.0,"editor":[{"given":"Alexander","family":"Reich","sequence":"first","affiliation":[]}],"issued":{"date-parts":[[2021]]},"references-count":0,"URL":"http://dx.doi.org/10.1553/ita-ms-20-02","published":{"date-parts":[[2021]]}}
! org.lmdbjava.Txn$BadException: Transaction must abort, has a child, or is invalid (-30782)
! at org.lmdbjava.ResultCodeMapper.checkRc(ResultCodeMapper.java:70)
! at org.lmdbjava.Dbi.put(Dbi.java:411)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.store(MetadataLookup.java:110)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.lambda$loadFromFile$0(MetadataLookup.java:95)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.lambda$load$0(CrossrefJsonlReader.java:39)
! at java.util.Iterator.forEachRemaining(Iterator.java:116)
! at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
! at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.load(CrossrefJsonlReader.java:33)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.loadFromFile(MetadataLookup.java:86)
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask$LoadCrossrefFile.run(IncrementalLoaderTask.java:215)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)

Crossref timeout

This issue might be because I didn't enter any API key and got rate-limited but not sure, but as the last entry in the log file I got the following:

ERROR [2022-02-15 09:33:44,208] com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask: Crossref update call failed
! java.lang.Exception: The request to Crossref REST API failed: java.net.SocketTimeoutException thrown during request execution :  (,cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAFfMc7Fmxkam1HbkpnUWxxbWlCMkxwREpMSFEAAAAABNql-xZxd3ptbnlDVlFrS2ltN0l2dW1uWlJBAAAAAALwHO8WTzVoX2ZEVS1SWnE4ZHBtX2VLZ2NNZwAAAAACv5qxFi14RFJYanphVGUyczg3YnAzem5lTXcAAAAAAsqlvxY1N1JUNFlBQVR3eVZLRWYwZnFvMkRRAAAAAAV8xzwWbGRqbUduSmdRbHFtaUIyTHBESkxIUQ==,filter=from-update-date:2022-02-14,rows=1000)
! Read timed out
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask.run(IncrementalLoaderTask.java:128)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)

@kermitt2 kermitt2 merged commit a018da0 into master Apr 11, 2022
@lfoppiano lfoppiano deleted the incremental-update branch May 31, 2023 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants