Incremental update; blocking + pairwise matching double step; support all known crossref dumps #66

kermitt2 · 2021-09-13T15:18:14Z

rename node.js module "matching" as "indexing" because it is indexing and not matching anything :)
migrate indexer and lookup to ElasticSearch 7
support of CrossRef dump as single file .xz (GreeneLab), gzip or uncompressed, or directory of gzip (academic torrent) or uncompressed json file in a tar.gz file (Metadata Plus) ; files can be jsonl or json in an array (indexer and lookup) (Add the ability to compile lmdb from crossref premium #39)
fix potential ES index refresh bug
update to gradle 7
add a nicer progress ES indexing progress "bar" with colors :)
as blocking step, we retrieve via ElasticSearch n best records (list of MatchingDocument objects, n=4 currently and it is defined in the configuration file) (Better selection step #13)
as pairwise matching step, we compute and rank the n best records according to a pairwise distance of the expected/candidate fields (Better matching step #21)
best ranked candidate is returned if above a matching score threshold
heavy factorization/simplification of the search cases, improve overall runtime
make all components (lookup, indexer and pubmed-glutton) use a unique yaml config file (under config/glutton.yml)
update pubmed-glutton (ES7, gradle 7)
remove withValidation option (we always validate against matching score)
remove some useless endpoints
add a Crossref client
"gap" incremental update using CrossRef REST API (using cursors) (Add the ability to update lmdb data from crossref #38, Feature request/Question: Keep data up to date, incremental appending #49)
daily CrossRef update launched by the server at an hour indicated in the config
year can be passed now as additional metadata and it is used in the matching distance
mix matching mode (Add REST service for mixed matching approach #12) updated but not used any more, because the full raw ref matching mode is now almost as fast and still much more accurate (96.26 f1-score against 94.75 mixed matching, 0.067s per request versus 0.048s mixed matching)
update readme
parsing & conversion of the medline/pudmed records to the Crossref format (a bit extended to accommodate the extra info, like MeSH), generate a dump similar to Crossref snapshot to be loaded by biblio-glutton

To be done:

document the incremental update (move doc to readthedocs)
unpaywall incremental file update (via unpaywall subscription)
review existing matching (which is super basic at the moment)

Loading with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

115,972,356 indexed records
around 4 hours
232GB LMDB index volume

Indexing with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

115,972,356 indexed records
around 6:30 to index (working on the same time on the computer), 4797 records/s
25.94GB index volume

Current evaluation against 17015 raw references in PMC 1943 sample set (CRF model to parse the raw references prior to matching):

17015 bibliographical references processed in 1145.593 seconds, 0.06732841610343815 seconds per bibliographical reference.
Found 16699 DOI

======= GLUTTON API ======= 

precision:      0.9732918138810708
recall: 0.9552159858947987
f-score:        0.9641691878744736

With BiLSTM-CRF_FEATURES model instead of CRF for parsing the raw references prior to matching:

Found 16752 DOI

======= GLUTTON API ======= 

precision:      0.9733763132760267
recall: 0.9583308845136644
f-score:        0.9657950069594575

Previous one with smaller index (2019, so in principle easier) was:

======= GLUTTON API ======= 

17015 bibliographical references processed in 2363.978 seconds, 0.13893493975903615 seconds per bibliographical reference.
Found 16462 DOI

precision:      0.9699307496051512
recall: 0.9384072876873347
f-score:        0.953908653702542

… dump

… uncompressed json files

…39)

…anking step

…ch index and storage

…doc type

…update ES init

…single data path

…ng; various fixes

@Aazhar

…hot, thanks @Aazhar

…pping and PMC OA information

lfoppiano · 2021-11-19T09:35:04Z

I tested the loading and indexing from scratch and looks very good.

…on into incremental-update

karatekaneen · 2022-02-03T15:17:12Z

Anything I can do here to help move this along? Seems like a really nice update

karatekaneen · 2022-02-15T09:51:09Z

Tried out this branch in a Docker container and ran into some issues.

Path

For the incremental update, it tried to run ../indexing but the path was not correct as it seems that the folder structure in the container was 1 level deeper now.

Indexing error

When running the incremental update I got a BUNCH (88m log file) of errors similar to the one below. I'm not sure if any documents was indexed because I forgot to check the size before the update ran but at least the file says it was modified at ~09:33

ERROR [2022-02-15 09:33:12,403] com.scienceminer.lookup.storage.lookup.MetadataLookup: Cannot store the entry 10.1553/ita-ms-20-02, {"institution":[{"name":"Institut fuer Technologiefolgenabschaetzung der OEAW","acronym":["ITA"],"place":["Vienna, Austria"]}],"publisher-location":"Vienna","reference-count":0,"publisher":"self","content-domain":{"domain":[],"crossmark-restriction":false},"DOI":"10.1553/ita-ms-20-02","type":"report","created":{"date-parts":[[2021,11,15]],"date-time":"2021-11-15T21:52:18Z","timestamp":1637013138000},"source":"Crossref","is-referenced-by-count":0,"title":["COVID-19 - Voices from Academia (ITA-manu:script 21-02)"],"prefix":"10.1553","member":"418","published-online":{"date-parts":[[2021]]},"deposited":{"date-parts":[[2022,2,15]],"date-time":"2022-02-15T04:57:55Z","timestamp":1644901075000},"score":0.0,"editor":[{"given":"Alexander","family":"Reich","sequence":"first","affiliation":[]}],"issued":{"date-parts":[[2021]]},"references-count":0,"URL":"http://dx.doi.org/10.1553/ita-ms-20-02","published":{"date-parts":[[2021]]}}
! org.lmdbjava.Txn$BadException: Transaction must abort, has a child, or is invalid (-30782)
! at org.lmdbjava.ResultCodeMapper.checkRc(ResultCodeMapper.java:70)
! at org.lmdbjava.Dbi.put(Dbi.java:411)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.store(MetadataLookup.java:110)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.lambda$loadFromFile$0(MetadataLookup.java:95)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.lambda$load$0(CrossrefJsonlReader.java:39)
! at java.util.Iterator.forEachRemaining(Iterator.java:116)
! at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
! at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
! at com.scienceminer.lookup.reader.CrossrefJsonlReader.load(CrossrefJsonlReader.java:33)
! at com.scienceminer.lookup.storage.lookup.MetadataLookup.loadFromFile(MetadataLookup.java:86)
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask$LoadCrossrefFile.run(IncrementalLoaderTask.java:215)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)

Crossref timeout

This issue might be because I didn't enter any API key and got rate-limited but not sure, but as the last entry in the log file I got the following:

ERROR [2022-02-15 09:33:44,208] com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask: Crossref update call failed
! java.lang.Exception: The request to Crossref REST API failed: java.net.SocketTimeoutException thrown during request execution :  (,cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAFfMc7Fmxkam1HbkpnUWxxbWlCMkxwREpMSFEAAAAABNql-xZxd3ptbnlDVlFrS2ltN0l2dW1uWlJBAAAAAALwHO8WTzVoX2ZEVS1SWnE4ZHBtX2VLZ2NNZwAAAAACv5qxFi14RFJYanphVGUyczg3YnAzem5lTXcAAAAAAsqlvxY1N1JUNFlBQVR3eVZLRWYwZnFvMkRRAAAAAAV8xzwWbGRqbUduSmdRbHFtaUIyTHBESkxIUQ==,filter=from-update-date:2022-02-14,rows=1000)
! Read timed out
! at com.scienceminer.lookup.utils.crossrefclient.IncrementalLoaderTask.run(IncrementalLoaderTask.java:128)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
! at java.lang.Thread.run(Thread.java:748)

kermitt2 added 30 commits September 3, 2021 10:51

move to ES7 for the indexing; support indexing of crossref multi-file…

bfd1f1b

… dump

add global counter for indexed documents

14421b2

migrate to ES7 for matching; move to gradle 7

a6434a4

manage sequential file stream processing

a301656

improve indexing progress reports; support indexing with directory of…

7ac4c87

… uncompressed json files

support loading of directory of uncompressed json files (cf. examples #…

92ce567

…39)

logs info in the console when loading

0673d06

add CORS, use #59 for max connection

cc24125

add blocking step with n candidates MatchingDocument and a pairwise r…

ca2656c

…anking step

add basic pairwise ranking

b25d250

use explicit field names; add comments; normalize titles both in sear…

efd03a5

…ch index and storage

review metadata-driven cases

c0a741d

fix index count with count api; simplify redundant search; remove ES …

9ef253f

…doc type

update default config without doc type

9f8e4dc

use a default global configuration file

3f7a8e8

make indexing module use the global unique config ffile

1ed3823

make pubmed-glutton use the unique config file, update to gradlew 7; …

a3e6307

…update ES init

update elasticsearch client in pubmed-glutton

1b4cf97

cleaning

3044b6a

manage latest indexed date while loading crossref dump; re-organize, …

7b755ac

…single data path

make lastest crossref indexed date persistent

f5a3c2e

prepare crossref client for incremental update

9662bc5

add crossref gap incremental update

939d7ff

report loading stats

32475ee

add external ES indexing for crossref incremental update

9a18f6f

update service/data numbers

f4c334a

fix bug with next-cursor

3f3fde9

remove post validation; exploit more metadata; refine pairwise matchi…

985e075

…ng; various fixes

review ES query

865bc90

support tar.gz archive of json files for crossref metadata plus snaps…

04ddcb0

…hot, thanks @Aazhar

kermitt2 added 3 commits September 19, 2021 19:53

add incomplete json medline serialization and dump

2c4a837

convert medline entries into JSON crossref format

1d19784

fix crossref conversion

f15f8c6

lfoppiano mentioned this pull request Sep 21, 2021

Fixed ID Updated node #58

Closed

kermitt2 and others added 7 commits September 21, 2021 12:46

add mesh classes in medline json dump

736bf87

complete crossref serialization for medline, injection of euro PMC ma…

dae2cf8

…pping and PMC OA information

fix some crossref conversion issues

8bff4a8

fix bug end of daily update

b2bea15

remove obsolete tests and back-up tests I don't know how to update

af74776

Add url to PMIC to PMC file

794bfa7

Revert wrong change

5fde2c1

kermitt2 and others added 4 commits December 12, 2021 09:13

update log4j ;)

2cf5e2c

Merge branch 'incremental-update' of github.com:kermitt2/biblio-glutt…

65c5e8b

…on into incremental-update

update again log4j ;)

59e83c2

Merge branch 'master' into incremental-update

5c8a303

kermitt2 added 9 commits March 27, 2022 18:47

update load instructions

1bd9ec8

update parsing for actual crosref metadata plus snapshot

a7ac35e

doc on gap coverage

55e731e

update build and doc

8b184a5

fix for #70

1d85916

reuse dataengine for healthcheck

9f5a402

add option to remove or not the incremental files after gap/daily update

d49d195

preparing version 0.2

278fc1d

update doc; review strong identifier combination

1111dde

kermitt2 merged commit a018da0 into master Apr 11, 2022

kermitt2 mentioned this pull request Apr 11, 2022

Feature request/Question: Keep data up to date, incremental appending #49

Closed

lfoppiano deleted the incremental-update branch May 31, 2023 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental update; blocking + pairwise matching double step; support all known crossref dumps #66

Incremental update; blocking + pairwise matching double step; support all known crossref dumps #66

kermitt2 commented Sep 13, 2021 •

edited

lfoppiano commented Nov 19, 2021

karatekaneen commented Feb 3, 2022

karatekaneen commented Feb 15, 2022

Incremental update; blocking + pairwise matching double step; support all known crossref dumps #66

Incremental update; blocking + pairwise matching double step; support all known crossref dumps #66

Conversation

kermitt2 commented Sep 13, 2021 • edited

lfoppiano commented Nov 19, 2021

karatekaneen commented Feb 3, 2022

karatekaneen commented Feb 15, 2022

Path

Indexing error

Crossref timeout

kermitt2 commented Sep 13, 2021 •

edited