-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental update; blocking + pairwise matching double step; support all known crossref dumps #66
Conversation
… uncompressed json files
…ch index and storage
…ng; various fixes
…pping and PMC OA information
I tested the loading and indexing from scratch and looks very good. |
Anything I can do here to help move this along? Seems like a really nice update |
Tried out this branch in a Docker container and ran into some issues. PathFor the incremental update, it tried to run Indexing errorWhen running the incremental update I got a BUNCH (88m log file) of errors similar to the one below. I'm not sure if any documents was indexed because I forgot to check the size before the update ran but at least the file says it was modified at ~09:33
Crossref timeoutThis issue might be because I didn't enter any API key and got rate-limited but not sure, but as the last entry in the log file I got the following:
|
n=4
currently and it is defined in the configuration file) (Better selection step #13)config/glutton.yml
)withValidation
option (we always validate against matching score)year
can be passed now as additional metadata and it is used in the matching distanceTo be done:
Loading with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):
Indexing with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):
Current evaluation against 17015 raw references in PMC 1943 sample set (CRF model to parse the raw references prior to matching):
With BiLSTM-CRF_FEATURES model instead of CRF for parsing the raw references prior to matching:
Previous one with smaller index (2019, so in principle easier) was: