Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Elastic7, all crossref dump format supported #61

Closed
wants to merge 7 commits into from

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Sep 5, 2021

Updates:

  • rename node.js module "matching" as "indexing" because it is indexing and not matching anything :)
  • migrate indexer and lookup to ElasticSearch 7
  • support of CrossRef dump as single file .xz (GreeneLab), gzip or uncompressed, or directory of gzip or uncompressed json file (academic torrent or Metadata Plus) ; files can be jsonl or json in an array (indexer and lookup)
  • fix potential index refresh bug
  • update to gradle 7
  • add a nicer progress indexing progress "bar" with colors :)
  • update readme

Indexing with CrossRef Torrent Academic dump (January, 7, 2021, 120M records):

  • 115,972,356 indexed records
  • around 6:30 to index (working on the same time on the computer), 4797 records/s
  • 25.94GB index size

I didn't find the courage to integrate the other changes in PR #50 #58 but I will try.

@kermitt2
Copy link
Owner Author

kermitt2 commented Sep 5, 2021

Note: I realized that with ES 7, to get the size of the ES index, we need to use the count API. The search API only provides an approximation which is useless (like "more than 10K" for the 160M documents).
-> this is fixed in PR #62.

@kermitt2
Copy link
Owner Author

follow up in #66

@kermitt2 kermitt2 closed this Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant