Fixed ID issue, updated node #50

karatekaneen · 2020-10-25T21:55:29Z

Main fix

When using the latest Crossref data dump instead of the one mentioned in the docs the node utility crashes because of the missing attribute _id that I guess stems from the MongoDB used in the Crossref Extractor used. This fix removes the preset ID and relies on one being created by Elasticsearch. (This can possibly lead to other issues downstream, but I haven't found any issues when testing after removing the ID).

Other fixes

Update from node 10 to node 14 because node10 is soon at end-of-life
Changed from the depracated elasticsearch client to the new, supported client
Fixed index name "crossref" from being hardcoded in a couple of places
Converted to Typescript (poor man's testing :D). Let me know if you want pure javascript instead and I'll resubmit with the compiled JS code.
Refactored to async/await instead of callbacks for better readability

…nc/await

Aazhar · 2020-11-24T10:10:36Z

matching/src/main.ts

+  if (data.author?.length > 0) {
+    const { authors, firstAuthorName } = data.author.reduce(
+      (acc, author) => {
+        if (author.sequence === "first") {


I asked crossref support about this, because this 'sequence' attribute is not available in the last dump, maybe consider the first one in the list as first too, also the language attribut is not available (maybe it is not relevant for you..)

I wasn't aware that the sequence attribute was missing to be honest and maybe the order should be considered as well. Although I'm not sure that that would always yield the correct result. Do you have any more insight on this?

to build my actual index , I have considered either the sequence attribute or the order in the list, I think in most cases the list order represents the real order..

Aazhar · 2020-11-24T10:13:01Z

matching/src/main.ts

+      return;
+    }
+
+    delete doc._id;


to able to synchronize incrementally with crossref, I think doi could be used as identifier, @kermitt2 @karatekaneen WDYT ?

It sounds like a very simple solution that may help towards the incremental synchronization that I really long for. But I have a very limited view on how all this works in the rest of glutton.

so far biblio glutton has no tool to do synchronization, but I'm trying to work on something and I need to confirm the usage of doi as unique identifier (I have tried to rebuild the index using doi, and it seems there is no data loss..)

Interesting, I've been meaning to look in to that myself.
Do you mean that the internal LMDB's does not clear out when you try to add a batch of fresh data? And if so, using doi here would make a lot of sense.

karatekaneen · 2021-02-09T14:32:48Z

Had any chance to look at this?

lfoppiano · 2021-02-09T22:58:40Z

@karatekaneen sorry, lot of stuff around and hard to follow all the projects. I will have a look within the next two weeks.

karatekaneen · 2021-02-10T10:12:17Z

No worries, I suspected that was the case so I just wanted to bump it to give a reminder. Thanks for the great stuff you guys release, it's super useful!

lfoppiano · 2021-03-29T00:31:17Z

btw, I haven't forgotten about this issue. I'm still downloading the dump, and is taking me weeks 😭

lfoppiano · 2021-04-20T03:14:16Z

@karatekaneen I've downloaded the dump, perhaps we could read the dump directly from the thousand of gzip files instead of expecting a single file. Did you uncompress them and re-compress as a single jzon file?

Update: I'm going to implement the mechanism to read the dump directly from the compressed gzipped files in a directory

Update2: ok It's already implemented in #52

lfoppiano · 2021-04-22T08:11:05Z

@karatekaneen I'm having issues with starting up the indexing, the command:

node main -dump *PATH_TO_THE_CROSSREF_JSON_DUMP* index

doesn't seems to be valid anymore.
Will this work also with a directory containing the recent dump downloaded from the torrent?

karatekaneen · 2021-04-22T08:54:53Z

Unfortunately my memory has released all the details of how I wrote this, so I will have to take a closer look this weekend or so. I think that I didn't added anything (such as multiple files etc) because I wanted to keep it as close to the original as possible. But I'm happy to modify if requested.

But, I'm pretty sure that I figured out that the files could be appended to each other (this was before #52 existed in my mind) and I ran a util script just appending all the files into one huge file.

For running the node util I think you can either:

npx ts-node -dump *PATH_TO_THE_CROSSREF_JSON_DUMP* index <- Run typescript without pre-compiling to JS
tsc && node main -dump *PATH_TO_THE_CROSSREF_JSON_DUMP* index <- Compile to JS-files and then run the main.js

lfoppiano · 2021-04-23T05:43:37Z

@karatekaneen thanks!
If you have time to update the documentation and support either a file or a directory as input that would be great. I cannot really add modification without creating a new PR.

lfoppiano · 2021-05-06T00:39:19Z

I've tried to run the script without the compilation, I was expecting to get an error on the fact that the provided dump is a directory, but instead I get this error:

(base) Lucas-MacBook-Pro:matching lfoppiano$ npx ts-node index  -dump /Volumes/SEAGATE1TB/scienceminer/crossref_public_data_file_2021_01 
Error: Cannot find module '/Users/lfoppiano/development/projects/biblio-glutton/matching/index'
    at Function.Module._resolveFilename (node:internal/modules/cjs/loader:924:15)
    at Function.Module._load (node:internal/modules/cjs/loader:769:27)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
    at main (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/ts-node/src/bin.ts:198:14)
    at Object.<anonymous> (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/ts-node/src/bin.ts:288:3)
    at Module._compile (node:internal/modules/cjs/loader:1092:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1121:10)
    at Module.load (node:internal/modules/cjs/loader:972:32)
    at Function.Module._load (node:internal/modules/cjs/loader:813:14)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
(base) Lucas-MacBook-Pro:matching lfoppiano$ npx ts-node -dump /Volumes/SEAGATE1TB/scienceminer/crossref_public_data_file_2021_01  index
/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/arg/index.js:90
						throw err;
						^

Error: Unknown or unexpected option: -d
    at arg (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/arg/index.js:88:19)
    at main (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/ts-node/dist/bin.js:15:67)
    at Object.<anonymous> (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/ts-node/dist/bin.js:238:5)
    at Module._compile (node:internal/modules/cjs/loader:1092:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1121:10)
    at Module.load (node:internal/modules/cjs/loader:972:32)
    at Function.Module._load (node:internal/modules/cjs/loader:813:14)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
    at node:internal/main/run_main_module:17:47 {
  code: 'ARG_UNKNOWN_OPTION'
}

I have the following version of npx:

(base) Lucas-MacBook-Pro:matching lfoppiano$ npx --version
7.7.6

lfoppiano · 2021-05-18T07:25:03Z

I've managed to get some progress on this task. Please follow up at #58

karatekaneen added 3 commits October 25, 2020 22:38

Update to node 14, added typescript compiling

5e3c636

Fixded _id issue, converted to TS, updated elastic, refactored to asy…

ba3a325

…nc/await

Updated dependencies

16d5fd6

karatekaneen mentioned this pull request Nov 3, 2020

[WIP] Comply with the public crossref dump and some cleaning #51

Closed

Aazhar reviewed Nov 24, 2020

View reviewed changes

lfoppiano self-assigned this Feb 9, 2021

lfoppiano mentioned this pull request Mar 9, 2021

compliance with crossref public dump #52

Closed

lfoppiano added the enhancement label Mar 9, 2021

lfoppiano added a commit that referenced this pull request Apr 20, 2021

integrate PR #50 changes from @Aahzar

21645a7

lfoppiano mentioned this pull request Apr 20, 2021

Merge changes from #52 to #50 #53

Closed

lfoppiano mentioned this pull request May 18, 2021

Fixed ID Updated node #58

Closed

kermitt2 mentioned this pull request Sep 5, 2021

Update to Elastic7, all crossref dump format supported #61

Closed

Merge branch 'kermitt2:master' into master

a16ab7e

lfoppiano closed this Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed ID issue, updated node #50

Fixed ID issue, updated node #50

karatekaneen commented Oct 25, 2020

Aazhar Nov 24, 2020

karatekaneen Nov 24, 2020

Aazhar Nov 25, 2020

Aazhar Nov 24, 2020

karatekaneen Nov 24, 2020

Aazhar Nov 25, 2020

karatekaneen Nov 25, 2020

karatekaneen commented Feb 9, 2021

lfoppiano commented Feb 9, 2021

karatekaneen commented Feb 10, 2021

lfoppiano commented Mar 29, 2021

lfoppiano commented Apr 20, 2021 •

edited

Loading

lfoppiano commented Apr 22, 2021

karatekaneen commented Apr 22, 2021 •

edited

Loading

lfoppiano commented Apr 23, 2021

lfoppiano commented May 6, 2021

lfoppiano commented May 18, 2021

Fixed ID issue, updated node #50

Fixed ID issue, updated node #50

Conversation

karatekaneen commented Oct 25, 2020

Main fix

Other fixes

Aazhar Nov 24, 2020

Choose a reason for hiding this comment

karatekaneen Nov 24, 2020

Choose a reason for hiding this comment

Aazhar Nov 25, 2020

Choose a reason for hiding this comment

Aazhar Nov 24, 2020

Choose a reason for hiding this comment

karatekaneen Nov 24, 2020

Choose a reason for hiding this comment

Aazhar Nov 25, 2020

Choose a reason for hiding this comment

karatekaneen Nov 25, 2020

Choose a reason for hiding this comment

karatekaneen commented Feb 9, 2021

lfoppiano commented Feb 9, 2021

karatekaneen commented Feb 10, 2021

lfoppiano commented Mar 29, 2021

lfoppiano commented Apr 20, 2021 • edited Loading

lfoppiano commented Apr 22, 2021

karatekaneen commented Apr 22, 2021 • edited Loading

lfoppiano commented Apr 23, 2021

lfoppiano commented May 6, 2021

lfoppiano commented May 18, 2021

lfoppiano commented Apr 20, 2021 •

edited

Loading

karatekaneen commented Apr 22, 2021 •

edited

Loading