Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed ID issue, updated node #50

Closed
wants to merge 4 commits into from

Conversation

karatekaneen
Copy link
Contributor

Main fix

When using the latest Crossref data dump instead of the one mentioned in the docs the node utility crashes because of the missing attribute _id that I guess stems from the MongoDB used in the Crossref Extractor used. This fix removes the preset ID and relies on one being created by Elasticsearch. (This can possibly lead to other issues downstream, but I haven't found any issues when testing after removing the ID).

Other fixes

  • Update from node 10 to node 14 because node10 is soon at end-of-life
  • Changed from the depracated elasticsearch client to the new, supported client
  • Fixed index name "crossref" from being hardcoded in a couple of places
  • Converted to Typescript (poor man's testing :D). Let me know if you want pure javascript instead and I'll resubmit with the compiled JS code.
  • Refactored to async/await instead of callbacks for better readability

if (data.author?.length > 0) {
const { authors, firstAuthorName } = data.author.reduce(
(acc, author) => {
if (author.sequence === "first") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked crossref support about this, because this 'sequence' attribute is not available in the last dump, maybe consider the first one in the list as first too, also the language attribut is not available (maybe it is not relevant for you..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware that the sequence attribute was missing to be honest and maybe the order should be considered as well. Although I'm not sure that that would always yield the correct result. Do you have any more insight on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to build my actual index , I have considered either the sequence attribute or the order in the list, I think in most cases the list order represents the real order..

return;
}

delete doc._id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to able to synchronize incrementally with crossref, I think doi could be used as identifier, @kermitt2 @karatekaneen WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like a very simple solution that may help towards the incremental synchronization that I really long for. But I have a very limited view on how all this works in the rest of glutton.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so far biblio glutton has no tool to do synchronization, but I'm trying to work on something and I need to confirm the usage of doi as unique identifier (I have tried to rebuild the index using doi, and it seems there is no data loss..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I've been meaning to look in to that myself.
Do you mean that the internal LMDB's does not clear out when you try to add a batch of fresh data? And if so, using doi here would make a lot of sense.

@karatekaneen
Copy link
Contributor Author

Had any chance to look at this?

@lfoppiano
Copy link
Collaborator

@karatekaneen sorry, lot of stuff around and hard to follow all the projects. I will have a look within the next two weeks.

@lfoppiano lfoppiano self-assigned this Feb 9, 2021
@karatekaneen
Copy link
Contributor Author

No worries, I suspected that was the case so I just wanted to bump it to give a reminder. Thanks for the great stuff you guys release, it's super useful!

@lfoppiano
Copy link
Collaborator

btw, I haven't forgotten about this issue. I'm still downloading the dump, and is taking me weeks 😭

@lfoppiano
Copy link
Collaborator

lfoppiano commented Apr 20, 2021

@karatekaneen I've downloaded the dump, perhaps we could read the dump directly from the thousand of gzip files instead of expecting a single file. Did you uncompress them and re-compress as a single jzon file?

Update: I'm going to implement the mechanism to read the dump directly from the compressed gzipped files in a directory

Update2: ok It's already implemented in #52

lfoppiano added a commit that referenced this pull request Apr 20, 2021
@lfoppiano
Copy link
Collaborator

@karatekaneen I'm having issues with starting up the indexing, the command:

node main -dump *PATH_TO_THE_CROSSREF_JSON_DUMP* index

doesn't seems to be valid anymore.
Will this work also with a directory containing the recent dump downloaded from the torrent?

@karatekaneen
Copy link
Contributor Author

karatekaneen commented Apr 22, 2021

Unfortunately my memory has released all the details of how I wrote this, so I will have to take a closer look this weekend or so. I think that I didn't added anything (such as multiple files etc) because I wanted to keep it as close to the original as possible. But I'm happy to modify if requested.

But, I'm pretty sure that I figured out that the files could be appended to each other (this was before #52 existed in my mind) and I ran a util script just appending all the files into one huge file.

For running the node util I think you can either:

  • npx ts-node -dump *PATH_TO_THE_CROSSREF_JSON_DUMP* index <- Run typescript without pre-compiling to JS
  • tsc && node main -dump *PATH_TO_THE_CROSSREF_JSON_DUMP* index <- Compile to JS-files and then run the main.js

@lfoppiano
Copy link
Collaborator

@karatekaneen thanks!
If you have time to update the documentation and support either a file or a directory as input that would be great. I cannot really add modification without creating a new PR.

@lfoppiano
Copy link
Collaborator

I've tried to run the script without the compilation, I was expecting to get an error on the fact that the provided dump is a directory, but instead I get this error:

(base) Lucas-MacBook-Pro:matching lfoppiano$ npx ts-node index  -dump /Volumes/SEAGATE1TB/scienceminer/crossref_public_data_file_2021_01 
Error: Cannot find module '/Users/lfoppiano/development/projects/biblio-glutton/matching/index'
    at Function.Module._resolveFilename (node:internal/modules/cjs/loader:924:15)
    at Function.Module._load (node:internal/modules/cjs/loader:769:27)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
    at main (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/ts-node/src/bin.ts:198:14)
    at Object.<anonymous> (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/ts-node/src/bin.ts:288:3)
    at Module._compile (node:internal/modules/cjs/loader:1092:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1121:10)
    at Module.load (node:internal/modules/cjs/loader:972:32)
    at Function.Module._load (node:internal/modules/cjs/loader:813:14)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
(base) Lucas-MacBook-Pro:matching lfoppiano$ npx ts-node -dump /Volumes/SEAGATE1TB/scienceminer/crossref_public_data_file_2021_01  index
/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/arg/index.js:90
						throw err;
						^

Error: Unknown or unexpected option: -d
    at arg (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/arg/index.js:88:19)
    at main (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/ts-node/dist/bin.js:15:67)
    at Object.<anonymous> (/Users/lfoppiano/development/projects/biblio-glutton/matching/node_modules/ts-node/dist/bin.js:238:5)
    at Module._compile (node:internal/modules/cjs/loader:1092:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1121:10)
    at Module.load (node:internal/modules/cjs/loader:972:32)
    at Function.Module._load (node:internal/modules/cjs/loader:813:14)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
    at node:internal/main/run_main_module:17:47 {
  code: 'ARG_UNKNOWN_OPTION'
}

I have the following version of npx:

(base) Lucas-MacBook-Pro:matching lfoppiano$ npx --version
7.7.6

@lfoppiano lfoppiano mentioned this pull request May 18, 2021
@lfoppiano
Copy link
Collaborator

I've managed to get some progress on this task. Please follow up at #58

@lfoppiano lfoppiano closed this Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants