-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed ID issue, updated node #50
Conversation
if (data.author?.length > 0) { | ||
const { authors, firstAuthorName } = data.author.reduce( | ||
(acc, author) => { | ||
if (author.sequence === "first") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked crossref support about this, because this 'sequence' attribute is not available in the last dump, maybe consider the first one in the list as first too, also the language attribut is not available (maybe it is not relevant for you..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware that the sequence attribute was missing to be honest and maybe the order should be considered as well. Although I'm not sure that that would always yield the correct result. Do you have any more insight on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to build my actual index , I have considered either the sequence attribute or the order in the list, I think in most cases the list order represents the real order..
return; | ||
} | ||
|
||
delete doc._id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to able to synchronize incrementally with crossref, I think doi could be used as identifier, @kermitt2 @karatekaneen WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like a very simple solution that may help towards the incremental synchronization that I really long for. But I have a very limited view on how all this works in the rest of glutton.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so far biblio glutton has no tool to do synchronization, but I'm trying to work on something and I need to confirm the usage of doi as unique identifier (I have tried to rebuild the index using doi, and it seems there is no data loss..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I've been meaning to look in to that myself.
Do you mean that the internal LMDB's does not clear out when you try to add a batch of fresh data? And if so, using doi here would make a lot of sense.
Had any chance to look at this? |
@karatekaneen sorry, lot of stuff around and hard to follow all the projects. I will have a look within the next two weeks. |
No worries, I suspected that was the case so I just wanted to bump it to give a reminder. Thanks for the great stuff you guys release, it's super useful! |
btw, I haven't forgotten about this issue. I'm still downloading the dump, and is taking me weeks 😭 |
@karatekaneen I've downloaded the dump, perhaps we could read the dump directly from the thousand of gzip files instead of expecting a single file. Did you uncompress them and re-compress as a single jzon file? Update: I'm going to implement the mechanism to read the dump directly from the compressed gzipped files in a directory Update2: ok It's already implemented in #52 |
@karatekaneen I'm having issues with starting up the indexing, the command:
doesn't seems to be valid anymore. |
Unfortunately my memory has released all the details of how I wrote this, so I will have to take a closer look this weekend or so. I think that I didn't added anything (such as multiple files etc) because I wanted to keep it as close to the original as possible. But I'm happy to modify if requested. But, I'm pretty sure that I figured out that the files could be appended to each other (this was before #52 existed in my mind) and I ran a util script just appending all the files into one huge file. For running the node util I think you can either:
|
@karatekaneen thanks! |
I've tried to run the script without the compilation, I was expecting to get an error on the fact that the provided dump is a directory, but instead I get this error:
I have the following version of npx:
|
I've managed to get some progress on this task. Please follow up at #58 |
Main fix
When using the latest Crossref data dump instead of the one mentioned in the docs the node utility crashes because of the missing attribute
_id
that I guess stems from the MongoDB used in the Crossref Extractor used. This fix removes the preset ID and relies on one being created by Elasticsearch. (This can possibly lead to other issues downstream, but I haven't found any issues when testing after removing the ID).Other fixes