New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reserving Datacite DOIs asynchronously #6980
Comments
Additional context from now-closed-as-duplicate #7019: In #5093/#6901 we added support for reserving DOIs with Datacite. There may be some steps to get DOIs into a reserved state, and we want to make sure that all drafts remain publishable. Edit: there are definitely some steps needed to reserve the unreserved Datacite DOIs. (that's the API described above). None of the existing unpublished draft datasets have reserved DOIs just yet. (none of them exist in the db on the Datacite side). They all need to be reserved. The only ambiguity is how urgent this task is, to get them reserved after upgrading to d5. Specifically, whether it is indeed required for the unpublished drafts to "remain publishable" (more below). We also don't want Dataverse to have old drafts marked as reserved when they are not actually reserved. Edit: It sounds like there is no additional "reserved" marking. What serves as the indicator of the DOI having been reserved is the non-null globalidcreatetime (per @pdurbin). So the question is, do unpublished drafts have this timestamp populated, or not? If it is the former, they will appear "reserved" (and publishable), even though they are not reserved. If the latter, they will become unpublishable after the d5 upgrade. On a quick check of the prod. db, there are 4K local (non-harvested) datasets with null publicationDate (presumably, most are unpublished drafts) AND null globalidcreatetime. These include a ton of recently created drafts; so yes, it does look like a) drafts are not currently getting the timestamp populated on create and b) these drafts will indeed become unpublishable as soon as we upgrade. There should be a step about running an API to mark drafts as reserved (whether a single time or on a regular cron) Edit: The previous paragraphs were about handling the existing unpublished drafts after the upgrade. That's a one-time task. |
@scolapasta Thanks for discussing and estimating - is the plan to handle this through a release note or through code? |
Release note. |
Another thing we can recommend - and should definitely do in our prod. - would be to add a few extra words to the warning banner about the upcoming upgrade; something along the lines of "This Dataverse will be down between ... and ... for a planned upgrade. Please note that users may experience temporary problems publishing previously unpublished drafts immediately after the upgrade". |
I've tested the case where we have (in v4.20) an unpublished draft with the null GlobalIdCreateTime, that DOES exist on the Datacite side - what happens when we run the new reserve API on it? A real example of an unpublished draft with a registered DOI: 10.7910/DVN/CAL9UW I was afraid/expecting that the same exact exchange was going to happen when we try to reserve the DOI for a case like that. Resulting in an unnecessary DOI change again. I was preparing to address this by recommending that the admins review their unpublished draft DOIs, find such cases and, if exist, mark them as "reserved", before running the reserve API on the remaining "real unreserved" ones. However, testing (with both a test and prod. authorities) has confirmed that the reserve API simply proceeds to update the existing record; nothing fails, the fact that it exist is not a problem - and the dataset ends up with the same, existing DOI that is now "reserved" in the Dataverse db as well. So this is good - fewer steps to document and perform after the upgrade; it also sounds like starting v5.0 there will not be a realistically possible case where the DOI will actually have to change before the dataset is published. (yay!) But this brings up a question - is this the behavior we want? Do note that if this were a genuine collision (extremely unlikely - but possible! somebody using the same Datacite account went and registered this very identifier for something on another site) - it would not be detected. Are we ok with this? (even if we are not entirely ok with this, the condition is probably too unlikely to bother changing it before v5.0. but we could discuss changing it further down the road. or not?) |
I'm fine with the current behavior. If I'm a sysadmin running a few systems that share a DOI authority, I can always reduce the chance of a conflict by giving a unique shoulder to each system rather than hoping that the various systems play nicely together. And like you say above, if we can always revisit this post 5.0 if there's trouble. |
The reserve code could check to see if the DOI exists before trying to create it (with a call that creates or just updates the metadata if the DOI exists), but I wouldn't hold up 5 for that - as Phil says it's probably not a good practice to share authority/shoulder between systems, and I don't think many installations are sharing, and the collision chance should be minimal (unless Dataverse is coming back with the same DOI after a failure as has been the case with the publication collisions we've seen). |
@qqmyers @pdurbin |
@djbrooke - this task mentions asynchronously, but I don't know that it was intended to address making the reservation of DOIs and publicizing them at publication time asynchronous, i.e. being completed outside the save/publish call itself. Is that something to discuss? Create a separate issue for? Mostly thinking about the time required to register/publicize large numbers of file DOIs that the user has to wait through. |
This may have slipped through the cracks in the original PR #6901: We need either one of the following: An automated way to reserve Datacite dois for draft datasets for which the identifiers have not been reserved yet, after the fact, i.e. asynchronously. Or a release note explaining that this needs to be done.
This is for the dois that for one reason or another have failed to get reserved on create (starting v5). And, most importantly, for any draft datasets in Datacite-using Dataverses that have been created before v5. Since starting v5 they will not be publishable, until reserved.
We have discussed having an automatic timer job for this. And it would not be hard to implement. But it is not necessary, strictly speaking - since we have the API to list the unreserved DOIs, and the API to reserve them. So it is something that an admin could run once in a while, or script together a cron job to do. However, if we are not providing a timer, we at least need to add a release note explaining that any Datacite-using installation needs to do this for their existing drafts once v5 is deployed.
Apologies if I'm missing something and this is in fact explained somewhere.
The text was updated successfully, but these errors were encountered: