Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reserving Datacite DOIs asynchronously #6980

Closed
landreev opened this issue Jun 11, 2020 · 10 comments · Fixed by #7142
Closed

Reserving Datacite DOIs asynchronously #6980

landreev opened this issue Jun 11, 2020 · 10 comments · Fixed by #7142
Assignees
Milestone

Comments

@landreev
Copy link
Contributor

landreev commented Jun 11, 2020

This may have slipped through the cracks in the original PR #6901: We need either one of the following: An automated way to reserve Datacite dois for draft datasets for which the identifiers have not been reserved yet, after the fact, i.e. asynchronously. Or a release note explaining that this needs to be done.
This is for the dois that for one reason or another have failed to get reserved on create (starting v5). And, most importantly, for any draft datasets in Datacite-using Dataverses that have been created before v5. Since starting v5 they will not be publishable, until reserved.

We have discussed having an automatic timer job for this. And it would not be hard to implement. But it is not necessary, strictly speaking - since we have the API to list the unreserved DOIs, and the API to reserve them. So it is something that an admin could run once in a while, or script together a cron job to do. However, if we are not providing a timer, we at least need to add a release note explaining that any Datacite-using installation needs to do this for their existing drafts once v5 is deployed.

Apologies if I'm missing something and this is in fact explained somewhere.

@djbrooke
Copy link
Contributor

djbrooke commented Jun 25, 2020

Additional context from now-closed-as-duplicate #7019:

In #5093/#6901 we added support for reserving DOIs with Datacite.

There may be some steps to get DOIs into a reserved state, and we want to make sure that all drafts remain publishable.

Edit: there are definitely some steps needed to reserve the unreserved Datacite DOIs. (that's the API described above). None of the existing unpublished draft datasets have reserved DOIs just yet. (none of them exist in the db on the Datacite side). They all need to be reserved. The only ambiguity is how urgent this task is, to get them reserved after upgrading to d5. Specifically, whether it is indeed required for the unpublished drafts to "remain publishable" (more below).

We also don't want Dataverse to have old drafts marked as reserved when they are not actually reserved.

Edit: It sounds like there is no additional "reserved" marking. What serves as the indicator of the DOI having been reserved is the non-null globalidcreatetime (per @pdurbin). So the question is, do unpublished drafts have this timestamp populated, or not? If it is the former, they will appear "reserved" (and publishable), even though they are not reserved. If the latter, they will become unpublishable after the d5 upgrade. On a quick check of the prod. db, there are 4K local (non-harvested) datasets with null publicationDate (presumably, most are unpublished drafts) AND null globalidcreatetime. These include a ton of recently created drafts; so yes, it does look like a) drafts are not currently getting the timestamp populated on create and b) these drafts will indeed become unpublishable as soon as we upgrade.
There are also 3K datasets with null publicationDate AND non-null globalidcreatetime... not sure what's up with these. (none of these are super recent; there's one from 2019; then a whole bunch from 2018 and earlier; ok, the answer may be that some of these were drafts created back under EZID; and some are even more ancient datasets created when we were using handles?)

There should be a step about running an API to mark drafts as reserved (whether a single time or on a regular cron)

Edit: The previous paragraphs were about handling the existing unpublished drafts after the upgrade. That's a one-time task.
There is also a question of handling it going forward. (because some DOIs may be failing to get reserved on create - because Datacite is down, for example). So these will need to be regularly addressed. It'll likely not going to be a frequent condition. But we need to tell the admins that they'll need to either set up a cron job; or do it by hand (once in a while); or to keep an eye out for users complaining about the publish button being blocked - and then attending to it...

@sekmiller sekmiller moved this from Up Next 🛎 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 1, 2020
@sekmiller sekmiller assigned sekmiller and unassigned sekmiller Jul 1, 2020
@sekmiller sekmiller moved this from IQSS Team - In Progress 💻 to Up Next 🛎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 6, 2020
@djbrooke
Copy link
Contributor

@scolapasta Thanks for discussing and estimating - is the plan to handle this through a release note or through code?

@scolapasta
Copy link
Contributor

Release note.

@landreev
Copy link
Contributor Author

Another thing we can recommend - and should definitely do in our prod. - would be to add a few extra words to the warning banner about the upcoming upgrade; something along the lines of "This Dataverse will be down between ... and ... for a planned upgrade. Please note that users may experience temporary problems publishing previously unpublished drafts immediately after the upgrade".
We can keep a similar warning up after the upgrade as well, until all the unpublished DOIs are reserved.

@landreev
Copy link
Contributor Author

I've tested the case where we have (in v4.20) an unpublished draft with the null GlobalIdCreateTime, that DOES exist on the Datacite side - what happens when we run the new reserve API on it?
This case would usually be the result of a previous attempt to publish the dataset that failed. (failed on our end - but not before the dataset got registered with Datacite). All the real life cases we've had in production in the past where the DOI would suddenly change during publishing were due to this scenario. (on publish Dataverse would think it was registering a new identifier; but Datacite would respond "already exists" - Dataverse picks and registers the new DOI).

A real example of an unpublished draft with a registered DOI: 10.7910/DVN/CAL9UW

I was afraid/expecting that the same exact exchange was going to happen when we try to reserve the DOI for a case like that. Resulting in an unnecessary DOI change again. I was preparing to address this by recommending that the admins review their unpublished draft DOIs, find such cases and, if exist, mark them as "reserved", before running the reserve API on the remaining "real unreserved" ones.

However, testing (with both a test and prod. authorities) has confirmed that the reserve API simply proceeds to update the existing record; nothing fails, the fact that it exist is not a problem - and the dataset ends up with the same, existing DOI that is now "reserved" in the Dataverse db as well.

So this is good - fewer steps to document and perform after the upgrade; it also sounds like starting v5.0 there will not be a realistically possible case where the DOI will actually have to change before the dataset is published. (yay!)

But this brings up a question - is this the behavior we want? Do note that if this were a genuine collision (extremely unlikely - but possible! somebody using the same Datacite account went and registered this very identifier for something on another site) - it would not be detected. Are we ok with this?

(even if we are not entirely ok with this, the condition is probably too unlikely to bother changing it before v5.0. but we could discuss changing it further down the road. or not?)

@pdurbin
Copy link
Member

pdurbin commented Jul 29, 2020

I'm fine with the current behavior. If I'm a sysadmin running a few systems that share a DOI authority, I can always reduce the chance of a conflict by giving a unique shoulder to each system rather than hoping that the various systems play nicely together. And like you say above, if we can always revisit this post 5.0 if there's trouble.

@qqmyers
Copy link
Member

qqmyers commented Jul 29, 2020

The reserve code could check to see if the DOI exists before trying to create it (with a call that creates or just updates the metadata if the DOI exists), but I wouldn't hold up 5 for that - as Phil says it's probably not a good practice to share authority/shoulder between systems, and I don't think many installations are sharing, and the collision chance should be minimal (unless Dataverse is coming back with the same DOI after a failure as has been the case with the publication collisions we've seen).

@landreev
Copy link
Contributor Author

@qqmyers @pdurbin
Agree, I don't think we should worry about this being a 5.0 issue.
Mostly wondering if it's worth opening a new issue. A reasonable check could be something like this: if the doi exists on the Datacite end, AND the url is pointing somewhere else, other than this dataverse - then it's likely an actual conflict. (and then it could generate a new id; or just give up and let the admin handle it). Otherwise, just proceed to update it.

landreev added a commit that referenced this issue Jul 29, 2020
@qqmyers
Copy link
Member

qqmyers commented Nov 4, 2020

@djbrooke - this task mentions asynchronously, but I don't know that it was intended to address making the reservation of DOIs and publicizing them at publication time asynchronous, i.e. being completed outside the save/publish call itself. Is that something to discuss? Create a separate issue for? Mostly thinking about the time required to register/publicize large numbers of file DOIs that the user has to wait through.

@djbrooke
Copy link
Contributor

djbrooke commented Nov 4, 2020

@qqmyers thanks (and thanks @kcondon for discussing separately) - yes, if we can create a different issue that'd be great. If you have some ideas it'd be great for you to create it, or I can - let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants