Reserving Datacite DOIs asynchronously #6980

landreev · 2020-06-11T17:13:05Z

This may have slipped through the cracks in the original PR #6901: We need either one of the following: An automated way to reserve Datacite dois for draft datasets for which the identifiers have not been reserved yet, after the fact, i.e. asynchronously. Or a release note explaining that this needs to be done.
This is for the dois that for one reason or another have failed to get reserved on create (starting v5). And, most importantly, for any draft datasets in Datacite-using Dataverses that have been created before v5. Since starting v5 they will not be publishable, until reserved.

We have discussed having an automatic timer job for this. And it would not be hard to implement. But it is not necessary, strictly speaking - since we have the API to list the unreserved DOIs, and the API to reserve them. So it is something that an admin could run once in a while, or script together a cron job to do. However, if we are not providing a timer, we at least need to add a release note explaining that any Datacite-using installation needs to do this for their existing drafts once v5 is deployed.

Apologies if I'm missing something and this is in fact explained somewhere.

djbrooke · 2020-06-25T17:43:22Z

Additional context from now-closed-as-duplicate #7019:

In #5093/#6901 we added support for reserving DOIs with Datacite.

There may be some steps to get DOIs into a reserved state, and we want to make sure that all drafts remain publishable.

Edit: there are definitely some steps needed to reserve the unreserved Datacite DOIs. (that's the API described above). None of the existing unpublished draft datasets have reserved DOIs just yet. (none of them exist in the db on the Datacite side). They all need to be reserved. The only ambiguity is how urgent this task is, to get them reserved after upgrading to d5. Specifically, whether it is indeed required for the unpublished drafts to "remain publishable" (more below).

We also don't want Dataverse to have old drafts marked as reserved when they are not actually reserved.

Edit: It sounds like there is no additional "reserved" marking. What serves as the indicator of the DOI having been reserved is the non-null globalidcreatetime (per @pdurbin). So the question is, do unpublished drafts have this timestamp populated, or not? If it is the former, they will appear "reserved" (and publishable), even though they are not reserved. If the latter, they will become unpublishable after the d5 upgrade. On a quick check of the prod. db, there are 4K local (non-harvested) datasets with null publicationDate (presumably, most are unpublished drafts) AND null globalidcreatetime. These include a ton of recently created drafts; so yes, it does look like a) drafts are not currently getting the timestamp populated on create and b) these drafts will indeed become unpublishable as soon as we upgrade.
There are also 3K datasets with null publicationDate AND non-null globalidcreatetime... not sure what's up with these. (none of these are super recent; there's one from 2019; then a whole bunch from 2018 and earlier; ok, the answer may be that some of these were drafts created back under EZID; and some are even more ancient datasets created when we were using handles?)

There should be a step about running an API to mark drafts as reserved (whether a single time or on a regular cron)

Edit: The previous paragraphs were about handling the existing unpublished drafts after the upgrade. That's a one-time task.
There is also a question of handling it going forward. (because some DOIs may be failing to get reserved on create - because Datacite is down, for example). So these will need to be regularly addressed. It'll likely not going to be a frequent condition. But we need to tell the admins that they'll need to either set up a cron job; or do it by hand (once in a while); or to keep an eye out for users complaining about the publish button being blocked - and then attending to it...

djbrooke · 2020-07-15T17:37:05Z

@scolapasta Thanks for discussing and estimating - is the plan to handle this through a release note or through code?

scolapasta · 2020-07-15T17:37:36Z

Release note.

landreev · 2020-07-22T19:05:11Z

Another thing we can recommend - and should definitely do in our prod. - would be to add a few extra words to the warning banner about the upcoming upgrade; something along the lines of "This Dataverse will be down between ... and ... for a planned upgrade. Please note that users may experience temporary problems publishing previously unpublished drafts immediately after the upgrade".
We can keep a similar warning up after the upgrade as well, until all the unpublished DOIs are reserved.

landreev · 2020-07-27T18:48:35Z

I've tested the case where we have (in v4.20) an unpublished draft with the null GlobalIdCreateTime, that DOES exist on the Datacite side - what happens when we run the new reserve API on it?
This case would usually be the result of a previous attempt to publish the dataset that failed. (failed on our end - but not before the dataset got registered with Datacite). All the real life cases we've had in production in the past where the DOI would suddenly change during publishing were due to this scenario. (on publish Dataverse would think it was registering a new identifier; but Datacite would respond "already exists" - Dataverse picks and registers the new DOI).

A real example of an unpublished draft with a registered DOI: 10.7910/DVN/CAL9UW

I was afraid/expecting that the same exact exchange was going to happen when we try to reserve the DOI for a case like that. Resulting in an unnecessary DOI change again. I was preparing to address this by recommending that the admins review their unpublished draft DOIs, find such cases and, if exist, mark them as "reserved", before running the reserve API on the remaining "real unreserved" ones.

However, testing (with both a test and prod. authorities) has confirmed that the reserve API simply proceeds to update the existing record; nothing fails, the fact that it exist is not a problem - and the dataset ends up with the same, existing DOI that is now "reserved" in the Dataverse db as well.

So this is good - fewer steps to document and perform after the upgrade; it also sounds like starting v5.0 there will not be a realistically possible case where the DOI will actually have to change before the dataset is published. (yay!)

But this brings up a question - is this the behavior we want? Do note that if this were a genuine collision (extremely unlikely - but possible! somebody using the same Datacite account went and registered this very identifier for something on another site) - it would not be detected. Are we ok with this?

(even if we are not entirely ok with this, the condition is probably too unlikely to bother changing it before v5.0. but we could discuss changing it further down the road. or not?)

…te. #6980

pdurbin · 2020-07-29T19:04:34Z

I'm fine with the current behavior. If I'm a sysadmin running a few systems that share a DOI authority, I can always reduce the chance of a conflict by giving a unique shoulder to each system rather than hoping that the various systems play nicely together. And like you say above, if we can always revisit this post 5.0 if there's trouble.

qqmyers · 2020-07-29T19:09:57Z

The reserve code could check to see if the DOI exists before trying to create it (with a call that creates or just updates the metadata if the DOI exists), but I wouldn't hold up 5 for that - as Phil says it's probably not a good practice to share authority/shoulder between systems, and I don't think many installations are sharing, and the collision chance should be minimal (unless Dataverse is coming back with the same DOI after a failure as has been the case with the publication collisions we've seen).

landreev · 2020-07-29T19:23:30Z

@qqmyers @pdurbin
Agree, I don't think we should worry about this being a 5.0 issue.
Mostly wondering if it's worth opening a new issue. A reasonable check could be something like this: if the doi exists on the Datacite end, AND the url is pointing somewhere else, other than this dataverse - then it's likely an actual conflict. (and then it could generate a new id; or just give up and let the admin handle it). Otherwise, just proceed to update it.

qqmyers · 2020-11-04T21:35:34Z

@djbrooke - this task mentions asynchronously, but I don't know that it was intended to address making the reservation of DOIs and publicizing them at publication time asynchronous, i.e. being completed outside the save/publish call itself. Is that something to discuss? Create a separate issue for? Mostly thinking about the time required to register/publicize large numbers of file DOIs that the user has to wait through.

djbrooke · 2020-11-04T21:49:51Z

@qqmyers thanks (and thanks @kcondon for discussing separately) - yes, if we can create a different issue that'd be great. If you have some ideas it'd be great for you to create it, or I can - let me know.

djbrooke added this to Up Next 🛎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jun 25, 2020

djbrooke added this to the Dataverse 5 milestone Jun 25, 2020

djbrooke mentioned this issue Jun 25, 2020

Additional Release Notes around Reserved DOIs (#5093) #7019

Closed

sekmiller moved this from Up Next 🛎 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 1, 2020

sekmiller assigned sekmiller and unassigned sekmiller Jul 1, 2020

sekmiller moved this from IQSS Team - In Progress 💻 to Up Next 🛎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 6, 2020

scolapasta added the Medium label Jul 14, 2020

pdurbin mentioned this issue Jul 20, 2020

DOI registration is not confirmed by Datacite #7102

Closed

landreev moved this from Up Next 🛎 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 23, 2020

landreev self-assigned this Jul 23, 2020

mheppler mentioned this issue Jul 27, 2020

add message when DataCite DOI has not been reserved #7102 #7121

Merged

landreev added a commit that referenced this issue Jul 28, 2020

Datacite DOI reservation instruction for the upgrade, as a release no…

86b4722

…te. #6980

landreev added a commit that referenced this issue Jul 28, 2020

figuring the formatting for the release note... (#6980)

23de0df

landreev added a commit that referenced this issue Jul 28, 2020

still not sure about the formatting... #6980

94121e7

landreev added a commit that referenced this issue Jul 29, 2020

more cosmetic changes (#6980)

49c8c38

landreev added a commit that referenced this issue Jul 29, 2020

more cosmetic changes #6980

cda7bca

landreev mentioned this issue Jul 29, 2020

6980 datacite reservation upgrade #7142

Merged

landreev removed this from IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 29, 2020

landreev added a commit that referenced this issue Jul 29, 2020

how about this? #6980

ef87b1f

landreev added a commit that referenced this issue Jul 29, 2020

adding the .md extension... #6980

c693427

kcondon closed this as completed in #7142 Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reserving Datacite DOIs asynchronously #6980

Reserving Datacite DOIs asynchronously #6980

landreev commented Jun 11, 2020 •

edited

djbrooke commented Jun 25, 2020 •

edited by landreev

djbrooke commented Jul 15, 2020

scolapasta commented Jul 15, 2020

landreev commented Jul 22, 2020

landreev commented Jul 27, 2020

pdurbin commented Jul 29, 2020

qqmyers commented Jul 29, 2020

landreev commented Jul 29, 2020

qqmyers commented Nov 4, 2020

djbrooke commented Nov 4, 2020

Reserving Datacite DOIs asynchronously #6980

Reserving Datacite DOIs asynchronously #6980

Comments

landreev commented Jun 11, 2020 • edited

djbrooke commented Jun 25, 2020 • edited by landreev

djbrooke commented Jul 15, 2020

scolapasta commented Jul 15, 2020

landreev commented Jul 22, 2020

landreev commented Jul 27, 2020

pdurbin commented Jul 29, 2020

qqmyers commented Jul 29, 2020

landreev commented Jul 29, 2020

qqmyers commented Nov 4, 2020

djbrooke commented Nov 4, 2020

landreev commented Jun 11, 2020 •

edited

djbrooke commented Jun 25, 2020 •

edited by landreev