6918 locking improvements #7118

landreev · 2020-07-23T13:27:50Z

What this PR does / why we need it:

This is a patch for a few unrelated issues that, combined, resulted in a series of nasty production crashes. That investigation is described in detail in IQSS/dataverse.harvard.edu#73

The issues were as follows:

AuthenticationServiceBean was (unintentionally) left a @Singleton, with all the methods locking (i.e., with no explicit lock attributes all the methods were defaulting to @Lock(WRITE)). This was solved by making AuthenticationServiceBean a "normal", @Stateless bean; and moving the code that had any business being a singleton into its own compact service.

We had a complicated system for dataset publishing behavior. In some cases (depending on the number of files) we would lock the dataset and perform all the underlying tasks asynchronously; but on smaller datasets we would attempt to publish without a lock, making the page (or the API) wait for it to finish. It turned out that, even with a few files, publishing can still take a while. (possible reasons: Datacite service hanging/timing out; reindexing of the parent dataverse, when required, is now a potentially expensive operation; there are likely more). This created situations where a user was all but encouraged to click the publish button multiple times, while the previous attempts were still in progress. Which in turn resulted in these jobs ending up deadlocking each other; and, because of the previous issue, the entire application - if one of the commands was locking the AuthenticationService singleton.
Because of the previous fix, this would not be as catastrophic, if allowed to happen again. But it would still lock up the dataset in question, and would still result in a lot of frustration for the user.

This was solved by eliminating the 2 types of publishing behavior - all publishing now happens with the dataset securely locked.

The third issue was that because of a bug in the underlying EJB implementation of addAll(), the parent dataverse(s) of the dataset being published were often reindexed unnecessarily (when no new subjects were added). As of v4.20 this reindexing is a potentially expensive process (because of linked datasets?); so many datasets were taking unnecessarily long to publish just because of that.

This was solved by not using addAll().

Issue #6947 is unrelated the troubles described above. But the changes in the PR happen to fix it too: If anything goes wrong during an attempt to publish, the user will end up staying on the draft page. And not, confusingly, on the last previously published version (if exists).

Which issue(s) this PR closes:

Closes #6918
As a side effect, also Closes #6947

Special notes for your reviewer:

The code that deals with registering AuthenticationServiceProviders has been moved into a new AuthenticationProvidersRegistrationServiceBean. It's a long name - but we want it to be as self-explanatory as possible. The trouble with the original service above happened because it was originally created just for registering AuthenticationProviders; but then everybody assumed it was for everything generally Authentication-related - and filled it with all sorts of service methods; some potentially time-consuming. Which was a recipe for disaster, since it was all locking, and most parts of the system would need to call these methods as part of normal operation.

The part about Set.addAll() method being buggy: @scolapasta investigated this and found this: eclipse-ee4j/eclipselink#622.
The bug has to do with the boolean return value of the method. It works fine otherwise - it adds all the values from set A to set B just fine. It's just that it's supposed to return true only if any new values have actually been added to the target set; and the Eclipse implementation ends up always returning true. So in the context of the FinalizeDatasetPublicationCommand we were indexing the parent dataverse(s) unnecessarily:

if (dv.getDataverseSubjects().addAll(dsf.getControlledVocabularyValues())) {
     Dataverse dvWithSubjectJustAdded = ctxt.em().merge(dv);
     ctxt.em().flush();
     ctxt.index().indexDataverse(dvWithSubjectJustAdded); // need to reindex to capture the new subjects
}

I worked around it by relying on contains() instead. It doesn't look like we've been using the return value of addAll() anywhere else in the application; so let's keep it this way.
And of course this is only the case with the Ecliplse implementation; if you create your own HashSet or something similar, you can expect the return value to be reliable. It's only when you are using the method on an EJB entity class, as in the example above, that it is problematic.

Suggestions on how to test this:
It would be very hard to reproduce a full lockup condition that started this effort (on a full-size prod. db copy, maybe - but I don't think it's necessary to go to such lengths). It may be sufficient to confirm that this PR makes it next to impossible to start multiple parallel attempts to publish the same dataset (from different browsers, etc.); because the first attempt is now going to always lock the dataset.

I'm not aware of an easy way to prove that calls to methods in AuthenticationServiceBean used to lock up access to the service, and that it's no longer happening. Except by making a test build with some "sleep" statements in strategic places.

My guess is that the main emphasis of the test should be that publishing still works. (And that publishing always locks the dataset, even if the old file number limit option is left in place, set to more than the number of files in the dataset).

There are smaller things - like confirming that after the dataset becomes lock, it stays on the draft page (and doesn't go to the last published version - which is issue #6947, that this PR also closes).

And then there is that unnecessary reindexing of the parent dataverse, even when there were no new subjects added... It should be possible to test (maybe by just checking the "indexed" time stamp on the dataverse?) - but this I'm honestly not sure is worth it. So confirming that users can still publish their datasets without anything bad happening should probably be the priority; but that's just my interpretation.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

The UI impact is minimal, but there is still some.
From the description above: "We had a complicated system for dataset publishing behavior. In some cases (depending on the number of files) we would lock the dataset and perform all the underlying tasks asynchronously; but on smaller datasets we would attempt to publish without a lock, making the page (or the API) wait for it to finish."
This PR eliminates the latter type of behavior - from now on every time a user tries to publish, it will behave in the same, consistent manner. The dataset will be locked and the page will be showing a clear banner ("Publish in Progress ...") for the duration of the process. But this behavior is not new, it is already familiar to the users. So the changes are not introducing anything novel; just simplifying things.

Please let me know if you have any questions about the process.

One other change in the user experience, in the current (v4.20) setup, if Datacite registration fails during an attempt to publish, for the dataset or one of the files (because Datacite is down; because there's something in the metadata that Datacite doesn't like, etc.), it fails silently. In this PR we have agreed to add an explicit notification (emailed to the user, and shown in the UI) explaining what happened.
The text of the notifications is in the bundle and can be changed as needed.

Is there a release notes update needed for this change?:

Additional documentation:

… conditions. MUCH SIMPLIFIED, in terms of the number of files that had to be modified; but should be even more effective in ensuring that the service should never become locked under normal operating conditions. (#6918)

…cess, etc.)

coveralls · 2020-07-23T13:45:03Z

Coverage decreased (-0.02%) to 19.59% when pulling e9d3e74 on 6918-locking-improvements into 941d17d on develop.

qqmyers · 2020-07-23T18:50:56Z

doc/sphinx-guides/source/installation/config.rst


-Configures the number of files in the dataset to warrant performing the registration of persistent identifiers (section above) and/or file validation asynchronously during publishing. The setting is optional, and the default value is 10.
+Before v5.0 this setting used to specify the number of files in the dataset to warrant performing the registration of the persistent identifiers (section above) and/or file validation asynchronously (in the background) during publishing. As of v5.0 publishing *always* happens asynchronously, with the dataset locked for the duration of the process. The setting will be ignored if present. 


policy question - should old settings be described when they no longer have an effect? I ask partly because in #7116 I'm just removing documentation of old jvm options. (A quick search for deprecated only showed a couple cases where something was deprecated but still supported.)

I think it's more comment to remove both the setting and the docs for it. eb861c7 is an example. In a release note we could include something about why the setting was removed.

I felt like I should leave it in there marked as "deprecated", in case somebody sees it in their settings table and wonders what it is - ? (going through the older versions of the guide is an option - but would be an extra effort).
But I'm open to removing things that are no longer relevant too.

So what's the final word/vote - should I remove it? Or leave it in place?

@landreev Can you remove and add a release note? We have a "Notes for Installation Administrators" section it can go under. I'll make sure it gets integrated.

Would it really provide any value in the release notes?
There's a lot in this release that the admins will actually have to do, in order to upgrade... Do we want to give them more stuff to read, that they don't necessarily have any use for?
Wouldn't take any effort to move it of course, if you still want me to.

If we're voting, I vote for removing it from the docs (old versions of the docs are available indefinitely for a reason). A single line in the release notes saying "[whatever] has been deprecated and can be removed from your setting table" seems fine.

OK, fine, I'm convinced - or outvoted, which is the same thing. Will remove.

qqmyers · 2020-07-23T19:06:20Z

...ain/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java

    }

-    private void notifyUsersDatasetPublish(CommandContext ctxt, DvObject subject) {
+    private void notifyUsersDatasetPublishStatus(CommandContext ctxt, DvObject subject, UserNotification.Type type) {


I thought I understood why the in transaction version of sendNotification was needed (to send the email even if the command transaction was rolling back), but this looks like the opposite - is it correct as is? (If so, what am I missing?)

No, you were right. Since the notification needs to be saved in the database, rolling back the transaction un-saves it. And now that, in addition to the success notification, I am also sending notifications in situations when something goes bad and the transaction is going to be rolled back, I need to save the notification in a transaction of its own.
Of course I should probably only be using the "InTransaction" method when it is an on-failure notification; and stick with the normal, old way of doing it when it's the "success" notification at the end of the command. When nothing is expected to be rolled back that is.

The extra parameter - type - identifies the type of the notification that needs to be sent; "dataset's been published" vs "failed on account of registration error" ...

Right - but notifyUserDatasetPublishStatus is the one being used for the failure message, and it's using sendNotification inside and not sendNotificationInTransaction - that means the notification won't get saved in the db for the failure? Or?

Did I check in the wrong version of it? - hold on...

Of course I did. (I was messing with renaming the methods at the last moment - and did mess it up). Thank you.

been there, done that!

I was only looking at the line your comment was immediately referencing and got confused.
OK, fixed - thanks again.
Otherwise Kevin would run into this during QA, with no notification showing in the UI.

src/main/java/propertyFiles/Bundle.properties

…FT" to the url (?). Was pre-existing - but why not. (#6918)

…publishing is initiated (#6918)

scolapasta

@landreev I added some minor comments here; nothing critical. Please review, make changes if you think worthwhile or otherwise, move to QA (or let me know and I can officially press "Approve").

src/main/java/edu/harvard/iq/dataverse/UserNotificationServiceBean.java

...ain/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java

scolapasta · 2020-07-29T19:00:12Z

...ain/java/edu/harvard/iq/dataverse/engine/command/impl/FinalizeDatasetPublicationCommand.java

+                        }
+                    }
+                    if (newSubjectsAdded) {
+                        logger.fine("new dataverse subjects added - saving and reindexing");
                        Dataverse dvWithSubjectJustAdded = ctxt.em().merge(dv);
                        ctxt.em().flush();
                        ctxt.index().indexDataverse(dvWithSubjectJustAdded); // need to reindex to capture the new subjects


I already opened a separate issue for this, but index should be moved to onSuccess.

Makes sense.

landreev and others added 3 commits July 19, 2020 20:49

the remaining parts of the #6918 patch (changes to the publishing pro…

4765cbd

…cess, etc.)

typo

9faedbc

landreev added this to Code Review 🦁 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Jul 23, 2020

landreev added this to the Dataverse 5 milestone Jul 23, 2020

"bnackground" (#6918)

a4c961e

qqmyers reviewed Jul 23, 2020

View reviewed changes

Fixed the notification method in the finalize publish command. #6918

e734dd8

mheppler reviewed Jul 27, 2020

View reviewed changes

src/main/java/propertyFiles/Bundle.properties Outdated Show resolved Hide resolved

landreev added 4 commits July 27, 2020 11:14

fixed "Global Identifer" (#6918)

0579488

fixed another "Global Identifer" (#6918)

8b3d0cc

Publish notification, as displayed in the UI, was adding "version=DRA…

4ca896f

…FT" to the url (?). Was pre-existing - but why not. (#6918)

there's never a reason not to redirect to the draft explicitly after …

1751379

…publishing is initiated (#6918)

landreev mentioned this pull request Jul 27, 2020

Publish Dataset - Failed publish error msg displayed on latest published version, not on draft version #6947

Closed

removed a deprecated setting from the guide. #6918

840d290

scolapasta self-assigned this Jul 29, 2020

scolapasta reviewed Jul 29, 2020

View reviewed changes

scolapasta assigned landreev and unassigned scolapasta Jul 29, 2020

landreev added 2 commits July 29, 2020 17:50

renamed a method - "...InNewTransaction(...)" #6918

b5a37d1

typo. #6918

e9d3e74

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Code Review 🦁 to QA 🔎✅ Jul 30, 2020

scolapasta approved these changes Jul 30, 2020

View reviewed changes

djbrooke unassigned landreev Jul 31, 2020

kcondon self-assigned this Aug 3, 2020

kcondon merged commit 8ca934b into develop Aug 3, 2020

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA 🔎✅ to Done 🚀 Aug 3, 2020

kcondon deleted the 6918-locking-improvements branch August 3, 2020 19:15

landreev mentioned this pull request Mar 4, 2021

Dataset Publish - Missing success msg + broken tooltips and other js rebind functions + unfortunate "DRAFT was not found" on page reload #7653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6918 locking improvements #7118

6918 locking improvements #7118

landreev commented Jul 23, 2020 •

edited

coveralls commented Jul 23, 2020 •

edited

qqmyers Jul 23, 2020

pdurbin Jul 23, 2020

landreev Jul 23, 2020

landreev Jul 27, 2020

djbrooke Jul 27, 2020

landreev Jul 27, 2020

pdurbin Jul 27, 2020

landreev Jul 27, 2020 •

edited

landreev Jul 28, 2020

qqmyers Jul 23, 2020

landreev Jul 23, 2020

qqmyers Jul 23, 2020

landreev Jul 23, 2020

landreev Jul 23, 2020

qqmyers Jul 23, 2020

landreev Jul 23, 2020

scolapasta left a comment

scolapasta Jul 29, 2020

landreev Jul 29, 2020


		Configures the number of files in the dataset to warrant performing the registration of persistent identifiers (section above) and/or file validation asynchronously during publishing. The setting is optional, and the default value is 10.
		Before v5.0 this setting used to specify the number of files in the dataset to warrant performing the registration of the persistent identifiers (section above) and/or file validation asynchronously (in the background) during publishing. As of v5.0 publishing always happens asynchronously, with the dataset locked for the duration of the process. The setting will be ignored if present.

6918 locking improvements #7118

6918 locking improvements #7118

Conversation

landreev commented Jul 23, 2020 • edited

coveralls commented Jul 23, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landreev Jul 27, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scolapasta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landreev commented Jul 23, 2020 •

edited

coveralls commented Jul 23, 2020 •

edited

landreev Jul 27, 2020 •

edited