Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6918 locking improvements #7118

Merged
merged 12 commits into from
Aug 3, 2020
Merged

6918 locking improvements #7118

merged 12 commits into from
Aug 3, 2020

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Jul 23, 2020

What this PR does / why we need it:

This is a patch for a few unrelated issues that, combined, resulted in a series of nasty production crashes. That investigation is described in detail in IQSS/dataverse.harvard.edu#73

The issues were as follows:

AuthenticationServiceBean was (unintentionally) left a @Singleton, with all the methods locking (i.e., with no explicit lock attributes all the methods were defaulting to @Lock(WRITE)). This was solved by making AuthenticationServiceBean a "normal", @Stateless bean; and moving the code that had any business being a singleton into its own compact service.

We had a complicated system for dataset publishing behavior. In some cases (depending on the number of files) we would lock the dataset and perform all the underlying tasks asynchronously; but on smaller datasets we would attempt to publish without a lock, making the page (or the API) wait for it to finish. It turned out that, even with a few files, publishing can still take a while. (possible reasons: Datacite service hanging/timing out; reindexing of the parent dataverse, when required, is now a potentially expensive operation; there are likely more). This created situations where a user was all but encouraged to click the publish button multiple times, while the previous attempts were still in progress. Which in turn resulted in these jobs ending up deadlocking each other; and, because of the previous issue, the entire application - if one of the commands was locking the AuthenticationService singleton.
Because of the previous fix, this would not be as catastrophic, if allowed to happen again. But it would still lock up the dataset in question, and would still result in a lot of frustration for the user.

This was solved by eliminating the 2 types of publishing behavior - all publishing now happens with the dataset securely locked.

The third issue was that because of a bug in the underlying EJB implementation of addAll(), the parent dataverse(s) of the dataset being published were often reindexed unnecessarily (when no new subjects were added). As of v4.20 this reindexing is a potentially expensive process (because of linked datasets?); so many datasets were taking unnecessarily long to publish just because of that.

This was solved by not using addAll().

Issue #6947 is unrelated the troubles described above. But the changes in the PR happen to fix it too: If anything goes wrong during an attempt to publish, the user will end up staying on the draft page. And not, confusingly, on the last previously published version (if exists).

Which issue(s) this PR closes:

Closes #6918
As a side effect, also Closes #6947

Special notes for your reviewer:

The code that deals with registering AuthenticationServiceProviders has been moved into a new AuthenticationProvidersRegistrationServiceBean. It's a long name - but we want it to be as self-explanatory as possible. The trouble with the original service above happened because it was originally created just for registering AuthenticationProviders; but then everybody assumed it was for everything generally Authentication-related - and filled it with all sorts of service methods; some potentially time-consuming. Which was a recipe for disaster, since it was all locking, and most parts of the system would need to call these methods as part of normal operation.

The part about Set.addAll() method being buggy: @scolapasta investigated this and found this: eclipse-ee4j/eclipselink#622.
The bug has to do with the boolean return value of the method. It works fine otherwise - it adds all the values from set A to set B just fine. It's just that it's supposed to return true only if any new values have actually been added to the target set; and the Eclipse implementation ends up always returning true. So in the context of the FinalizeDatasetPublicationCommand we were indexing the parent dataverse(s) unnecessarily:

if (dv.getDataverseSubjects().addAll(dsf.getControlledVocabularyValues())) {
     Dataverse dvWithSubjectJustAdded = ctxt.em().merge(dv);
     ctxt.em().flush();
     ctxt.index().indexDataverse(dvWithSubjectJustAdded); // need to reindex to capture the new subjects
}

I worked around it by relying on contains() instead. It doesn't look like we've been using the return value of addAll() anywhere else in the application; so let's keep it this way.
And of course this is only the case with the Ecliplse implementation; if you create your own HashSet or something similar, you can expect the return value to be reliable. It's only when you are using the method on an EJB entity class, as in the example above, that it is problematic.

Suggestions on how to test this:
It would be very hard to reproduce a full lockup condition that started this effort (on a full-size prod. db copy, maybe - but I don't think it's necessary to go to such lengths). It may be sufficient to confirm that this PR makes it next to impossible to start multiple parallel attempts to publish the same dataset (from different browsers, etc.); because the first attempt is now going to always lock the dataset.

I'm not aware of an easy way to prove that calls to methods in AuthenticationServiceBean used to lock up access to the service, and that it's no longer happening. Except by making a test build with some "sleep" statements in strategic places.

My guess is that the main emphasis of the test should be that publishing still works. (And that publishing always locks the dataset, even if the old file number limit option is left in place, set to more than the number of files in the dataset).

There are smaller things - like confirming that after the dataset becomes lock, it stays on the draft page (and doesn't go to the last published version - which is issue #6947, that this PR also closes).

And then there is that unnecessary reindexing of the parent dataverse, even when there were no new subjects added... It should be possible to test (maybe by just checking the "indexed" time stamp on the dataverse?) - but this I'm honestly not sure is worth it. So confirming that users can still publish their datasets without anything bad happening should probably be the priority; but that's just my interpretation.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

The UI impact is minimal, but there is still some.
From the description above: "We had a complicated system for dataset publishing behavior. In some cases (depending on the number of files) we would lock the dataset and perform all the underlying tasks asynchronously; but on smaller datasets we would attempt to publish without a lock, making the page (or the API) wait for it to finish."
This PR eliminates the latter type of behavior - from now on every time a user tries to publish, it will behave in the same, consistent manner. The dataset will be locked and the page will be showing a clear banner ("Publish in Progress ...") for the duration of the process. But this behavior is not new, it is already familiar to the users. So the changes are not introducing anything novel; just simplifying things.

Please let me know if you have any questions about the process.

One other change in the user experience, in the current (v4.20) setup, if Datacite registration fails during an attempt to publish, for the dataset or one of the files (because Datacite is down; because there's something in the metadata that Datacite doesn't like, etc.), it fails silently. In this PR we have agreed to add an explicit notification (emailed to the user, and shown in the UI) explaining what happened.
The text of the notifications is in the bundle and can be changed as needed.

Is there a release notes update needed for this change?:

Additional documentation:

landreev and others added 3 commits July 19, 2020 20:49
… conditions.

MUCH SIMPLIFIED, in terms of the number of files that had to be modified; but should be
even more effective in ensuring that the service should never become locked under
normal operating conditions. (#6918)
@coveralls
Copy link

coveralls commented Jul 23, 2020

Coverage Status

Coverage decreased (-0.02%) to 19.59% when pulling e9d3e74 on 6918-locking-improvements into 941d17d on develop.

@landreev landreev added this to the Dataverse 5 milestone Jul 23, 2020

Configures the number of files in the dataset to warrant performing the registration of persistent identifiers (section above) and/or file validation asynchronously during publishing. The setting is optional, and the default value is 10.
Before v5.0 this setting used to specify the number of files in the dataset to warrant performing the registration of the persistent identifiers (section above) and/or file validation asynchronously (in the background) during publishing. As of v5.0 publishing *always* happens asynchronously, with the dataset locked for the duration of the process. The setting will be ignored if present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

policy question - should old settings be described when they no longer have an effect? I ask partly because in #7116 I'm just removing documentation of old jvm options. (A quick search for deprecated only showed a couple cases where something was deprecated but still supported.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more comment to remove both the setting and the docs for it. eb861c7 is an example. In a release note we could include something about why the setting was removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt like I should leave it in there marked as "deprecated", in case somebody sees it in their settings table and wonders what it is - ? (going through the older versions of the guide is an option - but would be an extra effort).
But I'm open to removing things that are no longer relevant too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's the final word/vote - should I remove it? Or leave it in place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev Can you remove and add a release note? We have a "Notes for Installation Administrators" section it can go under. I'll make sure it gets integrated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it really provide any value in the release notes?
There's a lot in this release that the admins will actually have to do, in order to upgrade... Do we want to give them more stuff to read, that they don't necessarily have any use for?
Wouldn't take any effort to move it of course, if you still want me to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're voting, I vote for removing it from the docs (old versions of the docs are available indefinitely for a reason). A single line in the release notes saying "[whatever] has been deprecated and can be removed from your setting table" seems fine.

Copy link
Contributor Author

@landreev landreev Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fine, I'm convinced - or outvoted, which is the same thing. Will remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done.

}

private void notifyUsersDatasetPublish(CommandContext ctxt, DvObject subject) {
private void notifyUsersDatasetPublishStatus(CommandContext ctxt, DvObject subject, UserNotification.Type type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I understood why the in transaction version of sendNotification was needed (to send the email even if the command transaction was rolling back), but this looks like the opposite - is it correct as is? (If so, what am I missing?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you were right. Since the notification needs to be saved in the database, rolling back the transaction un-saves it. And now that, in addition to the success notification, I am also sending notifications in situations when something goes bad and the transaction is going to be rolled back, I need to save the notification in a transaction of its own.
Of course I should probably only be using the "InTransaction" method when it is an on-failure notification; and stick with the normal, old way of doing it when it's the "success" notification at the end of the command. When nothing is expected to be rolled back that is.

The extra parameter - type - identifies the type of the notification that needs to be sent; "dataset's been published" vs "failed on account of registration error" ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - but notifyUserDatasetPublishStatus is the one being used for the failure message, and it's using sendNotification inside and not sendNotificationInTransaction - that means the notification won't get saved in the db for the failure? Or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I check in the wrong version of it? - hold on...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course I did. (I was messing with renaming the methods at the last moment - and did mess it up). Thank you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

been there, done that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was only looking at the line your comment was immediately referencing and got confused.
OK, fixed - thanks again.
Otherwise Kevin would run into this during QA, with no notification showing in the UI.

@scolapasta scolapasta self-assigned this Jul 29, 2020
Copy link
Contributor

@scolapasta scolapasta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev I added some minor comments here; nothing critical. Please review, make changes if you think worthwhile or otherwise, move to QA (or let me know and I can officially press "Approve").

}
}
if (newSubjectsAdded) {
logger.fine("new dataverse subjects added - saving and reindexing");
Dataverse dvWithSubjectJustAdded = ctxt.em().merge(dv);
ctxt.em().flush();
ctxt.index().indexDataverse(dvWithSubjectJustAdded); // need to reindex to capture the new subjects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already opened a separate issue for this, but index should be moved to onSuccess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@scolapasta scolapasta assigned landreev and unassigned scolapasta Jul 29, 2020
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Code Review 🦁 to QA 🔎✅ Jul 30, 2020
@kcondon kcondon self-assigned this Aug 3, 2020
@kcondon kcondon merged commit 8ca934b into develop Aug 3, 2020
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA 🔎✅ to Done 🚀 Aug 3, 2020
@kcondon kcondon deleted the 6918-locking-improvements branch August 3, 2020 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
8 participants