Skip to content

8259242: Remove ProtectionDomainSet_lock #3362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

coleenp
Copy link
Contributor

@coleenp coleenp commented Apr 6, 2021

This change makes the DictionaryEntry pd_set reads lock free, because when I set a low SafepointTimeout on the findDeadlock test, these threads were waiting on the ProtectionDomainSet_lock. This lock is a singleton that's taken for every dictionary entry.

I don't know if this change will make the test never timeout again, because removing this lock caused us to hit a low SafepointTimeout on another _no_safepoint_check lock that is not so easy to eradictate.

Tested with tiers 1-3. There's a test that exercises this code in runtime/Dictionary/ProtectionDomainCacheTest.java.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3362/head:pull/3362
$ git checkout pull/3362

Update a local copy of the PR:
$ git checkout pull/3362
$ git pull https://git.openjdk.java.net/jdk pull/3362/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3362

View PR using the GUI difftool:
$ git pr show -t 3362

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3362.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Apr 6, 2021

👋 Welcome back coleenp! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Apr 6, 2021
@openjdk
Copy link

openjdk bot commented Apr 6, 2021

@coleenp The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot hotspot-dev@openjdk.org label Apr 6, 2021
@mlbridge
Copy link

mlbridge bot commented Apr 6, 2021

Webrevs

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Coleen,

Can you explain the synchronization protocol you are using here please - I don't understand how the handshake fits into it ??

Further comments below.

Thanks,
David

Comment on lines 421 to 422
assert(Thread::current()->is_Java_thread(), "only called by JavaThread");
assert_locked_or_safepoint(SystemDictionary_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the current thread must be a JavaThread then we cannot be at a safepoint and so the second assert can just check the lock is held.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

Comment on lines 144 to 145
ProtectionDomainEntry* pd_set() const { return Atomic::load(&_pd_set); }
void set_pd_set(ProtectionDomainEntry* entry) { Atomic::store(&_pd_set, entry); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These probably need to be load_acquire and release_store, so that a lock-free traversal can proceed correctly with a locked addition/removal. With suitable name changes for the methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot that the method names have to match the implementation details. Thanks I didn't know whether they needed acquire etc.

Comment on lines +87 to +89
// If there are any deleted entries, Handshake-all then they'll be
// safe to remove since traversing the pd_set list does not stop for
// safepoints and only JavaThreads will read the pd_set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand what you are trying to do here. This is basically a no-op handshake with every thread? Does all the threads remain in the handshake until the last one is processed or do they proceed through one at a time? The former seems better suited for a safepoint operation. The latter means this doesn't work as they may have moved on to a new removal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the ServiceThread calls this code, so they won't be removed again. This is modeled after the ObjectMonitor code that does the same thing. All the threads have to hit a handshake at a safepoint polling place, which they cannot do while reading the list. After all the threads have hit the handshake, no thread can see the old entries on the list.
I thought that's what I said in the comment. What should I say to make this clearer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See src/hotspot/share/runtime/synchronizer.cpp:

size_t ObjectSynchronizer::deflate_idle_monitors() {
<snip>
    if (current->is_Java_thread()) {
<snip>
      // A JavaThread needs to handshake in order to safely free the
      // ObjectMonitors that were deflated in this cycle.
      HandshakeForDeflation hfd_hc;
      Handshake::execute(&hfd_hc);

Comment on lines 114 to 115
ProtectionDomainEntry* next() { return Atomic::load(&_next); }
void set_next(ProtectionDomainEntry* entry) { Atomic::store(&_next, entry); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again these may need to be load_acquire/release_store for correct lock-free traversal. With suitable name changes for the methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll add acquire/release and change the names to match the implementation details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we collectively agree on the naming convention that set_next has to be release_set_next because the implementation uses release_store, or is this just your preference? I find this noisy in the context where these functions are being called.

@coleenp
Copy link
Contributor Author

coleenp commented Apr 7, 2021

The lock-free algorithm is based on the one for object monitors, except that adding and deleting entries is done with the SystemDictionary_lock held, reading entries is done with a (now) Atomic load acquires etc with no safepoint polls, and deleting the entries is done by saving the entries off to the side, until a handshake is reached, meaning none of the readers are reading the old version of the list.

@mlbridge
Copy link

mlbridge bot commented Apr 8, 2021

Mailing list message from David Holmes on hotspot-dev:

On 7/04/2021 10:30 pm, Coleen Phillimore wrote:

On Wed, 7 Apr 2021 11:55:56 GMT, Coleen Phillimore <coleenp at openjdk.org> wrote:

src/hotspot/share/classfile/protectionDomainCache.hpp line 115:

113:
114: ProtectionDomainEntry* next() { return Atomic::load(&_next); }
115: void set_next(ProtectionDomainEntry* entry) { Atomic::store(&_next, entry); }

Again these may need to be load_acquire/release_store for correct lock-free traversal. With suitable name changes for the methods.

Thanks, I'll add acquire/release and change the names to match the implementation details.

Did we collectively agree on the naming convention that set_next has to be release_set_next because the implementation uses release_store, or is this just your preference? I find this noisy in the context where these functions are being called.

Yes that is the convention that was established and in place for a
number of fields.

Thanks,
David

@mlbridge
Copy link

mlbridge bot commented Apr 8, 2021

Mailing list message from David Holmes on hotspot-dev:

On 7/04/2021 10:05 pm, Coleen Phillimore wrote:

On Wed, 7 Apr 2021 04:55:35 GMT, David Holmes <dholmes at openjdk.org> wrote:

This change makes the DictionaryEntry pd_set lock free, because when I set a low SafepointTimeout on the findDeadlock test, these threads were waiting on the ProtectionDomainSet_lock. This lock is a singleton that's taken for every dictionary entry.

I don't know if this change will make the test never timeout again, because removing this lock caused us to hit a low SafepointTimeout on another _no_safepoint_check lock that is not so easy to eradictate.

Tested with tiers 1-3. There's a test that exercises this code in runtime/Dictionary/ProtectionDomainCacheTest.java.

Hi Coleen,

Can you explain the synchronization protocol you are using here please - I don't understand how the handshake fits into it ??

Further comments below.

Thanks,
David

The lock-free algorithm is based on the one for object monitors, except that adding and deleting entries is done with the SystemDictionary_lock held, reading entries is done with a (now) Atomic load acquires etc with no safepoint polls, and deleting the entries is done by saving the entries off to the side, until a handshake is reached, meaning none of the readers are reading the old version of the list.

I really don't see why we need to use handshakes here. The writers use a
lock, and use release_stores. The readers use load_acquire. Our normal
lock-free data structures don't need handshakes to be involved, so sorry
but I'm not getting it.

Thanks,
David

@coleenp
Copy link
Contributor Author

coleenp commented Apr 8, 2021

The reader thread could be using the linked list and have just loaded -> next, while the deleting thread also has loaded -> next, deciding it's unused and calling delete on it. The reader is then swapped back in the process and looks at the _pd_cache field which was deleted or the _next field which is will point to the delete pattern.
When we call 'delete' on a node in this list, we have to make sure that nothing is reading it.

The concurrent hashtable uses globalCounter for this situation. Many of our lock free data structures delete entries in safepoints (like dictionaries). This pd_set list does the same as the ObjectMonitor code.

The handshake is rare as I said in the comments. The protection domain is generally associated with the klass in the initiating loader, which stays alive while this class is alive.

Copy link
Contributor

@pchilano pchilano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Coleen,

Changes LGTM. Some comments below:

Copy link
Contributor Author

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing, Patricio.

@coleenp coleenp changed the title 8259242: vmTestbase/vm/mlvm/mixed/stress/java/findDeadlock/TestDescription.java timed out 8259242: Remove ProtectionDomainSet_lock Apr 8, 2021
@coleenp
Copy link
Contributor Author

coleenp commented Apr 8, 2021

I renamed this bug and pull request.

… deletions before handshaking. Fix if to when.
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Coleen,

Thanks for the additional explanations about the synchronization protocol that is being used. The use of the global handshake with the OM deletion had not really registered with me as a general technique that we would employ for safe-memory-reclamation. (I'm not sure if I find it cool or scary! :) )

So to summarise the protocol:

  • the pd_set can only be set and appended to when the lock is held
  • traversal of a pd_set is lock-free
  • removal from a pd_set is a two phase process:
    1. Under the lock move the entry to the global delete list
    1. When appropriate the serviceThread will process the delete list, by first handshaking all threads, and then iterating through and deleting the entries

We know an entry cannot be in-use when deleted because that implies a traversal is in progress, but if a traversal is in progress then the handshake cannot have happened.

Is that correct?

I would like to see this high-level description as a block comment somewhere for clarity, as the individual pieces of code seem spread around and it is not easy (for me) to see the connections when just reading the code.

Thanks,
David

Comment on lines +113 to +116
if (_delete_list == NULL) {
_delete_list = new (ResourceObj::C_HEAP, mtClass)
GrowableArray<ProtectionDomainEntry*>(20, mtClass);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is protecting this code to ensure the allocation can only happen once?

Copy link
Contributor Author

@coleenp coleenp Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functions is only called from the service thread. It's not reentrant.

ProtectionDomainEntry* to_delete = current;
current = current->next();
delete to_delete;
delete_list->push(current);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested comment before this:
// Mark current for deletion, but in the meantime it can still be traversed

It is important that current is not unlinked so that in-progress traversals do not break. Though to be honest I'm unclear exactly what triggers the clearing of an entry this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the comment. The service thread could be trying to remove this entry and there's a small chance that some class loading will be looking up this entry with SystemDictionary::find_instance_klass.

// First clean cached pd lists in loaded CLDs
// It's unlikely, but some loaded classes in a dictionary might
// point to a protection_domain that has been unloaded.
// The dictionary pd_set points at entries in the ProtectionDomainCacheTable.

There's a test runtime/Dictionary/ProtectionDomainCache.java that does this. It creates a protection domain with a jar file loading with a URL class loader that's then removed. The defining class loader of the class loaded has that protection domain in its pd_set.

@openjdk
Copy link

openjdk bot commented Apr 9, 2021

@coleenp This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8259242: Remove ProtectionDomainSet_lock

Reviewed-by: dholmes, pchilanomate

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 47 new commits pushed to the master branch:

  • 9bb1863: 8260923: Add more tests for SSLSocket input/output shutdown
  • 33fa855: 8264732: Clean up LinkResolver::vtable_index_of_interface_method()
  • b3782ea: 8264918: [JVMCI] getVtableIndexForInterfaceMethod doesn't check that type and method are related
  • f7a6c63: 8259822: [PPC64] Support the prefixed instruction format added in POWER10
  • a45733f: 8264779: Fix doclint warnings in java/nio
  • 3e57924: 8264885: Fix the code style of macro in aarch64_neon_ad.m4
  • 051c117: 8264923: PNGImageWriter.write_zTXt throws Exception with a typo
  • 1c6b113: 8264513: Cleanup CardTableBarrierSetC2::post_barrier
  • 666fd62: 8264881: Remove the old development option MemProfiling
  • 951f277: 8264874: Build interim-langtools for HotSpot only if Graal is enabled
  • ... and 37 more: https://git.openjdk.java.net/jdk/compare/2f5169965dc6e76116c2fa023b905efea7a6f6a9...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Apr 9, 2021
@coleenp
Copy link
Contributor Author

coleenp commented Apr 9, 2021

Hi David, Yes you have the design correct. It's used for ObjectMonitors, nmethod unloading, implicitly class unloading with ZGC, and now this. The concurrent hashtable uses the globalCounter mechanism, which I briefly considered but it would have required more surgery to the code and would slightly affect read times, and that mechanism is more complicated imo. I could have done something with a lock bit/refcount for readers but that's more error prone and harder to get right. There's a new lockFreeQueue.hpp implementation but this wasn't a queue.
Maybe this should be generalized to be lockFreeList sometime.

I put a block comment before Dictionary::contains_protection_domain, the lock free reader with the NSV which describes the protocol. Hopefully this is clear. Thanks!

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates look good - thanks Coleen!

David

Copy link
Contributor

@pchilano pchilano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks Coleen!

@coleenp
Copy link
Contributor Author

coleenp commented Apr 9, 2021

Thank you Patricio and David!
/integrate

@openjdk openjdk bot closed this Apr 9, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 9, 2021
@coleenp coleenp deleted the findDeadlocks branch April 9, 2021 15:00
@openjdk
Copy link

openjdk bot commented Apr 9, 2021

@coleenp Since your change was applied there have been 47 commits pushed to the master branch:

  • 9bb1863: 8260923: Add more tests for SSLSocket input/output shutdown
  • 33fa855: 8264732: Clean up LinkResolver::vtable_index_of_interface_method()
  • b3782ea: 8264918: [JVMCI] getVtableIndexForInterfaceMethod doesn't check that type and method are related
  • f7a6c63: 8259822: [PPC64] Support the prefixed instruction format added in POWER10
  • a45733f: 8264779: Fix doclint warnings in java/nio
  • 3e57924: 8264885: Fix the code style of macro in aarch64_neon_ad.m4
  • 051c117: 8264923: PNGImageWriter.write_zTXt throws Exception with a typo
  • 1c6b113: 8264513: Cleanup CardTableBarrierSetC2::post_barrier
  • 666fd62: 8264881: Remove the old development option MemProfiling
  • 951f277: 8264874: Build interim-langtools for HotSpot only if Graal is enabled
  • ... and 37 more: https://git.openjdk.java.net/jdk/compare/2f5169965dc6e76116c2fa023b905efea7a6f6a9...master

Your commit was automatically rebased without conflicts.

Pushed as commit 06e6b1f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

4 participants