Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8264393: JDK-8258284 introduced dangling TLH race #3272

Closed
wants to merge 4 commits into from

Conversation

@dcubed-ojdk
Copy link
Member

@dcubed-ojdk dcubed-ojdk commented Mar 30, 2021

I ported some 20 year old tests using JDK-8262881 in order to help
test [~rehn]'s fix for JDK-8257831. These tests in combination with
one piece of the fix from JDK-8257831 revealed a bug in my fix for
JDK-8258284 from back in Dec 2020.

The race revealed by the ported tests from JDK-8262881 happens
only with nested ThreadsListHandles. When TLH2 is destroyed, the
thread updates its threads_hazard_ptr from the TLH2-list to the
TLH1-list; I made this change back in 2020.12 using JDK-8258284.
The threads_hazard_ptr can be observed by a thread calling
ThreadsSMRSupport::free_list() as a stable ThreadsList at the same
time as the TLH1 destructor is decrementing the nested_handle_cnt
that permits the ThreadsList to be freed. So the thread calling
ThreadsSMRSupport::free_list() thinks it has a stable hazard ptr
(TLH1-list), but that hazard ptr can be freed and causes lots of
confusion.

Update: This fix along with the fix from JDK-8264123 were stress
tested with the new tests from JDK-8262881.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8264393: JDK-8258284 introduced dangling TLH race

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3272/head:pull/3272
$ git checkout pull/3272

Update a local copy of the PR:
$ git checkout pull/3272
$ git pull https://git.openjdk.java.net/jdk pull/3272/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3272

View PR using the GUI difftool:
$ git pr show -t 3272

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3272.diff

dcubed-ojdk added 2 commits Mar 26, 2021
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Mar 30, 2021

👋 Welcome back dcubed! A progress list of the required criteria for merging this PR into pr/3255 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Mar 30, 2021

@dcubed-ojdk The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Mar 30, 2021

@fisk, @robehn and @dholmes-ora - This one is a Threads-SMR race fix that I used
"JDK-8264123 add ThreadsList.is_valid() support" to find. It would be good to have
you guys brush off your Thread-SMR memories and review this fix.

@dcubed-ojdk dcubed-ojdk marked this pull request as ready for review Mar 30, 2021
@openjdk openjdk bot added the rfr label Mar 30, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Mar 30, 2021

Webrevs

@dcubed-ojdk dcubed-ojdk mentioned this pull request Mar 30, 2021
3 of 3 tasks complete
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Hi Dan,

So in a nutshell, when clearing a nested TLH you can't simply install the previous TLH as the threads_hazard_ptr, but instead set to NULL so that it is properly set by acquire_stable_list.

This seems reasonable to me.

Thanks,
David

// thread's hazard ptr is handled by acquire_stable_list_fast_path().
// And that protocol cannot be properly done with a ThreadsList that
// might not be the current system ThreadsList.
assert(_previous->_list->_nested_handle_cnt > 0, "must be > than zero");

This comment has been minimized.

@dholmes-ora

dholmes-ora Mar 31, 2021
Member

Nit: "> than" reads "greater than than".

This comment has been minimized.

@dcubed-ojdk

dcubed-ojdk Mar 31, 2021
Author Member

Fixed.

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Mar 31, 2021

@dholmes-ora - your nutshell summary is spot on. Thanks for the review.

@robehn
Copy link
Contributor

@robehn robehn commented Mar 31, 2021

Hi Dan,

As you describe it and how it look to me, isn't the issue just that we decrement before before reinstating the old list?
So if we do first publish the previous list as current list and then decrement the nested handle count it should be okay?
E.g.:
_thread->set_threads_hazard_ptr(_previous->_list);
_list->dec_nested_handle_cnt();

So you just need to move the "if (_has_ref_count) {" piece of code after the potential reinstating of the previous list.

Or am I missing something?

Thanks for finding it!

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Mar 31, 2021

@robehn - Thanks for reviewing the fix. Yes, I think you have missed something. :-)

I modeled the analysis of this race after one of your favorite race techniques
in my analysis.ThreadsList.for_JBS attachment to the bug report: since there
is nothing to force the two threads to interact with each in a particular order,
I posited delays at various points in the execution of each thread. This quote
from the analysis.ThreadsList.for_JBS attachment describes scenario:

Race note:
    THR-1 is the thread calling the TLH destructors.
    THR-2 is the exiting thread calling ThreadsSMRSupport::free_list.

    If THR-2's ThreadsSMRSupport::free_list() call finishes its scan of of
    the active Threads _threads_hazard_ptr values BEFORE the TLH-2
    destructor code sequence updates THR-1->_threads_hazard_ptr from TL-2
    to TL-1, then TL-2 and not TL-1 will be on the list of in-use
    ThreadsLists:

      // Gather a hash table of the current hazard ptrs:
      ThreadScanHashtable *scan_table = new ThreadScanHashtable(hash_table_size);
      ScanHazardPtrGatherThreadsListClosure scan_cl(scan_table);
      threads_do(&scan_cl);

    At this point, THR-2's ThreadsSMRSupport::free_list() call stalls and
    THR-1 not only finishes the TLH-2 destructor, it also finishes its use
    of TLH-1 as described in the next section and starts to run the TLH-1
    destructor.

After the first ThreadsListHandle is released for THR-1:

+----------------------------------+
| THR-1                            |
+----------------------------------+
| _threads_hazard_ptr=0            |
| _threads_list_ptr=0              |
| _nested_threads_hazard_ptr_cnt=0 |
+----------------------------------+

                                         +----------------------+
                                         | TL-1                 |
                                         +----------------------+
                                         | _length=XXXXXXXXXXXX |
                                         | _next_list=XXXXXXXXX |
                                         | _threads[5]=XXXXXXXX |
                                         | _nested_handle_cnt=X |
                                         +----------------------+

Race note:
    THR-1 is running the TLH-1 destructor and has decremented the TL-1
    _nested_handle_cnt, but stalls before it clears _threads_hazard_ptr.

    The THR-2's ThreadsSMRSupport::free_list() call continues executing and
    checks the _to_delete_list ThreadsLists and if they are not in the
    scan_table and have a _nested_handle_cnt == 0 then, they are freed.

    This is how TL-1 is freed, but still remains in THR-1's
    _threads_hazard_ptr field and can be observed later by THR-2 as a valid
    hazard ptr in its call to smr_delete() on itself or by another thread
    perusing the system ThreadsList. This is especially true after
    ThreadsSMRSupport::free_list() has finished its work and released the
    Threads_lock which will allow another thread to walk the set of hazard
    ptrs.

    THR-1 resumes running again and clears _threads_hazard_ptr. However,
    the other thread walking the set of hazard ptrs has the stale TL-1
    value and tries to use it. Boom!

Switching the decrement:
_list->dec_nested_handle_cnt()
to happen after the:
_thread->set_threads_hazard_ptr(_previous->_list)
doesn't help because THR-2 observed TL-2 before we
reached that code and then THR-2 stalled until after all
the updates were made. THR-2 recorded TL-2 in the
collection of current hazard ptrs and THR-2 knows nothing
about TL-1 being a valid hazard ptr so THR-2 can free it.

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Mar 31, 2021

I tested JDK-8264123 together with this fix (JDK-8264393) in Mach5 Tier[1-7]
and there are no regressions.

@robehn
Copy link
Contributor

@robehn robehn commented Apr 1, 2021

Hi Dan, yes thanks.

So I would say, you may not install a ThreadsList into your hazard pointer if it's on the _to_delete_list.
Got it, thanks.

@robehn
robehn approved these changes Apr 1, 2021
@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Apr 1, 2021

@robehn - Thanks for closing the loop on your review thread.

@dholmes-ora nutshell summary covers it:

when clearing a nested TLH you can't simply install the previous TLH as the threads_hazard_ptr, but instead set to NULL so that it is properly set by acquire_stable_list.

Another way to put it is that the _threads_hazard_ptr field must only be set to
a non-NULL value by the acquire_stable_list() protocol.

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Apr 1, 2021

@fisk - since I'm tweaking your code (again), I really need you to chime in on
this review thread.

@fisk
fisk approved these changes Apr 3, 2021
Copy link
Contributor

@fisk fisk left a comment

Not sure how I missed this race earlier. I never originally intended for hazard pointers to be set when exiting nested ThreadsListHandles. Anyway - the problem is understood and the fix looks good.

@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Apr 3, 2021

@fisk - Thanks for the review. The fault is mine from when I
worked on JDK-8258284. I even added the new test that
verified the improper behavior. In any case, it's fixed now and
comments are added that should prevent any of us that play
with this code from making the same mistake again.

@openjdk-notifier openjdk-notifier bot changed the base branch from pr/3255 to master Apr 3, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 3, 2021

@dcubed-ojdk This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8264393: JDK-8258284 introduced dangling TLH race

Reviewed-by: dholmes, rehn, eosterlund

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 75 new commits pushed to the master branch:

  • 9b2232b: 8264123: add ThreadsList.is_valid() support
  • e8eda65: 8264664: use text blocks in javac module tests
  • cec66cf: 8264572: ForkJoinPool.getCommonPoolParallelism() reports always 1
  • 9c283da: 8264662: ProblemList vmTestbase/jit/escape/AdaptiveBlocking/AdaptiveBlocking001/AdaptiveBlocking001.java on win-x64 with ZGC
  • eb0ac86: 8264655: Minor internal doc comment cleanup
  • 3991b32: 8264539: Improve failure message of java/nio/file/WatchService/SensitivityModifier.java
  • 4133ded: 8264658: ProblemList javax/swing/JInternalFrame/8146321/JInternalFrameIconTest.java on linux-x64
  • 220ddbd: 8264657: ProblemList java/awt/Focus/FrameMinimizeTest/FrameMinimizeTest.java on linux-x64
  • d0f3cc9: 8264656: ProblemList sun/security/ssl/SSLSocketImpl/SSLSocketImplThrowsWrongExceptions.java on linux-x64
  • 6c145c4: 8264544: Case-insensitive comparison issue with supplementary characters.
  • ... and 65 more: https://git.openjdk.java.net/jdk/compare/8cf1c62c345ce91208549294192944a1efe717cf...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Apr 3, 2021
@dcubed-ojdk
Copy link
Member Author

@dcubed-ojdk dcubed-ojdk commented Apr 3, 2021

/integrate

@openjdk openjdk bot closed this Apr 3, 2021
@openjdk openjdk bot added integrated and removed ready labels Apr 3, 2021
@openjdk openjdk bot removed the rfr label Apr 3, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 3, 2021

@dcubed-ojdk Since your change was applied there have been 75 commits pushed to the master branch:

  • 9b2232b: 8264123: add ThreadsList.is_valid() support
  • e8eda65: 8264664: use text blocks in javac module tests
  • cec66cf: 8264572: ForkJoinPool.getCommonPoolParallelism() reports always 1
  • 9c283da: 8264662: ProblemList vmTestbase/jit/escape/AdaptiveBlocking/AdaptiveBlocking001/AdaptiveBlocking001.java on win-x64 with ZGC
  • eb0ac86: 8264655: Minor internal doc comment cleanup
  • 3991b32: 8264539: Improve failure message of java/nio/file/WatchService/SensitivityModifier.java
  • 4133ded: 8264658: ProblemList javax/swing/JInternalFrame/8146321/JInternalFrameIconTest.java on linux-x64
  • 220ddbd: 8264657: ProblemList java/awt/Focus/FrameMinimizeTest/FrameMinimizeTest.java on linux-x64
  • d0f3cc9: 8264656: ProblemList sun/security/ssl/SSLSocketImpl/SSLSocketImplThrowsWrongExceptions.java on linux-x64
  • 6c145c4: 8264544: Case-insensitive comparison issue with supplementary characters.
  • ... and 65 more: https://git.openjdk.java.net/jdk/compare/8cf1c62c345ce91208549294192944a1efe717cf...master

Your commit was automatically rebased without conflicts.

Pushed as commit f259eea.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dcubed-ojdk dcubed-ojdk deleted the dcubed-ojdk:JDK-8264393 branch Apr 3, 2021
@robehn
Copy link
Contributor

@robehn robehn commented Apr 6, 2021

Not sure how I missed this race earlier. I never originally intended for hazard pointers to be set when exiting nested ThreadsListHandles. Anyway - the problem is understood and the fix looks good.

I don't think you did, because originally we never intended them to be nested.
I still feel that the simple solution to nesting: pausing deletion on nesting, thus never changing the hazard pointer would be better (as in much simpler).
The only drawback is if there is a pathological case were we always have one thread with nested ThreadsLists (but hard to imagine).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants