8267579: Thread::cooked_allocated_bytes() hits assert(left >= right) failed: avoid underflow by mgronlun · Pull Request #4363 · openjdk/jdk

mgronlun · 2021-06-04T15:30:53Z

This is a workaround to avoid hitting the assert that was recently added to pointer_delta().

The implementation of cooked_allocated_bytes() is perhaps questionable, in that it reads tlab pointers optimistically. However, the functionality has been in place for a long time, and the impact of changing its behavior more substantially is unknown at this time, hence this defensive workaround to reduce noise and problems seen in general testing.

Testing: tier1, tier2, tier6, tier8

Thanks
Markus

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Issue

JDK-8267579: Thread::cooked_allocated_bytes() hits assert(left >= right) failed: avoid underflow

Reviewers

Daniel D. Daugherty (@dcubed-ojdk - Reviewer)
Stefan Karlsson (@stefank - Reviewer)
Kim Barrett (@kimbarrett - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4363/head:pull/4363
$ git checkout pull/4363

Update a local copy of the PR:
$ git checkout pull/4363
$ git pull https://git.openjdk.java.net/jdk pull/4363/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4363

View PR using the GUI difftool:
$ git pr show -t 4363

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4363.diff

bridgekeeper · 2021-06-04T15:31:07Z

👋 Welcome back mgronlun! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2021-06-04T15:33:19Z

@mgronlun The following label will be automatically applied to this pull request:

hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mgronlun · 2021-06-04T15:34:47Z

/label add hotspot-jfr

openjdk · 2021-06-04T15:35:26Z

@mgronlun
The hotspot-jfr label was successfully added.

mlbridge · 2021-06-04T15:39:08Z

Webrevs

01: Full - Incremental (46eeb2d8)
00: Full (8e2bc9a0)

dcubed-ojdk

Thumbs up.

openjdk · 2021-06-05T15:39:17Z

@mgronlun This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8267579: Thread::cooked_allocated_bytes() hits assert(left >= right) failed: avoid underflow

Reviewed-by: dcubed, stefank, kbarrett

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 140 new commits pushed to the master branch:

94d0b0f: 8268565: runtime/records/RedefineRecord.java should be run in driver mode
df65237: 8267930: Refine code for loading hsdis library
2e900da: 8268574: ProblemList tests failing due to UseBiasedLocking going away
4fd2a14: 8267556: Enhance class paths check during runtime
8c8422e: 8267893: Improve jtreg test failure handler do get native/mixed stack traces for cores and live processes
1e1039a: 8268223: Problemlist vmTestbase/nsk/jdi/HiddenClass/events/events001.java
78cb677: 8268539: several serviceability/sa tests should be run in driver mode
7267227: 8268361: Fix the infinite loop in next_line
b018c45: 8267630: Start of release updates for JDK 18
7400789: 8268542: serviceability/logging/TestFullNames.java tests only 1st test case
... and 130 more: https://git.openjdk.java.net/jdk/compare/59a539fef12dec6ba8af8a41000829402e7e9b72...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

dholmes-ora

Hi Markus,

So this fix simply unpacks used_bytes() to avoid calling pointer_delta with values that trigger the assertion. But given we see other problems with that assertion ie JDK-8268265, then could it not be that the assertion is in fact the problem? Is the assertion not making a usage assumption that is simply not valid?

Thanks,
David

stefank

This is a race-condition and will potentially fail just the same way the bug in #4224 did. There needs to be at least Atomic::loads of the the variables, or the compiler could convert the loads into multiple loads.

dholmes-ora · 2021-06-07T07:44:25Z

@stefank has explained in the bug report why the assertion is not the issue - thanks. So I have to question the validity of just side-stepping the assertion without trying to fix the broken code. What are the implications of finding these broken invariants in the product code? Do we just report/print an incorrect value?

mgronlun · 2021-06-07T10:13:29Z

@dholmes-ora The existing mechanism is problematic in that it is racy, it reads the pointers being updated by another thread. It attempts to parry for reading problematic values by comparing the used_bytes against the max_tlab_size, to avoid reporting really offside values. A safer means would be to pick up and report only the bytes the owner of the tlab reported already (I..e _allocated_bytes), but this means the value reported will then trail by one tlab, and the impact of doing this is right now is unknown. It could also involve introducing some protocol in how the threads publish their values from their TLABs. All of this can be done of course, but I am not sure this can be addressed straight away. Since testing suffers because the existing behaviour that has been in place for years is now starting to hit the newly introduced assert, this is a workaround, not a fix. It's not clear what the real fix is, so maybe we can create another issue to attempt to figure that out.

mlbridge · 2021-06-07T10:33:34Z

Mailing list message from David Holmes on hotspot-runtime-dev:

On 7/06/2021 8:16 pm, Markus Gr?nlund wrote:

On Mon, 7 Jun 2021 07:44:25 GMT, David Holmes <dholmes at openjdk.org> wrote:

This is a race-condition and will potentially fail just the same way the bug in #4224 did. There needs to be at least Atomic::loads of the the variables, or the compiler could convert the loads into multiple loads.

@stefank has explained in the bug report why the assertion is not the issue - thanks. So I have to question the validity of just side-stepping the assertion without trying to fix the broken code. What are the implications of finding these broken invariants in the product code? Do we just report/print an incorrect value?

@dholmes-ora The existing mechanism is broken in that it is racy, it reads the pointers being updated by another thread. It attempts to parry for reading problematic values by comparing the used_bytes against the max_tlab_size, to avoid reporting really offside values. A safer means would be to pick up and report only the bytes the owner of the tlab reported already (I..e _allocated_bytes), but this means the value reported will then trail by one tlab, and the impact of doing this is right now is unknown. It could also involve introducing some protocol in how the threads publish their values from their TLABs. All of this can be done of course, but I am not sure this can be addressed straight away. Since testing suffers because the existing behaviour that has been in place for years is now starting to hit the newly introduced assert, this is a workaround, not a fix. It's not clear what the real fix is, so maybe we can create another issue to attempt to figure that out.

Okay, please file an issue for the real issue.

Thanks,
David

dcubed-ojdk

Thumbs up.

Should the _start and _top fields be marked volatile to
prevent the S390 compiler from "optimizing" like it did with
another recent bug (JDK-8267842)?

mgronlun · 2021-06-08T09:30:51Z

@dcubed-ojdk Thanks Dan for looking at this. Hmm...having to introduce an explicit "volatile" keyword implies the "read-once" promise of Atomic::load() is not sufficient? I was told the side-effects of via Atomic::load(), which makes expressions volatile-qualified, are sufficient for read-once. But maybe not for all platforms?

If "volatile" decorations are required, I would prefer, granted it's sufficient, to only decorate the local variables. Having to decorate the tlab pointers could result in introducing unnecessary ordering constraints, i.e. for the TLAB owner, which could perhaps hurt performance?

dcubed-ojdk · 2021-06-08T15:34:03Z

No problem. I believe that @dholmes-ora thinks the bug that we
fixed with JDK-8267842 is a compiler bug for S390.

mgronlun · 2021-06-09T08:46:14Z

@stefank @dholmes-ora Are you ok with the suggested fix? The decision to have the impl in the .cpp is to avoid having to include runtime/atomic.hpp in the .inline.hpp file unnecessarily (not deemed performance sensitive). Let me know if you prefer the impl in the .inline.hpp instead.

dholmes-ora · 2021-06-09T13:36:31Z

src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp

  return MAX2(reserve_size, (size_t)_reserve_for_allocation_prefetch);
 }
+
+const HeapWord* ThreadLocalAllocBuffer::start_relaxed() const {


Why do we need the "relaxed" terminology? By default accessors have no memory ordering properties.

I'm not sure what you are asking, but Atomic::load implements the "relaxed" semantics equivalent to using MO_RELAXED in the Access API.

mlbridge · 2021-06-10T04:35:48Z

Mailing list message from David Holmes on hotspot-jfr-dev:

On 9/06/2021 11:54 pm, Stefan Karlsson wrote:

On Wed, 9 Jun 2021 13:36:31 GMT, David Holmes <dholmes at openjdk.org> wrote:

Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision:

read once

src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp line 478:

476: }
477:
478: const HeapWord* ThreadLocalAllocBuffer::start_relaxed() const {

Why do we need the "relaxed" terminology? By default accessors have no memory ordering properties.

I'm not sure what you are asking, but Atomic::load implements the "relaxed" semantics equivalent to using MO_RELAXED in the Access API.

I see this terminology is used in some of the GC code but I was not
aware of its existence. I also see nothing in atomic.hpp that states
that Atomic::load/store implement memory_order_relaxed ?? Is the
difference between unordered and relaxed what you get from applying
volatile to the variable being accessed? So unordered from a h/w
perspective but with some constraints on the compiler to maintain
program order?

Thanks,
David
-----

mlbridge · 2021-06-10T04:35:50Z

Mailing list message from David Holmes on hotspot-jfr-dev:

On 9/06/2021 11:54 pm, Stefan Karlsson wrote:

On Wed, 9 Jun 2021 13:36:31 GMT, David Holmes <dholmes at openjdk.org> wrote:

Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision:

read once

src/hotspot/share/gc/shared/threadLocalAllocBuffer.cpp line 478:

476: }
477:
478: const HeapWord* ThreadLocalAllocBuffer::start_relaxed() const {

Why do we need the "relaxed" terminology? By default accessors have no memory ordering properties.

I'm not sure what you are asking, but Atomic::load implements the "relaxed" semantics equivalent to using MO_RELAXED in the Access API.

I see this terminology is used in some of the GC code but I was not
aware of its existence. I also see nothing in atomic.hpp that states
that Atomic::load/store implement memory_order_relaxed ?? Is the
difference between unordered and relaxed what you get from applying
volatile to the variable being accessed? So unordered from a h/w
perspective but with some constraints on the compiler to maintain
program order?

Thanks,
David
-----

stefank · 2021-06-10T06:52:13Z

I'm not sure what you are asking, but Atomic::load implements the "relaxed" semantics equivalent to using MO_RELAXED in the Access API.

I see this terminology is used in some of the GC code but I was not
aware of its existence.

This is not a GC-specific terminology, but a C++11 terminology:
https://en.cppreference.com/w/cpp/atomic/memory_order

I also see nothing in atomic.hpp that states
that Atomic::load/store implement memory_order_relaxed ??

In atomic.hpp there are references to relaxed at the top of the file:

enum atomic_memory_order {
  // The modes that align with C++11 are intended to
  // follow the same semantics.
  memory_order_relaxed = 0,
  memory_order_acquire = 2,
  memory_order_release = 3,
  memory_order_acq_rel = 4,
  // Strong two-way memory barrier.
  memory_order_conservative = 8
};

But you are right, nothing here helps the reader understand that Atomic::load implies relaxed. This needs to be better documented.

Is the
difference between unordered and relaxed what you get from applying
volatile to the variable being accessed? So unordered from a h/w
perspective but with some constraints on the compiler to maintain
program order?

I talked to @fisk about this yesterday, and IIUC relaxed access are not supposed to be reordered by the HW, but volatile access could be.

Thanks,
David

mlbridge · 2021-06-10T08:00:49Z

Mailing list message from David Holmes on hotspot-jfr-dev:

On 10/06/2021 4:55 pm, Stefan Karlsson wrote:

On Thu, 10 Jun 2021 04:35:50 GMT, David Holmes <david.holmes at oracle.com> wrote:

I'm not sure what you are asking, but Atomic::load implements the "relaxed" semantics equivalent to using MO_RELAXED in the Access API.

I see this terminology is used in some of the GC code but I was not
aware of its existence.

This is not a GC-specific terminology, but a C++11 terminology:
https://en.cppreference.com/w/cpp/atomic/memory_order

I meant in the context of the hotspot sources and its use in method names.

I also see nothing in atomic.hpp that states
that Atomic::load/store implement memory_order_relaxed ??

In atomic.hpp there are references to relaxed at the top of the file:

enum atomic_memory_order {
// The modes that align with C++11 are intended to
// follow the same semantics.
memory_order_relaxed = 0,
memory_order_acquire = 2,
memory_order_release = 3,
memory_order_acq_rel = 4,
// Strong two-way memory barrier.
memory_order_conservative = 8
};

But you are right, nothing here helps the reader understand that Atomic::load implies relaxed. This needs to be better documented.

What am I missing - I can't see any memory barrier related instructions
associated with basic Atomic::load and store. Granted store needs
nothing on TSO and on non-TSO the barrier may be implicit in whatever
platform intrinsic is used; but loads would also need explicit barriers
to prevent reordering ??

David
-----

mlbridge · 2021-06-10T08:00:50Z

Mailing list message from David Holmes on hotspot-jfr-dev:

On 10/06/2021 4:55 pm, Stefan Karlsson wrote:

On Thu, 10 Jun 2021 04:35:50 GMT, David Holmes <david.holmes at oracle.com> wrote:

I'm not sure what you are asking, but Atomic::load implements the "relaxed" semantics equivalent to using MO_RELAXED in the Access API.

I see this terminology is used in some of the GC code but I was not
aware of its existence.

This is not a GC-specific terminology, but a C++11 terminology:
https://en.cppreference.com/w/cpp/atomic/memory_order

I meant in the context of the hotspot sources and its use in method names.

I also see nothing in atomic.hpp that states
that Atomic::load/store implement memory_order_relaxed ??

In atomic.hpp there are references to relaxed at the top of the file:

enum atomic_memory_order {
// The modes that align with C++11 are intended to
// follow the same semantics.
memory_order_relaxed = 0,
memory_order_acquire = 2,
memory_order_release = 3,
memory_order_acq_rel = 4,
// Strong two-way memory barrier.
memory_order_conservative = 8
};

But you are right, nothing here helps the reader understand that Atomic::load implies relaxed. This needs to be better documented.

What am I missing - I can't see any memory barrier related instructions
associated with basic Atomic::load and store. Granted store needs
nothing on TSO and on non-TSO the barrier may be implicit in whatever
platform intrinsic is used; but loads would also need explicit barriers
to prevent reordering ??

David
-----

stefank · 2021-06-10T09:16:25Z

Mailing list message from David Holmes on hotspot-jfr-dev:

On 10/06/2021 4:55 pm, Stefan Karlsson wrote:

On Thu, 10 Jun 2021 04:35:50 GMT, David Holmes <david.holmes at oracle.com> wrote:

I'm not sure what you are asking, but Atomic::load implements the "relaxed" semantics equivalent to using MO_RELAXED in the Access API.

I see this terminology is used in some of the GC code but I was not
aware of its existence.

This is not a GC-specific terminology, but a C++11 terminology:
https://en.cppreference.com/w/cpp/atomic/memory_order

I meant in the context of the hotspot sources and its use in method names.

I guess the other parts of HotSpot needs to get on board with this terminology now that we have gone to C++11.

I also see nothing in atomic.hpp that states
that Atomic::load/store implement memory_order_relaxed ??

In atomic.hpp there are references to relaxed at the top of the file:
enum atomic_memory_order {
// The modes that align with C++11 are intended to
// follow the same semantics.
memory_order_relaxed = 0,
memory_order_acquire = 2,
memory_order_release = 3,
memory_order_acq_rel = 4,
// Strong two-way memory barrier.
memory_order_conservative = 8
};
But you are right, nothing here helps the reader understand that Atomic::load implies relaxed. This needs to be better documented.

What am I missing - I can't see any memory barrier related instructions
associated with basic Atomic::load and store. Granted store needs
nothing on TSO and on non-TSO the barrier may be implicit in whatever
platform intrinsic is used; but loads would also need explicit barriers
to prevent reordering ??

AFAIK, the code was written prior to C++11 and nothing else is needed on our currently supported platforms. Though I've heard stories about some platforms that probably would have to fix this. I've also heard @fisk argue that we should replace the Atomic::load implementation with the compiler's version of relaxed atomic loads.

David

stefank · 2021-06-10T13:03:54Z

Kim has been looking through the standards w.r.t. atomic and read-read coherence and has some insights. Could be worth reading:
https://mail.openjdk.java.net/pipermail/hotspot-dev/2021-June/052745.html

kimbarrett

Looks good.

kimbarrett · 2021-06-11T06:57:25Z

Kim has been looking through the standards w.r.t. atomic and read-read coherence and has some insights. Could be worth reading:
https://mail.openjdk.java.net/pipermail/hotspot-dev/2021-June/052745.html

read-read coherence only applies to reads of the same atomic object. That doesn't apply to the case at hand.

stefank · 2021-06-11T07:07:50Z

Kim has been looking through the standards w.r.t. atomic and read-read coherence and has some insights. Could be worth reading:
https://mail.openjdk.java.net/pipermail/hotspot-dev/2021-June/052745.html

read-read coherence only applies to reads of the same atomic object. That doesn't apply to the case at hand.

Thanks for pointing this out. FWIW, the discussion above strayed away from the actual contents PR, and started to talk about the name and semantics of Atomic::load / relaxed atomic loads, in a more generic sense than the suggested usage of it in the patch above.

mgronlun · 2021-06-14T10:39:14Z

/integrate

openjdk · 2021-06-14T10:41:12Z

Going to push as commit c420735.
Since your change was applied there have been 147 commits pushed to the master branch:

51f3427: 8266791: Annotation property which is compiled as an array property but changed to a single element throws NullPointerException
ba601b8: 8268520: VirtualSpace::print_on() should be const
1ba4e0c: 8257038: Remove expired flags in JDK 18
90c1034: 8268644: ProblemList serviceability/sa/ClhsdbJstackXcompStress.java in -Xcomp mode
5cee23a: 8265518: C1: Intrinsic support for Preconditions.checkIndex
a466b49: 8267634: Update --release 17 symbol information for JDK 17 build 26
49112fa: 8265909: build.tools.dtdbuilder.DTDBuilder.java failed detecting missing path of dtd_home
94d0b0f: 8268565: runtime/records/RedefineRecord.java should be run in driver mode
df65237: 8267930: Refine code for loading hsdis library
2e900da: 8268574: ProblemList tests failing due to UseBiasedLocking going away
... and 137 more: https://git.openjdk.java.net/jdk/compare/59a539fef12dec6ba8af8a41000829402e7e9b72...master

Your commit was automatically rebased without conflicts.

openjdk · 2021-06-14T10:41:30Z

@mgronlun Pushed as commit c420735.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk · 2021-06-14T10:54:51Z

@mgronlun Unknown command backport - for a list of valid commands use /help.

openjdk · 2021-06-14T10:55:51Z

@mgronlun Available commands:

covered - used when employer has signed the OCA
help - shows this text
signed - used after signing the OCA
test - used to run tests

mgronlun added 2 commits June 4, 2021 17:06

Workaround JDK-8260046

5c3b2aa

comment

8e2bc9a

openjdk bot added the rfr Pull request is ready for review label Jun 4, 2021

openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Jun 4, 2021

openjdk bot added the hotspot-jfr hotspot-jfr-dev@openjdk.org label Jun 4, 2021

dcubed-ojdk approved these changes Jun 5, 2021

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Jun 5, 2021

dholmes-ora reviewed Jun 7, 2021

View reviewed changes

stefank suggested changes Jun 7, 2021

View reviewed changes

read once

46eeb2d

dcubed-ojdk approved these changes Jun 7, 2021

View reviewed changes

dholmes-ora reviewed Jun 9, 2021

View reviewed changes

kimbarrett approved these changes Jun 11, 2021

View reviewed changes

stefank approved these changes Jun 11, 2021

View reviewed changes

openjdk bot closed this Jun 14, 2021

openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 14, 2021

Conversation

mgronlun commented Jun 4, 2021 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Jun 4, 2021

Uh oh!

openjdk bot commented Jun 4, 2021

Uh oh!

mgronlun commented Jun 4, 2021

Uh oh!

openjdk bot commented Jun 4, 2021

Uh oh!

mlbridge bot commented Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

dcubed-ojdk left a comment

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Jun 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dholmes-ora left a comment

Choose a reason for hiding this comment

Uh oh!

stefank left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dholmes-ora commented Jun 7, 2021

Uh oh!

mgronlun commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented Jun 7, 2021

Uh oh!

dcubed-ojdk left a comment

Choose a reason for hiding this comment

Uh oh!

mgronlun commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcubed-ojdk commented Jun 8, 2021

Uh oh!

mgronlun commented Jun 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dholmes-ora Jun 9, 2021

Choose a reason for hiding this comment

Uh oh!

stefank Jun 9, 2021

Choose a reason for hiding this comment

Uh oh!

mlbridge bot commented Jun 10, 2021

Uh oh!

mlbridge bot commented Jun 10, 2021

Uh oh!

stefank commented Jun 10, 2021

Thanks, David

Uh oh!

mlbridge bot commented Jun 10, 2021

Uh oh!

mlbridge bot commented Jun 10, 2021

Uh oh!

stefank commented Jun 10, 2021

David

Uh oh!

stefank commented Jun 10, 2021

Uh oh!

kimbarrett left a comment

Choose a reason for hiding this comment

Uh oh!

kimbarrett commented Jun 11, 2021

Uh oh!

stefank commented Jun 11, 2021

Uh oh!

mgronlun commented Jun 4, 2021 •

edited by openjdk bot

Loading

mlbridge bot commented Jun 4, 2021 •

edited

Loading

openjdk bot commented Jun 5, 2021 •

edited

Loading

stefank left a comment •

edited

Loading

mgronlun commented Jun 7, 2021 •

edited

Loading

mgronlun commented Jun 8, 2021 •

edited

Loading

mgronlun commented Jun 9, 2021 •

edited

Loading

Thanks,
David