8308766: TLAB initialization may cause div by zero #14121

tschatzl · 2023-05-24T11:50:02Z

Hi all,

can I have reviews for this change that fixes an FP div by zero?

In ThreadLocalAllocBuffer::initialize() we initialize the TLAB using current available TLAB capacity for the thread. In G1, this can be zero in some situations, leading to that div by zero (see the CR for the crash when adding an assert).
The suggested fix is to just not sample at this point. TLAB resizing will fix TLAB sizing up.

Only G1 seems to be affected as it seems to be the only gc that uses a dynamic value for the capacity available for TLAB allocation. Other GCs seem to just use total heap capacity (Z, Shenandoah) or eden capacity (Serial, Parallel).
Not sure if that is actually better and I think won't result in the expected behavior (every thread should reload TLABs target_refills() times per mutator time); since even with G1 at TLAB resizing time eden is maximal, this hiccup at initialization does not seem too bad.

This may also be the cause for the behavior observed in https://bugs.openjdk.org/browse/JDK-8264798.

Testing: gha

Thanks,
Thomas

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8308766: TLAB initialization may cause div by zero

Reviewers

Aleksey Shipilev (@shipilev - Reviewer)
Albert Mingkun Yang (@albertnetymk - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14121/head:pull/14121
$ git checkout pull/14121

Update a local copy of the PR:
$ git checkout pull/14121
$ git pull https://git.openjdk.org/jdk.git pull/14121/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14121

View PR using the GUI difftool:
$ git pr show -t 14121

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14121.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2023-05-24T11:51:08Z

👋 Welcome back tschatzl! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2023-05-24T11:53:52Z

@tschatzl The following label will be automatically applied to this pull request:

hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2023-05-24T11:58:00Z

Webrevs

00: Full (2d3ece65)

albertnetymk · 2023-05-25T13:56:46Z

Looking at where _allocation_fraction is accessed, wouldn't a variable capacity cause the alloc-amount to be miscalculated? I'd expect capacity to be const to more accurately track/predict #alloc-bytes.

  // ** sampling place ** //
  size_t capacity = Universe::heap()->tlab_capacity(thread()) / HeapWordSize;
  float alloc_frac = desired_size() * target_refills() / (float)capacity;
  _allocation_fraction.sample(alloc_frac);

  // ** where it's used ** //
  // Compute the next tlab size using expected allocation amount
  size_t alloc = (size_t)(_allocation_fraction.average() *
                          (Universe::heap()->tlab_capacity(thread()) / HeapWordSize));

tschatzl · 2023-05-25T14:11:06Z

Looking at where _allocation_fraction is accessed, wouldn't a variable capacity cause the alloc-amount to be miscalculated? I'd expect capacity to be const to more accurately track/predict #alloc-bytes.

  // ** sampling place ** //
  size_t capacity = Universe::heap()->tlab_capacity(thread()) / HeapWordSize;
  float alloc_frac = desired_size() * target_refills() / (float)capacity;
  _allocation_fraction.sample(alloc_frac);

  // ** where it's used ** //
  // Compute the next tlab size using expected allocation amount
  size_t alloc = (size_t)(_allocation_fraction.average() *
                          (Universe::heap()->tlab_capacity(thread()) / HeapWordSize));

Where the capacity is used, during the GC pause, in G1 Universe::heap()->tlab_capacity is effectively a constant, reflecting the eden size for the next mutator phase.

There is a problem what to do during TLAB initialization when attaching a random thread: eden can be partially exhausted as it can happen at any time when the mutator is running: do you want to have target_refills() reloads until eden is exhausted, or as if the thread ran since the start of the mutator phase or something completely different.

Serial and parallel calculate it as if eden were empty, Shenandoah and Z seem to use total heap capacity (they're single-generational), and G1 uses the remaining eden capacity, with different effects.

(Fwiw, if there is an issue with that logic, it is pre-existing).

shipilev

OK, so this does happen when a new thread comes at unfortunate time in VM lifecycle, like on shutdown? Anyway, the fix looks okay. I think many other versions are also affected, can you please add relevant Affected-Versions to the bug?

openjdk · 2023-05-25T15:03:20Z

@tschatzl This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8308766: TLAB initialization may cause div by zero

Reviewed-by: shade, ayang

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 123 new commits pushed to the master branch:

eae1f59: 8309159: Some minor comment and code cleanup in jdk/com/sun/jdi/PopFramesTest.java
45473ef: 8309230: ProblemList jdk/incubator/vector/Float64VectorTests.java on aarch64
78aa5f3: 8299505: findVirtual on array classes incorrectly restricts the receiver type
42ca6e6: 8308022: update for deprecated sprintf for java.base
1264902: 8308316: Default decomposition mode in Collator
70670b4: 8308872: enhance logging and some exception in krb5/Config.java
024d9b1: 8308910: Allow executeAndLog to accept running process
25b9803: 8308917: C2 SuperWord::output: assert before bailout with CountedLoopReserveKit
d66b6d8: 8308765: RISC-V: Expand size of stub routines for zgc only
4aea7da: 8309120: java/net/httpclient/AsyncShutdownNow.java fails intermittently
... and 113 more: https://git.openjdk.org/jdk/compare/9e196b3631af0156ce9958a2f631894968211a4c...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

tschatzl · 2023-05-25T16:05:49Z

OK, so this does happen when a new thread comes at unfortunate time in VM lifecycle, like on shutdown? Anyway, the fix looks okay. I think many other versions are also affected, can you please add relevant Affected-Versions to the bug?

In this case, yes, a thread is attached on shutdown and you can get weird failures in other FP code (if you also enable FP exceptions, but it can leave something in a weird state apparently). I think (well I hope) it is also the cause for another similar bug (*) that caused crashes in G1 extremely intermittently (that has been closed as CNR at that point after it stopped appearing).

That assert that tripped is something that I added for trying to reproduce JDK-8264798, initially failed to do so, and then accidentally left in when testing another change....

I think it is worth cleaning up just in case.

(*) That may just be wishful thinking...

tschatzl · 2023-05-25T16:14:16Z

Added affects version back to JDK 8 since that code and the tlab_capacity() implementation are the same as they are now. Maybe other circumstance prevent this from happening.

albertnetymk

Thanks to Thomas' explanation, now I understand why it tracks the ratio instead of the actual alloc-amount. It's because (eden) capacity affects the distance btw two gc-pause (in STW GC), and alloc-amount is semi-proportional to gc-distance. Therefore, the ratio more or less reflects alloc-rate, which can be used to predict alloc-amount until the next gc-pause.

However, maintaining a constant number of refills btw gc-pauses seems an odd objective; preexisting issue.

tschatzl · 2023-06-01T06:56:22Z

Thanks @albertnetymk @shipilev for your reviews.

/integrate

openjdk · 2023-06-01T06:57:53Z

Going to push as commit 96ed139.
Since your change was applied there have been 134 commits pushed to the master branch:

6c7225f: 8303417: RISC-V: Merge vector instructs with similar match rules
a46b5ac: 8308503: AArch64: SIGILL when running with -XX:UseBranchProtection=pac-ret on hardware without PAC feature
f9ad7df: 8300865: C2: product reduction in ProdRed_Double is not vectorized
8eda97d: 8305320: DbgStrings and AsmRemarks are leaking
0951474: 8309150: Need to escape " inside attribute values
0119969: 8309171: Test vmTestbase/nsk/jvmti/scenarios/jni_interception/JI05/ji05t001/TestDescription.java fails after JDK-8308341
f8a924a: 8308975: Fix signed integer overflow in compiler code, part 2
5531f6b: 8308819: add JDWP and JDI virtual thread support for ThreadReference.ForceEarlyReturn
e42a4b6: 8309236: ProblemList java/util/concurrent/locks/Lock/OOMEInAQS.java with ZGC and Generational ZGC again
8dbd384: 8308678: (fs) UnixPath::toRealPath needs additional permissions when running with SM (macOS)
... and 124 more: https://git.openjdk.org/jdk/compare/9e196b3631af0156ce9958a2f631894968211a4c...master

Your commit was automatically rebased without conflicts.

openjdk · 2023-06-01T06:58:03Z

@tschatzl Pushed as commit 96ed139.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Initial version

2d3ece6

openjdk bot changed the title ~~8308766~~ 8308766: TLAB initialization may cause div by zero May 24, 2023

openjdk bot added the rfr Pull request is ready for review label May 24, 2023

openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label May 24, 2023

shipilev approved these changes May 25, 2023

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label May 25, 2023

tschatzl closed this May 25, 2023

tschatzl reopened this May 25, 2023

albertnetymk approved these changes May 31, 2023

View reviewed changes

openjdk bot added the integrated Pull request has been integrated label Jun 1, 2023

openjdk bot closed this Jun 1, 2023

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 1, 2023

tschatzl deleted the submit/8308766-tlab-init-div-by-zero branch June 6, 2023 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8308766: TLAB initialization may cause div by zero #14121

8308766: TLAB initialization may cause div by zero #14121

tschatzl commented May 24, 2023 •

edited by openjdk bot

Loading

bridgekeeper bot commented May 24, 2023

openjdk bot commented May 24, 2023

mlbridge bot commented May 24, 2023

albertnetymk commented May 25, 2023

tschatzl commented May 25, 2023

shipilev left a comment

openjdk bot commented May 25, 2023 •

edited

Loading

tschatzl commented May 25, 2023 •

edited

Loading

tschatzl commented May 25, 2023

albertnetymk left a comment

tschatzl commented Jun 1, 2023

openjdk bot commented Jun 1, 2023

openjdk bot commented Jun 1, 2023

8308766: TLAB initialization may cause div by zero #14121

8308766: TLAB initialization may cause div by zero #14121

Conversation

tschatzl commented May 24, 2023 • edited by openjdk bot Loading

Progress

Issue

Reviewers

Reviewing

Webrev

bridgekeeper bot commented May 24, 2023

openjdk bot commented May 24, 2023

mlbridge bot commented May 24, 2023

Webrevs

albertnetymk commented May 25, 2023

tschatzl commented May 25, 2023

shipilev left a comment

Choose a reason for hiding this comment

openjdk bot commented May 25, 2023 • edited Loading

tschatzl commented May 25, 2023 • edited Loading

tschatzl commented May 25, 2023

albertnetymk left a comment

Choose a reason for hiding this comment

tschatzl commented Jun 1, 2023

openjdk bot commented Jun 1, 2023

openjdk bot commented Jun 1, 2023

tschatzl commented May 24, 2023 •

edited by openjdk bot

Loading

openjdk bot commented May 25, 2023 •

edited

Loading

tschatzl commented May 25, 2023 •

edited

Loading