Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8308766: TLAB initialization may cause div by zero #14121

Closed

Conversation

tschatzl
Copy link
Contributor

@tschatzl tschatzl commented May 24, 2023

Hi all,

can I have reviews for this change that fixes an FP div by zero?

In ThreadLocalAllocBuffer::initialize() we initialize the TLAB using current available TLAB capacity for the thread. In G1, this can be zero in some situations, leading to that div by zero (see the CR for the crash when adding an assert).
The suggested fix is to just not sample at this point. TLAB resizing will fix TLAB sizing up.

Only G1 seems to be affected as it seems to be the only gc that uses a dynamic value for the capacity available for TLAB allocation. Other GCs seem to just use total heap capacity (Z, Shenandoah) or eden capacity (Serial, Parallel).
Not sure if that is actually better and I think won't result in the expected behavior (every thread should reload TLABs target_refills() times per mutator time); since even with G1 at TLAB resizing time eden is maximal, this hiccup at initialization does not seem too bad.

This may also be the cause for the behavior observed in https://bugs.openjdk.org/browse/JDK-8264798.

Testing: gha

Thanks,
Thomas


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8308766: TLAB initialization may cause div by zero

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14121/head:pull/14121
$ git checkout pull/14121

Update a local copy of the PR:
$ git checkout pull/14121
$ git pull https://git.openjdk.org/jdk.git pull/14121/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14121

View PR using the GUI difftool:
$ git pr show -t 14121

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14121.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 24, 2023

👋 Welcome back tschatzl! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot changed the title 8308766 8308766: TLAB initialization may cause div by zero May 24, 2023
@openjdk openjdk bot added the rfr Pull request is ready for review label May 24, 2023
@openjdk
Copy link

openjdk bot commented May 24, 2023

@tschatzl The following label will be automatically applied to this pull request:

  • hotspot-gc

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-gc hotspot-gc-dev@openjdk.org label May 24, 2023
@mlbridge
Copy link

mlbridge bot commented May 24, 2023

Webrevs

@albertnetymk
Copy link
Member

Looking at where _allocation_fraction is accessed, wouldn't a variable capacity cause the alloc-amount to be miscalculated? I'd expect capacity to be const to more accurately track/predict #alloc-bytes.

  // ** sampling place ** //
  size_t capacity = Universe::heap()->tlab_capacity(thread()) / HeapWordSize;
  float alloc_frac = desired_size() * target_refills() / (float)capacity;
  _allocation_fraction.sample(alloc_frac);

  // ** where it's used ** //
  // Compute the next tlab size using expected allocation amount
  size_t alloc = (size_t)(_allocation_fraction.average() *
                          (Universe::heap()->tlab_capacity(thread()) / HeapWordSize));

@tschatzl
Copy link
Contributor Author

Looking at where _allocation_fraction is accessed, wouldn't a variable capacity cause the alloc-amount to be miscalculated? I'd expect capacity to be const to more accurately track/predict #alloc-bytes.

  // ** sampling place ** //
  size_t capacity = Universe::heap()->tlab_capacity(thread()) / HeapWordSize;
  float alloc_frac = desired_size() * target_refills() / (float)capacity;
  _allocation_fraction.sample(alloc_frac);

  // ** where it's used ** //
  // Compute the next tlab size using expected allocation amount
  size_t alloc = (size_t)(_allocation_fraction.average() *
                          (Universe::heap()->tlab_capacity(thread()) / HeapWordSize));

Where the capacity is used, during the GC pause, in G1 Universe::heap()->tlab_capacity is effectively a constant, reflecting the eden size for the next mutator phase.

There is a problem what to do during TLAB initialization when attaching a random thread: eden can be partially exhausted as it can happen at any time when the mutator is running: do you want to have target_refills() reloads until eden is exhausted, or as if the thread ran since the start of the mutator phase or something completely different.

Serial and parallel calculate it as if eden were empty, Shenandoah and Z seem to use total heap capacity (they're single-generational), and G1 uses the remaining eden capacity, with different effects.

(Fwiw, if there is an issue with that logic, it is pre-existing).

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so this does happen when a new thread comes at unfortunate time in VM lifecycle, like on shutdown? Anyway, the fix looks okay. I think many other versions are also affected, can you please add relevant Affected-Versions to the bug?

@openjdk
Copy link

openjdk bot commented May 25, 2023

@tschatzl This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8308766: TLAB initialization may cause div by zero

Reviewed-by: shade, ayang

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 123 new commits pushed to the master branch:

  • eae1f59: 8309159: Some minor comment and code cleanup in jdk/com/sun/jdi/PopFramesTest.java
  • 45473ef: 8309230: ProblemList jdk/incubator/vector/Float64VectorTests.java on aarch64
  • 78aa5f3: 8299505: findVirtual on array classes incorrectly restricts the receiver type
  • 42ca6e6: 8308022: update for deprecated sprintf for java.base
  • 1264902: 8308316: Default decomposition mode in Collator
  • 70670b4: 8308872: enhance logging and some exception in krb5/Config.java
  • 024d9b1: 8308910: Allow executeAndLog to accept running process
  • 25b9803: 8308917: C2 SuperWord::output: assert before bailout with CountedLoopReserveKit
  • d66b6d8: 8308765: RISC-V: Expand size of stub routines for zgc only
  • 4aea7da: 8309120: java/net/httpclient/AsyncShutdownNow.java fails intermittently
  • ... and 113 more: https://git.openjdk.org/jdk/compare/9e196b3631af0156ce9958a2f631894968211a4c...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 25, 2023
@tschatzl tschatzl closed this May 25, 2023
@tschatzl tschatzl reopened this May 25, 2023
@tschatzl
Copy link
Contributor Author

tschatzl commented May 25, 2023

OK, so this does happen when a new thread comes at unfortunate time in VM lifecycle, like on shutdown? Anyway, the fix looks okay. I think many other versions are also affected, can you please add relevant Affected-Versions to the bug?

In this case, yes, a thread is attached on shutdown and you can get weird failures in other FP code (if you also enable FP exceptions, but it can leave something in a weird state apparently). I think (well I hope) it is also the cause for another similar bug (*) that caused crashes in G1 extremely intermittently (that has been closed as CNR at that point after it stopped appearing).

That assert that tripped is something that I added for trying to reproduce JDK-8264798, initially failed to do so, and then accidentally left in when testing another change....

I think it is worth cleaning up just in case.

(*) That may just be wishful thinking...

@tschatzl
Copy link
Contributor Author

Added affects version back to JDK 8 since that code and the tlab_capacity() implementation are the same as they are now. Maybe other circumstance prevent this from happening.

Copy link
Member

@albertnetymk albertnetymk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to Thomas' explanation, now I understand why it tracks the ratio instead of the actual alloc-amount. It's because (eden) capacity affects the distance btw two gc-pause (in STW GC), and alloc-amount is semi-proportional to gc-distance. Therefore, the ratio more or less reflects alloc-rate, which can be used to predict alloc-amount until the next gc-pause.

However, maintaining a constant number of refills btw gc-pauses seems an odd objective; preexisting issue.

@tschatzl
Copy link
Contributor Author

tschatzl commented Jun 1, 2023

Thanks @albertnetymk @shipilev for your reviews.

/integrate

@openjdk
Copy link

openjdk bot commented Jun 1, 2023

Going to push as commit 96ed139.
Since your change was applied there have been 134 commits pushed to the master branch:

  • 6c7225f: 8303417: RISC-V: Merge vector instructs with similar match rules
  • a46b5ac: 8308503: AArch64: SIGILL when running with -XX:UseBranchProtection=pac-ret on hardware without PAC feature
  • f9ad7df: 8300865: C2: product reduction in ProdRed_Double is not vectorized
  • 8eda97d: 8305320: DbgStrings and AsmRemarks are leaking
  • 0951474: 8309150: Need to escape " inside attribute values
  • 0119969: 8309171: Test vmTestbase/nsk/jvmti/scenarios/jni_interception/JI05/ji05t001/TestDescription.java fails after JDK-8308341
  • f8a924a: 8308975: Fix signed integer overflow in compiler code, part 2
  • 5531f6b: 8308819: add JDWP and JDI virtual thread support for ThreadReference.ForceEarlyReturn
  • e42a4b6: 8309236: ProblemList java/util/concurrent/locks/Lock/OOMEInAQS.java with ZGC and Generational ZGC again
  • 8dbd384: 8308678: (fs) UnixPath::toRealPath needs additional permissions when running with SM (macOS)
  • ... and 124 more: https://git.openjdk.org/jdk/compare/9e196b3631af0156ce9958a2f631894968211a4c...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jun 1, 2023
@openjdk openjdk bot closed this Jun 1, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jun 1, 2023
@openjdk
Copy link

openjdk bot commented Jun 1, 2023

@tschatzl Pushed as commit 96ed139.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@tschatzl tschatzl deleted the submit/8308766-tlab-init-div-by-zero branch June 6, 2023 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc hotspot-gc-dev@openjdk.org integrated Pull request has been integrated
3 participants