Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8255984: Shenandoah: "adaptive" heuristic is prone to missing load spikes #1099

Conversation

earthling-amzn
Copy link
Contributor

@earthling-amzn earthling-amzn commented Nov 6, 2020

This change adds a "reactive" heuristic for triggering concurrent GC cycles.

The reactive heuristic maintains a margin of error and an allocation spike detection mechanism to trigger cycles somewhat more aggressively than the 'adaptive' heuristic. This heuristic 'reacts' to the outcome of GC cycles by adjusting the sensitivity of the triggers.

JBS ticket is here: https://bugs.openjdk.java.net/browse/JDK-8255984

The "adaptive" heuristic remains the default.

Steps to reproduce and test will follow shortly (there are no new jtreg test failures for Shenandoah with this change).


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Testing

Linux aarch64 Linux arm Linux ppc64le Linux s390x Linux x64 Linux x86 Windows x64 macOS x64
Build ✔️ (1/1 passed) ✔️ (1/1 passed) ✔️ (1/1 passed) ✔️ (1/1 passed) ✔️ (6/6 passed) ✔️ (2/2 passed) ✔️ (2/2 passed) ✔️ (2/2 passed)
Test (tier1) ✔️ (9/9 passed) (1/9 running) (6/9 running) ✔️ (9/9 passed)

Issue

  • JDK-8255984: Shenandoah: "adaptive" heuristic is prone to missing load spikes

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/1099/head:pull/1099
$ git checkout pull/1099

The reactive heuristic maintains a margin of error and a spike
detection mechanism to trigger cycles somewhat more aggressively
than the 'adaptive' heuristic. This heuristic 'reacts' to the
outcome of GC cycles by adjusting the sensitivity of the triggers.
@bridgekeeper bridgekeeper bot added the oca label Nov 6, 2020
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Nov 6, 2020

Hi @earthling-amzn, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user earthling-amzn" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Nov 6, 2020

@earthling-amzn The following labels will be automatically applied to this pull request:

  • hotspot-gc
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-gc shenandoah labels Nov 6, 2020
@earthling-amzn
Copy link
Contributor Author

@earthling-amzn earthling-amzn commented Nov 6, 2020

/covered

@bridgekeeper bridgekeeper bot added the oca-verify label Nov 6, 2020
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Nov 6, 2020

Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

@robilad
Copy link

@robilad robilad commented Nov 6, 2020

Hi,

Please contact me at Dalibor.topic@oracle.com so that I can verify your GitHub account.

Thanks!

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 8, 2020

This looks interesting. I have to look closer at this, but maybe we should instead fold this to adaptive. This would also resolve the testing question: Shenandoah tests usually run with most (all?) heuristics, and so new heuristics needs to be followed with lots of test changes. Folding this to adaptive implicitly fixes that.

@earthling-amzn
Copy link
Contributor Author

@earthling-amzn earthling-amzn commented Nov 9, 2020

I think folding it in to adaptive makes sense. I went the conservative approach originally just to keep the changes isolated. I'll prepare a commit with these changes built directly in the adaptive heuristic.

@bridgekeeper bridgekeeper bot removed oca oca-verify labels Nov 9, 2020
Copy link
Contributor

@shipilev shipilev left a comment

While waiting for merge to "adaptive", here is a round of minor things from the first read.

 - Remove unnecessary log level checks
 - More idiomatic style
Copy link
Contributor

@shipilev shipilev left a comment

Thanks for updates. I would like to dive in with some testing. Second read nits below, meanwhile.

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 11, 2020

Oh, and change PR title to "8255984: Shenandoah: "adaptive" heuristic is prone to missing load spikes", so that bots know what issue it refers to.

@earthling-amzn earthling-amzn changed the title Shenandoah: "adaptive" heuristic is prone to missing load spikes 8255984: Shenandoah: "adaptive" heuristic is prone to missing load spikes Nov 12, 2020
@openjdk openjdk bot added the rfr label Nov 12, 2020
@mlbridge
Copy link

@mlbridge mlbridge bot commented Nov 12, 2020

This lets the heuristics update state without const_casts.
@earthling-amzn
Copy link
Contributor Author

@earthling-amzn earthling-amzn commented Nov 13, 2020

Though we tested this with several benchmarks and internal services, these changes were primarily developed using the HyperAlloc benchmark. The attached chart shows stacked histograms for the heuristics that triggered a collection cycle during a run of HyperAlloc.

Each test run was given a 3g heap. HyperAlloc targeted a 1g/s allocation rate (-a 1000) with a max object size of 1000 bytes (-x 1000). It was configured to maintain 1gb of live objects on the heap (-s 1024) and it ran for two minutes (-d 120). Each row in the chart is for a different allocation smoothness factor. The top row has an allocation smoothness of zero (i.e., most spiky), the bottom row has an allocation smoothness of 0.4, still spiky, but manageable. Each column in the chart is a different heuristic. On the left is the original adaptive heuristic (with default parameters). On the far right is the compact heuristic. In the middle are runs of the original reactive heuristic with different sensitivities to allocations spikes. The purple bars are allocation failures. The light blue bars are instances where an allocation spike was detected and triggered a cycle.

shen-summary

Copy link
Contributor

@shipilev shipilev left a comment

The more I study this patch, the more I like it, good job. Another round of changes below. I fixed a few mentioned issues in my private working copy while doing initial performance testing. Would be good to have an updated PR for testing too.

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 13, 2020

shen-summary

Yes, it mostly matches what I would expect from this heuristic improvement. As it stands right now, "adaptive" balances way too closely to the cliff, and so it is expected that improvement would make GC cycles more frequent to move GC triggers away from it. On the chart, "Headroom" probably means "Average allocation rate" trigger? It still a bit sad we cannot handle smoothness=0.0, but that probably requires future-predicting capabilities not yet available at current technology level.

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 19, 2020

I don't understand why GH actions do not test your patch :/ Can you go to https://github.com/earthling-amzn/jdk/settings/ and check that both "Actions" and "Secrets" do not have anything special defined? "Actions" should say "Allow all actions", "Secrets" should not have JDK_SUBMIT_PLATFORMS defined.

@earthling-amzn
Copy link
Contributor Author

@earthling-amzn earthling-amzn commented Nov 19, 2020

"Allow all actions" is checked, nothing defined for secrets. Is it because the title didn't have the JBS issue number in it when I opened the PR?

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 19, 2020

Unlikely. If you go to https://github.com/earthling-amzn/jdk/actions, can you trigger the workflow by hand?

@earthling-amzn
Copy link
Contributor Author

@earthling-amzn earthling-amzn commented Nov 19, 2020

Ah, it says this at the actions link:

Workflows aren’t being run on this forked repository
Because this repository contained workflow files when it was forked, we have disabled them from running on this fork. Make sure you understand the configured workflows and their expected usage before enabling Actions on this repository.

I've re-enabled them and the pre-submit action is running now.

@earthling-amzn
Copy link
Contributor Author

@earthling-amzn earthling-amzn commented Nov 19, 2020

I'm also running my own test suite (heapotheysis/HyperAlloc) with different permutations for the margin of error, spike threshold and sample frequency. I'll run the whole thing again without any headroom buffer. Will have the results by end of day.

@earthling-amzn
Copy link
Contributor Author

@earthling-amzn earthling-amzn commented Nov 20, 2020

Finished running more tests. My preference is still to keep the headroom/penalties in place. It can run without these, but there is a modest increase in degenerated cycles. If we did remove these safeguards, I'd recommend increasing the default sample frequency to 100hz.

Copy link
Contributor

@shipilev shipilev left a comment

Okay then! Time to integrate and do anything else in follow-ups.

@openjdk
Copy link

@openjdk openjdk bot commented Nov 20, 2020

@earthling-amzn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8255984: Shenandoah: "adaptive" heuristic is prone to missing load spikes

Reviewed-by: shade

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 114 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@shipilev) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready label Nov 20, 2020
Without this, the average allocation rate will be much higher than it should be.
This causes too many false positives and triggers unnecessary cycles.
@earthling-amzn
Copy link
Contributor Author

@earthling-amzn earthling-amzn commented Nov 20, 2020

Sorry, added one bug fix and another small change. The bug caused the average allocation rate to be much higher than it should be. The other change is to only consider the values which are actually sampled as a "spike" (as opposed to previously when it would recompute the instantaneous rate on every call to should_start_gc). For what it's worth, I made these changes while the headroom/penalty adjustments were disabled. I'm more comfortable with taking those adjustments out now.

Copy link
Contributor

@shipilev shipilev left a comment

Okay, I have a minor suggestion that seems still preserve the semantics. It is generally looks good, the testing passes, and benchmarks look okay (some regressions, but expected). Feel free to integrate.

@earthling-amzn
Copy link
Contributor Author

@earthling-amzn earthling-amzn commented Nov 23, 2020

/integrate

@openjdk openjdk bot added the sponsor label Nov 23, 2020
@openjdk
Copy link

@openjdk openjdk bot commented Nov 23, 2020

@earthling-amzn
Your change (at version 72f15a6) is now ready to be sponsored by a Committer.

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 23, 2020

/sponsor

@shipilev
Copy link
Contributor

@shipilev shipilev commented Nov 23, 2020

Congratulations! If you have time, please work on removing old alloc spike threshold and penalties handling.

@openjdk openjdk bot closed this Nov 23, 2020
@openjdk openjdk bot added integrated and removed sponsor ready rfr labels Nov 23, 2020
@openjdk
Copy link

@openjdk openjdk bot commented Nov 23, 2020

@shipilev @earthling-amzn Since your change was applied there have been 114 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

Pushed as commit aac5c2a.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-gc integrated shenandoah
3 participants