Skip to content

8312182: THPs cause huge RSS due to thread start timing issue #2086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Aug 17, 2023

Unclean composite backport to fix JDK-8312182 - "THPs cause huge RSS due to thread start timing issue" (https://bugs.openjdk.org/browse/JDK-8312182)

Problem:

On a machine with transparent huge pages (THP) unconditionally enabled (/sys/kernel/mm/transparent_hugepage/enabled = "always"), the JVM may show a huge memory footprint (RSS) and degraded thread start performance.

The following factors make the problem more severe and more likely:

  • thread stack size of 2M (on arm64 or x64) or larger
  • many threads, or high thread creation churn
  • a slow or overloaded machine (since part of the problem is timing-dependent)

For a detailed discussion of the underlying problem, please see openjdk/jdk#14919.


In jdk Head, the issue got fixed with a sequence of patches:

  • JDK-8303215 "Make thread stacks not use huge pages"
  • JDK-8312182 "THPs cause huge RSS due to thread start timing"

However, JDK-8312182 itself needed one preparatory fix:

and then we had several corner-case test problems which are fixed with:

  • JDK-8312394 "[linux] SIGSEGV if kernel was built without hugepage support"
  • JDK-8312620 "WSL Linux build crashes after JDK-8310233"
  • JDK-8314139 "TEST_BUG: runtime/os/THPsInThreadStackPreventionTest.java could fail on machine with large number of cores"

and finally, we decided to rename the switch that allows to switch off the THP mitigation with a final patch:

  • JDK-8312585 "Rename DisableTHPStackMitigation flag to THPStackMitigation"

Instead of downporting these 7 patches verbatim, I prepared a composite patch containing only the necessary mitigation and mitigation tests.

This patch does:

  • make sure that all thread stacks have at least one glibc guard page to prevent clustering of adjacent thread stacks into one VMA
  • change the default size of stacks to be not aligned to 2MB to prevent intra-stack THPs from forming

The patch needs some infrastructure, but I downported only the necessary parts: the helper class "HugePages", which is used in head to scan the operating system for information about THP settings. I only included the parts to do with THPs and left the rest out.

The patch also includes a regression test.


Testing:

I manually tested the JVM on Linux x64 with THP=always:

Without the patch (-Xmx1g -Xms1g -XX:+AlwaysPreTouch -Xss2m, 10000 threads started), I see slow thread startup and 11 GB - 14 GB of RSS.

The patched version comes up a lot faster and only shows 1.3 GB of RSS.

GHAs: unfortunately broken due to infrastructure issues.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

  • JDK-8312182: THPs cause huge RSS due to thread start timing issue (Bug - P3)
  • JDK-8310687: JDK-8303215 is incomplete (Bug - P4)
  • JDK-8303215: Make thread stacks not use huge pages (Enhancement - P3)
  • JDK-8310233: Fix THP detection on Linux (Bug - P4)
  • JDK-8312394: [linux] SIGSEGV if kernel was built without hugepage support (Bug - P3)
  • JDK-8312620: WSL Linux build crashes after JDK-8310233 (Bug - P3)
  • JDK-8314139: TEST_BUG: runtime/os/THPsInThreadStackPreventionTest.java could fail on machine with large number of cores (Bug - P4)
  • JDK-8312585: Rename DisableTHPStackMitigation flag to THPStackMitigation (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk11u-dev.git pull/2086/head:pull/2086
$ git checkout pull/2086

Update a local copy of the PR:
$ git checkout pull/2086
$ git pull https://git.openjdk.org/jdk11u-dev.git pull/2086/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 2086

View PR using the GUI difftool:
$ git pr show -t 2086

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk11u-dev/pull/2086.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 17, 2023

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot changed the title Backport 84b325b844c08809448a9c073a11443d9e3c3f8e 8312182: THPs cause huge RSS due to thread start timing issue Aug 17, 2023
@openjdk
Copy link

openjdk bot commented Aug 17, 2023

This backport pull request has now been updated with issues from the original commit.

@openjdk openjdk bot added the backport label Aug 17, 2023
@tstuefe tstuefe force-pushed the tstuefe-backport-84b325b8 branch from cea61c5 to c13269c Compare August 17, 2023 14:53
@tstuefe tstuefe marked this pull request as ready for review August 19, 2023 06:31
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 19, 2023
@mlbridge
Copy link

mlbridge bot commented Aug 19, 2023

Webrevs

@tstuefe
Copy link
Member Author

tstuefe commented Aug 21, 2023

All tests are green now.

@tstuefe
Copy link
Member Author

tstuefe commented Aug 23, 2023

/issue add 8303215 8310233 8312394 8312620 8314139 8312585

@openjdk
Copy link

openjdk bot commented Aug 23, 2023

@tstuefe
Adding additional issue to issue list: 8303215: Make thread stacks not use huge pages.

Adding additional issue to issue list: 8310233: Fix THP detection on Linux.

Adding additional issue to issue list: 8312394: [linux] SIGSEGV if kernel was built without hugepage support.

Adding additional issue to issue list: 8312620: WSL Linux build crashes after JDK-8310233.

Adding additional issue to issue list: 8314139: TEST_BUG: runtime/os/THPsInThreadStackPreventionTest.java could fail on machine with large number of cores.

Adding additional issue to issue list: 8312585: Rename DisableTHPStackMitigation flag to THPStackMitigation.

@tstuefe
Copy link
Member Author

tstuefe commented Aug 24, 2023

Friendly ping. It would be good to get this fixed in time for the next CPU.

@tstuefe tstuefe closed this Aug 25, 2023
@tstuefe tstuefe deleted the tstuefe-backport-84b325b8 branch October 23, 2023 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

1 participant