Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8312182: THPs cause huge RSS due to thread start timing issue #14919

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Jul 18, 2023

If Transparent Huge Pages are unconditionally enabled (/sys/kernel/mm/transparent_hugepage/enabled contains [always]), Java applications that use many threads may see a huge Resident Set Size. That RSS is caused by thread stacks being mostly paged in. This page-in is caused by thread stack memory being transformed into huge pages by khugepaged; later, those huge pages usually shatter into small pages when Java guard pages are established at thread start, but the remaining splinter small pages remain paged in.

JDK-8303215 attempted to fix this problem by making it unlikely that thread stack boundaries are aligned to THP page size. Unfortunately, that was not sufficient. We still see JVMs with huge footprints, especially if they did create many Java threads in rapid succession.

Note that this effect is independent of any JVM switches; in particular, it happens regardless of -XX:+UseTransparentHugePages or -XX:+UseLargePages.

Update: tests show that the interference of khugepaged also costs performance when starting threads, and this patch addresses both footprint and performance problems.

Demonstration:

Linux 5.15 on x64, glibc 2.31: 10000 idle threads with 100 MB pre-touched java heap, -Xss2M, on x64, will consume:

A) Baseline (THP disabled on system): 369 MB
B) THP="always", JDK-8303215 present: 1.5 GB .. >2 GB (very wobbly)
C) THP="always", JDK-8303215 present, artificial delay after thread start: 20,6 GB (!).

Cause:

The problem is caused by timing. When we create multiple Java threads, the following sequence of actions happens:

In the parent thread:

  • the parent thread calls pthread_create(3)
  • pthread_create(3) creates the thread stack by calling mmap(2)
  • pthread_create(3) calls clone(2) to start the child thread
  • repeat to start more threads

Each child thread:

  • queries its stack dimensions
  • handshakes with the parent to signal lifeness
  • establishes guard pages at the low end of the stack

The thread stack mapping is established in the parent thread; the guard pages are placed by the child threads. There is a time window in which the thread stack is already mapped into address space, but guard pages still need to be placed.

If the parent is faster than the children, it will have created mappings faster than the children can place guard pages on them.

For the kernel, these thread stacks are just anonymous mappings. It places them adjacent to each other to reduce address space fragmentation. As long as no guard pages are placed yet, all these thread stack mappings (VMAs) have the same attributes - same permission bits, all anonymous. Hence, the kernel will fold them into a single large VMA.

That VMA may be large enough to be eligible for huge pages. Now the JVM races with the khugepaged: If khugepaged is faster than the JVM, it will have converted that larger VMA partly or fully into hugepages before the child threads start creating guard pages:

glibc does              Kernel folds them        khugepaged creates
1 mmap/stack:           into one VMA (still      huge pages (paged in):
                        not paged in):
+---------+             +---------+              +---------+ 
|         |             |         |              |         | 
| stack A |             |         |              | hugepage|
|         |             |         |              | ------- | 
+---------+             |         |              |         | 
|         |             |         |              | hugepage| 
| stack B |    ====>    |  VMA    |    ====>     | ------- | 
|         |             |         |              |         | 
+---------+             |         |              | hugepage| 
|         |             |         |              | ------- | 
| stack C |             |         |              |         | 
|         |             |         |              | hugepage| 
+---------+             +---------+              +---------+

The child threads will catch up and create guard pages. That will splinter the large VMA into several smaller VMAs (two for each thread, one for the usable thread section, and one protected for the guards). Each of these VMAs will typically be smaller than a huge page, and typically not huge-page-aligned. The huge pages created by khugepaged will mostly shatter into small pages, but these small pages remain paged-in. Effect: we pay memory for the whole thread stacks even though the threads did not start yet.

This is a similar effect as described in JDK-8303215; but we assumed it only affects individual threads when it affects whole regions of adjacent thread stacks.

Example:

Let's create three threads. Each thread stack, including guard pages, is 2M + 4K sized (+4K because of JDK-8303215).

Their thread stacks will be located at: ( [base .. end .. guard]:

T1: [7feea53ff000 .. 7feea5202000 .. 7feea51fe000] 
T2: [7feea5600000 .. 7feea5403000 .. 7feea53ff000] 
T3: [7feea5801000 .. 7feea5604000 .. 7feea5600000]

After pthread_create(3), their thread stacks exist without JVM guard pages. Kernel merges the VMAs of their thread stacks into a single mapping > 6MB. khugepaged then coalesces their small pages into 3 huge pages:

7feea51fe000-7feea5801000 rw-p 00000000 00:00 0    <<<------- all three stacks as one VMA
Size:               6156 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                6148 kB
Pss:                6148 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      6148 kB
Referenced:         6148 kB
Anonymous:          6148 kB
LazyFree:              0 kB
AnonHugePages:      6144 kB      <<<---------- 3x2MB huge pages
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1    
VmFlags: rd wr mr mw me ac sd 

Threads start and create their respective guard pages. The single VMA splinters into 6 smaller VMAs. The huge pages shatter into small pages that remain paged-in:

7feea51fe000-7feea5202000 ---p 00000000 00:00 0   <<----- guard pages for T1
Size:                 16 kB
...
7feea5202000-7feea53ff000 rw-p 00000000 00:00 0   <<------ thread stack for T1
Size:               2036 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                2036 kB
Pss:                2036 kB
Private_Dirty:      2036 kB   <<<--------  all pages resident
...
7feea53ff000-7feea5403000 ---p 00000000 00:00 0   <<----- guard pages for T2
Size:                 16 kB
...
7feea5403000-7feea5600000 rw-p 00000000 00:00 0   <<------ thread stack for T2
Size:               2036 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                2036 kB
Pss:                2036 kB
Private_Dirty:      2036 kB   <<<--------  all pages resident
...
7feea5600000-7feea5604000 ---p 00000000 00:00 0    <<----- guard pages for T3 
Size:                 16 kB
...
7feea5604000-7feea5801000 rw-p 00000000 00:00 0    <<------ thread stack for T3
Size:               2036 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                2036 kB
Pss:                2036 kB
Private_Dirty:      2036 kB   <<<--------  all pages resident
...

Fix

The mitigation for the huge RSS buildup involves letting the glibc allocate guards for Java thread stacks. We usually avoid that since Java thread stacks are already guarded by JVM guard pages. However, here we use the fact that the glibc guard page is established by pthread_create(3) before starting the child thread. Therefore adjacent thread stack VMAs will be separated by a guard page - we close the time window within which the kernel sees a large VMA.

The fix works: the aforementioned 20.6 GB RSS example, with the fix applied, drops back to ~400 MB RSS.

As a byproduct, this patch fixes JDK-8310687: the mitigation will only be activated if THPs are enabled unconditionally on the system (THP mode = "always"). And we use the correct page size for THPs.

Finally, note that this solution has a slight disadvantage: threads now have two guards, the JVM guard pages and the glibc guard page. Currently, the kernel cannot merge these VMAs because their attributes differ. This problem may be solvable; I opened JDK-8312211 to track this problem.

/issue JDK-8310687


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

  • JDK-8312182: THPs cause huge RSS due to thread start timing issue (Bug - P3)
  • JDK-8310687: JDK-8303215 is incomplete (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14919/head:pull/14919
$ git checkout pull/14919

Update a local copy of the PR:
$ git checkout pull/14919
$ git pull https://git.openjdk.org/jdk.git pull/14919/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14919

View PR using the GUI difftool:
$ git pr show -t 14919

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14919.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 18, 2023

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 18, 2023

@tstuefe The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Jul 18, 2023
@tstuefe tstuefe changed the title 8312182-Huge-RSS-with-THPs-even-after-JDK-8303215 JDK-8312182: Huge RSS with THPs even after JDK-8303215 Jul 18, 2023
@tstuefe tstuefe force-pushed the JDK-8312182-Huge-RSS-with-THPs-even-after-JDK-8303215 branch from 6f21033 to 805cbeb Compare July 18, 2023 14:15
@openjdk
Copy link

openjdk bot commented Jul 18, 2023

@tstuefe
Adding additional issue to issue list: 8310687: JDK-8303215 is incomplete.

@tstuefe tstuefe changed the title JDK-8312182: Huge RSS with THPs even after JDK-8303215 JDK-8312182: THPs may cause huge RSS increases due to thread start timing issue Jul 18, 2023
@tstuefe tstuefe changed the title JDK-8312182: THPs may cause huge RSS increases due to thread start timing issue JDK-8312182: THPs cause huge RSS due to thread start timing issue Jul 18, 2023
@tstuefe tstuefe force-pushed the JDK-8312182-Huge-RSS-with-THPs-even-after-JDK-8303215 branch from 805cbeb to 81c0d38 Compare July 18, 2023 15:38
@tstuefe tstuefe marked this pull request as ready for review July 18, 2023 15:40
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 18, 2023
@mlbridge
Copy link

mlbridge bot commented Jul 18, 2023

Webrevs

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow what a mess! The OS is really working against us in this area. We should not have to be playing these silly games.

There is a caution in os::Linux::default_guard_size that glibc guard pages are expensive to create so what are we paying for this?

Thanks.

Comment on lines 931 to 932
if (PreventTHPsForThreadStacks) {
guard_size = MAX2(guard_size, os::vm_page_size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better hidden in os::Linux::default_guard_size and only applied to JavaThreads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better hidden in os::Linux::default_guard_size

I don't like this, since it would keep those two mitigations we apply - adding the glibc guard page and altering the stack size to be THP unaligned - separate in code. I find it clearer to have them in one place.

and only applied to JavaThreads.

No, it makes sense to apply this to any thread that does not by-default run with a glibc guard page. But in practice, it is as you wish it to be since only Java Threads run without glibc guard pages.

Comment on lines 933 to 934
// Add an additional page to the stack size to reduce its chances of getting huge page aligned
// so that the stack does not get backed by a transparent huge page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these two adjustments should be combined like this. The existing +1 may be sufficient for some folk who do not want the cost of creating a glibc guard page as well.

Copy link
Member Author

@tstuefe tstuefe Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, see earlier remarks. "May be sufficient" is vague - either you want to prevent THPs forming in thread stacks, or you are okay with it. There is no point in doing one without the other.

If you want to prevent them, you have to do both since the effects laid out in this PR are not predictable. Even if your app does not create threads - and why would you care for that one guard page then - the JVM creates a bunch of threads in quick succession too and thus suffers from the same effects.

And if you want to have THPs, then you should disable both mitigations together.

@tstuefe
Copy link
Member Author

tstuefe commented Jul 19, 2023

Wow what a mess! The OS is really working against us in this area. We should not have to be playing these silly games.

There is a caution in os::Linux::default_guard_size that glibc guard pages are expensive to create so what are we paying for this?

Less, actually :-) The patch is a performance win too.

Code-wise, glibc seems to do one more mmap(2) call if we use guard pages, at least for the standard code paths.

I wanted to see if that matters, so I tested with THP mode set to "always" on my 48-core Linux 5.15 glibc 2.31 box. I tested with two stack sizes (1 MB, 2 MB), with and without patch. The test measures the time it takes to spawn 10000 threads.

Stack size (kb) with patch (ms) without patch (ms) Delta
1024 723 +- 11 807 +- 18 +12%
2048 710 +- 13 1621 +- 91 +128%

The numbers are consistently better with my patch. When the stack size approaches THP page size, the difference gets huge!

This makes sense since khugepaged transforms thread pages under our noses while child threads, freshly started, try to use the same pages. So we get a lot of interference. The closer the stack size is to THP size, the more likely this is. And we use more memory too, which costs bandwidth.

I repeated the test with THPs disabled, and sure enough, the difference vanished.

Bottomline, the cost of the additional mmap is dwarfed by the win of not having to fight khugepaged for the thread stacks.

I don't think these two adjustments should be combined like this. The existing +1 may be sufficient for some folk who do not want the cost of creating a glibc guard page as well.

I disagree; I'd like to enable this patch by default as a whole if we run with THP=always; we save performance and memory. There are not many downsides.

@tstuefe
Copy link
Member Author

tstuefe commented Jul 19, 2023

@poonamparhar @dholmes-ora Thanks for your respective feedback. New version slightly moves the guard page calculation around and improves comment, functionally the same.

src/hotspot/os/linux/os_linux.cpp Outdated Show resolved Hide resolved
src/hotspot/os/linux/os_linux.cpp Outdated Show resolved Hide resolved
src/hotspot/os/linux/os_linux.cpp Outdated Show resolved Hide resolved
src/hotspot/os/linux/globals_linux.hpp Outdated Show resolved Hide resolved
@tstuefe
Copy link
Member Author

tstuefe commented Jul 20, 2023

New version:

  • changed flag name from PreventTHPsForThreadStacks -> DisableTHPStackMitigation (inverse logic) and made it diagnostic
  • comment corrections
  • changed the negative tests to be a jtreg / manual test (@dholmes-ora is that the right way to go here? I want a test that does not run automatically, but that can be run from command line)

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me now - not that I understand THP any better. I will run the test through our CI for good measure.

Thanks.

// is usually unwanted for thread stacks. Therefore we attempt to prevent THP formation in
// thread stacks unless the user explicitly allowed THP formation by manually disabling
// -XX:+DisableTHPStackMitigation.
if (HugePages::thp_mode() == THPMode::always) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What controls this? The user launching the VM? The OS configuration? Or both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OS, so ultimately the Admin that set up the OS.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test passes in our CI.

@openjdk
Copy link

openjdk bot commented Jul 21, 2023

@tstuefe This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8312182: THPs cause huge RSS due to thread start timing issue
8310687: JDK-8303215 is incomplete

Reviewed-by: dholmes, poonam

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 70 new commits pushed to the master branch:

  • 842d632: 8227229: Deprecate the launcher -Xdebug/-debug flags that have not done anything since Java 6
  • 9e4fc56: 8293114: JVM should trim the native heap
  • 59f66a3: 8312293: SIGSEGV in jfr.internal.event.EventWriter.putUncheckedByte after JDK-8312086
  • 8cd43bf: 8312474: JFR: Improve logging to diagnose event stream timeout
  • 3e8f1eb: 8311976: Inconsistency in usage of CITimeVerbose to generate compilation logs
  • d4aacdb: 8306136: [vectorapi] Intrinsics of VectorMask.laneIsSet()
  • 783de32: 8300051: assert(JvmtiEnvBase::environments_might_exist()) failed: to enter event controller, JVM TI environments must exist
  • 4e8f331: 8312443: sun.security should use toLowerCase(Locale.ROOT)
  • d7b9416: 8199149: Improve the exception message thrown by VarHandle of unsupported operation
  • 354c660: 8307185: pkcs11 native libraries make JNI calls into java code while holding GC lock
  • ... and 60 more: https://git.openjdk.org/jdk/compare/69a46c25cc87d9d5495d0bb975c44f38cbb1fe13...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Jul 21, 2023
@tstuefe
Copy link
Member Author

tstuefe commented Jul 21, 2023

Test passes in our CI.

Many thanks, David.

Copy link
Member

@poonamparhar poonamparhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me.

@tstuefe
Copy link
Member Author

tstuefe commented Jul 21, 2023

Thanks @poonamparhar, for the review and the analysis of JDK-8303215!

/integrate

@openjdk
Copy link

openjdk bot commented Jul 21, 2023

Going to push as commit 84b325b.
Since your change was applied there have been 70 commits pushed to the master branch:

  • 842d632: 8227229: Deprecate the launcher -Xdebug/-debug flags that have not done anything since Java 6
  • 9e4fc56: 8293114: JVM should trim the native heap
  • 59f66a3: 8312293: SIGSEGV in jfr.internal.event.EventWriter.putUncheckedByte after JDK-8312086
  • 8cd43bf: 8312474: JFR: Improve logging to diagnose event stream timeout
  • 3e8f1eb: 8311976: Inconsistency in usage of CITimeVerbose to generate compilation logs
  • d4aacdb: 8306136: [vectorapi] Intrinsics of VectorMask.laneIsSet()
  • 783de32: 8300051: assert(JvmtiEnvBase::environments_might_exist()) failed: to enter event controller, JVM TI environments must exist
  • 4e8f331: 8312443: sun.security should use toLowerCase(Locale.ROOT)
  • d7b9416: 8199149: Improve the exception message thrown by VarHandle of unsupported operation
  • 354c660: 8307185: pkcs11 native libraries make JNI calls into java code while holding GC lock
  • ... and 60 more: https://git.openjdk.org/jdk/compare/69a46c25cc87d9d5495d0bb975c44f38cbb1fe13...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 21, 2023
@openjdk openjdk bot closed this Jul 21, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 21, 2023
@openjdk
Copy link

openjdk bot commented Jul 21, 2023

@tstuefe Pushed as commit 84b325b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
3 participants