Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8293114: JVM should trim the native heap #14781

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Jul 6, 2023

This is a continuation of #10085. I closed #10085 because it had accumulated too much comment history and got confusing. For a history of this issue, see previous discussions [1] and the comment section of 10085.


This RFE adds the option to trim the Glibc heap periodically. This can recover a significant memory footprint if the VM process suffers from high-but-rare malloc spikes. It does not matter who causes the spikes: the JDK or customer code running in the JVM process.

Background:

The Glibc is reluctant to return memory to the OS. Temporary malloc spikes often carry over as permanent RSS increase. Note that C-heap retention is difficult to observe. Since it is freed memory, it won't appear in NMT; it is just a part of RSS.

This is, effectively, caching - a performance tradeoff by the glibc. It makes a lot of sense with applications that cause high traffic on the C-heap. The JVM, however, clusters allocations and often rolls its own memory management based on virtual memory for many of its use cases.

To manually trim the C-heap, Glibc exposes malloc_trim(3). With JDK 18 [2], we added a new jcmd command to manually trim the C-heap on Linux (jcmd System.trim_native_heap). We then observed customers running this command periodically to slim down process sizes of container-bound jvms. That is cumbersome, and the JVM can do this a lot better - among other things because it knows best when not to trim.

GLIBC internals

The following information I took from the glibc source code and experimenting.

Why do we need to trim manually? Does the Glibc not trim on free?

Upon free(), glibc may return memory to the OS if:

  • the returned block was mmap'ed
  • the returned block was not added to tcache or to fastbins
  • the returned block, possibly merged with its two immediate neighbors, had they been free, is larger than FASTBIN_CONSOLIDATION_THRESHOLD (64K) - in that case:
    a) for the main arena, glibc attempts to lower the brk()
    b) for mmap-ed heaps, glibc attempts to completely unmap or shrink the heap.
    In both cases, (a) and (b), only the top portion of the heap is reclaimed. "Holes" in the middle of other in-use chunks are not reclaimed.

So: glibc may automatically reclaim memory. In normal configurations, with typical C-heap allocation granularity, it is unlikely.

To increase the chance of auto-reclamation happening, one can do one or more things:

  • a) increase allocation sizes
  • b) limit mallocs to very few threads, ideally just one
  • c) set MALLOC_ARENA_MAX=1
  • d) set the glibc.malloc.trim_threshold to a very low value, e.g., 1

But:

  • (a) is not possible for foreign code; not that even within hotspot, where we cluster allocations using hotspot arenas, the typical arena chunk size is too small to be auto-reclaimed
  • (b) may just not be feasible
  • (c) is terrible for performance since many C-Heap operations will compete over the lock of that single arena
  • (d) works if either (b) or (c) are in place and if the released block happens to border the top of the arena. And it also costs performance.

The JVM only has limited influence on (a), none on (b), (c) is a really bad idea, and hence (d) often does little. That mirrors my practical experiences.

How does malloc_trim() differ from trimming on free() ?

malloc_trim(), will look for holes that are larger than a page; so it limits itself not to just reclaiming memory at the top of the arena. It will then madvise(MADV_DONTNEED) those holes. It does that for every arena.

What are the cons of calling malloc_trim()?

malloc_trim() cannot be interrupted. Once it runs, it runs. The runtime of malloc_trim() is not predictable. If there is nothing to reclaim, it is very fast (sub-ms). If there is a lot to reclaim (e.g. >32GB), I saw times of up to 800ms.

Moreover, malloc_trim, while trimming each arena, locks the arena. That may lock out concurrent C-heap operations in the thread that uses this arena. Note, however, that this is rare since many operations will be satisfied from the tcache and therefore don't lock.

What about the pad parameter for malloc_trim()

I found it has very little effect. It only affects how many bytes are preserved at the top of the main arena. It does not affect other arenas, nor does it affect how much space malloc_trim reclaims by releasing "holes", which is the main part of memory release.

The Patch

Patch adds new options (experimental):

-XX:+GCTrimNativeHeap
-XX:GCTrimNativeHeapInterval=<seconds> (defaults to 60)

GCTrimNativeHeap is off by default. If enabled, it will cause the VM to trim the native heap periodically. The period is defined by GCTrimNativeHeapInterval.

Periodic trimming is done in its own thread. We cannot burden the ServiceThread, since the runtime of trims is unpredictable.

The patch also adds a way to suspend trimming temporarily; if suspended, no trims will start, but ongoing trims will still finish.

The patch uses this mechanism to suspend trimming during GC STW phases and whenever we are about to do bulk C-heap operations (e.g. deleting deflated monitors).

Examples:

This is an artificial test that causes two high malloc spikes with long idle periods.

(yellow) NMT shows two spikes for malloc'ed memory;

(red) RSS of the baseline JVM shows that we reach a maximum and then never recover. This is the glibc retaining the free'd memory.

(blue) RSS of the patched JVM shows that we recover RSS in steps by doing periodic C-heap trimming.

alloc-test

(See here for parameters: run script )

Tests

Tested older Glibc (2.31), and newer Glibc (2.35) (mallinfo() vs mallinfo2()), on Linux x64.

Older versions of this patch were routinely tested at SAP for almost half a year.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8293114: JVM should trim the native heap (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14781/head:pull/14781
$ git checkout pull/14781

Update a local copy of the PR:
$ git checkout pull/14781
$ git pull https://git.openjdk.org/jdk.git pull/14781/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14781

View PR using the GUI difftool:
$ git pr show -t 14781

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14781.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 6, 2023

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 6, 2023

@tstuefe The following labels will be automatically applied to this pull request:

  • hotspot
  • serviceability
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added serviceability serviceability-dev@openjdk.org hotspot hotspot-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels Jul 6, 2023
@tstuefe tstuefe force-pushed the JDK-8293114-JVM-should-trim-the-native-heap branch from 15a33f4 to c7093a5 Compare July 6, 2023 07:06
@jdksjolen
Copy link
Contributor

And app's malloc load can fluctuate wildly, with temporary spikes and long idle periods.

Are you talking about allocations into native memory that a Java application does on its own accord and not as a consequence of the JVM doing its own allocs? For compiling, for example.

@tstuefe
Copy link
Member Author

tstuefe commented Jul 6, 2023

And app's malloc load can fluctuate wildly, with temporary spikes and long idle periods.

Are you talking about allocations into native memory that a Java application does on its own accord and not as a consequence of the JVM doing its own allocs? For compiling, for example.

This does not matter. The resulting malloc load is the sum of whatever we do and whatever the native code does.

@tstuefe tstuefe marked this pull request as ready for review July 6, 2023 10:01
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 6, 2023
@tstuefe
Copy link
Member Author

tstuefe commented Jul 6, 2023

Ping @shipilev and @robehn

@mlbridge
Copy link

mlbridge bot commented Jul 6, 2023

@shipilev
Copy link
Member

shipilev commented Jul 6, 2023

I don't think the comments from my yesterday's review were addressed :) There are some old comments (marked with (Outdated), helpfully), but some are new. Please take a look at them?

@tstuefe
Copy link
Member Author

tstuefe commented Jul 6, 2023

I don't think the comments from my yesterday's review were addressed :) There are some old comments (marked with (Outdated), helpfully), but some are new. Please take a look at them?

Of course, sorry. Seems I missed most of them.

About Pause in all VM ops, an alternative would be to just check in the trimmer if we are at safepoint, and if yes treat it as pause. I'll see if that's easier (I'm worried about pulling a mutex or atomic increasing the pauser variable in every VM op we run).

@shipilev
Copy link
Member

shipilev commented Jul 6, 2023

About Pause in all VM ops, an alternative would be to just check in the trimmer if we are at safepoint, and if yes treat it as pause. I'll see if that's easier (I'm worried about pulling a mutex or atomic increasing the pauser variable in every VM op we run).

Oh yes, I like it. We can just check SafepointSynchronize::is_at_safepoint() and SafepointSynchronize::is_synchronizing(). Would be even better, because we could stop trimming when there is safepoint sync pending. Makes the whole thing much less intrusive.

@tstuefe
Copy link
Member Author

tstuefe commented Jul 6, 2023

@jdksjolen @shipilev New version:

  • I removed all manual pause calls from all GC code and replaced those with just not trimming when at or near a safe point. This is less invasive since I expect the typical trim interval to be much larger than the interval we do our VM operations in.
  • The pauses in runtime code remain
  • I restored the original arena coding, I just added the pause. Though this coding could greatly benefit from cleanups, I want to keep this patch focused and easy to downport.
  • Since we no longer have close ties to GC coding, the tests don't need to run per gc, so I dumbed down the tests.

@shipilev : I kept the SuspendMark inside TrimNative, because I like it that way. Otherwise, I would name it something like TrimNativeSuspendMark, so nothing gained but another symbol at global scope.

@lkorinth
Copy link
Contributor

lkorinth commented Jul 6, 2023

The description says -XX:GCTrimNativeHeapInterval=<seconds> (defaults to 60), but the code says milliseconds.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another read. I think we would need to tighten up style a bit too: the newline spacing style is different across the change.

src/hotspot/share/runtime/globals.hpp Outdated Show resolved Hide resolved
src/hotspot/share/runtime/trimNative.cpp Outdated Show resolved Hide resolved
src/hotspot/share/runtime/trimNative.cpp Outdated Show resolved Hide resolved
src/hotspot/share/runtime/trimNative.cpp Outdated Show resolved Hide resolved
src/hotspot/share/runtime/trimNative.cpp Outdated Show resolved Hide resolved
src/hotspot/share/runtime/trimNative.cpp Outdated Show resolved Hide resolved
test/hotspot/jtreg/runtime/os/TestTrimNative.java Outdated Show resolved Hide resolved
test/hotspot/jtreg/runtime/os/TestTrimNative.java Outdated Show resolved Hide resolved
test/hotspot/jtreg/runtime/os/TestTrimNative.java Outdated Show resolved Hide resolved
src/hotspot/share/runtime/trimNative.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@robehn robehn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me.

src/hotspot/share/logging/logTag.hpp Outdated Show resolved Hide resolved
src/hotspot/share/runtime/trimNative.cpp Outdated Show resolved Hide resolved
test/hotspot/jtreg/runtime/os/TestTrimNative.java Outdated Show resolved Hide resolved
src/hotspot/share/runtime/trimNative.cpp Outdated Show resolved Hide resolved
Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test needs a fix for non-Linux.

test/hotspot/jtreg/runtime/os/TestTrimNative.java Outdated Show resolved Hide resolved
@shipilev
Copy link
Member

Idk. Its all solvable, its just code I guess. Maybe for a later RFE? Integrating the dcmd with periodic trimming may have one pro, that is preventing customers from shooting themselves in the foot who issue the dcmd via script and then forgot they did that :-P

Yes, I am not saying we should do it in this PR. Just thinking ahead on interaction with manual trim.

@shipilev
Copy link
Member

Realized that in production, we would like to see why trimmer might be late. I think this would look even better: trimnative-shipilev-2.patch

I thought about this too, but you don't really want to know if it was suspended for every wait interval, but for every trim interval. In other words, you want to know how many trims had been moved up because a safepoint had been happening, and how many trims had been skipped due to pause.

Well, yes. Isn't that what my patch did?

@tstuefe
Copy link
Member Author

tstuefe commented Jul 14, 2023

Realized that in production, we would like to see why trimmer might be late. I think this would look even better: trimnative-shipilev-2.patch

I thought about this too, but you don't really want to know if it was suspended for every wait interval, but for every trim interval. In other words, you want to know how many trims had been moved up because a safepoint had been happening, and how many trims had been skipped due to pause.

Well, yes. Isn't that what my patch did?

I thought you wanted to collect overarching stats about how many trims got delayed. I see now you wanted to do something much simpler. Okay.

@tstuefe
Copy link
Member Author

tstuefe commented Jul 14, 2023

Hi @dholmes-ora, @shipilev,

New version:

  • added the log Aleksey suggested. Tested manually with +SafepointALot, works fine. Briefly considered making a patch out of it but did not since such a test is not reliable enough unless one runs a longer time, which I wanted to avoid.
  • Discussion with David made me realize a status print function would be nice. Added said function. It is now called for hs_err file generation as well as in dcmd VM.info.
  • I then used that function to beef up the gtest. I also now call the gtest (only the relevant os.trim... subsection) as a separate jtreg-controlled test.
  • fixed the unsupported-platforms-case test

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with this, with only minor nits left. Good job!

ml.notify_all(); // pause end
}
}
log_debug(trimnative)("Trim resumed after %s (%u suspend requests)", reason, n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do it like I did in https://github.com/openjdk/jdk/files/12043977/trimnative-shipilev-2.patch ?

We don't say "resumed" if it really want not resumed due to non-zero suspend count.

}
}

log_trace(trimnative)("Times %u suspended, %u timed, %u safepoint",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Times: , I think.

@tstuefe
Copy link
Member Author

tstuefe commented Jul 14, 2023

Thanks a lot @shipilev!

Fixed the last nits, and fixed an issue where the "this platform does not support..." text was not displayed with warning level and hence not visible without Xlog. Plus, test for that.

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still okay with it.

@tstuefe
Copy link
Member Author

tstuefe commented Jul 17, 2023

@dholmes-ora Are you okay with this final version? Did your CI report any problems?

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing further from me.

Thanks

@tstuefe
Copy link
Member Author

tstuefe commented Jul 18, 2023

I'll wait for any last remarks until tomorrow morning (CET). Barring any objections, I'll push.

@tstuefe
Copy link
Member Author

tstuefe commented Jul 21, 2023

Many thanks @robehn, @shipilev and @dholmes-ora !

Feels good to have this finally in.

/integrate

@openjdk
Copy link

openjdk bot commented Jul 21, 2023

Going to push as commit 9e4fc56.
Since your change was applied there have been 2 commits pushed to the master branch:

  • 59f66a3: 8312293: SIGSEGV in jfr.internal.event.EventWriter.putUncheckedByte after JDK-8312086
  • 8cd43bf: 8312474: JFR: Improve logging to diagnose event stream timeout

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Jul 21, 2023
@openjdk openjdk bot closed this Jul 21, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Jul 21, 2023
@openjdk
Copy link

openjdk bot commented Jul 21, 2023

@tstuefe Pushed as commit 9e4fc56.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@dholmes-ora
Copy link
Member

@tstuefe unfortunately the test is failing intermittently in our CI. I will file a bug and assign it to you. Likely we will need a quick ProblemListing though. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot hotspot-dev@openjdk.org integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org
7 participants