JDK-8293114: JVM should trim the native heap #14781

tstuefe · 2023-07-06T06:54:22Z

This is a continuation of #10085. I closed #10085 because it had accumulated too much comment history and got confusing. For a history of this issue, see previous discussions [1] and the comment section of 10085.

This RFE adds the option to trim the Glibc heap periodically. This can recover a significant memory footprint if the VM process suffers from high-but-rare malloc spikes. It does not matter who causes the spikes: the JDK or customer code running in the JVM process.

Background:

The Glibc is reluctant to return memory to the OS. Temporary malloc spikes often carry over as permanent RSS increase. Note that C-heap retention is difficult to observe. Since it is freed memory, it won't appear in NMT; it is just a part of RSS.

This is, effectively, caching - a performance tradeoff by the glibc. It makes a lot of sense with applications that cause high traffic on the C-heap. The JVM, however, clusters allocations and often rolls its own memory management based on virtual memory for many of its use cases.

To manually trim the C-heap, Glibc exposes malloc_trim(3). With JDK 18 [2], we added a new jcmd command to manually trim the C-heap on Linux (jcmd System.trim_native_heap). We then observed customers running this command periodically to slim down process sizes of container-bound jvms. That is cumbersome, and the JVM can do this a lot better - among other things because it knows best when not to trim.

GLIBC internals

The following information I took from the glibc source code and experimenting.

Why do we need to trim manually? Does the Glibc not trim on free?

Upon free(), glibc may return memory to the OS if:

the returned block was mmap'ed
the returned block was not added to tcache or to fastbins
the returned block, possibly merged with its two immediate neighbors, had they been free, is larger than FASTBIN_CONSOLIDATION_THRESHOLD (64K) - in that case:
a) for the main arena, glibc attempts to lower the brk()
b) for mmap-ed heaps, glibc attempts to completely unmap or shrink the heap.
In both cases, (a) and (b), only the top portion of the heap is reclaimed. "Holes" in the middle of other in-use chunks are not reclaimed.

So: glibc may automatically reclaim memory. In normal configurations, with typical C-heap allocation granularity, it is unlikely.

To increase the chance of auto-reclamation happening, one can do one or more things:

a) increase allocation sizes
b) limit mallocs to very few threads, ideally just one
c) set MALLOC_ARENA_MAX=1
d) set the glibc.malloc.trim_threshold to a very low value, e.g., 1

But:

(a) is not possible for foreign code; not that even within hotspot, where we cluster allocations using hotspot arenas, the typical arena chunk size is too small to be auto-reclaimed
(b) may just not be feasible
(c) is terrible for performance since many C-Heap operations will compete over the lock of that single arena
(d) works if either (b) or (c) are in place and if the released block happens to border the top of the arena. And it also costs performance.

The JVM only has limited influence on (a), none on (b), (c) is a really bad idea, and hence (d) often does little. That mirrors my practical experiences.

How does `malloc_trim()` differ from trimming on free() ?

malloc_trim(), will look for holes that are larger than a page; so it limits itself not to just reclaiming memory at the top of the arena. It will then madvise(MADV_DONTNEED) those holes. It does that for every arena.

What are the cons of calling `malloc_trim()`?

malloc_trim() cannot be interrupted. Once it runs, it runs. The runtime of malloc_trim() is not predictable. If there is nothing to reclaim, it is very fast (sub-ms). If there is a lot to reclaim (e.g. >32GB), I saw times of up to 800ms.

Moreover, malloc_trim, while trimming each arena, locks the arena. That may lock out concurrent C-heap operations in the thread that uses this arena. Note, however, that this is rare since many operations will be satisfied from the tcache and therefore don't lock.

What about the `pad` parameter for `malloc_trim()`

I found it has very little effect. It only affects how many bytes are preserved at the top of the main arena. It does not affect other arenas, nor does it affect how much space malloc_trim reclaims by releasing "holes", which is the main part of memory release.

The Patch

Patch adds new options (experimental):

-XX:+GCTrimNativeHeap
-XX:GCTrimNativeHeapInterval=<seconds> (defaults to 60)

GCTrimNativeHeap is off by default. If enabled, it will cause the VM to trim the native heap periodically. The period is defined by GCTrimNativeHeapInterval.

Periodic trimming is done in its own thread. We cannot burden the ServiceThread, since the runtime of trims is unpredictable.

The patch also adds a way to suspend trimming temporarily; if suspended, no trims will start, but ongoing trims will still finish.

The patch uses this mechanism to suspend trimming during GC STW phases and whenever we are about to do bulk C-heap operations (e.g. deleting deflated monitors).

Examples:

This is an artificial test that causes two high malloc spikes with long idle periods.

(yellow) NMT shows two spikes for malloc'ed memory;

(red) RSS of the baseline JVM shows that we reach a maximum and then never recover. This is the glibc retaining the free'd memory.

(blue) RSS of the patched JVM shows that we recover RSS in steps by doing periodic C-heap trimming.

(See here for parameters: run script )

Tests

Tested older Glibc (2.31), and newer Glibc (2.35) (mallinfo() vs mallinfo2()), on Linux x64.

Older versions of this patch were routinely tested at SAP for almost half a year.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8293114: JVM should trim the native heap (Enhancement - P4)

Reviewers

Aleksey Shipilev (@shipilev - Reviewer)
Robbin Ehn (@robehn - Reviewer)
David Holmes (@dholmes-ora - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14781/head:pull/14781
$ git checkout pull/14781

Update a local copy of the PR:
$ git checkout pull/14781
$ git pull https://git.openjdk.org/jdk.git pull/14781/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14781

View PR using the GUI difftool:
$ git pr show -t 14781

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14781.diff

Webrev

Link to Webrev Comment

bridgekeeper · 2023-07-06T06:55:29Z

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2023-07-06T06:56:48Z

@tstuefe The following labels will be automatically applied to this pull request:

hotspot
serviceability
shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

jdksjolen · 2023-07-06T09:20:25Z

And app's malloc load can fluctuate wildly, with temporary spikes and long idle periods.

Are you talking about allocations into native memory that a Java application does on its own accord and not as a consequence of the JVM doing its own allocs? For compiling, for example.

tstuefe · 2023-07-06T09:57:44Z

And app's malloc load can fluctuate wildly, with temporary spikes and long idle periods.

Are you talking about allocations into native memory that a Java application does on its own accord and not as a consequence of the JVM doing its own allocs? For compiling, for example.

This does not matter. The resulting malloc load is the sum of whatever we do and whatever the native code does.

tstuefe · 2023-07-06T10:05:54Z

Ping @shipilev and @robehn

mlbridge · 2023-07-06T10:07:23Z

Webrevs

shipilev · 2023-07-06T10:08:03Z

I don't think the comments from my yesterday's review were addressed :) There are some old comments (marked with (Outdated), helpfully), but some are new. Please take a look at them?

tstuefe · 2023-07-06T10:47:25Z

I don't think the comments from my yesterday's review were addressed :) There are some old comments (marked with (Outdated), helpfully), but some are new. Please take a look at them?

Of course, sorry. Seems I missed most of them.

About Pause in all VM ops, an alternative would be to just check in the trimmer if we are at safepoint, and if yes treat it as pause. I'll see if that's easier (I'm worried about pulling a mutex or atomic increasing the pauser variable in every VM op we run).

shipilev · 2023-07-06T10:53:52Z

About Pause in all VM ops, an alternative would be to just check in the trimmer if we are at safepoint, and if yes treat it as pause. I'll see if that's easier (I'm worried about pulling a mutex or atomic increasing the pauser variable in every VM op we run).

Oh yes, I like it. We can just check SafepointSynchronize::is_at_safepoint() and SafepointSynchronize::is_synchronizing(). Would be even better, because we could stop trimming when there is safepoint sync pending. Makes the whole thing much less intrusive.

src/hotspot/share/memory/arena.cpp

tstuefe · 2023-07-06T15:26:38Z

@jdksjolen @shipilev New version:

I removed all manual pause calls from all GC code and replaced those with just not trimming when at or near a safe point. This is less invasive since I expect the typical trim interval to be much larger than the interval we do our VM operations in.
The pauses in runtime code remain
I restored the original arena coding, I just added the pause. Though this coding could greatly benefit from cleanups, I want to keep this patch focused and easy to downport.
Since we no longer have close ties to GC coding, the tests don't need to run per gc, so I dumbed down the tests.

@shipilev : I kept the SuspendMark inside TrimNative, because I like it that way. Otherwise, I would name it something like TrimNativeSuspendMark, so nothing gained but another symbol at global scope.