Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: MADV_COLLAPSE causes production performance issues on Linux #63334

Closed
mknyszek opened this issue Oct 2, 2023 · 6 comments
Closed

runtime: MADV_COLLAPSE causes production performance issues on Linux #63334

mknyszek opened this issue Oct 2, 2023 · 6 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Oct 2, 2023

A Google production service experienced a performance regression with the Go runtime's use of MADV_COLLAPSE. We've narrowed down the issue to exactly that call. We suspect the issue is that MADV_COLLAPSE can go into direct reclaim while we're holding the heap lock. We suspect this issue is more widely applicable.

For now, let's roll back uses of MADV_COLLAPSE. We can revisit this in the future, but our current policy is almost certainly too aggressive given the costs.

@mknyszek mknyszek added NeedsFix The path to resolution is known, but the work has not been done. release-blocker compiler/runtime Issues related to the Go compiler and/or runtime. labels Oct 2, 2023
@mknyszek mknyszek added this to the Go1.22 milestone Oct 2, 2023
@mknyszek
Copy link
Contributor Author

mknyszek commented Oct 2, 2023

@gopherbot Please open a backport issue for Go 1.21.

This issue has no workaround and can sometimes cause a significant performance regression with no workaround. The fix is small and safe (this codepath does not affect correctness).

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #63335 (for 1.21).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/531816 mentions this issue: runtime: don't eagerly collapse hugepages

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/532117 mentions this issue: runtime: delete hugepage tracking dead code

gopherbot pushed a commit that referenced this issue Oct 2, 2023
After the previous CL, this is now all dead code. This change is
separated out to make the previous one easy to backport.

For #63334.
Related to #61718 and #59960.

Change-Id: I109673ed97c62c472bbe2717dfeeb5aa4fc883ea
Reviewed-on: https://go-review.googlesource.com/c/go/+/532117
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@mknyszek
Copy link
Contributor Author

mknyszek commented Oct 2, 2023

I accidentally auto-submitted CL 531816 with a fairly vague commit message. Here's the full intended commit message.

    MADV_COLLAPSE can go into direct reclaim, but we call it with the heap
    lock held. This means that the process could end up stalled fairly
    quickly if just one allocating goroutine ends up in the madvise call, at
    least until the madvise(MADV_COLLAPSE) call returns. A similar issue
    occurred with madvise(MADV_HUGEPAGE), because that could go into direct
    reclaim on any page fault for MADV_HUGEPAGE-marked memory.

    My understanding was that the calls to madvise(MADV_COLLAPSE) were
    fairly rare, and it's "best-effort" nature prevented it from going into
    direct reclaim often, but this was wrong. It tends to be fairly
    heavyweight even when it doesn't end up in direct reclaim, and it's
    almost certainly not worth it.

    Disable it until further notice and let the kernel fully dictate
    hugepage policy. The updated scavenger policy is still more hugepage
    friendly by delaying scavening until hugepages are no longer densely
    packed, so we don't lose all that much.

    The Sweet benchmarks show a minimal difference. A couple less realistic
    benchmarks seem to slow down a bit; they might just be getting unlucky
    with what the kernel decides to back with a huge page. Some benchmarks
    on the other hand improve. Overall, it's a wash.

    name                  old time/op            new time/op            delta
    BiogoIgor                        13.1s ± 1%             13.2s ± 2%    ~     (p=0.182 n=9+10)
    BiogoKrishna                     12.0s ± 1%             12.1s ± 1%  +1.23%  (p=0.002 n=9+10)
    BleveIndexBatch100               4.51s ± 4%             4.56s ± 3%    ~     (p=0.393 n=10+10)
    EtcdPut                         20.2ms ± 4%            19.8ms ± 2%    ~     (p=0.079 n=10+9)
    EtcdSTM                          109ms ± 3%             111ms ± 3%  +1.63%  (p=0.035 n=10+10)
    GoBuildKubelet                   31.2s ± 1%             31.3s ± 1%    ~     (p=0.780 n=9+10)
    GoBuildKubeletLink               7.77s ± 0%             7.81s ± 2%    ~     (p=0.237 n=8+10)
    GoBuildIstioctl                  31.8s ± 1%             31.7s ± 0%    ~     (p=0.136 n=9+9)
    GoBuildIstioctlLink              7.88s ± 1%             7.89s ± 1%    ~     (p=0.720 n=9+10)
    GoBuildFrontend                  11.7s ± 1%             11.8s ± 1%    ~     (p=0.278 n=10+9)
    GoBuildFrontendLink              1.15s ± 4%             1.15s ± 5%    ~     (p=0.387 n=9+9)
    GopherLuaKNucleotide             19.7s ± 1%             20.6s ± 0%  +4.48%  (p=0.000 n=10+10)
    MarkdownRenderXHTML              194ms ± 3%             196ms ± 3%    ~     (p=0.356 n=9+10)
    Tile38QueryLoad                  633µs ± 2%             629µs ± 2%    ~     (p=0.075 n=10+10)

    name                  old average-RSS-bytes  new average-RSS-bytes  delta
    BiogoIgor                       69.2MB ± 3%            68.4MB ± 1%    ~     (p=0.190 n=10+10)
    BiogoKrishna                    4.40GB ± 0%            4.40GB ± 0%    ~     (p=0.605 n=9+9)
    BleveIndexBatch100               195MB ± 3%             195MB ± 2%    ~     (p=0.853 n=10+10)
    EtcdPut                          107MB ± 4%             108MB ± 3%    ~     (p=0.190 n=10+10)
    EtcdSTM                         91.6MB ± 5%            92.6MB ± 4%    ~     (p=0.481 n=10+10)
    GoBuildKubelet                  2.26GB ± 1%            2.28GB ± 1%  +1.22%  (p=0.000 n=10+10)
    GoBuildIstioctl                 1.53GB ± 0%            1.53GB ± 0%  +0.21%  (p=0.017 n=9+10)
    GoBuildFrontend                  556MB ± 1%             554MB ± 2%    ~     (p=0.497 n=9+10)
    GopherLuaKNucleotide            39.0MB ± 3%            39.0MB ± 1%    ~     (p=1.000 n=10+8)
    MarkdownRenderXHTML             21.2MB ± 2%            21.4MB ± 3%    ~     (p=0.190 n=10+10)
    Tile38QueryLoad                 5.99GB ± 2%            6.02GB ± 0%    ~     (p=0.243 n=10+9)

    name                  old peak-RSS-bytes     new peak-RSS-bytes     delta
    BiogoIgor                       90.2MB ± 4%            89.2MB ± 2%    ~     (p=0.143 n=10+10)
    BiogoKrishna                    4.49GB ± 0%            4.49GB ± 0%    ~     (p=0.190 n=10+10)
    BleveIndexBatch100               283MB ± 8%             274MB ± 6%    ~     (p=0.075 n=10+10)
    EtcdPut                          147MB ± 4%             149MB ± 2%  +1.55%  (p=0.034 n=10+8)
    EtcdSTM                          117MB ± 5%             117MB ± 4%    ~     (p=0.905 n=9+10)
    GopherLuaKNucleotide            44.9MB ± 1%            44.6MB ± 1%    ~     (p=0.083 n=8+8)
    MarkdownRenderXHTML             22.0MB ± 8%            22.1MB ± 9%    ~     (p=0.436 n=10+10)
    Tile38QueryLoad                 6.24GB ± 2%            6.29GB ± 2%    ~     (p=0.218 n=10+10)

    name                  old peak-VM-bytes      new peak-VM-bytes      delta
    BiogoIgor                       1.33GB ± 0%            1.33GB ± 0%    ~     (p=0.504 n=10+9)
    BiogoKrishna                    5.77GB ± 0%            5.77GB ± 0%    ~     (p=1.000 n=10+9)
    BleveIndexBatch100              3.53GB ± 0%            3.53GB ± 0%    ~     (p=0.642 n=10+10)
    EtcdPut                         12.1GB ± 0%            12.1GB ± 0%    ~     (p=0.564 n=10+10)
    EtcdSTM                         12.1GB ± 0%            12.1GB ± 0%    ~     (p=0.633 n=10+10)
    GopherLuaKNucleotide            1.26GB ± 0%            1.26GB ± 0%    ~     (p=0.297 n=9+10)
    MarkdownRenderXHTML             1.26GB ± 0%            1.26GB ± 0%    ~     (p=0.069 n=10+10)
    Tile38QueryLoad                 7.47GB ± 2%            7.53GB ± 2%    ~     (p=0.280 n=10+10)

    name                  old p50-latency-ns     new p50-latency-ns     delta
    EtcdPut                          19.8M ± 5%             19.3M ± 3%  -2.74%  (p=0.043 n=10+9)
    EtcdSTM                          81.4M ± 4%             83.4M ± 4%  +2.46%  (p=0.029 n=10+10)
    Tile38QueryLoad                   241k ± 1%              240k ± 1%    ~     (p=0.393 n=10+10)

    name                  old p90-latency-ns     new p90-latency-ns     delta
    EtcdPut                          30.4M ± 5%             30.6M ± 5%    ~     (p=0.971 n=10+10)
    EtcdSTM                           222M ± 3%              226M ± 4%    ~     (p=0.063 n=10+10)
    Tile38QueryLoad                   687k ± 2%              691k ± 1%    ~     (p=0.173 n=10+8)

    name                  old p99-latency-ns     new p99-latency-ns     delta
    EtcdPut                          42.3M ±10%             41.4M ± 7%    ~     (p=0.353 n=10+10)
    EtcdSTM                           486M ± 7%              487M ± 4%    ~     (p=0.579 n=10+10)
    Tile38QueryLoad                  6.43M ± 2%             6.37M ± 3%    ~     (p=0.280 n=10+10)

    name                  old ops/s              new ops/s              delta
    EtcdPut                          48.6k ± 3%             49.5k ± 2%    ~     (p=0.065 n=10+9)
    EtcdSTM                          9.09k ± 2%             8.95k ± 3%  -1.56%  (p=0.045 n=10+10)
    Tile38QueryLoad                  28.4k ± 1%             28.6k ± 1%  +0.87%  (p=0.016 n=9+10)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/532255 mentions this issue: [release-branch.go1.21] runtime: don't eagerly collapse hugepages

gopherbot pushed a commit that referenced this issue Oct 12, 2023
This has caused performance issues in production environments.

MADV_COLLAPSE can go into direct reclaim, but we call it with the heap
lock held. This means that the process could end up stalled fairly
quickly if just one allocating goroutine ends up in the madvise call, at
least until the madvise(MADV_COLLAPSE) call returns. A similar issue
occurred with madvise(MADV_HUGEPAGE), because that could go into direct
reclaim on any page fault for MADV_HUGEPAGE-marked memory.

My understanding was that the calls to madvise(MADV_COLLAPSE) were
fairly rare, and it's "best-effort" nature prevented it from going into
direct reclaim often, but this was wrong. It tends to be fairly
heavyweight even when it doesn't end up in direct reclaim, and it's
almost certainly not worth it.

Disable it until further notice and let the kernel fully dictate
hugepage policy. The updated scavenger policy is still more hugepage
friendly by delaying scavening until hugepages are no longer densely
packed, so we don't lose all that much.

The Sweet benchmarks show a minimal difference. A couple less realistic
benchmarks seem to slow down a bit; they might just be getting unlucky
with what the kernel decides to back with a huge page. Some benchmarks
on the other hand improve. Overall, it's a wash.

name                  old time/op            new time/op            delta
BiogoIgor                        13.1s ± 1%             13.2s ± 2%    ~     (p=0.182 n=9+10)
BiogoKrishna                     12.0s ± 1%             12.1s ± 1%  +1.23%  (p=0.002 n=9+10)
BleveIndexBatch100               4.51s ± 4%             4.56s ± 3%    ~     (p=0.393 n=10+10)
EtcdPut                         20.2ms ± 4%            19.8ms ± 2%    ~     (p=0.079 n=10+9)
EtcdSTM                          109ms ± 3%             111ms ± 3%  +1.63%  (p=0.035 n=10+10)
GoBuildKubelet                   31.2s ± 1%             31.3s ± 1%    ~     (p=0.780 n=9+10)
GoBuildKubeletLink               7.77s ± 0%             7.81s ± 2%    ~     (p=0.237 n=8+10)
GoBuildIstioctl                  31.8s ± 1%             31.7s ± 0%    ~     (p=0.136 n=9+9)
GoBuildIstioctlLink              7.88s ± 1%             7.89s ± 1%    ~     (p=0.720 n=9+10)
GoBuildFrontend                  11.7s ± 1%             11.8s ± 1%    ~     (p=0.278 n=10+9)
GoBuildFrontendLink              1.15s ± 4%             1.15s ± 5%    ~     (p=0.387 n=9+9)
GopherLuaKNucleotide             19.7s ± 1%             20.6s ± 0%  +4.48%  (p=0.000 n=10+10)
MarkdownRenderXHTML              194ms ± 3%             196ms ± 3%    ~     (p=0.356 n=9+10)
Tile38QueryLoad                  633µs ± 2%             629µs ± 2%    ~     (p=0.075 n=10+10)

name                  old average-RSS-bytes  new average-RSS-bytes  delta
BiogoIgor                       69.2MB ± 3%            68.4MB ± 1%    ~     (p=0.190 n=10+10)
BiogoKrishna                    4.40GB ± 0%            4.40GB ± 0%    ~     (p=0.605 n=9+9)
BleveIndexBatch100               195MB ± 3%             195MB ± 2%    ~     (p=0.853 n=10+10)
EtcdPut                          107MB ± 4%             108MB ± 3%    ~     (p=0.190 n=10+10)
EtcdSTM                         91.6MB ± 5%            92.6MB ± 4%    ~     (p=0.481 n=10+10)
GoBuildKubelet                  2.26GB ± 1%            2.28GB ± 1%  +1.22%  (p=0.000 n=10+10)
GoBuildIstioctl                 1.53GB ± 0%            1.53GB ± 0%  +0.21%  (p=0.017 n=9+10)
GoBuildFrontend                  556MB ± 1%             554MB ± 2%    ~     (p=0.497 n=9+10)
GopherLuaKNucleotide            39.0MB ± 3%            39.0MB ± 1%    ~     (p=1.000 n=10+8)
MarkdownRenderXHTML             21.2MB ± 2%            21.4MB ± 3%    ~     (p=0.190 n=10+10)
Tile38QueryLoad                 5.99GB ± 2%            6.02GB ± 0%    ~     (p=0.243 n=10+9)

name                  old peak-RSS-bytes     new peak-RSS-bytes     delta
BiogoIgor                       90.2MB ± 4%            89.2MB ± 2%    ~     (p=0.143 n=10+10)
BiogoKrishna                    4.49GB ± 0%            4.49GB ± 0%    ~     (p=0.190 n=10+10)
BleveIndexBatch100               283MB ± 8%             274MB ± 6%    ~     (p=0.075 n=10+10)
EtcdPut                          147MB ± 4%             149MB ± 2%  +1.55%  (p=0.034 n=10+8)
EtcdSTM                          117MB ± 5%             117MB ± 4%    ~     (p=0.905 n=9+10)
GopherLuaKNucleotide            44.9MB ± 1%            44.6MB ± 1%    ~     (p=0.083 n=8+8)
MarkdownRenderXHTML             22.0MB ± 8%            22.1MB ± 9%    ~     (p=0.436 n=10+10)
Tile38QueryLoad                 6.24GB ± 2%            6.29GB ± 2%    ~     (p=0.218 n=10+10)

name                  old peak-VM-bytes      new peak-VM-bytes      delta
BiogoIgor                       1.33GB ± 0%            1.33GB ± 0%    ~     (p=0.504 n=10+9)
BiogoKrishna                    5.77GB ± 0%            5.77GB ± 0%    ~     (p=1.000 n=10+9)
BleveIndexBatch100              3.53GB ± 0%            3.53GB ± 0%    ~     (p=0.642 n=10+10)
EtcdPut                         12.1GB ± 0%            12.1GB ± 0%    ~     (p=0.564 n=10+10)
EtcdSTM                         12.1GB ± 0%            12.1GB ± 0%    ~     (p=0.633 n=10+10)
GopherLuaKNucleotide            1.26GB ± 0%            1.26GB ± 0%    ~     (p=0.297 n=9+10)
MarkdownRenderXHTML             1.26GB ± 0%            1.26GB ± 0%    ~     (p=0.069 n=10+10)
Tile38QueryLoad                 7.47GB ± 2%            7.53GB ± 2%    ~     (p=0.280 n=10+10)

name                  old p50-latency-ns     new p50-latency-ns     delta
EtcdPut                          19.8M ± 5%             19.3M ± 3%  -2.74%  (p=0.043 n=10+9)
EtcdSTM                          81.4M ± 4%             83.4M ± 4%  +2.46%  (p=0.029 n=10+10)
Tile38QueryLoad                   241k ± 1%              240k ± 1%    ~     (p=0.393 n=10+10)

name                  old p90-latency-ns     new p90-latency-ns     delta
EtcdPut                          30.4M ± 5%             30.6M ± 5%    ~     (p=0.971 n=10+10)
EtcdSTM                           222M ± 3%              226M ± 4%    ~     (p=0.063 n=10+10)
Tile38QueryLoad                   687k ± 2%              691k ± 1%    ~     (p=0.173 n=10+8)

name                  old p99-latency-ns     new p99-latency-ns     delta
EtcdPut                          42.3M ±10%             41.4M ± 7%    ~     (p=0.353 n=10+10)
EtcdSTM                           486M ± 7%              487M ± 4%    ~     (p=0.579 n=10+10)
Tile38QueryLoad                  6.43M ± 2%             6.37M ± 3%    ~     (p=0.280 n=10+10)

name                  old ops/s              new ops/s              delta
EtcdPut                          48.6k ± 3%             49.5k ± 2%    ~     (p=0.065 n=10+9)
EtcdSTM                          9.09k ± 2%             8.95k ± 3%  -1.56%  (p=0.045 n=10+10)
Tile38QueryLoad                  28.4k ± 1%             28.6k ± 1%  +0.87%  (p=0.016 n=9+10)

Fixes #63335.
For #63334.
Related to #61718 and #59960.

Change-Id: If84c5a8685825d43c912a71418f2597e44e867e5
Reviewed-on: https://go-review.googlesource.com/c/go/+/531816
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit 595deec)
Reviewed-on: https://go-review.googlesource.com/c/go/+/532255
Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 16, 2024
…epage collapse

This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
make a least-effort attempt at a synchronous collapse of memory at
their own expense.

The only difference from MADV_COLLAPSE is that the new hugepage allocation
avoids direct reclaim/compaction, quickly failing on allocation errors.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the THP
* Avoid unpredictable timing of khugepaged collapse
* Prevent unpredictable stalls caused by direct reclaim and/or compaction

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, madvise(2) returns 0.
Else, -1 is returned and errno is set to indicate the error for the
most-recently attempted hugepage collapse.  Note that many failures might
have occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.

        ENOMEM  Memory allocation failed or VMA not found
        EBUSY   Memcg charging failed
        EAGAIN  Required resource temporarily unavailable.  Try again
                might succeed.
        EINVAL  Other error: No PMD found, subpage doesn't have Present
                bit set, "Special" page no backed by struct page, VMA
                incorrectly sized, address not page-aligned, ...

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334
Signed-off-by: Lance Yang <ioworker0@gmail.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 17, 2024
…epage collapse

This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
make a least-effort attempt at a synchronous collapse of memory at
their own expense.

The only difference from MADV_COLLAPSE is that the new hugepage allocation
avoids direct reclaim and/or compaction, quickly failing on allocation errors.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the THP
* Avoid unpredictable timing of khugepaged collapse
* Prevent unpredictable stalls caused by direct reclaim and/or compaction

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, madvise(2) returns 0.
Else, -1 is returned and errno is set to indicate the error for the
most-recently attempted hugepage collapse.  Note that many failures might
have occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.

        ENOMEM  Memory allocation failed or VMA not found
        EBUSY   Memcg charging failed
        EAGAIN  Required resource temporarily unavailable.  Try again
                might succeed.
        EINVAL  Other error: No PMD found, subpage doesn't have Present
                bit set, "Special" page no backed by struct page, VMA
                incorrectly sized, address not page-aligned, ...

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334
Signed-off-by: Lance Yang <ioworker0@gmail.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this issue Jan 17, 2024
…epage collapse

This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
make a least-effort attempt at a synchronous collapse of memory at
their own expense.

The only difference from MADV_COLLAPSE is that the new hugepage allocation
avoids direct reclaim and/or compaction, quickly failing on allocation errors.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the THP
* Avoid unpredictable timing of khugepaged collapse
* Prevent unpredictable stalls caused by direct reclaim and/or compaction

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, madvise(2) returns 0.
Else, -1 is returned and errno is set to indicate the error for the
most-recently attempted hugepage collapse.  Note that many failures might
have occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.

        ENOMEM  Memory allocation failed or VMA not found
        EBUSY   Memcg charging failed
        EAGAIN  Required resource temporarily unavailable.  Try again
                might succeed.
        EINVAL  Other error: No PMD found, subpage doesn't have Present
                bit set, "Special" page no backed by struct page, VMA
                incorrectly sized, address not page-aligned, ...

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334
Signed-off-by: Lance Yang <ioworker0@gmail.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 18, 2024
…epage collapse

This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
make a least-effort attempt at a synchronous collapse of memory at
their own expense.

The only difference from MADV_COLLAPSE is that the new hugepage allocation
avoids direct reclaim and/or compaction, quickly failing on allocation errors.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the THP
* Avoid unpredictable timing of khugepaged collapse
* Prevent unpredictable stalls caused by direct reclaim and/or compaction

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, madvise(2) returns 0.
Else, -1 is returned and errno is set to indicate the error for the
most-recently attempted hugepage collapse.  Note that many failures might
have occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.

        ENOMEM  Memory allocation failed or VMA not found
        EBUSY   Memcg charging failed
        EAGAIN  Required resource temporarily unavailable.  Try again
                might succeed.
        EINVAL  Other error: No PMD found, subpage doesn't have Present
                bit set, "Special" page no backed by struct page, VMA
                incorrectly sized, address not page-aligned, ...

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334
Signed-off-by: Lance Yang <ioworker0@gmail.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 18, 2024
…epage collapse

This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
make a least-effort attempt at a synchronous collapse of memory at
their own expense.

The only difference from MADV_COLLAPSE is that the new hugepage allocation
avoids direct reclaim and/or compaction, quickly failing on allocation errors.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the THP
* Avoid unpredictable timing of khugepaged collapse
* Prevent unpredictable stalls caused by direct reclaim and/or compaction

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, madvise(2) returns 0.
Else, -1 is returned and errno is set to indicate the error for the
most-recently attempted hugepage collapse.  Note that many failures might
have occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.

        ENOMEM  Memory allocation failed or VMA not found
        EBUSY   Memcg charging failed
        EAGAIN  Required resource temporarily unavailable.  Try again
                might succeed.
        EINVAL  Other error: No PMD found, subpage doesn't have Present
                bit set, "Special" page no backed by struct page, VMA
                incorrectly sized, address not page-aligned, ...

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334
Signed-off-by: Lance Yang <ioworker0@gmail.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 18, 2024
…epage collapse

This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
make a least-effort attempt at a synchronous collapse of memory at
their own expense.

The only difference from MADV_COLLAPSE is that the new hugepage allocation
avoids direct reclaim and/or compaction, quickly failing on allocation errors.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the THP
* Avoid unpredictable timing of khugepaged collapse
* Prevent unpredictable stalls caused by direct reclaim and/or compaction

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, madvise(2) returns 0.
Else, -1 is returned and errno is set to indicate the error for the
most-recently attempted hugepage collapse.  Note that many failures might
have occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.

        ENOMEM  Memory allocation failed or VMA not found
        EBUSY   Memcg charging failed
        EAGAIN  Required resource temporarily unavailable.  Try again
                might succeed.
        EINVAL  Other error: No PMD found, subpage doesn't have Present
                bit set, "Special" page no backed by struct page, VMA
                incorrectly sized, address not page-aligned, ...

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334
Signed-off-by: Lance Yang <ioworker0@gmail.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 18, 2024
…epage collapse

This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to
make a least-effort attempt at a synchronous collapse of memory at
their own expense.

The only difference from MADV_COLLAPSE is that the new hugepage allocation
avoids direct reclaim and/or compaction, quickly failing on allocation errors.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the THP
* Avoid unpredictable timing of khugepaged collapse
* Prevent unpredictable stalls caused by direct reclaim and/or compaction

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Return Value

If all hugepage-sized/aligned regions covered by the provided range were
either successfully collapsed, or were already PMD-mapped THPs, this
operation will be deemed successful.  On success, madvise(2) returns 0.
Else, -1 is returned and errno is set to indicate the error for the
most-recently attempted hugepage collapse.  Note that many failures might
have occurred, since the operation may continue to collapse in the event a
single hugepage-sized/aligned region fails.

        ENOMEM  Memory allocation failed or VMA not found
        EBUSY   Memcg charging failed
        EAGAIN  Required resource temporarily unavailable.  Try again
                might succeed.
        EINVAL  Other error: No PMD found, subpage doesn't have Present
                bit set, "Special" page no backed by struct page, VMA
                incorrectly sized, address not page-aligned, ...

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334
Signed-off-by: Lance Yang <ioworker0@gmail.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 18, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller has
CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but it
avoids direct reclaim and/or compaction, quickly failing on allocation errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 18, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 19, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 19, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 19, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 19, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 19, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 19, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 19, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
ioworker0 added a commit to ioworker0/linux that referenced this issue Jan 19, 2024
This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].

Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
has CAP_SYS_ADMIN or is requesting the collapse of its own memory.

The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
it  avoids direct reclaim and/or compaction, quickly failing on allocation
errors.

This change enables a more flexible and efficient usage of memory collapse
operations, providing additional control to userspace applications for
system-wide THP optimization.

Semantics

This call is independent of the system-wide THP sysfs settings, but will
fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
multiple VMAs, the semantics of the collapse over each VMA is independent
from the others.  This implies a hugepage cannot cross a VMA boundary.  If
collapse of a given hugepage-aligned/sized region fails, the operation may
continue to attempt collapsing the remainder of memory specified.

The memory ranges provided must be page-aligned, but are not required to
be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
start/end of the range will be clamped to the first/last hugepage-aligned
address covered by said range.  The memory ranges must span at least one
hugepage-sized region.

All non-resident pages covered by the range will first be
swapped/faulted-in, before being internally copied onto a freshly
allocated hugepage.  Unmapped pages will have their data directly
initialized to 0 in the new hugepage.  However, for every eligible
hugepage aligned/sized region to-be collapsed, at least one page must
currently be backed by memory (a PMD covering the address range must
already exist).

Allocation for the new hugepage will not enter direct reclaim and/or
compaction, quickly failing if allocation fails. When the system has
multiple NUMA nodes, the hugepage will be allocated from the node providing
the most native pages. This operation operates on the current state of the
specified process and makes no persistent changes or guarantees on how pages
will be mapped, constructed, or faulted in the future.

Use Cases

An immediate user of this new functionality is the Go runtime heap allocator
that manages memory in hugepage-sized chunks. In the past, whether it was a
newly allocated chunk through mmap() or a reused chunk released by
madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
respectively. However, both approaches resulted in performance issues; for
both scenarios, there could be entries into direct reclaim and/or compaction,
leading to unpredictable stalls[4]. Now, the allocator can confidently use
process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.

[1] torvalds@7d8faaf
[2] golang/go@8fa9e3b
[3] golang/go@9f9bb26
[4] golang/go#63334

[v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
Suggested-by: Zach O'Keefe <zokeefe@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
None yet
Development

No branches or pull requests

2 participants