runtime: `MADV_COLLAPSE` causes production performance issues on Linux #63334

mknyszek · 2023-10-02T18:13:16Z

A Google production service experienced a performance regression with the Go runtime's use of MADV_COLLAPSE. We've narrowed down the issue to exactly that call. We suspect the issue is that MADV_COLLAPSE can go into direct reclaim while we're holding the heap lock. We suspect this issue is more widely applicable.

For now, let's roll back uses of MADV_COLLAPSE. We can revisit this in the future, but our current policy is almost certainly too aggressive given the costs.

The text was updated successfully, but these errors were encountered:

mknyszek · 2023-10-02T18:19:51Z

@gopherbot Please open a backport issue for Go 1.21.

This issue has no workaround and can sometimes cause a significant performance regression with no workaround. The fix is small and safe (this codepath does not affect correctness).

gopherbot · 2023-10-02T18:20:51Z

Backport issue(s) opened: #63335 (for 1.21).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

gopherbot · 2023-10-02T18:21:20Z

Change https://go.dev/cl/531816 mentions this issue: runtime: don't eagerly collapse hugepages

gopherbot · 2023-10-02T18:21:23Z

Change https://go.dev/cl/532117 mentions this issue: runtime: delete hugepage tracking dead code

After the previous CL, this is now all dead code. This change is separated out to make the previous one easy to backport. For #63334. Related to #61718 and #59960. Change-Id: I109673ed97c62c472bbe2717dfeeb5aa4fc883ea Reviewed-on: https://go-review.googlesource.com/c/go/+/532117 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

mknyszek · 2023-10-02T18:38:52Z

I accidentally auto-submitted CL 531816 with a fairly vague commit message. Here's the full intended commit message.

    MADV_COLLAPSE can go into direct reclaim, but we call it with the heap
    lock held. This means that the process could end up stalled fairly
    quickly if just one allocating goroutine ends up in the madvise call, at
    least until the madvise(MADV_COLLAPSE) call returns. A similar issue
    occurred with madvise(MADV_HUGEPAGE), because that could go into direct
    reclaim on any page fault for MADV_HUGEPAGE-marked memory.

    My understanding was that the calls to madvise(MADV_COLLAPSE) were
    fairly rare, and it's "best-effort" nature prevented it from going into
    direct reclaim often, but this was wrong. It tends to be fairly
    heavyweight even when it doesn't end up in direct reclaim, and it's
    almost certainly not worth it.

    Disable it until further notice and let the kernel fully dictate
    hugepage policy. The updated scavenger policy is still more hugepage
    friendly by delaying scavening until hugepages are no longer densely
    packed, so we don't lose all that much.

    The Sweet benchmarks show a minimal difference. A couple less realistic
    benchmarks seem to slow down a bit; they might just be getting unlucky
    with what the kernel decides to back with a huge page. Some benchmarks
    on the other hand improve. Overall, it's a wash.

    name                  old time/op            new time/op            delta
    BiogoIgor                        13.1s ± 1%             13.2s ± 2%    ~     (p=0.182 n=9+10)
    BiogoKrishna                     12.0s ± 1%             12.1s ± 1%  +1.23%  (p=0.002 n=9+10)
    BleveIndexBatch100               4.51s ± 4%             4.56s ± 3%    ~     (p=0.393 n=10+10)
    EtcdPut                         20.2ms ± 4%            19.8ms ± 2%    ~     (p=0.079 n=10+9)
    EtcdSTM                          109ms ± 3%             111ms ± 3%  +1.63%  (p=0.035 n=10+10)
    GoBuildKubelet                   31.2s ± 1%             31.3s ± 1%    ~     (p=0.780 n=9+10)
    GoBuildKubeletLink               7.77s ± 0%             7.81s ± 2%    ~     (p=0.237 n=8+10)
    GoBuildIstioctl                  31.8s ± 1%             31.7s ± 0%    ~     (p=0.136 n=9+9)
    GoBuildIstioctlLink              7.88s ± 1%             7.89s ± 1%    ~     (p=0.720 n=9+10)
    GoBuildFrontend                  11.7s ± 1%             11.8s ± 1%    ~     (p=0.278 n=10+9)
    GoBuildFrontendLink              1.15s ± 4%             1.15s ± 5%    ~     (p=0.387 n=9+9)
    GopherLuaKNucleotide             19.7s ± 1%             20.6s ± 0%  +4.48%  (p=0.000 n=10+10)
    MarkdownRenderXHTML              194ms ± 3%             196ms ± 3%    ~     (p=0.356 n=9+10)
    Tile38QueryLoad                  633µs ± 2%             629µs ± 2%    ~     (p=0.075 n=10+10)

    name                  old average-RSS-bytes  new average-RSS-bytes  delta
    BiogoIgor                       69.2MB ± 3%            68.4MB ± 1%    ~     (p=0.190 n=10+10)
    BiogoKrishna                    4.40GB ± 0%            4.40GB ± 0%    ~     (p=0.605 n=9+9)
    BleveIndexBatch100               195MB ± 3%             195MB ± 2%    ~     (p=0.853 n=10+10)
    EtcdPut                          107MB ± 4%             108MB ± 3%    ~     (p=0.190 n=10+10)
    EtcdSTM                         91.6MB ± 5%            92.6MB ± 4%    ~     (p=0.481 n=10+10)
    GoBuildKubelet                  2.26GB ± 1%            2.28GB ± 1%  +1.22%  (p=0.000 n=10+10)
    GoBuildIstioctl                 1.53GB ± 0%            1.53GB ± 0%  +0.21%  (p=0.017 n=9+10)
    GoBuildFrontend                  556MB ± 1%             554MB ± 2%    ~     (p=0.497 n=9+10)
    GopherLuaKNucleotide            39.0MB ± 3%            39.0MB ± 1%    ~     (p=1.000 n=10+8)
    MarkdownRenderXHTML             21.2MB ± 2%            21.4MB ± 3%    ~     (p=0.190 n=10+10)
    Tile38QueryLoad                 5.99GB ± 2%            6.02GB ± 0%    ~     (p=0.243 n=10+9)

    name                  old peak-RSS-bytes     new peak-RSS-bytes     delta
    BiogoIgor                       90.2MB ± 4%            89.2MB ± 2%    ~     (p=0.143 n=10+10)
    BiogoKrishna                    4.49GB ± 0%            4.49GB ± 0%    ~     (p=0.190 n=10+10)
    BleveIndexBatch100               283MB ± 8%             274MB ± 6%    ~     (p=0.075 n=10+10)
    EtcdPut                          147MB ± 4%             149MB ± 2%  +1.55%  (p=0.034 n=10+8)
    EtcdSTM                          117MB ± 5%             117MB ± 4%    ~     (p=0.905 n=9+10)
    GopherLuaKNucleotide            44.9MB ± 1%            44.6MB ± 1%    ~     (p=0.083 n=8+8)
    MarkdownRenderXHTML             22.0MB ± 8%            22.1MB ± 9%    ~     (p=0.436 n=10+10)
    Tile38QueryLoad                 6.24GB ± 2%            6.29GB ± 2%    ~     (p=0.218 n=10+10)

    name                  old peak-VM-bytes      new peak-VM-bytes      delta
    BiogoIgor                       1.33GB ± 0%            1.33GB ± 0%    ~     (p=0.504 n=10+9)
    BiogoKrishna                    5.77GB ± 0%            5.77GB ± 0%    ~     (p=1.000 n=10+9)
    BleveIndexBatch100              3.53GB ± 0%            3.53GB ± 0%    ~     (p=0.642 n=10+10)
    EtcdPut                         12.1GB ± 0%            12.1GB ± 0%    ~     (p=0.564 n=10+10)
    EtcdSTM                         12.1GB ± 0%            12.1GB ± 0%    ~     (p=0.633 n=10+10)
    GopherLuaKNucleotide            1.26GB ± 0%            1.26GB ± 0%    ~     (p=0.297 n=9+10)
    MarkdownRenderXHTML             1.26GB ± 0%            1.26GB ± 0%    ~     (p=0.069 n=10+10)
    Tile38QueryLoad                 7.47GB ± 2%            7.53GB ± 2%    ~     (p=0.280 n=10+10)

    name                  old p50-latency-ns     new p50-latency-ns     delta
    EtcdPut                          19.8M ± 5%             19.3M ± 3%  -2.74%  (p=0.043 n=10+9)
    EtcdSTM                          81.4M ± 4%             83.4M ± 4%  +2.46%  (p=0.029 n=10+10)
    Tile38QueryLoad                   241k ± 1%              240k ± 1%    ~     (p=0.393 n=10+10)

    name                  old p90-latency-ns     new p90-latency-ns     delta
    EtcdPut                          30.4M ± 5%             30.6M ± 5%    ~     (p=0.971 n=10+10)
    EtcdSTM                           222M ± 3%              226M ± 4%    ~     (p=0.063 n=10+10)
    Tile38QueryLoad                   687k ± 2%              691k ± 1%    ~     (p=0.173 n=10+8)

    name                  old p99-latency-ns     new p99-latency-ns     delta
    EtcdPut                          42.3M ±10%             41.4M ± 7%    ~     (p=0.353 n=10+10)
    EtcdSTM                           486M ± 7%              487M ± 4%    ~     (p=0.579 n=10+10)
    Tile38QueryLoad                  6.43M ± 2%             6.37M ± 3%    ~     (p=0.280 n=10+10)

    name                  old ops/s              new ops/s              delta
    EtcdPut                          48.6k ± 3%             49.5k ± 2%    ~     (p=0.065 n=10+9)
    EtcdSTM                          9.09k ± 2%             8.95k ± 3%  -1.56%  (p=0.045 n=10+10)
    Tile38QueryLoad                  28.4k ± 1%             28.6k ± 1%  +0.87%  (p=0.016 n=9+10)

gopherbot · 2023-10-02T18:42:28Z

Change https://go.dev/cl/532255 mentions this issue: [release-branch.go1.21] runtime: don't eagerly collapse hugepages

This has caused performance issues in production environments. MADV_COLLAPSE can go into direct reclaim, but we call it with the heap lock held. This means that the process could end up stalled fairly quickly if just one allocating goroutine ends up in the madvise call, at least until the madvise(MADV_COLLAPSE) call returns. A similar issue occurred with madvise(MADV_HUGEPAGE), because that could go into direct reclaim on any page fault for MADV_HUGEPAGE-marked memory. My understanding was that the calls to madvise(MADV_COLLAPSE) were fairly rare, and it's "best-effort" nature prevented it from going into direct reclaim often, but this was wrong. It tends to be fairly heavyweight even when it doesn't end up in direct reclaim, and it's almost certainly not worth it. Disable it until further notice and let the kernel fully dictate hugepage policy. The updated scavenger policy is still more hugepage friendly by delaying scavening until hugepages are no longer densely packed, so we don't lose all that much. The Sweet benchmarks show a minimal difference. A couple less realistic benchmarks seem to slow down a bit; they might just be getting unlucky with what the kernel decides to back with a huge page. Some benchmarks on the other hand improve. Overall, it's a wash. name old time/op new time/op delta BiogoIgor 13.1s ± 1% 13.2s ± 2% ~ (p=0.182 n=9+10) BiogoKrishna 12.0s ± 1% 12.1s ± 1% +1.23% (p=0.002 n=9+10) BleveIndexBatch100 4.51s ± 4% 4.56s ± 3% ~ (p=0.393 n=10+10) EtcdPut 20.2ms ± 4% 19.8ms ± 2% ~ (p=0.079 n=10+9) EtcdSTM 109ms ± 3% 111ms ± 3% +1.63% (p=0.035 n=10+10) GoBuildKubelet 31.2s ± 1% 31.3s ± 1% ~ (p=0.780 n=9+10) GoBuildKubeletLink 7.77s ± 0% 7.81s ± 2% ~ (p=0.237 n=8+10) GoBuildIstioctl 31.8s ± 1% 31.7s ± 0% ~ (p=0.136 n=9+9) GoBuildIstioctlLink 7.88s ± 1% 7.89s ± 1% ~ (p=0.720 n=9+10) GoBuildFrontend 11.7s ± 1% 11.8s ± 1% ~ (p=0.278 n=10+9) GoBuildFrontendLink 1.15s ± 4% 1.15s ± 5% ~ (p=0.387 n=9+9) GopherLuaKNucleotide 19.7s ± 1% 20.6s ± 0% +4.48% (p=0.000 n=10+10) MarkdownRenderXHTML 194ms ± 3% 196ms ± 3% ~ (p=0.356 n=9+10) Tile38QueryLoad 633µs ± 2% 629µs ± 2% ~ (p=0.075 n=10+10) name old average-RSS-bytes new average-RSS-bytes delta BiogoIgor 69.2MB ± 3% 68.4MB ± 1% ~ (p=0.190 n=10+10) BiogoKrishna 4.40GB ± 0% 4.40GB ± 0% ~ (p=0.605 n=9+9) BleveIndexBatch100 195MB ± 3% 195MB ± 2% ~ (p=0.853 n=10+10) EtcdPut 107MB ± 4% 108MB ± 3% ~ (p=0.190 n=10+10) EtcdSTM 91.6MB ± 5% 92.6MB ± 4% ~ (p=0.481 n=10+10) GoBuildKubelet 2.26GB ± 1% 2.28GB ± 1% +1.22% (p=0.000 n=10+10) GoBuildIstioctl 1.53GB ± 0% 1.53GB ± 0% +0.21% (p=0.017 n=9+10) GoBuildFrontend 556MB ± 1% 554MB ± 2% ~ (p=0.497 n=9+10) GopherLuaKNucleotide 39.0MB ± 3% 39.0MB ± 1% ~ (p=1.000 n=10+8) MarkdownRenderXHTML 21.2MB ± 2% 21.4MB ± 3% ~ (p=0.190 n=10+10) Tile38QueryLoad 5.99GB ± 2% 6.02GB ± 0% ~ (p=0.243 n=10+9) name old peak-RSS-bytes new peak-RSS-bytes delta BiogoIgor 90.2MB ± 4% 89.2MB ± 2% ~ (p=0.143 n=10+10) BiogoKrishna 4.49GB ± 0% 4.49GB ± 0% ~ (p=0.190 n=10+10) BleveIndexBatch100 283MB ± 8% 274MB ± 6% ~ (p=0.075 n=10+10) EtcdPut 147MB ± 4% 149MB ± 2% +1.55% (p=0.034 n=10+8) EtcdSTM 117MB ± 5% 117MB ± 4% ~ (p=0.905 n=9+10) GopherLuaKNucleotide 44.9MB ± 1% 44.6MB ± 1% ~ (p=0.083 n=8+8) MarkdownRenderXHTML 22.0MB ± 8% 22.1MB ± 9% ~ (p=0.436 n=10+10) Tile38QueryLoad 6.24GB ± 2% 6.29GB ± 2% ~ (p=0.218 n=10+10) name old peak-VM-bytes new peak-VM-bytes delta BiogoIgor 1.33GB ± 0% 1.33GB ± 0% ~ (p=0.504 n=10+9) BiogoKrishna 5.77GB ± 0% 5.77GB ± 0% ~ (p=1.000 n=10+9) BleveIndexBatch100 3.53GB ± 0% 3.53GB ± 0% ~ (p=0.642 n=10+10) EtcdPut 12.1GB ± 0% 12.1GB ± 0% ~ (p=0.564 n=10+10) EtcdSTM 12.1GB ± 0% 12.1GB ± 0% ~ (p=0.633 n=10+10) GopherLuaKNucleotide 1.26GB ± 0% 1.26GB ± 0% ~ (p=0.297 n=9+10) MarkdownRenderXHTML 1.26GB ± 0% 1.26GB ± 0% ~ (p=0.069 n=10+10) Tile38QueryLoad 7.47GB ± 2% 7.53GB ± 2% ~ (p=0.280 n=10+10) name old p50-latency-ns new p50-latency-ns delta EtcdPut 19.8M ± 5% 19.3M ± 3% -2.74% (p=0.043 n=10+9) EtcdSTM 81.4M ± 4% 83.4M ± 4% +2.46% (p=0.029 n=10+10) Tile38QueryLoad 241k ± 1% 240k ± 1% ~ (p=0.393 n=10+10) name old p90-latency-ns new p90-latency-ns delta EtcdPut 30.4M ± 5% 30.6M ± 5% ~ (p=0.971 n=10+10) EtcdSTM 222M ± 3% 226M ± 4% ~ (p=0.063 n=10+10) Tile38QueryLoad 687k ± 2% 691k ± 1% ~ (p=0.173 n=10+8) name old p99-latency-ns new p99-latency-ns delta EtcdPut 42.3M ±10% 41.4M ± 7% ~ (p=0.353 n=10+10) EtcdSTM 486M ± 7% 487M ± 4% ~ (p=0.579 n=10+10) Tile38QueryLoad 6.43M ± 2% 6.37M ± 3% ~ (p=0.280 n=10+10) name old ops/s new ops/s delta EtcdPut 48.6k ± 3% 49.5k ± 2% ~ (p=0.065 n=10+9) EtcdSTM 9.09k ± 2% 8.95k ± 3% -1.56% (p=0.045 n=10+10) Tile38QueryLoad 28.4k ± 1% 28.6k ± 1% +0.87% (p=0.016 n=9+10) Fixes #63335. For #63334. Related to #61718 and #59960. Change-Id: If84c5a8685825d43c912a71418f2597e44e867e5 Reviewed-on: https://go-review.googlesource.com/c/go/+/531816 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> (cherry picked from commit 595deec) Reviewed-on: https://go-review.googlesource.com/c/go/+/532255 Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>

…epage collapse This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to make a least-effort attempt at a synchronous collapse of memory at their own expense. The only difference from MADV_COLLAPSE is that the new hugepage allocation avoids direct reclaim/compaction, quickly failing on allocation errors. The benefits of this approach are: * CPU is charged to the process that wants to spend the cycles for the THP * Avoid unpredictable timing of khugepaged collapse * Prevent unpredictable stalls caused by direct reclaim and/or compaction Semantics This call is independent of the system-wide THP sysfs settings, but will fail for memory marked VM_NOHUGEPAGE. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of memory specified. The memory ranges provided must be page-aligned, but are not required to be hugepage-aligned. If the memory ranges are not hugepage-aligned, the start/end of the range will be clamped to the first/last hugepage-aligned address covered by said range. The memory ranges must span at least one hugepage-sized region. All non-resident pages covered by the range will first be swapped/faulted-in, before being internally copied onto a freshly allocated hugepage. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory (a PMD covering the address range must already exist). Allocation for the new hugepage will not enter direct reclaim and/or compaction, quickly failing if allocation fails. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. This operation operates on the current state of the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future. Return Value If all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already PMD-mapped THPs, this operation will be deemed successful. On success, madvise(2) returns 0. Else, -1 is returned and errno is set to indicate the error for the most-recently attempted hugepage collapse. Note that many failures might have occurred, since the operation may continue to collapse in the event a single hugepage-sized/aligned region fails. ENOMEM Memory allocation failed or VMA not found EBUSY Memcg charging failed EAGAIN Required resource temporarily unavailable. Try again might succeed. EINVAL Other error: No PMD found, subpage doesn't have Present bit set, "Special" page no backed by struct page, VMA incorrectly sized, address not page-aligned, ... Use Cases An immediate user of this new functionality is the Go runtime heap allocator that manages memory in hugepage-sized chunks. In the past, whether it was a newly allocated chunk through mmap() or a reused chunk released by madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] respectively. However, both approaches resulted in performance issues; for both scenarios, there could be entries into direct reclaim and/or compaction, leading to unpredictable stalls[4]. Now, the allocator can confidently use madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages. [1] torvalds@7d8faaf [2] golang/go@8fa9e3b [3] golang/go@9f9bb26 [4] golang/go#63334 Signed-off-by: Lance Yang <ioworker0@gmail.com>

…epage collapse This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to make a least-effort attempt at a synchronous collapse of memory at their own expense. The only difference from MADV_COLLAPSE is that the new hugepage allocation avoids direct reclaim and/or compaction, quickly failing on allocation errors. The benefits of this approach are: * CPU is charged to the process that wants to spend the cycles for the THP * Avoid unpredictable timing of khugepaged collapse * Prevent unpredictable stalls caused by direct reclaim and/or compaction Semantics This call is independent of the system-wide THP sysfs settings, but will fail for memory marked VM_NOHUGEPAGE. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of memory specified. The memory ranges provided must be page-aligned, but are not required to be hugepage-aligned. If the memory ranges are not hugepage-aligned, the start/end of the range will be clamped to the first/last hugepage-aligned address covered by said range. The memory ranges must span at least one hugepage-sized region. All non-resident pages covered by the range will first be swapped/faulted-in, before being internally copied onto a freshly allocated hugepage. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory (a PMD covering the address range must already exist). Allocation for the new hugepage will not enter direct reclaim and/or compaction, quickly failing if allocation fails. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. This operation operates on the current state of the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future. Return Value If all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already PMD-mapped THPs, this operation will be deemed successful. On success, madvise(2) returns 0. Else, -1 is returned and errno is set to indicate the error for the most-recently attempted hugepage collapse. Note that many failures might have occurred, since the operation may continue to collapse in the event a single hugepage-sized/aligned region fails. ENOMEM Memory allocation failed or VMA not found EBUSY Memcg charging failed EAGAIN Required resource temporarily unavailable. Try again might succeed. EINVAL Other error: No PMD found, subpage doesn't have Present bit set, "Special" page no backed by struct page, VMA incorrectly sized, address not page-aligned, ... Use Cases An immediate user of this new functionality is the Go runtime heap allocator that manages memory in hugepage-sized chunks. In the past, whether it was a newly allocated chunk through mmap() or a reused chunk released by madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] respectively. However, both approaches resulted in performance issues; for both scenarios, there could be entries into direct reclaim and/or compaction, leading to unpredictable stalls[4]. Now, the allocator can confidently use madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages. [1] torvalds@7d8faaf [2] golang/go@8fa9e3b [3] golang/go@9f9bb26 [4] golang/go#63334 Signed-off-by: Lance Yang <ioworker0@gmail.com>

This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller has CAP_SYS_ADMIN or is requesting the collapse of its own memory. The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but it avoids direct reclaim and/or compaction, quickly failing on allocation errors. This change enables a more flexible and efficient usage of memory collapse operations, providing additional control to userspace applications for system-wide THP optimization. Semantics This call is independent of the system-wide THP sysfs settings, but will fail for memory marked VM_NOHUGEPAGE. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of memory specified. The memory ranges provided must be page-aligned, but are not required to be hugepage-aligned. If the memory ranges are not hugepage-aligned, the start/end of the range will be clamped to the first/last hugepage-aligned address covered by said range. The memory ranges must span at least one hugepage-sized region. All non-resident pages covered by the range will first be swapped/faulted-in, before being internally copied onto a freshly allocated hugepage. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory (a PMD covering the address range must already exist). Allocation for the new hugepage will not enter direct reclaim and/or compaction, quickly failing if allocation fails. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. This operation operates on the current state of the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future. Use Cases An immediate user of this new functionality is the Go runtime heap allocator that manages memory in hugepage-sized chunks. In the past, whether it was a newly allocated chunk through mmap() or a reused chunk released by madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3] respectively. However, both approaches resulted in performance issues; for both scenarios, there could be entries into direct reclaim and/or compaction, leading to unpredictable stalls[4]. Now, the allocator can confidently use process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages. [1] torvalds@7d8faaf [2] golang/go@8fa9e3b [3] golang/go@9f9bb26 [4] golang/go#63334 [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@gmail.com/ Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Zach O'Keefe <zokeefe@google.com> Suggested-by: David Hildenbrand <david@redhat.com>

mknyszek added NeedsFix The path to resolution is known, but the work has not been done. release-blocker compiler/runtime Issues related to the Go compiler and/or runtime. labels Oct 2, 2023

mknyszek added this to the Go1.22 milestone Oct 2, 2023

gopherbot mentioned this issue Oct 2, 2023

runtime: MADV_COLLAPSE causes production performance issues on Linux [1.21 backport] #63335

Closed

gopherbot closed this as completed in 595deec Oct 2, 2023

mknyszek mentioned this issue Nov 30, 2023

runtime: excessive memory use between 1.21.0 -> 1.21.1 due to hugepages and the linux/amd64 max_ptes_none default of 512 #64332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: `MADV_COLLAPSE` causes production performance issues on Linux #63334

runtime: `MADV_COLLAPSE` causes production performance issues on Linux #63334

mknyszek commented Oct 2, 2023 •

edited

Loading

mknyszek commented Oct 2, 2023

gopherbot commented Oct 2, 2023

gopherbot commented Oct 2, 2023

gopherbot commented Oct 2, 2023

mknyszek commented Oct 2, 2023

gopherbot commented Oct 2, 2023

runtime: MADV_COLLAPSE causes production performance issues on Linux #63334

runtime: MADV_COLLAPSE causes production performance issues on Linux #63334

Comments

mknyszek commented Oct 2, 2023 • edited Loading

mknyszek commented Oct 2, 2023

gopherbot commented Oct 2, 2023

gopherbot commented Oct 2, 2023

gopherbot commented Oct 2, 2023

mknyszek commented Oct 2, 2023

gopherbot commented Oct 2, 2023

runtime: `MADV_COLLAPSE` causes production performance issues on Linux #63334

runtime: `MADV_COLLAPSE` causes production performance issues on Linux #63334

mknyszek commented Oct 2, 2023 •

edited

Loading