Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit huge page and memory recycling support to pgalloc.MemoryFile. #9072

Merged
merged 1 commit into from
Jun 28, 2024

Conversation

copybara-service[bot]
Copy link

@copybara-service copybara-service bot commented Jun 8, 2023

Add explicit huge page and memory recycling support to pgalloc.MemoryFile.

This CL addresses the following major issues:

  • When an application releases memory to the sentry, the sentry unconditionally
    releases that memory to the host, rather than allowing it to be reused for
    future allocations, in order to ensure that new allocations are uniformly
    decommitted (use no memory): cl/145016083. In most cases, this should have
    relatively little performance impact; since releasing memory from the
    application to the OS is expensive even outside of gVisor, application memory
    allocators optimizing for performance already limit the rate at which they
    release memory to the OS. However, in applications that involve frequent
    process creation and exit (e.g. build systems), this practice prevents reuse
    of memory deallocated by exiting processes for memory allocated by new
    processes, resulting in both performance degradation and a spike in memory
    usage (since the sentry may not have released all deallocated memory to the
    host by the time new allocations occur).

  • gVisor's historical approach to application THP is based on THP being enabled
    on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the
    upstream Linux kernel
    (https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/).
    Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory
    without requiring the system to enable THP for all tmpfs files and memfds (by
    setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or
    "force").

  • Both MM and the application page allocator (pgalloc) are agnostic as to
    whether the underlying memory file will be THP-backed. Instead, both attempt
    to align hugepage-sized and larger allocations to hugepage boundaries, such
    that if the memory file happens to support THP then such allocations will be
    appropriately aligned to use THP. This is suboptimal since many allocations
    do not benefit from THP, resulting in memory underutilization.

These issues are especially relevant to platforms based on hardware
virtualization, where acquiring memory from the host is significantly more
expensive due to EPT/NPT fault overhead; when effective, THP reduces the
frequency with which said cost is incurred by a factor of 512, and page reuse
avoids incurring it at all.

Thus:

  • Instead of inferring whether THP use is desired from allocation size,
    indicate this explicitly as AllocOpts.Huge, and only set it to true for
    allocations for non-stack private anonymous mappings.

  • Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode
    that indicates that the caller will commit all pages in the allocation. In
    such cases, pgalloc can reuse deallocated pages without risking increased
    memory usage, internally referred to as "recycling".
    AllocateCallerIndirectCommit is used primarily for page faults on a
    THP-backed region. (It is also used for single-page allocations on non-THP
    backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned
    ranges, this is relatively uncommon.)

  • Allow different chunks in pgalloc.MemoryFile's backing file to have varying
    THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE.

  • Split pgalloc.MemoryFile's existing page metadata set into two sets tracking
    deallocated pages for small/huge-page-backed regions respectively; two sets
    tracking in-use pages for small/huge-page-backed regions respectively; and a
    fifth set tracking memory accounting state.

  • Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for
    pgalloc tests, but may also be applicable to disk-backed MemoryFiles.

Cleanup:

  • Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled,
    described in updateUsageLocked(), was based on the condition that
    MemoryFile.mu would be locked throughout the call to updateUsageLocked(),
    which was invalidated by cl/337865250.

  • Remove MemoryFileOpts.ManualZeroing, which is unused.

  • Rename "reclaiming" to "releasing"; the former is confusing since "reclaim"
    in Linux has a significantly different meaning (essentially "eviction" in
    pgalloc), and the latter seems to be conventional in user-mode memory
    allocators.

Using THP for application memory requires setting
/sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to
allow runsc to request THP from the kernel.

After this CL, pgalloc.MemoryFile still releases memory to the host as fast as
possible, limiting the effectiveness of page recycling. A following CL adds
optional memory release throttling to improve this.

Performance outcomes vary by workload and platform. (In all of the below,
"baseline" is without this CL, "expt" is with this CL, and "expt2" is with this
CL + reclaim throttling (cl/575046398).)

For systrap in GKE: As noted, this change is required to enable application THP
without forcing it on all host shmem users. In conjunction with recycling
(which has a relatively small effect on systrap since it does not use hardware
virtualization), THP use slightly improves performance, although whether this
is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise":

goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
                                                │  baseline  │               expt                │               expt2               │
                                                │   sec/op   │   sec/op    vs base               │   sec/op    vs base               │
BuildABSL/page_cache.clean/filesystem.bindfs-16   39.09 ± 4%   38.84 ± 5%       ~ (p=0.947 n=30)   38.84 ± 3%       ~ (p=0.854 n=30)
BuildABSL/page_cache.dirty/filesystem.bindfs-16   37.83 ± 3%   36.58 ± 4%       ~ (p=0.057 n=30)   36.83 ± 5%       ~ (p=0.314 n=30)
BuildABSL/page_cache.clean/filesystem.tmpfs-16    39.34 ± 3%   38.59 ± 4%       ~ (p=0.350 n=30)   38.58 ± 4%       ~ (p=0.300 n=30)
BuildABSL/page_cache.dirty/filesystem.tmpfs-16    37.83 ± 3%   36.08 ± 4%  -4.64% (p=0.026 n=30)   36.58 ± 4%       ~ (p=0.123 n=30)
BuildABSL/page_cache.clean/filesystem.rootfs-16   39.59 ± 4%   38.83 ± 3%       ~ (p=0.485 n=30)   40.09 ± 5%       ~ (p=0.971 n=30)
BuildABSL/page_cache.dirty/filesystem.rootfs-16   36.83 ± 3%   38.08 ± 5%       ~ (p=0.307 n=30)   38.08 ± 1%       ~ (p=0.242 n=30)
BuildABSL/page_cache.clean/filesystem.fusefs-16   38.34 ± 3%   37.59 ± 5%       ~ (p=0.752 n=30)   38.59 ± 3%       ~ (p=0.982 n=30)
BuildABSL/page_cache.dirty/filesystem.fusefs-16   37.58 ± 4%   38.08 ± 5%       ~ (p=0.708 n=30)   36.08 ± 6%       ~ (p=0.127 n=30)
BuildGRPC/page_cache.clean/filesystem.bindfs-16   212.7 ± 2%   211.0 ± 1%       ~ (p=0.138 n=30)   211.2 ± 1%       ~ (p=0.458 n=30)
BuildGRPC/page_cache.dirty/filesystem.bindfs-16   210.0 ± 1%   210.0 ± 1%       ~ (p=0.542 n=30)   209.7 ± 1%       ~ (p=0.665 n=30)
BuildGRPC/page_cache.clean/filesystem.rootfs-16   210.5 ± 1%   210.0 ± 1%       ~ (p=0.423 n=30)   210.0 ± 1%       ~ (p=0.142 n=30)
BuildGRPC/page_cache.dirty/filesystem.rootfs-16   210.2 ± 1%   209.0 ± 1%       ~ (p=0.219 n=30)   209.5 ± 1%       ~ (p=0.230 n=30)
geomean                                           67.62        66.97       -0.96%                  67.12       -0.74%

The KVM platform benefits significantly from reduced nested page faults due to
huge pages, and to a lesser extent due to recycling:

goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
                                                │  baseline  │                 expt                  │                 expt2                 │
                                                │   sec/op   │   sec/op    vs base                   │   sec/op    vs base                   │
BuildABSL/page_cache.clean/filesystem.bindfs-12   43.11 ± 2%   39.35 ± 3%   -8.71% (p=0.000 n=20)      38.10 ± 4%  -11.63% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.bindfs-12   42.35 ± 3%   39.09 ± 4%   -7.69% (p=0.000 n=20+19)   39.09 ± 5%   -7.69% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.tmpfs-12    42.35 ± 3%   38.34 ± 5%   -9.46% (p=0.000 n=20)      38.59 ± 3%   -8.87% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.tmpfs-12    42.09 ± 1%   37.59 ± 4%  -10.70% (p=0.000 n=20)      38.09 ± 4%   -9.51% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.rootfs-12   42.85 ± 3%   38.84 ± 3%   -9.35% (p=0.000 n=20)      39.09 ± 3%   -8.77% (p=0.000 n=20+17)
BuildABSL/page_cache.dirty/filesystem.rootfs-12   41.85 ± 2%   39.59 ± 6%   -5.40% (p=0.000 n=20+19)   38.09 ± 3%   -9.00% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.fusefs-12   42.60 ± 2%   38.34 ± 2%  -10.00% (p=0.000 n=20)      39.59 ± 3%   -7.06% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.fusefs-12   42.09 ± 4%   39.09 ± 3%   -7.13% (p=0.000 n=20)      38.09 ± 3%   -9.52% (p=0.000 n=20+19)
BuildGRPC/page_cache.clean/filesystem.bindfs-12   207.7 ± 1%   206.4 ± 0%   -0.60% (p=0.018 n=20)      205.9 ± 1%   -0.85% (p=0.001 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.bindfs-12   206.9 ± 1%   206.9 ± 1%        ~ (p=0.121 n=20)      204.4 ± 1%   -1.22% (p=0.004 n=20+19)
BuildGRPC/page_cache.clean/filesystem.rootfs-12   207.7 ± 1%   204.9 ± 1%   -1.33% (p=0.004 n=20)      203.9 ± 0%   -1.81% (p=0.000 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.rootfs-12   206.9 ± 1%   204.9 ± 0%   -0.97% (p=0.004 n=20+19)   203.9 ± 0%   -1.45% (p=0.000 n=20+19)
geomean                                           71.97        67.63        -6.03%                     67.28        -6.52%

@copybara-service copybara-service bot added the exported Issue was exported automatically label Jun 8, 2023
@copybara-service copybara-service bot force-pushed the test/cl481741148 branch 2 times, most recently from 51eff5f to 8650169 Compare June 14, 2023 18:23
@github-actions
Copy link

A friendly reminder that this PR had no activity for 120 days.

@github-actions github-actions bot added the stale-pr This PR has not been updated in 120 days. label Oct 13, 2023
@copybara-service copybara-service bot force-pushed the test/cl481741148 branch 2 times, most recently from 72dd2f0 to 16ffda3 Compare October 20, 2023 23:15
@github-actions github-actions bot removed the stale-pr This PR has not been updated in 120 days. label Oct 21, 2023
@copybara-service copybara-service bot changed the title Improve pgalloc huge page awareness. Improve pgalloc huge page handling. Nov 17, 2023
@copybara-service copybara-service bot force-pushed the test/cl481741148 branch 4 times, most recently from 520b549 to 57d077e Compare November 20, 2023 19:49
@copybara-service copybara-service bot changed the title Improve pgalloc huge page handling. Add explicit huge page and memory recycling support to pgalloc.MemoryFile. Nov 20, 2023
Copy link

A friendly reminder that this PR had no activity for 120 days.

@github-actions github-actions bot added the stale-pr This PR has not been updated in 120 days. label Mar 20, 2024
@github-actions github-actions bot removed the stale-pr This PR has not been updated in 120 days. label May 8, 2024
@copybara-service copybara-service bot force-pushed the test/cl481741148 branch 3 times, most recently from f7c3a2c to 570aee4 Compare May 9, 2024 04:54
@copybara-service copybara-service bot force-pushed the test/cl481741148 branch 2 times, most recently from c33ac33 to bb7de55 Compare June 18, 2024 18:35
@copybara-service copybara-service bot force-pushed the test/cl481741148 branch 5 times, most recently from 823e038 to 6d008f5 Compare June 24, 2024 16:55
@copybara-service copybara-service bot force-pushed the test/cl481741148 branch 4 times, most recently from 1021344 to 482cf48 Compare June 28, 2024 19:44
…File.

This CL addresses the following major issues:

- When an application releases memory to the sentry, the sentry unconditionally
  releases that memory to the host, rather than allowing it to be reused for
  future allocations, in order to ensure that new allocations are uniformly
  decommitted (use no memory): cl/145016083. In most cases, this should have
  relatively little performance impact; since releasing memory from the
  application to the OS is expensive even outside of gVisor, application memory
  allocators optimizing for performance already limit the rate at which they
  release memory to the OS. However, in applications that involve frequent
  process creation and exit (e.g. build systems), this practice prevents reuse
  of memory deallocated by exiting processes for memory allocated by new
  processes, resulting in both performance degradation and a spike in memory
  usage (since the sentry may not have released all deallocated memory to the
  host by the time new allocations occur).

- gVisor's historical approach to application THP is based on THP being enabled
  on a per-memfd basis, using the MFD_HUGEPAGE flag not merged into the
  upstream Linux kernel
  (https://patchwork.kernel.org/project/linux-mm/patch/c140f56a-1aa3-f7ae-b7d1-93da7d5a3572@google.com/).
  Thus, on vanilla Linux kernels, gVisor cannot use THP for application memory
  without requiring the system to enable THP for all tmpfs files and memfds (by
  setting /sys/kernel/mm/transparent_hugepage/shmem_enabled to "always" or
  "force").

- Both MM and the application page allocator (pgalloc) are agnostic as to
  whether the underlying memory file will be THP-backed. Instead, both attempt
  to align hugepage-sized and larger allocations to hugepage boundaries, such
  that if the memory file happens to support THP then such allocations will be
  appropriately aligned to use THP. This is suboptimal since many allocations
  do not benefit from THP, resulting in memory underutilization.

These issues are especially relevant to platforms based on hardware
virtualization, where acquiring memory from the host is significantly more
expensive due to EPT/NPT fault overhead; when effective, THP reduces the
frequency with which said cost is incurred by a factor of 512, and page reuse
avoids incurring it at all.

Thus:

- Instead of inferring whether THP use is desired from allocation size,
  indicate this explicitly as AllocOpts.Huge, and only set it to true for
  allocations for non-stack private anonymous mappings.

- Add AllocateCallerIndirectCommit, a new possible value for AllocOpts.Mode
  that indicates that the caller will commit all pages in the allocation. In
  such cases, pgalloc can reuse deallocated pages without risking increased
  memory usage, internally referred to as "recycling".
  AllocateCallerIndirectCommit is used primarily for page faults on a
  THP-backed region. (It is also used for single-page allocations on non-THP
  backed regions, but due to expansion of faults to mm.privateAllocUnit-aligned
  ranges, this is relatively uncommon.)

- Allow different chunks in pgalloc.MemoryFile's backing file to have varying
  THP-ness, indicated to the host using MADV_HUGEPAGE/NOHUGEPAGE.

- Split pgalloc.MemoryFile's existing page metadata set into two sets tracking
  deallocated pages for small/huge-page-backed regions respectively; two sets
  tracking in-use pages for small/huge-page-backed regions respectively; and a
  fifth set tracking memory accounting state.

- Add MemoryFileOpts.DisableMemoryAccounting; this is primarily intended for
  pgalloc tests, but may also be applicable to disk-backed MemoryFiles.

Cleanup:

- Remove MemoryFile.usageSwapped; the UpdateUsage() optimization it enabled,
  described in updateUsageLocked(), was based on the condition that
  MemoryFile.mu would be locked throughout the call to updateUsageLocked(),
  which was invalidated by cl/337865250.

- Remove MemoryFileOpts.ManualZeroing, which is unused.

- Rename "reclaiming" to "releasing"; the former is confusing since "reclaim"
  in Linux has a significantly different meaning (essentially "eviction" in
  pgalloc), and the latter seems to be conventional in user-mode memory
  allocators.

Using THP for application memory requires setting
/sys/kernel/mm/transparent_hugepage/shmem_enabled to "advise", in order to
allow runsc to request THP from the kernel.

After this CL, pgalloc.MemoryFile still releases memory to the host as fast as
possible, limiting the effectiveness of page recycling. A following CL adds
optional memory release throttling to improve this.

Performance outcomes vary by workload and platform. (In all of the below,
"baseline" is without this CL, "expt" is with this CL, and "expt2" is with this
CL + reclaim throttling (cl/575046398).)

For systrap in GKE: As noted, this change is required to enable application THP
without forcing it on all host shmem users. In conjunction with recycling
(which has a relatively small effect on systrap since it does not use hardware
virtualization), THP use slightly improves performance, although whether this
is measurable is case-dependent. On an idle VM, with shmem_enabled = "advise":

```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU @ 2.80GHz
                                                │  baseline  │               expt                │               expt2               │
                                                │   sec/op   │   sec/op    vs base               │   sec/op    vs base               │
BuildABSL/page_cache.clean/filesystem.bindfs-16   39.09 ± 4%   38.84 ± 5%       ~ (p=0.947 n=30)   38.84 ± 3%       ~ (p=0.854 n=30)
BuildABSL/page_cache.dirty/filesystem.bindfs-16   37.83 ± 3%   36.58 ± 4%       ~ (p=0.057 n=30)   36.83 ± 5%       ~ (p=0.314 n=30)
BuildABSL/page_cache.clean/filesystem.tmpfs-16    39.34 ± 3%   38.59 ± 4%       ~ (p=0.350 n=30)   38.58 ± 4%       ~ (p=0.300 n=30)
BuildABSL/page_cache.dirty/filesystem.tmpfs-16    37.83 ± 3%   36.08 ± 4%  -4.64% (p=0.026 n=30)   36.58 ± 4%       ~ (p=0.123 n=30)
BuildABSL/page_cache.clean/filesystem.rootfs-16   39.59 ± 4%   38.83 ± 3%       ~ (p=0.485 n=30)   40.09 ± 5%       ~ (p=0.971 n=30)
BuildABSL/page_cache.dirty/filesystem.rootfs-16   36.83 ± 3%   38.08 ± 5%       ~ (p=0.307 n=30)   38.08 ± 1%       ~ (p=0.242 n=30)
BuildABSL/page_cache.clean/filesystem.fusefs-16   38.34 ± 3%   37.59 ± 5%       ~ (p=0.752 n=30)   38.59 ± 3%       ~ (p=0.982 n=30)
BuildABSL/page_cache.dirty/filesystem.fusefs-16   37.58 ± 4%   38.08 ± 5%       ~ (p=0.708 n=30)   36.08 ± 6%       ~ (p=0.127 n=30)
BuildGRPC/page_cache.clean/filesystem.bindfs-16   212.7 ± 2%   211.0 ± 1%       ~ (p=0.138 n=30)   211.2 ± 1%       ~ (p=0.458 n=30)
BuildGRPC/page_cache.dirty/filesystem.bindfs-16   210.0 ± 1%   210.0 ± 1%       ~ (p=0.542 n=30)   209.7 ± 1%       ~ (p=0.665 n=30)
BuildGRPC/page_cache.clean/filesystem.rootfs-16   210.5 ± 1%   210.0 ± 1%       ~ (p=0.423 n=30)   210.0 ± 1%       ~ (p=0.142 n=30)
BuildGRPC/page_cache.dirty/filesystem.rootfs-16   210.2 ± 1%   209.0 ± 1%       ~ (p=0.219 n=30)   209.5 ± 1%       ~ (p=0.230 n=30)
geomean                                           67.62        66.97       -0.96%                  67.12       -0.74%
```

The KVM platform benefits significantly from reduced nested page faults due to
huge pages, and to a lesser extent due to recycling:

```
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
                                                │  baseline  │                 expt                  │                 expt2                 │
                                                │   sec/op   │   sec/op    vs base                   │   sec/op    vs base                   │
BuildABSL/page_cache.clean/filesystem.bindfs-12   43.11 ± 2%   39.35 ± 3%   -8.71% (p=0.000 n=20)      38.10 ± 4%  -11.63% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.bindfs-12   42.35 ± 3%   39.09 ± 4%   -7.69% (p=0.000 n=20+19)   39.09 ± 5%   -7.69% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.tmpfs-12    42.35 ± 3%   38.34 ± 5%   -9.46% (p=0.000 n=20)      38.59 ± 3%   -8.87% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.tmpfs-12    42.09 ± 1%   37.59 ± 4%  -10.70% (p=0.000 n=20)      38.09 ± 4%   -9.51% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.rootfs-12   42.85 ± 3%   38.84 ± 3%   -9.35% (p=0.000 n=20)      39.09 ± 3%   -8.77% (p=0.000 n=20+17)
BuildABSL/page_cache.dirty/filesystem.rootfs-12   41.85 ± 2%   39.59 ± 6%   -5.40% (p=0.000 n=20+19)   38.09 ± 3%   -9.00% (p=0.000 n=20+19)
BuildABSL/page_cache.clean/filesystem.fusefs-12   42.60 ± 2%   38.34 ± 2%  -10.00% (p=0.000 n=20)      39.59 ± 3%   -7.06% (p=0.000 n=20+19)
BuildABSL/page_cache.dirty/filesystem.fusefs-12   42.09 ± 4%   39.09 ± 3%   -7.13% (p=0.000 n=20)      38.09 ± 3%   -9.52% (p=0.000 n=20+19)
BuildGRPC/page_cache.clean/filesystem.bindfs-12   207.7 ± 1%   206.4 ± 0%   -0.60% (p=0.018 n=20)      205.9 ± 1%   -0.85% (p=0.001 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.bindfs-12   206.9 ± 1%   206.9 ± 1%        ~ (p=0.121 n=20)      204.4 ± 1%   -1.22% (p=0.004 n=20+19)
BuildGRPC/page_cache.clean/filesystem.rootfs-12   207.7 ± 1%   204.9 ± 1%   -1.33% (p=0.004 n=20)      203.9 ± 0%   -1.81% (p=0.000 n=20+19)
BuildGRPC/page_cache.dirty/filesystem.rootfs-12   206.9 ± 1%   204.9 ± 0%   -0.97% (p=0.004 n=20+19)   203.9 ± 0%   -1.45% (p=0.000 n=20+19)
geomean                                           71.97        67.63        -6.03%                     67.28        -6.52%
```
PiperOrigin-RevId: 647771821
@copybara-service copybara-service bot closed this Jun 28, 2024
@copybara-service copybara-service bot merged commit a557331 into master Jun 28, 2024
1 of 3 checks passed
@copybara-service copybara-service bot deleted the test/cl481741148 branch June 28, 2024 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exported Issue was exported automatically
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant