Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use scatter-gather lists for ARC buffers #75

Closed
behlendorf opened this issue Nov 5, 2010 · 3 comments
Closed

Use scatter-gather lists for ARC buffers #75

behlendorf opened this issue Nov 5, 2010 · 3 comments
Labels
Component: Memory Management kernel memory management
Milestone

Comments

@behlendorf
Copy link
Contributor

This is a big change but we really need to consider updating the ZFS code to use scatter-gather lists for the ARC buffers instead of vmalloc'ed memory. Using a vmalloc'ed buffer is the way it's done on OpenSolaris but it's less problematic there because they have a more full featured virtual memory management system in the kernel. By design the Linux kernel's VM is primitive for performance reasons. The only reason things are working reasonable well today is that I've implemented a fairly decent virtual slab in the SPL. This is good but it goes against the grain of what should be done and it does cause some problems, such as:

  1. Deadlocks. Because of the way the zio pipeline in designed in ZFS we must be careful to avoid triggering the synchronous memory reclaim path. If one of the zio threads does enter reclaim then it may deadlock on itself by trying to flush dirty pages from say a zvol. This is avoided in most instances by clearing GFP_FS but we can't clear this flag for vmalloc() calls. Unfortunately, we may be forced to vmalloc() a new slab in the zio pipeline for certain workloads such as compression and this we risk deadlocking. Moving to scatter-gather lists would allow us to eliminate this __vmalloc() and potential deadlock.

  2. Avoid serializing on the single Linux VM lock. Because the Linux VM is designed to be lightly used all changes to the virtual address space are serialized through a single lock. The SPL slab does go through some effort to minimizing this impact by allocating slabs of objects but clearly there are scaling concerns here.

  3. VM overhead. In addition to the lock contention there is overhead involved in locating suitable virtual addresses and setting up the mapping from virtual to physical pages. For a CPU hungry filesystem and overhead we can eliminate is worthwhile.

  4. 32-bit arch support. This biggest issue with supporting 32-bit arches is they have a very small virtual address range, usually only 100's of MB. By moving all ARC data buffers to scatter gather lists we avoid having to use this limited address range. Instead all data pages can simply reside is the standard address range just like with all other Linux filesystems.

behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 5, 2011
In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.

To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB.  The rational for
this is as follows:

* Desktop Systems (less than 8GB of memory)

  Limiting the ARC to 1/2 of memory is desirable for desktop
  systems which have highly dynamic memory requirements.  For
  example, launching your web browser can suddenly result in a
  demand for several gigabytes of memory.  This memory must be
  reclaimed from the ARC cache which can take some time.  The
  user will experience this reclaim time as a sluggish system
  with poor interactive performance.  Thus in this case it is
  preferable to leave the memory as free and available for
  immediate use.

* Server Systems (more than 8GB of memory)

  Using all but 4GB of memory for the ARC is preferable for
  server systems.  These systems often run with minimal user
  interaction and have long running daemons with relatively
  stable memory demands.  These systems will benefit most by
  having as much data cached in memory as possible.

These values should work well for most configurations.  However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC.  This can still be accomplished
by setting the 'zfs_arc_max' module option.

Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation.  Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory.  How much more memory get's
consumed will be determined by how badly fragmented the slabs are.

In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution.  Or preferably, using the page cache
to back the ARC under Linux would be even better.  See issue openzfs#75
for the benefits of more tightly integrating with the page cache.

This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory.  The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken.  This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#75
Rudd-O pushed a commit to Rudd-O/zfs that referenced this issue Feb 1, 2012
In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.

To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB.  The rational for
this is as follows:

* Desktop Systems (less than 8GB of memory)

  Limiting the ARC to 1/2 of memory is desirable for desktop
  systems which have highly dynamic memory requirements.  For
  example, launching your web browser can suddenly result in a
  demand for several gigabytes of memory.  This memory must be
  reclaimed from the ARC cache which can take some time.  The
  user will experience this reclaim time as a sluggish system
  with poor interactive performance.  Thus in this case it is
  preferable to leave the memory as free and available for
  immediate use.

* Server Systems (more than 8GB of memory)

  Using all but 4GB of memory for the ARC is preferable for
  server systems.  These systems often run with minimal user
  interaction and have long running daemons with relatively
  stable memory demands.  These systems will benefit most by
  having as much data cached in memory as possible.

These values should work well for most configurations.  However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC.  This can still be accomplished
by setting the 'zfs_arc_max' module option.

Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation.  Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory.  How much more memory get's
consumed will be determined by how badly fragmented the slabs are.

In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution.  Or preferably, using the page cache
to back the ARC under Linux would be even better.  See issue openzfs#75
for the benefits of more tightly integrating with the page cache.

This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory.  The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken.  This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#75
@behlendorf behlendorf modified the milestones: 0.7.0, 0.9.0 Oct 3, 2014
@behlendorf behlendorf added Bug - Major and removed Type: Feature Feature request or new feature labels Oct 3, 2014
@kernelOfTruth
Copy link
Contributor

related to #2129

@behlendorf behlendorf added the Component: Memory Management kernel memory management label Mar 25, 2016
@kernelOfTruth
Copy link
Contributor

and #3441

@behlendorf
Copy link
Contributor Author

Merged as:

7657def Introduce ARC Buffer Data (ABD)

ahrens pushed a commit to ahrens/zfs that referenced this issue Sep 17, 2019
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
sdimitro pushed a commit to sdimitro/zfs that referenced this issue Feb 14, 2022
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
rkojedzinszky pushed a commit to rkojedzinszky/zfs that referenced this issue Mar 7, 2023
Avoid duplicated Actions in TrueNAS ZFS CI

Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

No branches or pull requests

2 participants