Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC shrinking blocks reads/writes #10496

Merged
merged 1 commit into from Jun 26, 2020
Merged

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Jun 24, 2020

Motivation and Context

ZFS registers a memory hook, __arc_shrinker_func, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes arc_c via a call to arc_reduce_target_size().
Before commit 3ec34e5, the ARC shrinker would also evict data
from the ARC to bring arc_size down to the new arc_c. However, that
commit (seemingly inadvertently) made it so that the ARC shrinker no
longer evicts any data or waits for eviction to complete.

Repeated calls to the ARC shrinker can reduce arc_c drastically, often
all the way to arc_c_min. Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where arc_size
is more than arc_c for the several seconds/minutes it takes for
arc_adjust_zthr to evict data from the ARC. During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

Note: commit 3ec34e5 is OpenZFS 9284 - arc_reclaim_thread has 2 jobs
and was integrated in December 2018, and is part of ZoL 0.8.x but not 0.7.x.

Description

To ensure that arc_c doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5.

Additionally, when the ARC size is reduced drastically, the
arc_adjust_zthr can be on-CPU for many seconds without blocking. Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
arc_evict_state_impl() make a voluntary preemption call,
cond_resched().

Note, there are many other problems in this space, including the fact that there's no need for the ARC to shrink down to arc_c_min in response to small amounts of memory pressure, problems with how the ARC decides how large it should be, including other interactions with the kernel shrinker code, and the ARC's monitoring of memory pressure in arc_reap_zthr. I'm also working on those problems and will open a separate PR for those more complicated changes.

How Has This Been Tested?

The problem can be reproduced by creating memory pressure while the ARC is full. E.g. one thread reading a file that's larger than RAM in a loop, to fill up the ARC. Then create memory pressure, e.g. by writing to tmpfs. arcstat shows that arc_c immediately drops to the minimum, but arcsz takes ~3 minutes to reach the minimum, during which time there are no reads from disk (miss):

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
05:18:25     1     0      0     0    0     0    0     0    0   248G  179G
05:18:35    71     3      5     0    0     3  100     0    0   238G   81G
05:18:45     2     0      0     0    0     0    0     0    0   230G   35G
05:18:55    11     0      0     0    0     0    0     0    0   211G   14G
05:19:05     0     0      0     0    0     0    0     0    0   191G   14G
05:19:15     8     0      0     0    0     0    0     0    0   172G   14G
...
05:21:55     1     0      0     0    0     0    0     0    0    28G   14G
05:22:05     0     0      0     0    0     0    0     0    0    26G   14G
05:22:15     0     0      0     0    0     0    0     0    0    23G   14G
05:22:25     4     0      0     0    0     0    0     0    0    16G   14G
05:22:35  163K   81K     49    50    0   81K   99   150   11    14G   14G
05:22:45  233K  116K     50     0    0  116K  100   172   45    15G   15G
05:22:55  239K  119K     49     0    0  119K   99   102   30    18G   18G
05:23:05  244K  122K     49     0    0  122K   99   102   28    21G   21G

With the proposed changes we see that arc_c drops more slowly, in lockstep with arcsz. While they are decreasing, we stilll process ~25% of the reads (miss) that we do normally. (This system is configured differently than the one in the example above, so it "only" takes 30 seconds to evict the whole ARC, but this change doesn't significantly impact how long the eviction takes.)

    time  miss  arcsz     c
19:02:59   38K   114G  114G
19:03:00   38K   114G  114G
19:03:01   38K   115G  114G
19:03:02   38K   114G  114G
19:03:03   37K   115G  114G
19:03:04   35K   115G  114G
19:03:05   21K   113G  112G
19:03:06  7.5K   110G  110G
19:03:07 10.0K   107G  107G
19:03:08   11K   104G  104G
19:03:09   10K   101G  101G
19:03:10   11K    98G   98G
19:03:11   10K    95G   95G
19:03:12  9.2K    92G   91G
19:03:13  9.4K    89G   88G
19:03:14  8.9K    86G   85G
19:03:15  8.2K    82G   82G
19:03:16  8.2K    79G   79G
...
19:03:32  6.3K    27G   25G
19:03:33  5.8K    24G   22G
19:03:34  5.8K    20G   19G
19:03:35  5.8K    18G   15G
19:03:36  6.3K    14G   12G
19:03:37  6.5K    10G  8.8G
19:03:38  7.5K   8.0G  5.4G
19:03:39   10K   4.7G  3.9G
19:03:40   36K   3.9G  3.9G
19:03:41   45K   3.9G  3.9G

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@ahrens ahrens added Status: Code Review Needed Ready for review and testing Type: Performance Performance improvement or performance problem labels Jun 24, 2020
Copy link
Member

@prakashsurya prakashsurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

module/zfs/arc.c Show resolved Hide resolved
@algragon
Copy link

There is a typo (missing 'r') in the PR-title and the commit message: shinking -> shrinking.

@codecov
Copy link

codecov bot commented Jun 25, 2020

Codecov Report

Merging #10496 into master will decrease coverage by 14.13%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #10496       +/-   ##
===========================================
- Coverage   79.41%   65.28%   -14.14%     
===========================================
  Files         391      309       -82     
  Lines      123830   106545    -17285     
===========================================
- Hits        98339    69555    -28784     
- Misses      25491    36990    +11499     
Flag Coverage Δ
#kernel ?
#user 65.28% <100.00%> (-0.12%) ⬇️
Impacted Files Coverage Δ
module/os/linux/zfs/arc_os.c 40.00% <ø> (-19.68%) ⬇️
module/zfs/arc.c 73.80% <100.00%> (-15.48%) ⬇️
module/zfs/objlist.c 0.00% <0.00%> (-100.00%) ⬇️
module/zfs/pathname.c 0.00% <0.00%> (-100.00%) ⬇️
include/sys/dmu_redact.h 0.00% <0.00%> (-100.00%) ⬇️
include/sys/dmu_traverse.h 0.00% <0.00%> (-100.00%) ⬇️
module/zfs/zfs_rlock.c 0.00% <0.00%> (-96.36%) ⬇️
module/lua/ltablib.c 2.34% <0.00%> (-95.32%) ⬇️
module/zfs/bqueue.c 0.00% <0.00%> (-94.45%) ⬇️
module/zcommon/zfs_deleg.c 0.00% <0.00%> (-92.46%) ⬇️
... and 235 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f21de68...a86edaa. Read the comment docs.

@ahrens ahrens changed the title ARC shinking blocks reads/writes ARC shrinking blocks reads/writes Jun 25, 2020
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`.  Before commit 3ec34e5, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`.  However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.

Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`.  Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC.  During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5.

Note: commit 3ec34e5 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.

Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking.  Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
@ahrens
Copy link
Member Author

ahrens commented Jun 25, 2020

@algragon Whoops, thanks for pointing that out, fixed now!

@ahrens ahrens added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 26, 2020
@behlendorf behlendorf merged commit 67c0f0d into openzfs:master Jun 26, 2020
@ahrens ahrens deleted the arc-minimal branch June 27, 2020 03:09
DeHackEd pushed a commit to DeHackEd/zfs that referenced this pull request Aug 19, 2020
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`.  Before commit 3ec34e5, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`.  However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.

Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`.  Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC.  During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5.

Note: commit 3ec34e5 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.

Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking.  Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes openzfs#10496
DeHackEd pushed a commit to DeHackEd/zfs that referenced this pull request Aug 19, 2020
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`.  Before commit 3ec34e5, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`.  However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.

Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`.  Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC.  During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5.

Note: commit 3ec34e5 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.

Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking.  Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes openzfs#10496
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`.  Before commit 3ec34e5, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`.  However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.

Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`.  Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC.  During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5.

Note: commit 3ec34e5 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.

Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking.  Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes openzfs#10496
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
ZFS registers a memory hook, `__arc_shrinker_func`, which is supposed to
allow the ARC to shrink when the kernel experiences memory pressure.
The ARC shrinker changes `arc_c` via a call to
`arc_reduce_target_size()`.  Before commit 3ec34e5, the ARC
shrinker would also evict data from the ARC to bring `arc_size` down to
the new `arc_c`.  However, that commit (seemingly inadvertently) made it
so that the ARC shrinker no longer evicts any data or waits for eviction
to complete.

Repeated calls to the ARC shrinker can reduce `arc_c` drastically, often
all the way to `arc_c_min`.  Since it doesn't wait for the actual
eviction of data from the ARC, this creates a situation where `arc_size`
is more than `arc_c` for the several seconds/minutes it takes for
`arc_adjust_zthr` to evict data from the ARC.  During this time,
arc_get_data_impl() will block, so ZFS can't process read/write requests
(e.g. from iSCSI, NFS, or read/write syscalls).

To ensure that `arc_c` doesn't shrink faster than the adjust thread can
keep up, this commit makes the ARC shrinker wait for the eviction to
complete, resulting in similar behavior to what we had before commit
3ec34e5.

Note: commit 3ec34e5 is `OpenZFS 9284 - arc_reclaim_thread
has 2 jobs` and was integrated in December 2018, and is part of ZoL
0.8.x but not 0.7.x.

Additionally, when the ARC size is reduced drastically, the
`arc_adjust_zthr` can be on-CPU for many seconds without blocking.  Any
threads that are bound to the same CPU that arc_adjust_zthr is running
on will not able to run for a long time.

To ensure that CPU-bound threads can make progress, this commit changes
`arc_evict_state_impl()` make a voluntary preemption call,
`cond_resched()`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-70703
Closes openzfs#10496
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants