Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include scatter_chunk_waste in arc_size #10701

Merged
merged 1 commit into from Aug 18, 2020
Merged

Conversation

ahrens
Copy link
Member

@ahrens ahrens commented Aug 11, 2020

Motivation and Context

The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the scatter_chunk_waste kstat. However, the ARC's size is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.

Therefore, the ARC is effectively using more memory than it is aware of,
due to the scatter_chunk_waste. This impacts observability, e.g.
arcstat will show that the ARC is using less memory than it
effectively is. It also impacts how the ARC responds to memory
pressure. As the amount of scatter_chunk_waste changes, it appears to
the ARC as memory pressure, so it needs to resize arc_c.

If the sector size (1<<ashift) is the same as the page size (or
larger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
scatter_chunk_waste will be small, so the problematic effects are
minimal.

However, if using 512B sectors (ashift=9), and the (compressed) block
size is small (e.g. compression=on with the default volblocksize=8k
or a decreased recordsize), the amount of scatter_chunk_waste can be
very large. On a production system, with arc_size at a constant 50%
of memory, scatter_chunk_waste has been been observed to be 10-30% of
memory.

image

Description

This commit adds scatter_chunk_waste to arc_size, and adds a new
waste field to arcstat. As a result, the ARC's memory usage is more
observable, and arc_c does not need to be adjusted as frequently.

How Has This Been Tested?

Tested by setting recordsize=2k and observing arcstat. Without this commit, changing the workload from reading recordsize>=4k files to reading recordsize=2k files results in arc_c/arc_size shrinking, while free memory, and memory used by abd's, remains constant. With this commit, arc_c/arc_size remains constant.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@ahrens ahrens added the Status: Code Review Needed Ready for review and testing label Aug 11, 2020
@richardelling
Copy link
Contributor

imagine the hilarity when pagesize=2MiB

@codecov
Copy link

codecov bot commented Aug 12, 2020

Codecov Report

Merging #10701 into master will increase coverage by 0.09%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10701      +/-   ##
==========================================
+ Coverage   79.70%   79.80%   +0.09%     
==========================================
  Files         394      394              
  Lines      124649   124660      +11     
==========================================
+ Hits        99357    99481     +124     
+ Misses      25292    25179     -113     
Flag Coverage Δ
#kernel 80.32% <100.00%> (-0.04%) ⬇️
#user 65.25% <86.66%> (-0.39%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
module/os/linux/zfs/abd_os.c 97.63% <100.00%> (+0.02%) ⬆️
module/zfs/arc.c 89.50% <100.00%> (+0.02%) ⬆️
module/zfs/vdev_indirect.c 73.33% <0.00%> (-12.00%) ⬇️
module/zfs/vdev_raidz.c 89.76% <0.00%> (-2.73%) ⬇️
module/zfs/vdev_removal.c 94.05% <0.00%> (-2.52%) ⬇️
module/zfs/vdev_indirect_mapping.c 97.10% <0.00%> (-1.94%) ⬇️
module/zfs/zil.c 91.01% <0.00%> (-1.06%) ⬇️
module/zfs/dmu_traverse.c 96.01% <0.00%> (-1.00%) ⬇️
lib/libzfs/libzfs_changelist.c 84.26% <0.00%> (-0.75%) ⬇️
module/zfs/dsl_destroy.c 92.99% <0.00%> (-0.72%) ⬇️
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53c9d1d...2a1128d. Read the comment docs.

@ahrens
Copy link
Member Author

ahrens commented Aug 12, 2020

Added code for freebsd to also update the arc_size based on its abd_chunk_waste.

@gdevenyi
Copy link
Contributor

How long has this been around? Does this explain why I have to set my arc_max on a 0.6.5.11 ZFS NAS system to < 50% of total ram otherwise ZFS will OOM the machine?

@ahrens
Copy link
Member Author

ahrens commented Aug 12, 2020

@gdevenyi It's "always been this way" (at least since the ABD code was integrated, and I imagine there was a similar problem before that). But it's hard to say if your particular system was having troubles due to this issue or something else.

@gdevenyi
Copy link
Contributor

But it's hard to say if your particular system was having troubles due to this issue or something else.

Sure, I understand.

However, if using 512B sectors (ashift=9), and the (compressed) block
size is small (e.g. compression=on with the default volblocksize=8k
or a decreased recordsize), the amount of scatter_chunk_waste can be
very large.

This describes my system state, along with a large amount of unaccounted for memory usage while arc_c is small. Looking forward to seeing this merged.

module/os/linux/zfs/abd_os.c Show resolved Hide resolved
}

if (type != ARC_SPACE_DATA)
if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE)
aggsum_add(&arc_meta_used, space);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the ARC makes eviction decisions based on the metadata used, I think we should increment this if the waste is as a result of a metadata ABD.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you'd like arc_meta_min/arc_meta_limit to apply to the chunk waste of metadata abd's. I think we could do that based on ABD_FLAG_META.

I'm slightly concerned that arc_evict_meta() (& friends) would evict more than we intended, since we'd be telling it to evict an amount that includes the chunk waste, but it doesn't know about that, so it will evict that amount of abd_size (i.e. not including the chunk waste).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was what I was thinking but as I've given it more thought I think it makes sense to just avoid adding more metadata-specific code paths (especially since long-term this could go away). I'm good with this as-is.

The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K.  Therefore, the space used to cache each block
is rounded up to a multiple of 4K.  The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat.  However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.

Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`.  This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is.  It also impacts how the ARC responds to memory
pressure.  As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.

If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste.  If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.

However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large.  On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.

This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`.  As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 18, 2020
@behlendorf behlendorf merged commit 85ec5cb into openzfs:master Aug 18, 2020
@ahrens ahrens deleted the arc_size branch August 18, 2020 03:42
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K.  Therefore, the space used to cache each block
is rounded up to a multiple of 4K.  The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat.  However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.

Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`.  This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is.  It also impacts how the ARC responds to memory
pressure.  As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.

If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste.  If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.

However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large.  On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.

This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`.  As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10701
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K.  Therefore, the space used to cache each block
is rounded up to a multiple of 4K.  The ABD subsystem tracks this wasted
memory in the `scatter_chunk_waste` kstat.  However, the ARC's `size` is
not aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.

Therefore, the ARC is effectively using more memory than it is aware of,
due to the `scatter_chunk_waste`.  This impacts observability, e.g.
`arcstat` will show that the ARC is using less memory than it
effectively is.  It also impacts how the ARC responds to memory
pressure.  As the amount of `scatter_chunk_waste` changes, it appears to
the ARC as memory pressure, so it needs to resize `arc_c`.

If the sector size (`1<<ashift`) is the same as the page size (or
larger), there won't be any waste.  If the (compressed) block size is
relatively large compared to the page size, the amount of
`scatter_chunk_waste` will be small, so the problematic effects are
minimal.

However, if using 512B sectors (`ashift=9`), and the (compressed) block
size is small (e.g. `compression=on` with the default `volblocksize=8k`
or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be
very large.  On a production system, with `arc_size` at a constant 50%
of memory, `scatter_chunk_waste` has been been observed to be 10-30% of
memory.

This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new
`waste` field to `arcstat`.  As a result, the ARC's memory usage is more
observable, and `arc_c` does not need to be adjusted as frequently.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#10701
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants