New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
include scatter_chunk_waste in arc_size #10701
Conversation
|
imagine the hilarity when pagesize=2MiB |
Codecov Report
@@ Coverage Diff @@
## master #10701 +/- ##
==========================================
+ Coverage 79.70% 79.80% +0.09%
==========================================
Files 394 394
Lines 124649 124660 +11
==========================================
+ Hits 99357 99481 +124
+ Misses 25292 25179 -113
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
Added code for freebsd to also update the arc_size based on its abd_chunk_waste. |
|
How long has this been around? Does this explain why I have to set my arc_max on a 0.6.5.11 ZFS NAS system to < 50% of total ram otherwise ZFS will OOM the machine? |
|
@gdevenyi It's "always been this way" (at least since the ABD code was integrated, and I imagine there was a similar problem before that). But it's hard to say if your particular system was having troubles due to this issue or something else. |
Sure, I understand.
This describes my system state, along with a large amount of unaccounted for memory usage while arc_c is small. Looking forward to seeing this merged. |
| } | ||
|
|
||
| if (type != ARC_SPACE_DATA) | ||
| if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE) | ||
| aggsum_add(&arc_meta_used, space); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the ARC makes eviction decisions based on the metadata used, I think we should increment this if the waste is as a result of a metadata ABD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you'd like arc_meta_min/arc_meta_limit to apply to the chunk waste of metadata abd's. I think we could do that based on ABD_FLAG_META.
I'm slightly concerned that arc_evict_meta() (& friends) would evict more than we intended, since we'd be telling it to evict an amount that includes the chunk waste, but it doesn't know about that, so it will evict that amount of abd_size (i.e. not including the chunk waste).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was what I was thinking but as I've given it more thought I think it makes sense to just avoid adding more metadata-specific code paths (especially since long-term this could go away). I'm good with this as-is.
The ARC caches data in scatter ABD's, which are collections of pages, which are typically 4K. Therefore, the space used to cache each block is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is not aware of the memory used by this round-up, it only accounts for the size that it requested from the ABD subsystem. Therefore, the ARC is effectively using more memory than it is aware of, due to the `scatter_chunk_waste`. This impacts observability, e.g. `arcstat` will show that the ARC is using less memory than it effectively is. It also impacts how the ARC responds to memory pressure. As the amount of `scatter_chunk_waste` changes, it appears to the ARC as memory pressure, so it needs to resize `arc_c`. If the sector size (`1<<ashift`) is the same as the page size (or larger), there won't be any waste. If the (compressed) block size is relatively large compared to the page size, the amount of `scatter_chunk_waste` will be small, so the problematic effects are minimal. However, if using 512B sectors (`ashift=9`), and the (compressed) block size is small (e.g. `compression=on` with the default `volblocksize=8k` or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be very large. On a production system, with `arc_size` at a constant 50% of memory, `scatter_chunk_waste` has been been observed to be 10-30% of memory. This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new `waste` field to `arcstat`. As a result, the ARC's memory usage is more observable, and `arc_c` does not need to be adjusted as frequently. Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
The ARC caches data in scatter ABD's, which are collections of pages, which are typically 4K. Therefore, the space used to cache each block is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is not aware of the memory used by this round-up, it only accounts for the size that it requested from the ABD subsystem. Therefore, the ARC is effectively using more memory than it is aware of, due to the `scatter_chunk_waste`. This impacts observability, e.g. `arcstat` will show that the ARC is using less memory than it effectively is. It also impacts how the ARC responds to memory pressure. As the amount of `scatter_chunk_waste` changes, it appears to the ARC as memory pressure, so it needs to resize `arc_c`. If the sector size (`1<<ashift`) is the same as the page size (or larger), there won't be any waste. If the (compressed) block size is relatively large compared to the page size, the amount of `scatter_chunk_waste` will be small, so the problematic effects are minimal. However, if using 512B sectors (`ashift=9`), and the (compressed) block size is small (e.g. `compression=on` with the default `volblocksize=8k` or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be very large. On a production system, with `arc_size` at a constant 50% of memory, `scatter_chunk_waste` has been been observed to be 10-30% of memory. This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new `waste` field to `arcstat`. As a result, the ARC's memory usage is more observable, and `arc_c` does not need to be adjusted as frequently. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10701
The ARC caches data in scatter ABD's, which are collections of pages, which are typically 4K. Therefore, the space used to cache each block is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted memory in the `scatter_chunk_waste` kstat. However, the ARC's `size` is not aware of the memory used by this round-up, it only accounts for the size that it requested from the ABD subsystem. Therefore, the ARC is effectively using more memory than it is aware of, due to the `scatter_chunk_waste`. This impacts observability, e.g. `arcstat` will show that the ARC is using less memory than it effectively is. It also impacts how the ARC responds to memory pressure. As the amount of `scatter_chunk_waste` changes, it appears to the ARC as memory pressure, so it needs to resize `arc_c`. If the sector size (`1<<ashift`) is the same as the page size (or larger), there won't be any waste. If the (compressed) block size is relatively large compared to the page size, the amount of `scatter_chunk_waste` will be small, so the problematic effects are minimal. However, if using 512B sectors (`ashift=9`), and the (compressed) block size is small (e.g. `compression=on` with the default `volblocksize=8k` or a decreased `recordsize`), the amount of `scatter_chunk_waste` can be very large. On a production system, with `arc_size` at a constant 50% of memory, `scatter_chunk_waste` has been been observed to be 10-30% of memory. This commit adds `scatter_chunk_waste` to `arc_size`, and adds a new `waste` field to `arcstat`. As a result, the ARC's memory usage is more observable, and `arc_c` does not need to be adjusted as frequently. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#10701
Motivation and Context
The ARC caches data in scatter ABD's, which are collections of pages,
which are typically 4K. Therefore, the space used to cache each block
is rounded up to a multiple of 4K. The ABD subsystem tracks this wasted
memory in the
scatter_chunk_wastekstat. However, the ARC'ssizeisnot aware of the memory used by this round-up, it only accounts for the
size that it requested from the ABD subsystem.
Therefore, the ARC is effectively using more memory than it is aware of,
due to the
scatter_chunk_waste. This impacts observability, e.g.arcstatwill show that the ARC is using less memory than iteffectively is. It also impacts how the ARC responds to memory
pressure. As the amount of
scatter_chunk_wastechanges, it appears tothe ARC as memory pressure, so it needs to resize
arc_c.If the sector size (
1<<ashift) is the same as the page size (orlarger), there won't be any waste. If the (compressed) block size is
relatively large compared to the page size, the amount of
scatter_chunk_wastewill be small, so the problematic effects areminimal.
However, if using 512B sectors (
ashift=9), and the (compressed) blocksize is small (e.g.
compression=onwith the defaultvolblocksize=8kor a decreased
recordsize), the amount ofscatter_chunk_wastecan bevery large. On a production system, with
arc_sizeat a constant 50%of memory,
scatter_chunk_wastehas been been observed to be 10-30% ofmemory.
Description
This commit adds
scatter_chunk_wastetoarc_size, and adds a newwastefield toarcstat. As a result, the ARC's memory usage is moreobservable, and
arc_cdoes not need to be adjusted as frequently.How Has This Been Tested?
Tested by setting recordsize=2k and observing
arcstat. Without this commit, changing the workload from reading recordsize>=4k files to reading recordsize=2k files results in arc_c/arc_size shrinking, while free memory, and memory used by abd's, remains constant. With this commit, arc_c/arc_size remains constant.Types of changes
Checklist:
Signed-off-by.