New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ARC behavior with metadata heavy workloads #1967
Conversation
A GPF may occur if an l2arc buffer is evicted before the write completes for the l2arc buffer. The l2arc_write_done() function unconditionally starts walking the list at the write head so we must verify that l2arc_evict() has not removed it from the list. This long standing issue was exposed by allowing arc_adjust() to evict entries from the l2arc due to memory pressue. Prior to this change buffers would only be evicted immediately in front of the write hand or during l2arc cleanup. In neither case could we wrap around remove the write head as part of l2arc_evict(). PID: 3182 TASK: ffff88020df28080 CPU: 0 COMMAND: "z_null_int/0" [exception RIP: l2arc_write_done+128] 6 [ffff880210e97c58] zio_done at ffffffffa0597e0c [zfs] 7 [ffff880210e97cd8] zio_done at ffffffffa05982a7 [zfs] 8 [ffff880210e97d58] zio_done at ffffffffa05982a7 [zfs] 9 [ffff880210e97dd8] zio_execute at ffffffffa05948a3 [zfs] 10 [ffff880210e97e18] taskq_thread at ffffffffa041d8d8 [spl] 11 [ffff880210e97ee8] kthread at ffffffff81090d26 12 [ffff880210e97f48] kernel_thread at ffffffff8100c14a Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Decrease the mimimum ARC size from 1/32 of total system memory (or 64MB) to a much smaller 4MB. 1) Large systems with over a 1TB of memory are being deployed and reserving 1/32 of this memory (32GB) as the mimimum requirement is overkill. 2) Tiny systems like the raspberry pi may only have 256MB of memory in which case 64MB is far too large. The ARC should be reclaimable if the VFS determines it needs the memory for some other purpose. If you want to ensure the ARC is never completely reclaimed due to memory pressure you may still set a larger value with zfs_arc_min. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In an attempt to prevent arc_c from collapsing "too fast", the
arc_shrink() function was updated to take a "bytes" parameter by this
change:
commit 302f753
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Tue Mar 13 14:29:16 2012 -0700
Integrate ARC more tightly with Linux
Unfortunately, that change failed to make a similar change to the way
that arc_p was updated. So, there still exists the possibility for arc_p
to collapse to near 0 when the kernel start calling the arc's shrinkers.
This change attempts to fix this, by decrementing arc_p by the "bytes"
parameter in the same way that arc_c is updated.
In addition, a minimum value of arc_p is attempted to be maintained,
similar to the way a minimum arc_p value is maintained in arc_adapt().
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Currently there is not mechanism in place to reclaim or limit the number of L2ARC headers contained in RAM. This means that given a large enough L2ARC device, the headers can accumulate until they push out all other data, and can potentially cause an out-of-memory event on the system. While we'd rather not discard these headers, because it invalidates any blocks written to the L2ARC device, we also don't want these headers to push out all other (potentially more useful) data on the system and/or cause the system to hit a out-of-memory event. As a fist attempt to fix this issue, a new "l2arc_hdr_limit" module parameter has been introduced. This option allows an administrator to set a limit on the cumulative size that L2ARC headers can grow to on the system. By default this limit is 1/8th of the total ARC size. To do this a l2arc_evict_headers() function has been added. It will attempt to reclaim N bytes of l2arc headers by repeatedly advancing the write hand and evicting blocks from the l2arc. In addition, since we're not pruning from the L2ARC devices based on "arc_meta_limit" or even "arc_c_max", so these headers were completely taken out of the calculation of "arc_meta_used" and "arc_size". Signed-off-by: Prakash Surya <surya1@llnl.gov>
The meaning of the arc_meta_used and arc_size values are ambiguous at
best. Prior to this change, the best definition I could come up with are
the following:
* "arc_meta_used" is the sum of bonus buffers, dmu_buf_impl_t's,
dnode_t's, arc headers, and arc buffers of type ARC_BUFC_METADATA
* "arc_size" is the sum of arc_meta_used and arc buffers of type
ARC_BUFC_DATA
With these two definitions in mind, their limits and associated logic for
eviction and arc tuning don't make a whole lot of sense; and here's why:
1. First off, in the arc_adapt thread, we call arc_adjust_meta when
arc_meta_used is greater than arc_meta_limit. The problem is that
we evict buffer's of type ARC_BUFC_METADATA from the arc's mru
and mfu before we every try to evict from the upper layers (via
the prune callbacks). This can potentially lead us to evicting
"hot" buffers from the mru and/or mfu while leaving "cold"
dnode_t's, dmu_buf_impl_t's, etc. in the cache (consuming
precious arc_meta_used space).
2. Even though arc headers consume space within arc_meta_used, the
ghost lists are never pruned based on arc_meta_limit. This can
lead to a *very* unfortunate situation where the arc_met_used
space is consumed entirely by "useless" arc headers (any new arc
buffers of type ARC_BUFC_METADATA would almost immediately be
evicted and placed on it's respective ghost list due to the
arc_adapt thread calling arc_adjust_meta)
3. The arc code appears to be littered these following assumptions:
a) the size of the mfu is capped by arc_c - arc_p
b) the sum of the mru and mfu sizes is capped by arc_c
c) and finally, the sum of the mru and mfu sizes is arc_size
The problem is, (c) is blatantly untrue (see previous definition
above), which then invalidates the other assumptions as well.
Using the proper definition of arc_size, we can arrive at the
following truths:
d) the arc_size is capped by arc_c
e) the mru size is capped by arc_p
f) the sum of arcstat_[data,other,hdr]_size is arc_size
g) the sum of the mru and mfu sizes is arcstat_data_size
h) which leads to, the sum of the mru and mfu sizes being
capped by arc_c - arcstat_other_size - arcstat_hdr_size
The unveiling of (h) means that, although unlikely, for certain
values of arc_c, arcstat_other_size, and arcstat_hdr_size, one
can completely starve the arc lists.
The more likely scenario is not that *both* lists are starved
simultaneously, it's that *one* list is starved for a certain
time interval. For example, I've seen cases where the mru list
grows to arc_p, and the mfu is squashed to a negligible size
because it's space (the incorrectly assumed arc_c - arc_p space)
is actually being consumed by arc header buffers, dnodes, etc.
As an attempt to remedy the situation, the meaning of the arc_meta_used
and arc_size values have been redefined as the following:
* "arc_meta_used" is the sum of bonus buffers, and dmu_buf_impl_t's
(i.e. arc_meta_used == arcstat_other_size)
* "arc_size" is the sum of all arc buffers and arc headers
(i.e. arc_size == arcstat_data_size + arcstat_hdr_size)
In addition, to directly address issues (1) and (2), we no longer prune
buffers of type ARC_BUFC_METADATA from the arc lists when over
arc_meta_limit. With the definition of arc_meta_used, that functionality
doesn't make sense (buffers residing in the arc lists no longer factor
into arc_meta_used.
Now, when over arc_meta_limit, we only call into the upper layers via
the prune callbacks. Ideally we would also reap from the kmem caches
for the objects which account for arc_meta_used (e.g. dnode_t,
dmu_buf_impl_t, etc.), but none of them have associated shrinkers.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
For specific workloads consisting mainly of mfu data and new anon data
buffers, the aggressive growth of arc_p found in the arc_get_data_buf()
function can have detrimental effects on the mfu list size and ghost
list hit rate.
Running a workload consisting of two processes:
* Process 1 is creating many small files
* Process 2 is tar'ing a directory consisting of many small files
I've seen arc_p and the mru grow to their maximum size, while the mru
ghost list receives 100K times fewer hits than the mfu ghost list.
Ideally, as the mfu ghost list receives hits, arc_p should be driven
down and the size of the mfu should increase. Given the specific
workload I was testing with, the mfu list size should grow to a point
where almost no mfu ghost list hits would occur. Unfortunately, this
does not happen because the newly dirtied anon buffers constancy drive
arc_p to its maximum value and keep it there (effectively prioritizing
the mru list and starving the mfu list down to a negligible size).
The logic to increment arc_p from within the arc_get_data_buf() function
was introduced many years ago in this upstream commit:
commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc
Author: maybee <none@none>
Date: Wed Dec 20 15:46:12 2006 -0800
6505658 target MRU size (arc.p) needs to be adjusted more aggressively
and since I don't fully understand the motivation for the change, I am
reluctant to completely remove it.
As a way to test out how it's removal might affect performance, I've
disabled that code by default, but left it tunable via a module option.
Thus, if its removal is found to be grossly detrimental for certain
workloads, it can be re-enabled on the fly, without a code change.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
|
I can't be sure I've done something wrong, but running with this patch I find the ARC to be completely not regulating its size at all. # cat /proc/spl/kstat/zfs/arcstats 49 1 0x01 86 4128 448634175465 1434096363469 name type data hits 4 2509299 misses 4 493325 demand_data_hits 4 0 demand_data_misses 4 0 demand_metadata_hits 4 2359610 demand_metadata_misses 4 281858 prefetch_data_hits 4 0 prefetch_data_misses 4 0 prefetch_metadata_hits 4 149689 prefetch_metadata_misses 4 211467 mru_hits 4 272077 mru_ghost_hits 4 12180 mfu_hits 4 2087533 mfu_ghost_hits 4 86614 deleted 4 448 recycle_miss 4 139179 mutex_miss 4 106 evict_skip 4 34701926 evict_l2_cached 4 1446111232 evict_l2_eligible 4 419158016 evict_l2_ineligible 4 521666560 hash_elements 4 394086 hash_elements_max 4 394086 hash_collisions 4 190221 hash_chains 4 116252 hash_chain_max 4 10 p 4 313441056 c 4 5000000000 c_min 4 4194304 c_max 4 5000000000 size 4 5772108096 hdr_size 4 138256704 data_size 4 5633851392 other_size 4 1761772296 anon_size 4 16384 anon_evict_data 4 0 anon_evict_metadata 4 0 mru_size 4 2909063168 mru_evict_data 4 0 mru_evict_metadata 4 0 mru_ghost_size 4 510478848 mru_ghost_evict_data 4 0 mru_ghost_evict_metadata 4 510478848 mfu_size 4 2724771840 mfu_evict_data 4 0 mfu_evict_metadata 4 0 mfu_ghost_size 4 171085312 mfu_ghost_evict_data 4 0 mfu_ghost_evict_metadata 4 171085312 l2_hits 4 68774 l2_misses 4 420817 l2_feeds 4 782 l2_rw_clash 4 5 l2_read_bytes 4 364065792 l2_write_bytes 4 207549440 l2_writes_sent 4 555 l2_writes_done 4 555 l2_writes_error 4 0 l2_writes_hdr_miss 4 1 l2_evict_lock_retry 4 0 l2_evict_reading 4 0 l2_free_on_write 4 10 l2_abort_lowmem 4 0 l2_reclaim_lowmem 4 0 l2_cksum_bad 4 0 l2_io_error 4 0 l2_size 4 545421312 l2_asize 4 207549440 l2_hdr_size 4 1542360 l2_hdr_limit 4 625000000 l2_compress_successes 4 34514 l2_compress_zeros 4 0 l2_compress_failures 4 0 memory_throttle_count 4 0 duplicate_buffers 4 0 duplicate_buffers_size 4 0 duplicate_reads 4 0 memory_direct_count 4 0 memory_indirect_count 4 0 arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_prune 4 0 arc_meta_used 4 1761772296 arc_meta_limit 4 4500000000 arc_meta_max 4 1761773136 Workload is only multiple instances of a perl script which is effectively |
|
OK, great. I have a hunch this is being cause by the bugged |
|
Just to update, I've been letting it run for a while and it's started behaving better. Maybe it needed to slam into some reclaim first? 5 1 0x01 86 4128 6280349096 57277913791133 name type data hits 4 63012692 misses 4 15231126 demand_data_hits 4 0 demand_data_misses 4 41890 demand_metadata_hits 4 60361704 demand_metadata_misses 4 10874230 prefetch_data_hits 4 0 prefetch_data_misses 4 0 prefetch_metadata_hits 4 2650988 prefetch_metadata_misses 4 4315006 mru_hits 4 5813091 mru_ghost_hits 4 1518388 mfu_hits 4 54792986 mfu_ghost_hits 4 5792847 deleted 4 8637043 recycle_miss 4 8210179 mutex_miss 4 14390 evict_skip 4 1295462694 evict_l2_cached 4 100718430208 evict_l2_eligible 4 114773699584 evict_l2_ineligible 4 23804513280 hash_elements 4 2462832 hash_elements_max 4 2568779 hash_collisions 4 7703893 hash_chains 4 261931 hash_chain_max 4 25 p 4 917587264 c 4 5000000000 c_min 4 4194304 c_max 4 5000000000 size 4 5000007408 hdr_size 4 197165808 data_size 4 4802841600 other_size 4 1727879024 anon_size 4 1147392 anon_evict_data 4 0 anon_evict_metadata 4 0 mru_size 4 877194240 mru_evict_data 4 0 mru_evict_metadata 4 37143552 mru_ghost_size 4 4124495872 mru_ghost_evict_data 4 923648 mru_ghost_evict_metadata 4 4123572224 mfu_size 4 3924499968 mfu_evict_data 4 0 mfu_evict_metadata 4 1916374528 mfu_ghost_size 4 875671040 mfu_ghost_evict_data 4 0 mfu_ghost_evict_metadata 4 875671040 l2_hits 4 2742474 l2_misses 4 6246871 l2_feeds 4 11296 l2_rw_clash 4 180 l2_read_bytes 4 11603466752 l2_write_bytes 4 5159491072 l2_writes_sent 4 9611 l2_writes_done 4 9611 l2_writes_error 4 0 l2_writes_hdr_miss 4 40 l2_evict_lock_retry 4 1 l2_evict_reading 4 9 l2_free_on_write 4 15883 l2_abort_lowmem 4 122 l2_reclaim_lowmem 4 2004 l2_cksum_bad 4 3 l2_io_error 4 3 l2_size 4 36180289536 l2_asize 4 1794785191863296 l2_hdr_size 4 623479608 l2_hdr_limit 4 625000000 l2_compress_successes 4 2627936 l2_compress_zeros 4 0 l2_compress_failures 4 0 memory_throttle_count 4 0 duplicate_buffers 4 0 duplicate_buffers_size 4 0 duplicate_reads 4 0 memory_direct_count 4 257 memory_indirect_count 4 45682 arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_prune 4 35992 arc_meta_used 4 1727879024 arc_meta_limit 4 2000000000 arc_meta_max 4 3408673312 I also shrunk arc_meta_used 4 2118991096 arc_meta_limit 4 2000000000 |
|
@prakashsurya Notice *_evict_data, *_evict_metadata in the first arcstats. Despite the mru and mfu being substantial nothing is eligible for eviction, once some of those buffers become eligible in the second arcstats output the size is correctly brought under control. |
|
With these changes, I would not be surprised if |
|
@DeHackEd If you're testing these changes out, any chance you can keep a log of the arcstat file as your workload is running? For example: I have scripts to generate plots from logs in that format which is useful for analyzing how the ARC subsystem is working over time. The ARC is very dynamic in nature, so a single snapshot in time is helpful up to a point, it's much more beneficial to be able to visualize how the values change over time. |
capacity operations bandwidth pool alloc free read write read write ----------- ----- ----- ----- ----- ----- ----- whoopass1 5.32T 4.68T 486 0 1.09M 0 sdc 5.32T 4.68T 486 0 1.09M 0 cache - - - - - - SSD-part1 16.0E 36.4G 47 8 163K 740K ----------- ----- ----- ----- ----- ----- ----- Surprise cache growth. The partition I gave it is only 16 GB. I've been playing with the parameters via /sys/module/zfs/parameters/* if that matters. And sure, I'll keep an arcstats log. Leave it running for an hour or two. |
|
So far so good... (based on the original 6 commits only) |
|
@DeHackEd Good to hear. The original 6 is all that I plan to try and get merged for now. |
|
I ran some tests over the weekend, and this set of patch definitely helped one workload that I was expecting it to help. The workload was a single thread creating a lot of small files. I used fdtree to do this: With that as the only thing stressing the filesystem, I ran "before" and "after" tests on two nodes of essentially the same hardware configuration (the pool config was different, but that doesn't make much of a factor in these tests). First, here's a graph from the "before" test (zfs-0.6.2-128-gdda12da) when running As you can see in the top right graph, the hit rate never settles throughout the entirety of the test. Also, MRU and MFU sizes fall to a negligible size ( Now, running with this set of patches (zfs-0.6.1-255-ga226ae8), I see the following behavior: The hit rate is much more stable, and is also much better than previously. In addition, the sizes for the MRU and MFU are much more as I would expect. |
|
I also ran a "before" and "after" test with a single thread continually untar-ing the kernel source tree, compiling it, and then removing the directory. Just looking at the ARC values, I don't see much of a difference between the two runs (each receive about a 100% hit rate). But in the interest of full disclosure, I'll post the graphs for these tests as well. Here's the workload: Graph of ARC values from "before" test (zfs-0.6.2-128-gdda12da): Graph of ARC values from "after" test (zfs-0.6.1-255-ga226ae8): Really, the only difference is |
|
Another workload that I tested and gathered some "before" and "after" data is a Using 4 machines, I was able to have a send-receive pair running master (zfs-0.6.2-128-gdda12da) and a send-receive pair running this branch (zfs-0.6.1-255-ga226ae8). Looking at the ARC graphs on each of the receivers, it looks like this set of patches helps substantially for this workload. After running for over an hour, it looks like the receiver running master stabilized at about a 40% hit rate: Whereas the receiver running this branch (zfs-0.6.1-255-ga226ae8) stabilized at a significantly higher hit rate (looks to be around 95% or more): This branch doesn't appear to help the senders much at all, as they appear to be more IOPs limited, but it doesn't appear to hurt either. Here is the graph of the sender running master (zfs-0.6.2-128-gdda12da): And here is the graph of the sender running this branch (zfs-0.6.1-255-ga226ae8): Although the hit rate of the senders doesn't appear to be affected much by this branch's changes, the graphs do look much "better" with these changes, IMO. |
|
@kpande That's still an open question, really. In all of the testing I've done through the ZPL, these changes show that it either helps or doesn't really do anything. But, I've only done some very targeted testing, so I welcome others to try them out on any workload they care about. I don't foresee these changes having any negative drawbacks, but I would not be surprised if I'm overlooking something (which is why I would really like some more people testing this out). Changes to the ARC are very subtle and even seemingly "small" changes can have drastic performance implications. There are still a couple patches that I intend to push into this branch, but they need a little more internal testing time before I'm fully confident in them. I was hoping I would be able to get them out to the community today, but unfortunately I don't think that'll happen (issues unrelated to the code). |
This reverts commit c11a12b. Out of memory events were fixed by reverting this patch. Signed-off-by: Prakash Surya <surya1@llnl.gov>
Work in progress. I will add a more detailed commit message soon. Signed-off-by: Prakash Surya <surya1@llnl.gov>
This reverts commit 8d6fe74.
This reverts commit 993f0aa.
|
Status report. This system has been running for the last ~10 days so these last 2 commits don't apply currently. Still, I applied 5 of the 6 patches (due to l2arc fields breaking I left that patch out, and just gave it a 20 GB partition to use with compression). I definitely see an improvement in system reliability. (I do have some of my own patches as well, but they are not of a nature that would affect this outcome) |
When the arc is at it's size limit and a new buffer is added, data will
be evicted (or recycled) from the arc to make room for this new buffer.
As far as I can tell, this is to try and keep the arc from over stepping
it's bounds (i.e. keep it below the size limitation placed on it).
This makes sense conceptually, but there appears to be a subtle flaw in
its current implementation, resulting in metadata buffers being
throttled. When it evicts from the arc's lists, it also passes in a
"type" so as to remove a buffer of the same type that it is adding. The
problem with this is that once the size limit is hit, the ratio of
"metadata" to "data" contained in the arc essentially becomes fixed.
For example, consider the following scenario:
* the size of the arc is capped at 10G
* the meta_limit is capped at 4G
* 9G of the arc contains "data"
* 1G of the arc contains "metadata"
Now, every time a new "metadata" buffer is created and added to the arc,
an older "metadata" buffer(s) will be removed from the arc; preserving
the 9G "data" to 1G "metadata" ratio that was in-place when the size
limit was reached. This occurs even though the amount of "metadata" is
far below the "metadata" limit. This can result in the arc behaving
pathologically for certain workloads.
To fix this, the arc_get_data_buf function was modified to evict "data"
from the arc even when adding a "metadata" buffer (unless it's at the
"metadata" limit). In addition, arc_evict now more closely resembles
arc_evict_ghost; such that when evicting "data" from the arc, it may
make a second pass over the arc lists and evict "metadata" if it cannot
meet the eviction size the first time around.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
|
Great to hear that!! I think the l2arc limiting patch still needs to be looked at with a closer inspection. I've done limited testing and it seems to be "better" than before (i.e. l2arc headers not causing OOMs), but the limit isn't being strictly enforced by any means. So for the time being, the best course of action is still to use a properly sized l2arc device; as you did. Those last two patches that were added were only really needed to address some issues when running using Lustre instead of the ZPL, so I would not be surprised if your workload doesn't strictly need them. With that said, they shouldn't hurt (and might improve) the system if you do decide to pull them in. I'm planning on adding a few more patches to this series, slightly tweaking the current implementation. If you're interested in a "pre-release" view of these changes, I'm using this branch as my temporary working branch as I test the new patches out: Although, I make no promises as to whether this branch "works" or not; there's a good chance it'll be completely untested if/when you decide to peek at it. Once I feel confident in the changes I'll bring them over into this pull request. |
Currently the "arc_meta_used" value contains all of the arc_buf_t's,
arc_buf_hdr_t's, dnode_t's, dmu_buf_impl_t's, arc "metadata" buffers,
etc. allocated on the system. The value is then used in conjunction with
"arc_meta_limit" to pro-actively throttle arc "metadata" buffers
contained in the arc's regular lists (i.e. non-ghost lists). There's a
couple problems that can arise as a result:
1. As the "arc_meta_limit" is reached, new arc "metadata" buffers
that are added will force an eviction from either the arc's mru
list or the mfu list. Since *all* arc_buf_hdr_t's are contained
in "arc_meta_used", even the ghost list entries will take up
space in "arc_meta_used" and _may_ never be pruned. Given a
"metadata" intensive workload (e.g. rsync, mkdir, etc.), it is
possible for the ghost list headers to completely consume
"arc_meta_used" and leave no room for "useful" metadata;
causing atrocious arc hit rates (i.e. terrible performance).
2. Since *all* arc_buf_hdr_t's are accounted for in "arc_meta_used",
this means that even "data" buffers will take up a small amount
of space in "arc_meta_used". Given a significant amount of "data"
buffers (or a small enough "arc_meta_limit"), this can have a
noticeable impact on the amount of arc "metadata" that can be
cached. What makes it worse, even if we evict "metadata" from the
arc (as we do now when over "arc_meta_limit"), these "data" buffer
headers will never be touched.
This patch is intended to address problem (2) described above. Instead
of using "arc_meta_used" to contain *all* arc_buf_hdr_t's, arc_buf_t's,
etc., now it only contains those headers that accompany arc "metadata".
This way, if *all* of the arc's "metadata" buffers were evicted,
"arc_meta_used" will shrink to zero (even if "data" buffers are still
contained in the arc).
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Using "arc_meta_used" to determine if the arc's mru list is over it's target value of "arc_p" doesn't seem correct. The size of the mru list and the value of "arc_meta_used", although related, are completely independent. Buffers contained in "arc_meta_used" may not even be contained in the arc's mru list. As such, this patch removes "arc_meta_used" from the calculation in arc_adjust. Signed-off-by: Prakash Surya <surya1@llnl.gov>
Setting a limit on the minimum value of "arc_p" has been shown to have detrimental effects on the arc hit rate for certain "metadata" intensive workloads. Specifically, this has been exhibited with a workload that constantly dirties new "meatadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). What is seen is that the new anon data throttles the mfu list to a negligible size (because arc_p > anon + mru in arc_get_data_buf), even though the mfu ghost list receives a constant stream of hits. To remedy this, arc_p is now allowed to drop to zero if the algorithm deems it necessary. Signed-off-by: Prakash Surya <surya1@llnl.gov>
It's unclear why adjustments to arc_p need to be dampened as they are in arc_adjust. With that said, it's removal significantly improves the arc's ability to "warm up" to a given workload. Thus, I'm removing it until its usefulness is better understood. Signed-off-by: Prakash Surya <surya1@llnl.gov>
Work in progress, needs better commit message. To try and prevent arc headers from completely consuming "arc_meta_used", this commit changes arc_adjust_meta to more closely align with arc_adjust; "arc_meta_limit" is used similarly to "arc_c".
|
Here's a graph of the arc parameters, hit rate, and number of reads while running the above mentioned Workload: For comparison purposes, I've included identical graphs for the previous runs that I performed using the same ZFS Version: ZFS Version: |
|
Since this issue has gotten some attention lately (indirectly) I just wanted to mention I've been running with the first 6 patches (see below) on all my (mostly not-so-high load) systems lately without noticeable issue. I do think things are better for stability on the higher-load system as well. I hadn't been keeping up with the churn of the additional patches in the pull request. One catch: I explicitly left out the L2ARC patch because I do need the L2ARC capacity and prefer to control it by limiting the size of the device provided. +1 towards getting it merged. |
|
Thanks for the update. Leaving out the L2ARC patch, and properly sizing the device is the proper way to go about it at the moment, IMO. I should be back in the office this week, so I really want to finish cleaning up this pull request and get it in a merge-able state. The first 6 patches you mentioned have a few deficiencies (specific to certain workloads) that I want to address before it all lands; hence the follow up patches. |
This reverts commit c99fd2c.
|
Incidentally, I personally don't get anything out of any metrics I graph - it's far too noisy to work out what's going on. But "watch -d -n 0.1 cat /proc/spl/kstat/zfs/arcstats". Why are some of those metrics changing so rapidly? mru_size/mfu_size, mru_evict_metadata, arc_meta_used etc fluctuating by 1% per second even with a write-only load with or without primary and secondary cache set to metadata only, when every cache should be well and truly hot after months of uptime, and when arc_meta_used is well short of arc_meta_max and arc_meta_limit and arc_max? |
|
@spacelama I think the problem is that the meaning of the values exposed by the ARC isn't at all intuitive. Those values don't mean what you think they mean, and what they actually mean isn't documented anywhere. The result is the graphs aren't too useful unless your already very familiar with the area. It's certainly an area I think we could improve. |











Please see individual commit messages for details on each change.