-
Notifications
You must be signed in to change notification settings - Fork 67
Conversation
|
FreeBSD will also need the following : |
usr/src/uts/common/fs/zfs/arc.c
Outdated
| * Do the job for each type of headers (see b_lsize note below) | ||
| */ | ||
| for (int l = 0; l < 3 && (total_evicted < bytes || bytes == ARC_EVICT_ALL); l++) { | ||
| if(l < 2 && !arc_evict_l2_first && !arc_evict_l2_only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cstyle: insert space between if and (.
usr/src/uts/common/fs/zfs/arc.c
Outdated
| for (int l = 0; l < 3 && (total_evicted < bytes || bytes == ARC_EVICT_ALL); l++) { | ||
| if(l < 2 && !arc_evict_l2_first && !arc_evict_l2_only) | ||
| continue; | ||
| if(l == 2 && arc_evict_l2_only && total_evicted > 0 && bytes != ARC_EVICT_ALL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cstyle: insert space between if and (.
| @@ -3161,6 +3180,8 @@ arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes, | |||
| } | |||
| kmem_free(markers, sizeof (*markers) * num_sublists); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you didn't indent this (and the 100 lines above) correctly, it should have one more tab. Or consider moving the code in the new loop to a subroutine.
usr/src/uts/common/fs/zfs/arc.c
Outdated
| @@ -3069,6 +3073,15 @@ arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes, | |||
| num_sublists = multilist_get_num_sublists(ml); | |||
|
|
|||
| /* | |||
| * Do the job for each type of headers (see b_lsize note below) | |||
| */ | |||
| for (int l = 0; l < 3 && (total_evicted < bytes || bytes == ARC_EVICT_ALL); l++) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l is easily misread for 1, a possible value of l. How about a different variable name?
usr/src/uts/common/fs/zfs/arc.c
Outdated
| @@ -1036,6 +1036,8 @@ uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */ | |||
| boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */ | |||
| boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */ | |||
| boolean_t l2arc_norw = B_TRUE; /* no reads during writes */ | |||
| boolean_t arc_evict_l2_first = B_FALSE; /* first evict buffers from ARC which are in L2ARC */ | |||
| boolean_t arc_evict_l2_only = B_FALSE; /* only evict buffers from ARC which are in L2ARC */ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we don't evict the requested amount from the ARC if it would evict non-L2ARC bufs? That seems like a bad idea, as it could lead to running out of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arc_evict_l2_first will first choose L2ARC bufs. If it did not found enough of them to reach the requested amount (total_evicted < bytes), it will catch non-L2ARC bufs.
arc_evict_l2_only will only choose L2ARC bufs. As we could run out of memory in such a scenario, it will catch non-L2ARC bufs if it did not found any L2ARC bufs to evict (see line 3086, the total_evicted > 0 condition).
Of course arc_evict_l2_only feeds L2ARC better than arc_evict_l2_first.
Drawback of arc_evict_l2_only is that it could lead to ZFS throttle waiting for some ARC free space, so ones using it may have to tune l2arc_headroom and l2arc_feed_min_ms.
usr/src/uts/common/fs/zfs/arc.c
Outdated
| @@ -1036,6 +1036,8 @@ uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */ | |||
| boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */ | |||
| boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */ | |||
| boolean_t l2arc_norw = B_TRUE; /* no reads during writes */ | |||
| boolean_t arc_evict_l2_first = B_FALSE; /* first evict buffers from ARC which are in L2ARC */ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we add this as off-by-default? Seems like it will be dead code with untested behavior. I would rather find an algorithm which doesn't degrade performance on any important use cases, which we could then enable by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially did not want to change ZFS behaviour, being as less intrusive as possible.
According to my explanation below, we can securely enable arc_evict_l2_first, and I think let arc_evict_l2_only for those who know what they do.
usr/src/uts/common/fs/zfs/arc.c
Outdated
| */ | ||
| markers[i]->b_spa = 0; | ||
| markers[i]->b_lsize = l; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a non-obvious way to change the behavior of arc_evict_state_impl. How about passing an additional argument to the func?
|
@benrubson if I remember correctly (from what I read in the past) only owners such as @ahrens can poke the zettabot |
|
Is it worth taking into account that some of the very hottest items in ARC can also be on L2ARC, and are thus liable to be regularly pushed into a high-latency L2 read by this? Maybe some of the L2-backed items in MFU should survive an L2-first eviction, since they're almost certainly of more value than not-yet-L2-backed (or even L2-ineligible) older items in MRU. |
|
@rottegift Since you won't need to read from L2ARC until the ARC is warm, the L2ARC feeds from the soon-to-be evicted list (multilist_sublist_tail) rather than the hot list (multilist_sublist_head) for all practical purposes. In other words, on busy systems, L2ARC is unlikely to contain the hot items in the ARC. This is ok. |
|
@richardelling Sure, that was my thought too, but a block can become hot -- or hot again -- after it is on the L2, and thus faces L2-first eviction until it is no longer on the L2. With a large L2ARC, this can be a long-term periodic penalty for the unfortunate block that became hot belatedly; and consider that long-term could be almost permanent in the presence of persistent L2ARC. |
usr/src/uts/common/fs/zfs/arc.c
Outdated
| markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP); | ||
| for (int i = 0; i < num_sublists; i++) { | ||
| markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP); | ||
| for (int h = 0; h < 3 && (total_evicted < bytes || bytes == ARC_EVICT_ALL); h++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more clear if we used an enum for h, instead of magic numbers.
|
@benrubson I still have some concerns about this:
|
|
@ahrens here are answers to your questions :
Thank you very much 👍 |
|
@benrubson Have you at any point with your (test) workload ever experimented in restricting arc_evict_l2_first to L2-backed buffers in ARC_mru state ? For workloads different from yours (frankly, yours is not typical), it doesn't seem so bad when an L2-backed ARC_mru buffer gets evicted before less recently used not-L2-backed ARC_mru buffers, at least in comparison to evicting any ARC_mfu buffer before less recently used ARC_mru buffers. The unluckiest victims under this eviction approach will be objects that are read periodically, just once each time, and with a period long enough that under the current mechanism the buffer would just barely survive downward arc adjustments between a pair of accesses, thus ascending to ARC_mfu state. Such a periodically-read object is likely to reach L2, and under eviction biased against L2-backed buffers, is likely to fail to reach ARC_mfu. I struggle to think of cases where that might matter. However, returning to the present PR, I think we risk turning an L2 into a performance degrader. Consider, for example, the case where you split the workload you describe across two pools on the same system, and then fault one pool's L2. How low can the occupancy of ARC by the non-faulted pool dwindle as it faces competition from the degraded pool? As I argued last time, even in many plausible single-pool cases a block may linger in MRU for a while before becoming hot. Consider a big make install, followed by putting at least parts of several of the freshly installed files into heavy use. There are several opportunities in that sequence for the data to have been written to L2 before the heavy use begins. I also worry about processes that walk the ZPL directory structure (e.g. locate.updatedb, or Mac OS X's spotlight, or various similar indexing systems), especially those that start early -- when l2 is still in turbo fill mode but before a user logs into the graphical environment. |
|
@rottegift, no I did not try to restrict Coming back to my workoad : many processes update files concurrently. Regarding the risk to turn the L2 into a performance degrader, if L2 is faulted :
If this mru block was L2-backed and has been evicted from L1, this was certainly one of the least recently used blocks, as L2 writing process and L1 eviction start from the end of the queues.
When L2 is in turbo mode, L1 is not fully used yet, so eviction should / will not occur at this time. Of course this PR is not intended to suit all workloads. |
usr/src/uts/common/fs/zfs/arc.c
Outdated
| && !arc_evict_l2_first && !arc_evict_l2_only) | ||
| continue; | ||
| if (header_type == L1_AND_L2 | ||
| && arc_evict_l2_only && total_evicted > 0 && bytes != ARC_EVICT_ALL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the arc_evict_l2_only security valve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear:
arc_evict_l2_first switches to the default (current) behaviour if not enough buffers (total_evicted < bytes) have been evicted, so no issue here.
is AFTER all the L2-backed buffers for the non-faulted pool have already been evicted, right? All of the evictable ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this mru block was L2-backed and has been evicted from L1, this was certainly one of the least recently used blocks, as L2 writing process and L1 eviction start from the end of the queues.
To sharpen the question: what about the case where a block had been written to L2, evicted, then brought back in from L2? Does "certainly one of the least recently used blocks" apply to that buffer?
No, but at this time it will be at the beginning of the list, so it will not be re-evicted first. |
Yes, please confirm it. In this PR you've been coding in arc_evict_state_impl(). How do you get there? Isn't one possible path ?
And if it has been displaced half way down the list, but is now among the few buffers with an l2arc_buf_hdr? |
I was mistaken in my last comment, so yes, after the L2-backed buffers for the non faulted pool have been evicted. Not necessarily all, it depends on
It then has a chance to be re-evicted ? |
|
OK so I've just modified the patch so that buffers from datasets without (or with a dead) L2 cache will be evicted during eviction of prefered L2-backed buffers. In details, a buffer will be evicted if :
Otherwise, it will be evicted :
Many thanks 👍 |
|
@zettabot go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I am not comfortable making this semantic change without more extensive real-world testing. I would suggest that you get someone who uses L2ARC in production on general-purpose workloads (or a wide variety of workloads) to at least review this code, and better yet to test it.
Unfortunately I (and Delphix) do not use L2ARC so we aren't in a good position to test this. Maybe someone at Nexenta, Joyent, or iXsystems could help?
usr/src/uts/common/fs/zfs/arc.c
Outdated
| * false otherwise. | ||
| */ | ||
| static boolean_t | ||
| l2arc_alive(uint64_t spa) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do this by walking the existing l2arc_dev_list? This mechanism seems quite complicated.
That said, I'd be concerned about the performance of this either way, it's in a pretty hot path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but for each buffer to evict, we would have to :
- browse the list of vdevs ;
- and for each one of these vdevs, test if its spa matches the expected one, and test if it is non-faulted (
vdev_is_dead()).
I think this could be pretty time consuming, this is why I preferred to make l2arc_dev_get_next() maintaining a list of spa with alive L2 vdevs, as this function already make all the tests we need, on a regular basis. We then just have to walk the list of spa with alive L2 vdev (l2arc_spa_list) for each buffer to evict.
If you're afraid by the potential performance impact (which, as we already said, could occur, depending on the workload), we could make the
I would really be glad to help if needed, of course 👍 |
|
We at Joyent don't use the L2ARC. It's never been worth the DRAM cost to us. I don't think we'll be able to help in this case. |
|
PR just rebased with master 👍 |
|
This looks more like a |
allowing to first evict buffers which are already in L2ARC, leading to some performance improvement (see https://www.illumos.org/issues/7361). This can be disabled / enabled by the new arc_evict_l2_first tunable. The new arc_evict_l2_only tunable goes a step beyond only evicting buffers which are in L2ARC, with of course a security valve in case nothing can be evicted.
|
Thank you @prakashsurya for having launched the test. |
|
@benrubson I don't think that failure is due to this change; rather, think that failure is due to a prior commit: d28671a |
|
While I agree that the L2ARC logic needs some work, I'm not convinced that this is the right direction, and the change is too risky to take as-is. With sufficient testing, I would consider a simpler semantic change like "evict blocks in L2ARC first". That would probably require separating the lists into one for blocks in L2ARC (HDR_HAS_L2HDR) and one for blocks not in L2ARC. For now, I'm going to close this PR. |
This is exactly what this change does with
(...) So there is no risk at all compared to the current behaviour, as it is the current behaviour + a preference in buffers to choose. The step above
In production on 3 servers for a year now with : This change gives a very nice and constant performance improvement, as show in the bug report. In addition is is quite small in size. In any case, thank you very much to all reviewers here 👍 |
|
i have a large zfs backup server hosting millions of files being transferred to via rsync. most of the runtime is taken from rsync comparing files by timestamp. metadata is being re-read again and again on every backup (there is only few incremental transfer). that puts unnecessary stress on the disks and is so much slower, so please consider adding an optimization param like this which helps keeping metadata completely in l2arc, as it`s too big to fit in ram. anyhow, i'm curious why the metadata in ram takes that much space at all. how much metadata is kept in ram for each file? it seems it's much more than i would guess. i would guess it's <<1kb/file, but from what i observe it`s more... |
the fileserver i was talking about has millions of files (numbers still growing) which are being crawled every night. i'm not even able to throw ram at it so that all metadata is held in ram . metadata size in ARC is already so large that i could not even put the appropriate ram into the machine. i don't understand why there is a 2-tier caching algorithm with even "metadata" tunable in ZFS when we cannot make use of it. it's a little bit absurd. i don't think that "caching metadata in l2arc when you have lots of it" is so exotic demand... and - to make things even more interesting - mind the significant ssd price drop within the last year... https://www.guru3d.com/news-story/ssd-prices-are-dropping-significantly.html i still give my strong vote for ben rubsons patch and would be a happy person to help testing. |
Hello,
This patch modifies ZFS so that when it has to evict buffers from ARC, it first chooses buffers which are already in L2ARC.
See https://www.illumos.org/issues/7361 for details.
Thank you 👍
Ben