Skip to content
This repository was archived by the owner on Nov 7, 2019. It is now read-only.

7361 Improve L2ARC filling #189

Closed
wants to merge 1 commit into from
Closed

7361 Improve L2ARC filling #189

wants to merge 1 commit into from

Conversation

benrubson
Copy link

@benrubson benrubson commented Sep 7, 2016

Hello,

This patch modifies ZFS so that when it has to evict buffers from ARC, it first chooses buffers which are already in L2ARC.

See https://www.illumos.org/issues/7361 for details.

Thank you 👍

Ben

@benrubson
Copy link
Author

FreeBSD will also need the following :

SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_evict_l2_first, CTLFLAG_RW,
    &arc_evict_l2_first, 0, "first evict buffers from ARC which are in L2ARC");
SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_evict_l2_only, CTLFLAG_RW,
    &arc_evict_l2_only, 0, "only evict buffers from ARC which are in L2ARC");

* Do the job for each type of headers (see b_lsize note below)
*/
for (int l = 0; l < 3 && (total_evicted < bytes || bytes == ARC_EVICT_ALL); l++) {
if(l < 2 && !arc_evict_l2_first && !arc_evict_l2_only)
Copy link
Member

@ahrens ahrens Sep 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cstyle: insert space between if and (.

for (int l = 0; l < 3 && (total_evicted < bytes || bytes == ARC_EVICT_ALL); l++) {
if(l < 2 && !arc_evict_l2_first && !arc_evict_l2_only)
continue;
if(l == 2 && arc_evict_l2_only && total_evicted > 0 && bytes != ARC_EVICT_ALL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cstyle: insert space between if and (.

@@ -3161,6 +3180,8 @@ arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes,
}
kmem_free(markers, sizeof (*markers) * num_sublists);
Copy link
Member

@ahrens ahrens Sep 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you didn't indent this (and the 100 lines above) correctly, it should have one more tab. Or consider moving the code in the new loop to a subroutine.

@@ -3069,6 +3073,15 @@ arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes,
num_sublists = multilist_get_num_sublists(ml);

/*
* Do the job for each type of headers (see b_lsize note below)
*/
for (int l = 0; l < 3 && (total_evicted < bytes || bytes == ARC_EVICT_ALL); l++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

l is easily misread for 1, a possible value of l. How about a different variable name?

@@ -1036,6 +1036,8 @@ uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */
boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */
boolean_t l2arc_norw = B_TRUE; /* no reads during writes */
boolean_t arc_evict_l2_first = B_FALSE; /* first evict buffers from ARC which are in L2ARC */
boolean_t arc_evict_l2_only = B_FALSE; /* only evict buffers from ARC which are in L2ARC */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we don't evict the requested amount from the ARC if it would evict non-L2ARC bufs? That seems like a bad idea, as it could lead to running out of memory.

Copy link
Author

@benrubson benrubson Sep 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arc_evict_l2_first will first choose L2ARC bufs. If it did not found enough of them to reach the requested amount (total_evicted < bytes), it will catch non-L2ARC bufs.

arc_evict_l2_only will only choose L2ARC bufs. As we could run out of memory in such a scenario, it will catch non-L2ARC bufs if it did not found any L2ARC bufs to evict (see line 3086, the total_evicted > 0 condition).

Of course arc_evict_l2_only feeds L2ARC better than arc_evict_l2_first.
Drawback of arc_evict_l2_only is that it could lead to ZFS throttle waiting for some ARC free space, so ones using it may have to tune l2arc_headroom and l2arc_feed_min_ms.

@@ -1036,6 +1036,8 @@ uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */
boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */
boolean_t l2arc_norw = B_TRUE; /* no reads during writes */
boolean_t arc_evict_l2_first = B_FALSE; /* first evict buffers from ARC which are in L2ARC */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we add this as off-by-default? Seems like it will be dead code with untested behavior. I would rather find an algorithm which doesn't degrade performance on any important use cases, which we could then enable by default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially did not want to change ZFS behaviour, being as less intrusive as possible.
According to my explanation below, we can securely enable arc_evict_l2_first, and I think let arc_evict_l2_only for those who know what they do.

*/
markers[i]->b_spa = 0;
markers[i]->b_lsize = l;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-obvious way to change the behavior of arc_evict_state_impl. How about passing an additional argument to the func?

@kernelOfTruth
Copy link

@benrubson if I remember correctly (from what I read in the past) only owners such as @ahrens can poke the zettabot

@rottegift
Copy link

rottegift commented Sep 28, 2016

Is it worth taking into account that some of the very hottest items in ARC can also be on L2ARC, and are thus liable to be regularly pushed into a high-latency L2 read by this? Maybe some of the L2-backed items in MFU should survive an L2-first eviction, since they're almost certainly of more value than not-yet-L2-backed (or even L2-ineligible) older items in MRU.

@richardelling
Copy link

@rottegift Since you won't need to read from L2ARC until the ARC is warm, the L2ARC feeds from the soon-to-be evicted list (multilist_sublist_tail) rather than the hot list (multilist_sublist_head) for all practical purposes. In other words, on busy systems, L2ARC is unlikely to contain the hot items in the ARC. This is ok.

@rottegift
Copy link

@richardelling Sure, that was my thought too, but a block can become hot -- or hot again -- after it is on the L2, and thus faces L2-first eviction until it is no longer on the L2. With a large L2ARC, this can be a long-term periodic penalty for the unfortunate block that became hot belatedly; and consider that long-term could be almost permanent in the presence of persistent L2ARC.

markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP);
for (int i = 0; i < num_sublists; i++) {
markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);
for (int h = 0; h < 3 && (total_evicted < bytes || bytes == ARC_EVICT_ALL); h++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more clear if we used an enum for h, instead of magic numbers.

@ahrens
Copy link
Member

ahrens commented Oct 17, 2016

@benrubson I still have some concerns about this:

  1. The policy change is to preferentially evict data (from the ARC) that's also in the L2ARC. I understand that this policy improves performance in some use cases. I'm concerned that there may be other use cases where it hurts performance. How much experience do you have using this in production? What kind of workloads are you running?
  2. I don't understand the option to only evict data that's also in L2ARC. This seems like a flag which only serves to shoot oneself in the foot. I think it should be removed.

@benrubson
Copy link
Author

benrubson commented Oct 18, 2016

@ahrens here are answers to your questions :

  1. I'm in production with this since I wrote this patch, for about 2 months. My ZFS pools receive updates of most of their files. In this type of workload, indeed, caching metadata is important because it speeds-up files' browsing. Goal is then to have as much as metadata as possible in L2 so that next runs will be improved. This patch helps achieving this, as per the results I give here for a metadata workload. Of course we could revert the defaut policy change with arc_evict_l2_first = B_FALSE;, letting users try to improve performance using this new feature by themselves, depending on their workload. Just let me know if you prefer to revert it.
  2. Without arc_evict_l2_only, ZFS can easily evict (meta)data which is not in L2 because if arc_evict_l2_first did not evict enough data (total_evicted < bytes), it will revert to the default (current) behaviour. arc_evict_l2_only help preventing this, this is a bigger chance to have as much as (meta)data as possible in L2. It may however throttle ZFS in some circumstances where ZFS would have to wait for some ARC free space. However, depending on the use cases, this may be acceptable. Additionally, l2arc_headroom and l2arc_feed_min_ms could be tuned to mitigate this. Fortunately, arc_evict_l2_only has a sort of security valve which reverts to the default behaviour if no (meta)data has been evicted before. I then think we could keep this option.

Thank you very much 👍

@rottegift
Copy link

rottegift commented Oct 19, 2016

@benrubson Have you at any point with your (test) workload ever experimented in restricting arc_evict_l2_first to L2-backed buffers in ARC_mru state ?

For workloads different from yours (frankly, yours is not typical), it doesn't seem so bad when an L2-backed ARC_mru buffer gets evicted before less recently used not-L2-backed ARC_mru buffers, at least in comparison to evicting any ARC_mfu buffer before less recently used ARC_mru buffers. The unluckiest victims under this eviction approach will be objects that are read periodically, just once each time, and with a period long enough that under the current mechanism the buffer would just barely survive downward arc adjustments between a pair of accesses, thus ascending to ARC_mfu state. Such a periodically-read object is likely to reach L2, and under eviction biased against L2-backed buffers, is likely to fail to reach ARC_mfu. I struggle to think of cases where that might matter.

However, returning to the present PR, I think we risk turning an L2 into a performance degrader.

Consider, for example, the case where you split the workload you describe across two pools on the same system, and then fault one pool's L2. How low can the occupancy of ARC by the non-faulted pool dwindle as it faces competition from the degraded pool?

As I argued last time, even in many plausible single-pool cases a block may linger in MRU for a while before becoming hot. Consider a big make install, followed by putting at least parts of several of the freshly installed files into heavy use. There are several opportunities in that sequence for the data to have been written to L2 before the heavy use begins.

I also worry about processes that walk the ZPL directory structure (e.g. locate.updatedb, or Mac OS X's spotlight, or various similar indexing systems), especially those that start early -- when l2 is still in turbo fill mode but before a user logs into the graphical environment.

@benrubson
Copy link
Author

benrubson commented Oct 19, 2016

@rottegift, no I did not try to restrict arc_evict_l2_first to L2-backed buffers in ARC_mru state.
As my very first goal was to be sure every buffer would be L2-backed before being evicted, I did not think about this.
However depending on the mru/mfu ratio, this behaviour could have no effect at all (if we have almost no buffers in mru state).
arc_evict_l2_only could still enforce eviction from both state (mru and mru), but this would make a huge gap between the 2 options.

Coming back to my workoad : many processes update files concurrently.
With hundreds of processes, it's unlikely L2 will be interesting with so much "uncontrolled" / random data, and why would we prefer one process' data among the others (even if it is more accessed), I want every process to have the same QOS.
This is why I use L2 for metadata only, and make it big enough.
This patch makes sure that this metadata will be L2-backed, and not dropped by other processes' activity.

Regarding the risk to turn the L2 into a performance degrader, if L2 is faulted :

  • arc_evict_l2_first switches to the default (current) behaviour if not enough buffers (total_evicted < bytes) have been evicted, so no issue here.
  • arc_evict_l2_only will evict no buffer at all (at least after some times), so here comes its security valve (total_evicted > 0), it will switch back to the default (current) behaviour and evict L1-backed only buffers, so no issue here.

As I argued last time, even in many plausible single-pool cases a block may linger in MRU for a while before becoming hot. Consider a big make install, followed by putting at least parts of several of the freshly installed files into heavy use. There are several opportunities in that sequence for the data to have been written to L2 before the heavy use begins.

If this mru block was L2-backed and has been evicted from L1, this was certainly one of the least recently used blocks, as L2 writing process and L1 eviction start from the end of the queues.

I also worry about processes that walk the ZPL directory structure (e.g. locate.updatedb, or Mac OS X's spotlight, or various similar indexing systems), especially those that start early -- when l2 is still in turbo fill mode but before a user logs into the graphical environment.

When L2 is in turbo mode, L1 is not fully used yet, so eviction should / will not occur at this time.

Of course this PR is not intended to suit all workloads.
This is why I made it tunable through 2 options, so that the user can control the behaviour depending on its workload.

&& !arc_evict_l2_first && !arc_evict_l2_only)
continue;
if (header_type == L1_AND_L2
&& arc_evict_l2_only && total_evicted > 0 && bytes != ARC_EVICT_ALL)
Copy link
Author

@benrubson benrubson Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the arc_evict_l2_only security valve.

Copy link

@rottegift rottegift Oct 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear:

arc_evict_l2_first switches to the default (current) behaviour if not enough buffers (total_evicted < bytes) have been evicted, so no issue here.

is AFTER all the L2-backed buffers for the non-faulted pool have already been evicted, right? All of the evictable ones?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this mru block was L2-backed and has been evicted from L1, this was certainly one of the least recently used blocks, as L2 writing process and L1 eviction start from the end of the queues.

To sharpen the question: what about the case where a block had been written to L2, evicted, then brought back in from L2? Does "certainly one of the least recently used blocks" apply to that buffer?

@benrubson
Copy link
Author

benrubson commented Oct 19, 2016

is AFTER all the L2-backed buffers for the non-faulted pool have already been evicted, right? All of the evictable ones?

In the case of 2 pools on the same system, I think (one may confirm) that each pool will have its buffers evicted independently from the others (2 different processes).
In this case the pool with healthy L2 will have its L2-backed buffers evicted, pool with faulted L2 will have its L1-backed-only and (remaining) L2-backed buffers evicted.

To sharpen the question: what about the case where a block had been written to L2, evicted, then brought back in from L2? Does "certainly one of the least recently used blocks" apply to that buffer?

No, but at this time it will be at the beginning of the list, so it will not be re-evicted first.
If this block had not been evicted, it would have joined the mfu list. Moving closer to the end of the mfu list, it would have been L2-backed, being eligible for an eviction.

@rottegift
Copy link

In the case of 2 pools on the same system, I think (one may confirm) that each pool will have its buffers evicted independently from the others (2 different processes).

Yes, please confirm it. In this PR you've been coding in arc_evict_state_impl(). How do you get there? Isn't one possible path

 arc_reclaim_thread()->[arc_shrink(size)->]arc_adjust()->arc_adjust_impl(arc_mfu, 0, target, ARC_BUFC_DATA)->arc_evict_state(..., 0, ..., ...)->arc_evict_state_impl(...,...,..., 0,...)

?

No, but at this time it will be at the beginning of the list, so it will not be re-evicted first.

And if it has been displaced half way down the list, but is now among the few buffers with an l2arc_buf_hdr?

@benrubson
Copy link
Author

benrubson commented Oct 20, 2016

is AFTER all the L2-backed buffers for the non-faulted pool have already been evicted, right? All of the evictable ones?

I was mistaken in my last comment, so yes, after the L2-backed buffers for the non faulted pool have been evicted. Not necessarily all, it depends on l2arc_headroom.
However, I'm adding a check so that if a buffer comes from a spa which has no L2 device, it will not face the arc_evict_l2_first|only rules.
@rottegift it should suit you, thanks for this 👍

And if it has been displaced half way down the list, but is now among the few buffers with an l2arc_buf_hdr?

It then has a chance to be re-evicted ?
If it keeps hot enough, it will not be evicted as it will stay at the beginning of the list. It it gets cold, it will get a chance to be evicted as it will move to the end of the list.

@benrubson
Copy link
Author

benrubson commented Oct 24, 2016

OK so I've just modified the patch so that buffers from datasets without (or with a dead) L2 cache will be evicted during eviction of prefered L2-backed buffers.
Doing this at the same time, no dataset / pool will be advantaged or disadvantaged in case of a failed L2 cache, a pool without L2 cache, a dataset configured with secondarycache=none.

In details, a buffer will be evicted if :

  • all buffers have to be evicted or
  • we did not ask to prefer buffers with L2 header or
  • it has a L2 header

Otherwise, it will be evicted :

  • if it is not L2ARC eligible or does not have an alive L2, during the 1st phase of eviction (ONLY_WITH_L2HDR)
  • in the other cases, during the 2nd phase of eviction (ALL)

l2arc_alive() looks for the spa guid in the list of spa with an alive L2 cache.
This list is maintained by l2arc_dev_get_next() which periodically looks for non-faulted l2arc vdevs.

Many thanks 👍

@prakashsurya
Copy link
Member

@zettabot go

Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I am not comfortable making this semantic change without more extensive real-world testing. I would suggest that you get someone who uses L2ARC in production on general-purpose workloads (or a wide variety of workloads) to at least review this code, and better yet to test it.

Unfortunately I (and Delphix) do not use L2ARC so we aren't in a good position to test this. Maybe someone at Nexenta, Joyent, or iXsystems could help?

* false otherwise.
*/
static boolean_t
l2arc_alive(uint64_t spa)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this by walking the existing l2arc_dev_list? This mechanism seems quite complicated.

That said, I'd be concerned about the performance of this either way, it's in a pretty hot path.

Copy link
Author

@benrubson benrubson Nov 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but for each buffer to evict, we would have to :

  • browse the list of vdevs ;
  • and for each one of these vdevs, test if its spa matches the expected one, and test if it is non-faulted (vdev_is_dead()).

I think this could be pretty time consuming, this is why I preferred to make l2arc_dev_get_next() maintaining a list of spa with alive L2 vdevs, as this function already make all the tests we need, on a regular basis. We then just have to walk the list of spa with alive L2 vdev (l2arc_spa_list) for each buffer to evict.

@benrubson
Copy link
Author

benrubson commented Nov 30, 2016

Overall, I am not comfortable making this semantic change without more extensive real-world testing.

If you're afraid by the potential performance impact (which, as we already said, could occur, depending on the workload), we could make the arc_evict_l2_first option disabled by default, letting users try to improve performance by themselves, depending on their own workload.
I think it makes sense.

I would suggest that you get someone who uses L2ARC in production on general-purpose workloads (or a wide variety of workloads) to at least review this code, and better yet to test it.
Unfortunately I (and Delphix) do not use L2ARC so we aren't in a good position to test this. Maybe someone at Nexenta, Joyent, or iXsystems could help?

I would really be glad to help if needed, of course 👍

@rmustacc
Copy link

We at Joyent don't use the L2ARC. It's never been worth the DRAM cost to us. I don't think we'll be able to help in this case.

@benrubson
Copy link
Author

benrubson commented Aug 13, 2017

PR just rebased with master 👍

@kernelOfTruth
Copy link

kernelOfTruth commented Aug 13, 2017

This looks more like a git merge branch than a git rebase of a commit on top of a branch to a newer master branch ❓

allowing to first evict buffers which are already in L2ARC, leading to some performance improvement (see https://www.illumos.org/issues/7361).

This can be disabled / enabled by the new arc_evict_l2_first tunable.

The new arc_evict_l2_only tunable goes a step beyond only evicting buffers which are in L2ARC, with of course a security valve in case nothing can be evicted.
@benrubson
Copy link
Author

Thank you @prakashsurya for having launched the test.
One of the 67 ztest commands failed. Seems to be related to ZIL, whereas my PR concentrates on ARC. Is is then an "expected" failure ? Perhaps give the test another chance ?

@prakashsurya
Copy link
Member

@benrubson I don't think that failure is due to this change; rather, think that failure is due to a prior commit: d28671a

@ahrens
Copy link
Member

ahrens commented Aug 23, 2017

While I agree that the L2ARC logic needs some work, I'm not convinced that this is the right direction, and the change is too risky to take as-is.

With sufficient testing, I would consider a simpler semantic change like "evict blocks in L2ARC first". That would probably require separating the lists into one for blocks in L2ARC (HDR_HAS_L2HDR) and one for blocks not in L2ARC.

For now, I'm going to close this PR.

@ahrens ahrens closed this Aug 23, 2017
@benrubson
Copy link
Author

benrubson commented Aug 24, 2017

I would consider a simpler semantic change like "evict blocks in L2ARC first".

This is exactly what this change does with arc_evict_l2_first = B_TRUE :)
It will prefer buffers which are already in L2ARC, but if it does not found such enough buffers to reach the target, it will then continue with other buffers (...).

the change is too risky to take as-is

(...) So there is no risk at all compared to the current behaviour, as it is the current behaviour + a preference in buffers to choose.

The step above arc_evict_l2_only = B_TRUE, disabled by default (so set to B_FALSE), is for advertised users who want more than simply preferring buffers which are already in L2ARC.
It allows to only evict buffers which are already in L2ARC. Of course, if it can't find such buffers at all, it will then evict other buffers up to the given target, to avoid memory pressure.
It's then a more aggressive setting, but fitted with a security valve.

With sufficient testing

In production on 3 servers for a year now with :
arc_evict_l2_only = B_TRUE : the more aggressive setting ;
primarycache=metadata & primarycache=all : I tested both and now use the latter ;
secondarycache=metadata : it gives even more work to the algorithm to find buffers !

This change gives a very nice and constant performance improvement, as show in the bug report.

In addition is is quite small in size.

In any case, thank you very much to all reviewers here 👍

@devZer0
Copy link

devZer0 commented Dec 9, 2017

i have a large zfs backup server hosting millions of files being transferred to via rsync. most of the runtime is taken from rsync comparing files by timestamp. metadata is being re-read again and again on every backup (there is only few incremental transfer). that puts unnecessary stress on the disks and is so much slower, so please consider adding an optimization param like this which helps keeping metadata completely in l2arc, as it`s too big to fit in ram.

anyhow, i'm curious why the metadata in ram takes that much space at all. how much metadata is kept in ram for each file? it seems it's much more than i would guess. i would guess it's <<1kb/file, but from what i observe it`s more...

@devZer0
Copy link

devZer0 commented Nov 14, 2018

rmustacc commented on 30 Nov 2016
We at Joyent don't use the L2ARC. It's never been worth the DRAM cost to us.
I don't think we'll be able to help in this case.

the fileserver i was talking about has millions of files (numbers still growing) which are being crawled every night. i'm not even able to throw ram at it so that all metadata is held in ram . metadata size in ARC is already so large that i could not even put the appropriate ram into the machine.

i don't understand why there is a 2-tier caching algorithm with even "metadata" tunable in ZFS when we cannot make use of it. it's a little bit absurd. i don't think that "caching metadata in l2arc when you have lots of it" is so exotic demand...

and - to make things even more interesting - mind the significant ssd price drop within the last year...

https://www.guru3d.com/news-story/ssd-prices-are-dropping-significantly.html

i still give my strong vote for ben rubsons patch and would be a happy person to help testing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants