-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ZFS module parameter l2arc_mfuonly #10710
Conversation
b67dc59
to
13a9b4d
Compare
module/zfs/arc.c
Outdated
* l2arc_mfuonly : A ZFS module parameter that controls whether only MFU | ||
* metadata and data are cached from ARC into L2ARC. This reduces | ||
* the wear of L2ARC devices and may be beneficial in certain | ||
* workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modern SSDs won't have an endurance issue as cache devices. You need a better reason here.
Also, vaguely referring to "certain workloads" is not helpful for anyone deciding whether or not to enable the feature.
How about adding a method in the man page (missing from this PR!) to describe how to observe
your current workload's MFU size, l2arc feed rate, and eviction rate for eligible blocks as a method
for inferring if MFU-only makes sense for this workload. In other words, if the cache is 2TiB and your MFU is 100MiB, then maybe you don't want to set this parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great suggestion. The more obvious use case would be in case of a zfs send
, so that L2ARC space is not wasted by filling it with MRU data/metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardelling a very simple workload which is impacted by current L2ARC behavior is backup (ie: sequential read) of big files (think about vm disk image): when a L2ARC device is used, the high "churn rate" of ARC MFU means that reading will cause many L2ARC writes, wearing off the device. While it is true that mixed-use enterprise disks have significant endurance rating, one should not be limited to (relatively) expensive disk for an L2ARC device. For a real-world example, I have a server with 2x striped Intel S4601 480 GB as L2ARC and, when copying virtual machine files, I often literally "burn" one entire SSD daily write. While the L2ARC parameters where set to more aggressive level (ie: increased l2arc_headroom
and l2arc_write_max
), it seems very wrong to wear the SSDs when reading a big file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document your use case and put some numbers on it. For example, I can't easily google the specs on that drive, but an Intel DC S3520 480 GB SSD has an endurance rating of 945TBW. This equates to a full drive write every day for 5.4 years. NB, since we're measuring actual drive writes, you can just use the measured amount of writes. Also note, the warranty is 5 years, so expect to replace in that timeframe. So now you have your use case workload defined. But that workload may or may not be cause for concern. For your described workload with the above specs, the change is not justified. But for other workloads and specs, it might be justified... help out the user here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardelling well, if the change only adds a potentially useful option and it is not invasive/difficult to maintain, I think it would be useful. As said above, enterprise SSD are quite durable. However, for L2ARC it shoult be reasonable (for some cases at least) to use cheaper, even consumer grade SSD which, while having lower write performance/endurance, often are quite fast for reads. Finally, consider than having an L2ARC with potentially very useful psedo-MFU data being trashed by a sequential copy with MFU data is going to lower system performance. Giving the user the possibility of doing an informed choice does not seems bad to me, unless it is a maintenance burden.
But hey - I'm all for a better comment/documentation. @gamanakis the use case regarding send/recv
seems a very good example.
@richardelling Thank you for your comments! This is still a draft, man
pages are on the to do list :) In it's current state it mainly serves as a
test-bed for #10687.
…On Thu, Aug 13, 2020, 3:44 PM Richard Elling ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In module/zfs/arc.c
<#10710 (comment)>:
> @@ -881,6 +881,14 @@ static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);
static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);
static void l2arc_read_done(zio_t *);
+/*
+ * l2arc_mfuonly : A ZFS module parameter that controls whether only MFU
+ * metadata and data are cached from ARC into L2ARC. This reduces
+ * the wear of L2ARC devices and may be beneficial in certain
+ * workloads.
Modern SSDs won't have an endurance issue as cache devices. You need a
better reason here.
Also, vaguely referring to "certain workloads" is not helpful for anyone
deciding whether or not to enable the feature.
How about adding a method in the man page (missing from this PR!) to
describe how to observe
your current workload's MFU size, l2arc feed rate, and eviction rate for
eligible blocks as a method
for inferring if MFU-only makes sense for this workload. In other words,
if the cache is 2TiB and your MFU is 100MiB, then maybe you don't want to
set this parameter.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10710 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2Y2IOVRS63JZUWLNEROALSAQ7BFANCNFSM4P6VXNIA>
.
|
Should I be testing this with |
No. As far as I can tell from the code On the other hand |
Codecov Report
@@ Coverage Diff @@
## master #10710 +/- ##
==========================================
+ Coverage 79.73% 79.85% +0.11%
==========================================
Files 395 394 -1
Lines 125225 124660 -565
==========================================
- Hits 99854 99548 -306
+ Misses 25371 25112 -259
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
to be more consistent with the other secondary cache properties, this needs to be a dataset property. However, that is more work, so you might get feedback or test results to see if it makes any significant difference. |
The painfully inconclusive #10464 suggests the opposite (among many other things at various times 😬 ), i.e. #10464 (comment) If, in your code exploration, you unravel then truth then it'd be terrific to get the real intended behavior documented once and for all. 😄 |
I do not think that particular comment suggests the opposite. The code and the comment say the same thing:
Edit: This is wrong, see next comment. |
It seems I am wrong: Let us continue this conversation in #10464. |
@richardelling @gamanakis a dataset property would be great but, as it is much more work, I would suggest going the module parameter route first. Then, if the new Just my 2 cents. |
(unrelated to my previous comments) - ARC MFU has been broken for a long time and only fixed in the last day or two; has this perhaps been aggravating the original issue which prompted this PR? Possibly with a re-test it will be revealed that l2arc is now adequately biased towards MFU behavior (because MFU is stickier in ARC too) and a further change is not required. (I'm not betting on it, but it seems plausible, right?) |
@adamdmoss yes, I saw the relevant commit. Your suggestion is certainly possible. Regarding the All in all, I am not sure if this PR has any real benefit. I provided it because it was simple enough and wanted to test it myself. In terms of the |
@adamdmoss are you sure? This would be quite surprising. Can yuo share more information? Does the issue only happen with the master brach? EDIT: are you referring to #10548 ?
Maybe: prefetched read are always written in MRU, which will only slowly grow when MFU is big. However, especially when using aggressive L2ARC feed setting ( |
@gamanakis I think the main benefit is to avoid trashing L2ARC when copying large file. After all, L2ARC data are used to increase system performance; if a single, albeit big, copy can entirely trash it, performance will decrease. |
The issue happens with all releases since mid-2016 and is fixed only since e111c80
I expect you're right. |
Oh, I didn't want to imply that - I actually rather like the thinking behind the PR, but the chance has grown to non-zero that the addressed behavior is now addressed in another way. But still not a huge probability if I were to guess. |
Again I like - and actually want - this PR, so I'm not casting shade on it; I wanted to note that the L2ARC caches writes as well as reads, so it's doubling the rate of unnecessary L2ARC churn when copying big file(s) between two l2arc-backed pools when you don't expect to read those files again soon. (It's semi-trivial to fix this behavior or make it optional - I have toy patches.) |
I feel like the policy control for L2ARC might make more sense as a per-dataset property. Maybe extending 'secondarycache' from the values of 'all, metadata, none', to be 'all, mfuonly, metadata, none' or something, would make more sense to the user. |
@allanjude sure, a dataset property would be great. However, it commands more work and it becomes difficult to drop if is, for some reason, we want to get rid of that behavior. @gamanakis any thoughts on that? |
I agree, though IMHO a module property is okay for a coarse first pass. (Though a valid counterargument is that things which are better as dataset properties but start as module properties, don't often seem to graduate to dataset properties in practice. 😁 ) |
There is a counter argument, that since this isn't an on-disk change, adding it as a property presents more backwards compatibility issues than using a module property. This is something that we may be able to find to talk about on the leadership call later today. |
I appreciate all your feedback. I just emailed @ahrens to include this as a discussion topic in the leadership meeting. If time does not permit today, perhaps in the next one. |
There is a case here where the actual size of the MFU will decline because we're not caching MRU in L2ARC. Basically, you can't be in MFU until you are in MRU and get hit again (later). Current behaviour changes a block to MFU if it was in MRU, ghost MRU, or L2. NB, contrary to the name, MFU really means "we re-hit the block some time after it was first hit" where some time is intended to be 62ms. |
(IMHO the ARC state change straight to MFU simply because of a L2ARC hit - which may be from an ancient add to L2ARC, especially now that L2ARC is persistent - isn't a terrifically valuable heuristic anyway.) |
@gamanakis @ahrens thanks for talking about this! I am all for better observability of what is cached on L2ARC, but I find that orthogonal to this PR. For example, we already have the For the record, I measured this exact scenario on one of my customers with (small) hyperconverged setup: the nightly backup basically trash any useful data which were cached on L2ARC. Am I missing something? Thanks. |
@shodanshok I believe the point is to have direct kstats instead of relying purely on live monitoring with |
@shodanshok I'm fine with keeping the current behavior by default but adding a new tunable to not feed hit-once data to the L2ARC. With additional data/testing, I'd be OK with changing the default as well, if that makes sense. |
Stats per-dataset doesn't actually work, due to dedup. Similarly, arcstats are system-wide, not per-pool. But, an arcstat is the logical place to put this, and it is likely ok for now because it is still possible to discern what is happening in a test environment (one pool with L2) |
I created draft PR #10743 which introduces L2ARC arcstats according to buffer content type and MFU/MRU status upon caching in L2ARC. |
I can speak to another use-case: Storage vMotion. I'm using ZFS exclusively for block storage in a vSphere environment and I can attest that Storage vMotion will "ruin" L2ARC contents. At this very moment I'm sitting here in the middle of a large Storage vMotion watching an l2arc_write_max-sized chunk of data get uselessly stuffed into my L2ARC every second, displacing hard-won random reads that were beneficial to performance. I firmly believe that limiting L2ARC to MFU will improve my hit rate at the end of the day and I look forward trying this out. |
@richardm1 yeah, basically any big file copy is going to trash the L2ARC |
iff the file being copied is not currently in ARC/L2 and is bigger than L2. Storage vmotion has other pathologies, too. We always set |
@richardelling Sure, but this is the norm for big file (ie: TB-sized) copies. Moreover, even smaller than L2ARC files can cause significant L2ARC trashing if the copied file is bigger than ARC. Quoting @richardm1, the key issue is:
This basically negates a big chunk of L2ARC performance when the workload includes copying big files around. A tunable to control this behavior can be very useful (it may even be beneficial to change the default value, letting L2ARC to only cache MFU blocks, but this clearly is an harder decision to take). |
Also know that if an MFU block is still in ARC and it gets evicted from L2, then it will be re-loaded into L2 by the feed thread. This is why we need the information to know what gets evicted from ARC that is MRU or MFU and eligible for L2. |
@gamanakis @ahrens Any chances to see this patch (with no changes to default behavior) merged for OpenZFS 2.0 release? |
I'd be fine with adding this module option as long as the default behavior is unchanged and we add it to the module option man page. It's small enough that we could also backport it to the OpenZFS 2.0 release branch. |
@shodanshok I have put all my efforts on #10743 which will give better observability of what resides in L2ARC. |
In certain workloads it may be beneficial to reduce wear of L2ARC devices by not caching MRU metadata and data into L2ARC. This commit introduces a new tunable l2arc_mfuonly for this purpose. Signed-off-by: George Amanakis <gamanakis@gmail.com>
13a9b4d
to
e6f5298
Compare
I could add a test after #10743 gets merged. |
@gamanakis following up with a test case would be great. I don't think it'll take us long to get #10743 get merged. |
In certain workloads it may be beneficial to reduce wear of L2ARC devices by not caching MRU metadata and data into L2ARC. This commit introduces a new tunable l2arc_mfuonly for this purpose. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #10710
To my understanding this improvement (if enabled) defines MRU buffers as non-eligible for L2 caching. Is this already reflected in the zfs statistics, e.g.
If not, does it require updating "#define DBUF_IS_L2CACHEABLE(_db)" in dbuf.h? |
No, it is not reflected in those arcstats currently. I think that this is
the desired behaviour. If it were reflected, then evict_l2_eligible_mru
will be 0 when l2arc_mfuonly is enabled, meaning we will lose this kind of
information which could be used to decide if l2arc_mfuonly is appropriate
or not for the current workload.
…On Thu, Sep 17, 2020, 4:28 PM zfsuser ***@***.***> wrote:
To my understanding this improvement (if enabled) defines MRU buffers as
non-eligible for L2 caching.
Is this already reflected in the zfs statistics, e.g.
- arcstat_evict_l2_eligible
- arcstat_evict_l2_ineligible
- arcstat_evict_l2_eligible_mfu
- arcstat_evict_l2_eligible_mru
If not, does it require updating "#define DBUF_IS_L2CACHEABLE(_db)" in
dbuf.h?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10710 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2Y2IJU5BZIAACNANPKXK3SGJWNRANCNFSM4P6VXNIA>
.
|
I created #10945 to clarify the current behaviour in the man page of l2arc_mfuonly. |
@gamanakis : Thank you for the explanation and for documenting it. |
In certain workloads it may be beneficial to reduce wear of L2ARC devices by not caching MRU metadata and data into L2ARC. This commit introduces a new tunable l2arc_mfuonly for this purpose. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#10710
In certain workloads it may be beneficial to reduce wear of L2ARC devices by not caching MRU metadata and data into L2ARC. This commit introduces a new tunable l2arc_mfuonly for this purpose. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#10710
Motivation and Context
Closes #10687
Description
In certain workloads it may be beneficial to avoid caching MRU metadata and data into L2ARC. This commit
introduces a new tunable l2arc_mfuonly for this purpose.
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.