-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L2ARC Capacity and Usage FUBAR - significant performance penalty apparently associated #3400
Comments
Cache device was definitely the problem: send went from ~2MB/s to >70, massive (50-70%) IOWait on reads is gone as well. I've checked systems running prior builds, mostly using #2129 and iSCSI stacks, and i'm not seeing this. May be coming from #3115, may be coming from somewhere else in #3189 (and/or #3216 since its a very similar stack) . @kernelOfTruth: are you seeing this behavior elsewhere? It doesnt seem apparent until the cache device has been in use for a while, and the SCST load + ZFS send/receives we've been doing on these hosts fit the bill quite nicely to create this condition. The perf degradation is catastrophic - overarching subscribers time out on IO, databases fail, life sucks in general. Without a ZFS unload, removing the cache device did increase performance for a couple of minutes, but the system stalled out very quickly back into a sad iowait-related crawl (dmesg clean). Rmmod zfs failed, claimed module was still in use. Soft reboot failed, had to ipmi the sucker down hard to unload. |
@sempervictus oh dear, that sounds grave ! unfortunately no - this system hardly has an uptime greater than 2.5-3 days and the l2arc isn't most of the times completely filled |
Jumping in, I've opened #3358 a few days ago and I think it's somehow related. #1420 points in that direction too. I have migrated SAN nodes from SCST/iSCSI over DRBD to SCST/iSCSI over ZVOLs, starting with 0.6.3 and now 0.6.4.1 (nodes have 32G of RAM) L2ARC is just unusable in that context for me. I can confirm 'l2_size' going way beyond my partition size. For instance:
Partition 1 (sdb1, 60G) has this WWN:
I'm using this WWN as cache vdev:
However, 'arcstat.sh' reports a 274G L2ARC:
/proc/spl/kstat/zfs/arcstats seems to agree
Over time, I also see performance going down and the 'arc_adapt' process using a lot of CPU. More importantly, if I try to remove this cache vdev then IOs are blocked to all the pool ZVOLs for several minutes. According to strace, 'zpool remove' spends this time on ioctl 0x5a0c:
I can read a lot of people claiming that L2ARC can have a negative impact depending on the workload, but I still have the feeling that something is going wrong here. More than happy to provide more logs/traces/info. |
okay, so it seems the changes from #2110 (merged) and #3115 (to be merged, included in e.g. #3189 , #3190 , #3216 ) are not enough to plug this issue slowing down might be partly due to: #361 Things to read into: #361 (comment) #361 (comment) Explanation from @behlendorf ( #1420 (comment) )
not sure how much this still applies after all changes merged |
@kernelOfTruth: the L2ARC in that example is a 64G device, and there's 12G of ARC. The only data being served is a pair of ZVOLs, so not metadata heavy (or shouldnt be). The host has 24G of physical memory, and there's 20% unused even when SCST buffers start eating away at it. If the RAM is accounting for all the insanely misplaced pointers to L2ARC space which has been wrapped around and doesnt really exist, its still a problem with the wraparound. #361 is likely unrelated here as it deals with writes - this is specific to reads, once that ARC is wrapped, reads go to hell in a handbasket. The boundary issue is a real PITA for us, since Xen 6.2 wants 512b iscsi sectors, but we're moving to 6.5 in the near term so at least will be able to map 4K iscsi blocks to 16K recordsizes (to prevent the MD swell of small volblocksizing). |
@sempervictus wrong, in the end it either got panic really soon or got caught in kswapd0 spin in few days, I am using 0.6.4.1 now, just an idea for you to test and trying to survive this issue, |
might be related to the observation that @odoucet made in #3259 (comment) #3190 (comment) one of his test cases was scanning of files (thus mostly reading) with clamav which also lead to some strange behavior |
the behaviour I observed was due to an ARCsize set too high. I suggest you try to lower ARCSIZE and see if you observe the same. |
Here's another one using #3216 and experiencing serious lag. Receive of a 4T ZVOL went from 100MB/s to several bytes if i'm lucky once the SSDs filled up:
The fun part is that the CPU is completely idle, no churn on any kernel tasks, there's no IOWait to speak of (its just screwing itself to the wall doing a receive from another host). This isnt an ARC sizing issue, its a problem with the L2ARC devices showing unreal capacities and ARC being flooded with references into the dead space by the looks of it. EDIT: removing the cache devices has pegged a CPU and i'm slowly watching the l2_size drop - down to 800G, which is nuts, given hat there's only 372G of L2ARC total. EDIT2: after removal of the cache devices the pool was still unresponsive to receiving a send from another host. After exporting and importing the pool, same deal. zfs module could not be unloaded, claimed in use, showed no consumer. After reboot the pool imported and immediately went to 30% CPU use for at least 20 minutes running txg_sync. This pool is empty aside from an empty DS storing an empty ZVOL which is the recipient target of the send that caused the hangup in the first place. Something is seriously rotten in the state of Denmark. |
This is starting to look a bit like Illumos issue 5701 - https://reviews.csiden.org/r/175/. |
I pushed a test port of this to #3216 upstream: illumos/illumos-gate@a52fc31 let's see what the buildbots say From the wording of the Illumos issue entry it at first sounds if only zpool list is broken - thus cosmetic - but: https://www.illumos.org/issues/5701
so a regression, bug |
@sempervictus looks like all works well, the buildbots give green light in #3216 for those two additional commits - feel free to give it a good testing - I've also referenced this issue for @dweeezil thanks |
I have made a few tests here with #3216 . Things have improved on the performance side but I still end up with an L2 reported size on 300+ GB on a 60G partition! My setup:
I export a ZVOL with lz4 compression over SCST/iSCSI (blockio) to ESXi (2x10G ethernet, MPIO). I have a Win 2008R2 VM with iometer. 1 OS disk plus 2 disks for measurements sitting on the exported ZVOL. I configured 4 threads (2 per disk) with 256K transfers, 80% random, 50% read, 16 outstanding IOs. The 4 threads are hitting the same ZVOL. Following are the iops (r+w) and bandwidth (r+w) graphs for a single thread over 8 hours: 0.6.4.1, no cache, iops 0.6.4.1, no cache, bw 0.6.4.1, cache, iops 0.6.4.1, cache, bw 0.6.4-50_g19b6408, cache, iops 0.6.4-50_g19b6408, cache, bw I should also make a test with the patch but without cache. Moreover, I have Storage IO control enabled on the ESXi side, which I should probably disable for a better view. Still, the perf decrease over time with an L2 cache on 0.6.4.1 is very real but seems to be less dramatic with #3216 |
@pzwahlen thanks for your tests ! Just a poke in the dark: the SSD is running without TRIM, right ? what is the setting of zfs_arc_max ? @sempervictus did you do any new stress testing with abd-next ? does this also happen there ? @dweeezil since you now have access to some bigger "muscle" - did you observe anything similar in your tests ? |
@kernelOfTruth: looks like abd_next is the culprit:
This host is running abd_next on master with no other ARC related patches. Time to ping @tuxoko on #2129 i suppose. |
@sempervictus |
@kernelOfTruth I've not been doing anything with l2arc lately (working on #3115). |
@tuxoko: as @pzwahlen reported, the addition of 5701 to @kernelOfTruth's stack did not mitigate the problem. 5701 fixed an issue introduced by the Illumos changes before it, this problem seems to have started from ABD since metadata moved into the buffers, and occurs even without the relevant Illumos ports which required 5701 in the first place (if i'm understanding this correctly). |
@sempervictus |
The probem behavior doesn't occur with master far as i can, and only applies to the newer revisions of 2129. |
@tuxoko recent changes in Illumos ( #3115 ) from the end of last year introduced a regression which got fixed by "Illumos 5701 zpool list reports incorrect "alloc" value for cache devices" current ZFSonLinux master doesn't show this behavior but master + ABD seems to show a similar behavior - if I understood the report currectly. Since the upstream (Illumos) changes that lead to this broken state, regression (according to the Illumos issue description) are not in ZFSonLinux master, the fix won't help in ABD's case @sempervictus I hope that's the correct summarization |
@sempervictus @kernelOfTruth |
@tuxoko so that's probably another building site - perhaps ABD stresses areas of ARC & L2ARC accounting that wouldn't otherwise ? ( if I remember correctly @sempervictus pointed out that high i/o wait and/or other issues occured without ABD so he couldn't possibly hit this without the ABD improvements) but referring to @pzwahlen 's report it must have been there prior to ABD changes - and perhaps a regression introduced between 0.6.3 and 0.6.4.1 thanks for pointing that out |
I can confirm that in my case, L2ARC reported size goes way beyond the actual partition/disk size since 0.6.3 (I started using ZoL with this version). 0.6.4.1 with #3216 seems to improve performance a bit, but nothing changed on the reported size issue. Also keep in mind I'm doing ZVOL only, I don't have a single "file" in my zfs datasets. Thanks for looking into this, being able to use L2ARC would be really cool... |
referencing #3114 (which includes the potential fix in the top comment) adding some more: #1622 (merged) - Illumos #3137 L2ARC compression #1622 #1522 (superseded), #1542 (merged) - Fix inaccurate arcstat_l2_hdr_size calculations #1542 #936 (closed, too many unknowns) |
The issue referenced @ the bottom of the FreeNAS Redmine seems to indicate that this is now resolved in their kernel. Will pull their gh repo later and see what that actually was (though i'm sure by then @kernelOfTruth will have ported, tested, and polished it). |
WTF ?! I took the time and effort to create an account and take a look at the code - apparently it's not open source @sempervictus you, by chance could point me to the actual commit ? I'm sure I'm missing something blatantly obvious ❓ |
Right there with ya. Looks like the GH repo is not much help either, its a build system, requiring that we build atop an existing freenas... digging further, but yeah, "WTF?!" is about right. |
Finally had time to make some more tests. I have applied #3451 using the following commands from a buildbot log: Things are definitely much better, With a small (4G) l2arc partition that I could previously "overfill" in less than 10 minutes, I now couldn't reach more than 3G of used cache with my iometer test. On a larger 60G partition, cache usage went up to 29G (after about 2 hours) and then stayed there. I don't know if reaching half the cache size is normal with my workload or if it's the sign of another size calculation issue, though. The performance don't seem to suffer in any way. Are there other tests I could perform now that I have this running ? Thx for the hard work! |
@pzwahlen Thanks for sharing your stats =) If it's possible - I'd say leave it running (if there's automated tests or some scripts) over a longer period of days to stress it some more. Indeed, it could be that this (seemingly correctly behaving L2arc) shows some bug in the calculation or it's simply saturated and might fill up more over time. Cheers |
Quick update: after a few Storage vMotion I have been able to bring l2arc size to 81G out of my 128G partition. It means it definitely can grow over half the total size. Another point is that during low activity times, l2arc size actually decreases, which is something I never saw before! @sempervictus Do you have some additional feedback ? |
@pzwahlen did you observe any stalls, performance issues or other regression-like behavior ? |
@sempervictus not so far. However, I just applied #3451 over master (as far as I understand Git), and nothing related to #2129. Now that I read back this conversation, I even wonder if it's not #3433 that is supposed to fix this l2arc behavior, instead of #3451 (even though it definitely changed things in the right direction) Any input much appreciated. Cheers! |
@pzwahlen sorry, I should have updated the title of #3433 - done now according to @avg-I it's not entirely correct and thus should be superseded by #3451 (and subsequent changes) , so you applied the correct patch So the only change that applies to current zfs master is #3451 . "5701 zpool list reports incorrect "alloc" value for cache devices" does not apply and will be relevant once #3115 has been merged. Hope that clears things up Thanks |
Thanks guys for running this to ground. The fix in #3451 looks good to me although I noticed that despite posive reviews for illumos and originally be written for FreeBSD neither platform has merged it yet. Anybody happen to know what's holding things up? It would be nice to have an illumos issue number for this to make tracking it easier. |
@behlendorf Glad the l2arc works correctly now (even for everyday tasks it makes a remarkable difference for me - so it would a shame to have to disable it, especially for bigger workloads #3259 (comment) ) from looking on the web I got the impression that it (the issue) might not have gotten wide enough recognition to seek for solutions from users side yet and/or it hasn't been triggered often enough to be "relevant" (perhaps they're not running that stressing workloads ?) Why there's no activity on FreeBSD or Illumos - that I can't say |
Just to chime in, i've been using #3451 since it hit, successfully, on multiple hosts, including contentious SCST boxes feeding pairs of 40Gbit links on which a cloudstack resides, all of which beat the hell out of ARC. |
@behlendorf @kernelOfTruth regarding the illumos / FreeBSD (in-)activity: I made these changes for HybridCluster / ClusterHQ and we ran with them for many month while our product was based on FreeBSD. After submitting the changes to illumos and FreeBSD I switched to working on other things and didn't have time to follow through. Apparently noone else was as interested in the changes at that time. Fortunately, now there are interested parties :) and I've got a little bit more time: One thing I noticed is that there were recent significant changes to ARC code in illumos: So I am looking forward to fun times of merging those changes with mine. |
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); Account for the error message: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] uint64_t stats_size = 0; References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); Account for the error message: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] uint64_t stats_size = 0; References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
@avg-I thanks for following up with us! It would be great if you refresh and finalize the patches so fix this issue which clearly impacts everybody. We've just finished pulling in those significant ARC changes so it should be fairly easy for us to apply the fix. |
@behlendorf #3491 is equivalent to what I have |
The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); Account for the error message: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] uint64_t stats_size = 0; References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>
If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#3400 Issue openzfs#3433 Issue openzfs#3451
@kernelOfTruth thank you too! |
If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#3400 Closes openzfs#3433 Closes openzfs#3451 Cherry-pick ported by: Tim Chase <tim@chase2k.com> Cherry-picked from ef56b07
If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#3400 Closes openzfs#3433 Closes openzfs#3451 Ported-to-0.6.4.2-by: Tim Chase <tim@chase2k.com>
While running #3189 i came across something very strange - the L2 ARC has developed magical powers and turned a 64G SSD into a bottomless pit of data. This in turn has resulted in crushing performance degradation and apparent demonic possession of the SCST host running this export.
My assumption is that the following output is based on the number of times L2ARC has wrapped around:
Reads from the pool are suffering badly, IOWait on a send operation is >50%. About to drop the cache device and hope it gets fixed, but figured i should put this up here for now.
The text was updated successfully, but these errors were encountered: