L2ARC Capacity and Usage FUBAR - significant performance penalty apparently associated #3400

sempervictus · 2015-05-12T08:24:27Z

While running #3189 i came across something very strange - the L2 ARC has developed magical powers and turned a 64G SSD into a bottomless pit of data. This in turn has resulted in crushing performance degradation and apparent demonic possession of the SCST host running this export.

My assumption is that the following output is based on the number of times L2ARC has wrapped around:

zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
(omitted since this is a client system and the underlying VDEVs are named with serial numbers)
...
...
cache      -      -      -      -      -      -
  dm-name-crypt_2718620040  59.6G   667G  16.0E         -     0%  1118%

Reads from the pool are suffering badly, IOWait on a send operation is >50%. About to drop the cache device and hope it gets fixed, but figured i should put this up here for now.

The text was updated successfully, but these errors were encountered:

sempervictus · 2015-05-12T12:23:28Z

Cache device was definitely the problem: send went from ~2MB/s to >70, massive (50-70%) IOWait on reads is gone as well.

I've checked systems running prior builds, mostly using #2129 and iSCSI stacks, and i'm not seeing this. May be coming from #3115, may be coming from somewhere else in #3189 (and/or #3216 since its a very similar stack) .

@kernelOfTruth: are you seeing this behavior elsewhere? It doesnt seem apparent until the cache device has been in use for a while, and the SCST load + ZFS send/receives we've been doing on these hosts fit the bill quite nicely to create this condition. The perf degradation is catastrophic - overarching subscribers time out on IO, databases fail, life sucks in general. Without a ZFS unload, removing the cache device did increase performance for a couple of minutes, but the system stalled out very quickly back into a sad iowait-related crawl (dmesg clean). Rmmod zfs failed, claimed module was still in use. Soft reboot failed, had to ipmi the sucker down hard to unload.

kernelOfTruth · 2015-05-12T12:40:58Z

@sempervictus oh dear, that sounds grave !

unfortunately no - this system hardly has an uptime greater than 2.5-3 days and the l2arc isn't most of the times completely filled

pzwahlen · 2015-05-12T19:50:58Z

Jumping in,

I've opened #3358 a few days ago and I think it's somehow related. #1420 points in that direction too. I have migrated SAN nodes from SCST/iSCSI over DRBD to SCST/iSCSI over ZVOLs, starting with 0.6.3 and now 0.6.4.1 (nodes have 32G of RAM)

L2ARC is just unusable in that context for me. I can confirm 'l2_size' going way beyond my partition size. For instance:

[root@sanlab2 ~]# lsblk /dev/sdb
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdb      8:16   0  477G  0 disk 
├─sdb1   8:17   0 59.6G  0 part 
├─sdb2   8:18   0 59.6G  0 part 
├─sdb3   8:19   0 59.6G  0 part 
└─sdb4   8:20   0 59.6G  0 part

Partition 1 (sdb1, 60G) has this WWN:

lrwxrwxrwx 1 root root  9 Apr 29 15:27 wwn-0x50025385a014444d -> ../../sdb
lrwxrwxrwx 1 root root 10 Apr 30 09:56 wwn-0x50025385a014444d-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Apr 29 15:27 wwn-0x50025385a014444d-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 Apr 29 15:27 wwn-0x50025385a014444d-part3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Apr 29 15:27 wwn-0x50025385a014444d-part4 -> ../../sdb4

I'm using this WWN as cache vdev:

        NAME                            STATE     READ WRITE CKSUM
        p02                             ONLINE       0     0     0
          raidz1-0                      ONLINE       0     0     0
            wwn-0x5000c5006c5f04ab      ONLINE       0     0     0
            wwn-0x5000c5006c5f3b6f      ONLINE       0     0     0
            wwn-0x5000c5006c5f24bf      ONLINE       0     0     0
            wwn-0x5000c5006c5f36ab      ONLINE       0     0     0
        cache
          wwn-0x50025385a014444d-part1  ONLINE       0     0     0

However, 'arcstat.sh' reports a 274G L2ARC:

[root@sanlab2 ~]# arcstat.sh 
|---------------------------------------------------------------------------------------------------------------------|
|l1reads    l1miss     l1hits     l1hit%     size  |  l2reads    l2misses   l2hits     l2hit%     size   disk_access% |
|---------------------------------------------------------------------------------------------------------------------|
|463645439  132874797  330770642  71.341%    11 GB  |  131706937  49127270   82579667   62.699%    274GB   10.595%    |

/proc/spl/kstat/zfs/arcstats seems to agree
[root@sanlab2 ~]# cat /proc/spl/kstat/zfs/arcstats

...
l2_size                         4    294606549504
...
l2_hdr_size                     4    12651900104
...

Over time, I also see performance going down and the 'arc_adapt' process using a lot of CPU. More importantly, if I try to remove this cache vdev then IOs are blocked to all the pool ZVOLs for several minutes. According to strace, 'zpool remove' spends this time on ioctl 0x5a0c:

22:03:56 ioctl(3, 0x5a0c, 0x7fffa57f1c40) = 0

I can read a lot of people claiming that L2ARC can have a negative impact depending on the workload, but I still have the feeling that something is going wrong here.

More than happy to provide more logs/traces/info.

kernelOfTruth · 2015-05-12T20:35:22Z

okay, so it seems the changes from

#2110 (merged)

and #3115 (to be merged, included in e.g. #3189 , #3190 , #3216 )

are not enough to plug this issue

slowing down might be partly due to: #361

Things to read into:

#361 (comment)
https://www.illumos.org/issues/3794

#361 (comment)
http://lists.open-zfs.org/pipermail/developer/2015-January/001222.html

Explanation from @behlendorf ( #1420 (comment) )

What's happening is that virtually all of your 4GB of ARC space is being consumed managing the 125GB of data in the L2ARC. This means there's basically no memory available for anything else which is why your system is struggling.

To explain a little more, when a data buffer gets removed from the primary ARC cache and migrated to the L2ARC a reference to the L2ARC buffer must be left if memory. Depending on how large your L2ARC device is and what your default block size it can take a significant amount >of memory to manage these headers. This can get particularly bad for ZVOLs because they have >a small 8k block size vs 128k for a file system. This means the ARCs memory requirements for L2ARC headers increase by 16x.

You can check for this in the l2_hdr_size field of the arcstats if you know what you're looking for. There are really only two ways to handle this at the moment.

Add additional memory to your system so the ARC is large enough to manage your entire L2ARC device.

Manually partition your L2ARC device so it's smaller.

Arguably ZFS should internally limit its L2ARC usage to prevent this pathological behavior and that's something we'll want to look in to. The upstream code also suffers from this issue but it's somewhat hidden because the vendors will carefully size the ARC and L2ARC to avoid this case.

not sure how much this still applies after all changes merged

sempervictus · 2015-05-13T00:49:36Z

@kernelOfTruth: the L2ARC in that example is a 64G device, and there's 12G of ARC. The only data being served is a pair of ZVOLs, so not metadata heavy (or shouldnt be). The host has 24G of physical memory, and there's 20% unused even when SCST buffers start eating away at it. If the RAM is accounting for all the insanely misplaced pointers to L2ARC space which has been wrapped around and doesnt really exist, its still a problem with the wraparound.

#361 is likely unrelated here as it deals with writes - this is specific to reads, once that ARC is wrapped, reads go to hell in a handbasket. The boundary issue is a real PITA for us, since Xen 6.2 wants 512b iscsi sectors, but we're moving to 6.5 in the near term so at least will be able to map 4K iscsi blocks to 16K recordsizes (to prevent the MD swell of small volblocksizing).

AndCycle · 2015-05-13T14:47:55Z

@sempervictus
I got a simple idea you might wanna give it a try,
my system have 32G installed, after fire up the vm and all service it takes about 16G for those usage,
16G free memory left, so is it reasonable to give ZFS arc 8GB right?

wrong, in the end it either got panic really soon or got caught in kswapd0 spin in few days,
so I lower this to 4GB it's now still up and running for 2 weeks,
I have a 64G device for L2ARC too,
my l2_hdr_size is 162,072,864 ,
it serve mostly for 128k record size so it won't hit metadata limit,

I am using 0.6.4.1 now,
at 0.6.3 I don't have kswapd0 spin issue, but if I give it lower arc max value it alway blow over the limit due to heavy metadata workload, it won't crash if I use default 16G arc_max
at 0.6.4 it honer the limit much better, but got weird issue with kswapd0 spin,
I have same bottomless L2ARC issue at 0.6.3 few times, but never caught the reason,

just an idea for you to test and trying to survive this issue,
but this is really hard to catch where is going wrong.

kernelOfTruth · 2015-05-13T15:53:41Z

might be related to the observation that @odoucet made in #3259 (comment)

#3190 (comment)
#3216 (comment)

one of his test cases was scanning of files (thus mostly reading) with clamav which also lead to some strange behavior

odoucet · 2015-05-13T17:02:43Z

the behaviour I observed was due to an ARCsize set too high. I suggest you try to lower ARCSIZE and see if you observe the same.

sempervictus · 2015-05-13T19:31:55Z

Here's another one using #3216 and experiencing serious lag. Receive of a 4T ZVOL went from 100MB/s to several bytes if i'm lucky once the SSDs filled up:

root@storage-host:~# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
fn00-pool00  14.5T  1.70T  12.8T         -     4%    11%  1.00x  ONLINE  -
  raidz2  14.5T  1.70T  12.8T         -     4%    11%
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
  mirror  46.5G      0  46.5G         -     0%     0%
    ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part1      -      -      -         -      -      -
    ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part1      -      -      -         -      -      -
cache      -      -      -      -      -      -
  ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part2   186G   518G  16.0E         -     0%   277%
  ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part2   186G   519G  16.0E         -     0%   278%
root@storage-host:~# arcstat 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
15:25:15     0     0      0     0    0     0    0     0    0    24G   24G  
root@storage-host:~#  cat /proc/spl/kstat/zfs/arcstats  | grep size
size                            4    26245258968
hdr_size                        4    729092784
data_size                       4    0
meta_size                       4    102400
other_size                      4    136616
anon_size                       4    18432
mru_size                        4    83968
mru_ghost_size                  4    25769576960
mfu_size                        4    0
mfu_ghost_size                  4    213504
l2_size                         4    1051616213504
l2_asize                        4    1046476891648
l2_hdr_size                     4    25515927168
duplicate_buffers_size          4    0
root@storage-host:~#

The fun part is that the CPU is completely idle, no churn on any kernel tasks, there's no IOWait to speak of (its just screwing itself to the wall doing a receive from another host).

This isnt an ARC sizing issue, its a problem with the L2ARC devices showing unreal capacities and ARC being flooded with references into the dead space by the looks of it.

EDIT: removing the cache devices has pegged a CPU and i'm slowly watching the l2_size drop - down to 800G, which is nuts, given hat there's only 372G of L2ARC total.

EDIT2: after removal of the cache devices the pool was still unresponsive to receiving a send from another host. After exporting and importing the pool, same deal. zfs module could not be unloaded, claimed in use, showed no consumer. After reboot the pool imported and immediately went to 30% CPU use for at least 20 minutes running txg_sync. This pool is empty aside from an empty DS storing an empty ZVOL which is the recipient target of the send that caused the hangup in the first place. Something is seriously rotten in the state of Denmark.

sempervictus · 2015-05-15T19:48:32Z

This is starting to look a bit like Illumos issue 5701 - https://reviews.csiden.org/r/175/.
From reading the relevant ZoL PRs, i gather we need #3115 and #3038 to port onto.

kernelOfTruth · 2015-05-15T21:39:08Z

I pushed a test port of this to #3216

upstream: illumos/illumos-gate@a52fc31

kernelOfTruth@c99dfad

let's see what the buildbots say

From the wording of the Illumos issue entry it at first sounds if only

zpool list is broken - thus cosmetic - but:

https://www.illumos.org/issues/5701

The l2arc vdev space accounting was broken as of the landing of the
l2arc RAM reduction patch (89c86e3), as that patch completely removed a
couple calls to vdev_space_update(). The issue was compounded with the
lock change in l2arc_write_buffers() that went in with the arc_state
contention patch (244781f), as buffers could now be release with its
ARC_FLAG_L2_WRITING flag set without having been issued to the l2arc
device (this results in decrements without accompanying increments).

so a regression, bug

kernelOfTruth · 2015-05-16T07:52:18Z

@sempervictus looks like all works well, the buildbots give green light in #3216 for those two additional commits - feel free to give it a good testing - I've also referenced this issue for @dweeezil

thanks

pzwahlen · 2015-05-16T20:07:07Z

I have made a few tests here with #3216 . Things have improved on the performance side but I still end up with an L2 reported size on 300+ GB on a 60G partition!

My setup:

CentOS 7
2x Xeon E5-2620
32GB ECC memory
4x 600G 15k 3.5' in a JBOD via LSI (SAS 6G)
raidz1-0 pool (2 mirrors)
240G local SSD with a 60G partition for L2ARC

I export a ZVOL with lz4 compression over SCST/iSCSI (blockio) to ESXi (2x10G ethernet, MPIO). I have a Win 2008R2 VM with iometer. 1 OS disk plus 2 disks for measurements sitting on the exported ZVOL.

I configured 4 threads (2 per disk) with 256K transfers, 80% random, 50% read, 16 outstanding IOs. The 4 threads are hitting the same ZVOL. Following are the iops (r+w) and bandwidth (r+w) graphs for a single thread over 8 hours:

0.6.4.1, no cache, iops
http://i.imgur.com/K301KlG.png

0.6.4.1, no cache, bw
http://i.imgur.com/wjA7eKh.png

0.6.4.1, cache, iops
http://i.imgur.com/hWsbFI2.png

0.6.4.1, cache, bw
http://i.imgur.com/YlphS3i.png

0.6.4-50_g19b6408, cache, iops
http://i.imgur.com/IexinvM.png

0.6.4-50_g19b6408, cache, bw
http://i.imgur.com/zC1LNYT.png

I should also make a test with the patch but without cache. Moreover, I have Storage IO control enabled on the ESXi side, which I should probably disable for a better view.

Still, the perf decrease over time with an L2 cache on 0.6.4.1 is very real but seems to be less dramatic with #3216

kernelOfTruth · 2015-05-16T21:09:38Z

@pzwahlen thanks for your tests !

Just a poke in the dark: the SSD is running without TRIM, right ?
tried disabling NCQ, e.g. libata.force=noncq if that makes a change ?

what is the setting of zfs_arc_max ?
could you - as an experiment - set l2arc size to roughly less than twice of the ARC max ?

@sempervictus did you do any new stress testing with abd-next ? does this also happen there ?
if not it could be either that the regression wasn't fully fixed by "5701 zpool list reports incorrect "alloc" value for cache" (not sure if that "fix" even was fully & correctly ported - but according to buildbots it appears so)
or if it DOES happen with ABD only and without the changes from #3216 that it's possibly related to ABD-changes

@dweeezil since you now have access to some bigger "muscle" - did you observe anything similar in your tests ?

sempervictus · 2015-05-17T08:03:17Z

@kernelOfTruth: looks like abd_next is the culprit:

dm-uuid-CRYPT-LUKS1-...-sddc_crypt_unformatted  52.2G  69.4G  16.0E         -     0%   132%

This host is running abd_next on master with no other ARC related patches. Time to ping @tuxoko on #2129 i suppose.

tuxoko · 2015-05-17T11:48:15Z

@sempervictus
Uhm, if my reading serves me right, the 16.0E thing should be fixed by illumos 5701?
Have you try it on top of abd_next?

dweeezil · 2015-05-18T03:25:01Z

@kernelOfTruth I've not been doing anything with l2arc lately (working on #3115).

sempervictus · 2015-05-18T10:10:34Z

@tuxoko: as @pzwahlen reported, the addition of 5701 to @kernelOfTruth's stack did not mitigate the problem. 5701 fixed an issue introduced by the Illumos changes before it, this problem seems to have started from ABD since metadata moved into the buffers, and occurs even without the relevant Illumos ports which required 5701 in the first place (if i'm understanding this correctly).

tuxoko · 2015-05-18T11:42:43Z

@sempervictus
But the problem also happens on illumos, which don't have ABD in the first place.

sempervictus · 2015-05-18T12:35:24Z

The probem behavior doesn't occur with master far as i can, and only applies to the newer revisions of 2129.

kernelOfTruth · 2015-05-18T16:39:58Z

@tuxoko recent changes in Illumos ( #3115 ) from the end of last year introduced a regression which got fixed by "Illumos 5701 zpool list reports incorrect "alloc" value for cache devices"

current ZFSonLinux master doesn't show this behavior but master + ABD seems to show a similar behavior - if I understood the report currectly.

Since the upstream (Illumos) changes that lead to this broken state, regression (according to the Illumos issue description) are not in ZFSonLinux master, the fix won't help in ABD's case

@sempervictus I hope that's the correct summarization

tuxoko · 2015-05-18T17:17:08Z

@sempervictus @kernelOfTruth
ABD doesn't touch anything related to the vdev size stats, so I don't see how it could cause such bug. Also, in the first post by @pzwahlen #3400 (comment), he indicated that he saw l2_size grow over the disk size on 0.6.4.1. I assume that was on master?

kernelOfTruth · 2015-05-18T18:15:52Z

@tuxoko so that's probably another building site - perhaps ABD stresses areas of ARC & L2ARC accounting that wouldn't otherwise ? ( if I remember correctly @sempervictus pointed out that high i/o wait and/or other issues occured without ABD so he couldn't possibly hit this without the ABD improvements)

but referring to @pzwahlen 's report it must have been there prior to ABD changes - and perhaps a regression introduced between 0.6.3 and 0.6.4.1

thanks for pointing that out

pzwahlen · 2015-05-18T21:09:06Z

I can confirm that in my case, L2ARC reported size goes way beyond the actual partition/disk size since 0.6.3 (I started using ZoL with this version).

0.6.4.1 with #3216 seems to improve performance a bit, but nothing changed on the reported size issue.

Also keep in mind I'm doing ZVOL only, I don't have a single "file" in my zfs datasets.

Thanks for looking into this, being able to use L2ARC would be really cool...

kernelOfTruth · 2015-05-18T23:56:06Z

referencing #3114 (which includes the potential fix in the top comment)

adding some more:
#1612 (superseded) , #1936 (superseded), #1967 (superseded), #2110 (merged) - Improve ARC hit rate with metadata heavy workloads #2110

#1622 (merged) - Illumos #3137 L2ARC compression #1622

#1522 (superseded), #1542 (merged) - Fix inaccurate arcstat_l2_hdr_size calculations #1542

#936 (closed, too many unknowns)

sempervictus · 2015-05-19T00:09:07Z

The issue referenced @ the bottom of the FreeNAS Redmine seems to indicate that this is now resolved in their kernel. Will pull their gh repo later and see what that actually was (though i'm sure by then @kernelOfTruth will have ported, tested, and polished it).

kernelOfTruth · 2015-05-19T00:33:22Z

403

You are not authorized to access this page.

WTF ?!

I took the time and effort to create an account and take a look at the code - apparently it's not open source

@sempervictus you, by chance could point me to the actual commit ?

I'm sure I'm missing something blatantly obvious ❓

sempervictus · 2015-05-19T01:02:09Z

Right there with ya. Looks like the GH repo is not much help either, its a build system, requiring that we build atop an existing freenas... digging further, but yeah, "WTF?!" is about right.

pzwahlen · 2015-06-06T22:33:40Z

Finally had time to make some more tests.

I have applied #3451 using the following commands from a buildbot log:
http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-builder/builds/2391/steps/git_1/logs/stdio

Things are definitely much better, With a small (4G) l2arc partition that I could previously "overfill" in less than 10 minutes, I now couldn't reach more than 3G of used cache with my iometer test.

On a larger 60G partition, cache usage went up to 29G (after about 2 hours) and then stayed there.

I don't know if reaching half the cache size is normal with my workload or if it's the sign of another size calculation issue, though.

The performance don't seem to suffer in any way.

Are there other tests I could perform now that I have this running ?

Thx for the hard work!

kernelOfTruth · 2015-06-08T00:49:12Z

@pzwahlen Thanks for sharing your stats =)

If it's possible - I'd say leave it running (if there's automated tests or some scripts) over a longer period of days to stress it some more.

Indeed, it could be that this (seemingly correctly behaving L2arc) shows some bug in the calculation or it's simply saturated and might fill up more over time.

Cheers

pzwahlen · 2015-06-09T19:12:00Z

Quick update: after a few Storage vMotion I have been able to bring l2arc size to 81G out of my 128G partition. It means it definitely can grow over half the total size.

Another point is that during low activity times, l2arc size actually decreases, which is something I never saw before!

@sempervictus Do you have some additional feedback ?

kernelOfTruth · 2015-06-09T19:49:45Z

@pzwahlen did you observe any stalls, performance issues or other regression-like behavior ?

pzwahlen · 2015-06-09T20:19:06Z

@sempervictus not so far. However, I just applied #3451 over master (as far as I understand Git), and nothing related to #2129.

Now that I read back this conversation, I even wonder if it's not #3433 that is supposed to fix this l2arc behavior, instead of #3451 (even though it definitely changed things in the right direction)

Any input much appreciated. Cheers!

kernelOfTruth · 2015-06-09T21:04:52Z

@pzwahlen sorry, I should have updated the title of #3433 - done now

according to @avg-I it's not entirely correct and thus should be superseded by #3451 (and subsequent changes) , so you applied the correct patch

So the only change that applies to current zfs master is #3451 .

"5701 zpool list reports incorrect "alloc" value for cache devices" does not apply and will be relevant once #3115 has been merged.

Hope that clears things up

Thanks

behlendorf · 2015-06-09T21:12:27Z

Thanks guys for running this to ground. The fix in #3451 looks good to me although I noticed that despite posive reviews for illumos and originally be written for FreeBSD neither platform has merged it yet. Anybody happen to know what's holding things up? It would be nice to have an illumos issue number for this to make tracking it easier.

kernelOfTruth · 2015-06-09T21:48:21Z

@behlendorf Glad the l2arc works correctly now (even for everyday tasks it makes a remarkable difference for me - so it would a shame to have to disable it, especially for bigger workloads #3259 (comment) )

from looking on the web I got the impression that it (the issue) might not have gotten wide enough recognition to seek for solutions from users side yet and/or it hasn't been triggered often enough to be "relevant" (perhaps they're not running that stressing workloads ?)

Why there's no activity on FreeBSD or Illumos - that I can't say

sempervictus · 2015-06-12T05:31:34Z

Just to chime in, i've been using #3451 since it hit, successfully, on multiple hosts, including contentious SCST boxes feeding pairs of 40Gbit links on which a cloudstack resides, all of which beat the hell out of ARC.
@kernelOfTruth: you've been an immense help with the recent patch stacks for testing and hunting this down. Thank you very much.

avg-I · 2015-06-12T12:47:32Z

@behlendorf @kernelOfTruth regarding the illumos / FreeBSD (in-)activity: I made these changes for HybridCluster / ClusterHQ and we ran with them for many month while our product was based on FreeBSD. After submitting the changes to illumos and FreeBSD I switched to working on other things and didn't have time to follow through. Apparently noone else was as interested in the changes at that time.

Fortunately, now there are interested parties :) and I've got a little bit more time:
https://reviews.freebsd.org/D2764
https://reviews.freebsd.org/D2789

One thing I noticed is that there were recent significant changes to ARC code in illumos:
illumos/illumos-gate@89c86e3
illumos/illumos-gate@244781f <- this one is big
illumos/illumos-gate@2fd872a
illumos/illumos-gate@a52fc31

So I am looking forward to fun times of merging those changes with mine.

The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>

The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); Account for the error message: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] uint64_t stats_size = 0; References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>

behlendorf · 2015-06-16T17:21:35Z

@avg-I thanks for following up with us! It would be great if you refresh and finalize the patches so fix this issue which clearly impacts everybody. We've just finished pulling in those significant ARC changes so it should be fairly easy for us to apply the fix.

avg-I · 2015-06-16T17:24:09Z

@behlendorf #3491 is equivalent to what I have

The problem is that since OpenSolaris commit illumos/illumos-gate@e14bb32 l2ad_hand is kept aligned based on ashift (which is derived from the cache device's logical and physical block sizes). So, the hand could be advanced by more than the sum of b_asize-s of the written L2ARC buffers. This is because b_asize is a misnomer at the moment as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded, but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem may also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). PR: 198242 PR: 195746 (possibly related) Porting notes: Rather difficult to track changes related to: Illumos 5369 - arc flags should be an enum and Illumos 5408 - managing ZFS cache devices requires lots of RAM hdr->b_l2hdr = l2hdr; changed to hdr->b_flags |= ARC_FLAG_HAS_L2HDR; list_insert_head(dev->l2ad_buflist, hdr); changed to list_insert_head(&dev->l2ad_buflist, hdr); Account for the error message: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] uint64_t stats_size = 0; References: https://reviews.freebsd.org/D2764 openzfs#3400 openzfs#3433 openzfs#3451 Ported by: kernelOfTruth <kerneloftruth@gmail.com>

If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#3400 Issue openzfs#3433 Issue openzfs#3451

kernelOfTruth · 2015-06-24T19:00:59Z

@avg-I one of your patches is prior to be being merged ( #3521 )

Thanks again for your hard work =)

avg-I · 2015-06-24T19:50:40Z

@kernelOfTruth thank you too!

If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#3400 Closes openzfs#3433 Closes openzfs#3451 Cherry-pick ported by: Tim Chase <tim@chase2k.com> Cherry-picked from ef56b07

If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit 3a17a7a. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit 3a17a7a and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#3400 Closes openzfs#3433 Closes openzfs#3451 Ported-to-0.6.4.2-by: Tim Chase <tim@chase2k.com>

kernelOfTruth mentioned this issue May 15, 2015

Lock contention on arcs mtx #3115

Closed

sempervictus mentioned this issue May 17, 2015

Move ARC data buffers out of vmalloc #2129

Closed

kernelOfTruth mentioned this issue May 28, 2015

[WIP; RFC, discussion, feedback] account for ashift when choosing buffers … #3451

Closed

This was referenced Jun 12, 2015

[after ARC lock-contention fixes] account for ashift when gathering buffers to be written to l2arc device #3491

Closed

[RFC] FreeBSD D2789, Review - l2arc: make sure that all writes honor ashift of a cache device #3492

Closed

kernelOfTruth mentioned this issue Jun 17, 2015

[buildbot, proof-of-concept (*2)/testing, #2129 + new (tracepoint multilist) #3115], <feedback, comments> #3216

Closed

behlendorf mentioned this issue Jun 24, 2015

Account for ashift when gathering buffers to be written to l2arc device #3521

Closed

behlendorf closed this as completed in ef56b07 Jun 25, 2015

L2ARC Capacity and Usage FUBAR - significant performance penalty apparently associated #3400

L2ARC Capacity and Usage FUBAR - significant performance penalty apparently associated #3400

Comments

sempervictus commented May 12, 2015

sempervictus commented May 12, 2015

kernelOfTruth commented May 12, 2015

pzwahlen commented May 12, 2015

kernelOfTruth commented May 12, 2015

sempervictus commented May 13, 2015

AndCycle commented May 13, 2015

kernelOfTruth commented May 13, 2015

odoucet commented May 13, 2015

sempervictus commented May 13, 2015

sempervictus commented May 15, 2015

kernelOfTruth commented May 15, 2015

kernelOfTruth commented May 16, 2015

pzwahlen commented May 16, 2015

kernelOfTruth commented May 16, 2015

sempervictus commented May 17, 2015

tuxoko commented May 17, 2015

dweeezil commented May 18, 2015

sempervictus commented May 18, 2015

tuxoko commented May 18, 2015

sempervictus commented May 18, 2015

kernelOfTruth commented May 18, 2015

tuxoko commented May 18, 2015

kernelOfTruth commented May 18, 2015

pzwahlen commented May 18, 2015

kernelOfTruth commented May 18, 2015

sempervictus commented May 19, 2015

kernelOfTruth commented May 19, 2015

sempervictus commented May 19, 2015

pzwahlen commented Jun 6, 2015

kernelOfTruth commented Jun 8, 2015

pzwahlen commented Jun 9, 2015

kernelOfTruth commented Jun 9, 2015

pzwahlen commented Jun 9, 2015

kernelOfTruth commented Jun 9, 2015

behlendorf commented Jun 9, 2015

kernelOfTruth commented Jun 9, 2015

sempervictus commented Jun 12, 2015

avg-I commented Jun 12, 2015

behlendorf commented Jun 16, 2015

avg-I commented Jun 16, 2015

kernelOfTruth commented Jun 24, 2015

avg-I commented Jun 24, 2015