Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs 0.6.5 and write performance problems #4512

Closed
crollorc opened this issue Apr 12, 2016 · 143 comments
Closed

zfs 0.6.5 and write performance problems #4512

crollorc opened this issue Apr 12, 2016 · 143 comments
Labels
Status: Inactive Not being actively updated Type: Performance Performance improvement or performance problem

Comments

@crollorc
Copy link

Hi,

First post here so apologies in advance for my mistakes!

We recently upgraded ZFS on our Ubuntu 12.04.5 SAN from 0.6.2-1precise to 0.6.5.4-1precise. This was done mistakenly as part of a crash recovery.

This system has been in-place and stable for several years.

Following the upgrade we have been experiencing higher disk utilisation resulting in intermittent IOWAIT CPU issues for the server and its NFS clients.

We noticed several performance changes post upgrade. Note that the upgrade occurred @ Week 13 on the graphs below.

  1. All clients of this SAN use NFS and there has been no change in the number of inbound NFS writes (dark brown line)

image

  1. zpool iostat write ops have doubled

image

  1. iostat writes have doubled for all pool disks (graph is for a sample disk)

image

  1. iostat write sizes (avgrq-sz) have halved for all pool disks

image

I would guess that these issues are related to this entry in the 0.6.5 changelog -

  • Improved zvol performance on all kernels (>50% higher throughput, >20% lower latency)

I've also seen write coalescing mentioned in github and looking at zpool iostat, I don't see much of this happening -

san1stdpool                 48.6T  16.6T     63  1.36K   404K  9.00M
  raidz2                    8.04T  2.83T     14    306  88.6K  2.12M
    scsi-2001b4d2031085703      -      -      3    268  29.6K  1.55M
    scsi-2001b4d2061836505      -      -      4    265  38.4K  1.53M
    scsi-2001b4d2061930609      -      -      3    260  23.2K  1.52M
    scsi-2001b4d2061838504      -      -      3    266  15.2K  1.55M
  raidz2                    8.10T  2.78T      7    205   101K  1.40M
    scsi-2001b4d2004191406      -      -      2    154  40.8K  1.04M
    scsi-2001b4d2061656700      -      -      1    156  34.4K  1.05M
    scsi-2001b4d2061540302      -      -      2    156  17.6K  1.07M
    scsi-2001b4d2061648702      -      -      1    159  19.2K  1.08M
  raidz2                    8.11T  2.77T     11    169   118K  1.11M
    scsi-2001b4d2061055300      -      -      3    138  44.8K   865K
    scsi-2001b4d2061651708      -      -      3    136  42.4K   861K
    scsi-2001b4d2061645707      -      -      2    133  19.2K   861K
    scsi-2001b4d2061558207      -      -      2    137  23.2K   875K
  raidz2                    8.11T  2.76T     11    188  34.9K  1.07M
    scsi-2001b4d2061040403      -      -      1    154  7.20K   891K
    scsi-2001b4d2061556002      -      -      2    156  9.60K   892K
    scsi-2001b4d2061652709      -      -      4    151  17.6K   866K
    scsi-2001b4d2061634008      -      -      2    152  11.2K   873K
  raidz2                    8.12T  2.76T      8    236  28.6K  1.39M
    scsi-2001b4d2061541201      -      -      1    170  7.20K  1.12M
    scsi-2001b4d2061642603      -      -      2    169  12.0K  1.10M
    scsi-2001b4d2061650808      -      -      2    168  9.60K  1.09M
    scsi-2001b4d2061543300      -      -      2    167  9.60K  1.10M
  raidz2                    8.13T  2.75T     10    281  32.3K  1.89M
    scsi-2001b4d2061653704      -      -      2    241  12.8K  1.40M
    scsi-2001b4d2061643704      -      -      3    241  16.0K  1.40M
    scsi-2001b4d2061651706      -      -      2    243  8.80K  1.40M
    scsi-2001b4d2061555002      -      -      2    242  11.2K  1.41M

What would you guys suggest? Downgrade to 0.6.4?

Thanks for your help

@dweeezil
Copy link
Contributor

@crollorc There have been a whole lot more fundamental changes since 0.6.2 than just the zvol rework you pointed out. Among other things that could impact a write-heavy load is the write throttle overhaul of e8b96c6. You should check out some of the new-ish kstats such as /proc/spl/kstat/zfs/dmu_tx and also the per-pool histograms in /proc/spl/kstat/zfs/<pool>/dmu_tx_assign. The new system may be absorbing writes much more quickly than previously. The zvol rework you mention lowers latency a lot and cause excessive write throttling. It would depend on the amount of memory in the system and the speed at which your drives can handle the writes.

@crollorc
Copy link
Author

@dweeezil Thank you for the speedy response!

The new system may be absorbing writes much more quickly than previously.

Note that sync=disabled was setup on all pools a long time ago to get reasonable NFS performance

    root@san1:~# zfs get sync san1stdpool
    NAME         PROPERTY  VALUE     SOURCE
    san1stdpool  sync      disabled  local

It would depend on the amount of memory in the system and the speed at which your drives can handle the writes.

We use an Areca ARC-1880 controller and a bunch of WD30EZRX disks. Here are some details on system memory, ZFS tunings and iostat for a sample disk

    root@san1:~# grep Mem /proc/meminfo
    MemTotal:       148500592 kB
    MemFree:        52012052 kB

    root@san1:~# cat /etc/modprobe.d/zfs.conf
    options zfs zfs_arc_max=80530636800
    options zfs zfs_arc_meta_limit=21474836480
    options zfs zfs_prefetch_disable=1

    root@san1:~# iostat -dx 30 sdx
    Linux 3.2.0-57-generic (san1)   12/04/16        _x86_64_        (8 CPU)
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdx               0.00     0.00    3.79  172.44    82.42  1225.43    14.84     0.38    2.14   63.47    0.79   1.27  22.42

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdx               0.00     0.00    5.83  147.07    45.60  1149.20    15.63     0.22    1.46   32.50    0.23   1.28  19.52

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdx               0.00     0.00    5.40  168.37    53.87  1211.07    14.56     0.40    2.30   56.27    0.57   1.64  28.52

There have been a whole lot more fundamental changes since 0.6.2 than just the zvol rework you pointed out. Among other things that could impact a write-heavy load is the write throttle overhaul of e8b96c6. You should check out some of the new-ish kstats such as /proc/spl/kstat/zfs/dmu_tx and also the per-pool histograms in /proc/spl/kstat/zfs//dmu_tx_assign.

Below are some details for the kstats mentioned.

dmu_tx_assigned is increasing fast and dmu_tx_group is increasing slowly:

    dmu_tx (60s apart):
        root@san1:~# cat /proc/spl/kstat/zfs/dmu_tx
        5 1 0x01 11 528 5451072961 1032291543673689
        name                            type data
        dmu_tx_assigned                 4    594152327
        dmu_tx_delay                    4    0
        dmu_tx_error                    4    0
        dmu_tx_suspended                4    0
        dmu_tx_group                    4    264018
        dmu_tx_memory_reserve           4    0
        dmu_tx_memory_reclaim           4    0
        dmu_tx_dirty_throttle           4    3169
        dmu_tx_dirty_delay              4    0
        dmu_tx_dirty_over_max           4    0
        dmu_tx_quota                    4    0

        root@san1:~# cat /proc/spl/kstat/zfs/dmu_tx
        5 1 0x01 11 528 5451072961 1032351549214250
        name                            type data
        dmu_tx_assigned                 4    594190604
        dmu_tx_delay                    4    0
        dmu_tx_error                    4    0
        dmu_tx_suspended                4    0
        dmu_tx_group                    4    264044
        dmu_tx_memory_reserve           4    0
        dmu_tx_memory_reclaim           4    0
        dmu_tx_dirty_throttle           4    3169
        dmu_tx_dirty_delay              4    0
        dmu_tx_dirty_over_max           4    0
        dmu_tx_quota                    4    0

    dmu_tx_assign (60s apart):
        root@san1:~# cat /proc/spl/kstat/zfs/san1stdpool/dmu_tx_assign
        19 1 0x01 32 1536 15812990669 1032205962575895
        name                            type data
        1 ns                            4    0
        2 ns                            4    0
        4 ns                            4    0
        8 ns                            4    0
        16 ns                           4    0
        32 ns                           4    0
        64 ns                           4    0
        128 ns                          4    0
        256 ns                          4    39
        512 ns                          4    181
        1024 ns                         4    123
        2048 ns                         4    229
        4096 ns                         4    309
        8192 ns                         4    611
        16384 ns                        4    4702
        32768 ns                        4    7820
        65536 ns                        4    9830
        131072 ns                       4    8734
        262144 ns                       4    8626
        524288 ns                       4    7764
        1048576 ns                      4    6640
        2097152 ns                      4    6854
        4194304 ns                      4    8979
        8388608 ns                      4    14816
        16777216 ns                     4    23768
        33554432 ns                     4    26059
        67108864 ns                     4    21855
        134217728 ns                    4    19184
        268435456 ns                    4    21004
        536870912 ns                    4    23511
        1073741824 ns                   4    21004
        2147483648 ns                   4    24505

        root@san1:~# cat /proc/spl/kstat/zfs/san1stdpool/dmu_tx_assign
        19 1 0x01 32 1536 15812990669 1032266427393963
        name                            type data
        1 ns                            4    0
        2 ns                            4    0
        4 ns                            4    0
        8 ns                            4    0
        16 ns                           4    0
        32 ns                           4    0
        64 ns                           4    0
        128 ns                          4    0
        256 ns                          4    39
        512 ns                          4    181
        1024 ns                         4    123
        2048 ns                         4    229
        4096 ns                         4    309
        8192 ns                         4    611
        16384 ns                        4    4702
        32768 ns                        4    7820
        65536 ns                        4    9831
        131072 ns                       4    8735
        262144 ns                       4    8626
        524288 ns                       4    7764
        1048576 ns                      4    6640
        2097152 ns                      4    6854
        4194304 ns                      4    8979
        8388608 ns                      4    14816
        16777216 ns                     4    23768
        33554432 ns                     4    26059
        67108864 ns                     4    21856
        134217728 ns                    4    19184
        268435456 ns                    4    21007
        536870912 ns                    4    23512
        1073741824 ns                   4    21004
        2147483648 ns                   4    24510

Thanks again for any pointers.

@dweeezil
Copy link
Contributor

@crollorc These aren't a good sign:

    536870912 ns                    4    23512
    1073741824 ns                   4    21004
    2147483648 ns                   4    24510

Please try echo 100 > /sys/module/zfs/parameters/zfs_txg_history and then monitor the tail of /proc/spl/kstat/zfs/san1stdpool/txgs. My guess is that there's going to be a lot of dirty data built up and you'll see txgs taking a long time to sync. With 148GiB of RAM, by default, it'll allow 14.8GiB of dirty data and will start delaying transactions once 60% of that's filled. If the dirty data is building up that much, it's quite a bit to be flushing to a set of 6 vdevs comprised of 5400 RPM drives. Your sync=disabled setting removes most of the pressure from the writes and is likely letting lots of dirty data build up.

One other note: You had suspected changes to the zvol code but it's not clear that your system is using zvols at all, especially if it's just an NFS server.

@crollorc
Copy link
Author

@dweeezil Thanks again for your help!

One other note: You had suspected changes to the zvol code but it's not clear that your system is using zvols at all, especially if it's just an NFS server.

Yes, sorry for this red herring!

Please try echo 100 > /sys/module/zfs/parameters/zfs_txg_history and then monitor the tail of /proc/spl/kstat/zfs/san1stdpool/txgs. My guess is that there's going to be a lot of dirty data built up and you'll see txgs taking a long time to sync. With 148GiB of RAM, by default, it'll allow 14.8GiB of dirty data and will start delaying transactions once 60% of that's filled. If the dirty data is building up that much, it's quite a bit to be flushing to a set of 6 vdevs comprised of 5400 RPM drives. Your sync=disabled setting removes most of the pressure from the writes and is likely letting lots of dirty data build up.

Some txgs below. As far as I can tell, for this sample the avg sync time is 0.785s and max dirty data is 389MB. Is that right?

Does it look like we have excessive dirty data? If so, should we change some options to reduce this?

Also, the thing I find most confusing post upgrade is the halving of average write size and doubling of write IO. Could this indicate some sort of issue with the coalescing or merging of txg async writes?

root@san1:~# tail -100 /proc/spl/kstat/zfs/san1stdpool/txgs
18 0 0x01 100 11200 15812979317 105133212710609918 0 0x01 100 11200 15812979317 1051533841667804
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
15119926 1051317597136163 C     67319808     1138176      30715392     22       4244     2288632025   41059        90431        539574908
15119927 1051319885768188 C     67148288     324096       29394944     11       4660     3023421149   42716        56583        421626699
15119928 1051322909189337 C     622592       286208       9233920      7        2013     133828       158336       421440564    418241927
15119929 1051322909323165 C     67190272     0            31543808     0        4339     3105828843   40835        34503        207592661
15119930 1051326015152008 C     67112960     246272       30817280     18       4012     3288812978   32739        23840        1613413641
15119931 1051329303964986 C     67280896     797696       32596992     8        4321     2928346818   27324        51089        1421791320
15119932 1051332232311804 C     67201024     723456       28093952     7        4179     2135289574   30965        67649        272378251
15119933 1051334367601378 C     185868800    553984       34114560     9        3945     272499109    23381        4422         2392586832
15119934 1051334640100487 C     388984320    1226752      68935680     16       6189     2392672328   25820        17348        3349258721
15119935 1051337032772815 C     69060608     654336       33100288     18       3841     3349315820   22396        5053         2735950416
15119936 1051340382088635 C     73426944     2148864      34294272     46       4346     2735995882   22438        58645        3479784551
15119937 1051343118084517 C     72868864     719360       34494464     16       4195     3479911953   35964        67916        4541587497
15119938 1051346597996470 C     72126464     273408       29708288     5        4051     4541745505   40595        71655        470704815
15119939 1051351139741975 C     67288064     14848        34340352     1        4389     2762774648   35648        59742        290774122
15119940 1051353902516623 C     67375616     0            29429248     0        3693     1752902477   32134        57074        213941754
15119941 1051355655419100 C     67144192     121344       30396416     2        4474     2401879065   33550        56413        227860347
15119942 1051358057298165 C     311296       181760       6515200      15       1388     117155       24713        227814025    3711561552
15119943 1051358057415320 C     77940736     211456       35246592     11       5057     3939418192   39508        25284        420930211
15119944 1051361996833512 C     67150848     499712       34809856     8        4503     4519004489   35635        59046        250941851
15119945 1051366515838001 C     245760       0            6135296      0        978      149579       25826        250866431    90757888
15119946 1051366515987580 C     161980928    3584         32300544     1        3054     341680448    23303        160649       328860642
15119947 1051366857668028 C     67266560     75264        28854272     5        3882     3171883004   33868        53873        1542083954
15119948 1051370029551032 C     67233792     119296       31291392     2        3844     4015941251   33066        61815        222378832
15119949 1051374045492283 C     67112448     73216        30317056     1        4108     2828123890   32982        51705        232453458
15119950 1051376873616173 C     294912       3072         4966912      1        959      94905        23727        232425852    219545549
15119951 1051376873711078 C     67220480     0            28463104     0        3923     3582735732   32428        56405        173446117
15119952 1051380456446810 C     67352064     0            32849408     0        4644     4557654099   33427        62170        260095880
15119954 1051386876962366 C     311296       7168         4756992      3        968      64058        104865       217125334    391759430
15119955 1051386877026424 C     67350016     512          29517312     1        3781     2162472992   37046        58895        243407587
15119956 1051389039499416 C     67367936     4096         26507264     2        3857     2859501850   32602        16814        213039912
15119957 1051391899001266 C     294912       3584         6612480      1        1160     56747        23807        213015136    282602466
15119958 1051391899058013 C     84591616     5120         23961088     2        2823     495659675    19990        9958         193463258
15119959 1051392394717688 C     67204608     0            32209408     0        4709     3241553670   29093        58300        196789366
15119960 1051395636271358 C     67344384     3584         21316608     1        3268     1424572876   29297        51815        218067837
15119961 1051397060844234 C     133722112    10240        27375104     3        3207     218169319    21447        14854        276689871
15119962 1051397279013553 C     425984       75264        4948992      2        885      43184        34551        276654583    839144752
15119963 1051397279056737 C     86631424     418304       28002816     10       3119     1115928940   34058        44815        3867624148
15119964 1051398394985677 C     67249152     116224       34108928     5        4120     3976622524   31413        56118        1811913426
15119965 1051402371608201 C     442368       271360       6130176      3        1145     116962       30316        1811861424   316975495
15119966 1051402371725163 C     67272704     7168         30453248     2        3696     4079832998   5366991      56363        177157625
15119967 1051406451558161 C     606208       6656         5591040      2        1041     5474039      28348        177085766    463744406
15119968 1051406457032200 C     67382784     11264        28375552     3        3694     4750365650   31639        58194        251079004
15119969 1051411207397850 C     67136000     174080       30353920     4        4368     2648867338   33360        19377        333135826
15119970 1051413856265188 C     67301376     0            32245248     0        4648     3323272140   33377        49368        260578413
15119971 1051417179537328 C     67398656     0            30830592     0        4187     3326830255   34880        82275        193184454
15119972 1051420506367583 C     704512       13824        5534208      2        1028     168081       1436206      191703067    287482413
15119973 1051420506535664 C     67311104     4096         28228608     2        3945     1893336652   33214        20482        254137463
15119974 1051422399872316 C     67278336     0            21765632     0        2773     348983041    27356        47972        145083331
15119975 1051422748855357 C     67352064     0            26700288     0        4364     4445354130   28759        90471        275635672
15119976 1051427194209487 C     67309568     4096         17486848     2        2648     287537177    19723        36940        285099234
15119977 1051427481746664 C     294912       3584         6289920      1        1418     78881        15976        285067202    372916241
15119978 1051427481825545 C     99746816     138240       32547328     3        3813     658033143    37851        112851       281154981
15119979 1051428139858688 C     67299840     0            35225088     0        4658     4791695098   29047        34142        238786288
15119980 1051432931553786 C     67450368     442880       32909312     9        4657     3382521809   29070        53666        455676317
15119981 1051436314075595 C     62214656     0            28602368     0        4430     4998970069   33190        48473        212310971
15119982 1051441313045664 C     67166208     7680         29568512     1        4322     4977762671   33126        56693        220063626
15119984 1051448657016982 C     67188224     39424        28069888     1        3891     3938588146   33204        63339        255697567
15119985 1051452595605128 C     206512128    262144       36888576     2        3808     255814316    20576        5260         433042218
15119986 1051452851419444 C     294912       0            5493760      0        941      30191        67419        432978226    245940283
15119987 1051452851449635 C     118355456    740864       30279680     25       3021     679079720    36735        46412        5243127770
15119988 1051453530529355 C     107720192    454144       44655616     10       5230     5243222808   22147        4742         2608530245
15119989 1051458773752163 C     193300480    567808       42410496     14       4978     2608579548   23583        4823         2329348098
15119990 1051461382331711 C     67341824     647168       27520000     11       3886     2861814130   84761564     51831        2984429468
15119991 1051464244145841 C     10947072     136704       10307584     6        1463     84872604     25224        2984351067   376682757
15119992 1051464329018445 C     67301888     131072       33234944     1        3847     4520813041   31963        54605        222658607
15119993 1051468849831486 C     67185152     0            30845952     0        4001     3531488284   41844        54481        199061853
15119994 1051472381319770 C     67261952     0            33922048     0        4376     3678401469   33453        55968        232224320
15119995 1051476059721239 C     67164672     131072       37501440     1        5200     4175202659   34296        60155        225753980
15119996 1051480234923898 C     67061760     3584         33344512     1        5753     3162438635   33377        60652        293532139
15119997 1051483397362533 C     67233280     0            27124224     0        3267     892441673    40501        17950        188961671
15119998 1051484289804206 C     72597504     7680         19038720     2        2022     189064281    25234        38076        499031632
15119999 1051484478868487 C     67157504     0            29156352     0        3885     4877431971   33126        58347        294803691
15120000 1051489356300458 C     67275776     0            17293312     0        2613     488698066    25123        15277        186339567
15120001 1051489844998524 C     67120640     189952       29140480     2        4070     2988750048   31981        54116        217461923
15120002 1051492833748572 C     67270144     30208        30515200     1        4008     2598587018   33083        24376        207672250
15120003 1051495432335590 C     67363328     0            23166976     0        3080     1092220744   59143        27075        197765246
15120004 1051496524556334 C     311296       77824        5860864      3        944      107004       115234       197634991    284243810
15120005 1051496524663338 C     67376128     34816        21529088     1        3238     1492765489   33958        61471        177605761
15120006 1051498017428827 C     67281920     88064        23701504     3        3359     1563872137   34359        56987        2221304322
15120007 1051499581300964 C     442368       446464       6493696      7        972      142416       121732       2221137710   579516387
15120008 1051499581443380 C     80051200     0            27474432     0        3577     2800846052   31181        36791        237096877
15120009 1051502382289432 C     67284480     0            24519168     0        3317     843436262    32395        21238        187852604
15120010 1051503225725694 C     606208       288256       5197312      9        973      70125        137521       187704935    632766615
15120011 1051503225795819 C     89024512     65536        26772992     7        3249     820634999    35718        45433        2604648562
15120012 1051504046430818 C     69491712     264192       26582016     6        3784     2604750202   20538        38766        322755582
15120014 1051509437880034 C     72758272     207872       21197824     7        2681     169953556    24450        4840         2003911666
15120015 1051509607833590 C     67322880     31744        33702912     7        4161     4465702675   32512        58383        1077354867
15120016 1051514073536265 C     81204224     0            25047040     0        3331     1077512643   31088        13559        159332753
15120017 1051515151048908 C     67190784     187904       14333440     2        2120     330590061    23868        65989        925606183
15120018 1051515481638969 C     294912       5632         5998080      1        899      110777       16520        925574862    298434801
15120019 1051515481749746 C     115534848    302080       32688640     6        3619     1224080377   27726        44641        1976941306
15120020 1051516705830123 C     67234816     262144       31841280     2        4065     4021937649   29437        65558        910526372
15120021 1051520727767772 C     67363328     0            33303552     0        3697     3745727914   29722        61160        191961445
15120022 1051524473495686 C     67283968     242176       31216640     4        4147     2288904899   40747        58234        653195841
15120023 1051526762400585 C     67494400     0            31962112     0        4467     3981323257   30209        34169        226151116
15120024 1051530743723842 C     67171840     5632         30683648     5        4057     2780591447   28499        32628        233295354
15120025 1051533524315289 O     0            0            0            0        0        0            0            0            0

Thanks again.

@dweeezil
Copy link
Contributor

@crollorc I may have gotten ahead of myself here. So much has changed since 0.6.2 and I latched on to the new write throttle. There have also been big ARC changes. In order to get a better picture of what's going on, we also need to see your arcstats.

Are these changes:

options zfs zfs_arc_max=80530636800
options zfs zfs_arc_meta_limit=21474836480

left over from 0.6.2 or did you add them after the upgrade? If these are from 0.6.2, I'd suggest removing these and watching the arcstats.

@crollorc
Copy link
Author

@dweeezil

There have also been big ARC changes. In order to get a better picture of what's going on, we also need to see your arcstats.

I've added some arcstats and arc_summary below (hopefully with the right params).

Are these changes:
options zfs zfs_arc_max=80530636800
options zfs zfs_arc_meta_limit=21474836480
left over from 0.6.2 or did you add them after the upgrade? If these are from 0.6.2, I'd suggest removing these and watching the arcstats.

They are left over and were tuned to deal with some problems with a large ARC in 0.6.2. We were also dropping caches (echo 3 > /proc/sys/vm/drop_caches) every hour on 0.6.2 but stopped post upgrade.

We are very dependent on the high ARC hit rates we get (ARC:93% L2ARC:68%). How would i default these options from the live system? What size will the ARC default to and will this result in the ARC being cleared or will it resize dynamically?

Thanks again for your help.

root@san1:~# arcstat.py  2
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
22:19:34     1     0     33     0   33     0    0     0   33    75G   75G
22:19:36  4.5K   227      5   227    5     0    0   167   43    75G   75G
22:19:38  2.9K   199      6   199    6     0    0   163    7    74G   75G
22:19:40   749   132     17   132   17     0    0    95   29    74G   75G
22:19:42  4.4K   220      4   220    4     0    0   159   10    74G   75G
22:19:44  1.4K    60      4    60    4     0    0    31    3    74G   75G
22:19:46  1.6K   119      7   119    7     0    0    84    8    74G   75G
22:19:48  1.5K   128      8   128    8     0    0    71    7    75G   75G
22:19:50  2.3K   333     14   333   14     0    0   311   17    74G   75G
22:19:52  1.0K    81      8    81    8     0    0    44   32    75G   75G
22:19:54  4.5K   106      2   106    2     0    0    72    7    75G   75G
22:19:56  2.1K    31      1    31    1     0    0     7   35    75G   75G
22:19:58  3.0K    96      3    96    3     0    0    54    5    75G   75G
22:20:00  6.9K   353      5   353    5     0    0   304    5    75G   75G
22:20:02  1.5K   146      9   146    9     0    0    79   42    75G   75G
22:20:05   23K   190      0   190    0     0    0   138    0    75G   75G
22:20:07  5.3K   150      2   150    2     0    0   101    7    75G   75G
22:20:09  4.1K   325      8   325    8     0    0   254   17    75G   75G
22:20:11  3.8K   129      3   129    3     0    0    89    4    75G   75G
22:20:13  3.5K   127      3   127    3     0    0    78    6    75G   75G
22:20:15   968   118     12   118   12     0    0    78   40    75G   75G
22:20:17  1.4K   115      8   115    8     0    0    83    9    75G   75G
22:20:19   537    87     16    87   16     0    0    66   72    75G   75G

root@san1:~# arc_summary.py

------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Apr 12 22:20:47 2016
ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                73.11m
        Mutex Misses:                           8.20k
        Evict Skips:                            8.20k

ARC Size:                               100.00% 75.00   GiB
        Target Size: (Adaptive)         100.00% 75.00   GiB
        Min Size (Hard Limit):          0.04%   32.00   MiB
        Max Size (High Water):          2400:1  75.00   GiB

ARC Size Breakdown:
        Recently Used Cache Size:       93.75%  70.31   GiB
        Frequently Used Cache Size:     6.25%   4.69    GiB

ARC Hash Breakdown:
        Elements Max:                           9.19m
        Elements Current:               92.81%  8.53m
        Collisions:                             160.23m
        Chain Max:                              6
        Chains:                                 918.16k

ARC Total accesses:                                     3.54b
        Cache Hit Ratio:                93.96%  3.33b
        Cache Miss Ratio:               6.04%   213.86m
        Actual Hit Ratio:               93.80%  3.32b

        Data Demand Efficiency:         94.10%  1.08b
        Data Prefetch Efficiency:       30.63%  13.31m

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           8.86%   294.70m
          Most Frequently Used:         90.97%  3.03b
          Most Recently Used Ghost:     1.40%   46.42m
          Most Frequently Used Ghost:   3.08%   102.49m

        CACHE HITS BY DATA TYPE:
          Demand Data:                  30.47%  1.01b
          Prefetch Data:                0.12%   4.07m
          Demand Metadata:              69.34%  2.31b
          Prefetch Metadata:            0.07%   2.25m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  29.69%  63.48m
          Prefetch Data:                4.32%   9.23m
          Demand Metadata:              64.88%  138.75m
          Prefetch Metadata:            1.12%   2.40m

L2 ARC Summary: (HEALTHY)
        Low Memory Aborts:                      288
        Free on Write:                          475.27k
        R/W Clashes:                            7.58k
        Bad Checksums:                          0
        IO Errors:                              0

L2 ARC Size: (Adaptive)                         664.73  GiB
        Compressed:                     42.35%  281.54  GiB
        Header Size:                    0.09%   626.96  MiB

L2 ARC Evicts:
        Lock Retries:                           60
        Upon Reading:                           1

L2 ARC Breakdown:                               213.86m
        Hit Ratio:                      68.97%  147.50m
        Miss Ratio:                     31.03%  66.36m
        Feeds:                                  1.06m

L2 ARC Writes:
        Writes Sent:                    100.00% 1.02m

File-Level Prefetch: (HEALTHY)
DMU Efficiency:                                 9.40m
        Hit Ratio:                      31.19%  2.93m
        Miss Ratio:                     68.81%  6.47m

        Colinear:                               6.47m
          Hit Ratio:                    0.01%   536
          Miss Ratio:                   99.99%  6.46m

        Stride:                                 2.72m
          Hit Ratio:                    99.99%  2.72m
          Miss Ratio:                   0.01%   349

DMU Misc:
        Reclaim:                                6.46m
          Successes:                    0.35%   22.78k
          Failures:                     99.65%  6.44m

        Streams:                                214.18k
          +Resets:                      0.09%   200
          -Resets:                      99.91%  213.98k
          Bogus:                                0


ZFS Tunable:
        zfs_arc_sys_free                                  0
        zfs_arc_lotsfree_percent                          10
        l2arc_norw                                        0
        l2arc_feed_again                                  1
        l2arc_nocompress                                  0
        l2arc_noprefetch                                  1
        l2arc_feed_min_ms                                 200
        l2arc_feed_secs                                   1
        l2arc_headroom_boost                              200
        l2arc_headroom                                    2
        l2arc_write_boost                                 8388608
        l2arc_write_max                                   8388608
        zfs_arc_num_sublists_per_state                    8
        zfs_arc_min_prefetch_lifespan                     0
        zfs_arc_average_blocksize                         8192
        zfs_disable_dup_eviction                          0
        zfs_arc_p_min_shift                               0
        zfs_arc_shrink_shift                              0
        zfs_arc_p_dampener_disable                        1
        zfs_arc_p_aggressive_disable                      1
        zfs_arc_grow_retry                                0
        zfs_arc_meta_strategy                             1
        zfs_arc_meta_adjust_restarts                      4096
        zfs_arc_meta_prune                                10000
        zfs_arc_meta_min                                  0
        zfs_arc_meta_limit                                21474836480
        zfs_arc_max                                       80530636800
        zfs_arc_min                                       0
        zfs_dbuf_state_index                              0
        zfs_dedup_prefetch                                0
        zfs_nopwrite_enabled                              1
        zfs_mdcomp_disable                                0
        zfs_send_corrupt_data                             0
        zfs_pd_bytes_max                                  52428800
        zfetch_array_rd_sz                                1048576
        zfetch_block_cap                                  256
        zfetch_min_sec_reap                               2
        zfetch_max_streams                                8
        zfs_prefetch_disable                              1
        zfs_max_recordsize                                1048576
        zfs_delay_scale                                   500000
        zfs_dirty_data_sync                               67108864
        zfs_dirty_data_max_max                            38016151552
        zfs_dirty_data_max                                15206460620
        zfs_delay_min_dirty_percent                       60
        zfs_dirty_data_max_max_percent                    25
        zfs_dirty_data_max_percent                        10
        zfs_free_max_blocks                               100000
        zfs_no_scrub_prefetch                             0
        zfs_no_scrub_io                                   0
        zfs_resilver_min_time_ms                          3000
        zfs_free_min_time_ms                              1000
        zfs_scan_min_time_ms                              1000
        zfs_scan_idle                                     50
        zfs_scrub_delay                                   4
        zfs_resilver_delay                                2
        zfs_top_maxinflight                               32
        zfs_zevent_console                                0
        zfs_zevent_cols                                   80
        zfs_zevent_len_max                                2048
        metaslab_bias_enabled                             1
        metaslab_lba_weighting_enabled                    1
        metaslab_fragmentation_factor_enabled             1
        zfs_metaslab_fragmentation_threshold              70
        zfs_mg_fragmentation_threshold                    85
        zfs_mg_noalloc_threshold                          0
        metaslab_preload_enabled                          1
        metaslab_debug_unload                             0
        metaslab_debug_load                               0
        metaslab_aliquot                                  524288
        zio_taskq_batch_pct                               75
        spa_load_verify_data                              1
        spa_load_verify_metadata                          1
        spa_load_verify_maxinflight                       10000
        zfs_autoimport_disable                            1
        spa_config_path                                   /etc/zfs/zpool.cache
        spa_slop_shift                                    5
        spa_asize_inflation                               24
        zfs_deadman_enabled                               1
        zfs_deadman_synctime_ms                           1000000
        zfs_free_leak_on_eio                              0
        zfs_recover                                       0
        zfs_flags                                         0
        zfs_txg_history                                   100
        zfs_read_history_hits                             0
        zfs_read_history                                  0
        zfs_txg_timeout                                   5
        metaslabs_per_vdev                                200
        zfs_vdev_cache_bshift                             16
        zfs_vdev_cache_size                               0
        zfs_vdev_cache_max                                16384
        zfs_vdev_scheduler                                noop
        zfs_vdev_mirror_switch_us                         10000
        zfs_vdev_sync_write_min_active                    10
        zfs_vdev_sync_write_max_active                    10
        zfs_vdev_sync_read_min_active                     10
        zfs_vdev_sync_read_max_active                     10
        zfs_vdev_scrub_min_active                         1
        zfs_vdev_scrub_max_active                         2
        zfs_vdev_async_write_min_active                   1
        zfs_vdev_async_write_max_active                   10
        zfs_vdev_async_read_min_active                    1
        zfs_vdev_async_read_max_active                    3
        zfs_vdev_async_write_active_min_dirty_percent     30
        zfs_vdev_async_write_active_max_dirty_percent     60
        zfs_vdev_max_active                               1000
        zfs_vdev_write_gap_limit                          4096
        zfs_vdev_read_gap_limit                           32768
        zfs_vdev_aggregation_limit                        131072
        zfs_expire_snapshot                               300
        zfs_admin_snapshot                                0
        zfs_dbgmsg_maxsize                                4194304
        zfs_dbgmsg_enable                                 0
        zfs_immediate_write_sz                            32768
        zfs_read_chunk_size                               1048576
        zil_slog_limit                                    1048576
        zfs_nocacheflush                                  0
        zil_replay_disable                                0
        zfs_sync_pass_rewrite                             2
        zfs_sync_pass_dont_compress                       5
        zfs_sync_pass_deferred_free                       2
        zio_requeue_io_start_cut_in_line                  1
        zio_delay_max                                     30000
        zvol_prefetch_bytes                               131072
        zvol_max_discard_blocks                           16384
        zvol_major                                        230
        zvol_inhibit_dev                                  0

@dweeezil
Copy link
Contributor

@crollorc Please cat /proc/spl/kstat/zfs/arcstats periodically when the problem is happening and post the output (ideally to a gist or paste or something).

@crollorc
Copy link
Author

@dweeezil

Thanks again for all your pointers.

Please cat /proc/spl/kstat/zfs/arcstats periodically when the problem is happening and post the output (ideally to a gist or paste or something).

We haven't noticed any problems with the ARC or reads in general since the upgrade so I'm unclear when I should be looking at arcstats?

Because the ARC and L2ARC are so efficient, only a small proportion of disk IO are reads -

    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sdx               0.00     0.00    3.79  172.44    82.42  1225.43    14.84     0.38    2.14   63.47    0.79   1.27  22.42

However writes have doubled post upgrade which gives us a lot less disk IOP headroom. When IOPs hit 220 or so IO queue depth and service time jump followed by SAN and client load linked to IO_WAIT.

@dweeezil
Copy link
Contributor

@crollorc Both your iostat and txgs history do show an average write size of about 8K. I would have expected ec8501e, which is in 0.6.5.4, to help aggregate them better. The small writes, which were clearly shown in your original iostat graph, do appear to be the problem.

If there is a bigger underlying issue here, I'd expect there to be a lot more problem reports given how widespread the use of NFS server is. The only unusual thing about your setup is that you're not using an SSD log device but instead have set sync=disabled (and maybe that's not as unusual as it ought to be).

@crollorc
Copy link
Author

@dweeezil

Both your iostat and txgs history do show an average write size of about 8K. I would have expected ec8501e, which is in 0.6.5.4, to help aggregate them better. The small writes, which were clearly shown in your original iostat graph, do appear to be the problem.
If there is a bigger underlying issue here, I'd expect there to be a lot more problem reports given how widespread the use of NFS server is. The only unusual thing about your setup is that you're not using an SSD log device but instead have set sync=disabled (and maybe that's not as unusual as it ought to be).

One thing I didn't mention is that we have not done a 'zpool upgrade' so that we can downgrade if necessary.

We do have compression=on and I disabled this to see if it would affect what effect it would have.

iostat avgrq-sz has doubled (to where it was pre-upgrade) but w/s has not dropped. Also, there are no wrqm/s which seems odd to me.

compression=on:
root@san1:~# iostat -dx 30 sdx
Linux 3.2.0-57-generic (san1)   12/04/16        _x86_64_        (8 CPU)
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdx               0.00     0.00    5.83  147.07    45.60  1149.20    15.63     0.22    1.46   32.50    0.23   1.28  19.52

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdx               0.00     0.00    5.40  168.37    53.87  1211.07    14.56     0.40    2.30   56.27    0.57   1.64  28.52

compression=off:    
root@san1:~# iostat -dx 30 sdx
Linux 3.2.0-57-generic (san1)   13/04/16        _x86_64_        (8 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdx               0.00     0.00    4.47  167.83    29.60  3029.73    35.51     0.98    5.95  132.15    2.59   3.10  53.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdx               0.00     0.00    3.47  104.90    24.53  1603.60    30.05     0.47    4.33   55.69    2.63   3.06  33.12

Any ideas on data I could provide, tests I could do or steps I could take to help identify the root-cause or mitigate the problem?

Thanks again.

@crollorc
Copy link
Author

@dweeezil

We tested sync=standard on 0.6.2 with an SSD based ZIL but the performance was poor and so we moved to sync=disabled.

We are planning to try sync=standard again to see if this eliminate the small writes.

Does this sound reasonable?

@crollorc
Copy link
Author

@dweeezil

We are planning to try sync=standard again to see if this eliminate the small writes.

We are now running with sync=standard and request sizes to the slog ssds and the vdev are both now 15 sectors so it looks like we are hitting the same issue with sync=disabled|standard.

Any ideas!?

@dweeezil
Copy link
Contributor

@crollorc Unrelated to this issue, your earlier issue with sync=standard and an SSD log device may very well have been fixed by b39c22b. As to this current issue, I can't tell why you're not getting any write block aggregation but were in the past. The only thing I can thing of at the moment is that the new write throttle is starting to sync the txgs before there's enough dirty data to actually aggregate the writes. The new write throttle is controlled much less by time and more by the amount of dirty data. Maybe try increasing zfs_dirty_data_sync from its default of 64MiB to something several times larger. The commit comment in e8b96c6 outlines all the new tunables (there are a lot of them) and the rationale behind the new dirty-data-based logic.

@crollorc
Copy link
Author

crollorc commented Apr 14, 2016

@dweeezil Thanks for the advice.

I hiked up zfs_dirty_data_sync by a factor of 5

echo "335544320">>/sys/module/zfs/parameters/zfs_dirty_data_sync

This has lead to larger slower txg syncs but has had no effect on write sizes unfortunately

root@san1:~# cat /proc/spl/kstat/zfs/san1stdpool/txgs
18 0 0x01 100 11200 15812979317 1220138956414077
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
15165732 1219547306928970 C     103823872    8437248      108766208    687      6628     4998911226   21627        86364        5969940592
15165733 1219552305840196 C     133959680    16027136     133283840    890      8009     5970068750   19449        9661         9281926622
15165734 1219558275908946 C     419964416    10758144     418938880    948      10639    9281975753   38762642     12841        11173108648
15165735 1219567557884699 C     2506752      593408       7084544      46       1216     38783744     67981        11173038382  481888713
15165736 1219567596668443 C     335713280    14246912     329070592    911      11272    14922927079  33046        59473        10498697281
15165737 1219582519595522 C     208999424    19658240     209769984    1177     9328     10498829161  23350        11915        11889600512
15165738 1219593018424683 C     360969728    12538368     360671744    812      11064    11889653459  240645955    26036        12310457932
15165739 1219604908078142 C     3392512      1349632      11851264     142      2093     240679883    16497        12310439528  2259345943
15165740 1219605148758025 C     244354048    18569216     249533952    1169     11906    17309564928  29802        68381        12566762372
15165741 1219622458322953 C     614432768    8825344      612707328    687      13992    12566918310  18515761     24347        12789978166
15165742 1219635025241263 C     2703360      268288       9133056      46       1363     18546521     15738        12789962159  1117922924
15165743 1219635043787784 C     254418432    4844544      247903232    615      10461    17789000422  19663        28888        8831526786
15165744 1219652832788206 C     356394496    5272064      357357568    532      9933     8831597067   20160        20300        8794093168
15165745 1219661664385273 C     175232512    3508736      66828288     622      8274     8794205731   30647        27209        5537575188
15165746 1219670458591004 C     109496320    8479232      48517120     487      6868     5537646814   19917        4868         4994873362
15165747 1219675996237818 C     78661632     2893824      38667264     337      5141     4999150577   1095103      24052        2955321620
15165748 1219680995388395 C     335664128    10828800     71731200     695      7917     3664853246   31489        17776        5392371236
15165749 1219684660241641 C     425984       1851904      6299648      140      1209     66072        29508        5392331163   1340747972
15165750 1219684660307713 C     335826944    11036160     90896384     729      9605     10279364198  25490        16578        8948091217
15165751 1219694939671911 C     151248896    2181632      63707648     550      7995     8948185595   27002        35457        7073099465
15165752 1219703887857506 C     136689664    2355200      58725376     596      7312     7073210909   26658848     11622        5380774336
15165753 1219710961068415 C     174258688    2200576      58366464     410      6945     5407481756   28565        13310        4190204779
15165754 1219716368550171 C     106930688    5412352      50428928     493      6421     5000909906   13078253     22659        4051350662
15165755 1219721369460077 C     72669696     2042368      38909952     475      6270     5011741115   102642288    68493        3704350875
15165756 1219726381201192 C     77281792     1138688      35868672     267      5944     5099776317   30340        55163        2992220012
15165757 1219731480977509 C     86577664     1547776      42548736     441      6350     4999683313   34274        66523        2982665268
15165758 1219736480660822 C     125505024    2061312      51666944     434      7412     4999731645   4336226      66781        3676232961
15165760 1219745346040784 C     146661888    7630336      52675584     512      7413     6593591562   26484        43135        4919952556
15165761 1219751939632346 C     85427712     8373760      40056832     618      5884     4996010229   28315        61489        6691483521
15165762 1219756935642575 C     85024768     3826176      42681856     483      5982     6691619637   28521        43215        3149760693
15165763 1219763627262212 C     111463424    7518208      48117248     398      6533     4999845277   29544        46361        3226251398
15165764 1219768627107489 C     84978688     2159104      40262144     336      6124     4999694156   22488        56248        2707472379
15165765 1219773626801645 C     228312064    2410496      73106432     568      8755     4999814413   28472        67061        4537423339
15165766 1219778626616058 C     152351232    1446400      48518144     478      7144     4999704757   34587511     152924       3419871570
15165767 1219783626320815 C     85893632     1113088      42266112     346      6851     5031810548   29674        49372        2609414857
15165768 1219788658131363 C     80108544     3673088      40606208     531      7238     4999698306   29588        63670        5271838972
15165769 1219793657829669 C     107587072    3384320      49200640     778      7184     5271952524   18752        14482        6422220740
15165770 1219798929782193 C     283225088    1183232      75063808     337      8025     6422272179   25795865     26462        3306554213
15165771 1219805352054372 C     517490176    6026240      105960960    294      9675     3332407159   24361670     94654        7777553278
15165772 1219808684461531 C     328192       120832       8216576      38       1331     24462006     15774        7777537645   646071454
15165773 1219808708923537 C     335872512    2001408      96568832     196      10248    9625785224   3967403      58006        4411652001
15165774 1219818334708761 C     8011776      1228800      10922496     61       2146     4075025      25498        4411583229   1014418827
15165775 1219818338783786 C     122765312    285696       56177664     33       8769     9409308807   29039        17008        1427350402
15165776 1219827748092593 C     112653824    47104        52990464     19       7448     5003795561   29668        65926        823252720
15165777 1219832751888154 C     213636096    615936       68867072     22       7773     4999762114   29651        50609        582089623
15165778 1219837751650268 C     72818176     669184       40657920     90       7355     4999748160   29023        56489        2090515439
15165779 1219842751398428 C     69604864     148480       37343232     11       5940     4999820073   29721        75034        406247655
15165780 1219847751218501 C     92731392     64512        43635712     23       6909     4999625288   27797        56131        807718543
15165781 1219852750843789 C     136089600    16896        57552896     9        8189     4999839967   28295        66847        333888819
15165782 1219857750683756 C     119449088    397312       45880320     8        6320     5003233374   29865        36752        399538499
15165783 1219862753917130 C     290128896    865792       69095936     39       7460     4996205772   28943        49386        897644253
15165784 1219867750122902 C     272039424    941056       76449792     14       8472     4999853280   36926        58131        472168373
15165785 1219872749976182 C     57738752     230912       30526976     28       5454     4999638757   38953        15394        926580755
15165786 1219877749614939 C     112080384    2570752      49501696     62       7126     4999814566   30159        55156        1658388365
15165787 1219882749429505 C     118583808    927744       46692352     98       6621     5003692554   30183        26637        1538912251
15165788 1219887753122059 C     123766272    1391104      48965632     344      7353     4999812097   18289533     49957        3153770173
15165790 1219897768677826 C     115807232    5521408      48044544     599      7817     4999747181   27994        26217        5278535095
15165791 1219902768425007 C     107308032    3016192      43960320     421      6410     5278607879   30748        13269        3327519422
15165792 1219908047032886 C     88134144     2544640      44220928     428      6451     4996884894   24708        42158        3524859611
15165793 1219913043917780 C     101572096    1798656      47307264     402      6889     4999796341   29642        62216        3315015885
15165794 1219918043714121 C     201362432    9166848      64028160     856      8845     4999652480   21997593     51567        5958306468
15165795 1219923043366601 C     475703296    13393408     104780288    1027     11005    5980368461   53319        5879         10768368601
15165796 1219929023735062 C     411564032    5027328      102991360    670      10336    10768440854  18369        5008         9290714054
15165797 1219939792175916 C     216221184    6014464      81198592     520      9805     9290800250   26214        26780        9057522982
15165798 1219949082976166 C     269355008    2806784      96839680     480      9100     9057594477   19550        8993         7197828302
15165799 1219958140570643 C     120171520    1551872      52727808     352      6954     7197876001   20248892     185085       3336507211
15165800 1219965338446644 C     75444224     2197504      34736128     353      5539     5018575574   13689230     14698        3150206648
15165801 1219970357022218 C     72051712     1256960      35189248     346      5776     5011799878   29297        54952        2945432727
15165802 1219975368822096 C     113371648    1550336      51550720     328      6780     4999786866   29772        48439        2812015577
15165803 1219980368608962 C     148939264    1693184      50263040     441      6723     4999695466   22113        66972        3445798711
15165804 1219985368304428 C     369689600    1588736      80213504     408      7596     3445919602   22786        39240        4828515597
15165805 1219988814224030 C     146930176    2231296      60336128     503      7374     4997634368   25424        60583        5170268217
15165806 1219993811858398 C     97103360     1856000      44809216     446      6840     5170369617   20135        5213         5155910764
15165807 1219998982228015 C     66098688     3091456      36184064     371      5531     5155950845   19672        5059         3950018276
15165808 1220004138178860 C     101739520    1630208      47604736     361      6301     4996990395   17013433     51559        3971532466
15165809 1220009135169255 C     123042304    1344000      50500096     312      6664     5015688785   39308655     35128        4235817268
15165810 1220014150858040 C     257754112    1262080      64723456     335      7160     5035757232   37256        56746        3359800759
15165811 1220019186615272 C     78577664     1421824      40296448     356      6142     4999793735   18817233     42229        3218582558
15165812 1220024186409007 C     97896448     1326080      44792320     446      7251     5015675033   30864        70346        4524430587
15165813 1220029202084040 C     77125120     1417216      39683584     347      6158     4999808552   33373        48808        2920449939
15165814 1220034201892592 C     102322176    2422272      44349440     441      6314     4999800786   33444        62751        3078155648
15165815 1220039201693378 C     206567936    2083840      58301440     300      7239     4999646245   30567        53322        2940542504
15165816 1220044201339623 C     335696896    3860992      73400832     456      8006     4018577811   4150463      61669        5106145008
15165817 1220048219917434 C     1097728      546816       6807552      112      1469     4234043      29725        5106100328   797276855
15165818 1220048224151477 C     143298560    2547712      60027392     492      7608     10104533934  28355        52513        4927874857
15165820 1220063328383773 C     94451712     4605440      45061120     484      6269     4999770108   29912        47488        3966882626
15165821 1220068328153881 C     164284416    3402752      56995840     352      6653     4999796208   21953        86098        3702911755
15165822 1220073327950089 C     114455552    2962432      45576192     269      5641     4999694622   42065        16165        2549991152
15165823 1220078327644711 C     108544000    2788864      50128384     393      6284     5003727717   24171        59502        4126122936
15165824 1220083331372428 C     73833472     1990144      38971392     333      5788     4999829832   6680903      75739        2725572440
15165825 1220088331202260 C     90086400     2100736      43217920     319      6279     5003778113   28462        70563        2831876066
15165826 1220093334980373 C     121577984    4802560      56598016     857      8714     4999677310   28872        55741        7081100546
15165827 1220098334657683 C     339081728    3565568      86428160     499      9814     7081202852   47905        45680        8622300546
15165828 1220105415860535 C     330521088    7106048      98919936     903      10811    8622406877   22133        4582         12519017260
15165829 1220114038267412 C     204268032    58990592     79695360     1248     9552     12519064508  23334        4602         10409391997
15165830 1220126557331920 S     0            0            0            0        0        10409433593  19369        6400         0
15165831 1220136966765513 O     0            0            0            0        0        0            0            0            0

The only way we've found of increasing the average write size is to set compression=off.

@dweeezil
Copy link
Contributor

@crollorc I could not duplicate this problem with a quick test. I had no problem getting about 128KiB average write size as reported by iostat when copying large files to an NFS exported zfs filesystem. When rsyncing lots of small files, the average write size was all over the place, exactly as I'd expect. I'll try a few more tests with varying pool geometries but at the moment, I'm not able to reproduce the problem.

@crollorc
Copy link
Author

@dweeezil Thanks for trying.

Looking at the NFSv3 traffic, I found that most (70%) of writes were 4K despite having wsize=1048576 in all client mount options

root@san1:~# tcpdump -r /tmp/dump1|grep -i 'write fh'|wc -l
5555
root@san1:~# tcpdump -r /tmp/dump1|grep -i 'write fh'|awk '{print $10}'|sort|uniq -c|sort -nr|head
   3856 4096
    476 8192
    318 12288
     92 16384
     90 18
     50 16
     44 20480
     30 91374
     24 512
     22 24576

I had a look at the clients and I think the rootcause for the 4K writes may be that many of these clients are running Linux and Windows KVM VMs which I would guess probably have a 4K blocksize.

However, I don't understand why iostat avgrq-sz halved and iostat w/s doubled post upgrade.

@dweeezil
Copy link
Contributor

@crollorc I think your problem boils down to no aggregation. During my simple tests, 8K streaming writes were aggregated nicely. Could you try a simple dd bs=8k if=/dev/zero of=zeroes test both directly and also over NFS to an otherwise quiescent filesystem with compression disabled (even if only temporarily) to verify that aggregation is working at all.

@crollorc
Copy link
Author

crollorc commented Apr 16, 2016

@dweeezil

I think your problem boils down to no aggregation. During my simple tests, 8K streaming writes were aggregated nicely.

One thing I am unclear on is if I should see merged writes in iostat for all aggregations or are they happening inside ZFS and where exactly should I look for evidence of aggregation?

We ran without compression on 0.6.2 and I noted a large proportion of writes were merged in iostats. Last year we turned on compression and merged writes were significantly reduced but still present. So far on 0.6.5 we see no merged writes at all (prior to these tests) with compression on and off. You can see this on this graph for a sample pool member (unfortunately I don't have graphs illustrating merged writes with compression off on 0.6.2)

image

Could you try a simple dd bs=8k if=/dev/zero of=zeroes test both directly and also over NFS to an otherwise quiescent filesystem with compression disabled (even if only temporarily) to verify that aggregation is working at all.

Compression=off at the moment but I cannot do this on an entirely quiescent fs right now.

For the direct dd, I found that the request size jumped to 100 - 300 for the duration of the test with intermittent drops (i guess between flushes). Also there were merged writes during the test.

image

image

For the NFS dd, I found that the request size jumped to 100 - 150 for the duration of the test. There were also a few merged writes but much less than for the direct dd.

image

image

@dweeezil
Copy link
Contributor

@crollorc

One thing I am unclear on is if I should see merged writes in iostat for all aggregations or are they happening inside ZFS and where exactly should I look for evidence of aggregation?

I was mainly concerned about the aggregation performed within ZFS and was concentrating as the new write throttle as possibly being the cause of your regression because of your write-heavy load. The new write throttle also performed some refactoring of ZFS' aggregation code.

As to the block layer's scheduler merging, since you're using whole disks, ZoL should be disabling the elevator but it looks like you may have got cfq enabled. I don't remember, offhand, whether this logic has changed over time in ZoL or whether there have been any bugs with it. I'd suggest you check with something like head /sys/block/*/queue/scheduler and if they're not noop, then change them all to noop and see what happens.

@crollorc
Copy link
Author

crollorc commented Apr 19, 2016

@dweeezil

I was mainly concerned about the aggregation performed within ZFS and was concentrating as the new write throttle as possibly being the cause of your regression because of your write-heavy load. The new write throttle also performed some refactoring of ZFS' aggregation code.

Ok, I guess this was demonstrated by a jump in iostats avgrq-sz (versus dd bs) in the dd results.

As to the block layer's scheduler merging, since you're using whole disks, ZoL should be disabling the elevator but it looks like you may have got cfq enabled. I don't remember, offhand, whether this logic has changed over time in ZoL or whether there have been any bugs with it. I'd suggest you check with something like head /sys/block/*/queue/scheduler and if they're not noop, then change them all to noop and see what happens.

We are already using the noop scheduler.

I wonder if others using compression=off and a noop elevator see no write merges as this was definitely not the behavior we saw in 0.6.2?

In any case, thanks for all your help

@dweeezil
Copy link
Contributor

@crollorc Can you please clarify whether this problem occurs with 0.6.4? One interesting post-0.6.4 change was the task priority fix. Prior to 0.6.5, the zio write threads ran at a much lower priority (higher raw priority value under Linux) than they were supposed to. If you had a lot of nfsd threads pushing writes to them, it's possible that the ZFS aggregation code had fewer IOs to work with. If you don't see this issue with 0.6.4 but do see it with 0.6.5, it might be due to the task priority adjustment.

@erikjanhofstede
Copy link

erikjanhofstede commented Jun 6, 2016

Hi,

I'm seeing the same behaviour after updating from 0.6.4 to 0.6.5. Version 0.6.5 is much heavier on the writes being flushed to disks. My observations:

  • Dirty data settings seems to be okay: no throttle is being hit.
  • Before the update there where quite some merged io's. After the update none
  • Disabling/enabling compression does nog change the situation
  • There are ~4 times more writes being submitted to disk
  • There's ~2 times more disk utilization (being caused by writes)

My pool:

  pool: userdata
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub canceled on Thu May 12 09:19:13 2016
config:

    NAME                                             STATE     READ WRITE CKSUM
    userdata                                         ONLINE       0     0     0
      mirror-0                                       ONLINE       0     0     0
        scsi-35000c50058ab00e7                       ONLINE       0     0     0
        scsi-35000c50058ab3d43                       ONLINE       0     0     0
      mirror-1                                       ONLINE       0     0     0
        scsi-35000c50058ab1e6f                       ONLINE       0     0     0
        scsi-35000c50058ab42ef                       ONLINE       0     0     0
      mirror-2                                       ONLINE       0     0     0
        scsi-35000c50058ab7a8b                       ONLINE       0     0     0
        scsi-35000c50058ab2da3                       ONLINE       0     0     0
    logs
      mirror-3                                       ONLINE       0     0     0
        scsi-SATA_INTEL_SSDSC2BA1BTTV437401F9100FGN  ONLINE       0     0     0
        scsi-SATA_INTEL_SSDSC2BA1BTTV43730834100FGN  ONLINE       0     0     0
    cache
      scsi-35e83a9710005e87d-part1                   ONLINE       0     0     0
    spares
      scsi-35000c500625ff913                         AVAIL

Txgs:

txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime
10793211 312084106776788  C     68587008     8612352      20948480     756      4330     10000167442  4250         5997         2692029204
10793212 312094106944230  C     62390272     7127552      23287808     1401     4039     10004438963  7385         9050         3514455600
10793213 312104111383193  C     80222720     10573824     31437312     1148     5861     10000334677  4690         6464         2527310364
10793214 312114111717870  C     61217792     6523392      25288704     1347     5121     10000254737  9817         11140        3950554512
10793215 312124111972607  C     64391168     9314304      40939008     1033     4203     10000733954  6719         12856        4010043119
10793216 312134112706561  C     59781632     24415744     24387072     1285     3270     9999964766   6732         8780         3851472777
10793217 312144112671327  C     44187648     4140544      14866432     786      2452     9999383400   4533         6290         3181786317
10793218 312154112054727  C     60757504     8823296      22792704     1392     3056     10000743534  4368         9225         3982234157
10793219 312164112798261  C     42994176     3391488      15173120     609      2472     10000054193  4357         8645         2739292926
10793220 312174112852454  C     37067776     5884928      14753280     927      2202     9999231541   4567         9956         3228039487
10793221 312184112083995  C     56749568     5754368      20854272     1166     3333     10000896011  4931         7163         3946754118
10793222 312194112980006  C     56281088     8103936      21481984     857      4062     10000885793  4550         8623         3045669876
10793223 312204113865799  C     71583232     4467712      28741632     694      4329     10000099090  7305         9762         3024174973
10793224 312214113964889  C     70732800     6185984      30164992     861      3869     9999279064   4789         7715         3627366045
10793225 312224113243953  C     61769728     17265664     22879232     1494     4572     10000041677  6715         12030        3915833094
10793226 312234113285630  C     70991360     44334080     39017472     1627     5993     10000022878  4590         8579         4292663828
10793227 312244113308508  C     55891456     34114560     20900864     1136     3888     10000849053  6443         7677         3264109345
10793229 312264113423647  C     47859200     30145536     19790848     1883     2812     10000797960  4874         6283         4633259277
10793230 312274114221607  C     182593024    19094528     72361984     1641     10974    9999955309   7216         11589        4878448168
10793231 312284114176916  C     105703424    4952576      33337344     676      5762     9999255379   4655         10300        2943566267
10793232 312294113432295  C     56116224     8543744      20745216     1125     3288     10000125459  4935         7960         3491906603
10793233 312304113557754  C     66479616     6044160      26195456     1108     4850     10001592357  6613         9548         3806978876
10793234 312314115150111  C     54763008     4553216      26543104     582      3274     10000157898  5583         11509        3701671774
10793235 312324115308009  C     42426880     14307840     15518208     1194     2520     9999282329   4727         11780        3694537855
10793236 312334114590338  C     57604608     14051328     21906944     1464     3866     10000930645  5122         9420         4263669008
10793237 312344115520983  C     58293248     5001728      22450688     924      3346     10000983924  11828        8393         3497986043
10793238 312354116504907  C     46852096     7068672      17994240     1316     2770     10001032639  8727         8535         3058301827
10793239 312364117537546  C     42475008     9290240      16683008     901      3047     10000060973  5130         6193         3822290645
10793240 312374117598519  C     51773440     8830464      19445760     1164     3034     10000071321  4729         8658         3780294235

screen shot 2016-06-06 at 11 15 23

screen shot 2016-06-06 at 11 14 43

screen shot 2016-06-06 at 11 06 15

@crollorc
Copy link
Author

crollorc commented Jun 7, 2016

@erikjanhofstede i never got a chance to test 0.6.4 unfortunately.

did you track iostat avgrq-sz as I found this halved for me with compression=on between 0.6.3 and 0.6.5?

@erikjanhofstede
Copy link

@crollorc I only tested between 0.6.4 and 0.6.5. I don't have any stats from 0.6.3. I tried with and without compression on 0.6.5, but didn't notice any difference by the way.

I'm trying to reproduce the issue on a testing server, but that pool is almost empty and there's no workload either.

@dweeezil If this issue is caused by the new priorities, is it possible that disabling spl_taskq_thread_priority can give use any insight on this?

@erikjanhofstede
Copy link

@crollorc I don't have any graphs when I was updating from 0.6.3 to 0.6.4 many months ago, but there was no performance regression for me as far as I can remember. Only improvements.

@behlendorf
Copy link
Contributor

@erikjanhofstede in the latest master code zpool iostat has been extended to include latency (-w) and request size (-r) histograms. They provide an excellent view in to how well (or poorly) the ZFS I/O pipeline is merging requests before they are submitted to the block device.

If you have a test system available and we suspect the thread priority changes. Then I'd suggest running your test case using the master source and collecting the request size histogram. You can then revert the thread priority changes, 1229323, and run the same test to see if there's a difference. I'd be very interested in your results.

@erikjanhofstede
Copy link

@behlendorf I've reverted 1229323 (had to fix some conflicts though). My test consists of running the commands below in parallel, for the rest the servers is idle, with an almost empty zpool, unlike the server that I posted graphs about before.

while true; do dd if=CentOS-7-x86_64-Everything-1511.iso_1 of=CentOS-7-x86_64-Everything-1511.iso_2 bs=4k; done
while true; do dd if=CentOS-7-x86_64-Everything-1511.iso_3 of=CentOS-7-x86_64-Everything-1511.iso_4 bs=4k; done
while true; do dd if=CentOS-7-x86_64-Everything-1511.iso_5 of=CentOS-7-x86_64-Everything-1511.iso_6 bs=4k; done

The results running on master, polled over a period of 60 seconds (zpool iostat -r 60 2):

userdata      sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             9      0      0      0      0      0     25      0      0      0
1K              1      0      6      0      0      0     15      1      0      0
2K              7      0      0      0      0      0     22      2      0      0
4K              1      0      0      0      4      0     24      4      0      0
8K              0      0      0      0      0      0     10      7      0      0
16K             0      0      0      0      0      0      5      4      0      0
32K             0      0      0      0      3      0     11      3      0      0
64K             0      0      0      0     16      0    106     18      0      0
128K            0      0      0      0    321      0  2.13K      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

userdata      sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            27      0      0      0      0      0     21      0      0      0
1K              1      0      5      0      0      0     15      0      0      0
2K              9      0      0      0      0      0     19      3      0      0
4K              5      0      0      0      4      0     23      3      0      0
8K              0      0      0      0      0      0     10      7      0      0
16K             0      0      0      0      0      0      3      3      0      0
32K             0      0      0      0      3      0     13      3      0      0
64K             0      0      0      0     21      0    122     17      0      0
128K            0      0      0      0    309      0  1.97K      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

The results running on master, without 1229323, polled over a period of 60 seconds (zpool iostat -r 60 2):

userdata      sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            27      0      0      0      0      0     35      0      0      0
1K              5      0      8      0      0      0     27      2      0      0
2K              8      0      0      0      0      0     25      3      0      0
4K              1      0      0      0      6      0     25      4      0      0
8K              1      0      0      0      0      0     13      9      0      0
16K             0      0      0      0      0      0      7      4      0      0
32K             0      0      0      0      2      0      8      4      0      0
64K             0      0      0      0     12      1     98     20      0      0
128K            2      0      0      0    235      0  2.32K      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

userdata      sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512            77      0      0      0      0      0     41      0      0      0
1K              3      0      5      0      1      0     30      1      0      0
2K             11      0      0      0      1      0     22      4      0      0
4K              7      0      0      0      4      0     30      6      0      0
8K              1      0      0      0      0      0     11      8      0      0
16K             0      0      0      0      0      0     11      3      0      0
32K             0      0      0      0      3      0     17      4      0      0
64K             0      0      0      0     14      0    160     21      0      0
128K            2      0      0      0    192      0  2.13K      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0
--------------------------------------------------------------------------------

@behlendorf
Copy link
Contributor

@erikjanhofstede thanks for running the test. Based on these results it doesn't look like 1229323 has any real effect.

@erikjanhofstede
Copy link

@behlendorf I'm unable to reproduce the issue I'm having on my testing server, but it's very clearly visible on production servers. Can you think of a safe way to try some things out on a production server? In the meantime I'll continue to try to reproduce the issue.

@erikjanhofstede
Copy link

Over here it's the same as @markr123 CentOS6, no zvol, any upgrade from 0.6.4.2 to 0.6.5 results in a dramatic write performance degradation. You can see my findings in this thread, but if you need any more info/testing I'll be happy to assists you in this.

@markr123
Copy link

Hi. This problem/ticket has now been opened for over a year and the performance issue is stopping most if not all people from upgrading past the 0.6.4.2 version.

Is this problem being looked at or worked on? Anything I can send you? as it seems to have stalled.

A new version of ZOL may come out but if this problem is not fixed there is no point as no one can upgrade to it.

thanks for all your hard work on ZOL. It's the best filesystem out there for Linux.

@sempervictus
Copy link
Contributor

sempervictus commented Apr 28, 2017 via email

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Apr 28, 2017

@sempervictus agreed, ZFS needs a kind of "Phoronix Test Suite" for Filesystems that tracks performance (and latency ?) regressions and improvements regularly,

the buildbots & tests will test integrity and functionality but I guess asking for performance or latency related regular tests (for each commit) would be too much and not entirely suitable ?

Dedicated runs on servers (from Raspberry Pi, Pi2 - "little, small systems) or server farms (Lustre, etc.) would be much better suited to catch "corner", scaling and other cases

@behlendorf
Copy link
Contributor

@sempervictus agreed. We need additional automated testing which is run regularly in order to prevent performance regressions.

ZFS needs a kind of "Phoronix Test Suite" for Filesystems

I think that would be an entirely reasonable place to start. If we can put together a meaningful battery of ZFS performance tests for the PTS which take less than 6 hours to run they could be added to the buildbot and run regularly. Having a predefined suite of tests anyone can easily run to benchmark their hardware would be convenient. @kernelOfTruth if you're already familiar with the PTS would you be interested in putting something together?

@markr123
Copy link

Hi, Is there any progress with this? Sorry to chase but I would really like to upgrade to the new versions of zfs but am stuck on 6.4.2 until performance issues are fixed.
Thanks.

@bunder2015
Copy link
Contributor

You might be interested in giving 0.7.0-rc4 a try, it has various performance enhancements. There are also various ZVOL/ZIL improvements currently in the works c/o @ryao. Not sure if any of them are relevant to your usage case. Feel free to look through the list of current pull requests and give some of those a try as well.

@markr123
Copy link

thanks for coming back so quick on a Sunday.

my main problem has been lots of rsync writes coming in at once to zfs. 6.4.2 handled these really well, but from 6.5 onwards the cpu usage went high and the rsyncs and all filesystem writes blocked. samba writing was also stalled.

@behlendorf behlendorf removed this from the 0.7.0 milestone Jul 25, 2017
@erikjanhofstede
Copy link

Anybody who's experiencing this issue already tried the 0.7.x release?

@snajpa
Copy link
Contributor

snajpa commented Aug 18, 2017

I do with HEAD at 3769948 and RHEL6 kernel.

12 1 0x01 92 25024 8325332541 4337538365561
name                            type data
hits                            4    11106114
misses                          4    5791566
demand_data_hits                4    5166935
demand_data_misses              4    1052871
demand_metadata_hits            4    1997434
demand_metadata_misses          4    4092990
prefetch_data_hits              4    2495568
prefetch_data_misses            4    348742
prefetch_metadata_hits          4    1446177
prefetch_metadata_misses        4    296963
mru_hits                        4    1703164
mru_ghost_hits                  4    0
mfu_hits                        4    7881109
mfu_ghost_hits                  4    0
deleted                         4    169
mutex_miss                      4    0
evict_skip                      4    47
evict_not_enough                4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    2530816
evict_l2_ineligible             4    0
evict_l2_skip                   4    0
hash_elements                   4    2100829
hash_elements_max               4    2101303
hash_collisions                 4    156175
hash_chains                     4    63727
hash_chain_max                  4    4
p                               4    67573442560
c                               4    135146885120
c_min                           4    33554432
c_max                           4    135146885120
size                            4    87287478000
compressed_size                 4    77323989504
uncompressed_size               4    123392078848
overhead_size                   4    5172511232
hdr_size                        4    713580208
data_size                       4    75959115264
metadata_size                   4    6537385472
dbuf_size                       4    902585728
dnode_size                      4    2319490368
bonus_size                      4    855320960
anon_size                       4    473503232
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    41532464128
mru_evictable_data              4    36605826560
mru_evictable_metadata          4    419079168
mru_ghost_size                  4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_size                        4    40490533376
mfu_evictable_data              4    38751274496
mfu_evictable_metadata          4    407086080
mfu_ghost_size                  4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_hits                         4    0
l2_misses                       4    5791532
l2_feeds                        4    8026
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    80685456896
l2_writes_sent                  4    7791
l2_writes_done                  4    7791
l2_writes_error                 4    0
l2_writes_lock_retry            4    42
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    6895
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    103058641920
l2_asize                        4    66307313152
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    264192
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    11328362736
arc_meta_limit                  4    101360163840
arc_dnode_limit                 4    10136016384
arc_meta_max                    4    11353754696
arc_meta_min                    4    16777216
sync_wait_for_async             4    22570
demand_hit_predictive_prefetch  4    234743
arc_need_free                   4    0
arc_sys_free                    4    4223340160
23 0 0x01 86 9632 12311035077 4362579253819
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
13884744 4164236604749    C     36271104     123392000    61915136     1123     1778     2110737401   6910         1941900771   2215539012  
13884745 4166347342150    C     35784192     118595584    60448768     1094     1778     1941919728   6631         2215550371   1876709958  
13884746 4168289261878    C     63946752     109617152    75390976     1010     2069     2215566746   5264         1876714240   2504001285  
13884747 4170504828624    C     45242880     133758976    43941888     1206     1775     1876725141   4365         2504018775   2415840671  
13884748 4172381553765    C     26675712     131870720    32587776     1212     1640     2504031394   6207         2415853093   2484338338  
13884749 4174885585159    C     23851520     107413504    28540928     1009     1945     2415870320   5403         2484350763   2328118255  
13884750 4177301455479    C     49197568     120823808    47751168     1160     1851     2484366136   6345         2328119881   2555925449  
13884751 4179785821615    C     34044416     120647680    35971072     1030     1630     2328134325   6501         2555932784   1866568996  
13884752 4182113955940    C     36459008     139554816    37232640     1200     1915     2556038602   5064         1866484048   2234865186  
13884753 4184669994542    C     29173760     125607936    32129024     1070     1632     1866505692   6882         2234860797   1936573602  
13884754 4186536500234    C     29129728     116928512    39550976     1013     1655     2234879641   5388         1936584095   2055491295  
13884755 4188771379875    C     72613888     157794304    56950784     1347     2109     1936598521   6485         2055498967   2703213206  
13884756 4190707978396    C     57694720     203923456    59187200     1702     1929     2055557736   6087         2703183238   2910482353  
13884757 4192763536132    C     41032704     152723456    52813824     1295     1676     2703199501   6481         2910483823   2012891445  
13884758 4195466735633    C     106031616    132505600    85508096     1130     2197     2910517873   5054         2012876880   2258632437  
13884759 4198377253506    C     29989376     142020608    31498240     1217     1490     2012887710   5152         2258648842   2024240489  
13884760 4200390141216    C     39697920     111620096    40534016     1055     1967     2258683609   5520         2024231689   2771066342  
13884761 4202648824825    C     100624384    105779200    77398016     1039     2486     2024245121   6332         2771080493   2276465887  
13884762 4204673069946    C     52517376     118792192    56631296     1027     2295     2771093324   4882         2276485176   1815269833  
13884763 4207444163270    C     89269760     179376128    80846848     1474     2303     2276499597   6223         1815274214   2528415135  
13884764 4209720662867    C     86455808     153219072    71598080     1296     2304     1815287497   5182         2528422794   2096165541  
13884765 4211535950364    C     50241536     112902144    47398912     1055     2062     2528434602   5115         2096181106   2575307696  
13884766 4214064384966    C     48568832     127492096    53805056     1142     1758     2096197167   5995         2575320771   2110255466  
13884767 4216160582133    C     41707520     150327296    43040768     1442     1704     2575335623   4913         2110260731   2743316802  
13884768 4218735917756    C     54598144     115286016    51937280     1186     1652     2110275185   6364         2743328586   2166999266  
13884769 4220846192941    C     38547968     82055168     41779200     894      1915     2743356757   5090         2166992049   2345322866  
13884770 4223589549698    C     29343744     134176768    35012608     1233     1700     2167005958   5656         2345328367   1993619416  
13884772 4228101897894    C     36727808     140693504    37928960     1218     1483     1993648727   8629         2162407073   1855508114  
13884773 4230095546621    C     23354368     136331264    31248384     1533     1467     2162423513   5743         1855514720   2420886278  
13884774 4232257970134    C     72283136     127410176    54194176     1500     1873     1855528761   4571         2420896618   2809874073  
13884775 4234113498895    C     50669056     122269696    69353472     1092     2079     2420909258   7177053      2802708436   2118281111  
13884776 4236534408153    C     35153920     142364672    61358080     1222     1777     2809893566   6081         2118295890   2007734999  
13884777 4239344301719    C     47450112     140697600    63119360     1246     1830     2118320919   14139        2007720866   2196866389  
13884778 4241462622638    C     56240640     143757312    80420864     1344     1900     2007748896   7129         2196865716   2541512318  
13884779 4243470371534    C     28441600     132440064    42409984     1261     1666     2196880689   6153         2541526140   2072776457  
13884780 4245667252223    C     53864448     125136896    111034368    1128     2333     2541543644   6552         2072800582   2149140917  
13884781 4248208795867    C     83119616     141053952    107102208    1294     2219     2072813715   4112         2149148544   2564207783  
13884782 4250281609582    C     84401664     108220416    88825856     1065     2360     2149158675   5242         2564215611   2516223608  
13884783 4252430768257    C     53362176     136302592    51167232     1225     1963     2564242028   4741         2516217376   2108411013  
13884784 4254995010285    C     23246848     77152256     39084032     899      1731     2516233278   6925         2108422419   2390715576  
13884785 4257511243563    C     100516864    98656256     77406208     1013     2314     2108443540   7946         2390712547   2697941751  
13884786 4259619687103    C     59678208     127086592    74522624     1479     2087     2390728524   5162         2697948082   2518428769  
13884787 4262010415627    C     51313152     133058560    56836096     1160     1850     2697961085   5785         2518438066   1874581249  
13884788 4264708376712    C     38499840     130056192    40427520     1175     2138     2518456463   4040859      1870546505   2491954978  
13884789 4267226833175    C     58814976     90529792     117776384    864      3098     1874607776   4026         2491956053   2475562255  
13884790 4269101440951    C     86475776     139202560    174727168    1246     4186     2491969522   4868         2475578556   2327543107  
13884791 4271593410473    C     108779520    129941504    191479808    1104     2934     2475594576   5917         2327543433   2065292104  
13884792 4274069005049    C     136452608    145252352    256712704    1456     4042     2327557865   6033         2065305318   2788907047  
13884793 4276396562914    C     77019136     140771328    155049984    1189     2780     2065320358   5701         2788918012   2412242775  
13884794 4278461883272    C     143178752    76984320     251576320    852      4020     2788933634   5398         2412245428   2815951378  
13884795 4281250816906    C     69138944     145485824    91815936     1220     2216     2412258432   4946         2815966106   2335489136  
13884796 4283663075338    C     114761216    141082624    158588928    1228     3151     2815978557   4281         2335508011   2782529898  
13884797 4286479053895    C     35223040     119554048    37945344     1068     1612     2335521727   4598         2782534251   1982206309  
13884798 4288814575622    C     64258560     143339520    108339200    1204     2616     2782545586   3910         1982223964   2600873631  
13884799 4291597121208    C     24773632     207495168    36782080     1733     1582     1982237772   5099         2600891868   1916979266  
13884800 4293579358980    C     48922112     137027584    75210752     1200     1574     2600906689   5127         1916990316   2014285410  
13884802 4298097293340    C     33272320     150700032    102039552    1253     2404     2014286990   5110         2156809695   2035838820  
13884803 4300111580330    C     81604608     138477568    126861312    1192     2630     2156824003   5153         2035852326   2064413620  
13884804 4302268404333    C     110749696    125366272    170745856    1088     2985     2035864271   4277         2064431564   2524985032  
13884805 4304304268604    C     60029952     151941120    90644480     1284     2082     2064447577   6658         2524985414   2341167947  
13884806 4306368716181    C     65350656     158601216    110608384    1384     2593     2524997687   4230         2341175864   2778715774  
13884807 4308893713868    C     41444864     140292096    89128960     1219     2097     2341186801   4016         2778725366   2245347679  
13884808 4311234900669    C     54941696     148480000    108142592    1286     2384     2778736522   5175         2245354933   2465445800  
13884809 4314013637191    C     71305216     141561856    114016256    1190     2394     2245371433   35507        2465417436   2043746134  
13884810 4316259008624    C     76064256     147419136    115892224    1243     2522     2465460281   5492         2043752965   2362551344  
13884811 4318724468905    C     95911936     131731456    146939904    1220     2857     2043767537   4649         2362556913   2703281041  
13884812 4320768236442    C     46643712     154566656    106479616    1346     2253     2362580955   5659         2703282138   2281451691  
13884813 4323130817397    C     74088960     150323200    212606976    1552     3717     2703298387   6454         2281453627   3030101641  
13884814 4325834115784    C     60939264     130662400    140001280    1279     2537     2281466788   4724         3030117687   2297613763  
13884815 4328115582572    C     165896704    142925824    366075904    1382     5678     3030143591   4028         2297608111   3558481627  
13884816 4331145726163    C     142273024    85852160     291889152    903      4652     2297622990   5684         3558483940   3239762671  
13884817 4333443349153    C     264770560    165347328    435494912    1628     7283     3558495720   4809         3239816411   4048062420  
13884818 4337001844873    C     162929664    86028288     291962880    902      4540     3239834932   7746         4048068030   2584101426  
13884819 4340241679805    C     137316352    94298112     281124864    1217     3654     4048085836   5507         2584105436   2336033214  
13884820 4344289765641    C     95045120     89255936     175349760    910      2625     2584123986   6581         2336032299   1924461857  
13884821 4346873889627    C     97711616     87232512     184975360    877      3075     2336051996   7233         1924461104   1861019067  
13884822 4349209941623    C     84547584     97206272     157278208    1083     2972     1924476383   4748         1861035267   2099725077  
13884823 4351134418006    C     88662016     90320896     175153152    905      2417     1861048221   5092         2099731154   1837980140  
13884824 4352995466227    C     112260608    90034176     182132736    931      3010     2099747002   5675         1837983064   1818140317  
13884825 4355095213229    C     54755840     77709312     99287040     692      1760     1837997759   5168         1818144607   1574600950  
13884826 4356933210988    C     60847616     78692352     102064128    716      1724     1818156051   14286179     1560326897   1501072823  
13884827 4358751367039    S     0            0            0            0        0        1574634670   4651         1501064607   0           
13884828 4360326001709    W     0            0            0            0        0        1501074833   4289         0            0           
13884829 4361827076542    O     0            0            0            0        0        0            0            0            0  
dbuf_cache_hiwater_pct = 10
dbuf_cache_lowater_pct = 10
dbuf_cache_max_bytes = 104857600
dbuf_cache_max_shift = 5
dmu_object_alloc_chunk_shift = 7
ignore_hole_birth = 1
l2arc_feed_again = 1
l2arc_feed_min_ms = 200
l2arc_feed_secs = 1
l2arc_headroom = 2
l2arc_headroom_boost = 200
l2arc_noprefetch = 1
l2arc_norw = 0
l2arc_write_boost = 8388608
l2arc_write_max = 8388608
metaslab_aliquot = 524288
metaslab_bias_enabled = 1
metaslab_debug_load = 0
metaslab_debug_unload = 0
metaslab_fragmentation_factor_enabled = 1
metaslab_lba_weighting_enabled = 1
metaslab_preload_enabled = 1
metaslabs_per_vdev = 200
send_holes_without_birth_time = 1
spa_asize_inflation = 24
spa_config_path = /etc/zfs/zpool.cache
spa_load_verify_data = 1
spa_load_verify_maxinflight = 10000
spa_load_verify_metadata = 1
spa_slop_shift = 5
zfetch_array_rd_sz = 1048576
zfetch_max_distance = 8388608
zfetch_max_streams = 8
zfetch_min_sec_reap = 2
zfs_abd_scatter_enabled = 1
zfs_abd_scatter_max_order = 10
zfs_admin_snapshot = 1
zfs_arc_average_blocksize = 8192
zfs_arc_dnode_limit = 0
zfs_arc_dnode_limit_percent = 10
zfs_arc_dnode_reduce_percent = 10
zfs_arc_grow_retry = 0
zfs_arc_lotsfree_percent = 10
zfs_arc_max = 0
zfs_arc_meta_adjust_restarts = 4096
zfs_arc_meta_limit = 0
zfs_arc_meta_limit_percent = 75
zfs_arc_meta_min = 0
zfs_arc_meta_prune = 10000
zfs_arc_meta_strategy = 1
zfs_arc_min = 0
zfs_arc_min_prefetch_lifespan = 0
zfs_arc_p_aggressive_disable = 1
zfs_arc_pc_percent = 0
zfs_arc_p_dampener_disable = 1
zfs_arc_p_min_shift = 0
zfs_arc_shrink_shift = 0
zfs_arc_sys_free = 0
zfs_autoimport_disable = 1
zfs_compressed_arc_enabled = 1
zfs_dbgmsg_enable = 0
zfs_dbgmsg_maxsize = 4194304
zfs_dbuf_state_index = 0
zfs_deadman_checktime_ms = 5000
zfs_deadman_enabled = 1
zfs_deadman_synctime_ms = 1000000
zfs_dedup_prefetch = 0
zfs_delay_min_dirty_percent = 60
zfs_delay_scale = 500000
zfs_delete_blocks = 20480
zfs_dirty_data_max = 4294967296
zfs_dirty_data_max_max = 4294967296
zfs_dirty_data_max_max_percent = 25
zfs_dirty_data_max_percent = 10
zfs_dirty_data_sync = 824633720832
zfs_dmu_offset_next_sync = 0
zfs_expire_snapshot = 300
zfs_flags = 0
zfs_free_bpobj_enabled = 1
zfs_free_leak_on_eio = 0
zfs_free_max_blocks = 100000
zfs_free_min_time_ms = 1000
zfs_immediate_write_sz = 32768
zfs_max_recordsize = 1048576
zfs_mdcomp_disable = 0
zfs_metaslab_fragmentation_threshold = 70
zfs_metaslab_segment_weight_enabled = 1
zfs_metaslab_switch_threshold = 2
zfs_mg_fragmentation_threshold = 85
zfs_mg_noalloc_threshold = 0
zfs_multihost_fail_intervals = 5
zfs_multihost_history = 0
zfs_multihost_import_intervals = 10
zfs_multihost_interval = 1000
zfs_multilist_num_sublists = 0
zfs_nocacheflush = 0
zfs_nopwrite_enabled = 1
zfs_no_scrub_io = 0
zfs_no_scrub_prefetch = 0
zfs_object_mutex_size = 64
zfs_pd_bytes_max = 52428800
zfs_per_txg_dirty_frees_percent = 30
zfs_prefetch_disable = 0
zfs_read_chunk_size = 1048576
zfs_read_history = 0
zfs_read_history_hits = 0
zfs_recover = 0
zfs_resilver_delay = 2
zfs_resilver_min_time_ms = 3000
zfs_scan_idle = 50
zfs_scan_min_time_ms = 1000
zfs_scrub_delay = 4
zfs_send_corrupt_data = 0
zfs_sync_pass_deferred_free = 2
zfs_sync_pass_dont_compress = 5
zfs_sync_pass_rewrite = 2
zfs_sync_taskq_batch_pct = 75
zfs_top_maxinflight = 32
zfs_txg_history = 200
zfs_txg_timeout = 20
zfs_vdev_aggregation_limit = 131072
zfs_vdev_async_read_max_active = 3
zfs_vdev_async_read_min_active = 1
zfs_vdev_async_write_active_max_dirty_percent = 60
zfs_vdev_async_write_active_min_dirty_percent = 30
zfs_vdev_async_write_max_active = 10
zfs_vdev_async_write_min_active = 2
zfs_vdev_cache_bshift = 16
zfs_vdev_cache_max = 16384
zfs_vdev_cache_size = 0
zfs_vdev_max_active = 1000
zfs_vdev_mirror_non_rotating_inc = 0
zfs_vdev_mirror_non_rotating_seek_inc = 1
zfs_vdev_mirror_rotating_inc = 0
zfs_vdev_mirror_rotating_seek_inc = 5
zfs_vdev_mirror_rotating_seek_offset = 1048576
zfs_vdev_queue_depth_pct = 1000
zfs_vdev_raidz_impl = [fastest] original scalar sse2 ssse3 
zfs_vdev_read_gap_limit = 32768
zfs_vdev_scheduler = noop
zfs_vdev_scrub_max_active = 2
zfs_vdev_scrub_min_active = 1
zfs_vdev_sync_read_max_active = 10
zfs_vdev_sync_read_min_active = 10
zfs_vdev_sync_write_max_active = 10
zfs_vdev_sync_write_min_active = 10
zfs_vdev_write_gap_limit = 4096
zfs_zevent_cols = 80
zfs_zevent_console = 0
zfs_zevent_len_max = 384
zil_replay_disable = 0
zil_slog_bulk = 786432
zio_delay_max = 30000
zio_dva_throttle_enabled = 1
zio_requeue_io_start_cut_in_line = 1
zio_taskq_batch_pct = 75
zvol_inhibit_dev = 0
zvol_major = 230
zvol_max_discard_blocks = 16384
zvol_prefetch_bytes = 131072
zvol_request_sync = 0
zvol_threads = 32
zvol_volmode = 1

@erikjanhofstede
Copy link

@snajpa And still having these write performance issues?

@erikjanhofstede
Copy link

I'm running 0.7.1 for a couple of days now and the write performance issue I was having with 0.6.5.x seems to be gone!

@snajpa
Copy link
Contributor

snajpa commented Sep 13, 2017

Unfortunately I can't upgrade that often, next round of updates won't be for another few weeks. I'll get back to this issue within a ~month max with concrete data.

@nawang87
Copy link

Hi @erikjanhofstede, could you please share the zfs parameters applied after upgrade to 0.7.1. Thanx in advance.
@snajpa Do you have any updates regarding the new packages for ubuntu 16.04 ? Thanx

@erikjanhofstede
Copy link

@nawang87 I have some custom parameters in place for this specific configuration and workload. I didn't apply any custom parameter settings except renaming zil_slog_limit to zil_slog_bulk for running 0.7.1.

@arturpzol
Copy link

I performed some tests and noticed similar like @ab-oe that without commit 37f9dac performance was pretty good on 0.6.5.x (kernel 4.4.45)

ZOL 0.6.5.x with commit 37f9dac

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00


dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=10000
480903168 bytes (481 MB) copied, 74.4589 s, 6.5 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=10000 
508559360 bytes (509 MB) copied, 80.3149 s, 6.3 MB/s

ZOL 0.6.5.x with reverted commit 37f9dac

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=10000
1310720000 bytes (1.3 GB) copied, 3.33955 s, 392 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=10000 
10485760000 bytes (10 GB) copied, 14.1663 s, 740 MB/s

ZOL 0.7 HEAD and 0.7.1 similar result like below:

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=10000
455213056 bytes (455 MB) copied, 71.3401 s, 6.4 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=10000
484442112 bytes (484 MB) copied, 74.2631 s, 6.5 MB/s


Noticed that ZOL head by default also has only one zvol kernel process run:

ps ax|grep zvol
 61303 ?        S<     0:00 [zvol]

htop during I/O shows that zvol process does not occupies resources:

61303 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol

spl module with spl_taskq_thread_dynamic=0 runs 32 zvol processes:

ps ax|grep zvol
 58474 ?        S<     0:00 [zvol]
 58475 ?        S<     0:00 [zvol]
 ....
 58504 ?        S<     0:00 [zvol]
 58505 ?        S<     0:00 [zvol]

but anyway htop during I/O shows that zvol processes do not occupies resources:

58474 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol
58475 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol
...

58504 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol
58505 root        0 -20     0     0     0 S  0.0  0.0  0:00.00 zvol

However ZOL 0.6.5.x with reverted commit 37f9dac shows that zvol processes nice occupies resources (CPU%):

 20416 root       0 -20       0      0      0 S   2.0  0.0   0:01.83 zvol                                                                                                   
 20403 root       0 -20       0      0      0 S   1.7  0.0   0:01.83 zvol                                                                                                   
 20407 root       0 -20       0      0      0 S   1.7  0.0   0:01.82 zvol                                                                                                   
 20414 root       0 -20       0      0      0 S   1.7  0.0   0:02.18 zvol                                                                                                                                                                                              
 20404 root       0 -20       0      0      0 S   1.3  0.0   0:01.93 zvol                                                                                                   
 20406 root       0 -20       0      0      0 S   1.3  0.0   0:01.79 zvol                                                                                                   
 20413 root       0 -20       0      0      0 S   1.3  0.0   0:02.03 zvol                                                                                                   
 20417 root       0 -20       0      0      0 S   1.3  0.0   0:02.07 zvol                                                                                                   
 20425 root       0 -20       0      0      0 S   1.3  0.0   0:02.06 zvol                                                                                                   
 20429 root       0 -20       0      0      0 S   1.3  0.0   0:01.96 zvol  

@sempervictus
Copy link
Contributor

@behlendorf: maybe its time to revert the bio work. Its obviously causing major slowdown and there's no solution in sight. I've not seen anything from @ryao in a while on this (or actually anything, hope he's OK), and I'm not sure anyone else is working on/knows how to solve the problem. If LLNL had to back the workload with blockdev, would you allow engineering to implement a pathologically slow system which literally dropped 100x in terms of performance on a commit and never fixed it? This has blocked many production implementations for us, client engineering teams shoot us down and spent 30X on ... name a vendor ... Then they're stuck in proprietary hell locked in to the people who now own their data with other performance and consistency problems.
With crypto in place, this is probably now the most serious outstanding feature gap. Can this get prioritized please, or reverted if there is no developer bandwidth to fix it?

@kernelOfTruth
Copy link
Contributor

Listing the following commits (for reference), that touch upon bio work (or related), may it help to track down issues and avoid having to revert all of it ☕️ :

d454121 Linux 3.14 compat: Immutable biovec changes in vdev_disk.c [via git search]

37f9dac zvol processing should use struct bio [mentioned here]

2727b9d Use uio for zvol_{read,write}

a765a34 Clean up zvol request processing to pass uio and fix porting regressions

#4316 Reduce overhead in zvol_write/zvol_read via existing hold

@arturpzol
Copy link

I noticed one more performance reducing which was provided in ZOL 0.7.2:

ZOL 0.7.1-1:

cat /sys/module/zfs/version 
0.7.1-1

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 16.8642 s, 38.9 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 23.6626 s, 44.3 MB/s

ZOL 0.7.2-1:

cat /sys/module/zfs/version 
0.7.2-1

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 94.3687 s, 6.9 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=500 
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 75.3563 s, 7.0 MB/s

After revert each single commits seems that commit a5b91f3 introduced that slowdown:

ZOL 0.7.2-1 without a5b91f3 :

zpool create Pool-0 raidz2 wwn-0x50000c0f01125248 wwn-0x50000c0f0109c684 wwn-0x50000c0f01125280 wwn-0x50000c0f0109c654 wwn-0x50000c0f01124ea0 wwn-0x50000c0f010afff0 wwn-0x50000c0f0109c318 wwn-0x50000c0f010afed8 wwn-0x50000c0f0112529c wwn-0x50000c0f01095b68 wwn-0x50000c0f011252e0 wwn-0x50000c0f010b0030 wwn-0x50000c0f01125334 wwn-0x50000c0f011251c4 wwn-0x50000c0f011252f0 wwn-0x50000c0f010affd4 -O dedup=off -O compression=off -O acltype=posixacl -o ashift=12 -o failmode=wait -m /Pools/Pool-0 -f
zfs create -s -b 131072 -o primarycache=all -o logbias=latency -o sync=always -o dedup=off -o compression=lz4 -o copies=1 -o secondarycache=all -V 11327976243200 Pool-0/vol00

dd if=/dev/zero of=/dev/zd0 bs=128k oflag=dsync  count=5000
5000+0 records in
5000+0 records out
655360000 bytes (655 MB) copied, 16.7465 s, 39.1 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=dsync  count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.0499 s, 43.6 MB/s

By the way: with direct flag performance is generally good but non each application can use direct I/O so it would be nice to refactor bio work:

ZOL 0.7.2-1:


dd if=/dev/zero of=/dev/zd0 bs=128k oflag=direct  count=50000
50000+0 records in
50000+0 records out
6553600000 bytes (6.6 GB) copied, 13.6889 s, 479 MB/s

dd if=/dev/zero of=/dev/zd0 bs=1M oflag=direct  count=50000
50000+0 records in
50000+0 records out
52428800000 bytes (52 GB) copied, 47.4183 s, 1.1 GB/s

@odoucet
Copy link

odoucet commented Sep 27, 2017

I may provide additional tests on a server experiencing the same issue (on 0.6.5.x).
How easily can commit 37f9dac be rolled back in source code to reproduce the issue ?

@behlendorf
Copy link
Contributor

@sempervictus @arturpzol @odoucet I've read through this issue again, but I'm not sure I completely understand. Can you help me understand your target workload. From the recently posted test results it looks like your focus is entirely on synchronous single threaded IO which is unfortunately exactly where I'd expect performance to be the worst. Particularly for a raidz pool without a dedicate log device.

Let me try and summarize the state of things in 0.7.2 and master do my best answer the questions posted. Then I can properly investigate your performance concerns and we can determine what can be done.

maybe its time to revert the bio work

The flaw in the bio work referenced above has already been reverted. This was done in commit 692e55b, and 8fa5250 which reinstated the task queues to resolve the asynchronous IO performance issue.

commit a5b91f3 introduced that slowdown:

That's a bigger hit that I would have expected, we'll have to look in to that.

Using the current master code I tested this worst case scenario as posted above. The volume was created with sync=always and I ran a single threaded dd for the workload. That meant that every single 128K block needs to be written to the log, which is part of the primary pool, and we can't return to user space to handle another write until we're absolutely sure it's on disk. Then we need to write it again as part of the normal IO path. This is absolutely going to be slow.

That all said, you're right. In my testing it does look even worse than what I would have expected I see a large write multiplier which I can't immediately explain, let me investigate what's going on and get back to you.

On the upside, if you stick with the default values I do see reasonable performance numbers.

@arturpzol
Copy link

@behlendorf dd scenario was only for the test and for the easy repetition the issue.

In my main enviroment I use SCST which shares zvols to iSCSI initiator. Initially I noticed with ZOL 0.7 master very long disk format on Windows side:

ZOL 0.6.5.x with reverted bio work:

zvol size: 10TB, sync: always - about 18 seconds

ZOL 0.7 HEAD:

zvol size: 10TB, sync: always - about 4 minutes 36 seconds

so I tried to eliminate SCST, network and iSCSI initiator layer and switched to local I/O using dd to be sure that it is something on zol layer.

I understand that with sync=alwaysand without a dedicate log device has to be slow but why 0.6.5.x with reverted bio work or 0.6.4.2 with the same test scenario is so fast? Does it mean that 0.6.5.x with reverted bio or 0.6.4.2 is not fully synchronized? but on the other hand it cannot be because I performed many test with force shutdown the node and have never experienced data corruption.

Of course, I can use default value of sync property for zvol but it does not protect me to avoid data corruption in case of power failure.

@behlendorf are you able to perform test with dd with ZOL 0.6.4.2 and 0.6.5.x/0.7.x? You will notice very big performance differences.

@arturpzol
Copy link

@odoucet you can simply use 0.6.4.2 or revert bio work on first released 0.6.5.
In order to do it try download patch from https://uploadfiles.io/pba86 .

It is diff between released 0.6.5 and 37f9dac
Next patch your zfs sources e.g:

patch -p1 < ./revert_bio.patch

or use git to do it.

For next released version of v0.6.5 (e.g. v0.6.5.1) it will be harder to revert it because of more source code dependencies.

@kernelOfTruth
Copy link
Contributor

Make sure to run zfstests after those changes,

ZFS is about data integrity after all, if the modified filesystem code puts your data through the grinder (data corruption, no integrity, issues) - it's pretty much useless

@arturpzol
Copy link

@kernelOfTruth I can agree in 100% that ZFS data integrity and others need to be tested with reverted bio.
I only wanted to show performance differences between 0.6.4.2 and 0.6.5.

@ab-oe
Copy link
Contributor

ab-oe commented Oct 6, 2017

@behlendorf I tested 37f9dac with PR #5824 on top has the same poor performance as described in most posts here. Restoring the 32 zvol threads has no impact on performance. The performance on 0.7.0-rc4 was better because of incorrect ranges locked for indirect writes which by the way led to broken data on zvols but after the fix the performance is as bad as it was on 37f9dac.

I found another performance regression described in #6728 which affects discard operation on zvols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests