Feature Request: Sequential Resilvering (was: Very slow drive-replace) #4825

RubenKelevra · 2016-07-04T12:12:28Z

Currently a drive-replace is dead slow. I've currently at a data-migration to a larger storage...

My setup is a 9 3TB SATA-HDD RAIDz2 which is currently degraded to one redundancy.

I'm currently receiving two filesystems and need to replace two disks, because I started with two disks just borrowed from a friend, now wanna use the disks from the old storage system.

The two zfs receives are limited by the internet connection, which is around 95 Mbit/s.

So the rest of the IO-Bandwidth, which should be around 60 MB/s x 7 => 420 MByte/s is available for a device replace.

So I've started to replace both drives, which I need to replace at once.

Currently ZFS seems to not just copy the data from one drive to another, while writing new data / metadata just to both drives, instead it looks like zfs is running a complete analyse of all data on all disks to get the information which has to be stored on this particular new disk.

This is currently dead slow:

# zpool status
  pool: tanka
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jul  3 19:53:09 2016
    110G scanned out of 2.64T at 1.71M/s, 431h38m to go
    23.6G resilvered, 4.05% done
config:

    NAME                      STATE     READ WRITE CKSUM
    tanka                     DEGRADED     0     0     0
      raidz2-0                DEGRADED     0     0     0
        replacing-0           ONLINE       0     0     0
          sdd                 ONLINE       0     0     0
          sdc                 ONLINE       0     0     0  (resilvering)
        sda                   ONLINE       0     0     0
        sde                   ONLINE       0     0     0
        sdb                   ONLINE       0     0     0
        sdf                   ONLINE       0     0     0
        replacing-5           ONLINE       0     0     0
          sdg                 ONLINE       0     0     0
          sdl                 ONLINE       0     0     0  (resilvering)
        sdh                   ONLINE       0     0     0
        sdi                   ONLINE       0     0     0
        15843216295752979018  UNAVAIL      0     0     0  was /tmp/fakedrive.img

errors: No known data errors

zfs list shows up, that 2,97T is used, which leads to a usage of 432GByte per Disk. Since each disk is capable of writing / reading around 90 MB/s plain, a 1:1 copy should be as fast as 60 MB/s avg at worst.

So my expectation was that sdd is copied to sdc while sdg is copied to sdl both at 60 MB/s. So this should be done in 2 hours. Now I'm facing 450 hours runtime, which is 225,000 % longer than expected...

The text was updated successfully, but these errors were encountered:

RubenKelevra · 2016-07-04T13:34:27Z

I've just changed my procedure:
I've stopped both replaces of running disks - I expected that there was to much load on the disks which were currently active members.

Now I try to add the missing disk in advance, which should be the least load for the whole array - because all disk provide the data for the missing disks at one. Now I would expect a recovering rate at nearly 90 MB/s.

But the result is much worse than that, like before, the whole array is scanned, twice as fast but this is far away from the expected speed:

90 MB/s expected on 432 GB total data per disk => 1:22 h recovering time.

Got 3 MB/s on 3,76T ... => 15 days 12:39h expected recovering time.

# zpool status
  pool: tanka
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jul  4 15:16:49 2016
    1,28G scanned out of 3,76T at 3,07M/s, 355h54m to go
    143M resilvered, 0,03% done
config:

    NAME                        STATE     READ WRITE CKSUM
    tanka                       DEGRADED     0     0     0
      raidz2-0                  DEGRADED     0     0     0
        sdd                     ONLINE       0     0     0
        sda                     ONLINE       0     0     0
        sde                     ONLINE       0     0     0
        sdb                     ONLINE       0     0     0
        sdf                     ONLINE       0     0     0
        sdg                     ONLINE       0     0     0
        sdh                     ONLINE       0     0     0
        sdi                     ONLINE       0     0     0
        spare-8                 OFFLINE      0     0     0
          15843216295752979018  OFFLINE      0     0     0  was /tmp/fakedrive.img
          sdc                   ONLINE       0     0     0  (resilvering)
    spares
      sdl                       AVAIL   
      sdc                       INUSE     currently in use

errors: No known data errors

RubenKelevra · 2016-07-04T13:44:51Z

The System is running on Arch Linux with ZFS version 0.6.5.7, Kernel-Version 4.6.3.

The System got an i5 6xxx quadcore processor without HT and 32 GB DDR3 RAM.

The 3 TB disks are added completely to zfs, the system is booting from a dedicated usb device.

dasjoe · 2016-07-04T14:10:34Z

Resilver speed depends on your disks' fragmentation, and thus usually varies over the whole resilver. 3 MB/s for a single vdev is well above the expected minimum speed.

You're looking at a single disk's IOPS, so even slight fragmentation leads to a lower resilver speed.
Let's assume 90 IOPS, so your pool can read at least 90 blocks per second. A minimum block size of 4 KB leads to 360 KB/s, showing us your pool average blocks are larger or not fully fragmented.

You may have success with following http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/, zfs_vdev_scrub_min_active and zfs_vdev_scrub_max_active.

Also, this is not a discussion forum but a bug tracker, so this would be better on zfs-discuss.

RubenKelevra · 2016-07-05T06:58:39Z

This is the performance of a completly idle storage:

# zpool status
  pool: tanka
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jul  4 15:16:49 2016
    229G scanned out of 3,76T at 3,72M/s, 276h10m to go
    25,2G resilvered, 5,96% done
config:

    NAME                        STATE     READ WRITE CKSUM
    tanka                       DEGRADED     0     0     0
      raidz2-0                  DEGRADED     0     0     0
        sdd                     ONLINE       0     0     0
        sda                     ONLINE       0     0     0
        sde                     ONLINE       0     0     0
        sdb                     ONLINE       0     0     0
        sdf                     ONLINE       0     0     0
        sdg                     ONLINE       0     0     0
        sdh                     ONLINE       0     0     0
        sdi                     ONLINE       0     0     0
        spare-8                 OFFLINE      0     0     0
          15843216295752979018  OFFLINE      0     0     0  was /tmp/fakedrive.img
          sdc                   ONLINE       0     0     0  (resilvering)
    spares
      sdl                       AVAIL   
      sdc                       INUSE     currently in use

errors: No known data errors

Actually this is a completely new pool. We just started to import data 4 days ago. So I don't think there is no fragmentation at all. Else this was my point: ZFS does not need to read the data with fragmentation. ZFS just need a bitmap of Data & No-Data blocks per Device, when this is completed it's able to do a 1:1 copy of the device which needs to be exchanged. Data which is newly written can just be written to both disks as well.

This would increase the speed here about around 112.000%.

If this cannot be done, a 1:1 copy of the whole datadisk is much faster than the zfs approach, this is what I'm actually doing right now because the ZFS way of replacing a disk is completely broken from my point of view.

Also, this is not a discussion forum but a bug tracker, so this would be better on zfs-discuss.

This is a bugreport, not a discussion. This is broken, not a discussion if it's broken.

Sidequestion:

Resilver speed depends on your disks' fragmentation, and thus usually varies over the whole resilver. 3 MB/s for a single vdev is well above the expected minimum speed.

What actually can I do against a fragmentation of such an pool? Defragmentation is not supported by ZFS at all. See #4785

RubenKelevra · 2016-07-05T07:13:15Z

I've tried your zfs_vdev_scrub_min_activeand zfs_vdev_scrub_max_active, zfs recovering speed increased to avg 6 MB/s on the recovering device ... which still seems fairly slow. All disks running at about 9-12% business and the recovering device is a bit higher, with around 60% avg.

Else the written Data seems very cluttered, since the disk is 60% busy but zfs only writes 6 MB/s.

mailinglists35 · 2016-07-05T09:47:33Z

this issue was previously discussed in #1110

"@behlendorf commented on Nov 29, 2012
@mattlqx Unfortunately, if you have a lot of small files in your pool it's the norm. It's also not really acceptable for the enterprise so there is a design for a fast resilver feature floating around which just needs to be implemented."

does anyone know what is the status of the fast resilver feature?

mailinglists35 · 2016-07-05T09:52:22Z

from https://www.reddit.com/r/zfs/comments/4192js/resilvering_raidz_why_so_incredibly_slow/

"RAIDz resilvering is very slow in OpenZFS-based zpools. It's a lot better in Solaris, though still not as good as mirroring. Basically, it starts with every transaction that's ever happened in the pool and plays them back one-by-one to the new drive. This is very IO-intensive. If you're using hard drives larger than 1TB and you are using OpenZFS, use mirror, not RAIDz*. From a certain point of view, one might think that RAIDz's only legitimate use case in a post-2015 world is for all-SSD pools.
[...]
Disclaimer: I'm an Oracle employee. [...]"

further on

"Solaris tweaked this a lot: Sequential Resilvering. The previous resilvering algorithm repairs blocks from oldest to newest, which can degrade into a lot of small random I/O. The new resilvering algorithm uses a two-step process to sort and resilver blocks in LBA order. The amount of improvement depends on how pool data is laid out. For example, sequentially written data on a mirrored pool shows no improvement, but randomly written data or sequentially written data on RAID-Z improves significantly - typically reducing time by 25 to 50 percent."

RubenKelevra · 2016-07-06T12:30:19Z

I'm now fixing this issue with dd, a 1:1 copy. I learned that the 3 TB disks does 200 MB/s, so replacing two disks now takes 4 h not 15 1/2 days like zfs would need.

Since I'm writing 3 TB not just 450 GB Sequential Resilvering has a much higher potential than just 25-50%.

ronnyegner · 2016-07-07T08:41:38Z

Hi,

i am also running a RAIDz2 (8x 3 TB) and a RAIDz3 with (12x 4 TB) and recently one of the disks in every pool failed. On both pools the resilver speed is between 800 and 1000 MB/s.

The pools contain mainly ZVOLs with 1 MB volblockisze and a few bigger file systems with relatively large files (average between a few MB and GB).

The settings i use to favor resilvering over other I/O are:

echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight
echo 8000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

Ronny

Rayn0r · 2017-03-25T23:36:04Z

I replaced an HDD in a mirror consisting of two disks today on a test system, during evaluation tests.
The machine is running Ubuntu 16.04 LTS with Kernel 4.4.0-64. The pool was filled with 1.4TB data just yesterday. Resilvering then ran at 10.5-11.6MiB/s on a 4TB Seagate Ironwolf with no more than 93 IOPs according to "zpool iostat datengrab 10".

After setting /sys/module/zfs/parameters/zfs_vdev_async_write_min_active from 1 to 8, it ran at 140-160MiB/s and IOPs jumped to over 1000.
Values mentioned above by @ronnyegner had no effect in regards of re-sync speed here.
It seems as if the default value for zfs_vdev_async_write_min_active "artificially" slows down writes and even causes iostat to assume that the drive is working at its limit, probably because w_await and svctm are almost identical. See below:

zpool status -v
  pool: datengrab
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Mar 25 21:21:33 2017
    400G scanned out of 1.41T at 43.6M/s, 6h50m to go
    400G resilvered, 27.59% done
config:

        NAME        STATE     READ WRITE CKSUM
        datengrab   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc1    ONLINE       0     0     0
            sde1    ONLINE       0     0     0  (resilvering)

iostat -xm /dev/sdc1 /dev/sde1 10
delivers the following output with zfs_vdev_async_write_min_active set to 1:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.30    0.35    0.91    0.91    0.00   97.53

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sde1              0.00     0.00    0.00   90.30     0.00    10.85   246.07     1.02   11.28    0.00   11.28  11.03  99.64
sdc1             24.60     0.00   72.50    2.40    12.05     0.03   330.28     1.33   17.74   17.67   19.67   1.87  14.04

After setting it to 8 it looks like this:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.23    0.10    5.21    1.28    0.00   92.18

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sde1              0.00     0.00    0.00 1340.50     0.00   165.34   252.60     7.75    5.78    0.00    5.78   0.74  98.88
sdc1              0.00     0.00 1348.70    8.30   167.31     0.21   252.81     2.12    1.57    1.34   37.64   0.71  96.44

Even svctm dropped from 11.03ms to 0.74ms... Amazing!

DeHackEd · 2017-03-26T02:29:36Z

That zfs_vdev_async_write_min_active=1 thing is known. I think the minimum needs to be set to 2.

What I believe is happening is that ZFS only gives drives 1 write operation at a time. When it finishes a write, ZFS sends a new one for the next sector, but the platter has already rotated past the start of the sector and must wait for (effectively) a full rotation to do the write. With 2 active outstanding writes the drive can be kept fed sequential write operations.

Rayn0r · 2017-03-26T12:33:41Z

@DeHackEd
The porperty's name is zfs_vdev_async_write_min_active. In my understanding "async" means that it should not wait for the data to be written to disk, but just feed it to the write-cache and let it handle the rest.

From what I read here, the write-cache is not enabled, if ZFS does not handle the entire disk. But I also found an (lamost 5 year old) post mentioning, that needed ioctls for this are not supported by Linux.
I then check hdparm.conf and found that we had purposely turned off write cache for all drives to mitigate FS corruption in case of a power/PSU failure.

With 2 or more write operations in parallel one is probably circumventing the resulting "write lag" because of the disabled cache.
Setting zfs_vdev_async_write_min_active to above 3 does not result in a performance gain anymore.

Setting zfs_vdev_async_write_min_active=1 and enabling write-cache with hdparm -W1 /dev/sde results in the same write performance as with disabled write-cache and zfs_vdev_async_write_min_active=3 (145-165Mib/s).

DeHackEd · 2017-03-26T13:25:24Z

But ZFS has its own IO scheduler which will only keep the drive fed with 1 write request at a time if the request is internally categorized as ASYNC. While resilvering doesn't need to wait for a write to happen, the IO scheduler is spoon-feeding the drive when really it wants a continuous stream of data.

Rayn0r · 2017-03-26T15:24:07Z

This is what I am saying... ASYNC means that writes are sent to the disk without the need to wait for the drive to confirm the write.
If there is a write-cache available, this spoon-feeding seems to be working pretty well, since the cache buffers all write request and the drive's firmware handles the rest.
As soon as you turn off the cache, you will need to wait till the head has reached a position where it is over the next sector, that needs to be written. This would also explain the 11ms service time (this is probably the head's seek time) in the iostat -xm... output.

RubenKelevra · 2017-03-27T05:51:43Z

@Rayn0r that sounds reasonable.

But in my case the whole disks where added to the pool, and I did not played with any hdparm commands on this machine, so usually I would expect that the write caches are on or turned on by zfs.

I think my case was just many many transactions which was imported from two hypervisor-filesystems. But well, since the pool was completely empty before, this should not lead to heavy fragmentation.

Maybe we also got an issue with importing foreign filesystems which got heavily fragmented while writing or the whole transactions reading has a bottleneck while resilvering...

behlendorf · 2017-03-27T21:26:45Z

@Rayn0r do you have data for setting the zfs_vdev_async_write_min_active=2. I saw above you posted results for 1 and 3. Ideally, we want to set this value as low as possible while still maintaining good performance.

Rayn0r · 2017-03-28T15:45:29Z

I did a complete re-sync with zfs_vdev_async_write_min_active=2 this morning. Note that the pool usage has increased by 220GiB over the last 2 days.


   eid: 39
 class: resilver.finish
  host: ilpss8
  time: 2017-03-28 11:23:47+0200
  pool: datengrab
 state: ONLINE
  scan: resilvered 1.64T in 4h39m with 0 errors on Tue Mar 28 11:23:47 2017
config:

	NAME        STATE     READ WRITE CKSUM
	datengrab   ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sdc1    ONLINE       0     0     0
	    sde1    ONLINE       0     0     0

errors: No known data errors

This is the test from the weekend with zfs_vdev_async_write_min_active=3 :


   eid: 33
 class: resilver.finish
  host: ilpss8
  time: 2017-03-26 16:50:05+0200
  pool: datengrab
 state: ONLINE
  scan: resilvered 1.42T in 2h59m with 0 errors on Sun Mar 26 16:50:05 2017
config:

	NAME        STATE     READ WRITE CKSUM
	datengrab   ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sdc1    ONLINE       0     0     0
	    sde1    ONLINE       0     0     0

errors: No known data errors

RubenKelevra · 2017-03-29T04:29:45Z

I did a complete re-sync with `zfs_vdev_async_write_min_active=2` this morning. Note that the pool usage has increased by 220GiB over the last 2 days. scan: resilvered 1.64T in 4h39m `zfs_vdev_async_write_min_active=3` scan: resilvered 1.42T in 2h59m

Looks like we want to ask the devices if the write-cache is enabled and if not we want to select 3 instead of 1 or 2. Or something like this. Is there any disadvantage if the number is higher? I thought NCQ could only work if the disk has enough in the queue. Would a higher number hurt the atomic updates/replaces or the linear timeline in edge cases or would it just increase the latency on high demand peaks?

Rayn0r · 2017-03-29T08:09:07Z

Is there any disadvantage if the number is higher?

I can only answer to above question...
I did test with values of up to 8 and could not see any performance improvement over 3.
The data rate did not drop while increasing the number.

I'd suggest that you also elaborate some tests to identify the impact of increasing the number.

RubenKelevra · 2017-03-30T16:17:36Z

> Is there any disadvantage if the number is higher? I can only answer to above question... I did test with values of up to 8 and could not see any performance improvement over 3. The data rate did not drop while increasing the number. I'd suggest that you also elaborate some tests to identify the impact of increasing the number.

Thanks! :) Do you had an ear on the HDDs? I was wondering if they seek very much. I bet they have NCQ and I was wondering if that is just the improvement we achieve here, filling the NCQ-Buffer far enough to get a better data via seek rate. If that's the case, a 3 should be favoritable all the time. :)

RubenKelevra · 2017-03-30T16:55:45Z

I may have some example data for you to analyze guys, I changed it on three hypervisors (after the gap in the graphs the setting is 8 not 1). Nothing else has changed on the zfs settings or the systems. All servers are on for weeks.

So it seems that you get a slight reduction on the utilization from 5-10% while it might increase the responsibility of the disk.

The big difference between loki and ra is, loki has two database servers one RRD based and one MySQL based. So you get a lot of small random write to the disk while ra has a lot more normal load like webservers and application servers.

I added some hardware-specs at the end.

loki:
8 cores amd (not very powerful and constantly taken up)
two hdds via sata (1 TB)
32 gigs of ram (6-8 gig available for zfs/cache-use)
zfs on partitions and hdd write-cache ON

ra:
8 cores intel (a bit more powerful than loki)
16 gigs of ram (5-7 gigs available for zfs/cache-use)
two hdds via sata (2 TB)
zfs on partitions and hdd write-cache ON

juno
2x6 cores intel (very powerful xeons)
128 gigs of ram (not even half full)
two hdds (4 TB) via raid-controller with a large cache
raid-controller write-cache is off, hdd write cache - I have no idea
zfs on partitions

Rayn0r · 2017-03-30T17:00:33Z

Do you had an ear on the HDDs? I was wondering if they seek very much.

At a data rate of over 140MiB/s, it is highly unlikely that the drives are seeking a lot. The noise inside the rack is too loud to hear the drives over the fans from the computer, UPS, or the switch.

PS: I feel now like I have somehow hijacked this issued. The subject still is: "Feature Request: Sequential Resilvering"...

RubenKelevra · 2017-03-30T18:10:33Z

> Do you had an ear on the HDDs? I was wondering if they seek very > much. At a data rate of over 140MiB/s, it is highly unlikely that the drives are seeking a lot. The noise inside the rack is too loud to hear the drives over the fans from the computer, UPS, or the switch.

Yeah alright, I does not thought about that. My machines are pretty silent. :D Alright, so in my case the setting to 8 (maybe 3 also) helps a bit with much random I/O. Not sure if it's the same stuff that helps with rebuiding. But maybe we should just get the next RC with 3 out and see what the guys out there are saying about it? :)

PS: I feel now like I have somehow hijacked this issued. The subject still is: "Feature Request: Sequential Resilvering"...

Don't worry, the main topic was, that it's faster to copy an entirly disk of 3 TB than using zfs to just copy 400 GB from the old one to the other. And not like "well this is twice as fast" but a hell faster. So if this little tweak helps others it would be nice, but in my case the data seem to contain that lot of seperate random transactions that it probably won't help much. Anyway, don't worry :)

RubenKelevra · 2017-04-01T14:51:29Z

Well, I worked again on one of the VMs on the loki hypervisor and I must admit that the machine is overall A LOT snappier.

I looked at different stats and the machine has now more network traffic throughput. This explains why the IRQs are now around 12% higher. Also the CPU-usage has increased, which is a good thing.

I have still no idea what the machines bottleneck are, but I guess it's just a CPU-RAM-connection bottleneck. But the "8" works fine on this machine.

If there are concerns about 8, I might try out 3 for you.

@behlendorf ?

behlendorf · 2017-04-03T17:54:20Z

@RubenKelevra it would be great if you could try out values of 2 and 3. The the specific concern here is latency which can be seen in the nice graphs you generated.

The ZFS I/O scheduler is trying to strike a good balance between latency and throughput in order to achieve consistent performance. The theory of operation is fully described in the ZFS I/O SCHEDULER section of the zfs-module-parameters man page which I've included below.

To summarize we should consider increasing zfs_vdev_async_write_min_active but we want to increase it as little as possible to minimize the negative impact on latency. If we can get the majority of the benefit from increasing it to 2 or 3 that would be best. We may also want to consider decreasing the zfs_vdev_async_write_active_min_dirty_percent so the zfs_vdev_async_write_min_active cap is raised more quickly.

ZFS I/O SCHEDULER
       ZFS issues I/O operations to leaf vdevs to satisfy and  complete  I/Os.
       The  I/O  scheduler  determines when and in what order those operations
       are issued.  The I/O scheduler divides operations into five I/O classes
       prioritized  in the following order: sync read, sync write, async read,
       async write, and scrub/resilver.  Each queue defines  the  minimum  and
       maximum  number  of  concurrent  operations  that  may be issued to the
       device.   In  addition,  the   device   has   an   aggregate   maximum,
       zfs_vdev_max_active.  Note  that the sum of the per-queue minimums must
       not exceed the aggregate maximum.  If the sum of the per-queue maximums
       exceeds the aggregate maximum, then the number of active I/Os may reach
       zfs_vdev_max_active, in which case  no  further  I/Os  will  be  issued
       regardless of whether all per-queue minimums have been met.

       For many physical devices, throughput increases with the number of con‐
       current operations, but latency typically  suffers.  Further,  physical
       devices typically have a limit at which more concurrent operations have
       no effect on throughput or can actually cause it to decrease.

       The scheduler selects the next operation to issue by first looking  for
       an  I/O class whose minimum has not been satisfied. Once all are satis‐
       fied and the aggregate maximum has not been hit,  the  scheduler  looks
       for classes whose maximum has not been satisfied. Iteration through the
       I/O classes is done in the order specified above. No further operations
       are issued if the aggregate maximum number of concurrent operations has
       been hit or if there are no operations queued for an I/O class that has
       not  hit its maximum.  Every time an I/O is queued or an operation com‐
       pletes, the I/O scheduler looks for new operations to issue.

       In general, smaller max_active's will lead to lower latency of synchro‐
       nous  operations.   Larger  max_active's  may  lead  to  higher overall
       throughput, depending on underlying storage.

       The ratio of the queues' max_actives determines the balance of  perfor‐
       mance   between   reads,   writes,   and   scrubs.    E.g.,  increasing
       zfs_vdev_scrub_max_active will cause the scrub or resilver to  complete
       more  quickly,  but  reads  and writes to have higher latency and lower
       throughput.

       All I/O classes have a fixed maximum number of  outstanding  operations
       except  for  the  async  write class. Asynchronous writes represent the
       data that is committed to stable storage during the syncing  stage  for
       transaction groups. Transaction groups enter the syncing state periodi‐
       cally so the number of queued async writes will quickly  burst  up  and
       then  bleed down to zero. Rather than servicing them as quickly as pos‐
       sible, the I/O scheduler changes the maximum  number  of  active  async
       write  I/Os  according  to the amount of dirty data in the pool.  Since
       both throughput and latency typically increase with the number of  con‐
       current  operations issued to physical devices, reducing the burstiness
       in the number of concurrent operations  also  stabilizes  the  response
       time  of  operations  from  other  --  and in particular synchronous --
       queues. In broad strokes, the I/O scheduler will issue more  concurrent
       operations from the async write queue as there's more dirty data in the
       pool.

       Async Writes

       The number of concurrent operations issued  for  the  async  write  I/O
       class  follows a piece-wise linear function defined by a few adjustable
       points.

              |              o---------| <-- zfs_vdev_async_write_max_active
         ^    |             /^         |
         |    |            / |         |
       active |           /  |         |
        I/O   |          /   |         |
       count  |         /    |         |
              |        /     |         |
              |-------o      |         | <-- zfs_vdev_async_write_min_active
             0|_______^______|_________|
              0%      |      |       100% of zfs_dirty_data_max
                      |      |
                      |      `-- zfs_vdev_async_write_active_max_dirty_percent
                      `--------- zfs_vdev_async_write_active_min_dirty_percent

       Until the amount of dirty data exceeds  a  minimum  percentage  of  the
       dirty data allowed in the pool, the I/O scheduler will limit the number
       of concurrent operations to the minimum. As that threshold is  crossed,
       the  number  of  concurrent operations issued increases linearly to the
       maximum at the specified maximum percentage of the dirty  data  allowed
       in the pool.

       Ideally,  the  amount  of  dirty  data  on a busy pool will stay in the
       sloped        part        of        the        function         between
       zfs_vdev_async_write_active_min_dirty_percent                       and
       zfs_vdev_async_write_active_max_dirty_percent. If it exceeds the  maxi‐
       mum  percentage,  this  indicates  that  the  rate  of incoming data is
       greater than the rate that the backend  storage  can  handle.  In  this
       case,  we  must  further  throttle incoming writes, as described in the
       next section.

RubenKelevra · 2017-04-04T11:04:22Z

@RubenKelevra it would be great if you could try out values of 2 and 3. The the specific concern here is latency which can be seen in the nice graphs you generated.

Alright, I scripted that the server where you observed the latency issues (which are indeed the most I/O-bound server) do a hour each value from 1 to 8. Expect in roughly 8 hours a news graph about the difference. :) Best regards Ruben

RubenKelevra · 2017-04-04T19:10:15Z

![img_0709](https://cloud.githubusercontent.com/assets/614929/24675042/fb4e8894-197d-11e7-895a-267a7a6183e8.JPG) So, I think the outcome is pretty clear. One grey line is one hour, I've changed the setting exactly on each o'clock. Edit: Sorry, Github seems to drop attachments while answering via mail instead of just adding them as attachments.

RubenKelevra · 2017-04-04T19:34:34Z

Oh yes, now JavaScript of Github has failed, I'm sorry but I think it's readable.

behlendorf · 2017-04-04T22:35:19Z

To be clear the relevant section of the graph is that right most portion where latency increases linearly with larger zfs_vdev_async_write_min_active values. And from the previous rebuild testing a value of 2 significantly helped rebuilds times, 3 yielded an additional meaningful improvement and there wasn't much gain after that.

Given this data I think we should adopt the performance tweak proposed by @DeHackEd in PR #5926 and increase the default zfs_vdev_async_write_min_active to 2 to keep the drive fed.

RubenKelevra · 2017-04-05T07:16:17Z

Well, 3 looked a bit more promising since it could speedup the rebuilt even more.

132.22 MB/s vs 97.97 MB/s rebuild speed - a 34.96% speedup.

Since this should also affect random writes a performance increase in this range should be discussed I think.

Else it is to mention that my graphs show the latency for devices with write cache ON, he got devices with write cache OFF. So my performance gain was negligible because the write cache off the devices doing most of the job here.

If it would be helpful I can rerun my setup with different values with device cache being turned off.

The performance impact was that heavy at the time I setup my rigs that I was unable to run it without write cache, even if this may cause dataloss on a crash, because I use zfs on partitions.

behlendorf · 2017-04-05T17:23:08Z

Sure, I can see a case for that. It depends if you value the bandwidth or latency more. Here's approximately the data I see from the comments above.

zfs_vdev_async_write_min_active	b/w (MB/s)	latency (ms)
1	11	~4
2	97	~6
3	132	~8
8	~132	~16

@RubenKelevra as long as the controllers and drives honor cache flushes it's safe, and preferable, to run with the write cache enabled.

RubenKelevra · 2017-04-05T17:51:41Z

Thanks for your neat summary. I thought about it again and I still think, the 2 ms avg latency addition is presumably negligible, since we're still talking about the disk latency far better from everything I observed with LVM over a md-raid before on the same hardware. The amount of performance gain on the other hand seems to be worth it in any case. The only thing which worries me is the very small sample size we're using to determine the best default setting for all users and use cases. So, what happens if you got a database server with a massive organic load which fully work on the limit of a 10 ssd array for example? I think before we even select 2 we should try to find more users which can offer some performance data for different use cases. Best regards Ruben

DeHackEd · 2017-04-05T18:51:04Z

The IO scheduler was originally tuned with rotational media in mind. Resilvering performance of an SSD will be improved with higher queue depths, but not for the same reasons as hard drives resulting from an overshoot of the next sector requiring a wait for the next rotation.

I have seen a few other instances of this happening in the IRC channel. I won't quote any here. While it's anecdotal compared to the tests done here, it sounds like a major win for minor performance losses.

SSDs should be reconfigured from defaults anyway.

RubenKelevra · 2017-04-05T19:34:06Z

Alright then, so we're talking about HDD scenarios only. In this case I would pledge for a change to 3, since this seems to be the optimal performance option in a organic load scenario as well as on rebuilds. Also it does not increase the avg responsibility of the drives to much to hurt realtime access too much. I think it's a sane default to assume that user want the maximal throughput-option. In case someone is using a large scale database application server on HDDs we should add an advice in the documentation and/or in the changelog, that in these cases 1 or 2 might be the better option. Best regards Ruben

ryao · 2017-04-07T03:04:21Z

@ahrens What do you think about this?

ahrens · 2017-04-07T04:16:24Z

@ryao This thread is pretty massive, can you summarize what's being proposed, or point me to the a specific comment that is the proposal?

DeHackEd · 2017-04-07T12:28:40Z

Super-short version: #5926 - Raise zfs_vdev_async_write_min_active from 1 to 2

Reasoning:
Resilver performance has been pretty abysmal for a lot of users with rotational drives. What appears to be the cause is that the IO scheduler only dispatches 1 async write to the drive at a time. For a linear write by the time the drive completes the operation the read/write head has gone past the next sector before it receives the next write. What happens is effectively 1 write per disk rotation.

A simple solution is to raise zfs_vdev_async_write_min_active from 1 to 2. This tends to keep the drive fed much better and the increase in latency is small-ish (4ms to 6ms based on measurements).

jwittlincohen · 2017-04-07T13:34:00Z

@DeHackEd Don't you mean zfs_vdev_async_write_min_active and not zfs_vdev_async_write_max_active? Your commit at #5926 revers to min_active as does the testing performed in this thread.

DeHackEd · 2017-04-07T13:40:44Z

Whoops. My bad. Previous comment edited.

ahrens · 2017-04-07T16:31:53Z

@DeHackEd Thanks, I left my comments on the PR.

RubenKelevra · 2017-04-07T20:46:21Z

Just digged a bit deeper in my graphs. Well, increasing the async writes does indeed hurt the latency, but only the write latency.

The read latency is very slightly better (reduced). I just have so many writes and so less reads per second that it looks like the overall performance is decreasing.

richardelling · 2017-04-08T00:11:20Z

FWIW, in the bad old days of the old write throttle, we found that for HDDs, zfs_vdev_max_pending=2 was far superior to 1, and larger values don't help much until you get too large and HDDs fall over (somewhere between 4 and 35, YMMV). So it is quite reasonable to get a meaningful boost from setting to 2, considering that the resilvering drive satisfies few reads early in the resilver time.

Resilver operations frequently cause only a small amount of dirty data to be written to disk at a time, resulting in the IO scheduler to only issue 1 write at a time to the resilvering disk. When it is rotational media the drive will often travel past the next sector to be written before receiving a write command from ZFS, significantly delaying the write of the next sector. Raise zfs_vdev_async_write_min_active so that drives are kept fed during resilvering. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue openzfs#4825 Closes openzfs#5926

RubenKelevra · 2017-04-23T11:00:58Z

What is the result of a much higher value like 8?

jwittlincohen · 2017-04-23T13:52:13Z

Testing showed the most significant benefit resulted from going from 1 to 2, and there was an additional but smaller benefit changing to 3. Higher values increased latency but not throughout. A default value of 2 was chosen as a compromise between speed and latency as it gives you the majority of the improvement with a minor increase in latency.

…

On Apr 23, 2017 7:01 AM, ***@***.***" ***@***.***> wrote: What is the result of a much higher value like 8? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4825 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJHmLSbyGFeLhkpkHUqo-scU3pEu-h3tks5ryy9vgaJpZM4JEUNj> .

RubenKelevra · 2017-04-23T18:07:10Z

@jwittlincohen since I did some of these testings, I've asked @richardelling for a confirmation of these findings.

@ptx0 this sounds interesting, but I never observed such a behavior.

On which hardware do you got these issues?

RubenKelevra · 2017-04-23T18:08:42Z

@jwittlincohen since I did some of these testings, I've asked @richardelling for a confirmation of these findings.

@ptx0 this sounds interesting, but I never observed such a behavior.

On which hardware do you got these issues?

richardelling · 2017-04-27T22:07:42Z

I concur with @RubenKelevra 2 is a better default than 1.

behlendorf · 2017-06-29T19:35:57Z

@RubenKelevra if you don't mind I'd like to close this issue out. We have several other issues open to work on improving the resilver speeds. The following modest improvements already made include:

06226b5 was merged to master to increase the default value to 2
3d6da72 skips resilver IOs which don't span all child vdevs and need not be resilvered.

Resilver operations frequently cause only a small amount of dirty data to be written to disk at a time, resulting in the IO scheduler to only issue 1 write at a time to the resilvering disk. When it is rotational media the drive will often travel past the next sector to be written before receiving a write command from ZFS, significantly delaying the write of the next sector. Raise zfs_vdev_async_write_min_active so that drives are kept fed during resilvering. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue openzfs#4825 Closes openzfs#5926

RubenKelevra · 2017-06-30T22:55:39Z

Sure!

I think my issue might be fixed as well. Time will tell ;)

Resilver operations frequently cause only a small amount of dirty data to be written to disk at a time, resulting in the IO scheduler to only issue 1 write at a time to the resilvering disk. When it is rotational media the drive will often travel past the next sector to be written before receiving a write command from ZFS, significantly delaying the write of the next sector. Raise zfs_vdev_async_write_min_active so that drives are kept fed during resilvering. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue #4825 Closes #5926

RubenKelevra changed the title ~~Very slow drive-replace~~ Feature Request: Sequential Resilvering (was: Very slow drive-replace) Jul 6, 2016

ironMann mentioned this issue Jul 19, 2016

Implement sequential (two-phase) resilvering #3625

Closed

DeHackEd mentioned this issue Mar 26, 2017

Increase zfs_vdev_async_write_min_active to 2 #5926

Closed

11 tasks

behlendorf added the Type: Performance Performance improvement or performance problem label Mar 27, 2017

behlendorf added this to the 0.7.0 milestone Mar 27, 2017

RubenKelevra closed this as completed Jun 30, 2017

DeHackEd mentioned this issue Jul 16, 2017

slow resilvering #3087

Closed

joshenders mentioned this issue Dec 25, 2023

Sequential replacement for raidz #15494

Open

Feature Request: Sequential Resilvering (was: Very slow drive-replace) #4825

Feature Request: Sequential Resilvering (was: Very slow drive-replace) #4825

Comments

RubenKelevra commented Jul 4, 2016 • edited Loading

RubenKelevra commented Jul 4, 2016

RubenKelevra commented Jul 4, 2016

dasjoe commented Jul 4, 2016

RubenKelevra commented Jul 5, 2016 • edited Loading

RubenKelevra commented Jul 5, 2016

mailinglists35 commented Jul 5, 2016 • edited Loading

mailinglists35 commented Jul 5, 2016 • edited Loading

RubenKelevra commented Jul 6, 2016 • edited Loading

ronnyegner commented Jul 7, 2016 • edited Loading

Rayn0r commented Mar 25, 2017

DeHackEd commented Mar 26, 2017 • edited Loading

Rayn0r commented Mar 26, 2017

DeHackEd commented Mar 26, 2017

Rayn0r commented Mar 26, 2017 • edited Loading

RubenKelevra commented Mar 27, 2017

behlendorf commented Mar 27, 2017

Rayn0r commented Mar 28, 2017 • edited Loading

RubenKelevra commented Mar 29, 2017 via email

Rayn0r commented Mar 29, 2017

RubenKelevra commented Mar 30, 2017 via email

RubenKelevra commented Mar 30, 2017 • edited Loading

Rayn0r commented Mar 30, 2017

RubenKelevra commented Mar 30, 2017 via email

RubenKelevra commented Apr 1, 2017

behlendorf commented Apr 3, 2017

RubenKelevra commented Apr 4, 2017 via email

RubenKelevra commented Apr 4, 2017 via email • edited Loading

RubenKelevra commented Apr 4, 2017

behlendorf commented Apr 4, 2017

RubenKelevra commented Apr 5, 2017

behlendorf commented Apr 5, 2017 • edited Loading

RubenKelevra commented Apr 5, 2017 via email

DeHackEd commented Apr 5, 2017

RubenKelevra commented Apr 5, 2017 via email

ryao commented Apr 7, 2017

ahrens commented Apr 7, 2017

DeHackEd commented Apr 7, 2017 • edited Loading

jwittlincohen commented Apr 7, 2017 • edited Loading

DeHackEd commented Apr 7, 2017

ahrens commented Apr 7, 2017

RubenKelevra commented Apr 7, 2017

richardelling commented Apr 8, 2017

RubenKelevra commented Apr 23, 2017 via email

jwittlincohen commented Apr 23, 2017 via email

RubenKelevra commented Apr 23, 2017

RubenKelevra commented Apr 23, 2017

richardelling commented Apr 27, 2017

behlendorf commented Jun 29, 2017

RubenKelevra commented Jun 30, 2017

RubenKelevra commented Jul 4, 2016 •

edited

Loading

RubenKelevra commented Jul 5, 2016 •

edited

Loading

mailinglists35 commented Jul 5, 2016 •

edited

Loading

mailinglists35 commented Jul 5, 2016 •

edited

Loading

RubenKelevra commented Jul 6, 2016 •

edited

Loading

ronnyegner commented Jul 7, 2016 •

edited

Loading

DeHackEd commented Mar 26, 2017 •

edited

Loading

Rayn0r commented Mar 26, 2017 •

edited

Loading

Rayn0r commented Mar 28, 2017 •

edited

Loading

RubenKelevra commented Mar 30, 2017 •

edited

Loading

RubenKelevra commented Apr 4, 2017 via email •

edited

Loading

behlendorf commented Apr 5, 2017 •

edited

Loading

DeHackEd commented Apr 7, 2017 •

edited

Loading

jwittlincohen commented Apr 7, 2017 •

edited

Loading