Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Sequential Resilvering (was: Very slow drive-replace) #4825

Closed
RubenKelevra opened this issue Jul 4, 2016 · 50 comments
Closed
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@RubenKelevra
Copy link

RubenKelevra commented Jul 4, 2016

Currently a drive-replace is dead slow. I've currently at a data-migration to a larger storage...

My setup is a 9 3TB SATA-HDD RAIDz2 which is currently degraded to one redundancy.

I'm currently receiving two filesystems and need to replace two disks, because I started with two disks just borrowed from a friend, now wanna use the disks from the old storage system.

The two zfs receives are limited by the internet connection, which is around 95 Mbit/s.

So the rest of the IO-Bandwidth, which should be around 60 MB/s x 7 => 420 MByte/s is available for a device replace.

So I've started to replace both drives, which I need to replace at once.

Currently ZFS seems to not just copy the data from one drive to another, while writing new data / metadata just to both drives, instead it looks like zfs is running a complete analyse of all data on all disks to get the information which has to be stored on this particular new disk.

This is currently dead slow:

# zpool status
  pool: tanka
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jul  3 19:53:09 2016
    110G scanned out of 2.64T at 1.71M/s, 431h38m to go
    23.6G resilvered, 4.05% done
config:

    NAME                      STATE     READ WRITE CKSUM
    tanka                     DEGRADED     0     0     0
      raidz2-0                DEGRADED     0     0     0
        replacing-0           ONLINE       0     0     0
          sdd                 ONLINE       0     0     0
          sdc                 ONLINE       0     0     0  (resilvering)
        sda                   ONLINE       0     0     0
        sde                   ONLINE       0     0     0
        sdb                   ONLINE       0     0     0
        sdf                   ONLINE       0     0     0
        replacing-5           ONLINE       0     0     0
          sdg                 ONLINE       0     0     0
          sdl                 ONLINE       0     0     0  (resilvering)
        sdh                   ONLINE       0     0     0
        sdi                   ONLINE       0     0     0
        15843216295752979018  UNAVAIL      0     0     0  was /tmp/fakedrive.img

errors: No known data errors

zfs list shows up, that 2,97T is used, which leads to a usage of 432GByte per Disk. Since each disk is capable of writing / reading around 90 MB/s plain, a 1:1 copy should be as fast as 60 MB/s avg at worst.

So my expectation was that sdd is copied to sdc while sdg is copied to sdl both at 60 MB/s. So this should be done in 2 hours. Now I'm facing 450 hours runtime, which is 225,000 % longer than expected...

@RubenKelevra
Copy link
Author

I've just changed my procedure:
I've stopped both replaces of running disks - I expected that there was to much load on the disks which were currently active members.

Now I try to add the missing disk in advance, which should be the least load for the whole array - because all disk provide the data for the missing disks at one. Now I would expect a recovering rate at nearly 90 MB/s.

But the result is much worse than that, like before, the whole array is scanned, twice as fast but this is far away from the expected speed:

90 MB/s expected on 432 GB total data per disk => 1:22 h recovering time.

Got 3 MB/s on 3,76T ... => 15 days 12:39h expected recovering time.

# zpool status
  pool: tanka
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jul  4 15:16:49 2016
    1,28G scanned out of 3,76T at 3,07M/s, 355h54m to go
    143M resilvered, 0,03% done
config:

    NAME                        STATE     READ WRITE CKSUM
    tanka                       DEGRADED     0     0     0
      raidz2-0                  DEGRADED     0     0     0
        sdd                     ONLINE       0     0     0
        sda                     ONLINE       0     0     0
        sde                     ONLINE       0     0     0
        sdb                     ONLINE       0     0     0
        sdf                     ONLINE       0     0     0
        sdg                     ONLINE       0     0     0
        sdh                     ONLINE       0     0     0
        sdi                     ONLINE       0     0     0
        spare-8                 OFFLINE      0     0     0
          15843216295752979018  OFFLINE      0     0     0  was /tmp/fakedrive.img
          sdc                   ONLINE       0     0     0  (resilvering)
    spares
      sdl                       AVAIL   
      sdc                       INUSE     currently in use

errors: No known data errors

@RubenKelevra
Copy link
Author

The System is running on Arch Linux with ZFS version 0.6.5.7, Kernel-Version 4.6.3.

The System got an i5 6xxx quadcore processor without HT and 32 GB DDR3 RAM.

The 3 TB disks are added completely to zfs, the system is booting from a dedicated usb device.

@dasjoe
Copy link
Contributor

dasjoe commented Jul 4, 2016

Resilver speed depends on your disks' fragmentation, and thus usually varies over the whole resilver. 3 MB/s for a single vdev is well above the expected minimum speed.

You're looking at a single disk's IOPS, so even slight fragmentation leads to a lower resilver speed.
Let's assume 90 IOPS, so your pool can read at least 90 blocks per second. A minimum block size of 4 KB leads to 360 KB/s, showing us your pool average blocks are larger or not fully fragmented.

You may have success with following http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/, zfs_vdev_scrub_min_active and zfs_vdev_scrub_max_active.

Also, this is not a discussion forum but a bug tracker, so this would be better on zfs-discuss.

@RubenKelevra
Copy link
Author

RubenKelevra commented Jul 5, 2016

This is the performance of a completly idle storage:

# zpool status
  pool: tanka
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jul  4 15:16:49 2016
    229G scanned out of 3,76T at 3,72M/s, 276h10m to go
    25,2G resilvered, 5,96% done
config:

    NAME                        STATE     READ WRITE CKSUM
    tanka                       DEGRADED     0     0     0
      raidz2-0                  DEGRADED     0     0     0
        sdd                     ONLINE       0     0     0
        sda                     ONLINE       0     0     0
        sde                     ONLINE       0     0     0
        sdb                     ONLINE       0     0     0
        sdf                     ONLINE       0     0     0
        sdg                     ONLINE       0     0     0
        sdh                     ONLINE       0     0     0
        sdi                     ONLINE       0     0     0
        spare-8                 OFFLINE      0     0     0
          15843216295752979018  OFFLINE      0     0     0  was /tmp/fakedrive.img
          sdc                   ONLINE       0     0     0  (resilvering)
    spares
      sdl                       AVAIL   
      sdc                       INUSE     currently in use

errors: No known data errors

Actually this is a completely new pool. We just started to import data 4 days ago. So I don't think there is no fragmentation at all. Else this was my point: ZFS does not need to read the data with fragmentation. ZFS just need a bitmap of Data & No-Data blocks per Device, when this is completed it's able to do a 1:1 copy of the device which needs to be exchanged. Data which is newly written can just be written to both disks as well.

This would increase the speed here about around 112.000%.

If this cannot be done, a 1:1 copy of the whole datadisk is much faster than the zfs approach, this is what I'm actually doing right now because the ZFS way of replacing a disk is completely broken from my point of view.

Also, this is not a discussion forum but a bug tracker, so this would be better on zfs-discuss.

This is a bugreport, not a discussion. This is broken, not a discussion if it's broken.

Sidequestion:

Resilver speed depends on your disks' fragmentation, and thus usually varies over the whole resilver. 3 MB/s for a single vdev is well above the expected minimum speed.

What actually can I do against a fragmentation of such an pool? Defragmentation is not supported by ZFS at all. See #4785

@RubenKelevra
Copy link
Author

I've tried your zfs_vdev_scrub_min_activeand zfs_vdev_scrub_max_active, zfs recovering speed increased to avg 6 MB/s on the recovering device ... which still seems fairly slow. All disks running at about 9-12% business and the recovering device is a bit higher, with around 60% avg.

Else the written Data seems very cluttered, since the disk is 60% busy but zfs only writes 6 MB/s.

@mailinglists35
Copy link

mailinglists35 commented Jul 5, 2016

this issue was previously discussed in #1110

"@behlendorf commented on Nov 29, 2012
@mattlqx Unfortunately, if you have a lot of small files in your pool it's the norm. It's also not really acceptable for the enterprise so there is a design for a fast resilver feature floating around which just needs to be implemented."

does anyone know what is the status of the fast resilver feature?

@mailinglists35
Copy link

mailinglists35 commented Jul 5, 2016

from https://www.reddit.com/r/zfs/comments/4192js/resilvering_raidz_why_so_incredibly_slow/

"RAIDz resilvering is very slow in OpenZFS-based zpools. It's a lot better in Solaris, though still not as good as mirroring. Basically, it starts with every transaction that's ever happened in the pool and plays them back one-by-one to the new drive. This is very IO-intensive. If you're using hard drives larger than 1TB and you are using OpenZFS, use mirror, not RAIDz*. From a certain point of view, one might think that RAIDz's only legitimate use case in a post-2015 world is for all-SSD pools.
[...]
Disclaimer: I'm an Oracle employee. [...]"

further on

"Solaris tweaked this a lot: Sequential Resilvering. The previous resilvering algorithm repairs blocks from oldest to newest, which can degrade into a lot of small random I/O. The new resilvering algorithm uses a two-step process to sort and resilver blocks in LBA order. The amount of improvement depends on how pool data is laid out. For example, sequentially written data on a mirrored pool shows no improvement, but randomly written data or sequentially written data on RAID-Z improves significantly - typically reducing time by 25 to 50 percent."

@RubenKelevra
Copy link
Author

RubenKelevra commented Jul 6, 2016

I'm now fixing this issue with dd, a 1:1 copy. I learned that the 3 TB disks does 200 MB/s, so replacing two disks now takes 4 h not 15 1/2 days like zfs would need.

Since I'm writing 3 TB not just 450 GB Sequential Resilvering has a much higher potential than just 25-50%.

@RubenKelevra RubenKelevra changed the title Very slow drive-replace Feature Request: Sequential Resilvering (was: Very slow drive-replace) Jul 6, 2016
@ronnyegner
Copy link

ronnyegner commented Jul 7, 2016

Hi,

i am also running a RAIDz2 (8x 3 TB) and a RAIDz3 with (12x 4 TB) and recently one of the disks in every pool failed. On both pools the resilver speed is between 800 and 1000 MB/s.

The pools contain mainly ZVOLs with 1 MB volblockisze and a few bigger file systems with relatively large files (average between a few MB and GB).

The settings i use to favor resilvering over other I/O are:

echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight
echo 8000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

Ronny

@Rayn0r
Copy link

Rayn0r commented Mar 25, 2017

I replaced an HDD in a mirror consisting of two disks today on a test system, during evaluation tests.
The machine is running Ubuntu 16.04 LTS with Kernel 4.4.0-64. The pool was filled with 1.4TB data just yesterday. Resilvering then ran at 10.5-11.6MiB/s on a 4TB Seagate Ironwolf with no more than 93 IOPs according to "zpool iostat datengrab 10".

After setting /sys/module/zfs/parameters/zfs_vdev_async_write_min_active from 1 to 8, it ran at 140-160MiB/s and IOPs jumped to over 1000.
Values mentioned above by @ronnyegner had no effect in regards of re-sync speed here.
It seems as if the default value for zfs_vdev_async_write_min_active "artificially" slows down writes and even causes iostat to assume that the drive is working at its limit, probably because w_await and svctm are almost identical. See below:

zpool status -v
  pool: datengrab
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Mar 25 21:21:33 2017
    400G scanned out of 1.41T at 43.6M/s, 6h50m to go
    400G resilvered, 27.59% done
config:

        NAME        STATE     READ WRITE CKSUM
        datengrab   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc1    ONLINE       0     0     0
            sde1    ONLINE       0     0     0  (resilvering)

iostat -xm /dev/sdc1 /dev/sde1 10
delivers the following output with zfs_vdev_async_write_min_active set to 1:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.30    0.35    0.91    0.91    0.00   97.53

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sde1              0.00     0.00    0.00   90.30     0.00    10.85   246.07     1.02   11.28    0.00   11.28  11.03  99.64
sdc1             24.60     0.00   72.50    2.40    12.05     0.03   330.28     1.33   17.74   17.67   19.67   1.87  14.04

After setting it to 8 it looks like this:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.23    0.10    5.21    1.28    0.00   92.18

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sde1              0.00     0.00    0.00 1340.50     0.00   165.34   252.60     7.75    5.78    0.00    5.78   0.74  98.88
sdc1              0.00     0.00 1348.70    8.30   167.31     0.21   252.81     2.12    1.57    1.34   37.64   0.71  96.44

Even svctm dropped from 11.03ms to 0.74ms... Amazing!

@DeHackEd
Copy link
Contributor

DeHackEd commented Mar 26, 2017

That zfs_vdev_async_write_min_active=1 thing is known. I think the minimum needs to be set to 2.

What I believe is happening is that ZFS only gives drives 1 write operation at a time. When it finishes a write, ZFS sends a new one for the next sector, but the platter has already rotated past the start of the sector and must wait for (effectively) a full rotation to do the write. With 2 active outstanding writes the drive can be kept fed sequential write operations.

@Rayn0r
Copy link

Rayn0r commented Mar 26, 2017

@DeHackEd
The porperty's name is zfs_vdev_async_write_min_active. In my understanding "async" means that it should not wait for the data to be written to disk, but just feed it to the write-cache and let it handle the rest.

From what I read here, the write-cache is not enabled, if ZFS does not handle the entire disk. But I also found an (lamost 5 year old) post mentioning, that needed ioctls for this are not supported by Linux.
I then check hdparm.conf and found that we had purposely turned off write cache for all drives to mitigate FS corruption in case of a power/PSU failure.

With 2 or more write operations in parallel one is probably circumventing the resulting "write lag" because of the disabled cache.
Setting zfs_vdev_async_write_min_active to above 3 does not result in a performance gain anymore.

Setting zfs_vdev_async_write_min_active=1 and enabling write-cache with hdparm -W1 /dev/sde results in the same write performance as with disabled write-cache and zfs_vdev_async_write_min_active=3 (145-165Mib/s).

@DeHackEd
Copy link
Contributor

But ZFS has its own IO scheduler which will only keep the drive fed with 1 write request at a time if the request is internally categorized as ASYNC. While resilvering doesn't need to wait for a write to happen, the IO scheduler is spoon-feeding the drive when really it wants a continuous stream of data.

@Rayn0r
Copy link

Rayn0r commented Mar 26, 2017

This is what I am saying... ASYNC means that writes are sent to the disk without the need to wait for the drive to confirm the write.
If there is a write-cache available, this spoon-feeding seems to be working pretty well, since the cache buffers all write request and the drive's firmware handles the rest.
As soon as you turn off the cache, you will need to wait till the head has reached a position where it is over the next sector, that needs to be written. This would also explain the 11ms service time (this is probably the head's seek time) in the iostat -xm... output.

@RubenKelevra
Copy link
Author

@Rayn0r that sounds reasonable.

But in my case the whole disks where added to the pool, and I did not played with any hdparm commands on this machine, so usually I would expect that the write caches are on or turned on by zfs.

I think my case was just many many transactions which was imported from two hypervisor-filesystems. But well, since the pool was completely empty before, this should not lead to heavy fragmentation.

Maybe we also got an issue with importing foreign filesystems which got heavily fragmented while writing or the whole transactions reading has a bottleneck while resilvering...

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Mar 27, 2017
@behlendorf behlendorf added this to the 0.7.0 milestone Mar 27, 2017
@behlendorf
Copy link
Contributor

@Rayn0r do you have data for setting the zfs_vdev_async_write_min_active=2. I saw above you posted results for 1 and 3. Ideally, we want to set this value as low as possible while still maintaining good performance.

@Rayn0r
Copy link

Rayn0r commented Mar 28, 2017

I did a complete re-sync with zfs_vdev_async_write_min_active=2 this morning. Note that the pool usage has increased by 220GiB over the last 2 days.


   eid: 39
 class: resilver.finish
  host: ilpss8
  time: 2017-03-28 11:23:47+0200
  pool: datengrab
 state: ONLINE
  scan: resilvered 1.64T in 4h39m with 0 errors on Tue Mar 28 11:23:47 2017
config:

	NAME        STATE     READ WRITE CKSUM
	datengrab   ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sdc1    ONLINE       0     0     0
	    sde1    ONLINE       0     0     0

errors: No known data errors

This is the test from the weekend with zfs_vdev_async_write_min_active=3 :


   eid: 33
 class: resilver.finish
  host: ilpss8
  time: 2017-03-26 16:50:05+0200
  pool: datengrab
 state: ONLINE
  scan: resilvered 1.42T in 2h59m with 0 errors on Sun Mar 26 16:50:05 2017
config:

	NAME        STATE     READ WRITE CKSUM
	datengrab   ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    sdc1    ONLINE       0     0     0
	    sde1    ONLINE       0     0     0

errors: No known data errors

@RubenKelevra
Copy link
Author

RubenKelevra commented Mar 29, 2017 via email

@Rayn0r
Copy link

Rayn0r commented Mar 29, 2017

Is there any disadvantage if the number is higher?

I can only answer to above question...
I did test with values of up to 8 and could not see any performance improvement over 3.
The data rate did not drop while increasing the number.

I'd suggest that you also elaborate some tests to identify the impact of increasing the number.

@RubenKelevra
Copy link
Author

RubenKelevra commented Mar 30, 2017 via email

@RubenKelevra
Copy link
Author

RubenKelevra commented Mar 30, 2017

I may have some example data for you to analyze guys, I changed it on three hypervisors (after the gap in the graphs the setting is 8 not 1). Nothing else has changed on the zfs settings or the systems. All servers are on for weeks.

So it seems that you get a slight reduction on the utilization from 5-10% while it might increase the responsibility of the disk.

The big difference between loki and ra is, loki has two database servers one RRD based and one MySQL based. So you get a lot of small random write to the disk while ra has a lot more normal load like webservers and application servers.

I added some hardware-specs at the end.

juno-diskstats_iops-week
juno-diskstats_latency-week
juno-diskstats_throughput-week
juno-diskstats_utilization-week
loki-diskstats_iops-week
loki-diskstats_latency-week
loki-diskstats_throughput-week
loki-diskstats_utilization-week
ra-diskstats_iops-week
ra-diskstats_latency-week
ra-diskstats_throughput-week
ra-diskstats_utilization-week

loki:
8 cores amd (not very powerful and constantly taken up)
two hdds via sata (1 TB)
32 gigs of ram (6-8 gig available for zfs/cache-use)
zfs on partitions and hdd write-cache ON

ra:
8 cores intel (a bit more powerful than loki)
16 gigs of ram (5-7 gigs available for zfs/cache-use)
two hdds via sata (2 TB)
zfs on partitions and hdd write-cache ON

juno
2x6 cores intel (very powerful xeons)
128 gigs of ram (not even half full)
two hdds (4 TB) via raid-controller with a large cache
raid-controller write-cache is off, hdd write cache - I have no idea
zfs on partitions

@Rayn0r
Copy link

Rayn0r commented Mar 30, 2017

Do you had an ear on the HDDs? I was wondering if they seek very much.

At a data rate of over 140MiB/s, it is highly unlikely that the drives are seeking a lot. The noise inside the rack is too loud to hear the drives over the fans from the computer, UPS, or the switch.

PS: I feel now like I have somehow hijacked this issued. The subject still is: "Feature Request: Sequential Resilvering"...

@RubenKelevra
Copy link
Author

RubenKelevra commented Mar 30, 2017 via email

@RubenKelevra
Copy link
Author

Well, I worked again on one of the VMs on the loki hypervisor and I must admit that the machine is overall A LOT snappier.

I looked at different stats and the machine has now more network traffic throughput. This explains why the IRQs are now around 12% higher. Also the CPU-usage has increased, which is a good thing.

I have still no idea what the machines bottleneck are, but I guess it's just a CPU-RAM-connection bottleneck. But the "8" works fine on this machine.

If there are concerns about 8, I might try out 3 for you.

@behlendorf ?

@behlendorf
Copy link
Contributor

@RubenKelevra it would be great if you could try out values of 2 and 3. The the specific concern here is latency which can be seen in the nice graphs you generated.

latency

The ZFS I/O scheduler is trying to strike a good balance between latency and throughput in order to achieve consistent performance. The theory of operation is fully described in the ZFS I/O SCHEDULER section of the zfs-module-parameters man page which I've included below.

To summarize we should consider increasing zfs_vdev_async_write_min_active but we want to increase it as little as possible to minimize the negative impact on latency. If we can get the majority of the benefit from increasing it to 2 or 3 that would be best. We may also want to consider decreasing the zfs_vdev_async_write_active_min_dirty_percent so the zfs_vdev_async_write_min_active cap is raised more quickly.

ZFS I/O SCHEDULER
       ZFS issues I/O operations to leaf vdevs to satisfy and  complete  I/Os.
       The  I/O  scheduler  determines when and in what order those operations
       are issued.  The I/O scheduler divides operations into five I/O classes
       prioritized  in the following order: sync read, sync write, async read,
       async write, and scrub/resilver.  Each queue defines  the  minimum  and
       maximum  number  of  concurrent  operations  that  may be issued to the
       device.   In  addition,  the   device   has   an   aggregate   maximum,
       zfs_vdev_max_active.  Note  that the sum of the per-queue minimums must
       not exceed the aggregate maximum.  If the sum of the per-queue maximums
       exceeds the aggregate maximum, then the number of active I/Os may reach
       zfs_vdev_max_active, in which case  no  further  I/Os  will  be  issued
       regardless of whether all per-queue minimums have been met.

       For many physical devices, throughput increases with the number of con‐
       current operations, but latency typically  suffers.  Further,  physical
       devices typically have a limit at which more concurrent operations have
       no effect on throughput or can actually cause it to decrease.

       The scheduler selects the next operation to issue by first looking  for
       an  I/O class whose minimum has not been satisfied. Once all are satis‐
       fied and the aggregate maximum has not been hit,  the  scheduler  looks
       for classes whose maximum has not been satisfied. Iteration through the
       I/O classes is done in the order specified above. No further operations
       are issued if the aggregate maximum number of concurrent operations has
       been hit or if there are no operations queued for an I/O class that has
       not  hit its maximum.  Every time an I/O is queued or an operation com‐
       pletes, the I/O scheduler looks for new operations to issue.

       In general, smaller max_active's will lead to lower latency of synchro‐
       nous  operations.   Larger  max_active's  may  lead  to  higher overall
       throughput, depending on underlying storage.

       The ratio of the queues' max_actives determines the balance of  perfor‐
       mance   between   reads,   writes,   and   scrubs.    E.g.,  increasing
       zfs_vdev_scrub_max_active will cause the scrub or resilver to  complete
       more  quickly,  but  reads  and writes to have higher latency and lower
       throughput.

       All I/O classes have a fixed maximum number of  outstanding  operations
       except  for  the  async  write class. Asynchronous writes represent the
       data that is committed to stable storage during the syncing  stage  for
       transaction groups. Transaction groups enter the syncing state periodi‐
       cally so the number of queued async writes will quickly  burst  up  and
       then  bleed down to zero. Rather than servicing them as quickly as pos‐
       sible, the I/O scheduler changes the maximum  number  of  active  async
       write  I/Os  according  to the amount of dirty data in the pool.  Since
       both throughput and latency typically increase with the number of  con‐
       current  operations issued to physical devices, reducing the burstiness
       in the number of concurrent operations  also  stabilizes  the  response
       time  of  operations  from  other  --  and in particular synchronous --
       queues. In broad strokes, the I/O scheduler will issue more  concurrent
       operations from the async write queue as there's more dirty data in the
       pool.

       Async Writes

       The number of concurrent operations issued  for  the  async  write  I/O
       class  follows a piece-wise linear function defined by a few adjustable
       points.

              |              o---------| <-- zfs_vdev_async_write_max_active
         ^    |             /^         |
         |    |            / |         |
       active |           /  |         |
        I/O   |          /   |         |
       count  |         /    |         |
              |        /     |         |
              |-------o      |         | <-- zfs_vdev_async_write_min_active
             0|_______^______|_________|
              0%      |      |       100% of zfs_dirty_data_max
                      |      |
                      |      `-- zfs_vdev_async_write_active_max_dirty_percent
                      `--------- zfs_vdev_async_write_active_min_dirty_percent

       Until the amount of dirty data exceeds  a  minimum  percentage  of  the
       dirty data allowed in the pool, the I/O scheduler will limit the number
       of concurrent operations to the minimum. As that threshold is  crossed,
       the  number  of  concurrent operations issued increases linearly to the
       maximum at the specified maximum percentage of the dirty  data  allowed
       in the pool.

       Ideally,  the  amount  of  dirty  data  on a busy pool will stay in the
       sloped        part        of        the        function         between
       zfs_vdev_async_write_active_min_dirty_percent                       and
       zfs_vdev_async_write_active_max_dirty_percent. If it exceeds the  maxi‐
       mum  percentage,  this  indicates  that  the  rate  of incoming data is
       greater than the rate that the backend  storage  can  handle.  In  this
       case,  we  must  further  throttle incoming writes, as described in the
       next section.

@RubenKelevra
Copy link
Author

RubenKelevra commented Apr 4, 2017 via email

@RubenKelevra
Copy link
Author

RubenKelevra commented Apr 4, 2017 via email

@RubenKelevra
Copy link
Author

Oh yes, now JavaScript of Github has failed, I'm sorry but I think it's readable.

@behlendorf
Copy link
Contributor

image

To be clear the relevant section of the graph is that right most portion where latency increases linearly with larger zfs_vdev_async_write_min_active values. And from the previous rebuild testing a value of 2 significantly helped rebuilds times, 3 yielded an additional meaningful improvement and there wasn't much gain after that.

Given this data I think we should adopt the performance tweak proposed by @DeHackEd in PR #5926 and increase the default zfs_vdev_async_write_min_active to 2 to keep the drive fed.

@RubenKelevra
Copy link
Author

Well, 3 looked a bit more promising since it could speedup the rebuilt even more.

132.22 MB/s vs 97.97 MB/s rebuild speed - a 34.96% speedup.

Since this should also affect random writes a performance increase in this range should be discussed I think.


Else it is to mention that my graphs show the latency for devices with write cache ON, he got devices with write cache OFF. So my performance gain was negligible because the write cache off the devices doing most of the job here.

If it would be helpful I can rerun my setup with different values with device cache being turned off.

The performance impact was that heavy at the time I setup my rigs that I was unable to run it without write cache, even if this may cause dataloss on a crash, because I use zfs on partitions.

@behlendorf
Copy link
Contributor

behlendorf commented Apr 5, 2017

Sure, I can see a case for that. It depends if you value the bandwidth or latency more. Here's approximately the data I see from the comments above.

zfs_vdev_async_write_min_active b/w (MB/s) latency (ms)
1 11 ~4
2 97 ~6
3 132 ~8
8 ~132 ~16

@RubenKelevra as long as the controllers and drives honor cache flushes it's safe, and preferable, to run with the write cache enabled.

@RubenKelevra
Copy link
Author

RubenKelevra commented Apr 5, 2017 via email

@DeHackEd
Copy link
Contributor

DeHackEd commented Apr 5, 2017

The IO scheduler was originally tuned with rotational media in mind. Resilvering performance of an SSD will be improved with higher queue depths, but not for the same reasons as hard drives resulting from an overshoot of the next sector requiring a wait for the next rotation.

I have seen a few other instances of this happening in the IRC channel. I won't quote any here. While it's anecdotal compared to the tests done here, it sounds like a major win for minor performance losses.

SSDs should be reconfigured from defaults anyway.

@RubenKelevra
Copy link
Author

RubenKelevra commented Apr 5, 2017 via email

@ryao
Copy link
Contributor

ryao commented Apr 7, 2017

@ahrens What do you think about this?

@ahrens
Copy link
Member

ahrens commented Apr 7, 2017

@ryao This thread is pretty massive, can you summarize what's being proposed, or point me to the a specific comment that is the proposal?

@DeHackEd
Copy link
Contributor

DeHackEd commented Apr 7, 2017

Super-short version: #5926 - Raise zfs_vdev_async_write_min_active from 1 to 2

Reasoning:
Resilver performance has been pretty abysmal for a lot of users with rotational drives. What appears to be the cause is that the IO scheduler only dispatches 1 async write to the drive at a time. For a linear write by the time the drive completes the operation the read/write head has gone past the next sector before it receives the next write. What happens is effectively 1 write per disk rotation.

A simple solution is to raise zfs_vdev_async_write_min_active from 1 to 2. This tends to keep the drive fed much better and the increase in latency is small-ish (4ms to 6ms based on measurements).

@jwittlincohen
Copy link
Contributor

jwittlincohen commented Apr 7, 2017

@DeHackEd Don't you mean zfs_vdev_async_write_min_active and not zfs_vdev_async_write_max_active? Your commit at #5926 revers to min_active as does the testing performed in this thread.

@DeHackEd
Copy link
Contributor

DeHackEd commented Apr 7, 2017

Whoops. My bad. Previous comment edited.

@ahrens
Copy link
Member

ahrens commented Apr 7, 2017

@DeHackEd Thanks, I left my comments on the PR.

@RubenKelevra
Copy link
Author

Just digged a bit deeper in my graphs. Well, increasing the async writes does indeed hurt the latency, but only the write latency.

The read latency is very slightly better (reduced). I just have so many writes and so less reads per second that it looks like the overall performance is decreasing.

@richardelling
Copy link
Contributor

FWIW, in the bad old days of the old write throttle, we found that for HDDs, zfs_vdev_max_pending=2 was far superior to 1, and larger values don't help much until you get too large and HDDs fall over (somewhere between 4 and 35, YMMV). So it is quite reasonable to get a meaningful boost from setting to 2, considering that the resilvering drive satisfies few reads early in the resilver time.

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Apr 14, 2017
Resilver operations frequently cause only a small amount of dirty data
to be written to disk at a time, resulting in the IO scheduler to only
issue 1 write at a time to the resilvering disk. When it is rotational
media the drive will often travel past the next sector to be written
before receiving a write command from ZFS, significantly delaying the
write of the next sector.

Raise zfs_vdev_async_write_min_active so that drives are kept fed
during resilvering.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue openzfs#4825
Closes openzfs#5926
@RubenKelevra
Copy link
Author

RubenKelevra commented Apr 23, 2017 via email

@jwittlincohen
Copy link
Contributor

jwittlincohen commented Apr 23, 2017 via email

@RubenKelevra
Copy link
Author

@jwittlincohen since I did some of these testings, I've asked @richardelling for a confirmation of these findings.

@ptx0 this sounds interesting, but I never observed such a behavior.

On which hardware do you got these issues?

1 similar comment
@RubenKelevra
Copy link
Author

@jwittlincohen since I did some of these testings, I've asked @richardelling for a confirmation of these findings.

@ptx0 this sounds interesting, but I never observed such a behavior.

On which hardware do you got these issues?

@richardelling
Copy link
Contributor

I concur with @RubenKelevra 2 is a better default than 1.

@behlendorf
Copy link
Contributor

@RubenKelevra if you don't mind I'd like to close this issue out. We have several other issues open to work on improving the resilver speeds. The following modest improvements already made include:

  • 06226b5 was merged to master to increase the default value to 2
  • 3d6da72 skips resilver IOs which don't span all child vdevs and need not be resilvered.

tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Jun 29, 2017
Resilver operations frequently cause only a small amount of dirty data
to be written to disk at a time, resulting in the IO scheduler to only
issue 1 write at a time to the resilvering disk. When it is rotational
media the drive will often travel past the next sector to be written
before receiving a write command from ZFS, significantly delaying the
write of the next sector.

Raise zfs_vdev_async_write_min_active so that drives are kept fed
during resilvering.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue openzfs#4825
Closes openzfs#5926
@RubenKelevra
Copy link
Author

Sure!

I think my issue might be fixed as well. Time will tell ;)

tonyhutter pushed a commit that referenced this issue Jul 20, 2017
Resilver operations frequently cause only a small amount of dirty data
to be written to disk at a time, resulting in the IO scheduler to only
issue 1 write at a time to the resilvering disk. When it is rotational
media the drive will often travel past the next sector to be written
before receiving a write command from ZFS, significantly delaying the
write of the next sector.

Raise zfs_vdev_async_write_min_active so that drives are kept fed
during resilvering.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #4825
Closes #5926
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

12 participants