-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Sequential Resilvering (was: Very slow drive-replace) #4825
Comments
I've just changed my procedure: Now I try to add the missing disk in advance, which should be the least load for the whole array - because all disk provide the data for the missing disks at one. Now I would expect a recovering rate at nearly 90 MB/s. But the result is much worse than that, like before, the whole array is scanned, twice as fast but this is far away from the expected speed: 90 MB/s expected on 432 GB total data per disk => 1:22 h recovering time. Got 3 MB/s on 3,76T ... => 15 days 12:39h expected recovering time.
|
The System is running on Arch Linux with ZFS version 0.6.5.7, Kernel-Version 4.6.3. The System got an i5 6xxx quadcore processor without HT and 32 GB DDR3 RAM. The 3 TB disks are added completely to zfs, the system is booting from a dedicated usb device. |
Resilver speed depends on your disks' fragmentation, and thus usually varies over the whole resilver. 3 MB/s for a single vdev is well above the expected minimum speed. You're looking at a single disk's IOPS, so even slight fragmentation leads to a lower resilver speed. You may have success with following http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/, Also, this is not a discussion forum but a bug tracker, so this would be better on zfs-discuss. |
This is the performance of a completly idle storage:
Actually this is a completely new pool. We just started to import data 4 days ago. So I don't think there is no fragmentation at all. Else this was my point: ZFS does not need to read the data with fragmentation. ZFS just need a bitmap of Data & No-Data blocks per Device, when this is completed it's able to do a 1:1 copy of the device which needs to be exchanged. Data which is newly written can just be written to both disks as well. This would increase the speed here about around 112.000%. If this cannot be done, a 1:1 copy of the whole datadisk is much faster than the zfs approach, this is what I'm actually doing right now because the ZFS way of replacing a disk is completely broken from my point of view.
This is a bugreport, not a discussion. This is broken, not a discussion if it's broken. Sidequestion:
What actually can I do against a fragmentation of such an pool? Defragmentation is not supported by ZFS at all. See #4785 |
I've tried your Else the written Data seems very cluttered, since the disk is 60% busy but zfs only writes 6 MB/s. |
this issue was previously discussed in #1110 "@behlendorf commented on Nov 29, 2012 does anyone know what is the status of the fast resilver feature? |
from https://www.reddit.com/r/zfs/comments/4192js/resilvering_raidz_why_so_incredibly_slow/ "RAIDz resilvering is very slow in OpenZFS-based zpools. It's a lot better in Solaris, though still not as good as mirroring. Basically, it starts with every transaction that's ever happened in the pool and plays them back one-by-one to the new drive. This is very IO-intensive. If you're using hard drives larger than 1TB and you are using OpenZFS, use mirror, not RAIDz*. From a certain point of view, one might think that RAIDz's only legitimate use case in a post-2015 world is for all-SSD pools. further on "Solaris tweaked this a lot: Sequential Resilvering. The previous resilvering algorithm repairs blocks from oldest to newest, which can degrade into a lot of small random I/O. The new resilvering algorithm uses a two-step process to sort and resilver blocks in LBA order. The amount of improvement depends on how pool data is laid out. For example, sequentially written data on a mirrored pool shows no improvement, but randomly written data or sequentially written data on RAID-Z improves significantly - typically reducing time by 25 to 50 percent." |
I'm now fixing this issue with dd, a 1:1 copy. I learned that the 3 TB disks does 200 MB/s, so replacing two disks now takes 4 h not 15 1/2 days like zfs would need. Since I'm writing 3 TB not just 450 GB Sequential Resilvering has a much higher potential than just 25-50%. |
Hi, i am also running a RAIDz2 (8x 3 TB) and a RAIDz3 with (12x 4 TB) and recently one of the disks in every pool failed. On both pools the resilver speed is between 800 and 1000 MB/s. The pools contain mainly ZVOLs with 1 MB volblockisze and a few bigger file systems with relatively large files (average between a few MB and GB). The settings i use to favor resilvering over other I/O are:
Ronny |
I replaced an HDD in a mirror consisting of two disks today on a test system, during evaluation tests. After setting /sys/module/zfs/parameters/zfs_vdev_async_write_min_active from 1 to 8, it ran at 140-160MiB/s and IOPs jumped to over 1000.
After setting it to 8 it looks like this:
Even svctm dropped from 11.03ms to 0.74ms... Amazing! |
That What I believe is happening is that ZFS only gives drives 1 write operation at a time. When it finishes a write, ZFS sends a new one for the next sector, but the platter has already rotated past the start of the sector and must wait for (effectively) a full rotation to do the write. With 2 active outstanding writes the drive can be kept fed sequential write operations. |
@DeHackEd From what I read here, the write-cache is not enabled, if ZFS does not handle the entire disk. But I also found an (lamost 5 year old) post mentioning, that needed ioctls for this are not supported by Linux. With 2 or more write operations in parallel one is probably circumventing the resulting "write lag" because of the disabled cache. Setting |
But ZFS has its own IO scheduler which will only keep the drive fed with 1 write request at a time if the request is internally categorized as ASYNC. While resilvering doesn't need to wait for a write to happen, the IO scheduler is spoon-feeding the drive when really it wants a continuous stream of data. |
This is what I am saying... ASYNC means that writes are sent to the disk without the need to wait for the drive to confirm the write. |
@Rayn0r that sounds reasonable. But in my case the whole disks where added to the pool, and I did not played with any hdparm commands on this machine, so usually I would expect that the write caches are on or turned on by zfs. I think my case was just many many transactions which was imported from two hypervisor-filesystems. But well, since the pool was completely empty before, this should not lead to heavy fragmentation. Maybe we also got an issue with importing foreign filesystems which got heavily fragmented while writing or the whole transactions reading has a bottleneck while resilvering... |
@Rayn0r do you have data for setting the |
I did a complete re-sync with
This is the test from the weekend with
|
I did a complete re-sync with `zfs_vdev_async_write_min_active=2`
this morning. Note that the pool usage has increased by 220GiB over
the last 2 days.
scan: resilvered 1.64T in 4h39m
`zfs_vdev_async_write_min_active=3`
scan: resilvered 1.42T in 2h59m
Looks like we want to ask the devices if the write-cache is enabled and
if not we want to select 3 instead of 1 or 2. Or something like this.
Is there any disadvantage if the number is higher? I thought NCQ could
only work if the disk has enough in the queue.
Would a higher number hurt the atomic updates/replaces or the linear
timeline in edge cases or would it just increase the latency on high
demand peaks?
|
I can only answer to above question... I'd suggest that you also elaborate some tests to identify the impact of increasing the number. |
> Is there any disadvantage if the number is higher?
I can only answer to above question...
I did test with values of up to 8 and could not see any performance
improvement over 3. The data rate did not drop while increasing the
number.
I'd suggest that you also elaborate some tests to identify the impact
of increasing the number.
Thanks! :)
Do you had an ear on the HDDs? I was wondering if they seek very much.
I bet they have NCQ and I was wondering if that is just the improvement
we achieve here, filling the NCQ-Buffer far enough to get a better data
via seek rate.
If that's the case, a 3 should be favoritable all the time. :)
|
I may have some example data for you to analyze guys, I changed it on three hypervisors (after the gap in the graphs the setting is 8 not 1). Nothing else has changed on the zfs settings or the systems. All servers are on for weeks. So it seems that you get a slight reduction on the utilization from 5-10% while it might increase the responsibility of the disk. The big difference between loki and ra is, loki has two database servers one RRD based and one MySQL based. So you get a lot of small random write to the disk while ra has a lot more normal load like webservers and application servers. I added some hardware-specs at the end. loki: ra: juno |
At a data rate of over 140MiB/s, it is highly unlikely that the drives are seeking a lot. The noise inside the rack is too loud to hear the drives over the fans from the computer, UPS, or the switch. PS: I feel now like I have somehow hijacked this issued. The subject still is: "Feature Request: Sequential Resilvering"... |
> Do you had an ear on the HDDs? I was wondering if they seek very
> much.
At a data rate of over 140MiB/s, it is highly unlikely that the
drives are seeking a lot. The noise inside the rack is too loud to
hear the drives over the fans from the computer, UPS, or the switch.
Yeah alright, I does not thought about that. My machines are pretty
silent. :D
Alright, so in my case the setting to 8 (maybe 3 also) helps a bit with
much random I/O. Not sure if it's the same stuff that helps with
rebuiding. But maybe we should just get the next RC with 3 out and see
what the guys out there are saying about it? :)
PS: I feel now like I have somehow hijacked this issued. The subject
still is: "Feature Request: Sequential Resilvering"...
Don't worry, the main topic was, that it's faster to copy an entirly
disk of 3 TB than using zfs to just copy 400 GB from the old one to the
other. And not like "well this is twice as fast" but a hell faster.
So if this little tweak helps others it would be nice, but in my case
the data seem to contain that lot of seperate random transactions that
it probably won't help much.
Anyway, don't worry :)
|
Well, I worked again on one of the VMs on the loki hypervisor and I must admit that the machine is overall A LOT snappier. I looked at different stats and the machine has now more network traffic throughput. This explains why the IRQs are now around 12% higher. Also the CPU-usage has increased, which is a good thing. I have still no idea what the machines bottleneck are, but I guess it's just a CPU-RAM-connection bottleneck. But the "8" works fine on this machine. If there are concerns about 8, I might try out 3 for you. |
@RubenKelevra it would be great if you could try out values of 2 and 3. The the specific concern here is latency which can be seen in the nice graphs you generated. The ZFS I/O scheduler is trying to strike a good balance between latency and throughput in order to achieve consistent performance. The theory of operation is fully described in the ZFS I/O SCHEDULER section of the To summarize we should consider increasing
|
@RubenKelevra it would be great if you could try out values of 2 and 3. The the specific concern here is latency which can be seen in the nice graphs you generated.
Alright, I scripted that the server where you observed the latency issues (which are indeed the most I/O-bound server) do a hour each value from 1 to 8.
Expect in roughly 8 hours a news graph about the difference. :)
Best regards
Ruben
|
![img_0709](https://cloud.githubusercontent.com/assets/614929/24675042/fb4e8894-197d-11e7-895a-267a7a6183e8.JPG)
So, I think the outcome is pretty clear.
One grey line is one hour, I've changed the setting exactly on each o'clock.
Edit: Sorry, Github seems to drop attachments while answering via mail instead of just adding them as attachments.
|
Oh yes, now JavaScript of Github has failed, I'm sorry but I think it's readable. |
To be clear the relevant section of the graph is that right most portion where latency increases linearly with larger Given this data I think we should adopt the performance tweak proposed by @DeHackEd in PR #5926 and increase the default |
Well, 3 looked a bit more promising since it could speedup the rebuilt even more. 132.22 MB/s vs 97.97 MB/s rebuild speed - a 34.96% speedup. Since this should also affect random writes a performance increase in this range should be discussed I think. Else it is to mention that my graphs show the latency for devices with write cache ON, he got devices with write cache OFF. So my performance gain was negligible because the write cache off the devices doing most of the job here. If it would be helpful I can rerun my setup with different values with device cache being turned off. The performance impact was that heavy at the time I setup my rigs that I was unable to run it without write cache, even if this may cause dataloss on a crash, because I use zfs on partitions. |
Sure, I can see a case for that. It depends if you value the bandwidth or latency more. Here's approximately the data I see from the comments above.
@RubenKelevra as long as the controllers and drives honor cache flushes it's safe, and preferable, to run with the write cache enabled. |
Thanks for your neat summary.
I thought about it again and I still think, the 2 ms avg latency addition is presumably negligible, since we're still talking about the disk latency far better from everything I observed with LVM over a md-raid before on the same hardware.
The amount of performance gain on the other hand seems to be worth it in any case.
The only thing which worries me is the very small sample size we're using to determine the best default setting for all users and use cases.
So, what happens if you got a database server with a massive organic load which fully work on the limit of a 10 ssd array for example?
I think before we even select 2 we should try to find more users which can offer some performance data for different use
cases.
Best regards
Ruben
|
The IO scheduler was originally tuned with rotational media in mind. Resilvering performance of an SSD will be improved with higher queue depths, but not for the same reasons as hard drives resulting from an overshoot of the next sector requiring a wait for the next rotation. I have seen a few other instances of this happening in the IRC channel. I won't quote any here. While it's anecdotal compared to the tests done here, it sounds like a major win for minor performance losses. SSDs should be reconfigured from defaults anyway. |
Alright then, so we're talking about HDD scenarios only. In this case I would pledge for a change to 3, since this seems to be the optimal performance option in a organic load scenario as well as on rebuilds.
Also it does not increase the avg responsibility of the drives to much to hurt realtime access too much.
I think it's a sane default to assume that user want the maximal throughput-option.
In case someone is using a large scale database application server on HDDs we should add an advice in the documentation and/or in the changelog, that in these cases 1 or 2 might be the better option.
Best regards
Ruben
|
@ahrens What do you think about this? |
@ryao This thread is pretty massive, can you summarize what's being proposed, or point me to the a specific comment that is the proposal? |
Super-short version: #5926 - Raise Reasoning: A simple solution is to raise |
Whoops. My bad. Previous comment edited. |
@DeHackEd Thanks, I left my comments on the PR. |
Just digged a bit deeper in my graphs. Well, increasing the async writes does indeed hurt the latency, but only the write latency. The read latency is very slightly better (reduced). I just have so many writes and so less reads per second that it looks like the overall performance is decreasing. |
FWIW, in the bad old days of the old write throttle, we found that for HDDs, zfs_vdev_max_pending=2 was far superior to 1, and larger values don't help much until you get too large and HDDs fall over (somewhere between 4 and 35, YMMV). So it is quite reasonable to get a meaningful boost from setting to 2, considering that the resilvering drive satisfies few reads early in the resilver time. |
Resilver operations frequently cause only a small amount of dirty data to be written to disk at a time, resulting in the IO scheduler to only issue 1 write at a time to the resilvering disk. When it is rotational media the drive will often travel past the next sector to be written before receiving a write command from ZFS, significantly delaying the write of the next sector. Raise zfs_vdev_async_write_min_active so that drives are kept fed during resilvering. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue openzfs#4825 Closes openzfs#5926
What is the result of a much higher value like 8?
|
Testing showed the most significant benefit resulted from going from 1 to
2, and there was an additional but smaller benefit changing to 3. Higher
values increased latency but not throughout. A default value of 2 was
chosen as a compromise between speed and latency as it gives you the
majority of the improvement with a minor increase in latency.
…On Apr 23, 2017 7:01 AM, ***@***.***" ***@***.***> wrote:
What is the result of a much higher value like 8?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4825 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJHmLSbyGFeLhkpkHUqo-scU3pEu-h3tks5ryy9vgaJpZM4JEUNj>
.
|
@jwittlincohen since I did some of these testings, I've asked @richardelling for a confirmation of these findings. @ptx0 this sounds interesting, but I never observed such a behavior. On which hardware do you got these issues? |
1 similar comment
@jwittlincohen since I did some of these testings, I've asked @richardelling for a confirmation of these findings. @ptx0 this sounds interesting, but I never observed such a behavior. On which hardware do you got these issues? |
I concur with @RubenKelevra 2 is a better default than 1. |
@RubenKelevra if you don't mind I'd like to close this issue out. We have several other issues open to work on improving the resilver speeds. The following modest improvements already made include: |
Resilver operations frequently cause only a small amount of dirty data to be written to disk at a time, resulting in the IO scheduler to only issue 1 write at a time to the resilvering disk. When it is rotational media the drive will often travel past the next sector to be written before receiving a write command from ZFS, significantly delaying the write of the next sector. Raise zfs_vdev_async_write_min_active so that drives are kept fed during resilvering. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue openzfs#4825 Closes openzfs#5926
Sure! I think my issue might be fixed as well. Time will tell ;) |
Resilver operations frequently cause only a small amount of dirty data to be written to disk at a time, resulting in the IO scheduler to only issue 1 write at a time to the resilvering disk. When it is rotational media the drive will often travel past the next sector to be written before receiving a write command from ZFS, significantly delaying the write of the next sector. Raise zfs_vdev_async_write_min_active so that drives are kept fed during resilvering. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue #4825 Closes #5926
Currently a drive-replace is dead slow. I've currently at a data-migration to a larger storage...
My setup is a 9 3TB SATA-HDD RAIDz2 which is currently degraded to one redundancy.
I'm currently receiving two filesystems and need to replace two disks, because I started with two disks just borrowed from a friend, now wanna use the disks from the old storage system.
The two zfs receives are limited by the internet connection, which is around 95 Mbit/s.
So the rest of the IO-Bandwidth, which should be around 60 MB/s x 7 => 420 MByte/s is available for a device replace.
So I've started to replace both drives, which I need to replace at once.
Currently ZFS seems to not just copy the data from one drive to another, while writing new data / metadata just to both drives, instead it looks like zfs is running a complete analyse of all data on all disks to get the information which has to be stored on this particular new disk.
This is currently dead slow:
zfs list shows up, that 2,97T is used, which leads to a usage of 432GByte per Disk. Since each disk is capable of writing / reading around 90 MB/s plain, a 1:1 copy should be as fast as 60 MB/s avg at worst.
So my expectation was that sdd is copied to sdc while sdg is copied to sdl both at 60 MB/s. So this should be done in 2 hours. Now I'm facing 450 hours runtime, which is 225,000 % longer than expected...
The text was updated successfully, but these errors were encountered: