-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lack of fairness of sync writes #10110
Comments
Lowering zfs_dirty_data_max significantly (to 100-200M values from default 3G) mitigates the problem for me, but with 50% performance drop. |
After some code investigation the problem appears to be too deeply ingrained in the write path.
I am afraid my workaround is currently the only viable option for acceptable latency under overwhelming fsync load. ZFS is nor designed nor built to be bandwidth-fair to consumer entities. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
It's a design issue, so I guess a fix probability is effectively zero. Still, a desirable feature in both desktop and multi-tenant headless environments. Let's hear out developers on the subject of complexity and then close/wontfix it. |
Probably related to #11140 |
apparently, we had this in the past and maybe i was wrong that it was resolved? anyhow, what about #11929 (comment) and #11912 ? |
I can still reproduce it on 0.8.4:
dd example from #4603 seems incorrect, I did not see any non-zero numbers in syncq_write column of zpool iostat while running it. The solutions/comments you linked revolve around queue depth limitation and reducing the latency at the cost of bandwidth reduction. This will not replace a theoretical writer-aware fair scheduler, that could fix the latency without making a write queue universally-shallow. |
#4603 was open for 4 years with no activity and then closed by stale bot. not the best way to handle issues, and this is a real one. our whole proxmox (with local zfs storage) migration from xenserver stalls because of this for weeks now and all our effort up to now may go poof if we don't get this resolved. but besides the fsync stall there may be other stalling issues in kvm. i put this bugreport for reference, as it is at least related: https://bugzilla.kernel.org/show_bug.cgi?id=199727 i also think this is a significant one, rendering kvm on top of zfs really unusable when you have fsync centric workloads. thanks for reporting the details about it and for your analysis ! |
Yeah, if you're multi-tenant (many VMs) you'll have better luck with boring qcow2s on ext/xfs+raid. |
no option. i want zfs snapshots and replication and i care about my data. |
@behlendorf wondering what your thoughts on this issue are. I'm on Proxmox as well and have occasionally noticed the same thing as @Boris-Barboris and @devZer0 have. |
this is a real bummer. this following output delays happen for example by simply copying a file inside a virtual machine you can clearly see that sync io from ioping is getting completely starved inside the VM. I guess i have never seen a single IO need 5.35min for completion. [root@gitlab backups]# ioping -WWWYy ioping.dat this long starvation causing the following kernel message : [87720.075195] INFO: task xfsaild/dm-3:871 blocked for more than 120 seconds. i'm not completely sure if this is a zfs problem alone, as with "zpool iostat -w hddpool 1" i would expect to see the outstanding IO from ioping (which hangs for minuts) in syncq_wait queue, but in 137s row, there is no single IO shown. is there a way to make this visible on ZFS layer ?
|
i know this can be mitigated to some point with adding a SLOG, but we have a ssd and a hdd mirror or raidz on each hypervisor server and adding another enterprise ssd just to make the hdd's run without a problem feels a little bit ugly, as you could better switch to "ssd only" system then. |
Any updates? Are there plans to resolve this in the future versions? |
Why hasn't this been escalated as a serious issue? Performance before features imo |
So it's kinda a design limitation: Normally a filesystem is offering a mount and is accessing a disk. The fairness here is provided by the IO-scheduler which is attributing the individual requests to the processes issuing these. ZFS however isn't working that way. The actual IO to the disks are issued by ZFS processes. The scheduler therefore cannot "see" which application is behind individual IO. In addition ZFS has its own scheduler built-in and thus an IO scheduler below isn't considered helpful, as the IO gets optimized for low latency by making sure they get issued in an order which completes individual requests as fast as possible. The scheduler is also sorting the requests to complete synchronous reads first and synchronous writes with second priority, followed by asynchronous reads and then asynchronous writes. These priorities are not super strict however: The amount of concurrent queues for each type of the described IO classes are tuned up and down, based on the outstanding requests. The tuneable you modified is adjusting the limit of how many writes can be cached. Lowering this value increases the amount of writing threads earlier, as the thresholds are percentages. In addition ZFS starts to throttle the incoming IO by applications, by introducing a sleep time if this cache gets fuller. So there are a couple of things you could try to lower the impact of issues you're seeing: Reduce parallel IO jobs per vdevcheck if your disks can keep up with the amount of concurrency:
If it's often above say 15ms (on SSDs)/50ms (HDDs) the disk has trouble keeping up with the amount of concurrent IO. The tuneable Earlier throttling and adjusting the delay introduced for throttlingInstead of lowering the maximum amount of "dirty" data for async writes its better IMHO to adjust on what percentage ZFS starts throttling the writes accepted by applications. The threshold can be configured with In addition It's probably best to adjust that to a mix of random and sequential IO tested on one disk. Depending on your pool layout you need to multiply that:
@ShadowJonathan wrote:
It has, there's a feature request open to implement the missing feature: Balancing IO between processes and creating IONice levels as well, so background IO can be marked as such. See #14151 |
hello, does this new feature of sync parallism help adressing this problem ? |
System information
Describe the problem you're observing
I am observing unfair sync write scheduling and severe userspace process io starvation in certain situation. It appears to be that fsync call on a file with a lot of unwritten dirty data will stall the system and cause FIFO-like sync write order, where no other processes get their share untill the dirty data is flushed.
On my home system this causes severe stalls when the guest VM with cache=writeback virtio-scsi disk decides to sync the scsi barrier while having a lot of dirty data in the hypervisor's RAM. All other hypervisor writers block completely and userspace starts chimping out with various timeouts and locks. It effectively acts as a DoS.
Describe how to reproduce the problem
1). Prepare a reasonably-default dataset.
2). Prepare 2 terminal tabs, cd to this dataset mount point. In them prepare the following fio commands:
"big-write"
and "small-write"
3). Let them run once to prepare the necessary benchmark files. In the meantime observe the iostat on the pool:
Note that when fio issues 2G of async writes it calls fsync at the very end, wich moves them from async to sync class.
4). When fios are finished, do the following: start "big-write" and then after 2-3 seconds (when "Jobs: 1" appears) start the "small-write". Note that the small 128K write will never finish before the 2G one, and the second fio remains blocked until the first one finishes.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: