-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS Slow Write Performance / z_wr_iss stuck in native_queued_spin_lock_slowpath #4834
Comments
In my searching I also found this discussion of a person with what appears to be the same problem, but not solution: https://forum.proxmox.com/threads/server-stalls-with-zfs.26721/ |
@bdaroz Any chance you're burning all your CPU doing the encryption? Maybe run |
@dweeezil Thanks for looking at this.... There's a Long answer: When the problem is not happening, and I/O maxes out around 150MB/s, the CPU does load up above 90% across all cores. When the problem does happen the I/O throughput drops to around 5MB/s, and the CPU does have quite a bit of idle time (see |
@bdaroz Oops, missed your perf output. Looks like a memory management issue based on the functions shown in your perf output. Without knowing anything else about your test setup such as, for example, where is the source of the files you're copying, it's really hard to guess what's going on. Your next step ought to be to run Best guess at the moment is that this is caused by too much dirty data for the capability of the vdevs (and that's just a total out-of-left field guess). |
@dweeezil Thanks. I did note in the opening paragraph the files (ranging from 3-10GB ea) are being copied from a volume on the I'll look at |
Here's a somewhat expanded view of
Also, I should be able to upgrade this server from 16GB to 32GB of ECC RAM sometime in the next 24-48 hours, if we are looking at a arc cache malloc issue. |
I also bumped up (set, actually) the zfs_arc_max value from 0 to about 12GB - with 16GB of RAM it would have defaulted to about 8GB. (Don't mind the diagonal line from 0 to 12GB, that value isn't polled for that frequently.) The value was changed about 12:15am or so on the graph. If it's starved for ARC cache, one would expect the in-use amount to jump up considerably more after the setting change. The server had at least 1.5GB of free ram + some buffers available the entire time. I also shut down some memory-intensive processes (java) to push free RAM over the 3GB mark. ZFS never went on a RAM-grab, and the performance was intermittently abysmal throughout this period. |
One other interesting graph... The pool was mostly quiet until 12:01PM when a 15GB file was moved between 2 volumes on the |
@bdaroz Since this does seem to be a memory management issue, please post the arcstats ( |
Good or bad, the system was upgraded to 32GB of RAM and the problem has largely subsided. There are periods of slower throughput, but we're looking at a loss of at most 25%-30%, not 95%+. Given how long this machine was "fine" on Perhaps some of the work being done in #4880 and #4512 will help. |
No sooner than I post that and 30 minutes later performance goes to complete crap again.... Here's
|
Running into the same issue here, no more updates? |
I'm Also having exactly same issue. |
This is driving me crazy... This gets so bad for me that any large file transfers over smb stall and run into a timeout. Anything bigger than 1-2GB can't really be transferred because of that. |
@jakubfijolek Sorry. since I upgraded RAM significantly, and later upgraded ZoL to later versions I haven't had this reoccur. |
Using atop I have noticed that one hard disk in my pool seems to have a higher average io time than the others and thus have way higher busy percentage values than the other hard disks. Is it possible that a somehow slower drive is pulling the complete pool to such low performance? |
@ItsEcholot yes, for example in RAIDZ you'll have IOPS of slowest device. |
@gmelikov Alright thanks I will look into buying a replacement drive and hope that will improve my situation |
@ItsEcholot you can try to |
@gmelikov Just did, it really seems to help. Write speeds are still very slow (10MB/s) but it's worlds better than before especially because transfers don't seem to freeze anymore. Gonna order a new hdd to replace it. Also maybe my RAID card is hurting my performance. Would getting a real HBA / JBOD card improve the performance noticeably or is it not worth it? Also would I have to transfer the files from the pool and then back again when replacing the RAID card or could I do it without losing data in the pool? |
@ItsEcholot RAID card might be the case too.
It depends on your RAID card, if it makes passthrough - then it's ok, if not - send/recv. If you have more questions - feel free to use our mailing lists https://github.com/zfsonlinux/zfs/wiki/Mailing-Lists |
@gmelikov Alright thanks a lot for your answers this has helped me a lot. |
Running
Ubuntu 15.10
with ZoL0.6.5.7-1~wily
from thezfs-native
ppa. This seemed to happen in one of the last few releases. Not 100% reproducible, but when it does occur it's most noticeable copying a large file from one volume to another location on the same volume. Normal transfer rates would be in the ~100MB/s range, but when this does occur transfer rates drop to <5MB/s.System is a 16GB (ECC) Xeon E3-1241. Problem occurs on any pool (including some test mirrors, raidz1 and raidz2), info form the main pool
storage
is included here. All info taken during the slow transfer speeds, all drives are 4TB 5400RPM hard drives, no log or cache drives. All drives are LUKS encrypted (which is why the normal high water mark for transfers is in the ~100MB/s - ~130MB/s range).perf top
output during the slow transfer period:Output from
top
- (some process names scrubbed)ARC Summary output:
zpool iostat 5
output as well as a sample verbose output, slow copy in progress is from and to thestorage
pool.And verbose sample:
Sample
iostat
from a drive in the array - typical for all the drives:Properties of the volume:
And finally
zpool list
output:I started a copy of a small number of files (about 4-5GB each) totaling about 55GB, as I'm finishing writing this issue about 49 minutes later only about 6.5GB has been copied so far.
I've been watching #4512 but the conversation there seems to be trending toward issues in older kernels, not the 4.2.0 kernel running here.
Any help, or guidance, would be appreciated.
The text was updated successfully, but these errors were encountered: