-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rsync and txg_sync blocked for more than 120 seconds #2611
Comments
I hit something similar but maybe different last night. I believe it occurred while KVM was copying disk blocks from another server to this one. This ended up causing actual corruption on at least one of the zvols (as seen by the VM).
This was repeated for
I tried migrating a VM again today and all hell broke loose but I did not get these errors. The system load was in the 100's on an 8 core system. Major I/O wait time. I killed the migration but ended up having at least 3 corrupt zvols anyways. |
i do not have any corruption or problem - the system is stable and running - also the rsync tasks are done 100% perfect - but very slow 1M/second write performance - and while running the whole system is not responding fast - but no errors and no corruption - these "blocked for more than 120 seconds" dmesg's come in pairs for rsync and txg_sync once a day |
not similar but related message: Aug 29 05:37:06 morpheus kernel: [46185.239554] ata6.00: configured for UDMA/133 this seems to occur from time to time with a rather slow USB3.0 powered 4TB hdd (Touro Desk 3.0, HGST5K4000) in an external case during rsync & transferring of large files (several GiB) |
clarification for above: I had this happen with the above mentioned drive - of which I still don't know what causes it (but I suspect it could be related either to the chipset in the external harddrive enclosure where the drive sits in or powersaving features of the XHCI driver & hardware, which I already had issues with in the past) another drive showed this behavior (a seagate ST3000DM001) which likely underwent a headcrash and did reallocate several sectors (<10). It had been placed in an external enclosure by fantec [db-f8u3e with an incompatible chipset against smartctl] that had shown in the past to have a life & mind of its own: it would occasionally turn off during transfers and causing trouble with other filesystems, on ZFS, however, the files so far seemed fine. The day before yesterday I placed the drive in another external enclosure and it worked well during backups (only transferring several hundreds of MiB of data per backup job incrementally via rsync) until I decided to run a scrub and check everything: after several hours the drive again screamed and made hearable noises of a head-crash and/or sector re-allocation (had those in the past) and access wasn't possible anymore to the drive that's where the above posted message occured again so when encountering this message - make sure to double- or triple-check that it's not a hardware-issue instead a "software"- (ZFS-related) problem |
@wankdanker I think that your issue is separate. It might have been caused by the zvol processing occuring inside an interrupt context. Pull request #2484 might resolve it. |
@freakout42 Would you tell us more about your pool configuration? Also, do you have data deduplication enabled on this pool? |
|
Had similar failures, which occurred during heavy rsync pulls from remote machine. I was able to make it happen very quickly by starting up the remote pull. Did this three times in a row and caused the fault every time. The symptom was that any userland zfs/zpool commands hang, but the machine was still responsive to other commands. I set the parameter spl_kmem_cache_slab_limit=0 (it had been spl_kmem_cache_slab_limit=16384), and the problem seems to be gone, or not easliy triggered. Part of the process which triggers this includes snapshot renaming, but no zvols are involved in this process, although the pool has some. The pool is a raidz1 pool, and there are no hardware issues on the server.
|
@ColdCanuck Your comments regarding Back to the point at hand: I'm posting this followup because there have been a disturbing number of seemingly otherwise unrelated problems sporadically seemingly caused by using the Linux slab. Although I've not been able to spend the time on it I've wanted to, I've been rather knee deep investigating the series of issues related to Posix ACLs and SA xattrs and have seen at least one report (#2701) and, more interestingly #2725 which makes me think there may be a tie-in to our use of the Linux slab for <= 16KiB objects. I don't have any other brilliant observations offer at the moment other than to raise concern there may be problems realted to using the Linux slab and to ask @behlendorf, @ryao et al. what your thoughts are on this (particularly given the last few comments in #2725). |
just posting what comes to mind: could scheduling a regular cronjob which compacts memory via change things (provided slab issues and timeouts are related to memory fragmentation) |
I'm systematically having this issue when trying to RSync when using the latest ZFS from Arch: zfs-git 0.6.3_r170_gd958324f_3.18.2_2-1
|
Still trying to run rsync:
|
This might be caused by #2523, can you verify you have the fix to the SPL applied: openzfs/spl@a3c1eb7 |
More of the same
Running spl versions from arch:
|
Free memory:
|
I'm getting this every couple of days now since upgrading to the latest Debian build (which according to the linked bug has the fix in it). Oops messages more or less the same as those already posted.. The zvols lock up and load average climbs into the hundreds. I've never had an issue prior to this. |
I'm seeing similar symptoms after running a From dmesg, on Ubuntu Server 14.04.2 LTS, ZoL 0.6.3-5 from the Ubuntu PPA:
|
I can trigger the same error by using rsync from two different pools. The rsync process is hanging and can not be killed. If I won't stop the rsync from the cli, the will endup into a fault status. The fault status is gone after a reboot.
Fingers crossed I've provided good information. I'm running an arch linux with demz repo. |
I got the same issue with high-loaded mongodb on ZFS.
|
Same for me on Debian Jessie with Linux 3.16.0-4-amd64 and zfs 0.6.5.2-2. Jan 22 07:04:42 db04 kernel: [5056080.684110] INFO: task txg_sync:378 blocked for more than 120 seconds. |
@andreas-p please update to 0.6.5.4* if available alternatively: you can build your own latest zfsonlinux packages: |
Unfortunately, debian8 is DKMS still on 6.5.2, no update in the last three months. Any clue when this gets resolved? |
@andreas-p sorry, no idea, but there's always the option of building the packages on your own - which is some effort but you'll know that you can trust those instead of having to rely on third-party repositories, etc. |
Got the very same problem with 0.6.5.4 on a different machine with Debian8, zfs built from source. The stack trace shows exactly the same positions as the trace from Jan 18. 12:54 Starting an rsync from a 2.5TB xfs to 4TB zfs partition, memory rising from 2GB to 5GB within 5 minutes. |
same problem here with Debian 8, ZoL 6.5.2 from the official package repository. |
Exactly the same block with 0.6.5.5: I got two consecutive "txg_sync blocked for more than 120 seconds", then it went back to normal. |
I'm suffering from the similar problem. Please advice.
|
@narunask could you please post output of
or meanwhile access to the box is denied ("bricked") ? Also please post some hardware and specific configuration data (RAM, processor, mainboard, harddrive type, lvm/cryptsetup/etc. etc.) thanks |
Server is on the HDD that I'm copying from is LVM based, consisting of 3 PVs. HDD are not encrypted. Also FYI, currently
Thanks |
I haven't seen the problem for quite a while now (seemed to have gone since 0.6.5.6), but this morning I had the very same hung task with 0.6.5.7 (single occurrance) come up again. Sigh... |
I can confirm that I have not seen this issue happen for ages on the same hardware. I have not had it reoccur, probably within the last 3-6 months. |
Still happening with heavy RSync backups.
Most of the time performance is fine. |
So, I replaced the drive which might have been causing issues, with a new Samsung 840 PRO SSD, partitioned 50GB OS, 4GB Swap and the rest available as L2ARC. I've set the arc max size to 4GB on a system with only 8GB memory. Tonight, the same issue occurred, rsync and then everything ground to a halt. So, it seems unlikely to be a disk issue. Nominal read speeds are 100MB/s+ so things are humming along quite nicely. |
I can reproduce this issue fairly easily so just let me know what information you'd like me to collect and I'll try to do it. |
Happened last night during backups. |
Okay, so I've replaced the drive which had the high await time, and also increased the memory to 16GB, still having issues:
|
Running
|
Memory available is okay:
iostat seem okay (sde is samsung 840 pro)
|
Just wanted to give a short update to my post in this issue from last year: in the mean time I have upgraded to ZFS 0.6.5.8 from Debian's backports, still using Debian 8. Unfortunately I still get the exact same timeout in the kernel logs. |
@kpande if you're going to assert that, can you please describe what's big enough? I have resources 4X the average described in this thread and I also see the same issues, in this case with the combination of zfs send and nfsd load. |
You may be right. I have 4x 4TB drives, and one 256 SSD as OS/Cache. 16GB memory, 2 core CPU (1.5Ghz Atom). This system is almost exclusively used for RSync backups. I do feel that this is pretty reasonable for my needs. The ARC cache is 12GB, the OS has 4GB left over. The L2 cache, if enabled , is about 170GB. I'll have to check how many files it is, but I'm not sure if it's multi-millions or not. |
@kpande you are right I forgot to give any relevant infos about my underlying hardware. Sorry about that, here are hopefully all the relevant infos: My server is a virtualization Server running Xen 4.4 with currently 6 virtual machines which all have their logical volumes (LVM) stored on a RAIDZ1 volume with 3x 2TB Seagate SATA enterprise disks (ST32000645NS). The Debian 8 OS is independent and located on two internal SATA-SSD disks of both 16 GB in RAID1 using Linux MD for mirroring. The CPU is an Intel E5-2620 v3 @ 2.40GHz with 6 cores/12 threads. Out of these 6 cores 4 vCPUs have been pinned to the host/hypervisor/dom0 using the Below is the output of an actual ARC summary (server has bee rebooted 5 days ago):
Do you need any more information? and what do you think about this setup? is my hardware undersized? |
I checked and the majority of my backups are < 200k files and < 20GBytes. |
I hope it helps:
it was mv process from one zfs to another in the same pool (actually it behaves as cp, not faster). the process hangs forever (at least few hours) and not allow to terminate with a signal. zpool reports errors-free. Now I've started UPDATE:
Let me know if you need other infos. |
doing nothing but some basic rsync's with moderate sizes 4-10G result always in having approx 1MB/sec throughtput (very slow) on a up-date 16G RAM HP server - CentOS 6 with OpenVZ and selfcompiled modules for ZFS - dmesg:
The text was updated successfully, but these errors were encountered: