-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: BUG: soft lockup - CPU#7 stuck for 22s! [z_null_iss/0:1796] #2708
Comments
As a follow up, I took the exact same machine and upgraded the kernel from 3.10 to 3.16 and ran the same workload which pretty consistently resulted in the above error on kernel 3.10 and so far it seems to be passing the test on 3.16. It might still be too early to be 100% certain of this, so I'll continue testing the workload, but I would have expected an issue by now. Would that make any sense? Would the Linux kernel version have an effect on an issue like this? Using 3.16 is not a problem for me if that is a sensible workaround. |
Yeah, I spoke too soon. The issue manifests itself slightly differently on 3.16 (no console messages, slightly different error in /var/log/messages) -- Sep 17 22:19:32 localhost kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 |
Another update -- I was able to get the same issue (with console messages) on ZoL 0.6.3 as well as git head. |
Sure, I'll try to give the relevant information. Please feel free to ask me any other questions. Centos 6.5. I've tried kernels ranging from 3.10 to 3.16. I've seen the issue on both 0.6.3 and git head. I have two pools, both created with ashift=12, compression=gzip-9, primarycache=metadata, secondarycache=metadata. I explicitly indicated ashift=12 because at one point I had trouble with some drives lying about being 512b sector drives that ZoL didn't catch and autocorrect to 4k. I'm not sure that's the case with these specific drives, it's just habit now. Compression is set to highest possible setting, but generally performance is acceptable for my scenario (backups), with the exception, obviously, of this hang. :) I've set primarycache and secondarycache to metadata because, again, it's backups, and I have an inordinate amount of files, including countless millions of very, very small files, such that I actually notice a large space differential between ashift=9 and ashift=12, but performance for ashift=9 is so absolutely terrible that I have to go with ashift=12 (it wasn't 2x or 3x slower, it was 10x or 20x slower at least). Anyway, I'm doing large rsyncs to actually make the backups, and I don't much need caching for anything but metadata. I'm willing to listen to alternate recommendations here, but generally this has worked well. The two pools are a 5x4TB raidz1 and a 8x3TB raidz2. No good reason for the difference in pool structures other than that I'm in the process of migrating to larger pools, and I haven't bought the final hardware yet. I do a zfs send/recv from one pool to another (the 2nd pool is simply supposed to be a carbon copy of the 1st one, for additional redundancy). I can trigger this error with a simple zfs send/recv (a "zfs send -v [first snap name on 1st pool] | zfs recv -dv [2nd pool name]" followed by a "zfs send -v -I [first snap name on 1st pool] [last snap name on 1st pool] | zfs recv -dvF [2nd pool name]". I'm actually trying to sync the 1st pool to the 2nd pool for the first time, so the system would be expected to run for a couple days copying the data. It runs for anywhere between an hour or up to maybe 12 hours or so before this issue arises. There are about 50 filesystems with roughly 200 snapshots each. I can also trigger the issue while running rsync alone (my standard backup script from various sources to copy to 1st pool), or both the rsync backup AND the zfs send/recv to the 2nd pool simultaneously, or with the zfs send/recv alone as mentioned above. The issue seems to be strongly correlated with heavy load. I can expect roughly 50MB/s, and my 8 thread processor seems to be the bottleneck, as generally the gzip-9 compression eats up 100% of CPU on all cores. The system DOES have ECC RAM. Please let me know what other information would be useful. I'd be happy to run tests or gather more data for you. Thanks!! |
Oh, and if it's relevant, I can't snap my fingers and make this happen, but I can trigger it fairly consistently. I'm pretty certain I can run a workload that will kill it in less than 24 hours, at least. Since this has now happened to me quite a few times and I'm comfortable that it's not causing me any data loss, I'm willing to trigger it with any debugging steps you'd like me to take, etc. |
@bjquinn Thanks for the additional information. I'm almost certainly going to put together a set of debug patches this week to try and get to the root cause here. If you could give them a try when they're ready it would be appreciated. I'll let you know. Thanks! |
I would absolutely be willing to do that. Let me know when the debug patches are ready, and I'll try them out. |
Just wanted to report that we saw the same issue here (0.6.3 with 2.6.32.x kernel). Is a production machine unfortunately, so debugging wouldn't be possible. It's not reproducible so far. Seemed to be triggered by heavy IO load. ZFS is used for a Lustre OST on the machine. |
I just got it again a few minutes ago. I'm comfortable that I should be able to reproduce once the debugging patches are available. |
Just for reference, here the slightly different trace we get. Issue occurs regularly under load.
|
@rfehren Thanks for the additional debugging. Could you describe the load which is able to trigger this, if I was able to reproduce this locally it would help running down the root cause. |
It's a customer server running a MDS and OSS (Lustre 2.5.2 on ZFS) + a kvm machine that mounts the Lustre FS. I can't tell exactly when the bug is triggered, but I saw that users do a lot of Lustre I/O inside of the VM (e.g. rsyncing large dirs). What I can tell is that it's most likely not a memory pressure issue since the Lustre server that mounts ZFS had 16GB mem free when it last happened. Hard to reproduce this scenario locally ... |
Hmm, just happened again after Lustre recovery was finished and the clients have reconnected. Since there was nobody doing anything inside of the kvm machine this time, I'm now pretty sure it's the load from the compute nodes hammering on Lustre that triggers this. It's not a big cluster though, so I'm a bit puzzled, especially since the same software combo is running flawlessly for months on a number of other clusters. Only difference is that Lustre is running over Ethernet here rather than IB like on the other ones. |
@rfehren I'm sure this isn't related to memory pressure. There appears to be a suble flag in the zio locking but I haven't been able to put my finger on it. Thanks for the information, I'm treating this one as a blocked for 0.6.4. |
Does anyone have another suggested workaround to this I can try? |
It seems we have the same problem: BUG: soft lockup - CPU#2 stuck for 67s! [z_null_iss/0:1175] Pid: 1175, comm: z_null_iss/0 Tainted: P --------------- 2.6.32-504.1.3.el6.x86_64 #1 Dell Inc. PowerEdge R300/0TY179 Pid: 1175, comm: z_null_iss/0 Tainted: P --------------- 2.6.32-504.1.3.el6.x86_64 #1 Dell Inc. PowerEdge R300/0TY179 |
For what it's worth, I've been able to trigger this issue far easier when the storage in question is connected via USB. SATA/SAS storage has seemed to be much more reliable and less likely to trigger this issue. |
We changed the centos 6.5 Kernel from version 2.6.32-504.1.3.el6.x86_64 back to 2.6.32-431.29.2.el6.x86_64 and it seems that at least the soft lockup disappeared for the moment. |
Just for information: since we downgraded to the kernel mentioned before (2.6.32-431.29.2.el6.x86_64) no soft lockups any more. |
Please try my patch in #2523 |
I had the same issue last week on Debian Wheezy (kernel 3.2.0-4-amd64 #1 Debian 3.2.60-1+deb7u3) and again today on the same system (3.2.0-4-amd64 #1 Debian 3.2.63-2) with ZoL 0.6.3 from zol debian package (zfs 0.6.3-1~wheezy, dkms 2.2.0.3-1.2+zfs6). The exact same action have triggered it both time: deleting the content of a complex directory (176GB of data with compress ratio of ~4x and recordsize of 8K). The dataset where the directory reside had 1 previous snapshot at the time of both crash. The syslog gives a lot of output bewteen the start of the issue (from 23:13 until the crash at 00:14 for the first crash and from 23:04 to 00:58 the second time), so I have truncated it and the output is available here for the first minutes of syslog: crash 1: http://apaste.info/nN8 Unfortunately it's a production environnement so I cannot do much diagnostic, but tell me if you need any more information on the system or workload at the time of the crash. |
|
@FransUrbo : As seen on IRC, I will update to the dailies tonight. Since the process happens once a week, I'll update this ticket next week whenever it still crash or not. |
It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #421
This issue which is a duplicate of #2523 was resolved by the following commit. Full details can be found in the commit message and related lwn article. openzfs/spl@a3c1eb7 mutex: force serialization on mutex_exit() to fix races |
Commit: openzfs/zfs@a3c1eb7 From: Chunwei Chen <tuxoko@gmail.com> Date: Fri, 19 Dec 2014 11:31:59 +0800 Subject: mutex: force serialization on mutex_exit() to fix races It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Backported-by: Darik Horn <dajhorn@vanadac.com> Closes #421 Conflicts: include/sys/mutex.h
It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #421
@behlendorf This is not a duplicate of #2523. Instead, it is a duplicate of #3091, provided that this deadlock occurred on a multi-socket system. |
It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes openzfs#421 Conflicts: include/sys/mutex.h
I'm getting lots of the following under heavy load doing a send/recv from one pool to another on the same machine. Both pools end up unresponsive.
Sep 16 17:46:38 localhost kernel: BUG: soft lockup - CPU#7 stuck for 22s! [z_null_iss/0:1796]
Sep 16 17:46:38 localhost kernel: Modules linked in: des_generic ecb md4 sha256_ssse3 sha256_generic nls_utf8 cifs fscache dns_resolver fuse ip6table_filter ip6_tables ebtable_nat ebtables autofs4 bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp bridge stp llc ipv6 ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables btrfs raid6_pq xor libcrc32c zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support dcdbas sg coretemp hwmon freq_table mperf intel_powerclamp kvm crc32_pclmul crc32c_intel ghash_clmulni_intel microcode pcspkr bnx2 i2c_i801 lpc_ich ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom usb_storage aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 ahci libahci wmi xhci_hcd mgag200 ttm drm_kms_helper sysimgblt sysfillrect syscopyarea dm_mirror dm_region_hash dm_log dm_mod
Sep 16 17:46:38 localhost kernel: CPU: 7 PID: 1796 Comm: z_null_iss/0 Tainted: P O 3.10.53-1.el6.elrepo.x86_64 #1
Sep 16 17:46:38 localhost kernel: Hardware name: Dell Inc. PowerEdge R210 II/03X6X0, BIOS 2.8.0 06/24/2014
Sep 16 17:46:38 localhost kernel: task: ffff880426c2c040 ti: ffff88041e496000 task.ti: ffff88041e496000
Sep 16 17:46:38 localhost kernel: RIP: 0010:[] [] _raw_spin_lock+0x25/0x30
Sep 16 17:46:38 localhost kernel: RSP: 0018:ffff88041e497cf8 EFLAGS: 00000293
Sep 16 17:46:38 localhost kernel: RAX: 0000000000000002 RBX: ffffffff8108ae69 RCX: ffff8802fdb22890
Sep 16 17:46:38 localhost kernel: RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffff8802fdb2285c
Sep 16 17:46:38 localhost kernel: RBP: ffff88041e497cf8 R08: ffff88043fdd43c0 R09: 0000000000000000
Sep 16 17:46:38 localhost kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802fdb22878
Sep 16 17:46:38 localhost kernel: R13: ffff88043fdd4430 R14: ffffffff81082866 R15: ffff88041e497c68
Sep 16 17:46:38 localhost kernel: FS: 0000000000000000(0000) GS:ffff88043fdc0000(0000) knlGS:0000000000000000
Sep 16 17:46:38 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 17:46:38 localhost kernel: CR2: 000000309f155d63 CR3: 00000002df7a0000 CR4: 00000000001407e0
Sep 16 17:46:38 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 16 17:46:38 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 16 17:46:38 localhost kernel: Stack:
Sep 16 17:46:38 localhost kernel: ffff88041e497d18 ffffffff815e9aaa ffff88041e497d28 ffff8802fdb22548
Sep 16 17:46:38 localhost kernel: ffff88041e497d28 ffffffff815e9a8b ffff88041e497da8 ffffffffa0469045
Sep 16 17:46:38 localhost kernel: ffff8802fdb22548 ffff8802fdb22548 ffff88041e497da8 ffffffffa0468757
Sep 16 17:46:38 localhost kernel: Call Trace:
Sep 16 17:46:38 localhost kernel: [] __mutex_unlock_slowpath+0x1a/0x50
Sep 16 17:46:38 localhost kernel: [] mutex_unlock+0x1b/0x20
Sep 16 17:46:38 localhost kernel: [] zio_done+0x5e5/0xe50 [zfs]
Sep 16 17:46:38 localhost kernel: [] ? zio_ready+0x177/0x480 [zfs]
Sep 16 17:46:38 localhost kernel: [] zio_execute+0xb3/0x140 [zfs]
Sep 16 17:46:38 localhost kernel: [] taskq_thread+0x22d/0x4f0 [spl]
Sep 16 17:46:38 localhost kernel: [] ? try_to_wake_up+0x2b0/0x2b0
Sep 16 17:46:38 localhost kernel: [] ? task_expire+0x120/0x120 [spl]
Sep 16 17:46:38 localhost kernel: [] kthread+0xce/0xe0
Sep 16 17:46:38 localhost kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Sep 16 17:46:38 localhost kernel: [] ret_from_fork+0x7c/0xb0
Sep 16 17:46:38 localhost kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Sep 16 17:46:38 localhost kernel: Code: 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0e 0f 1f 40 00 f3 90 0f b7 07 <66> 39 c2 75 f6 c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 9c
Sep 16 17:46:39 localhost abrtd: Directory 'oops-2014-09-16-17:46:39-3775-1' creation detected
Sep 16 17:46:39 localhost abrt-dump-oops: Reported 1 kernel oopses to Abrt
Sep 16 17:46:39 localhost abrtd: Can't open file '/var/spool/abrt/oops-2014-09-16-17:46:39-3775-1/uid': No such file or directory
Sep 16 17:46:54 localhost kernel: nr_pdflush_threads exported in /proc is scheduled for removal
Sep 16 17:46:54 localhost kernel: sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case. If you have one, please send an email to linux-mm@kvack.org.
The text was updated successfully, but these errors were encountered: