Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: BUG: soft lockup - CPU#7 stuck for 22s! [z_null_iss/0:1796] #2708

Closed
bjquinn opened this issue Sep 16, 2014 · 26 comments
Closed

kernel: BUG: soft lockup - CPU#7 stuck for 22s! [z_null_iss/0:1796] #2708

bjquinn opened this issue Sep 16, 2014 · 26 comments
Milestone

Comments

@bjquinn
Copy link

bjquinn commented Sep 16, 2014

I'm getting lots of the following under heavy load doing a send/recv from one pool to another on the same machine. Both pools end up unresponsive.

Sep 16 17:46:38 localhost kernel: BUG: soft lockup - CPU#7 stuck for 22s! [z_null_iss/0:1796]
Sep 16 17:46:38 localhost kernel: Modules linked in: des_generic ecb md4 sha256_ssse3 sha256_generic nls_utf8 cifs fscache dns_resolver fuse ip6table_filter ip6_tables ebtable_nat ebtables autofs4 bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp bridge stp llc ipv6 ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables btrfs raid6_pq xor libcrc32c zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support dcdbas sg coretemp hwmon freq_table mperf intel_powerclamp kvm crc32_pclmul crc32c_intel ghash_clmulni_intel microcode pcspkr bnx2 i2c_i801 lpc_ich ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom usb_storage aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 ahci libahci wmi xhci_hcd mgag200 ttm drm_kms_helper sysimgblt sysfillrect syscopyarea dm_mirror dm_region_hash dm_log dm_mod
Sep 16 17:46:38 localhost kernel: CPU: 7 PID: 1796 Comm: z_null_iss/0 Tainted: P O 3.10.53-1.el6.elrepo.x86_64 #1
Sep 16 17:46:38 localhost kernel: Hardware name: Dell Inc. PowerEdge R210 II/03X6X0, BIOS 2.8.0 06/24/2014
Sep 16 17:46:38 localhost kernel: task: ffff880426c2c040 ti: ffff88041e496000 task.ti: ffff88041e496000
Sep 16 17:46:38 localhost kernel: RIP: 0010:[] [] _raw_spin_lock+0x25/0x30
Sep 16 17:46:38 localhost kernel: RSP: 0018:ffff88041e497cf8 EFLAGS: 00000293
Sep 16 17:46:38 localhost kernel: RAX: 0000000000000002 RBX: ffffffff8108ae69 RCX: ffff8802fdb22890
Sep 16 17:46:38 localhost kernel: RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffff8802fdb2285c
Sep 16 17:46:38 localhost kernel: RBP: ffff88041e497cf8 R08: ffff88043fdd43c0 R09: 0000000000000000
Sep 16 17:46:38 localhost kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802fdb22878
Sep 16 17:46:38 localhost kernel: R13: ffff88043fdd4430 R14: ffffffff81082866 R15: ffff88041e497c68
Sep 16 17:46:38 localhost kernel: FS: 0000000000000000(0000) GS:ffff88043fdc0000(0000) knlGS:0000000000000000
Sep 16 17:46:38 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 17:46:38 localhost kernel: CR2: 000000309f155d63 CR3: 00000002df7a0000 CR4: 00000000001407e0
Sep 16 17:46:38 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 16 17:46:38 localhost kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 16 17:46:38 localhost kernel: Stack:
Sep 16 17:46:38 localhost kernel: ffff88041e497d18 ffffffff815e9aaa ffff88041e497d28 ffff8802fdb22548
Sep 16 17:46:38 localhost kernel: ffff88041e497d28 ffffffff815e9a8b ffff88041e497da8 ffffffffa0469045
Sep 16 17:46:38 localhost kernel: ffff8802fdb22548 ffff8802fdb22548 ffff88041e497da8 ffffffffa0468757
Sep 16 17:46:38 localhost kernel: Call Trace:
Sep 16 17:46:38 localhost kernel: [] __mutex_unlock_slowpath+0x1a/0x50
Sep 16 17:46:38 localhost kernel: [] mutex_unlock+0x1b/0x20
Sep 16 17:46:38 localhost kernel: [] zio_done+0x5e5/0xe50 [zfs]
Sep 16 17:46:38 localhost kernel: [] ? zio_ready+0x177/0x480 [zfs]
Sep 16 17:46:38 localhost kernel: [] zio_execute+0xb3/0x140 [zfs]
Sep 16 17:46:38 localhost kernel: [] taskq_thread+0x22d/0x4f0 [spl]
Sep 16 17:46:38 localhost kernel: [] ? try_to_wake_up+0x2b0/0x2b0
Sep 16 17:46:38 localhost kernel: [] ? task_expire+0x120/0x120 [spl]
Sep 16 17:46:38 localhost kernel: [] kthread+0xce/0xe0
Sep 16 17:46:38 localhost kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Sep 16 17:46:38 localhost kernel: [] ret_from_fork+0x7c/0xb0
Sep 16 17:46:38 localhost kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Sep 16 17:46:38 localhost kernel: Code: 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 0e 0f 1f 40 00 f3 90 0f b7 07 <66> 39 c2 75 f6 c9 c3 0f 1f 40 00 55 48 89 e5 0f 1f 44 00 00 9c
Sep 16 17:46:39 localhost abrtd: Directory 'oops-2014-09-16-17:46:39-3775-1' creation detected
Sep 16 17:46:39 localhost abrt-dump-oops: Reported 1 kernel oopses to Abrt
Sep 16 17:46:39 localhost abrtd: Can't open file '/var/spool/abrt/oops-2014-09-16-17:46:39-3775-1/uid': No such file or directory
Sep 16 17:46:54 localhost kernel: nr_pdflush_threads exported in /proc is scheduled for removal
Sep 16 17:46:54 localhost kernel: sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case. If you have one, please send an email to linux-mm@kvack.org.

@bjquinn
Copy link
Author

bjquinn commented Sep 17, 2014

As a follow up, I took the exact same machine and upgraded the kernel from 3.10 to 3.16 and ran the same workload which pretty consistently resulted in the above error on kernel 3.10 and so far it seems to be passing the test on 3.16. It might still be too early to be 100% certain of this, so I'll continue testing the workload, but I would have expected an issue by now. Would that make any sense? Would the Linux kernel version have an effect on an issue like this? Using 3.16 is not a problem for me if that is a sensible workaround.

@bjquinn
Copy link
Author

bjquinn commented Sep 18, 2014

Yeah, I spoke too soon. The issue manifests itself slightly differently on 3.16 (no console messages, slightly different error in /var/log/messages) --

Sep 17 22:19:32 localhost kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
Sep 17 22:19:32 localhost kernel: IP: [] __mutex_unlock_slowpath+0x2d/0x50
Sep 17 22:19:32 localhost kernel: PGD 0
Sep 17 22:19:32 localhost kernel: Oops: 0000 [#1] SMP
Sep 17 22:19:32 localhost kernel: Modules linked in: fuse des_generic ecb md4 nls_utf8 cifs fscache dns_resolver i
p6table_filter ip6_tables ebtable_nat ebtables autofs4 bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_t
gt 8021q garp bridge stp llc ipv6 ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle xt_phys
dev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables btrfs raid6_pq xor
vhost_net macvtap macvlan vhost tun kvm uinput zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) iTCO_w
dt iTCO_vendor_support dcdbas microcode ipmi_si ipmi_msghandler pcspkr sg i2c_i801 lpc_ich bnx2 ext4 jbd2 mbcache
uas usb_storage sr_mod cdrom sd_mod crc_t10dif crct10dif_common wmi ahci libahci xhci_hcd mgag200 ttm drm_kms_help
er sysimgblt sysfillrect syscopyarea dm_mirror dm_region_hash dm_log dm_mod
Sep 17 22:19:32 localhost kernel: CPU: 3 PID: 2019 Comm: z_null_iss/0 Tainted: P O 3.16.2-1.el6.elrepo.
x86_64 #1
Sep 17 22:19:32 localhost kernel: Hardware name: Dell Inc. PowerEdge R210 II/03X6X0, BIOS 2.8.0 06/24/2014
Sep 17 22:19:32 localhost kernel: task: ffff8800b9862ef0 ti: ffff88041c4c4000 task.ti: ffff88041c4c4000
Sep 17 22:19:32 localhost kernel: RIP: 0010:[] [] __mutex_unlock_slowpath+0x2
d/0x50
Sep 17 22:19:32 localhost kernel: RSP: 0018:ffff88041c4c7d08 EFLAGS: 00010207
Sep 17 22:19:32 localhost kernel: RAX: 0000000000000000 RBX: ffff8802ed262c08 RCX: ffff8802ed262c40
Sep 17 22:19:32 localhost kernel: RDX: ffff8802ed262c10 RSI: 0000000000000282 RDI: ffff8802ed262c0c
Sep 17 22:19:32 localhost kernel: RBP: ffff88041c4c7d18 R08: ffff88043fcd4400 R09: 0000000000000000
Sep 17 22:19:32 localhost kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Sep 17 22:19:32 localhost kernel: R13: 0000000000000000 R14: ffff8802ed262c08 R15: ffff8800b9862ef0
Sep 17 22:19:32 localhost kernel: FS: 0000000000000000(0000) GS:ffff88043fcc0000(0000) knlGS:0000000000000000
Sep 17 22:19:32 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 17 22:19:32 localhost kernel: CR2: 0000000000000010 CR3: 0000000001c11000 CR4: 00000000001407e0
Sep 17 22:19:32 localhost kernel: Stack:
Sep 17 22:19:32 localhost kernel: ffff88041c4c7d28 ffff8802ed2628f8 ffff88041c4c7d28 ffffffff81648dfb
Sep 17 22:19:32 localhost kernel: ffff88041c4c7da8 ffffffffa03db115 ffff88042737e800 ffff8802ed2628f8
Sep 17 22:19:32 localhost kernel: ffff88041c4c7da8 ffffffffa03da827 ffff8804278a1928 ffff88042ae062d0
Sep 17 22:19:32 localhost kernel: Call Trace:
Sep 17 22:19:32 localhost kernel: [] mutex_unlock+0x1b/0x20
Sep 17 22:19:32 localhost kernel: [] zio_done+0x5e5/0xe50 [zfs]
Sep 17 22:19:32 localhost kernel: [] ? zio_ready+0x177/0x480 [zfs]
Sep 17 22:19:32 localhost kernel: [] ? __schedule+0x11e/0x6e0
Sep 17 22:19:32 localhost kernel: [] zio_execute+0xb3/0x140 [zfs]
Sep 17 22:19:32 localhost kernel: [] taskq_thread+0x22d/0x4f0 [spl]
Sep 17 22:19:32 localhost kernel: [] ? try_to_wake_up+0x220/0x220
Sep 17 22:19:32 localhost kernel: [] ? task_expire+0x120/0x120 [spl]
Sep 17 22:19:32 localhost kernel: [] kthread+0xce/0xf0
Sep 17 22:19:32 localhost kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Sep 17 22:19:32 localhost kernel: [] ret_from_fork+0x7c/0xb0
Sep 17 22:19:32 localhost kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Sep 17 22:19:32 localhost kernel: Code: 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb c7 07 01 00 00 00 48 8d 7f 04 e8 d0 22 00 00 48 8b 43 08 48 8d 53 08 48 39 d0 74 09 <48> 8b 78 10 e8 ba cf a5 ff 66 83 43 04 01 48 83 c4 08 5b c9 c3
Sep 17 22:19:32 localhost kernel: RIP [] __mutex_unlock_slowpath+0x2d/0x50
Sep 17 22:19:32 localhost kernel: RSP
Sep 17 22:19:32 localhost kernel: CR2: 0000000000000010
Sep 17 22:19:32 localhost kernel: ---[ end trace b928f740ede82415 ]---

@bjquinn
Copy link
Author

bjquinn commented Sep 22, 2014

Another update -- I was able to get the same issue (with console messages) on ZoL 0.6.3 as well as git head.

@behlendorf
Copy link
Contributor

@bjquinn This is a duplicate of #2523 which is currently what I'm attempting to run down. One of the issue holding up getting this resolved has been my inability to reproduce the issue. But it sounds like you have a test case, can you describe how you're reproducing the issue.

@behlendorf behlendorf added the Bug label Sep 22, 2014
@behlendorf behlendorf added this to the 0.6.4 milestone Sep 22, 2014
@bjquinn
Copy link
Author

bjquinn commented Sep 22, 2014

Sure, I'll try to give the relevant information. Please feel free to ask me any other questions.

Centos 6.5. I've tried kernels ranging from 3.10 to 3.16. I've seen the issue on both 0.6.3 and git head.

I have two pools, both created with ashift=12, compression=gzip-9, primarycache=metadata, secondarycache=metadata.

I explicitly indicated ashift=12 because at one point I had trouble with some drives lying about being 512b sector drives that ZoL didn't catch and autocorrect to 4k. I'm not sure that's the case with these specific drives, it's just habit now.

Compression is set to highest possible setting, but generally performance is acceptable for my scenario (backups), with the exception, obviously, of this hang. :)

I've set primarycache and secondarycache to metadata because, again, it's backups, and I have an inordinate amount of files, including countless millions of very, very small files, such that I actually notice a large space differential between ashift=9 and ashift=12, but performance for ashift=9 is so absolutely terrible that I have to go with ashift=12 (it wasn't 2x or 3x slower, it was 10x or 20x slower at least). Anyway, I'm doing large rsyncs to actually make the backups, and I don't much need caching for anything but metadata. I'm willing to listen to alternate recommendations here, but generally this has worked well.

The two pools are a 5x4TB raidz1 and a 8x3TB raidz2. No good reason for the difference in pool structures other than that I'm in the process of migrating to larger pools, and I haven't bought the final hardware yet. I do a zfs send/recv from one pool to another (the 2nd pool is simply supposed to be a carbon copy of the 1st one, for additional redundancy). I can trigger this error with a simple zfs send/recv (a "zfs send -v [first snap name on 1st pool] | zfs recv -dv [2nd pool name]" followed by a "zfs send -v -I [first snap name on 1st pool] [last snap name on 1st pool] | zfs recv -dvF [2nd pool name]".

I'm actually trying to sync the 1st pool to the 2nd pool for the first time, so the system would be expected to run for a couple days copying the data. It runs for anywhere between an hour or up to maybe 12 hours or so before this issue arises.

There are about 50 filesystems with roughly 200 snapshots each.

I can also trigger the issue while running rsync alone (my standard backup script from various sources to copy to 1st pool), or both the rsync backup AND the zfs send/recv to the 2nd pool simultaneously, or with the zfs send/recv alone as mentioned above. The issue seems to be strongly correlated with heavy load. I can expect roughly 50MB/s, and my 8 thread processor seems to be the bottleneck, as generally the gzip-9 compression eats up 100% of CPU on all cores.

The system DOES have ECC RAM.

Please let me know what other information would be useful. I'd be happy to run tests or gather more data for you.

Thanks!!

@bjquinn
Copy link
Author

bjquinn commented Sep 22, 2014

Oh, and if it's relevant, I can't snap my fingers and make this happen, but I can trigger it fairly consistently. I'm pretty certain I can run a workload that will kill it in less than 24 hours, at least. Since this has now happened to me quite a few times and I'm comfortable that it's not causing me any data loss, I'm willing to trigger it with any debugging steps you'd like me to take, etc.

@behlendorf
Copy link
Contributor

@bjquinn Thanks for the additional information. I'm almost certainly going to put together a set of debug patches this week to try and get to the root cause here. If you could give them a try when they're ready it would be appreciated. I'll let you know. Thanks!

@bjquinn
Copy link
Author

bjquinn commented Sep 22, 2014

I would absolutely be willing to do that. Let me know when the debug patches are ready, and I'll try them out.

@rfehren
Copy link

rfehren commented Sep 30, 2014

Just wanted to report that we saw the same issue here (0.6.3 with 2.6.32.x kernel). Is a production machine unfortunately, so debugging wouldn't be possible. It's not reproducible so far. Seemed to be triggered by heavy IO load. ZFS is used for a Lustre OST on the machine.

@bjquinn
Copy link
Author

bjquinn commented Sep 30, 2014

I just got it again a few minutes ago. I'm comfortable that I should be able to reproduce once the debugging patches are available.

@rfehren
Copy link

rfehren commented Oct 3, 2014

Just for reference, here the slightly different trace we get. Issue occurs regularly under load.

Oct  1 18:54:55 mull kernel: [535793.951996] BUG: soft lockup - CPU#0 stuck for 61s! [z_null_iss/0:5307]
Oct  1 18:54:55 mull kernel: [535793.956200] Modules linked in: fuse ntfs vfat msdos fat xfs ofd ost osp mdd lfsck lod mdt mgs mgc osd_zfs lquota lustre lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs sha1_generic crc32c libcfs ipmi_devintf ipmi_si ipmi_msghandler kvm_amd kvm nfsd exportfs ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables zfs(P) zcommon(P) bonding znvpair(P) ipv6 bridge zavl(P) stp zunicode(P) llc i2c_piix4 spl button i2c_core processor dcdbas zlib_deflate hid_cherry usbhid ehci_hcd mpt2sas ohci_hcd usbcore thermal thermal_sys bnx2
Oct  1 18:54:55 mull kernel: [535793.956200] CPU 0:
Oct  1 18:54:55 mull kernel: [535793.956200] Modules linked in: fuse ntfs vfat msdos fat xfs ofd ost osp mdd lfsck lod mdt mgs mgc osd_zfs lquota lustre lov osc mdc fid fld ksocklnd ptlrpc obdclass lnet lvfs sha1_generic crc32c libcfs ipmi_devintf ipmi_si ipmi_msghandler kvm_amd kvm nfsd exportfs ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables zfs(P) zcommon(P) bonding znvpair(P) ipv6 bridge zavl(P) stp zunicode(P) llc i2c_piix4 spl button i2c_core processor dcdbas zlib_deflate hid_cherry usbhid ehci_hcd mpt2sas ohci_hcd usbcore thermal thermal_sys bnx2
Oct  1 18:54:55 mull kernel: [535793.956200] Pid: 5307, comm: z_null_iss/0 Tainted: P           2.6.32.62-ql-server-93 #1 PowerEdge R515
Oct  1 18:54:55 mull kernel: [535793.956200] RIP: 0010:[<ffffffff81028621>]  [<ffffffff81028621>] __ticket_spin_lock+0x19/0x20
Oct  1 18:54:55 mull kernel: [535793.956200] RSP: 0018:ffff88084f1add30  EFLAGS: 00000202
Oct  1 18:54:55 mull kernel: [535793.956200] RAX: 0000000000000000 RBX: ffff88084f1add30 RCX: ffff88056f12da20
Oct  1 18:54:55 mull kernel: [535793.956200] RDX: 0000000000000002 RSI: 0000000000000282 RDI: ffff88056f12d9f4
Oct  1 18:54:55 mull kernel: [535793.956200] RBP: ffffffff810103ee R08: 0000000000000000 R09: ffffffff8137a120
Oct  1 18:54:55 mull kernel: [535793.956200] R10: 00006972e35e5201 R11: 0000000000000000 R12: ffff88056f12da20
Oct  1 18:54:55 mull kernel: [535793.956200] R13: ffff88056f12da08 R14: 0000000000000000 R15: ffffffff8137a120
Oct  1 18:54:55 mull kernel: [535793.956200] FS:  00007f26a916d700(0000) GS:ffff880044e00000(0000) knlGS:0000000000000000
Oct  1 18:54:55 mull kernel: [535793.956200] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Oct  1 18:54:55 mull kernel: [535793.956200] CR2: 0000000000763808 CR3: 0000000001001000 CR4: 00000000000006f0
Oct  1 18:54:55 mull kernel: [535793.956200] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  1 18:54:55 mull kernel: [535793.956200] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct  1 18:54:55 mull kernel: [535793.956200] Call Trace:
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff81368d30>] ? _spin_lock+0x9/0xd
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff81367ed3>] ? __mutex_unlock_slowpath+0x16/0x44
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff81367b2b>] ? mutex_unlock+0x16/0x18
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffffa0269a85>] ? zio_done+0xac9/0xb1a [zfs]
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffffa0268da1>] ? zio_ready+0xb7/0x2d2 [zfs]
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff8101024e>] ? common_interrupt+0xe/0x13
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffffa026598e>] ? zio_execute+0xdc/0x104 [zfs]
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffffa00d2dd5>] ? taskq_thread+0x299/0x40d [spl]
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff8103a071>] ? default_wake_function+0x0/0xf
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffffa00d2b3c>] ? taskq_thread+0x0/0x40d [spl]
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff810534c4>] ? kthread+0x7b/0x83
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff810369d3>] ? finish_task_switch+0x48/0xb9
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff8101092a>] ? child_rip+0xa/0x20
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff81053449>] ? kthread+0x0/0x83
Oct  1 18:54:55 mull kernel: [535793.956200]  [<ffffffff81010920>] ? child_rip+0x0/0x20

@behlendorf
Copy link
Contributor

@rfehren Thanks for the additional debugging. Could you describe the load which is able to trigger this, if I was able to reproduce this locally it would help running down the root cause.

@rfehren
Copy link

rfehren commented Oct 3, 2014

It's a customer server running a MDS and OSS (Lustre 2.5.2 on ZFS) + a kvm machine that mounts the Lustre FS. I can't tell exactly when the bug is triggered, but I saw that users do a lot of Lustre I/O inside of the VM (e.g. rsyncing large dirs). What I can tell is that it's most likely not a memory pressure issue since the Lustre server that mounts ZFS had 16GB mem free when it last happened. Hard to reproduce this scenario locally ...

@rfehren
Copy link

rfehren commented Oct 3, 2014

Hmm, just happened again after Lustre recovery was finished and the clients have reconnected. Since there was nobody doing anything inside of the kvm machine this time, I'm now pretty sure it's the load from the compute nodes hammering on Lustre that triggers this. It's not a big cluster though, so I'm a bit puzzled, especially since the same software combo is running flawlessly for months on a number of other clusters. Only difference is that Lustre is running over Ethernet here rather than IB like on the other ones.

@behlendorf
Copy link
Contributor

@rfehren I'm sure this isn't related to memory pressure. There appears to be a suble flag in the zio locking but I haven't been able to put my finger on it. Thanks for the information, I'm treating this one as a blocked for 0.6.4.

@bjquinn
Copy link
Author

bjquinn commented Nov 10, 2014

Does anyone have another suggested workaround to this I can try?

@chriseqipe
Copy link

It seems we have the same problem:

BUG: soft lockup - CPU#2 stuck for 67s! [z_null_iss/0:1175]
Modules linked in: des_generic ecb md4 nls_utf8 cifs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 microcode iTCO_wdt iTCO_vendor_support dcdbas serio_raw lpc_ich mfd_core tg3 ptp pps_core i5100_edac edac_core ioatdma dca zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_si ipmi_msghandler ses enclosure sg ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mpt2sas raid_class mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 2
Modules linked in: des_generic ecb md4 nls_utf8 cifs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 microcode iTCO_wdt iTCO_vendor_support dcdbas serio_raw lpc_ich mfd_core tg3 ptp pps_core i5100_edac edac_core ioatdma dca zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_si ipmi_msghandler ses enclosure sg ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mpt2sas raid_class mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 1175, comm: z_null_iss/0 Tainted: P --------------- 2.6.32-504.1.3.el6.x86_64 #1 Dell Inc. PowerEdge R300/0TY179
RIP: 0010:[] [] _spin_lock+0x1e/0x30
RSP: 0018:ffff88066395fd50 EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffff88066395fd50 RCX: 000000000000001b
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8806170248d4
RBP: ffffffff8100bb8e R08: 0000000000000100 R09: 000000009b978822
R10: 00000000f1d78ddc R11: 0000000000000000 R12: 42bf963dba41b1cb
R13: f6119d6552cd2a0a R14: ead2c59eebc18d01 R15: 0000000000001000
FS: 0000000000000000(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f12412fa000 CR3: 0000000001a85000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process z_null_iss/0 (pid: 1175, threadinfo ffff88066395e000, task ffff880664600aa0)
Stack:
ffff88066395fd70 ffffffff8152adeb ffff8806170245c0 ffff8804deb05a30
ffff88066395fd80 ffffffff8152ad7b ffff88066395fde0 ffffffffa0514d85
ffff880617024850 ffff8806170248d0 ffff880664600aa0 0000000000000000
Call Trace:
[] ? __mutex_unlock_slowpath+0x2b/0x60
[] ? mutex_unlock+0x1b/0x20
[] ? zio_ready+0x1c5/0x480 [zfs]
[] ? zio_execute+0xb5/0x140 [zfs]
[] ? taskq_thread+0x1e7/0x3f0 [spl]
[] ? default_wake_function+0x0/0x20
[] ? taskq_thread+0x0/0x3f0 [spl]
[] ? kthread+0x9e/0xc0
[] ? child_rip+0xa/0x20
[] ? kthread+0x0/0xc0
[] ? child_rip+0x0/0x20
Code: 00 00 00 01 74 05 e8 22 bb d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
Call Trace:
[] ? __mutex_unlock_slowpath+0x2b/0x60
[] ? mutex_unlock+0x1b/0x20
[] ? zio_ready+0x1c5/0x480 [zfs]
[] ? zio_execute+0xb5/0x140 [zfs]
[] ? taskq_thread+0x1e7/0x3f0 [spl]
[] ? default_wake_function+0x0/0x20
[] ? taskq_thread+0x0/0x3f0 [spl]
[] ? kthread+0x9e/0xc0
[] ? child_rip+0xa/0x20
[] ? kthread+0x0/0xc0
[] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#2 stuck for 67s! [z_null_iss/0:1175]
Modules linked in: des_generic ecb md4 nls_utf8 cifs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 microcode iTCO_wdt iTCO_vendor_support dcdbas serio_raw lpc_ich mfd_core tg3 ptp pps_core i5100_edac edac_core ioatdma dca zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_si ipmi_msghandler ses enclosure sg ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mpt2sas raid_class mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 2
Modules linked in: des_generic ecb md4 nls_utf8 cifs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 microcode iTCO_wdt iTCO_vendor_support dcdbas serio_raw lpc_ich mfd_core tg3 ptp pps_core i5100_edac edac_core ioatdma dca zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ipmi_si ipmi_msghandler ses enclosure sg ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mpt2sas raid_class mptsas mptscsih mptbase scsi_transport_sas radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 1175, comm: z_null_iss/0 Tainted: P --------------- 2.6.32-504.1.3.el6.x86_64 #1 Dell Inc. PowerEdge R300/0TY179
RIP: 0010:[] [] _spin_lock+0x1e/0x30
RSP: 0018:ffff88066395fd50 EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffff88066395fd50 RCX: 000000000000001b
RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8806170248d4
RBP: ffffffff8100bb8e R08: 0000000000000100 R09: 000000009b978822
R10: 00000000f1d78ddc R11: 0000000000000000 R12: 42bf963dba41b1cb
R13: f6119d6552cd2a0a R14: ead2c59eebc18d01 R15: 0000000000001000
FS: 0000000000000000(0000) GS:ffff880028300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f12412fa000 CR3: 0000000001a85000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process z_null_iss/0 (pid: 1175, threadinfo ffff88066395e000, task ffff880664600aa0)
Stack:
ffff88066395fd70 ffffffff8152adeb ffff8806170245c0 ffff8804deb05a30
ffff88066395fd80 ffffffff8152ad7b ffff88066395fde0 ffffffffa0514d85
ffff880617024850 ffff8806170248d0 ffff880664600aa0 0000000000000000
Call Trace:
[] ? __mutex_unlock_slowpath+0x2b/0x60
[] ? mutex_unlock+0x1b/0x20
[] ? zio_ready+0x1c5/0x480 [zfs]
[] ? zio_execute+0xb5/0x140 [zfs]
[] ? taskq_thread+0x1e7/0x3f0 [spl]
[] ? default_wake_function+0x0/0x20
[] ? taskq_thread+0x0/0x3f0 [spl]
[] ? kthread+0x9e/0xc0
[] ? child_rip+0xa/0x20
[] ? kthread+0x0/0xc0
[] ? child_rip+0x0/0x20
Code: 00 00 00 01 74 05 e8 22 bb d6 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
Call Trace:
[] ? __mutex_unlock_slowpath+0x2b/0x60
[] ? mutex_unlock+0x1b/0x20
[] ? zio_ready+0x1c5/0x480 [zfs]
[] ? zio_execute+0xb5/0x140 [zfs]
[] ? taskq_thread+0x1e7/0x3f0 [spl]
[] ? default_wake_function+0x0/0x20
[] ? taskq_thread+0x0/0x3f0 [spl]
[] ? kthread+0x9e/0xc0
[] ? child_rip+0xa/0x20
[] ? kthread+0x0/0xc0
[] ? child_rip+0x0/0x20
INFO: task txg_sync:1282 blocked for more than 120 seconds.
Tainted: P --------------- 2.6.32-504.1.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync D 0000000000000000 0 1282 2 0x00000000
ffff88065ddedb30 0000000000000046 0000000000000000 ffff880670f59800
0000000000000000 0000000000000286 00001a54b37d00ae ffffffffa0511413
ffff880375c8d068 0000000101b52e42 ffff880672ef05f8 ffff88065ddedfd8
Call Trace:
[] ? zio_taskq_dispatch+0x73/0xa0 [zfs]
[] io_schedule+0x73/0xc0
[] cv_wait_common+0x8c/0x100 [spl]
[] ? autoremove_wake_function+0x0/0x40
[] __cv_wait_io+0x18/0x20 [spl]
[] zio_wait+0xfb/0x1b0 [zfs]
[] vdev_uberblock_sync_list+0x89/0x100 [zfs]
[] vdev_config_sync+0x132/0x160 [zfs]
[] spa_sync+0x7c6/0xb70 [zfs]
[] ? __wake_up_common+0x59/0x90
[] txg_sync_thread+0x355/0x5b0 [zfs]
[] ? enqueue_entity+0x56/0x450
[] ? txg_sync_thread+0x0/0x5b0 [zfs]
[] ? txg_sync_thread+0x0/0x5b0 [zfs]
[] thread_generic_wrapper+0x68/0x80 [spl]
[] ? thread_generic_wrapper+0x0/0x80 [spl]
[] kthread+0x9e/0xc0
[] child_rip+0xa/0x20
[] ? kthread+0x0/0xc0
[] ? child_rip+0x0/0x20
INFO: task smbd:3183 blocked for more than 120 seconds.
Tainted: P --------------- 2.6.32-504.1.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
smbd D 0000000000000003 0 3183 2537 0x00000080
ffff880605adbab8 0000000000000086 ffff880605adba80 ffff880605adba7c
ffff8804996b0298 ffff88067fc23680 00001a668fc916ab ffff8800283168c0
0000000000000607 0000000101b659f9 ffff880669de9af8 ffff880605adbfd8
Call Trace:
[] cv_wait_common+0xed/0x100 [spl]
[] ? autoremove_wake_function+0x0/0x40
[] __cv_wait+0x15/0x20 [spl]
[] dmu_tx_wait+0xad/0x340 [zfs]
[] dmu_tx_assign+0x91/0x490 [zfs]
[] ? dsl_dataset_block_freeable+0x4b/0x80 [zfs]
[] ? dmu_tx_count_dnode+0x66/0xd0 [zfs]
[] zfs_write+0x3fe/0xc60 [zfs]
[] ? inet_recvmsg+0x5a/0x90
[] ? do_sync_read+0xfa/0x140
[] zpl_write+0x9b/0xe0 [zfs]
[] vfs_write+0xb8/0x1a0
[] sys_pwrite64+0x82/0xa0
[] system_call_fastpath+0x16/0x1b
INFO: task rsync:9657 blocked for more than 120 seconds.
Tainted: P --------------- 2.6.32-504.1.3.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rsync D 0000000000000000 0 9657 9656 0x00000080
ffff8803f6f95a88 0000000000000082 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000202 ffff880670fe9ab8 ffff8803f6f95fd8
Call Trace:
[] ? prepare_to_wait_exclusive+0x4e/0x80
[] cv_wait_common+0xed/0x100 [spl]
[] ? autoremove_wake_function+0x0/0x40
[] ? mutex_lock+0x1e/0x50
[] __cv_wait+0x15/0x20 [spl]
[] dmu_tx_wait+0xad/0x340 [zfs]
[] dmu_tx_assign+0x91/0x490 [zfs]
[] ? dmu_tx_hold_object_impl+0x86/0xf0 [zfs]
[] ? dmu_tx_hold_spill+0x23/0xc0 [zfs]
[] zfs_setattr+0x5cd/0x1740 [zfs]
[] ? tsd_hash_dtor+0x70/0x90 [spl]
[] ? zfs_open+0x9e/0x140 [zfs]
[] ? __dentry_open+0x23f/0x360
[] ? security_inode_permission+0x1f/0x30
[] ? __kmalloc+0x20c/0x220
[] zpl_setattr+0xee/0x120 [zfs]
[] notify_change+0x168/0x340
[] sys_fchmod+0xfa/0x130
[] ? final_putname+0x26/0x50
[] ? __audit_syscall_exit+0x25e/0x290
[] system_call_fastpath+0x16/0x1b

@bjquinn
Copy link
Author

bjquinn commented Dec 6, 2014

For what it's worth, I've been able to trigger this issue far easier when the storage in question is connected via USB. SATA/SAS storage has seemed to be much more reliable and less likely to trigger this issue.

@chriseqipe
Copy link

We changed the centos 6.5 Kernel from version 2.6.32-504.1.3.el6.x86_64 back to 2.6.32-431.29.2.el6.x86_64 and it seems that at least the soft lockup disappeared for the moment.
zfs in use is zfs-0.6.3-159_gc944be5.el6.x86_64.

@chriseqipe
Copy link

Just for information: since we downgraded to the kernel mentioned before (2.6.32-431.29.2.el6.x86_64) no soft lockups any more.

@tuxoko
Copy link
Contributor

tuxoko commented Dec 17, 2014

Please try my patch in #2523

@AceSlash
Copy link

I had the same issue last week on Debian Wheezy (kernel 3.2.0-4-amd64 #1 Debian 3.2.60-1+deb7u3) and again today on the same system (3.2.0-4-amd64 #1 Debian 3.2.63-2) with ZoL 0.6.3 from zol debian package (zfs 0.6.3-1~wheezy, dkms 2.2.0.3-1.2+zfs6).

The exact same action have triggered it both time: deleting the content of a complex directory (176GB of data with compress ratio of ~4x and recordsize of 8K). The dataset where the directory reside had 1 previous snapshot at the time of both crash.

The syslog gives a lot of output bewteen the start of the issue (from 23:13 until the crash at 00:14 for the first crash and from 23:04 to 00:58 the second time), so I have truncated it and the output is available here for the first minutes of syslog:

crash 1: http://apaste.info/nN8
crash 2: http://apaste.info/tyr

Unfortunately it's a production environnement so I cannot do much diagnostic, but tell me if you need any more information on the system or workload at the time of the crash.

@FransUrbo
Copy link
Contributor

from zol debian package (zfs 0.6.3-1~wheezy, dkms 2.2.0.3-1.2+zfs6).

Have you tried the dailies?

@AceSlash
Copy link

@FransUrbo : As seen on IRC, I will update to the dailies tonight.

Since the process happens once a week, I'll update this ticket next week whenever it still crash or not.

behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 19, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421
@behlendorf
Copy link
Contributor

This issue which is a duplicate of #2523 was resolved by the following commit. Full details can be found in the commit message and related lwn article.

openzfs/spl@a3c1eb7 mutex: force serialization on mutex_exit() to fix races

dajhorn added a commit to zfsonlinux/pkg-spl that referenced this issue Dec 20, 2014
Commit: openzfs/zfs@a3c1eb7
From: Chunwei Chen <tuxoko@gmail.com>
Date: Fri, 19 Dec 2014 11:31:59 +0800
Subject: mutex: force serialization on mutex_exit() to fix races

It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Backported-by: Darik Horn <dajhorn@vanadac.com>
Closes #421

Conflicts:
        include/sys/mutex.h
behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 23, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421
@ryao
Copy link
Contributor

ryao commented Feb 10, 2015

@behlendorf This is not a duplicate of #2523. Instead, it is a duplicate of #3091, provided that this deadlock occurred on a multi-socket system.

ryao pushed a commit to ryao/spl that referenced this issue Feb 19, 2015
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes openzfs#421

Conflicts:
	include/sys/mutex.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants