Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS lockup 0.6.3.1 #3091

Closed
eolson78 opened this issue Feb 9, 2015 · 20 comments
Closed

ZFS lockup 0.6.3.1 #3091

eolson78 opened this issue Feb 9, 2015 · 20 comments

Comments

@eolson78
Copy link

eolson78 commented Feb 9, 2015

Hi We are seeing a lockup in relation to MUTEX and Spinlock. Please see below stack traces
CENTOS 6.5
Kernel 3.14.4
32VPCU
60GB of RAM

Feb 8 19:09:31 nas-prd-app02 kernel: BUG: soft lockup - CPU#1 stuck for 22s! [nfsd:19060]
Feb 8 19:09:31 nas-prd-app02 kernel: Modules linked in: bridge nfsd nfs_acl auth_rpcgss iscsi_target_mod target_core_iblock target_core_file target_core_pscsi target_core_mod crct10dif_generic crc_t10dif crct10dif_common configfs bnx2fc cnic uio fcoe libfcoe sg libfc scsi_transport_fc scsi_tgt 8021q garp stp llc lockd sunrpc ipv6 dm_mirror dm_region_hash dm_log tcp_htcp zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) xen_netfront dm_mod coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache xen_blkfront
Feb 8 19:09:31 nas-prd-app02 kernel: CPU: 1 PID: 19060 Comm: nfsd Tainted: P O 3.14.4 #1
Feb 8 19:09:31 nas-prd-app02 kernel: task: ffff880404f6d170 ti: ffff880404f44000 task.ti: ffff880404f44000
Feb 8 19:09:31 nas-prd-app02 kernel: RIP: e030:[] [] _raw_spin_lock+0x23/0x30
Feb 8 19:09:31 nas-prd-app02 kernel: RSP: e02b:ffff880404f45638 EFLAGS: 00000293
Feb 8 19:09:31 nas-prd-app02 kernel: RAX: 0000000000000002 RBX: ffff88083482e5b8 RCX: 0000000000000000
Feb 8 19:09:31 nas-prd-app02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88083482e5bc
Feb 8 19:09:31 nas-prd-app02 kernel: RBP: ffff880404f45638 R08: 0000000000000040 R09: ffffc90aada99000
Feb 8 19:09:31 nas-prd-app02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88083482e560
Feb 8 19:09:31 nas-prd-app02 kernel: R13: ffffc90011196d60 R14: ffff8805a47c4420 R15: 0000000000000000
Feb 8 19:09:31 nas-prd-app02 kernel: FS: 00007f6363472700(0000) GS:ffff880ebb820000(0000) knlGS:0000000000000000
Feb 8 19:09:31 nas-prd-app02 kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 8 19:09:31 nas-prd-app02 kernel: CR2: 00007f636347e000 CR3: 0000000001c0c000 CR4: 0000000000002660
Feb 8 19:09:31 nas-prd-app02 kernel: Stack:
Feb 8 19:09:31 nas-prd-app02 kernel: ffff880404f45658 ffffffff81621bba ffff880404f45668 ffff880404f45714
Feb 8 19:09:31 nas-prd-app02 kernel: ffff880404f45668 ffffffff81621c0b ffff880404f456e8 ffffffffa0284d03
Feb 8 19:09:31 nas-prd-app02 kernel: ffff880404f456d8 ffffffffa03205f1 0000000000000000 0000000000000000
Feb 8 19:09:31 nas-prd-app02 kernel: Call Trace:
Feb 8 19:09:31 nas-prd-app02 kernel: [] __mutex_unlock_slowpath+0x1a/0x50
Feb 8 19:09:31 nas-prd-app02 kernel: [] mutex_unlock+0x1b/0x20
Feb 8 19:09:31 nas-prd-app02 kernel: [] dbuf_read_impl+0x1e3/0x390 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? zio_null+0x61/0x70 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] dbuf_read+0xad/0x3f0 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] dmu_buf_will_dirty+0xa7/0x100 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] dmu_write_uio_dnode+0x70/0x140 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] dmu_write_uio_dbuf+0x49/0x60 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] zfs_write+0xa62/0xbc0 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? RW_WRITE_HELD+0x61/0xb0 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? sa_lookup_impl+0x13/0x20 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? dmu_object_size_from_db+0x69/0xa0 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? mutex_lock+0x16/0x40
Feb 8 19:09:31 nas-prd-app02 kernel: [] zpl_aio_write+0xb5/0x100 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] do_sync_readv_writev+0x4d/0x80
Feb 8 19:09:31 nas-prd-app02 kernel: [] do_readv_writev+0xc8/0x2c0
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? zpl_compat_ioctl+0x10/0x10 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? zpl_read+0xf0/0xf0 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? zfs_open+0x9b/0x130 [zfs]
Feb 8 19:09:31 nas-prd-app02 kernel: [] vfs_writev+0x37/0x50
Feb 8 19:09:31 nas-prd-app02 kernel: [] nfsd_vfs_write+0xa1/0x2a0 [nfsd]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? ima_file_check+0x36/0x50
Feb 8 19:09:31 nas-prd-app02 kernel: [] nfsd_write+0xe7/0x100 [nfsd]
Feb 8 19:09:31 nas-prd-app02 kernel: [] nfsd3_proc_write+0x9f/0xd0 [nfsd]
Feb 8 19:09:31 nas-prd-app02 kernel: [] nfsd_dispatch+0xa8/0x1b0 [nfsd]
Feb 8 19:09:31 nas-prd-app02 kernel: [] svc_process_common+0x30c/0x530 [sunrpc]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
Feb 8 19:09:31 nas-prd-app02 kernel: [] svc_process+0x112/0x160 [sunrpc]
Feb 8 19:09:31 nas-prd-app02 kernel: [] nfsd+0xbf/0x130 [nfsd]
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
Feb 8 19:09:31 nas-prd-app02 kernel: [] kthread+0xce/0xf0
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? xen_end_context_switch+0x1e/0x30
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Feb 8 19:09:31 nas-prd-app02 kernel: [] ret_from_fork+0x7c/0xb0
Feb 8 19:09:31 nas-prd-app02 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Feb 8 19:09:31 nas-prd-app02 kernel: Code: 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 04 c9 c3 f3 90 0f b7 07 <66> 39 c2 75 f6 c9 c3 90 90 90 90 90 90 55 48 89 e5 48 83 ec 20
Feb 8 19:09:32 nas-prd-app02 dhclient[3573]: DHCPREQUEST on eth0 to 10.90.8.1 port 67 (xid=0x2483666e)
Feb 8 19:09:32 nas-prd-app02 abrtd: Directory 'oops-2015-02-08-19:09:32-4397-1' creation detected
Feb 8 19:09:32 nas-prd-app02 abrt-dump-oops: Reported 1 kernel oopses to Abrt
Feb 8 19:09:45 nas-prd-app02 abrtd: Sending an email...
Feb 8 19:09:45 nas-prd-app02 abrtd: Email was sent to: root@localhost
Feb 8 19:09:45 nas-prd-app02 abrtd: Duplicate: UUID
Feb 8 19:09:45 nas-prd-app02 abrtd: DUP_OF_DIR: /var/spool/abrt/oops-2015-02-08-19:01:08-4397-1
Feb 8 19:09:45 nas-prd-app02 abrtd: Deleting problem directory oops-2015-02-08-19:02:36-4397-1 (dup of oops-2015-02-08-19:01:08-4397-1)
Feb 8 19:09:45 nas-prd-app02 abrtd: Can't open file '/var/spool/abrt/oops-2015-02-08-19:01:08-4397-1/uid': No such file or directory
Feb 8 19:09:45 nas-prd-app02 abrtd: No actions are found for event 'notify-dup'
Feb 8 19:09:45 nas-prd-app02 abrtd: Can't open file '/var/spool/abrt/oops-2015-02-08-19:03:04-4397-1/uid': No such file or directory
Feb 8 19:09:48 nas-prd-app02 dhclient[3573]: DHCPREQUEST on eth0 to 10.90.8.1 port 67 (xid=0x2483666e)
Feb 8 19:09:58 nas-prd-app02 dhclient[3573]: DHCPREQUEST on eth0 to 10.90.8.1 port 67 (xid=0x2483666e)
Feb 8 19:09:59 nas-prd-app02 kernel: BUG: soft lockup - CPU#1 stuck for 22s! [nfsd:19060]
Feb 8 19:09:59 nas-prd-app02 kernel: Modules linked in: bridge nfsd nfs_acl auth_rpcgss iscsi_target_mod target_core_iblock target_core_file target_core_pscsi target_core_mod crct10dif_generic crc_t10dif crct10dif_common configfs bnx2fc cnic uio fcoe libfcoe sg libfc scsi_transport_fc scsi_tgt 8021q garp stp llc lockd sunrpc ipv6 dm_mirror dm_region_hash dm_log tcp_htcp zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) xen_netfront dm_mod coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache xen_blkfront
Feb 8 19:09:59 nas-prd-app02 kernel: CPU: 1 PID: 19060 Comm: nfsd Tainted: P O 3.14.4 #1
Feb 8 19:09:59 nas-prd-app02 kernel: task: ffff880404f6d170 ti: ffff880404f44000 task.ti: ffff880404f44000
Feb 8 19:09:59 nas-prd-app02 kernel: RIP: e030:[] [] _raw_spin_lock+0x23/0x30
Feb 8 19:09:59 nas-prd-app02 kernel: RSP: e02b:ffff880404f45638 EFLAGS: 00000293
Feb 8 19:09:59 nas-prd-app02 kernel: RAX: 0000000000000002 RBX: ffff88083482e5b8 RCX: 0000000000000000
Feb 8 19:09:59 nas-prd-app02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88083482e5bc
Feb 8 19:09:59 nas-prd-app02 kernel: RBP: ffff880404f45638 R08: 0000000000000040 R09: ffffc90aada99000
Feb 8 19:09:59 nas-prd-app02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88083482e560
Feb 8 19:09:59 nas-prd-app02 kernel: R13: ffffc90011196d60 R14: ffff8805a47c4420 R15: 0000000000000000
Feb 8 19:09:59 nas-prd-app02 kernel: FS: 00007f6363472700(0000) GS:ffff880ebb820000(0000) knlGS:0000000000000000
Feb 8 19:09:59 nas-prd-app02 kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 8 19:09:59 nas-prd-app02 kernel: CR2: 00007f636347e000 CR3: 0000000001c0c000 CR4: 0000000000002660
Feb 8 19:09:59 nas-prd-app02 kernel: Stack:
Feb 8 19:09:59 nas-prd-app02 kernel: ffff880404f45658 ffffffff81621bba ffff880404f45668 ffff880404f45714
Feb 8 19:09:59 nas-prd-app02 kernel: ffff880404f45668 ffffffff81621c0b ffff880404f456e8 ffffffffa0284d03
Feb 8 19:09:59 nas-prd-app02 kernel: ffff880404f456d8 ffffffffa03205f1 0000000000000000 0000000000000000
Feb 8 19:09:59 nas-prd-app02 kernel: Call Trace:
Feb 8 19:09:59 nas-prd-app02 kernel: [] __mutex_unlock_slowpath+0x1a/0x50
Feb 8 19:09:59 nas-prd-app02 kernel: [] mutex_unlock+0x1b/0x20
Feb 8 19:09:59 nas-prd-app02 kernel: [] dbuf_read_impl+0x1e3/0x390 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? zio_null+0x61/0x70 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] dbuf_read+0xad/0x3f0 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] dmu_buf_will_dirty+0xa7/0x100 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] dmu_write_uio_dnode+0x70/0x140 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] dmu_write_uio_dbuf+0x49/0x60 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] zfs_write+0xa62/0xbc0 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? RW_WRITE_HELD+0x61/0xb0 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? sa_lookup_impl+0x13/0x20 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? dmu_object_size_from_db+0x69/0xa0 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? mutex_lock+0x16/0x40
Feb 8 19:09:59 nas-prd-app02 kernel: [] zpl_aio_write+0xb5/0x100 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] do_sync_readv_writev+0x4d/0x80
Feb 8 19:09:59 nas-prd-app02 kernel: [] do_readv_writev+0xc8/0x2c0
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? zpl_compat_ioctl+0x10/0x10 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? zpl_read+0xf0/0xf0 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? zfs_open+0x9b/0x130 [zfs]
Feb 8 19:09:59 nas-prd-app02 kernel: [] vfs_writev+0x37/0x50
Feb 8 19:09:59 nas-prd-app02 kernel: [] nfsd_vfs_write+0xa1/0x2a0 [nfsd]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? ima_file_check+0x36/0x50
Feb 8 19:09:59 nas-prd-app02 kernel: [] nfsd_write+0xe7/0x100 [nfsd]
Feb 8 19:09:59 nas-prd-app02 kernel: [] nfsd3_proc_write+0x9f/0xd0 [nfsd]
Feb 8 19:09:59 nas-prd-app02 kernel: [] nfsd_dispatch+0xa8/0x1b0 [nfsd]
Feb 8 19:09:59 nas-prd-app02 kernel: [] svc_process_common+0x30c/0x530 [sunrpc]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
Feb 8 19:09:59 nas-prd-app02 kernel: [] svc_process+0x112/0x160 [sunrpc]
Feb 8 19:09:59 nas-prd-app02 kernel: [] nfsd+0xbf/0x130 [nfsd]
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
Feb 8 19:09:59 nas-prd-app02 kernel: [] kthread+0xce/0xf0
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? xen_end_context_switch+0x1e/0x30
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Feb 8 19:09:59 nas-prd-app02 kernel: [] ret_from_fork+0x7c/0xb0
Feb 8 19:09:59 nas-prd-app02 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Feb 8 19:09:59 nas-prd-app02 kernel: Code: 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 04 c9 c3 f3 90 0f b7 07 <66> 39 c2 75 f6 c9 c3 90 90 90 90 90 90 55 48 89 e5 48 83 ec 20
Feb 8 19:10:01 nas-prd-app02 abrtd: Directory 'oops-2015-02-08-19:10:01-4397-1' creation detected
Feb 8 19:10:01 nas-prd-app02 abrt-dump-oops: Reported 1 kernel oopses to Abrt
Feb 8 19:10:08 nas-prd-app02 dhclient[3573]: DHCPREQUEST on eth0 to 10.90.8.1 port 67 (xid=0x2483666e)
Feb 8 19:10:27 nas-prd-app02 kernel: BUG: soft lockup - CPU#1 stuck for 22s! [nfsd:19060]
Feb 8 19:10:27 nas-prd-app02 kernel: Modules linked in: bridge nfsd nfs_acl auth_rpcgss iscsi_target_mod target_core_iblock target_core_file target_core_pscsi target_core_mod crct10dif_generic crc_t10dif crct10dif_common configfs bnx2fc cnic uio fcoe libfcoe sg libfc scsi_transport_fc scsi_tgt 8021q garp stp llc lockd sunrpc ipv6 dm_mirror dm_region_hash dm_log tcp_htcp zfs(PO) zcommon(PO) znvpair(PO) zavl(PO) zunicode(PO) spl(O) xen_netfront dm_mod coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache xen_blkfront
Feb 8 19:10:27 nas-prd-app02 kernel: CPU: 1 PID: 19060 Comm: nfsd Tainted: P O 3.14.4 #1
Feb 8 19:10:27 nas-prd-app02 kernel: task: ffff880404f6d170 ti: ffff880404f44000 task.ti: ffff880404f44000
Feb 8 19:10:27 nas-prd-app02 kernel: RIP: e030:[] [] _raw_spin_lock+0x23/0x30
Feb 8 19:10:27 nas-prd-app02 kernel: RSP: e02b:ffff880404f45638 EFLAGS: 00000293
Feb 8 19:10:27 nas-prd-app02 kernel: RAX: 0000000000000002 RBX: ffff88083482e5b8 RCX: 0000000000000000
Feb 8 19:10:27 nas-prd-app02 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88083482e5bc
Feb 8 19:10:27 nas-prd-app02 kernel: RBP: ffff880404f45638 R08: 0000000000000040 R09: ffffc90aada99000
Feb 8 19:10:27 nas-prd-app02 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88083482e560
Feb 8 19:10:27 nas-prd-app02 kernel: R13: ffffc90011196d60 R14: ffff8805a47c4420 R15: 0000000000000000
Feb 8 19:10:27 nas-prd-app02 kernel: FS: 00007f6363472700(0000) GS:ffff880ebb820000(0000) knlGS:0000000000000000
Feb 8 19:10:27 nas-prd-app02 kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 8 19:10:27 nas-prd-app02 kernel: CR2: 00007f636347e000 CR3: 0000000001c0c000 CR4: 0000000000002660
Feb 8 19:10:27 nas-prd-app02 kernel: Stack:
Feb 8 19:10:27 nas-prd-app02 kernel: ffff880404f45658 ffffffff81621bba ffff880404f45668 ffff880404f45714
Feb 8 19:10:27 nas-prd-app02 kernel: ffff880404f45668 ffffffff81621c0b ffff880404f456e8 ffffffffa0284d03
Feb 8 19:10:27 nas-prd-app02 kernel: ffff880404f456d8 ffffffffa03205f1 0000000000000000 0000000000000000
Feb 8 19:10:27 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:27 nas-prd-app02 kernel: [] __mutex_unlock_slowpath+0x1a/0x50
Feb 8 19:10:27 nas-prd-app02 kernel: [] mutex_unlock+0x1b/0x20
Feb 8 19:10:27 nas-prd-app02 kernel: [] dbuf_read_impl+0x1e3/0x390 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? zio_null+0x61/0x70 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] dbuf_read+0xad/0x3f0 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] dmu_buf_will_dirty+0xa7/0x100 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] dmu_write_uio_dnode+0x70/0x140 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] dmu_write_uio_dbuf+0x49/0x60 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] zfs_write+0xa62/0xbc0 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? RW_WRITE_HELD+0x61/0xb0 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? sa_lookup_impl+0x13/0x20 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? dmu_object_size_from_db+0x69/0xa0 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? mutex_lock+0x16/0x40
Feb 8 19:10:27 nas-prd-app02 kernel: [] zpl_aio_write+0xb5/0x100 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] do_sync_readv_writev+0x4d/0x80
Feb 8 19:10:27 nas-prd-app02 kernel: [] do_readv_writev+0xc8/0x2c0
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? zpl_compat_ioctl+0x10/0x10 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? zpl_read+0xf0/0xf0 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? zfs_open+0x9b/0x130 [zfs]
Feb 8 19:10:27 nas-prd-app02 kernel: [] vfs_writev+0x37/0x50
Feb 8 19:10:27 nas-prd-app02 kernel: [] nfsd_vfs_write+0xa1/0x2a0 [nfsd]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? ima_file_check+0x36/0x50
Feb 8 19:10:27 nas-prd-app02 kernel: [] nfsd_write+0xe7/0x100 [nfsd]
Feb 8 19:10:27 nas-prd-app02 kernel: [] nfsd3_proc_write+0x9f/0xd0 [nfsd]
Feb 8 19:10:27 nas-prd-app02 kernel: [] nfsd_dispatch+0xa8/0x1b0 [nfsd]
Feb 8 19:10:27 nas-prd-app02 kernel: [] svc_process_common+0x30c/0x530 [sunrpc]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
Feb 8 19:10:27 nas-prd-app02 kernel: [] svc_process+0x112/0x160 [sunrpc]
Feb 8 19:10:27 nas-prd-app02 kernel: [] nfsd+0xbf/0x130 [nfsd]
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
Feb 8 19:10:27 nas-prd-app02 kernel: [] kthread+0xce/0xf0
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? xen_end_context_switch+0x1e/0x30
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Feb 8 19:10:27 nas-prd-app02 kernel: [] ret_from_fork+0x7c/0xb0
Feb 8 19:10:27 nas-prd-app02 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Feb 8 19:10:27 nas-prd-app02 kernel: Code: 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 04 c9 c3 f3 90 0f b7 07 <66> 39 c2 75 f6 c9 c3 90 90 90 90 90 90 55 48 89 e5 48 83 ec 20
Feb 8 19:10:28 nas-prd-app02 abrtd: Directory 'oops-2015-02-08-19:10:28-4397-1' creation detected
Feb 8 19:10:28 nas-prd-app02 abrt-dump-oops: Reported 1 kernel oopses to Abrt
Feb 8 19:10:29 nas-prd-app02 dhclient[3573]: DHCPREQUEST on eth0 to 10.90.8.1 port 67 (xid=0x2483666e)
Feb 8 19:10:36 nas-prd-app02 dhclient[3573]: DHCPREQUEST on eth0 to 10.90.8.1 port 67 (xid=0x2483666e)
Feb 8 19:10:43 nas-prd-app02 dhclient[3573]: DHCPREQUEST on eth0 to 10.90.8.1 port 67 (xid=0x2483666e)
Feb 8 19:10:43 nas-prd-app02 kernel: INFO: rcu_sched self-detected stall on CPU { 1} (t=600016 jiffies g=14719917 c=14719916 q=9887444)
Feb 8 19:10:43 nas-prd-app02 kernel: sending NMI to all CPUs:
Feb 8 19:10:43 nas-prd-app02 kernel: NMI backtrace for cpu 0
Feb 8 19:10:43 nas-prd-app02 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 3.14.4 #1
Feb 8 19:10:43 nas-prd-app02 kernel: task: ffffffff81c13480 ti: ffffffff81c00000 task.ti: ffffffff81c00000
Feb 8 19:10:43 nas-prd-app02 kernel: RIP: e030:[] [] run_timer_softirq+0x53/0x2a0
Feb 8 19:10:43 nas-prd-app02 kernel: RSP: e02b:ffff880ebb803e88 EFLAGS: 00000046
Feb 8 19:10:43 nas-prd-app02 kernel: RAX: ffffffff81fb4fa8 RBX: ffffffff81fb3b80 RCX: 0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: RDX: 000000000000530c RSI: ffffffffff4f9000 RDI: ffffffff81fb3b80
Feb 8 19:10:43 nas-prd-app02 kernel: RBP: ffff880ebb803ef8 R08: 00000000b74b9688 R09: 0000ed250ad05bc0
Feb 8 19:10:43 nas-prd-app02 kernel: R10: ffffffff8181cf65 R11: 0000000000000010 R12: 0000000000000001
Feb 8 19:10:43 nas-prd-app02 kernel: R13: ffff880ebb803eb8 R14: ffffffff81fb4ba8 R15: 0000000000000002
Feb 8 19:10:43 nas-prd-app02 kernel: FS: 00007f6363472700(0000) GS:ffff880ebb800000(0000) knlGS:0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 8 19:10:43 nas-prd-app02 kernel: CR2: 00007f636347e000 CR3: 0000000ea3375000 CR4: 0000000000002660
Feb 8 19:10:43 nas-prd-app02 kernel: Stack:
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff81c00010 ffffffff81c00000 ffff880ebb803eb8 ffffffff81fb4fa8
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000000000 ffff880ebb80b0c0 ffff880ebb803f58 ffffffff81384227
Feb 8 19:10:43 nas-prd-app02 kernel: ffff880ebb803ed8 ffffffff81c05088 0000000000000001 0000000000000100
Feb 8 19:10:43 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:43 nas-prd-app02 kernel:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? evtchn_2l_handle_events+0x107/0x2a0
Feb 8 19:10:43 nas-prd-app02 kernel: [] __do_softirq+0xfc/0x2b0
Feb 8 19:10:43 nas-prd-app02 kernel: [] irq_exit+0xbd/0xd0
Feb 8 19:10:43 nas-prd-app02 kernel: [] xen_evtchn_do_upcall+0x35/0x50
Feb 8 19:10:43 nas-prd-app02 kernel: [] xen_do_hypervisor_callback+0x1e/0x30
Feb 8 19:10:43 nas-prd-app02 kernel:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_hypercall_sched_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_hypercall_sched_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_safe_halt+0x10/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? default_idle+0x24/0xc0
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? arch_cpu_idle+0x26/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? cpu_idle_loop+0x9e/0x250
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? cpu_startup_entry+0x23/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? rest_init+0x77/0x80
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? start_kernel+0x39e/0x3a5
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? repair_env_string+0x5b/0x5b
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? x86_64_start_reservations+0x2a/0x2c
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_start_kernel+0x5bd/0x5bf
Feb 8 19:10:43 nas-prd-app02 kernel: Code: 07 cd 00 48 3b 43 10 0f 88 6e 01 00 00 48 89 df 4c 8d b3 28 10 00 00 4c 8d 6d c0 e8 68 a7 5a 00 48 8d 83 28 14 00 00 48 89 45 a8 <48> 8d 83 28 18 00 00 48 89 45 a0 48 8d 83 28 1c 00 00 48 89 45
Feb 8 19:10:43 nas-prd-app02 kernel: NMI backtrace for cpu 1
Feb 8 19:10:43 nas-prd-app02 kernel: CPU: 1 PID: 19060 Comm: nfsd Tainted: P O 3.14.4 #1
Feb 8 19:10:43 nas-prd-app02 kernel: task: ffff880404f6d170 ti: ffff880404f44000 task.ti: ffff880404f44000
Feb 8 19:10:43 nas-prd-app02 kernel: RIP: e030:[] [] xen_hypercall_vcpu_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: RSP: e02b:ffff880ebb823b70 EFLAGS: 00000046
Feb 8 19:10:43 nas-prd-app02 kernel: RAX: 0000000000000000 RBX: ffffffff81d3c980 RCX: ffffffff8100130a
Feb 8 19:10:43 nas-prd-app02 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000b
Feb 8 19:10:43 nas-prd-app02 kernel: RBP: ffff880ebb823b88 R08: ffffffff81d3c980 R09: 0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: R10: 0000000000001235 R11: 0000000000000246 R12: ffffffff81d3c980
Feb 8 19:10:43 nas-prd-app02 kernel: R13: 0000000000000005 R14: 0000000000000001 R15: ffff880ebb82e460
Feb 8 19:10:43 nas-prd-app02 kernel: FS: 00007f6363472700(0000) GS:ffff880ebb820000(0000) knlGS:0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 8 19:10:43 nas-prd-app02 kernel: CR2: 00007f636347e000 CR3: 0000000001c0c000 CR4: 0000000000002660
Feb 8 19:10:43 nas-prd-app02 kernel: Stack:
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000001235 0000000000000001 ffffffff81383714 ffff880ebb823bb8
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff81011bc3 0000000000000000 000000000096ded4 ffffffff81d3c580
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000000000 ffff880ebb823bc8 ffffffff810127f9 ffff880ebb823be8
Feb 8 19:10:43 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:43 nas-prd-app02 kernel:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_send_IPI_one+0x44/0x60
Feb 8 19:10:43 nas-prd-app02 kernel: [] __xen_send_IPI_mask+0x33/0x60
Feb 8 19:10:43 nas-prd-app02 kernel: [] xen_send_IPI_all+0x79/0xa0
Feb 8 19:10:43 nas-prd-app02 kernel: [] arch_trigger_all_cpu_backtrace+0x61/0xa0
Feb 8 19:10:43 nas-prd-app02 kernel: [] print_cpu_stall+0xdc/0x140
Feb 8 19:10:43 nas-prd-app02 kernel: [] __rcu_pending+0x23e/0x250
Feb 8 19:10:43 nas-prd-app02 kernel: [] rcu_check_callbacks+0xed/0x1a0
Feb 8 19:10:43 nas-prd-app02 kernel: [] update_process_times+0x48/0x80
Feb 8 19:10:43 nas-prd-app02 kernel: [] tick_sched_handle+0x39/0x80
Feb 8 19:10:43 nas-prd-app02 kernel: [] tick_sched_timer+0x54/0x90
Feb 8 19:10:43 nas-prd-app02 kernel: [] __run_hrtimer+0x83/0x1d0
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? tick_nohz_handler+0xc0/0xc0
Feb 8 19:10:43 nas-prd-app02 kernel: [] hrtimer_interrupt+0x10e/0x260
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? __mix_pool_bytes+0x54/0xa0
Feb 8 19:10:43 nas-prd-app02 kernel: [] xen_timer_interrupt+0x2f/0x130
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? add_interrupt_randomness+0x4c/0x1d0
Feb 8 19:10:43 nas-prd-app02 kernel: [] handle_irq_event_percpu+0x6d/0x200
Feb 8 19:10:43 nas-prd-app02 kernel: [] handle_percpu_irq+0x48/0x70
Feb 8 19:10:43 nas-prd-app02 kernel: [] evtchn_2l_handle_events+0x28c/0x2a0
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? __local_bh_enable+0x17/0x50
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? __do_softirq+0x19c/0x2b0
Feb 8 19:10:43 nas-prd-app02 kernel: [] __xen_evtchn_do_upcall+0x50/0x90
Feb 8 19:10:43 nas-prd-app02 kernel: [] xen_evtchn_do_upcall+0x2f/0x50
Feb 8 19:10:43 nas-prd-app02 kernel: [] xen_do_hypervisor_callback+0x1e/0x30
Feb 8 19:10:43 nas-prd-app02 kernel:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? _raw_spin_lock+0x20/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? __mutex_unlock_slowpath+0x1a/0x50
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? mutex_unlock+0x1b/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? dbuf_read_impl+0x1e3/0x390 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? zio_null+0x61/0x70 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? dbuf_read+0xad/0x3f0 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? dmu_buf_will_dirty+0xa7/0x100 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? dmu_write_uio_dnode+0x70/0x140 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? dmu_write_uio_dbuf+0x49/0x60 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? zfs_write+0xa62/0xbc0 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? RW_WRITE_HELD+0x61/0xb0 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? sa_lookup_impl+0x13/0x20 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? dmu_object_size_from_db+0x69/0xa0 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? mutex_lock+0x16/0x40
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? zpl_aio_write+0xb5/0x100 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? do_sync_readv_writev+0x4d/0x80
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? do_readv_writev+0xc8/0x2c0
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? zpl_compat_ioctl+0x10/0x10 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? zpl_read+0xf0/0xf0 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? zfs_open+0x9b/0x130 [zfs]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? vfs_writev+0x37/0x50
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? nfsd_vfs_write+0xa1/0x2a0 [nfsd]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? ima_file_check+0x36/0x50
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? nfsd_write+0xe7/0x100 [nfsd]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? nfsd3_proc_write+0x9f/0xd0 [nfsd]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? nfsd_dispatch+0xa8/0x1b0 [nfsd]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? svc_process_common+0x30c/0x530 [sunrpc]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? svc_process+0x112/0x160 [sunrpc]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? nfsd+0xbf/0x130 [nfsd]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? nfsd_pool_stats_release+0x60/0x60 [nfsd]
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? kthread+0xce/0xf0
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_end_context_switch+0x1e/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? ret_from_fork+0x7c/0xb0
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? kthread_freezable_should_stop+0x70/0x70
Feb 8 19:10:43 nas-prd-app02 kernel: Code: cc 51 41 53 50 b8 17 00 00 00 0f 05 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 18 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
Feb 8 19:10:43 nas-prd-app02 kernel: NMI backtrace for cpu 2
Feb 8 19:10:43 nas-prd-app02 kernel: CPU: 2 PID: 0 Comm: swapper/2 Tainted: P O 3.14.4 #1
Feb 8 19:10:43 nas-prd-app02 kernel: task: ffff880ea5c82150 ti: ffff880ea5c88000 task.ti: ffff880ea5c88000
Feb 8 19:10:43 nas-prd-app02 kernel: RIP: e030:[] [] xen_hypercall_sched_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: RSP: e02b:ffff880ea5c89e78 EFLAGS: 00000246
Feb 8 19:10:43 nas-prd-app02 kernel: RAX: 0000000000000000 RBX: ffff880ea5c88010 RCX: ffffffff810013aa
Feb 8 19:10:43 nas-prd-app02 kernel: RDX: ffffffff81c7b580 RSI: 0000000000000000 RDI: 0000000000000001
Feb 8 19:10:43 nas-prd-app02 kernel: RBP: ffff880ea5c89e90 R08: 0000000000000000 R09: 0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000002
Feb 8 19:10:43 nas-prd-app02 kernel: R13: ffffffff81d3c980 R14: ffff880ea5c88010 R15: ffff880ea5c88000
Feb 8 19:10:43 nas-prd-app02 kernel: FS: 00007f50c03a0720(0000) GS:ffff880ebb840000(0000) knlGS:0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
Feb 8 19:10:43 nas-prd-app02 kernel: CR2: ffffffffff600400 CR3: 0000000e81b14000 CR4: 0000000000002660
Feb 8 19:10:43 nas-prd-app02 kernel: Stack:
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000000000 0000000000000000 ffffffff8100a6f0 ffff880ea5c89eb0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e844 ffff880ea5c88010 ffff880ea5c88010 ffff880ea5c89ec0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e1e6 ffff880ea5c89f30 ffffffff810c6a1e ffffffff8100af22
Feb 8 19:10:43 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_safe_halt+0x10/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] default_idle+0x24/0xc0
Feb 8 19:10:43 nas-prd-app02 kernel: [] arch_cpu_idle+0x26/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_idle_loop+0x9e/0x250
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? check_events+0x12/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_startup_entry+0x23/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_bringup_and_idle+0x2a/0x40
Feb 8 19:10:43 nas-prd-app02 kernel: Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
Feb 8 19:10:43 nas-prd-app02 kernel: NMI backtrace for cpu 3
Feb 8 19:10:43 nas-prd-app02 kernel: CPU: 3 PID: 0 Comm: swapper/3 Tainted: P O 3.14.4 #1
Feb 8 19:10:43 nas-prd-app02 kernel: task: ffff880ea5c8af30 ti: ffff880ea5c8c000 task.ti: ffff880ea5c8c000
Feb 8 19:10:43 nas-prd-app02 kernel: RIP: e030:[] [] xen_hypercall_sched_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: RSP: e02b:ffff880ea5c8de78 EFLAGS: 00000246
Feb 8 19:10:43 nas-prd-app02 kernel: RAX: 0000000000000000 RBX: ffff880ea5c8c010 RCX: ffffffff810013aa
Feb 8 19:10:43 nas-prd-app02 kernel: RDX: ffffffff81c7b580 RSI: 0000000000000000 RDI: 0000000000000001
Feb 8 19:10:43 nas-prd-app02 kernel: RBP: ffff880ea5c8de90 R08: 0000000000000000 R09: 0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000003
Feb 8 19:10:43 nas-prd-app02 kernel: R13: ffffffff81d3c980 R14: ffff880ea5c8c010 R15: ffff880ea5c8c000
Feb 8 19:10:43 nas-prd-app02 kernel: FS: 00007f0b3d7f8700(0000) GS:ffff880ebb860000(0000) knlGS:0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
Feb 8 19:10:43 nas-prd-app02 kernel: CR2: 00000035c02acd50 CR3: 0000000001c0c000 CR4: 0000000000002660
Feb 8 19:10:43 nas-prd-app02 kernel: Stack:
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000000000 0000000000000000 ffffffff8100a6f0 ffff880ea5c8deb0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e844 ffff880ea5c8c010 ffff880ea5c8c010 ffff880ea5c8dec0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e1e6 ffff880ea5c8df30 ffffffff810c6a1e ffffffff8100af22
Feb 8 19:10:43 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_safe_halt+0x10/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] default_idle+0x24/0xc0
Feb 8 19:10:43 nas-prd-app02 kernel: [] arch_cpu_idle+0x26/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_idle_loop+0x9e/0x250
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? check_events+0x12/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_startup_entry+0x23/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_bringup_and_idle+0x2a/0x40
Feb 8 19:10:43 nas-prd-app02 kernel: Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
Feb 8 19:10:43 nas-prd-app02 kernel: NMI backtrace for cpu 4
Feb 8 19:10:43 nas-prd-app02 kernel: CPU: 4 PID: 0 Comm: swapper/4 Tainted: P O 3.14.4 #1
Feb 8 19:10:43 nas-prd-app02 kernel: task: ffff880ea5c8a190 ti: ffff880ea5c8e000 task.ti: ffff880ea5c8e000
Feb 8 19:10:43 nas-prd-app02 kernel: RIP: e030:[] [] xen_hypercall_sched_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: RSP: e02b:ffff880ea5c8fe78 EFLAGS: 00000246
Feb 8 19:10:43 nas-prd-app02 kernel: RAX: 0000000000000000 RBX: ffff880ea5c8e010 RCX: ffffffff810013aa
Feb 8 19:10:43 nas-prd-app02 kernel: RDX: ffffffff81c7b580 RSI: 0000000000000000 RDI: 0000000000000001
Feb 8 19:10:43 nas-prd-app02 kernel: RBP: ffff880ea5c8fe90 R08: 0000000000000000 R09: 0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000004
Feb 8 19:10:43 nas-prd-app02 kernel: R13: ffffffff81d3c980 R14: ffff880ea5c8e010 R15: ffff880ea5c8e000
Feb 8 19:10:43 nas-prd-app02 kernel: FS: 00007fdead6fc700(0000) GS:ffff880ebb880000(0000) knlGS:0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
Feb 8 19:10:43 nas-prd-app02 kernel: CR2: 000000000044fc20 CR3: 0000000001c0c000 CR4: 0000000000002660
Feb 8 19:10:43 nas-prd-app02 kernel: Stack:
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000000000 0000000000000000 ffffffff8100a6f0 ffff880ea5c8feb0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e844 ffff880ea5c8e010 ffff880ea5c8e010 ffff880ea5c8fec0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e1e6 ffff880ea5c8ff30 ffffffff810c6a1e ffffffff8100af22
Feb 8 19:10:43 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_safe_halt+0x10/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] default_idle+0x24/0xc0
Feb 8 19:10:43 nas-prd-app02 kernel: [] arch_cpu_idle+0x26/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_idle_loop+0x9e/0x250
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? check_events+0x12/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_startup_entry+0x23/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_bringup_and_idle+0x2a/0x40
Feb 8 19:10:43 nas-prd-app02 kernel: Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
Feb 8 19:10:43 nas-prd-app02 kernel: NMI backtrace for cpu 5
Feb 8 19:10:43 nas-prd-app02 kernel: CPU: 5 PID: 0 Comm: swapper/5 Tainted: P O 3.14.4 #1
Feb 8 19:10:43 nas-prd-app02 kernel: task: ffff880ea5c92f70 ti: ffff880ea5c94000 task.ti: ffff880ea5c94000
Feb 8 19:10:43 nas-prd-app02 kernel: RIP: e030:[] [] xen_hypercall_sched_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: RSP: e02b:ffff880ea5c95e78 EFLAGS: 00000246
Feb 8 19:10:43 nas-prd-app02 kernel: RAX: 0000000000000000 RBX: ffff880ea5c94010 RCX: ffffffff810013aa
Feb 8 19:10:43 nas-prd-app02 kernel: RDX: ffffffff81c7b580 RSI: 0000000000000000 RDI: 0000000000000001
Feb 8 19:10:43 nas-prd-app02 kernel: RBP: ffff880ea5c95e90 R08: 0000000000000000 R09: 0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000005
Feb 8 19:10:43 nas-prd-app02 kernel: R13: ffffffff81d3c980 R14: ffff880ea5c94010 R15: ffff880ea5c94000
Feb 8 19:10:43 nas-prd-app02 kernel: FS: 00007f376d89f7c0(0000) GS:ffff880ebb8a0000(0000) knlGS:0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
Feb 8 19:10:43 nas-prd-app02 kernel: CR2: ffffffffff600400 CR3: 0000000e819d7000 CR4: 0000000000002660
Feb 8 19:10:43 nas-prd-app02 kernel: Stack:
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000000000 0000000000000000 ffffffff8100a6f0 ffff880ea5c95eb0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e844 ffff880ea5c94010 ffff880ea5c94010 ffff880ea5c95ec0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e1e6 ffff880ea5c95f30 ffffffff810c6a1e ffffffff8100af22
Feb 8 19:10:43 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_safe_halt+0x10/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] default_idle+0x24/0xc0
Feb 8 19:10:43 nas-prd-app02 kernel: [] arch_cpu_idle+0x26/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_idle_loop+0x9e/0x250
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? check_events+0x12/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_startup_entry+0x23/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_bringup_and_idle+0x2a/0x40
Feb 8 19:10:43 nas-prd-app02 kernel: Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
Feb 8 19:10:43 nas-prd-app02 kernel: NMI backtrace for cpu 6
Feb 8 19:10:43 nas-prd-app02 kernel: CPU: 6 PID: 0 Comm: swapper/6 Tainted: P O 3.14.4 #1
Feb 8 19:10:43 nas-prd-app02 kernel: task: ffff880ea5c921d0 ti: ffff880ea5c98000 task.ti: ffff880ea5c98000
Feb 8 19:10:43 nas-prd-app02 kernel: RIP: e030:[] [] xen_hypercall_sched_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: RSP: e02b:ffff880ea5c99e78 EFLAGS: 00000246
Feb 8 19:10:43 nas-prd-app02 kernel: RAX: 0000000000000000 RBX: ffff880ea5c98010 RCX: ffffffff810013aa
Feb 8 19:10:43 nas-prd-app02 kernel: RDX: ffffffff81c7b580 RSI: 0000000000000000 RDI: 0000000000000001
Feb 8 19:10:43 nas-prd-app02 kernel: RBP: ffff880ea5c99e90 R08: 0000000000000000 R09: 0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 1
Feb 8 19:10:43 nas-prd-app02 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000006
Feb 8 19:10:43 nas-prd-app02 kernel: }
Feb 8 19:10:43 nas-prd-app02 kernel: R13: ffffffff81d3c980 R14: ffff880ea5c98010 R15: ffff880ea5c98000
Feb 8 19:10:43 nas-prd-app02 kernel: (detected by 0, t=600017 jiffies, g=14719917, c=14719916, q=9887455)
Feb 8 19:10:43 nas-prd-app02 kernel: FS: 00007f8ce47b9700(0000) GS:ffff880ebb8c0000(0000) knlGS:0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
Feb 8 19:10:43 nas-prd-app02 kernel: CR2: 00000000028daec0 CR3: 0000000e819fa000 CR4: 0000000000002660
Feb 8 19:10:43 nas-prd-app02 kernel: Stack:
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000000000 0000000000000000 ffffffff8100a6f0 ffff880ea5c99eb0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e844 ffff880ea5c98010 ffff880ea5c98010 ffff880ea5c99ec0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e1e6 ffff880ea5c99f30 ffffffff810c6a1e ffffffff8100af22
Feb 8 19:10:43 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_safe_halt+0x10/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] default_idle+0x24/0xc0
Feb 8 19:10:43 nas-prd-app02 kernel: [] arch_cpu_idle+0x26/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_idle_loop+0x9e/0x250
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? check_events+0x12/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_startup_entry+0x23/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_bringup_and_idle+0x2a/0x40
Feb 8 19:10:43 nas-prd-app02 kernel: Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
Feb 8 19:10:43 nas-prd-app02 kernel: NMI backtrace for cpu 7
Feb 8 19:10:43 nas-prd-app02 kernel: CPU: 7 PID: 0 Comm: swapper/7 Tainted: P O 3.14.4 #1
Feb 8 19:10:43 nas-prd-app02 kernel: task: ffff880ea5c9afb0 ti: ffff880ea5c9c000 task.ti: ffff880ea5c9c000
Feb 8 19:10:43 nas-prd-app02 kernel: RIP: e030:[] [] xen_hypercall_sched_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: RSP: e02b:ffff880ea5c9de78 EFLAGS: 00000246
Feb 8 19:10:43 nas-prd-app02 kernel: RAX: 0000000000000000 RBX: ffff880ea5c9c010 RCX: ffffffff810013aa
Feb 8 19:10:43 nas-prd-app02 kernel: RDX: ffffffff81c7b580 RSI: 0000000000000000 RDI: 0000000000000001
Feb 8 19:10:43 nas-prd-app02 kernel: RBP: ffff880ea5c9de90 R08: 0000000000000000 R09: 0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000007
Feb 8 19:10:43 nas-prd-app02 kernel: R13: ffffffff81d3c980 R14: ffff880ea5c9c010 R15: ffff880ea5c9c000
Feb 8 19:10:43 nas-prd-app02 kernel: FS: 00007f7a41103700(0000) GS:ffff880ebb8e0000(0000) knlGS:0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
Feb 8 19:10:43 nas-prd-app02 kernel: CR2: 00000000025d6168 CR3: 0000000001c0c000 CR4: 0000000000002660
Feb 8 19:10:43 nas-prd-app02 kernel: Stack:
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000000000 0000000000000000 ffffffff8100a6f0 ffff880ea5c9deb0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e844 ffff880ea5c9c010 ffff880ea5c9c010 ffff880ea5c9dec0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e1e6 ffff880ea5c9df30 ffffffff810c6a1e ffffffff8100af22
Feb 8 19:10:43 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_safe_halt+0x10/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] default_idle+0x24/0xc0
Feb 8 19:10:43 nas-prd-app02 kernel: [] arch_cpu_idle+0x26/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_idle_loop+0x9e/0x250
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? check_events+0x12/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_startup_entry+0x23/0x30
Feb 8 19:10:43 nas-prd-app02 kernel: [] cpu_bringup_and_idle+0x2a/0x40
Feb 8 19:10:43 nas-prd-app02 kernel: Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
Feb 8 19:10:43 nas-prd-app02 kernel: NMI backtrace for cpu 9
Feb 8 19:10:43 nas-prd-app02 kernel: CPU: 9 PID: 0 Comm: swapper/9 Tainted: P O 3.14.4 #1
Feb 8 19:10:43 nas-prd-app02 kernel: task: ffff880ea5ca2ff0 ti: ffff880ea5ca4000 task.ti: ffff880ea5ca4000
Feb 8 19:10:43 nas-prd-app02 kernel: RIP: e030:[] [] xen_hypercall_sched_op+0xa/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: RSP: e02b:ffff880ea5ca5e78 EFLAGS: 00000246
Feb 8 19:10:43 nas-prd-app02 kernel: RAX: 0000000000000000 RBX: ffff880ea5ca4010 RCX: ffffffff810013aa
Feb 8 19:10:43 nas-prd-app02 kernel: RDX: ffffffff81c7b580 RSI: 0000000000000000 RDI: 0000000000000001
Feb 8 19:10:43 nas-prd-app02 kernel: RBP: ffff880ea5ca5e90 R08: 0000000000000000 R09: 0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000009
Feb 8 19:10:43 nas-prd-app02 kernel: R13: ffffffff81d3c980 R14: ffff880ea5ca4010 R15: ffff880ea5ca4000
Feb 8 19:10:43 nas-prd-app02 kernel: FS: 00007f7c7374a700(0000) GS:ffff880ebb920000(0000) knlGS:0000000000000000
Feb 8 19:10:43 nas-prd-app02 kernel: CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
Feb 8 19:10:43 nas-prd-app02 kernel: CR2: 00007fa4597cf600 CR3: 0000000e84971000 CR4: 0000000000002660
Feb 8 19:10:43 nas-prd-app02 kernel: Stack:
Feb 8 19:10:43 nas-prd-app02 kernel: 0000000000000000 0000000000000000 ffffffff8100a6f0 ffff880ea5ca5eb0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e844 ffff880ea5ca4010 ffff880ea5ca4010 ffff880ea5ca5ec0
Feb 8 19:10:43 nas-prd-app02 kernel: ffffffff8101e1e6 ffff880ea5ca5f30 ffffffff810c6a1e ffffffff8100af22
Feb 8 19:10:43 nas-prd-app02 kernel: Call Trace:
Feb 8 19:10:43 nas-prd-app02 kernel: [] ? xen_safe_halt+0x10/0x20
Feb 8 19:10:43 nas-prd-app02 kernel: [] default_idle+0x24/0xc0
Feb 8 19:10:43 nas-prd-app02 kernel: [

@eolson78
Copy link
Author

eolson78 commented Feb 9, 2015

attempted to apply the patch based on this issue openzfs/spl#421
It had no impact Stack traces are nearly identical from pre to post patch

@ryao
Copy link
Contributor

ryao commented Feb 9, 2015

This looks extremely interesting. I had a chat with eolson78 about it. It turns out that he built his kernel without CONFIG_PARAVIRT_SPINLOCKS. If you build the 3.14.4 kernel without CONFIG_PARAVIRT_SPINLOCKS, you end up with disassembly that looks like this:

ffffffff81623d10 <_raw_spin_unlock_irqrestore>:
ffffffff81623d10:       55                      push   %rbp
ffffffff81623d11:       48 89 e5                mov    %rsp,%rbp
ffffffff81623d14:       e8 a7 91 00 00          callq  ffffffff8162cec0 <mcount>
ffffffff81623d19:       66 83 07 01             addw   $0x1,(%rdi)
ffffffff81623d1d:       48 89 f7                mov    %rsi,%rdi
ffffffff81623d20:       ff 14 25 28 57 c2 81    callq  *0xffffffff81c25728
ffffffff81623d27:       c9                      leaveq 
ffffffff81623d28:       c3                      retq   
ffffffff81623d29:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

ffffffff81624080 <_raw_spin_lock>:
ffffffff81624080:       55                      push   %rbp
ffffffff81624081:       48 89 e5                mov    %rsp,%rbp
ffffffff81624084:       e8 37 8e 00 00          callq  ffffffff8162cec0 <mcount>
ffffffff81624089:       b8 00 00 01 00          mov    $0x10000,%eax
ffffffff8162408e:       f0 0f c1 07             lock xadd %eax,(%rdi)
ffffffff81624092:       89 c2                   mov    %eax,%edx
ffffffff81624094:       c1 ea 10                shr    $0x10,%edx
ffffffff81624097:       66 39 c2                cmp    %ax,%dx
ffffffff8162409a:       75 04                   jne    ffffffff816240a0 <_raw_spin_lock+0x20>
ffffffff8162409c:       c9                      leaveq 
ffffffff8162409d:       c3                      retq   
ffffffff8162409e:       f3 90                   pause  
ffffffff816240a0:       0f b7 07                movzwl (%rdi),%eax
ffffffff816240a3:       66 39 c2                cmp    %ax,%dx
ffffffff816240a6:       75 f6                   jne    ffffffff8162409e <_raw_spin_lock+0x1e>
ffffffff816240a8:       c9                      leaveq 
ffffffff816240a9:       c3                      retq   
ffffffff816240aa:       90                      nop
ffffffff816240ab:       90                      nop
ffffffff816240ac:       90                      nop
ffffffff816240ad:       90                      nop
ffffffff816240ae:       90                      nop
ffffffff816240af:       90                      nop

Since CONFIG_UNINLINE_SPIN_UNLOCK=n is usually defined on kernels, I needed to use _raw_spin_unlock_irqrestore to get the disassembly of arch_spin_unlock() indirectly.

The code looks like this:

static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
    if (TICKET_SLOWPATH_FLAG &&
        static_key_false(&paravirt_ticketlocks_enabled)) {
        arch_spinlock_t prev;

        prev = *lock;
        add_smp(&lock->tickets.head, TICKET_LOCK_INC);

        /* add_smp() is a full mb() */

        if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
            __ticket_unlock_slowpath(lock, prev);
    } else
        __add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
}

There is simply not enough assembly in _raw_spin_unlock_irqrestore to correspond to all of this code, especially after the fact that we were supposed to inline this function is taken into consideration. The reason for that is building without CONFIG_PARAVIRT_SPINLOCKS has the side effect of making TICKET_SLOWPATH_FLAG equal to 0, which allows the compiler to do dead code elimination. Consequently, that function is just the __add call. It looks like this should be atomic when glancing at the C, but there are no atomic instructions in the assembly. This is because of the following, which makes UNLOCK_LOCK_PREFIX become nothing:

#if defined(CONFIG_X86_32) && (defined(CONFIG_X86_PPRO_FENCE))
/*
 * On PPro SMP, we use a locked operation to unlock
 * (PPro errata 66, 92)
 */
# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
#else
# define UNLOCK_LOCK_PREFIX
#endif

To make matters worse, the arch_spin_unlock uses ACCESS_ONCE without any memory barriers, in violation of the kernel's SMP programming documentation:

/*
 * Ticket locks are conceptually two parts, one indicating the current head of
 * the queue, and the other indicating the current tail. The lock is acquired
 * by atomically noting the tail and incrementing it by one (thus adding
 * ourself to the queue and noting our position), then waiting until the head
 * becomes equal to the the initial value of the tail.
 *
 * We use an xadd covering *both* parts of the lock, to increment the tail and
 * also load the position of the head, which takes care of memory ordering
 * issues and should be optimal for the uncontended case. Note the tail must be
 * in the high part, because a wide xadd increment of the low part would carry
 * up and contaminate the high part.
 */
static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
{
    register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };

    inc = xadd(&lock->tickets, inc);
    if (likely(inc.head == inc.tail))
        goto out;

    inc.tail &= ~TICKET_SLOWPATH_FLAG;
    for (;;) {
        unsigned count = SPIN_THRESHOLD;

        do {
            if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
                goto out;
            cpu_relax();
        } while (--count);
        __ticket_lock_spinning(lock, inc.tail);
    }
out:    barrier();  /* make sure nothing creeps before the lock is taken */
}
SMP BARRIER PAIRING
-------------------

When dealing with CPU-CPU interactions, certain types of memory barrier should
always be paired.  A lack of appropriate pairing is almost certainly an error.

General barriers pair with each other, though they also pair with
most other types of barriers, albeit without transitivity.  An acquire
barrier pairs with a release barrier, but both may also pair with other
barriers, including of course general barriers.  A write barrier pairs
with a data dependency barrier, an acquire barrier, a release barrier,
a read barrier, or a general barrier.  Similarly a read barrier or a
data dependency barrier pairs with a write barrier, an acquire barrier,
a release barrier, or a general barrier:

    CPU 1             CPU 2
    ===============       ===============
    ACCESS_ONCE(a) = 1;
    <write barrier>
    ACCESS_ONCE(b) = 2;   x = ACCESS_ONCE(b);
                  <read barrier>
                  y = ACCESS_ONCE(a);

Or:

    CPU 1             CPU 2
    ===============       ===============================
    a = 1;
    <write barrier>
    ACCESS_ONCE(b) = &a;  x = ACCESS_ONCE(b);
                  <data dependency barrier>
                  y = *x;

Basically, the read barrier always has to be there, even though it can be of
the "weaker" type.

[!] Note that the stores before the write barrier would normally be expected to
match the loads after the read barrier or the data dependency barrier, and vice
versa:

    CPU 1                               CPU 2
    ===================                 ===================
    ACCESS_ONCE(a) = 1;  }----   --->{  v = ACCESS_ONCE(c);
    ACCESS_ONCE(b) = 2;  }    \ /    {  w = ACCESS_ONCE(d);
    <write barrier>            \        <read barrier>
    ACCESS_ONCE(c) = 3;  }    / \    {  x = ACCESS_ONCE(a);
    ACCESS_ONCE(d) = 4;  }----   --->{  y = ACCESS_ONCE(b);

https://www.kernel.org/doc/Documentation/memory-barriers.txt

The consequence is that we can have a race where thread A wants to unlock the mutex when thread B wants to lock it, thread B wins, thread A's cpu caches the value in memory, thread B updates it without notifying thread A, and thread A polls its own cache infinitely.

The code in question was rewritten from inline assembly into C some time between 2.6.32 and 3.14, so it is possible this is a regression. CentOS 7 enabled enabled CONFIG_PARAVIRT_SPINLOCKS and other large enterprise distributions likely followed, so they would be unaffected by this.

One user on the Gentoo forums seems to have encountered this, so I will want to follow up with him when I have a chance:

http://forums.gentoo.org/viewtopic-p-7665838.html#7665838

@behlendorf Does this explanation look sane to you?

@ryao
Copy link
Contributor

ryao commented Feb 10, 2015

@behlendorf Just to be clear, @eolson78 informs me that this is reproducible with the Linux lock torture tester on Amazon EC2 with an instance size of 32 cores. Threads will enter an infinite loop on the spinlock and the test never finishes. This seems NUMA-related. He told me that forcing atomics did not fix it, so I asked him to try peppering the Linux spinlock code with barriers.

@dweeezil
Copy link
Contributor

Poking my nose into this issue for a moment: is the culprit the changes made by torvalds/linux@96f853eaa? That appears to have landed first in 3.12.

@behlendorf
Copy link
Contributor

@ryao Just to be clear, you're saying @eolson78 is able to reproduce a very similar deadlock using these, torvalds/linux@0af3fe1e, in kernel lock torture tests? If so is there an existing issue opened against the kernel for this?

@behlendorf
Copy link
Contributor

@ryao I also disagree with all the issues you've marked as duplicates of this problem. Certainly some, if not most, of the issue you've linked this issue to were caused by #2523. This issue reveals another locking problem but from what I've seen reported it's in general far less common.

@ryao
Copy link
Contributor

ryao commented Feb 10, 2015

@dweeezil I am not certain. @eolson78 will need to check.

@behlendorf One of @eolson78's clients ran into this issue. He pinged me about it last week. I suggested he apply the patch that resolved #2523. I provided him with explicit instructions on how to do it from my notes. 24 hours after he applied the patch as per my instructions and reloaded the modules, the issue occurred again. I asked him verify the timestamps on the kernel modules after the procedure to ensure that they had changed and had him send me his kernel binary.

I disassembled his kernel binary, noticed the assembly didn't quite match what I expected and suggested that he try having several kernel threads loop trying to lock the same spinlock with the prediction that whatever caused the deadlock would occur in that test. He had one of his employees do this and the deadlock occurred. I am told that this has been reproduced on Amazon EC2's 32-core VMs with a CentOS 6 userland on both the Linux 3.14.4 and Linux 3.18.9. He also verified this on Amazon Linux, which used a newer version of 3.14.y.

As for the issues that I linked, #2523 causes either a segfault or corrupt magic. An infinite loop is possible, but if a loop occurred, then this issue seems more likely. I discussed the test results with others in #gentoo-dev on freenode. It sparked a lengthy discussion on how to program NUMA systems correctly, but none of us came up with a solid explanation for @eolson78's results.

It feels like some sort of CPU erratum, but none of the published errata for the Intel Xeon E5 v2 series appear to correspond to it. However, I do know for certain that Amazon EC2 has outdated Intel microcode.

As for the issues that I linked, some of the issues appear to involve multisocket systems though. I know that you use multisocket systems at LLNL, so I assume #401 was on a multisocket system. #2517 explicitly says it. #2891 mentions 20 threads, which typically is beyond what you see with 1 socket. The others are indeterminate.

@ryao
Copy link
Contributor

ryao commented Feb 11, 2015

@behlendorf @eolson78 gave me an update. It seems that there are bugs in the locktorture test, which made its detection of a problem a false positive. Consequently, there is no way to reproduce this outside of that one workload.

@eolson78
Copy link
Author

@behlendorf @dweeezil we are going to use @ryao suggestion of changing the kernel config to enable for CONFIGURE_PARAVIRTUAL_SPINLOCKS> We also will be applying the patch in #2523 we are going to apply this at the clients site as this is the only place anyone has been able to reproduce this

@tuxoko
Copy link
Contributor

tuxoko commented Feb 11, 2015

I still think this is the same as #2523
@eolson78 Can you verify that the code openzfs/spl#421 is built into your binary and still has this issue?

@ryao the whole PARAVIRT slow path thing shouldn't have any significant impact on this. It just allows KVM and XEN to yield a spinning CPU early to allow lock owner to do useful work. It may cause some performance issue if you don't turn them on, but it certainly shouldn't cause a 20sec or forever spinning.

Also, the assembly of the spinlock code shown by @ryao doesn't look any different than earlier hand written version. So the C code rewrite should be fine as long as the original assembly is correct.

To make matters worse, the arch_spin_unlock uses ACCESS_ONCE without any memory barriers, in violation of the kernel's SMP programming documentation:

You don't need memory barrier for single repeated instruction. Memory barrier is used for guaranteeing ordering of two instructions. So it should be fine.

Consequently, that function is just the __add call. It looks like this should be atomic when glancing at the C, but there are no atomic instructions in the assembly.

x64 is by it self atomic on simple 64-bit memory read/write. So it should be fine.

@eolson78
Copy link
Author

@tuxoko I can confirm this was applied and we experienced this exact issue in 2 separate systems at the same client

On Feb 10, 2015, at 9:39 PM, "tuxoko" <notifications@github.commailto:notifications@github.com> wrote:

I still think this is the same as #2523#2523
@eolson78https://github.com/eolson78 Can you verify that the code openzfs/spl#421openzfs/spl#421 is built into your binary and still has this issue?

@ryaohttps://github.com/ryao the whole PARAVIRT slow path thing shouldn't have any significant impact on this. It just allows KVM and XEN to yield a spinning CPU early to allow lock owner to do useful work. It may cause some performance issue if you don't turn them on, but it certainly shouldn't cause a 20sec or forever spinning.

Also, the assembly of the spinlock code shown by @ryaohttps://github.com/ryao doesn't look any different than earlier hand written version. So the C code rewrite should be fine as long as the original assembly is correct.

To make matters worse, the arch_spin_unlock uses ACCESS_ONCE without any memory barriers, in violation of the kernel's SMP programming documentation:

You don't need memory barrier for single repeated instruction. Memory barrier is used for guaranteeing ordering of two instructions. So it should be fine.

Consequently, that function is just the __add call. It looks like this should be atomic when glancing at the C, but there are no atomic instructions in the assembly.

x64 is by it self atomic on simple 64-bit memory read/write. So it should be fine.

Reply to this email directly or view it on GitHubhttps://github.com//issues/3091#issuecomment-73825723.

@tuxoko
Copy link
Contributor

tuxoko commented Feb 11, 2015

@eolson78
If that's the case, then it means that there are some other things causing the dbuf corruption.
I'll try to investigate this.

@eolson78
Copy link
Author

@tuxoko please let me know what other information I can provide to help your investigation

On Feb 10, 2015, at 9:44 PM, "tuxoko" <notifications@github.commailto:notifications@github.com> wrote:

@eolson78https://github.com/eolson78
If that's the case, then it means that there are some other things causing the dbuf corruption.
I'll try to investigate this.

Reply to this email directly or view it on GitHubhttps://github.com//issues/3091#issuecomment-73826128.

@tuxoko
Copy link
Contributor

tuxoko commented Feb 11, 2015

I did a quick overview into the dbuf free path. It's rather convoluted, but doesn't seem wrong. Though, there could be something too subtle to notice.

@eolson78 Did you rebuild ZFS after updating openzfs/spl#421? If you only rebuilt SPL but didn't rebuild ZFS, the code won't be built into the binary. Also, are you loading the module during initrd? If yes, are the new module installed properly into the initrd?

@behlendorf This issue looks too similar to the zio->io_lock. Was there any reproducible test case that can make sure that openzfs/spl#421 solved the io_lock issue?
#2704, #2891, #2897, and lower half of #2517 seem to be all related to dbuf->db_mtx. Did any of them confirm the fix?

@behlendorf
Copy link
Contributor

@tuxoko convoluted is a good way to put it. And I agree this looks very similar to the zio->io_lock issue. That isn't too surprising to me because the dbuf code does suffer from the same kind of locking assumptions which were at the root of the zio->io_lock problem. One of the reasons I really liked your fix was that it resolved a whole class of inexplicable lockups including this kind of dbuf issue.

To my knowledge there was never a reliable reproducer for this issue. However, after getting the fix merged and rolled out the number of reports I saw of 'hangs' and other similar deadlocks dropped dramatically. I can verify that it definitely resolved #2850 and it's been commented in #2891 that the issue hasn't been observed since.

That said. While very similar to openzfs/spl#421 this might be an entirely different bug. If we've confirmed that the fix was applied and we're still able to reproduce a problem we should absolutely pursue this.

One thing which did catch my eye in the above stacks was that xen was being used. For some reason, which has never been fully explained, zen and zfs have historically run in to issues like this when running on the same system.

@eolson78
Copy link
Author

@tuxoko yes we rebuilt ZFS after the update to SPL. Yes the modules are loaded correctly. I followed the exact procedure from @ryao to implement. @behlendorf This is running on AWS EC-2 Instance. hence the xen.

@ryao
Copy link
Contributor

ryao commented Feb 11, 2015

@behlendorf I recall issues with ZoL in Xen Dom0 that I thought were from the ARC calculations being thrown off by decreases in system memory whenever a VM started, but I am not aware of any in DomU. Do you recall any that were specific to DomU operation?

@behlendorf
Copy link
Contributor

@ryao The vast majority of issues were definitely related to Dom0 and xen/zfs fighting over memory. As for DomU issues you're right. The only ones I'm aware of are related to performance.

@tuxoko
Copy link
Contributor

tuxoko commented Feb 12, 2015

@eolson78 Could you build a kernel with "CONFIG_DEBUG_PAGEALLOC=y" and try with the following patch
https://gist.github.com/tuxoko/132647ede8572651909b

@eolson78
Copy link
Author

this issue was actually spolved with openzfs/spl#421 closing out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants