Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[master] panic during zfs send #197

Closed
rottegift opened this issue Jun 17, 2014 · 28 comments
Closed

[master] panic during zfs send #197

rottegift opened this issue Jun 17, 2014 · 28 comments

Comments

@rottegift
Copy link
Contributor

[master today]

It was doing a "zfs send -I dataset@somesnap dataset@someothersnap" at the time.

I seem to be able to reproduce this but haven't nailed down whether it's one precise dataset or snapshot. It happens during a job that does incremental sends of quite a few datasets on one pool.

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jun 17 10:22:57 2014
panic(cpu 0 caller 0xffffff802f6dbf5e): Kernel trap at 0xffffff7fafd3e634, type 13=general protection, registers:
CR0: 0x0000000080010033, CR2: 0x00007fa6cd36f000, CR3: 0x0000000189097140, CR4: 0x00000000001606e0
RAX: 0x74697265686e6924, RBX: 0xffffff8235e43d58, RCX: 0xffffff7fafdc38d0, RDX: 0x0000000000000001
RSP: 0xffffff8235e436b0, RBP: 0xffffff8235e436e0, RSI: 0xffffff8235e43878, RDI: 0xffffff8251647898
R8:  0x0000000000000074, R9:  0x0000000000000220, R10: 0xffffff8235e43878, R11: 0xffffff8235e43500
R12: 0x00000000120003e8, R13: 0xffffff8092b94b40, R14: 0x00000000120003e8, R15: 0xffffff805986cc88
RFL: 0x0000000000010206, RIP: 0xffffff7fafd3e634, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x00007fa6cd36f000, Error code: 0x0000000000000000, Fault CPU: 0x0

Backtrace (CPU 0), Frame : Return Address
0xffffff81efbb5c50 : 0xffffff802f622fa9 mach_kernel : _panic + 0xc9
0xffffff81efbb5cd0 : 0xffffff802f6dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff81efbb5ea0 : 0xffffff802f6f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff81efbb5ec0 : 0xffffff7fafd3e634 net.lundman.zfs : _nvlist_exists + 0x54
0xffffff8235e436e0 : 0xffffff7fafc7d6db net.lundman.zfs : _dsl_prop_get_all_impl + 0x36b
0xffffff8235e439e0 : 0xffffff7fafc7cf9c net.lundman.zfs : _dsl_prop_get_all_ds + 0x19c
0xffffff8235e43b40 : 0xffffff7fafc7cdf5 net.lundman.zfs : _dsl_prop_get_all + 0x25
0xffffff8235e43b60 : 0xffffff7fafcf430d net.lundman.zfs : _zfs_ioc_objset_stats_impl + 0x4d
0xffffff8235e43ba0 : 0xffffff7fafcf113b net.lundman.zfs : _zfs_ioc_snapshot_list_next + 0x1ab
0xffffff8235e43c20 : 0xffffff7fafcecc64 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff8235e43cf0 : 0xffffff802f80d63f mach_kernel : _spec_ioctl + 0x11f
0xffffff8235e43d40 : 0xffffff802f7fe000 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff8235e43dc0 : 0xffffff802f7f3e51 mach_kernel : _utf8_normalizestr + 0x971
0xffffff8235e43e10 : 0xffffff802f9c1303 mach_kernel : _fo_ioctl + 0x43
0xffffff8235e43e40 : 0xffffff802f9f2c66 mach_kernel : _ioctl + 0x466
0xffffff8235e43f50 : 0xffffff802fa40653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff8235e43fb0 : 0xffffff802f6f3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[10E9E953-EA4B-3364-9581-F5E75E2DF966]@0xffffff7fafc2b000->0xffffff7fafe37fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fafbec000
            dependency: net.lundman.spl(1.0.0)[63543165-E11D-384D-8456-F485EA93FC7E]@0xffffff7fafc1a000
@rottegift
Copy link
Contributor Author

It appears to happen when a "zfs send -I" and this overlap:

zfs list -H -r -t snap -o name,userrefs anypool | grep -v '0$'  | awk '{ print $1 }' | xargs -n 10 zfs holds -H
Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jun 17 10:49:11 2014
panic(cpu 4 caller 0xffffff80076dbf5e): Kernel trap at 0xffffff7f87d3b800, type 14=page fault, registers:
CR0: 0x0000000080010033, CR2: 0x0000000000000000, CR3: 0x0000000087d81029, CR4: 0x00000000001606e0
RAX: 0x0000000000000010, RBX: 0xffffff8218d6bd58, RCX: 0x0000000000000020, RDX: 0xffffff822658adf0
RSP: 0xffffff8218d6b7f0, RBP: 0xffffff8218d6b820, RSI: 0x0000000000000000, RDI: 0xffffff8223040bb0
R8:  0xffffff822658adf0, R9:  0x0000000000000003, R10: 0xffffff8218d6b8a0, R11: 0xffffff8218d6b900
R12: 0x00000000120004a5, R13: 0xffffff803f9ac780, R14: 0x00000000120004a5, R15: 0xffffff802bc6f098
RFL: 0x0000000000010286, RIP: 0xffffff7f87d3b800, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x0000000000000000, Error code: 0x0000000000000000, Fault CPU: 0x4

Backtrace (CPU 4), Frame : Return Address
0xffffff8218d6b480 : 0xffffff8007622fa9 mach_kernel : _panic + 0xc9
0xffffff8218d6b500 : 0xffffff80076dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff8218d6b6d0 : 0xffffff80076f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff8218d6b6f0 : 0xffffff7f87d3b800 net.lundman.zfs : _nv_mem_zalloc + 0x20
0xffffff8218d6b820 : 0xffffff7f87d3fc94 net.lundman.zfs : _nvp_buf_alloc + 0x44
0xffffff8218d6b870 : 0xffffff7f87d3c2b2 net.lundman.zfs : _nvlist_add_common + 0x202
0xffffff8218d6b950 : 0xffffff7f87d3bcfd net.lundman.zfs : _nvlist_copy_pairs + 0xad
0xffffff8218d6b9a0 : 0xffffff7f87d4235e net.lundman.zfs : _nvlist_copy_embedded + 0x5e
0xffffff8218d6b9e0 : 0xffffff7f87d3c4aa net.lundman.zfs : _nvlist_add_common + 0x3fa
0xffffff8218d6bac0 : 0xffffff7f87d3cc79 net.lundman.zfs : _nvlist_add_nvlist + 0x39
0xffffff8218d6baf0 : 0xffffff7f87c7d20c net.lundman.zfs : _dsl_prop_nvlist_add_uint64 + 0x16c
0xffffff8218d6bb30 : 0xffffff7f87c4f8ee net.lundman.zfs : _dmu_objset_stats + 0x4e
0xffffff8218d6bb60 : 0xffffff7f87cf4328 net.lundman.zfs : _zfs_ioc_objset_stats_impl + 0x68
0xffffff8218d6bba0 : 0xffffff7f87cf113b net.lundman.zfs : _zfs_ioc_snapshot_list_next + 0x1ab
0xffffff8218d6bc20 : 0xffffff7f87cecc64 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff8218d6bcf0 : 0xffffff800780d63f mach_kernel : _spec_ioctl + 0x11f
0xffffff8218d6bd40 : 0xffffff80077fe000 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff8218d6bdc0 : 0xffffff80077f3e51 mach_kernel : _utf8_normalizestr + 0x971
0xffffff8218d6be10 : 0xffffff80079c1303 mach_kernel : _fo_ioctl + 0x43
0xffffff8218d6be40 : 0xffffff80079f2c66 mach_kernel : _ioctl + 0x466
0xffffff8218d6bf50 : 0xffffff8007a40653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff8218d6bfb0 : 0xffffff80076f3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[10E9E953-EA4B-3364-9581-F5E75E2DF966]@0xffffff7f87c2b000->0xffffff7f87e37fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f87bec000
            dependency: net.lundman.spl(1.0.0)[63543165-E11D-384D-8456-F485EA93FC7E]@0xffffff7f87c1a000

@rottegift
Copy link
Contributor Author

My guess is that because the zfs list takes a long time (there are many datasets and snapshots) that the zfs list is still running when the automatic holds are removed at the end of the zfs send -I.

@rottegift
Copy link
Contributor Author

During a "zfs send -I anypool/data/set@foo anypool/data/set@bar", I ran

while true; do zfs list -H -r -t snap -o name,userrefs anypool/data/set | grep -v '0$'  | awk '{ print $1 }' | xargs -n 10 zfs holds -H; done

and got this (different) panic

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jun 17 11:04:12 2014
panic(cpu 2 caller 0xffffff80100dbf5e): Kernel trap at 0xffffff7f90728314, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0xfffffffffffffff0, CR3: 0x00000000125d0000, CR4: 0x00000000001606e0
RAX: 0xfffffffffffffff0, RBX: 0xffffff81ffac67a8, RCX: 0xffffff8218506d50, RDX: 0xfffffffffffffff0
RSP: 0xffffff81fb63bae0, RBP: 0xffffff81fb63bb10, RSI: 0xffffff82fa1ce808, RDI: 0xffffff8218506e50
R8:  0x0000000000000000, R9:  0xffffff8010601910, R10: 0x00000000000003ff, R11: 0xffffffffffffffff
R12: 0xffffff81ffac67e8, R13: 0x00000000001e3355, R14: 0xffffff81ffac67c8, R15: 0xffffff821845bb08
RFL: 0x0000000000010286, RIP: 0xffffff7f90728314, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0xfffffffffffffff0, Error code: 0x0000000000000000, Fault CPU: 0x2

Backtrace (CPU 2), Frame : Return Address
0xffffff81fb63b770 : 0xffffff8010022fa9 mach_kernel : _panic + 0xc9
0xffffff81fb63b7f0 : 0xffffff80100dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff81fb63b9c0 : 0xffffff80100f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff81fb63b9e0 : 0xffffff7f90728314 net.lundman.zfs : _zio_walk_parents + 0x94
0xffffff81fb63bb10 : 0xffffff7f9072ef93 net.lundman.zfs : _zio_done + 0x1073
0xffffff81fb63bc30 : 0xffffff7f9072a4aa net.lundman.zfs : ___zio_execute + 0x12a
0xffffff81fb63bc70 : 0xffffff7f9072f0f2 net.lundman.zfs : _zio_done + 0x11d2
0xffffff81fb63bd90 : 0xffffff7f9072a4aa net.lundman.zfs : ___zio_execute + 0x12a
0xffffff81fb63bdd0 : 0xffffff7f9072f0f2 net.lundman.zfs : _zio_done + 0x11d2
0xffffff81fb63bef0 : 0xffffff7f9072a4aa net.lundman.zfs : ___zio_execute + 0x12a
0xffffff81fb63bf30 : 0xffffff7f9072a375 net.lundman.zfs : _zio_execute + 0x15
0xffffff81fb63bf50 : 0xffffff7f9061e217 net.lundman.spl : _taskq_thread + 0xc7
0xffffff81fb63bfb0 : 0xffffff80100d7127 mach_kernel : _call_continuation + 0x17
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[63543165-E11D-384D-8456-F485EA93FC7E]@0xffffff7f9061a000->0xffffff7f9062afff
         net.lundman.zfs(1.0)[10E9E953-EA4B-3364-9581-F5E75E2DF966]@0xffffff7f9062b000->0xffffff7f90837fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f905ec000
            dependency: net.lundman.spl(1.0.0)[63543165-E11D-384D-8456-F485EA93FC7E]@0xffffff7f9061a000

@rottegift
Copy link
Contributor Author

It looks like this with the latest vdev-iokit.

script holds-ssd

#!/bin/bash

zfs list -H -r -t snap -o name,userrefs ssdpool | grep -v '0$'  | awk '{ print $1 }' | xargs -n 10 zfs holds -H

In a terminal window, run

while true; do holds-ssd; done

in another terminal window

zfs send -v -I ssdpool/dataset@older ssdpool/dataset@latest > /dev/null

BOOM.

Excellent trick on any system which does automatically scheduled backups using zfs send -I.

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jun 17 22:48:09 2014
panic(cpu 2 caller 0xffffff80292dbf5e): Kernel trap at 0xffffff7fa993dfc0, type 14=page fault, registers:
CR0: 0x0000000080010033, CR2: 0x0000000000000000, CR3: 0x000000011098c057, CR4: 0x00000000001606e0
RAX: 0x0000000000000010, RBX: 0xffffff823f683d58, RCX: 0x0000000000000030, RDX: 0xffffff823f007d40
RSP: 0xffffff823f6833e0, RBP: 0xffffff823f683410, RSI: 0x0000000000000000, RDI: 0xffffff8218670190
R8:  0xffffff823f007d40, R9:  0x0000000000000003, R10: 0x0000000000000000, R11: 0xffffff823f683500
R12: 0x0000000012000e82, R13: 0xffffff805e7180f0, R14: 0x0000000012000e82, R15: 0xffffff8050b33bf0
RFL: 0x0000000000010282, RIP: 0xffffff7fa993dfc0, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x0000000000000000, Error code: 0x0000000000000000, Fault CPU: 0x2

Backtrace (CPU 2), Frame : Return Address
0xffffff823f683070 : 0xffffff8029222fa9 mach_kernel : _panic + 0xc9
0xffffff823f6830f0 : 0xffffff80292dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff823f6832c0 : 0xffffff80292f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff823f6832e0 : 0xffffff7fa993dfc0 net.lundman.zfs : _nv_mem_zalloc + 0x20
0xffffff823f683410 : 0xffffff7fa9942454 net.lundman.zfs : _nvp_buf_alloc + 0x44
0xffffff823f683460 : 0xffffff7fa993ea72 net.lundman.zfs : _nvlist_add_common + 0x202
0xffffff823f683540 : 0xffffff7fa993e4bd net.lundman.zfs : _nvlist_copy_pairs + 0xad
0xffffff823f683590 : 0xffffff7fa9944b1e net.lundman.zfs : _nvlist_copy_embedded + 0x5e
0xffffff823f6835d0 : 0xffffff7fa993ec6a net.lundman.zfs : _nvlist_add_common + 0x3fa
0xffffff823f6836b0 : 0xffffff7fa993f439 net.lundman.zfs : _nvlist_add_nvlist + 0x39
0xffffff823f6836e0 : 0xffffff7fa987d2bc net.lundman.zfs : _dsl_prop_get_all_impl + 0x51c
0xffffff823f6839e0 : 0xffffff7fa987c9cc net.lundman.zfs : _dsl_prop_get_all_ds + 0x19c
0xffffff823f683b40 : 0xffffff7fa987c825 net.lundman.zfs : _dsl_prop_get_all + 0x25
0xffffff823f683b60 : 0xffffff7fa98f676d net.lundman.zfs : _zfs_ioc_objset_stats_impl + 0x4d
0xffffff823f683ba0 : 0xffffff7fa98f357b net.lundman.zfs : _zfs_ioc_snapshot_list_next + 0x1ab
0xffffff823f683c20 : 0xffffff7fa98ef0a4 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff823f683cf0 : 0xffffff802940d63f mach_kernel : _spec_ioctl + 0x11f
0xffffff823f683d40 : 0xffffff80293fe000 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff823f683dc0 : 0xffffff80293f3e51 mach_kernel : _utf8_normalizestr + 0x971
0xffffff823f683e10 : 0xffffff80295c1303 mach_kernel : _fo_ioctl + 0x43
0xffffff823f683e40 : 0xffffff80295f2c66 mach_kernel : _ioctl + 0x466
0xffffff823f683f50 : 0xffffff8029640653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff823f683fb0 : 0xffffff80292f3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[0C7B3320-A055-3936-B86A-8C6E3EC2CB6D]@0xffffff7fa982b000->0xffffff7fa9a3cfff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fa97ec000
            dependency: net.lundman.spl(1.0.0)[63543165-E11D-384D-8456-F485EA93FC7E]@0xffffff7fa981a000

BSD process name corresponding to current thread: zfs
Boot args: -v keepsyms=y darkwake=0

@rottegift
Copy link
Contributor Author

(note to self for later: can this be reproduced in light of the master/spl changes in issue #201 ?)

(will try in some hours...)

@rottegift
Copy link
Contributor Author

Nope. This panic happened during "zfs list -r -t snap ssdpool/DATA/opt" (macports /opt) while the looping holds-ssd script ( #197 (comment) ) was running.

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Fri Jul  4 15:05:27 2014
panic(cpu 0 caller 0xffffff8023adc24e): Kernel trap at 0xffffff7fa40b5b0c, type 13=general protection, registers:
CR0: 0x0000000080010033, CR2: 0x00007f8e01f09000, CR3: 0x000000005845602e, CR4: 0x00000000001606e0
RAX: 0xffffff7fa40b5ae0, RBX: 0xffffff823029bd58, RCX: 0x006c6f6f70647373, RDX: 0x0000000000000000
RSP: 0xffffff823029ba48, RBP: 0xffffff823029ba70, RSI: 0x006cdc84eaa8c0f9, RDI: 0x006c6f6f70647373
R8:  0xffffff7fa4066410, R9:  0x0000000000000000, R10: 0xffffff823029b990, R11: 0xffffff823029ba00
R12: 0x0000000012000669, R13: 0xffffff805b7b7000, R14: 0x0000000012000669, R15: 0xffffff804cd94e90
RFL: 0x0000000000010296, RIP: 0xffffff7fa40b5b0c, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x00007f8e01f09000, Error code: 0x0000000000000000, Fault CPU: 0x0

Backtrace (CPU 0), Frame : Return Address
0xffffff81e3e67c50 : 0xffffff8023a22f79 mach_kernel : _panic + 0xc9
0xffffff81e3e67cd0 : 0xffffff8023adc24e mach_kernel : _kernel_trap + 0x7fe
0xffffff81e3e67ea0 : 0xffffff8023af3746 mach_kernel : _return_from_trap + 0xe6
0xffffff81e3e67ec0 : 0xffffff7fa40b5b0c net.lundman.zfs : _unique_compare + 0x2c
0xffffff823029ba70 : 0xffffff7fa41385e9 net.lundman.zfs : _avl_find + 0x69
0xffffff823029bac0 : 0xffffff7fa40b5c44 net.lundman.zfs : _unique_insert + 0x94
0xffffff823029baf0 : 0xffffff7fa40663a8 net.lundman.zfs : _dsl_dataset_hold_obj + 0x5a8
0xffffff823029bba0 : 0xffffff7fa40f0c41 net.lundman.zfs : _zfs_ioc_snapshot_list_next + 0x171
0xffffff823029bc20 : 0xffffff7fa40ec7a4 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff823029bcf0 : 0xffffff8023c0d92f mach_kernel : _spec_ioctl + 0x11f
0xffffff823029bd40 : 0xffffff8023bfe2f0 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff823029bdc0 : 0xffffff8023bf4141 mach_kernel : _utf8_normalizestr + 0x971
0xffffff823029be10 : 0xffffff8023dc16a3 mach_kernel : _fo_ioctl + 0x43
0xffffff823029be40 : 0xffffff8023df3026 mach_kernel : _ioctl + 0x466
0xffffff823029bf50 : 0xffffff8023e40a33 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff823029bfb0 : 0xffffff8023af3f46 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[0CE90CF6-C355-3255-A550-3E2FA41D5811]@0xffffff7fa402b000->0xffffff7fa4238fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fa3fec000
            dependency: net.lundman.spl(1.0.0)[1185F90C-FE6B-34EB-B385-C360EB991B37]@0xffffff7fa401a000

BSD process name corresponding to current thread: zfs
Boot args: -v keepsyms=y darkwake=0

@lundman
Copy link
Contributor

lundman commented Jul 7, 2014

Since it appears to be random memory corruption, its unlikely you can trigger the same panic again, but in case you can, I added some guard words around the unique struct, as well as, changed SPL to dirty memory allocations (that aren't zalloc). But really, we do probably need lldb connected and backtrace to get a feel for how it dies.

@lundman
Copy link
Contributor

lundman commented Jul 7, 2014

@rottegift
Copy link
Contributor Author

Can you not reproduce this? It isn't pool, pool geometry, or system dependent as far as I can tell, and it's easy for me to reproduce. (I'm about to try again with the issue197 zfs and issue201 spl.)

@rottegift
Copy link
Contributor Author

Got a hopefully more informative panic. With issue197 zfs, issue201 spl, following the method in issuecomment-46371332 (#197 (comment))

Terminal 1:

# while true; do ~/bin/holds-ssd; done

Terminal 2:

# zfs send -v -I ssdpool/DATA/opt@2014-07-01-072915 ssdpool/DATA/opt@2014-07-07-070053 > /dev/null

Result, panic:

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Mon Jul  7 14:30:42 2014
panic(cpu 0 caller 0xffffff7f8605a8a9): "VERIFY3(" "nvlist_pack(nvl, &packed, sizep, 0, 0x0000)" " " "==" " " "0" ") " "failed (" "14" " " "==" " " "0" ")\n"@../nvpair/fnvpair.c:81
Backtrace (CPU 0), Frame : Return Address
0xffffff81ef9cba60 : 0xffffff8004222f79 mach_kernel : _panic + 0xc9
0xffffff81ef9cbae0 : 0xffffff7f8605a8a9 net.lundman.zfs : _fnvlist_pack + 0x79
0xffffff81ef9cbb10 : 0xffffff7f8600dbda net.lundman.zfs : _put_nvlist + 0x5a
0xffffff81ef9cbb50 : 0xffffff7f86014e21 net.lundman.zfs : _zfs_ioc_objset_stats_impl + 0x111
0xffffff81ef9cbb90 : 0xffffff7f8601165d net.lundman.zfs : _zfs_ioc_objset_stats + 0x3d
0xffffff81ef9cbbc0 : 0xffffff7f8601199e net.lundman.zfs : _zfs_ioc_dataset_list_next + 0x19e
0xffffff81ef9cbc20 : 0xffffff7f8600d6b4 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff81ef9cbcf0 : 0xffffff800440d92f mach_kernel : _spec_ioctl + 0x11f
0xffffff81ef9cbd40 : 0xffffff80043fe2f0 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff81ef9cbdc0 : 0xffffff80043f4141 mach_kernel : _utf8_normalizestr + 0x971
0xffffff81ef9cbe10 : 0xffffff80045c16a3 mach_kernel : _fo_ioctl + 0x43
0xffffff81ef9cbe40 : 0xffffff80045f3026 mach_kernel : _ioctl + 0x466
0xffffff81ef9cbf50 : 0xffffff8004640a33 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff81ef9cbfb0 : 0xffffff80042f3f46 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[8A62A238-4390-3D97-999B-62F7D2F0FF77]@0xffffff7f85f4c000->0xffffff7f86158fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f847f3000
            dependency: net.lundman.spl(1.0.0)[872B88E3-C60B-3AD3-920D-9CD93E7AD5E3]@0xffffff7f85f3b000

BSD process name corresponding to current thread: zfs
Boot args: -v keepsyms=y darkwake=0

@lundman
Copy link
Contributor

lundman commented Jul 8, 2014

I have put ssdholds in while, as well as zfs send all morning, alas, it does not trigger any panics.

My test pool may be too small though.

send from @send to BOOM@two estimated size is 1.47G
send from @two to BOOM@middle estimated size is 2.76G
send from @middle to BOOM@end estimated size is 897M
total estimated size is 5.10G
TIME        SENT   SNAPSHOT
10:13:25   84.3M   BOOM@two

BOOM@send       .send-97988-1   Tue Jul  8 10:13 2014
BOOM@two        .send-97988-1   Tue Jul  8 10:13 2014
BOOM@middle     .send-97988-1   Tue Jul  8 10:13 2014
BOOM@end        .send-97988-1   Tue Jul  8 10:13 2014

@rottegift
Copy link
Contributor Author

The snapshots that are involved are much smaller (and more numerous) so the race (assuming there is one) is probably when adding or removing the temporary holds.

Below is an example of one dataset that will trigger the panic reliably.

This issue originated when I started noticing problems when a job that does incrementals across 52 datasets; doing that incremental while running the while loop also causes a panic. Most of the datasets are changed only irregularly, so the changes from one to another are about zero.

ssdpool/DATA/opt                    16.2G  58.8G  14.9G  /opt
ssdpool/DATA/opt@2014-07-01-182159   223M      -  14.8G  /opt/.zfs/snapshot/2014-07-01-182159
ssdpool/DATA/opt@2014-07-01-213050  5.09M      -  14.8G  /opt/.zfs/snapshot/2014-07-01-213050
ssdpool/DATA/opt@2014-07-02-024152   496K      -  14.8G  /opt/.zfs/snapshot/2014-07-02-024152
ssdpool/DATA/opt@2014-07-02-102357   420K      -  14.8G  /opt/.zfs/snapshot/2014-07-02-102357
ssdpool/DATA/opt@2014-07-02-105453   228K      -  14.8G  /opt/.zfs/snapshot/2014-07-02-105453
ssdpool/DATA/opt@2014-07-02-145453   228K      -  14.8G  /opt/.zfs/snapshot/2014-07-02-145453
ssdpool/DATA/opt@2014-07-02-161055   328K      -  14.8G  /opt/.zfs/snapshot/2014-07-02-161055
ssdpool/DATA/opt@2014-07-02-171755   428K      -  14.8G  /opt/.zfs/snapshot/2014-07-02-171755
ssdpool/DATA/opt@2014-07-02-181152   484K      -  14.8G  /opt/.zfs/snapshot/2014-07-02-181152
ssdpool/DATA/opt@2014-07-02-212005   496K      -  14.8G  /opt/.zfs/snapshot/2014-07-02-212005
ssdpool/DATA/opt@2014-07-03-002156   472K      -  14.8G  /opt/.zfs/snapshot/2014-07-03-002156
ssdpool/DATA/opt@2014-07-03-011251   492K      -  14.8G  /opt/.zfs/snapshot/2014-07-03-011251
ssdpool/DATA/opt@2014-07-03-020246   524K      -  14.8G  /opt/.zfs/snapshot/2014-07-03-020246
ssdpool/DATA/opt@2014-07-03-060216   196K      -  14.8G  /opt/.zfs/snapshot/2014-07-03-060216
ssdpool/DATA/opt@2014-07-03-100215   200K      -  14.8G  /opt/.zfs/snapshot/2014-07-03-100215
ssdpool/DATA/opt@2014-07-03-114951   452K      -  14.8G  /opt/.zfs/snapshot/2014-07-03-114951
ssdpool/DATA/opt@2014-07-03-133453   488K      -  14.8G  /opt/.zfs/snapshot/2014-07-03-133453
ssdpool/DATA/opt@2014-07-03-143350   432K      -  14.8G  /opt/.zfs/snapshot/2014-07-03-143350
ssdpool/DATA/opt@2014-07-03-183350   224M      -  14.9G  /opt/.zfs/snapshot/2014-07-03-183350
ssdpool/DATA/opt@2014-07-04-013105   236K      -  14.9G  /opt/.zfs/snapshot/2014-07-04-013105
ssdpool/DATA/opt@2014-07-04-023349   236K      -  14.9G  /opt/.zfs/snapshot/2014-07-04-023349
ssdpool/DATA/opt@2014-07-04-063348   252K      -  14.9G  /opt/.zfs/snapshot/2014-07-04-063348
ssdpool/DATA/opt@2014-07-04-103348   224K      -  14.9G  /opt/.zfs/snapshot/2014-07-04-103348
ssdpool/DATA/opt@2014-07-04-153519   516K      -  14.9G  /opt/.zfs/snapshot/2014-07-04-153519
ssdpool/DATA/opt@2014-07-04-171153   296K      -  14.9G  /opt/.zfs/snapshot/2014-07-04-171153
ssdpool/DATA/opt@2014-07-04-211152   284K      -  14.9G  /opt/.zfs/snapshot/2014-07-04-211152
ssdpool/DATA/opt@2014-07-05-033908   484K      -  14.9G  /opt/.zfs/snapshot/2014-07-05-033908
ssdpool/DATA/opt@2014-07-05-150006   588K      -  14.9G  /opt/.zfs/snapshot/2014-07-05-150006
ssdpool/DATA/opt@2014-07-05-162151   312K      -  14.9G  /opt/.zfs/snapshot/2014-07-05-162151
ssdpool/DATA/opt@2014-07-05-202150   268K      -  14.9G  /opt/.zfs/snapshot/2014-07-05-202150
ssdpool/DATA/opt@2014-07-06-002150   268K      -  14.9G  /opt/.zfs/snapshot/2014-07-06-002150
ssdpool/DATA/opt@2014-07-06-042150   232K      -  14.9G  /opt/.zfs/snapshot/2014-07-06-042150
ssdpool/DATA/opt@2014-07-06-082149   252K      -  14.9G  /opt/.zfs/snapshot/2014-07-06-082149
ssdpool/DATA/opt@2014-07-06-122149   276K      -  14.9G  /opt/.zfs/snapshot/2014-07-06-122149
ssdpool/DATA/opt@2014-07-06-150352   604K      -  14.9G  /opt/.zfs/snapshot/2014-07-06-150352
ssdpool/DATA/opt@2014-07-06-155450  8.05M      -  14.9G  /opt/.zfs/snapshot/2014-07-06-155450
ssdpool/DATA/opt@2014-07-06-190351   220K      -  14.9G  /opt/.zfs/snapshot/2014-07-06-190351
ssdpool/DATA/opt@2014-07-06-230351   252K      -  14.9G  /opt/.zfs/snapshot/2014-07-06-230351
ssdpool/DATA/opt@2014-07-07-030054   268K      -  14.9G  /opt/.zfs/snapshot/2014-07-07-030054
ssdpool/DATA/opt@2014-07-07-070053   284K      -  14.9G  /opt/.zfs/snapshot/2014-07-07-070053
ssdpool/DATA/opt@2014-07-07-151451   612K      -  14.9G  /opt/.zfs/snapshot/2014-07-07-151451
ssdpool/DATA/opt@2014-07-07-171204   416K      -  14.9G  /opt/.zfs/snapshot/2014-07-07-171204
ssdpool/DATA/opt@2014-07-07-183953   260K      -  14.9G  /opt/.zfs/snapshot/2014-07-07-183953
ssdpool/DATA/opt@2014-07-07-223952   264K      -  14.9G  /opt/.zfs/snapshot/2014-07-07-223952

@evansus
Copy link
Contributor

evansus commented Jul 8, 2014

@rottegift I wonder if this would happen with a slightly different set of scripts:

holds-ssd in a while loop to repeatedly get the zfs hold list, and rather than zfs send, a while loop that does zfs hold and zfs release repeatedly.

May be easier to reproduce, and may narrow down if the issue only relates to the hold/release vs the send itself.

@evansus
Copy link
Contributor

evansus commented Jul 8, 2014

I remember seeing the panic log some time ago and wondering if you were running zpool iostat or zfs list in a second terminal - and actually it's this while loop with zfs holds -H.

On another note, any reason not to use sleep 1 between passes? May work around the panic issue and also use substantially less CPU.

@rottegift
Copy link
Contributor Author

Yep. That panicked. Details in a moment.

Sent from my iPhone

On 8 Jul 2014, at 02:37, Evan Susarret notifications@github.com wrote:

@rottegift I wonder if this would happen with a slightly different set of scripts:

ssdholds in a while loop to repeatedly get the zfs hold list, and rather than zfs send, a while loop that does zfs hold and zfs release repeatedly.

May be easier to reproduce, and may narrow down if the issue only relates to the hold/release vs the send itself.


Reply to this email directly or view it on GitHub.

@evansus
Copy link
Contributor

evansus commented Jul 8, 2014

Plus you could try

#!/bin/bash

zfs list -H -r -t snap -o name,userrefs ssdpool | grep -v '0$'  | awk '{ print $1 }'

(just removing the holds step)

to see if zfs list -Hrt snap -o name,userrefs or zfs holds -H is the culprit.

I'm suspecting the issue is in zfs holds, rather than zfs list, since it's not used nearly as often.

@rottegift
Copy link
Contributor Author

A long moment. The panic had boot volume jhfs+ corrupting teeth.

The scripts aren't "real" - they were an attempt to get a reduced test case.

Roughly, zfs snapshot -r ssdpool@a them a loop doing zfs hold -r test hold ssdpool@a ; zfs release -r ssdpool@a made three or four iterations then kaboom.

Sent from my iPhone

On 8 Jul 2014, at 02:42, Evan Susarret notifications@github.com wrote:

I remember seeing the panic log some time ago and wondering if you were running zpool iostat or zfs list in a second terminal - and actually it's this while loop with zfs holds -H.

On another note, any reason not to use sleep 1 between passes? May work around the panic issue and also use substantially less CPU.


Reply to this email directly or view it on GitHub.

@lundman
Copy link
Contributor

lundman commented Jul 8, 2014

Ok with send traversing over 50 snapshot, it hung. But no lldb attach, probably trashed the stack. At least I have something to work on :)

@rottegift
Copy link
Contributor Author

Yeah, my panic left no stack trace at all. Glad you can reproduce it. :-)

@rottegift
Copy link
Contributor Author

@evansus With the zfs holds removed from the script:

#!/bin/bash

while true
do
      zfs list -H -r -t snap -o name,userrefs ssdpool | grep -v '0$'  | awk '{ print $1 }' 
done

then the repeated hold/release loop

#!/bin/sh

zfs snapshot ssdpool@holdstest
while true
do 
    zfs hold -r testhold ssdpool@holdstest
    zfs release -r testhold ssdpool@holdstest
done

then the zfs list threw an ASSERT which was lost in the immediately subsequent panic:

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jul  8 03:12:57 2014
panic(cpu 4 caller 0xffffff7f8756d84f): "VERIFY3(" "0" " " "==" " " "dsl_prop_unregister(ds, zfs_prop_to_name(ZFS_PROP_CHECKSUM), checksum_changed_cb, os)" ") " "failed (" "0" " " "==" " " "91" ")\n"@dmu_objset.c:610
Backtrace (CPU 4), Frame : Return Address
0xffffff81fd763930 : 0xffffff8005822f79 mach_kernel : _panic + 0xc9
0xffffff81fd7639b0 : 0xffffff7f8756d84f net.lundman.zfs : _dmu_objset_evict + 0xef
0xffffff81fd763a60 : 0xffffff7f8758718a net.lundman.zfs : _dsl_dataset_evict + 0x4a
0xffffff81fd763a90 : 0xffffff7f8755df70 net.lundman.zfs : _dbuf_evict_user + 0x80
0xffffff81fd763ab0 : 0xffffff7f8755f606 net.lundman.zfs : _dbuf_rele_and_unlock + 0xf6
0xffffff81fd763af0 : 0xffffff7f8755c780 net.lundman.zfs : _dbuf_rele + 0x30
0xffffff81fd763b10 : 0xffffff7f8756ab68 net.lundman.zfs : _dmu_buf_rele + 0x28
0xffffff81fd763b40 : 0xffffff7f875872a1 net.lundman.zfs : _dsl_dataset_rele + 0x21
0xffffff81fd763b60 : 0xffffff7f8756d38d net.lundman.zfs : _dmu_objset_rele + 0x2d
0xffffff81fd763b90 : 0xffffff7f87611670 net.lundman.zfs : _zfs_ioc_objset_stats + 0x50
0xffffff81fd763bc0 : 0xffffff7f8761199e net.lundman.zfs : _zfs_ioc_dataset_list_next + 0x19e
0xffffff81fd763c20 : 0xffffff7f8760d6b4 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff81fd763cf0 : 0xffffff8005a0d92f mach_kernel : _spec_ioctl + 0x11f
0xffffff81fd763d40 : 0xffffff80059fe2f0 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff81fd763dc0 : 0xffffff80059f4141 mach_kernel : _utf8_normalizestr + 0x971
0xffffff81fd763e10 : 0xffffff8005bc16a3 mach_kernel : _fo_ioctl + 0x43
0xffffff81fd763e40 : 0xffffff8005bf3026 mach_kernel : _ioctl + 0x466
0xffffff81fd763f50 : 0xffffff8005c40a33 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff81fd763fb0 : 0xffffff80058f3f46 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[8A62A238-4390-3D97-999B-62F7D2F0FF77]@0xffffff7f8754c000->0xffffff7f87758fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f85df3000
            dependency: net.lundman.spl(1.0.0)[872B88E3-C60B-3AD3-920D-9CD93E7AD5E3]@0xffffff7f8753b000

BSD process name corresponding to current thread: zfs
Boot args: -v keepsyms=y darkwake=0

@lundman
Copy link
Contributor

lundman commented Jul 9, 2014

ok could you try out
openzfsonosx/spl@220a3ec78cc5
and see if it fixes the zfs send problem. Since its a bug with SPL allocator, it may fix a whole bunch of other things too. Please let me know

@rottegift
Copy link
Contributor Author

Should I try out the mem-fix branch or just the openzfsonosx/spl@220a3ec changes on the issue201 branch ?

@ilovezfs
Copy link
Contributor

ilovezfs commented Jul 9, 2014

We determined it didn't fix anything. So no need.

@lundman
Copy link
Contributor

lundman commented Jul 9, 2014

Just checkout branch mem-fix on spl, run whatever zfs branch you want and see. I added 2 commits after my last (faulty) message.

@rottegift
Copy link
Contributor Author

Ok. crosses fingers :)

@rottegift
Copy link
Contributor Author

It's surviving serious I/O load so far.

@ilovezfs
Copy link
Contributor

@rottegift So is it still all OK?

@rottegift
Copy link
Contributor Author

Yes, it's been up and fine for 26 hours under substantial workload and I can no longer repeat the panic with the holds looping, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants