Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS 0.6.5.3, NULL pointer dereference in dmu_objset_sync #4116

Closed
arturpzol opened this issue Dec 17, 2015 · 8 comments
Closed

ZFS 0.6.5.3, NULL pointer dereference in dmu_objset_sync #4116

arturpzol opened this issue Dec 17, 2015 · 8 comments

Comments

@arturpzol
Copy link

Hello,

I am using ZFS 0.6.5.3 with kernel 3.10. The pool structure was:

zpool status -D

  pool: Pool-0
 state: ONLINE
  scan: none requested
config:

        NAME                                                          STATE     READ WRITE CKSUM
        Pool-0                                                        ONLINE       0     0     0
          scsi-SVMware_Virtual_disk_6000c29c5f3a3b522d4379a08376d18a  ONLINE       0     0     0
errors: No known data errors

 dedup: no DDT entries

zfs list

NAME              USED  AVAIL  REFER  MOUNTPOINT
Pool-0            499G   462G   616K  /Pools/Pool-0
Pool-0/sys      4.13G   466G   132M  -
Pool-0/zvoldest   495G   492G  29.0G  -

Also znapzend was configured in that environment to make backup every minute to zvoldest.

When I tried to resize zvoldest using zfs set refreservation (which hanged) I got call trace with NULL pointer dereference:

Dec  8 09:50:15 [kern.alert] [ 2855.197660] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230
Dec  8 09:50:15 [kern.alert] [ 2855.197668] IP: [<ffffffffa024a9aa>] dmu_objset_sync+0x1a/0x310 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.197715] PGD 4277a5067 PUD 429065067 PMD 0 
Dec  8 09:50:15 [kern.warning] [ 2855.197720] Oops: 0002 [#1] SMP 
Dec  8 09:50:15 [kern.warning] [ 2855.197723] Modules linked in: sch_htb iscsi_scst(O) scst_vdisk(O) scst(O) mperf zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) crc32c_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd vmxnet3(O) processor ac button nls_iso8859_1 nls_cp437 sg pvscsi(O) pata_acpi ata_generic virtio_scsi virtio_pci virtio virtio_ring vfat fat aufs
Dec  8 09:50:15 [kern.warning] [ 2855.197755] CPU: 3 PID: 18472 Comm: txg_sync Tainted: P           O 3.10.92-oe64-ge331686 #15
Dec  8 09:50:15 [kern.warning] [ 2855.197759] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014
Dec  8 09:50:15 [kern.warning] [ 2855.197762] task: ffff8804018da760 ti: ffff880401556000 task.ti: ffff880401556000
Dec  8 09:50:15 [kern.warning] [ 2855.197765] RIP: 0010:[<ffffffffa024a9aa>]  [<ffffffffa024a9aa>] dmu_objset_sync+0x1a/0x310 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.197797] RSP: 0018:ffff880401557be0  EFLAGS: 00010282
Dec  8 09:50:15 [kern.warning] [ 2855.197800] RAX: ffff880418663800 RBX: ffff880401314180 RCX: ffff8801eca1a988
Dec  8 09:50:15 [kern.warning] [ 2855.197802] RDX: ffff880401314180 RSI: ffff8803bbd15e30 RDI: 0000000000000000
Dec  8 09:50:15 [kern.warning] [ 2855.197804] RBP: ffff880401314180 R08: 0000000000014ae0 R09: 0000000000000000
Dec  8 09:50:15 [kern.warning] [ 2855.197807] R10: ffff8801eca1a900 R11: ffff88042a024000 R12: 0000000000000000
Dec  8 09:50:15 [kern.warning] [ 2855.197809] R13: ffff8803bbd15e30 R14: ffff880401557ce0 R15: ffff8803bbd15e30
Dec  8 09:50:15 [kern.warning] [ 2855.197812] FS:  0000000000000000(0000) GS:ffff88043fd80000(0000) knlGS:0000000000000000
Dec  8 09:50:15 [kern.warning] [ 2855.197814] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec  8 09:50:15 [kern.warning] [ 2855.197817] CR2: 0000000000000230 CR3: 0000000425e68000 CR4: 00000000000007e0
Dec  8 09:50:15 [kern.warning] [ 2855.197889] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  8 09:50:15 [kern.warning] [ 2855.197923] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  8 09:50:15 [kern.warning] [ 2855.197925] Stack:
Dec  8 09:50:15 [kern.warning] [ 2855.197927]  ffff880401314180 ffff880409fd6b68 0000000000000a46 ffff8803fe286ed8
Dec  8 09:50:15 [kern.warning] [ 2855.197931]  ffffffffa025790b ffff8801eca1af00 ffff880401767318 ffff880401314180
Dec  8 09:50:15 [kern.warning] [ 2855.197934]  ffff8803fe286de0 ffff880401767418 ffffffffa023fcd7 ffff8803fe286e38
Dec  8 09:50:15 [kern.warning] [ 2855.197938] Call Trace:
Dec  8 09:50:15 [kern.warning] [ 2855.197977]  [<ffffffffa025790b>] ? dnode_setdirty+0x13b/0x1c0 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.198003]  [<ffffffffa023fcd7>] ? dbuf_dirty+0x477/0x980 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.198040]  [<ffffffffa0262344>] ? dsl_dataset_sync+0x44/0x80 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.198080]  [<ffffffffa026ac42>] ? dsl_pool_sync+0x92/0x460 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.198127]  [<ffffffffa0284f2e>] ? spa_sync+0x35e/0xb30 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.198176]  [<ffffffffa0295f56>] ? txg_sync_thread+0x3d6/0x640 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.198224]  [<ffffffffa0295b80>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.198254]  [<ffffffffa0192cf0>] ? thread_generic_wrapper+0x70/0x80 [spl]
Dec  8 09:50:15 [kern.warning] [ 2855.198262]  [<ffffffffa0192c80>] ? __thread_exit+0x10/0x10 [spl]
Dec  8 09:50:15 [kern.warning] [ 2855.198270]  [<ffffffff81062e92>] ? kthread+0xc2/0xd0
Dec  8 09:50:15 [kern.warning] [ 2855.198276]  [<ffffffff81070000>] ? sched_clock_cpu+0x30/0x100
Dec  8 09:50:15 [kern.warning] [ 2855.198281]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
Dec  8 09:50:15 [kern.warning] [ 2855.198287]  [<ffffffff8169db98>] ? ret_from_fork+0x58/0x90
Dec  8 09:50:15 [kern.warning] [ 2855.198292]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
Dec  8 09:50:15 [kern.warning] [ 2855.198294] Code: 41 5e 41 5f c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 41 54 49 89 fc 55 53 49 89 f5 48 89 d3 48 81 ec 88 00 00 00 <49> 89 94 24 30 02 00 00 48 8b 07 48 85 c0 0f 84 bd 02 00 00 48 
Dec  8 09:50:15 [kern.alert] [ 2855.198328] RIP  [<ffffffffa024a9aa>] dmu_objset_sync+0x1a/0x310 [zfs]
Dec  8 09:50:15 [kern.warning] [ 2855.198361]  RSP <ffff880401557be0>
Dec  8 09:50:15 [kern.warning] [ 2855.198363] CR2: 0000000000000230
Dec  8 09:50:15 [kern.warning] [ 2855.198367] ---[ end trace 8712512ed53c2339 ]---

Is it know issue? If you need any more information please write.

@behlendorf
Copy link
Contributor

Thanks for posting it, this looks like a new issue. Are you able to consistently reproduce this by running zfs set refreservation?

@arturpzol
Copy link
Author

Unfortunately I am not able to reproduce it by running zfs set refreservation each time. Repeatability of the issue is very small. I am trying to find 100% scenario. Do you have any suspicion where the bug can be?

@behlendorf
Copy link
Contributor

@aerusso I suspect that NULL was somehow passed as the os argument to dmu_objset_sync(). Exactly how that's possible isn't clear though.

@arturpzol
Copy link
Author

I have repeated the issue by constant change volsize and refreservation properties on destination node with znapzend environment:

while true; do
   for((i=1;i<9;i++)); do
      zfs set volsize="54${i}G" znapzend-dest/backup1
      zfs set refreservation="54${i}G" znapzend-dest/backup1;

      zfs set volsize="64${i}G" znapzend-dest/backup2;
      zfs set refreservation="64${i}G" znapzend-dest/backup2;

      zfs set volsize="74${i}G" znapzend-dest/backup3;
      zfs set refreservation="74${i}G" znapzend-dest/backup3;
   done;
done

Repeated call trace:


[ 9093.005071] zd0: detected capacity change from 689342251008 to 690415992832
[ 9093.237012] zd48: detected capacity change from 796716433408 to 797790175232
[ 9093.373890] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230
[ 9093.373895] IP: [<ffffffffa07799aa>] dmu_objset_sync+0x1a/0x310 [zfs]
[ 9093.373937] PGD 0
[ 9093.373940] Oops: 0002 [#1] SMP
[ 9093.373943] Modules linked in: iscsi_scst(O) scst_vdisk(O) qla2x00tgt(O) scst(O) xts gf128mul zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) crc32c_intel sg qla2xxx(O) scsi_transport_fc mpt2sas(O) raid_class scsi_transport_sas ixgbe(O) igb(O) ptp pps_core button acpi_cpufreq mperf processor aufs [last unloaded: fat]
[ 9093.373967] CPU: 0 PID: 21034 Comm: txg_sync Tainted: P O 3.10.92-oe64-ge331686 #15
[ 9093.373970] Hardware name: Supermicro X8DTS/X8DTS, BIOS 2.1 06/25/2012
[ 9093.373972] task: ffff880309886900 ti: ffff8803098ba000 task.ti: ffff8803098ba000
[ 9093.373974] RIP: 0010:[<ffffffffa07799aa>] [<ffffffffa07799aa>] dmu_objset_sync+0x1a/0x310 [zfs]
[ 9093.374001] RSP: 0018:ffff8803098bbbe0 EFLAGS: 00010282
[ 9093.374003] RAX: ffff88032a3a4000 RBX: ffff88031fb76300 RCX: ffff88032c847e88
[ 9093.374005] RDX: ffff88031fb76300 RSI: ffff8802f7b337a8 RDI: 0000000000000000
[ 9093.374007] RBP: ffff88031fb76300 R08: 0000000000014ae0 R09: ffff88032c847e00
[ 9093.374009] R10: 0000000000000002 R11: ffff88032ba54000 R12: 0000000000000000
[ 9093.374011] R13: ffff8802f7b337a8 R14: ffff8803098bbce0 R15: ffff8802f7b337a8
[ 9093.374013] FS: 0000000000000000(0000) GS:ffff880333200000(0000) knlGS:0000000000000000
[ 9093.374015] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 9093.374017] CR2: 0000000000000230 CR3: 000000000195e000 CR4: 00000000000007f0
[ 9093.374019] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9093.374021] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 9093.374022] Stack:
[ 9093.374024] ffff88031fb76300 ffff88032e434368 0000000000035805 ffff88030d6f2ed8
[ 9093.374027] ffffffffa078690b ffff88032c846700 ffff88030d6fb680 ffff88031fb76300
[ 9093.374030] ffff88030d6f2de0 ffff88030d6fb780 ffffffffa076ecd7 ffff88030d6f2e38
[ 9093.374033] Call Trace:
[ 9093.374061] [<ffffffffa078690b>] ? dnode_setdirty+0x13b/0x1c0 [zfs]
[ 9093.374082] [<ffffffffa076ecd7>] ? dbuf_dirty+0x477/0x980 [zfs]
[ 9093.374105] [<ffffffffa0791344>] ? dsl_dataset_sync+0x44/0x80 [zfs]
[ 9093.374129] [<ffffffffa0799c42>] ? dsl_pool_sync+0x92/0x460 [zfs]
[ 9093.374157] [<ffffffffa07b3f2e>] ? spa_sync+0x35e/0xb30 [zfs]
[ 9093.374187] [<ffffffffa07c4f56>] ? txg_sync_thread+0x3d6/0x640 [zfs]
[ 9093.374216] [<ffffffffa07c4b80>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
[ 9093.374222] [<ffffffffa061dcf0>] ? thread_generic_wrapper+0x70/0x80 [spl]
[ 9093.374227] [<ffffffffa061dc80>] ? __thread_exit+0x10/0x10 [spl]
[ 9093.374232] [<ffffffff81062e92>] ? kthread+0xc2/0xd0
[ 9093.374235] [<ffffffff81070000>] ? sched_clock_cpu+0x30/0x100
[ 9093.374238] [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
[ 9093.374242] [<ffffffff8169db98>] ? ret_from_fork+0x58/0x90
[ 9093.374245] [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
[ 9093.374246] Code: 41 5e 41 5f c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 41 54 49 89 fc 55 53 49 89 f5 48 89 d3 48 81 ec 88 00 00 00 <49> 89 94 24 30 02 00 00 48 8b 07 48 85 c0 0f 84 bd 02 00 00 48
[ 9093.374269] RIP [<ffffffffa07799aa>] dmu_objset_sync+0x1a/0x310 [zfs]
[ 9093.374288] RSP <ffff8803098bbbe0>
[ 9093.374290] CR2: 0000000000000230
[ 9093.374292] ---[ end trace aeff801d0a446e8d ]---

SysRQ list of blocked (D state) tasks output:


[13185.275430] SysRq : Show Blocked State
[13185.275436] task PC stack pid father
[13185.275509] zfs D ffff88032e437290 0 24219 23964 0x00000004
[13185.275513] ffff880329c1a760 0000000000000002 ffff8802f9e5ffd8 ffff8802f9e5ffd8
[13185.275516] ffff8802f9e5ffd8 ffff880329c1a760 ffffffffff0a0004 0000000000000015
[13185.275518] ffff8802f9e5fbd8 ffff880306b7e520 ffffffffa083bd3d ffff8802fa66d000
[13185.275521] Call Trace:
[13185.275531] [<ffffffff81309f87>] ? kvasprintf+0x67/0x80
[13185.275537] [<ffffffff81695678>] ? __mutex_lock_slowpath+0x1b8/0x1e0
[13185.275563] [<ffffffffa076bf28>] ? dbuf_rele_and_unlock+0x2b8/0x3d0 [zfs]
[13185.275571] [<ffffffffa06224dd>] ? cv_wait_common+0xed/0x100 [spl]
[13185.275575] [<ffffffff81063aa0>] ? wake_up_bit+0x20/0x20
[13185.275605] [<ffffffffa07c440f>] ? txg_wait_synced+0xdf/0x120 [zfs]
[13185.275627] [<ffffffffa078e290>] ? dsl_dataset_recalc_head_uniq+0xa0/0xa0 [zfs]
[13185.275652] [<ffffffffa07a173d>] ? dsl_sync_task+0x13d/0x240 [zfs]
[13185.275675] [<ffffffffa078fd40>] ? dsl_dataset_set_refreservation_sync_impl+0x1d0/0x1d0 [zfs]
[13185.275697] [<ffffffffa078e290>] ? dsl_dataset_recalc_head_uniq+0xa0/0xa0 [zfs]
[13185.275719] [<ffffffffa078fd40>] ? dsl_dataset_set_refreservation_sync_impl+0x1d0/0x1d0 [zfs]
[13185.275742] [<ffffffffa078fe20>] ? dsl_dataset_set_refreservation+0x30/0x40 [zfs]
[13185.275772] [<ffffffffa07e98b0>] ? zfs_prop_set_special+0x1e0/0x400 [zfs]
[13185.275802] [<ffffffffa07ea9f0>] ? zfs_check_settable.isra.13+0xf0/0x490 [zfs]
[13185.275832] [<ffffffffa07eb24e>] ? zfs_set_prop_nvlist+0x10e/0x360 [zfs]
[13185.275838] [<ffffffffa0639849>] ? nvlist_xalloc.part.15+0x59/0xc0 [znvpair]
[13185.275869] [<ffffffffa07ec19e>] ? zfs_ioc_set_prop+0xde/0x110 [zfs]
[13185.275899] [<ffffffffa07ea73d>] ? zfsdev_ioctl+0x45d/0x500 [zfs]
[13185.275902] [<ffffffff811122f5>] ? do_vfs_ioctl+0x2d5/0x4b0
[13185.275906] [<ffffffff81070518>] ? vtime_account_user.part.3+0x38/0x50
[13185.275908] [<ffffffff81112551>] ? SyS_ioctl+0x81/0xa0
[13185.275911] [<ffffffff8169de57>] ? tracesys+0xdd/0xe2
[13185.275914] zfs D ffff88032e437290 0 24263 24226 0x00000000
[13185.275917] ffff880329eb5be0 0000000000000002 ffff8802ff0cdfd8 ffff8802ff0cdfd8
[13185.275920] ffff8802ff0cdfd8 ffff880329eb5be0 ffff88030d88f4d0 ffff8802ff0cdc28
[13185.275922] ffff88030d745f60 ffff88030d745f68 0000000000000202 ffffffffa0777363
[13185.275924] Call Trace:
[13185.275944] [<ffffffffa0777363>] ? dmu_object_info_from_dnode+0x93/0xd0 [zfs]
[13185.275962] [<ffffffffa0777414>] ? dmu_object_info_from_db+0x24/0x30 [zfs]
[13185.275977] [<ffffffffa076bf28>] ? dbuf_rele_and_unlock+0x2b8/0x3d0 [zfs]
[13185.275999] [<ffffffffa078b7ea>] ? dsl_dataset_hold_obj+0x8a/0x800 [zfs]
[13185.276014] [<ffffffffa076bf28>] ? dbuf_rele_and_unlock+0x2b8/0x3d0 [zfs]
[13185.276021] [<ffffffffa06224dd>] ? cv_wait_common+0xed/0x100 [spl]
[13185.276023] [<ffffffff81063aa0>] ? wake_up_bit+0x20/0x20
[13185.276052] [<ffffffffa07c440f>] ? txg_wait_synced+0xdf/0x120 [zfs]
[13185.276076] [<ffffffffa079bcc0>] ? dsl_prop_inherit+0x50/0x50 [zfs]
[13185.276101] [<ffffffffa07a173d>] ? dsl_sync_task+0x13d/0x240 [zfs]
[13185.276126] [<ffffffffa079d1c0>] ? dsl_props_set_sync_impl+0x170/0x170 [zfs]
[13185.276150] [<ffffffffa079bcc0>] ? dsl_prop_inherit+0x50/0x50 [zfs]
[13185.276175] [<ffffffffa079d1c0>] ? dsl_props_set_sync_impl+0x170/0x170 [zfs]
[13185.276199] [<ffffffffa079bbc8>] ? dsl_props_set+0x48/0x50 [zfs]
[13185.276230] [<ffffffffa07eb3e0>] ? zfs_set_prop_nvlist+0x2a0/0x360 [zfs]
[13185.276234] [<ffffffffa0639849>] ? nvlist_xalloc.part.15+0x59/0xc0 [znvpair]
[13185.276264] [<ffffffffa07ec19e>] ? zfs_ioc_set_prop+0xde/0x110 [zfs]
[13185.276295] [<ffffffffa07ea73d>] ? zfsdev_ioctl+0x45d/0x500 [zfs]
[13185.276297] [<ffffffff811122f5>] ? do_vfs_ioctl+0x2d5/0x4b0
[13185.276300] [<ffffffff81070518>] ? vtime_account_user.part.3+0x38/0x50
[13185.276303] [<ffffffff81112551>] ? SyS_ioctl+0x81/0xa0
[13185.276305] [<ffffffff8169de57>] ? tracesys+0xdd/0xe2
[13185.276307] zfs D ffff88032e437290 0 24264 24262 0x00000000
[13185.276309] ffff880309880000 0000000000000002 ffff8802f617ffd8 ffff8802f617ffd8
[13185.276312] ffff8802f617ffd8 ffff880309880000 ffff8802fbc7e000 0000000000000000
[13185.276314] ffff8802f617fc28 ffffffffa077d5e6 ffff8802fbc7e000 ffffffffa08c4740
[13185.276316] Call Trace:
[13185.276336] [<ffffffffa077d5e6>] ? dmu_recv_begin_check+0x1c6/0x4b0 [zfs]
[13185.276350] [<ffffffffa0761e31>] ? add_reference.isra.27+0x61/0xa0 [zfs]
[13185.276365] [<ffffffffa076b14a>] ? dbuf_find+0x1aa/0x1c0 [zfs]
[13185.276369] [<ffffffff810f8e60>] ? __kmalloc_node+0x20/0x100
[13185.276375] [<ffffffffa06224dd>] ? cv_wait_common+0xed/0x100 [spl]
[13185.276377] [<ffffffff81063aa0>] ? wake_up_bit+0x20/0x20
[13185.276406] [<ffffffffa07c440f>] ? txg_wait_synced+0xdf/0x120 [zfs]
[13185.276426] [<ffffffffa077d420>] ? dmu_recv_begin_sync+0x3d0/0x3d0 [zfs]
[13185.276451] [<ffffffffa07a173d>] ? dsl_sync_task+0x13d/0x240 [zfs]
[13185.276470] [<ffffffffa077d050>] ? restore_read+0x130/0x130 [zfs]
[13185.276490] [<ffffffffa077d420>] ? dmu_recv_begin_sync+0x3d0/0x3d0 [zfs]
[13185.276509] [<ffffffffa077d050>] ? restore_read+0x130/0x130 [zfs]
[13185.276529] [<ffffffffa077f5d7>] ? dmu_recv_begin+0x1b7/0x1e0 [zfs]
[13185.276559] [<ffffffffa07eb706>] ? zfs_ioc_recv+0x146/0xb00 [zfs]
[13185.276562] [<ffffffff81074646>] ? enqueue_task_fair+0x246/0xbd0
[13185.276566] [<ffffffff8101252f>] ? native_sched_clock+0xf/0x80
[13185.276568] [<ffffffff810125a5>] ? sched_clock+0x5/0x10
[13185.276571] [<ffffffff8106c16d>] ? ttwu_do_wakeup+0xd/0x90
[13185.276573] [<ffffffff8106df7a>] ? try_to_wake_up+0x1ea/0x270
[13185.276593] [<ffffffffa0780030>] ? dmu_recv_stream+0xa30/0xb60 [zfs]
[13185.276599] [<ffffffffa061e6ad>] ? taskq_dispatch_ent+0x10d/0x140 [spl]
[13185.276620] [<ffffffffa0786754>] ? dnode_rele_and_unlock+0x44/0x80 [zfs]
[13185.276626] [<ffffffffa06230e7>] ? tsd_hash_dtor+0x67/0x80 [spl]
[13185.276632] [<ffffffffa06233ca>] ? tsd_set+0x15a/0x4d0 [spl]
[13185.276656] [<ffffffffa079b23c>] ? dsl_prop_get_ds+0x17c/0x230 [zfs]
[13185.276683] [<ffffffffa07aaaa3>] ? rrw_exit+0x53/0x160 [zfs]
[13185.276685] [<ffffffff810f8d6b>] ? __kmalloc+0x1b/0xf0
[13185.276715] [<ffffffffa07ea73d>] ? zfsdev_ioctl+0x45d/0x500 [zfs]
[13185.276718] [<ffffffff811122f5>] ? do_vfs_ioctl+0x2d5/0x4b0
[13185.276721] [<ffffffff81070518>] ? vtime_account_user.part.3+0x38/0x50
[13185.276723] [<ffffffff81112551>] ? SyS_ioctl+0x81/0xa0
[13185.276726] [<ffffffff8169de57>] ? tracesys+0xdd/0xe2
[13185.276728] zfs D ffff88032e437290 0 24265 24261 0x00000000
[13185.276730] ffff88030c738000 0000000000000002 ffff8802f617dfd8 ffff8802f617dfd8
[13185.276732] ffff8802f617dfd8 ffff88030c738000 ffff88030e38b000 0000000000000000
[13185.276734] ffff8802f617dc28 ffffffffa077d5e6 ffff88030e38b000 ffffffffa08c4740
[13185.276737] Call Trace:
[13185.276757] [<ffffffffa077d5e6>] ? dmu_recv_begin_check+0x1c6/0x4b0 [zfs]
[13185.276770] [<ffffffffa0761e31>] ? add_reference.isra.27+0x61/0xa0 [zfs]
[13185.276785] [<ffffffffa076b14a>] ? dbuf_find+0x1aa/0x1c0 [zfs]
[13185.276787] [<ffffffff810f8e60>] ? __kmalloc_node+0x20/0x100
[13185.276794] [<ffffffffa06224dd>] ? cv_wait_common+0xed/0x100 [spl]
[13185.276796] [<ffffffff81063aa0>] ? wake_up_bit+0x20/0x20
[13185.276825] [<ffffffffa07c440f>] ? txg_wait_synced+0xdf/0x120 [zfs]
[13185.276845] [<ffffffffa077d420>] ? dmu_recv_begin_sync+0x3d0/0x3d0 [zfs]
[13185.276869] [<ffffffffa07a173d>] ? dsl_sync_task+0x13d/0x240 [zfs]
[13185.276889] [<ffffffffa077d050>] ? restore_read+0x130/0x130 [zfs]
[13185.276908] [<ffffffffa077d420>] ? dmu_recv_begin_sync+0x3d0/0x3d0 [zfs]
[13185.276928] [<ffffffffa077d050>] ? restore_read+0x130/0x130 [zfs]
[13185.276947] [<ffffffffa077f5d7>] ? dmu_recv_begin+0x1b7/0x1e0 [zfs]
[13185.276977] [<ffffffffa07eb706>] ? zfs_ioc_recv+0x146/0xb00 [zfs]
[13185.276980] [<ffffffff810f8e60>] ? __kmalloc_node+0x20/0x100
[13185.276984] [<ffffffffa0655a30>] ? zprop_string_to_index+0xb0/0xb0 [zcommon]
[13185.276987] [<ffffffff8106df7a>] ? try_to_wake_up+0x1ea/0x270
[13185.277007] [<ffffffffa0780030>] ? dmu_recv_stream+0xa30/0xb60 [zfs]
[13185.277012] [<ffffffffa061e6ad>] ? taskq_dispatch_ent+0x10d/0x140 [spl]
[13185.277033] [<ffffffffa0786754>] ? dnode_rele_and_unlock+0x44/0x80 [zfs]
[13185.277039] [<ffffffffa06230e7>] ? tsd_hash_dtor+0x67/0x80 [spl]
[13185.277045] [<ffffffffa06233ca>] ? tsd_set+0x15a/0x4d0 [spl]
[13185.277069] [<ffffffffa079b23c>] ? dsl_prop_get_ds+0x17c/0x230 [zfs]
[13185.277095] [<ffffffffa07aaaa3>] ? rrw_exit+0x53/0x160 [zfs]
[13185.277098] [<ffffffff810f8d6b>] ? __kmalloc+0x1b/0xf0
[13185.277128] [<ffffffffa07ea73d>] ? zfsdev_ioctl+0x45d/0x500 [zfs]
[13185.277131] [<ffffffff811122f5>] ? do_vfs_ioctl+0x2d5/0x4b0
[13185.277134] [<ffffffff81070518>] ? vtime_account_user.part.3+0x38/0x50
[13185.277136] [<ffffffff81112551>] ? SyS_ioctl+0x81/0xa0
[13185.277139] [<ffffffff8169de57>] ? tracesys+0xdd/0xe2

so I think that I have repeat scenario.
If you have any solution I can test it.
I will try also to debug the issue.

@arturpzol
Copy link
Author

Call trace with debug enabled:

[  426.058714] zd16: detected capacity change from 583041810432 to 584115552256
[  426.160475] zd0: detected capacity change from 690415992832 to 691489734656
[  426.299123] zd48: detected capacity change from 797790175232 to 798863917056
[  426.319271] VERIFY(!dmu_objset_is_dirty(os, t)) failed
[  426.319275] PANIC at dmu_objset.c:689:dmu_objset_evict()
[  426.319278] Showing stack for process 19383
[  426.319282] CPU: 1 PID: 19383 Comm: txg_sync Tainted: P           O 3.10.92-oe64-ge331686 #15
[  426.319284] Hardware name: Supermicro X8DTS/X8DTS, BIOS 2.1     06/25/2012
[  426.319286]  ffffffff81692f75 ffffffffa057c267 ffffffffa0b9fe80 ffff880200000028
[  426.319290]  ffff88030ba55b00 ffff88030ba55aa0 2128594649524556 736a626f5f756d64
[  426.319293]  69645f73695f7465 202c736f28797472 6c69616620292974 ffff8803000a6465
[  426.319296] Call Trace:
[  426.319303]  [<ffffffff81692f75>] ? dump_stack+0xc/0x15
[  426.319317]  [<ffffffffa057c267>] ? spl_panic+0xa7/0xe0 [spl]
[  426.319322]  [<ffffffff810f9401>] ? kmem_cache_free+0xf1/0x120
[  426.319356]  [<ffffffffa09cbd16>] ? dbuf_destroy+0x2a6/0x4c0 [zfs]
[  426.319373]  [<ffffffffa09ce607>] ? dbuf_do_evict+0xc7/0x170 [zfs]
[  426.319390]  [<ffffffffa09c5cdd>] ? arc_clear_callback+0x20d/0x480 [zfs]
[  426.319408]  [<ffffffffa09ce23f>] ? dbuf_clear+0xcf/0x2d0 [zfs]
[  426.319432]  [<ffffffffa09fa4c5>] ? dnode_evict_dbufs+0x225/0x2b0 [zfs]
[  426.319454]  [<ffffffffa09e5b5c>] ? dmu_objset_evict+0x4c/0x640 [zfs]
[  426.319478]  [<ffffffffa09fd672>] ? dsl_dataset_clone_swap_sync_impl+0xc2/0x800 [zfs]
[  426.319496]  [<ffffffffa09cdb58>] ? dbuf_rele_and_unlock+0x1c8/0x460 [zfs]
[  426.319526]  [<ffffffffa0a25e99>] ? refcount_remove_many+0x1c9/0x2b0 [zfs]
[  426.319551]  [<ffffffffa09ffc7e>] ? dsl_dataset_hold+0x7e/0x2b0 [zfs]
[  426.319573]  [<ffffffffa09e9018>] ? dmu_recv_end_sync+0xb8/0x420 [zfs]
[  426.319600]  [<ffffffffa0a18f02>] ? dsl_sync_task_sync+0x122/0x130 [zfs]
[  426.319627]  [<ffffffffa0a10363>] ? dsl_pool_sync+0x543/0x670 [zfs]
[  426.319658]  [<ffffffffa0a3284e>] ? spa_sync+0x35e/0xe00 [zfs]
[  426.319692]  [<ffffffffa0a6e589>] ? zfs_dbgmsg_purge+0xb9/0x110 [zfs]
[  426.319724]  [<ffffffffa0a47e01>] ? txg_sync_thread+0x3f1/0x710 [zfs]
[  426.319728]  [<ffffffff8106c16d>] ? ttwu_do_wakeup+0xd/0x90
[  426.319760]  [<ffffffffa0a47a10>] ? txg_thread_wait+0xd0/0xd0 [zfs]
[  426.319765]  [<ffffffffa05780a9>] ? thread_generic_wrapper+0x79/0xc0 [spl]
[  426.319770]  [<ffffffffa0578030>] ? __thread_exit+0x10/0x10 [spl]
[  426.319774]  [<ffffffff81062e92>] ? kthread+0xc2/0xd0
[  426.319777]  [<ffffffff81070000>] ? sched_clock_cpu+0x30/0x100
[  426.319780]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
[  426.319783]  [<ffffffff8169db98>] ? ret_from_fork+0x58/0x90
[  426.319786]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110

@arturpzol
Copy link
Author

Issue repeated also on released 0.6.5.4 version.

cat /var/log/kernel.log | grep -E 'SPL:|ZFS:'
Feb  9 10:00:04 [kern.notice] [   35.987176] SPL: Loaded module v0.6.5.4-1 (DEBUG mode)
Feb  9 10:00:04 [kern.notice] [   36.051103] ZFS: Loaded module v0.6.5.4-1 (DEBUG mode), ZFS pool version 5000, ZFS filesystem version 5

Call trace with debug enabled:

[  904.267725] zd0: detected capacity change from 692563476480 to 693637218304
[  904.423461] zd48: detected capacity change from 799937658880 to 801011400704
[  904.438689] VERIFY(!dmu_objset_is_dirty(os, t)) failed
[  904.438695] PANIC at dmu_objset.c:689:dmu_objset_evict()
[  904.438697] Showing stack for process 21037
[  904.438700] CPU: 1 PID: 21037 Comm: txg_sync Tainted: P           O 3.10.92-oe64-ge331686 #15
[  904.438702] Hardware name: Supermicro X8DTS/X8DTS, BIOS 2.1     06/25/2012
[  904.438704]  ffffffff81692f75 ffffffffa0591217 ffffffffa09869c0 ffff880200000028
[  904.438707]  ffff880309729b00 ffff880309729aa0 2128594649524556 736a626f5f756d64
[  904.438709]  69645f73695f7465 202c736f28797472 6c69616620292974 ffff8803000a6465
[  904.438711] Call Trace:
[  904.438719]  [<ffffffff81692f75>] ? dump_stack+0xc/0x15
[  904.438730]  [<ffffffffa0591217>] ? spl_panic+0xa7/0xe0 [spl]
[  904.438735]  [<ffffffff810f9401>] ? kmem_cache_free+0xf1/0x120
[  904.438769]  [<ffffffffa07b1d16>] ? dbuf_destroy+0x2a6/0x4c0 [zfs]
[  904.438787]  [<ffffffffa07b4607>] ? dbuf_do_evict+0xc7/0x170 [zfs]
[  904.438803]  [<ffffffffa07abcdd>] ? arc_clear_callback+0x20d/0x480 [zfs]
[  904.438821]  [<ffffffffa07b423f>] ? dbuf_clear+0xcf/0x2d0 [zfs]
[  904.438845]  [<ffffffffa07e04a5>] ? dnode_evict_dbufs+0x225/0x2b0 [zfs]
[  904.438868]  [<ffffffffa07cbb7c>] ? dmu_objset_evict+0x4c/0x640 [zfs]
[  904.438891]  [<ffffffffa07e3652>] ? dsl_dataset_clone_swap_sync_impl+0xc2/0x800 [zfs]
[  904.438909]  [<ffffffffa07b3b58>] ? dbuf_rele_and_unlock+0x1c8/0x460 [zfs]
[  904.438940]  [<ffffffffa080be79>] ? refcount_remove_many+0x1c9/0x2b0 [zfs]
[  904.438965]  [<ffffffffa07e5c5e>] ? dsl_dataset_hold+0x7e/0x2b0 [zfs]
[  904.438987]  [<ffffffffa07cf038>] ? dmu_recv_end_sync+0xb8/0x420 [zfs]
[  904.439014]  [<ffffffffa07feee2>] ? dsl_sync_task_sync+0x122/0x130 [zfs]
[  904.439041]  [<ffffffffa07f6343>] ? dsl_pool_sync+0x543/0x670 [zfs]
[  904.439072]  [<ffffffffa081883e>] ? spa_sync+0x35e/0xe00 [zfs]
[  904.439104]  [<ffffffffa082ddf1>] ? txg_sync_thread+0x3f1/0x710 [zfs]
[  904.439108]  [<ffffffff8106c16d>] ? ttwu_do_wakeup+0xd/0x90
[  904.439140]  [<ffffffffa082da00>] ? txg_thread_wait+0xd0/0xd0 [zfs]
[  904.439145]  [<ffffffffa058d0d9>] ? thread_generic_wrapper+0x79/0xc0 [spl]
[  904.439150]  [<ffffffffa058d060>] ? __thread_exit+0x10/0x10 [spl]
[  904.439154]  [<ffffffff81062e92>] ? kthread+0xc2/0xd0
[  904.439157]  [<ffffffff81070000>] ? sched_clock_cpu+0x30/0x100
[  904.439160]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
[  904.439163]  [<ffffffff8169db98>] ? ret_from_fork+0x58/0x90
[  904.439166]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110

The same behavior on master SPL and ZFS source code.

@ab-oe
Copy link
Contributor

ab-oe commented Feb 25, 2016

Hello,
I reproduced this issue with the code delivered by @arturpzol. I added some prints to the code to track down this issue. I found out that it happens when the zap_update function in zvol_update_volsize (which adds transaction group to dp->dp_dirty_datasets) is executed. Meanwhile, before the dirty data are synchronized, receive process causes that dmu_recv_end_sync is executed. Then finally dirty data are going to be synchronized but the synchronization ends with call trace mentioned by @arturpzol . Wating for txg synchronization in zvol_update_volsize resolves this issue.

Index: zvol.c
===================================================================
--- zvol.c.orig
+++ zvol.c
@@ -300,10 +302,13 @@ zvol_update_volsize(uint64_t volsize, ob
 {
    dmu_tx_t *tx;
    int error;
+   uint64_t txg;

    ASSERT(MUTEX_HELD(&zvol_state_lock));

    tx = dmu_tx_create(os);
+   txg = dmu_tx_get_txg(tx);
+
    dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL);
    error = dmu_tx_assign(tx, TXG_WAIT);
    if (error) {
@@ -315,6 +320,8 @@ zvol_update_volsize(uint64_t volsize, ob
        &volsize, tx);
    dmu_tx_commit(tx);

+   txg_wait_synced(dmu_objset_pool(os), txg);
+
    if (error == 0)
        error = dmu_free_long_range(os,
            ZVOL_OBJ, volsize, DMU_OBJECT_END);

@behlendorf
Copy link
Contributor

@ab-oe nice analysis and fix. Your patch looks good to me could you please open a new pull request with it.

An alternate approach to resolve this would be to restructure zvol_update_volsize() as a sync task. This is the same mechanism which is used for most of the other properties in zfs_prop_set_special() and it results in the same txg_wait_synced() behavior. That said, restructuring all this code would be a more invasive change for very little gain so I'm OK with the fix as you've proposed it.

ab-oe added a commit to ab-oe/zfs that referenced this issue Feb 26, 2016
Make zvol update volsize operation synchronous.

There is a race condition when new transaction group is added
to dp->dp_dirty_datasets list by the zap_update in the zvol_update_volsize.
Meanwhile, before these dirty data are synchronized, the receive process
can cause that dmu_recv_end_sync is executed. Then finally dirty data
are going to be synchronized but the synchronization ends with the NULL
pointer dereference error.
ab-oe added a commit to ab-oe/zfs that referenced this issue Feb 28, 2016
ab-oe added a commit to ab-oe/zfs that referenced this issue Sep 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants