Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC at zfeature.c:294:feature_get_enabled_txg() during import #6543

Closed
utopiabound opened this issue Aug 22, 2017 · 12 comments
Closed

PANIC at zfeature.c:294:feature_get_enabled_txg() during import #6543

utopiabound opened this issue Aug 22, 2017 · 12 comments

Comments

@utopiabound
Copy link
Contributor

utopiabound commented Aug 22, 2017

System information

Type Version/Name
Distribution Name Red Hat Enterprise Linux
Distribution Version 7.4
Linux Kernel 3.10.0-514.26.2.el7_lustre.x86_64
Architecture x86_64
ZFS Version 0.7.1-1
SPL Version 0.7.1-1

Describe the problem you're observing

Kernel panic on zfs import of pool created during automated testing.

Bug is also being tracked:
https://jira.hpdd.intel.com/browse/LU-9901

Describe how to reproduce the problem

IML tests: https://github.com/intel-hpdd/intel-manager-for-lustre

Will try to narrow down exact reproduction case.

Include any warning/errors/backtraces from the system logs

[19124.879146] WARNING: can't open objset 51, error 5
[19124.895291] VERIFY3(0 == zap_lookup(spa->spa_meta_objset, spa->spa_feat_enabled_txg_obj, feature->fi_guid, sizeof (ui
nt64_t), 1, res)) failed (0 == 52)
[19124.905980] PANIC at zfeature.c:294:feature_get_enabled_txg()
[19124.910754] Showing stack for process 18651
[19124.915295] CPU: 1 PID: 18651 Comm: zpool Tainted: P           OE  ------------   3.10.0-514.26.2.el7_lustre.x86_64 #1
[19124.920749] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[19124.925537]  ffffffffa0e1f9b9 0000000051f16af8 ffff8800121ab8f8 ffffffff8168729f
[19124.930690]  ffff8800121ab908 ffffffffa0bf0254 ffff8800121aba90 ffffffffa0bf0329
[19124.935541]  0000000051f16af8 ffff880000000030 ffff8800121abaa0 ffff8800121aba40
[19124.939945] Call Trace:
[19124.943537]  [<ffffffff8168729f>] dump_stack+0x19/0x1b
[19124.947382]  [<ffffffffa0bf0254>] spl_dumpstack+0x44/0x50 [spl]
[19124.951156]  [<ffffffffa0bf0329>] spl_panic+0xc9/0x110 [spl]
[19124.954764]  [<ffffffffa0d44fe0>] ? zap_lookup_norm+0x60/0xd0 [zfs]
[19124.958185]  [<ffffffff8168c969>] ? schedule+0x29/0x70
[19124.961438]  [<ffffffff8168a3a9>] ? schedule_timeout+0x239/0x2c0
[19124.964782]  [<ffffffffa0d47345>] spa_feature_enabled_txg+0x105/0x110 [zfs]
[19124.967995]  [<ffffffffa0cdb31b>] traverse_impl+0x3db/0x460 [zfs]
[19124.971047]  [<ffffffff811de835>] ? kmem_cache_alloc+0x35/0x1e0
[19124.974060]  [<ffffffffa0bec319>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[19124.977064]  [<ffffffffa0d10d10>] ? spa_async_suspend+0xa0/0xa0 [zfs]
[19124.980079]  [<ffffffffa0cdb834>] traverse_pool+0x84/0x1d0 [zfs]
[19124.982868]  [<ffffffffa0d10d10>] ? spa_async_suspend+0xa0/0xa0 [zfs]
[19124.985617]  [<ffffffffa0d10d10>] ? spa_async_suspend+0xa0/0xa0 [zfs]
[19124.988234]  [<ffffffffa0d7f02a>] ? zio_null+0x6a/0x70 [zfs]
[19124.990798]  [<ffffffffa0d18128>] spa_load+0x1a58/0x2030 [zfs]
[19124.993366]  [<ffffffffa0d1875e>] spa_load_best+0x5e/0x290 [zfs]
[19124.995935]  [<ffffffffa0d1a4a2>] spa_import+0x212/0x730 [zfs]
[19124.998374]  [<ffffffffa0d55a27>] zfs_ioc_pool_import+0x147/0x160 [zfs]
[19125.000830]  [<ffffffffa0d5a626>] zfsdev_ioctl+0x606/0x650 [zfs]
[19125.003126]  [<ffffffff812127a5>] do_vfs_ioctl+0x2d5/0x4b0
[19125.005463]  [<ffffffff81693226>] ? trace_do_page_fault+0x56/0x150
[19125.007630]  [<ffffffff81212a21>] SyS_ioctl+0xa1/0xc0
[19125.009651]  [<ffffffff816928cb>] ? do_async_page_fault+0x1b/0xd0
[19125.011740]  [<ffffffff81697989>] system_call_fastpath+0x16/0x1b
@utopiabound
Copy link
Contributor Author

Command that PANICed was:
zpool import -f zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13

Attached is output of:
zdb -vvvve zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13
zfs_pool_scsi0QEMU_QEMU_HARDDISK_disk13.zdb-dump-vvvv.txt

@loli10K
Copy link
Contributor

loli10K commented Aug 22, 2017

This panic can be "reproduced" in a couple of minutes; output from one of my debian test boxes:

root@debian-9:~# zpool import -a -d /var/tmp
[  796.299189] VERIFY3(0 == zap_lookup(spa->spa_meta_objset, spa->spa_feat_enabled_txg_obj, feature->fi_guid, sizeof (uint64_t), 1, res)) failed (0 == 52)
[  796.301375] PANIC at zfeature.c:294:feature_get_enabled_txg()
[  796.302266] Showing stack for process 25499
[  796.302982] CPU: 0 PID: 25499 Comm: zpool Tainted: P           OE   4.9.0-2-amd64 #1 Debian 4.9.18-1
[  796.304374] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  796.305260]  0000000000000000 ffffffffa5528714 ffffffffc0899548 ffffac87c37c3a90
[  796.306471]  ffffffffc0597805 ffffffffc06e946d 0000001200000030 ffffac87c37c3aa0
[  796.307632]  ffffac87c37c3a40 2833594649524556 70617a203d3d2030 2870756b6f6f6c5f
[  796.308745] Call Trace:
[  796.309102]  [<ffffffffa5528714>] ? dump_stack+0x5c/0x78
[  796.309860]  [<ffffffffc0597805>] ? spl_panic+0xc5/0x100 [spl]
[  796.310737]  [<ffffffffc06e946d>] ? dnode_rele_and_unlock+0x4d/0xc0 [zfs]
[  796.311739]  [<ffffffffc076e368>] ? zap_lookup_norm+0x68/0xe0 [zfs]
[  796.312635]  [<ffffffffc0770bba>] ? spa_feature_enabled_txg+0x10a/0x1c0 [zfs]
[  796.343688]  [<ffffffffc06e2405>] ? traverse_impl+0x455/0x570 [zfs]
[  796.345012]  [<ffffffffa57f72fe>] ? wait_for_completion+0x10e/0x130
[  796.346341]  [<ffffffffc072ee10>] ? spa_async_suspend+0xd0/0xd0 [zfs]
[  796.347732]  [<ffffffffc06e2a20>] ? traverse_pool+0x70/0x1a0 [zfs]
[  796.349063]  [<ffffffffc072ee10>] ? spa_async_suspend+0xd0/0xd0 [zfs]
[  796.350428]  [<ffffffffc072ee10>] ? spa_async_suspend+0xd0/0xd0 [zfs]
[  796.351802]  [<ffffffffc07b2b56>] ? zio_null+0x66/0x70 [zfs]
[  796.353047]  [<ffffffffc073747b>] ? spa_load+0x214b/0x24a0 [zfs]
[  796.354185]  [<ffffffffc073782a>] ? spa_load_best+0x5a/0x2b0 [zfs]
[  796.355242]  [<ffffffffc0739971>] ? spa_import+0x251/0x780 [zfs]
[  796.356280]  [<ffffffffc077eca3>] ? get_nvlist+0x103/0x130 [zfs]
[  796.357339]  [<ffffffffc077fd7b>] ? zfs_ioc_pool_import+0x12b/0x140 [zfs]
[  796.358484]  [<ffffffffc0788c14>] ? zfsdev_ioctl+0x694/0x760 [zfs]
[  796.359492]  [<ffffffffa54160df>] ? do_vfs_ioctl+0x9f/0x600
[  796.360387]  [<ffffffffa54166b4>] ? SyS_ioctl+0x74/0x80
[  796.361242]  [<ffffffffa57fb07b>] ? system_call_fast_compare_end+0xc/0x9b

Message from syslogd@debian-9 at Aug 22 17:55:58 ...
 kernel:[  796.299189] VERIFY3(0 == zap_lookup(spa->spa_meta_objset, spa->spa_feat_enabled_txg_obj, feature->fi_guid, sizeof (uint64_t), 1, res)) failed (0 == 52)

Message from syslogd@debian-9 at Aug 22 17:55:58 ...
 kernel:[  796.301375] PANIC at zfeature.c:294:feature_get_enabled_txg()

The real question here is: "how did the MOS get corrupted"?

EDIT: spelling

@behlendorf
Copy link
Contributor

@loli10K how were you able to reproduce this?

@loli10K
Copy link
Contributor

loli10K commented Aug 23, 2017

@behlendorf i just wrote garbage on all the 3 mos->obj_dir->feature_enabled_txg DVAs via dd to simulate corruption (ECKSUM): i don't think this PANIC is a bug, the real issue is the one that caused corruption on disk.

@brianjmurrell
Copy link

Is it possible to at least handle this kind of situation in some non-blocking fashion with an error from userspace rather than having the userspace command hang indefinitely?

The latter is much more difficult to detect/debug, particularly when the commands are being driven by a non-human.

@behlendorf
Copy link
Contributor

@loli10K I see.

@brianjmurrell you can set the module option spl_panic_halt=1 to cause the system to panic rather than hang to make detecting the issue easier. But it'd still like to understand how you managed to damage the pool in this way.

@tanabarr
Copy link

tanabarr commented Sep 22, 2017

@behlendorf as per the comments in whamcloud/integrated-manager-for-lustre#86 this is probably caused by dd'ing the underlying disk after failing to properly remove the zpool. if trying to recreate zpools between automated tests and 'zpool destroy -f ...' fails to remove the pool (pool still reported https://github.com/intel-hpdd/intel-manager-for-lustre/pull/282 ) what would be the recommended approach to clearing state?

@utopiabound
Copy link
Contributor Author

@tanabarr You can use wipefs on the underlying disks to remove any trace of zpool and get a clean slate from which to work. That's probably the easiest way to clear any old data.

@behlendorf
Copy link
Contributor

You could also use zpool labelclear to remove any remaining traces of the label. This has the advantage that it won't let you destroy a label for a pool which is imported which makes it nice to use interactively. For your use case wipefs(8) should also work great.

@tanabarr
Copy link

thanks, wipefs seems to be working nicely. also it seems to refuse to remove signatures from imported pools

@behlendorf
Copy link
Contributor

@utopiabound @tanabarr then as I understand it, the issue here was that the remnants of a previous pool were being used during the import which resulted in the panic. We can work on adding additional sanity checking for block pointers and known object types, but I'd like to tackle that in a different issue. Given that, if there's nothing else to do in this issue can you please close it out.

@tanabarr
Copy link

happy to close, @behlendorf labelclear worked nicely to resolve my issue and we have root caused why we needed it in the first place. much appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants