Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(zap.c:579:zap_deref_leaf()) VERIFY3(zap->zap_u.zap_fat.zap_phys->zap_magic == 0x2F52AB2ABULL) #1445

Closed
chrisrd opened this issue May 7, 2013 · 8 comments

Comments

@chrisrd
Copy link
Contributor

chrisrd commented May 7, 2013

linux-3.8.10, openzfs/spl@352bd19, 57f5a20

The system was largely unresponsive (very high load, following a large number of "remove xattr, unlink file" sequences on axattr=dir file system, ref: #457), eventually leading to a power cycle. Following this, I have one file system producing this when trying to mount:

May  7 08:00:43 b5 kernel: [  817.900053] SPLError: 15166:0:(zap.c:579:zap_deref_leaf()) VERIFY3(zap->zap_u.zap_fat.zap_phys->zap_magic == 0x2F52AB2ABULL) failed (0 == 12703150763)
May  7 08:00:43 b5 kernel: [  817.900148] SPLError: 15166:0:(zap.c:579:zap_deref_leaf()) SPL PANIC
May  7 08:00:43 b5 kernel: [  817.900206] SPL: Showing stack for process 15166
May  7 08:00:43 b5 kernel: [  817.900259] Pid: 15166, comm: mount.zfs Tainted: G           O 3.8.8-otn-00007-g5914d52 #1
May  7 08:00:43 b5 kernel: [  817.900323] Call Trace:
May  7 08:00:43 b5 kernel: [  817.900387]  [<ffffffffa0e45557>] spl_debug_dumpstack+0x27/0x40 [spl]
May  7 08:00:43 b5 kernel: [  817.900449]  [<ffffffffa0e466cf>] spl_debug_bug+0x7f/0xe0 [spl]
May  7 08:00:43 b5 kernel: [  817.900548]  [<ffffffffa07b8251>] zap_deref_leaf+0xa1/0x1e0 [zfs]
May  7 08:00:43 b5 kernel: [  817.900632]  [<ffffffffa07bf6cd>] ? zap_lockdir+0x12d/0xd80 [zfs]
May  7 08:00:43 b5 kernel: [  817.900713]  [<ffffffffa0784086>] ? sa_lookup+0x66/0x120 [zfs]
May  7 08:00:43 b5 kernel: [  817.900796]  [<ffffffffa07baac1>] fzap_cursor_retrieve+0xe1/0x3f0 [zfs]
May  7 08:00:43 b5 kernel: [  817.900881]  [<ffffffffa07c079b>] zap_cursor_retrieve+0x6b/0x450 [zfs]
May  7 08:00:43 b5 kernel: [  817.900968]  [<ffffffffa07f1204>] ? zfs_znode_alloc+0x4f4/0x7f0 [zfs]
May  7 08:00:43 b5 kernel: [  817.901052]  [<ffffffffa07cc90e>] zfs_purgedir+0xae/0x270 [zfs]
May  7 08:00:43 b5 kernel: [  817.901136]  [<ffffffffa07f461d>] ? zfs_zget+0x39d/0x3d0 [zfs]
May  7 08:00:43 b5 kernel: [  817.901218]  [<ffffffffa07ccbce>] zfs_unlinked_drain+0xfe/0x150 [zfs]
May  7 08:00:43 b5 kernel: [  817.901282]  [<ffffffffa0e50009>] ? taskq_create+0x309/0x4c0 [spl]
May  7 08:00:43 b5 kernel: [  817.901362]  [<ffffffffa076e4c4>] ? dsl_prop_register+0x454/0x560 [zfs]
May  7 08:00:43 b5 kernel: [  817.901424]  [<ffffffff81063d70>] ? finish_wait+0x80/0x80
May  7 08:00:43 b5 kernel: [  817.901529]  [<ffffffffa07eebb0>] ? zfs_rmdir+0x9f0/0x9f0 [zfs]
May  7 08:00:43 b5 kernel: [  817.901636]  [<ffffffffa07fb70f>] ? zil_open+0x6f/0x160 [zfs]
May  7 08:00:43 b5 kernel: [  817.901742]  [<ffffffffa07e2ae0>] zfs_sb_setup+0x120/0x1c0 [zfs]
May  7 08:00:43 b5 kernel: [  817.901848]  [<ffffffffa07e46f0>] zfs_domount+0x240/0x2e0 [zfs]
May  7 08:00:43 b5 kernel: [  817.901954]  [<ffffffffa080deb0>] ? zpl_mount+0x30/0x30 [zfs]
May  7 08:00:43 b5 kernel: [  817.902059]  [<ffffffffa080dec3>] zpl_fill_super+0x13/0x80 [zfs]
May  7 08:00:43 b5 kernel: [  817.902165]  [<ffffffffa080deb0>] ? zpl_mount+0x30/0x30 [zfs]
May  7 08:00:43 b5 kernel: [  817.902248]  [<ffffffff8113e686>] mount_nodev+0x56/0xb0
May  7 08:00:43 b5 kernel: [  817.902352]  [<ffffffffa080dea5>] zpl_mount+0x25/0x30 [zfs]
May  7 08:00:43 b5 kernel: [  817.902494]  [<ffffffff8113f5c0>] mount_fs+0x20/0xd0
May  7 08:00:43 b5 kernel: [  817.902576]  [<ffffffff81159174>] vfs_kern_mount+0x74/0x110
May  7 08:00:43 b5 kernel: [  817.902657]  [<ffffffff8115b697>] do_mount+0x357/0xa60
May  7 08:00:43 b5 kernel: [  817.902737]  [<ffffffff810e620e>] ? __get_free_pages+0xe/0x50
May  7 08:00:43 b5 kernel: [  817.902817]  [<ffffffff8115b1aa>] ? copy_mount_options+0x3a/0x180
May  7 08:00:43 b5 kernel: [  817.902899]  [<ffffffff8115be2e>] sys_mount+0x8e/0xe0
May  7 08:00:43 b5 kernel: [  817.902980]  [<ffffffff813a2e59>] system_call_fastpath+0x16/0x1b

A similar problem was seen in #1242, see 2nd trace.

@chrisrd
Copy link
Contributor Author

chrisrd commented May 7, 2013

That zfs_unlinked_drain indicates the problem is occurring whilst trying to clean up the znodes left over after the power cycle.

Any ideas on what I can do to work out what the problem is, and/or get this file system cleanly mounted?

@chrisrd
Copy link
Contributor Author

chrisrd commented May 7, 2013

Well, I seem to have fixed it, at least to the point that the fs mounts. However I'm not sure if the fs is now really clean, or if it has some more subtle corruption which hasn't hit me yet. My kindom for an fsck.zfs!

This got it mounting...

I applied this patch (available at chrisrd/zfs@920d03a):

--- a/module/zfs/zap.c
+++ b/module/zfs/zap.c
@@ -576,7 +576,14 @@ zap_deref_leaf(zap_t *zap, uint64_t h, dmu_tx_t *tx, krw_t lt, zap_leaf_t **lp)

    ASSERT(zap->zap_dbuf == NULL ||
        zap->zap_f.zap_phys == zap->zap_dbuf->db_data);
-   ASSERT3U(zap->zap_f.zap_phys->zap_magic, ==, ZAP_MAGIC);
+   if (zap->zap_f.zap_phys->zap_magic != ZAP_MAGIC) {
+       zfs_panic_recover("zfs: "
+           "zap->zap_u.zap_fat.zap_phys->zap_magic == 0x2F52AB2ABULL "
+           "failed (%lu == %lu) for object %lu",
+           zap->zap_u.zap_fat.zap_phys->zap_magic, 0x2F52AB2ABULL
+           zap->zap_object);
+       return (EFAULT);
+   }
    idx = ZAP_HASH_IDX(h, zap->zap_f.zap_phys->zap_ptrtbl.zt_shift);
    err = zap_idx_to_blk(zap, idx, &blk);
    if (err != 0)

...then after loading the new module, enabled the zfs_recover variable like:

stap -g -p4 -m stap_toggle_zfs_recover -r ~/linux-build/3.8.10 -e ' 
  probe module("zfs").function("zfs_domount").call {
    @var("zfs_recover@module/zfs/zfs_debug.c") = 1;
    exit();
  }
'

staprun stap_zfs_recover.ko -c 'zfs mount tank/fs'

Which resulted in:

[ 2080.011111] SPLError: 15697:0:(spl-err.c:67:vcmn_err()) WARNING: zfs: zap->zap_u.zap_fat.zap_phys->zap_magic == 0x2F52AB2ABULL failed (0 == 12703150763) for object 1161831793

...but the fs mounted after some delay (still processing further left-over znodes?) without any further (obvious) issues. The fs now unmounts and mounts without any complaints.

I used a slightly modified systemtap script to set zfs_recover back to zero.

@behlendorf
Copy link
Contributor

@chrisrd I'm glad you got it sorted out. What I don't understand is how that ZAP leaf block was damaged like this even if the node was power cycled. However, clearly there's some sort of bug here.

I wouldn't worry too much about there being any additional undetected damage to the filesystem. It sounds like any damage was contained too that one object. It would be interesting to use zdb to dump object 1161831793 if it still exists and take a closer look at it.

@chrisrd
Copy link
Contributor Author

chrisrd commented May 16, 2013

I've not played with zdb much so I'm not really sure what I'm looking for, but like this?

# zdb -dddddddd tank/fs 1161831793
Dataset tank/fs [ZPL], ID 220, cr_txg 184147, 90.2T, 377999694 objects, rootbp DVA[0]=<1:1a26b9deac00:600> DVA[1]=<2:1183cb17e00:600> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=6696666L/6696666P fill=377999694 cksum=1d3ef91dd6:81afc38c893:15617552c6322:2a640d09131f4b

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
zdb: dmu_bonus_hold(1161831793) failed, errno 2

@chrisrd
Copy link
Contributor Author

chrisrd commented May 16, 2013

...and I see errno 2 is ENOENT, so I'm guessing this object simply doesn't exist anymore.

@behlendorf
Copy link
Contributor

Yup, that's exactly what it means. Thanks for checking, even though you fixed your pool lets leave the bug open to track this issue until it gets resolved.

@behlendorf behlendorf removed this from the 0.6.7 milestone Oct 16, 2014
@ryao
Copy link
Contributor

ryao commented Feb 26, 2015

@behlendorf This just occurred on my workstation yesterday evening during system updates. I am archiving a disk image of the root pool before attempting to repair it. I can test any patches that are aimed at fixing this on a copy of the image.

behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 27, 2015
There are a handful of ASSERT(!"...")'s throughout the code base for
cases which should be impossible.  This patch converts them to use
cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that
additional debugging is logged if they were to occur.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#1445
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 3, 2015
There are a handful of ASSERT(!"...")'s throughout the code base for
cases which should be impossible.  This patch converts them to use
cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that
additional debugging is logged if they were to occur.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#1445
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Apr 4, 2015
There are a handful of ASSERT(!"...")'s throughout the code base for
cases which should be impossible.  This patch converts them to use
cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that
additional debugging is logged if they were to occur.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#1445
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Apr 5, 2015
There are a handful of ASSERT(!"...")'s throughout the code base for
cases which should be impossible.  This patch converts them to use
cmn_err(CE_PANIC, ...) to ensure they are always enabled and so that
additional debugging is logged if they were to occur.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#1445
@behlendorf
Copy link
Contributor

This issue was resolved by commit 29572cc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants