New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NULL pointer dereference in dva_get_dsize_sync #1891
Comments
|
At first glance, this would certainly seem to indicate a corrupted pool. It looks like one of the DVAs refers to a bogus vdev or is just totally corrupted in some other way. Have you tried scrubbing the pool? Does |
|
scrubbing the pool panic'd the kernel in an (at least superficially) identical way. zpool status never complained, but it may not have gotten a chance to. I have had to nuke the pool to get our machines back into production, but I can try to reproduce it on a different set of hardware if that'd be desirable. |
|
Does this machine have ECC memory? This is likely a case in which vdev_lookup_top() is returning NULL because the DVA has a bogus vdev id. I presume you ran into this problem across reboots which means the corruption is on-disk as opposed to in-memory. Unfortunately, tracking down this type of problem would likely require adding some debugging code and a bit of time. |
|
The machine does have ECC, yes. And yes, this survived reboot, so it's an on-disk thing. I don't have a minimal test case yet but the trigger seems to involve switching from xattr=on to xattr=sa with an already-riddled-with-xattrs filesystem and doing a lot more xattr work (we were using these nodes to back Ceph, which makes (indefensible, IMHO, but) extensive use of xattrs). |
|
I am seeing the same issues on one of my systems. I built ZFS from the git tree to support the 3.12.1 kernel running on Ubuntu 13.10. It reports during boot: ZFS: Loaded module v0.6.2-111_g119a394, My system is a hp dl360G5 with ECC memory. Like nfw, the problem survives a reboot, and I am also using it for Ceph with the -o xattr=sa -o atime=off options. The ZFS file system is on a single SAS drive. I tried the zpool scrub command on this pool, which hung the terminal session and I could not get out of it with a ctrl-c or ctrl-z. I can scrub other pools on this system. zpool status shows: pool: tca18_wwn-0x600508b1001034313320202020200008 |
|
I've seen this today on a machine running 3.13.5-200.fc20.x86_64 with v0.6.2-195_g0ad85ed, one RAIDZ2 pool with 6 disks. The only significant recent change was that I enabled acltype=posixacl on a dataset and started using them (setfacl). The parent dataset uses xattr=sa. The machine does not have ECC. The pool was imported successfully after a reset, scrub went through without any errors. |
|
I have just run across this issue testing current master with lustre master (http://review.whamcloud.com/#/c/8979/6). Crash occurred on MDS https://maloo.whamcloud.com/test_sets/c6c513c0-cacc-11e3-8f53-52540035b04c |
The dva_get_dsize_sync() function incorrectly assumes that the call
to vdev_lookup_top() cannot fail. However, the NULL dereference at
clearly shows that under certain circumstances it is possible. Note
that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio.
BUG: unable to handle kernel NULL pointer dereference at 00000570
crash> struct -o vdev
struct vdev {
[0] uint64_t vdev_id;
... ...
[1376] uint64_t vdev_deflate_ratio;
Given that this can happen this patch add the required error handling.
In the case where vdev_lookup_top() fails assume that no deflation
will occur for the DVA and use the asize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#1707
Issue openzfs#1987
Issue openzfs#1891
The dva_get_dsize_sync() function incorrectly assumes that the call
to vdev_lookup_top() cannot fail. However, the NULL dereference at
clearly shows that under certain circumstances it is possible. Note
that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio.
BUG: unable to handle kernel NULL pointer dereference at 00000570
crash> struct -o vdev
struct vdev {
[0] uint64_t vdev_id;
... ...
[1376] uint64_t vdev_deflate_ratio;
Given that this can happen this patch add the required error handling.
In the case where vdev_lookup_top() fails assume that no deflation
will occur for the DVA and use the asize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#1707
Issue openzfs#1987
Issue openzfs#1891
Running ubuntu's nightly tree for saucy (i.e. spl 0.6.2-2
saucy2.gbpc60af6and zfs 0.6.2-2
saucy2.gbp46f6df), I tripped overThe machine is a Sun X4500 with 16G of RAM and the pool in question is a
RAIDZ1 on 11 spinning-rust 1TB disks with 2 hot spares.
Possibly relevant is that the pool underlies a ceph osd and that the dataset
had been filled with xattr=on and had just recently been flipped to
xattr=sa. The other three pools in the machine have had xattr similarly
changed and have yet to hit this, it seems, so it's not a certain thing.
If this is the wrong place for such a bug report (I did not immediately find
an Ubuntu-specific one) please flame gently. :)
The text was updated successfully, but these errors were encountered: