Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeBSD ZFS panic due to corrupt AVL tree. #15271

Open
VoxSciurorum opened this issue Sep 13, 2023 · 0 comments
Open

FreeBSD ZFS panic due to corrupt AVL tree. #15271

VoxSciurorum opened this issue Sep 13, 2023 · 0 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@VoxSciurorum
Copy link

System information

Type Version/Name
Distribution Name FreeBSD
Distribution Version 13.2-CURRENT
Kernel Version
Architecture amd64
OpenZFS Version zfs-2.1.12-FreeBSD_g86783d7d9

Also seen in January, 2023 in FreeBSD 13.1-CURRENT.

Describe the problem you're observing

Server has crashed several times during a zpool scrub due to a corrupt in-memory AVL tree.

See also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268909

Describe how to reproduce the problem

Run zpool scrub on my system. Not very useful to ZFS developers, I understand.

Include any warning/errors/backtraces from the system logs

The symptom is a GPF due to a bad pointer, in one case null and in the other case invalid.

In the most recent crash avl_walk was called with a node that looks like

{avl_child = {0x0, 0xfffff80200004d20}, avl_pcb = 0xfffff801f1c461fa}

The [1] child points to

{avl_child = {0x395753c375b177a6, 0xfa91e69b009252c}, avl_pcb = 0xfffff801476764a6}

The parent link avl_pcb is correct but the children are invalid pointers. The loop crashed trying to examine the [0] child.

In a previous crash, also during a scrub, avl_rotation crashed because gchild was null in this block of code

        gchild = child->avl_child[right];
        gleft = gchild->avl_child[left];
        gright = gchild->avl_child[right];

This is the bad pool:

NAME         SIZE  ALLOC   FREE  FRAG    CAP  DEDUP    HEALTH  ALTROOT
data        36.4T  20.4T  15.9T   34%    56%  1.17x    ONLINE  -
  raidz2-0  36.4T  20.4T  15.9T   34%  56.2%      -    ONLINE
    ada0    9.10T      -      -     -      -      -    ONLINE
    ada1    9.10T      -      -     -      -      -    ONLINE
    ada2    9.10T      -      -     -      -      -    ONLINE
    ada3    9.10T      -      -     -      -      -    ONLINE
cache           -      -      -     -      -      -  -
  ada4p5     150G   143G  6.69G    0%  95.5%      -    ONLINE

The cache partition is on an SSD. The other disks are spinning hard drives.

The largest filesystem has encryption=aes-256-gcm and dedup=on.
The other filesystems have dedup=verify and no encryption.

The CPU is an AMD Opteron x3421 ("excavator") and the system is compiled with -march=bdver4.

The interesting part of the stack trace is

#7  avl_walk (tree=tree@entry=0xfffff80009178260, 
    oldnode=oldnode@entry=0xfffff80147676440, left=left@entry=1)
    at /usr/src/sys/contrib/openzfs/module/avl/avl.c:147
#8  0xffffffff81c1bea5 in scan_io_queue_gather (queue=0xfffff80009178200, 
    list=0xfffffe010f60eda8, rs=<optimized out>)
    at /usr/src/sys/contrib/openzfs/module/zfs/dsl_scan.c:2942
#9  scan_io_queues_run_one (arg=0xfffff80009178200)
    at /usr/src/sys/contrib/openzfs/module/zfs/dsl_scan.c:3093
#10 0xffffffff81b41bbf in taskq_run (arg=0xfffff80041735d80, 
    pending=<optimized out>)
    at /usr/src/sys/contrib/openzfs/module/os/freebsd/spl/spl_taskq.c:315
@VoxSciurorum VoxSciurorum added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

1 participant