Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmu_zfetch_run() panic #11936

Closed
ikozhukhov opened this issue Apr 24, 2021 · 5 comments · Fixed by #11998
Closed

dmu_zfetch_run() panic #11936

ikozhukhov opened this issue Apr 24, 2021 · 5 comments · Fixed by #11998
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@ikozhukhov
Copy link
Contributor

ikozhukhov commented Apr 24, 2021

System information

Type Version/Name
Distribution Name DilOS
Distribution Version
Linux Kernel
Architecture
ZFS Version
SPL Version

Describe the problem you're observing

we have 4 vm as vmware guests under vmware esxi and 2 vm as bhyve guests with sporadic/periodic panics with the same stack trace below in different scenarios. we have no panic in the same place, but we can see instability and just revert of commit 891568c (issue #11652) fixed this issue.

Describe how to reproduce the problem

ZTS tests on several different vms in loop

Include any warning/errors/backtraces from the system logs

panic[cpu0]/thread=fffffe0c1836b7e0:
mutex_enter: bad mutex, lp=fffffe0d4bb6a340 owner=deadbeefdeadbee8 thread=fffffe0c1836b7e0


fffffe001116b920 unix:mutex_panic+4a ()
fffffe001116b990 unix:mutex_vector_enter+3a7 ()
fffffe001116ba50 zfs:dmu_zfetch_run+ad ()
fffffe001116bb10 zfs:dmu_buf_hold_array_by_dnode+482 ()
fffffe001116bbd0 zfs:dmu_read_uio_dnode+86 ()
fffffe001116bc40 zfs:dmu_read_uio_dbuf+6f ()
fffffe001116bd20 zfs:zfs_read+5a0 ()
fffffe001116bdc0 genunix:fop_read+fd ()
fffffe001116bef0 genunix:pread+1de ()
fffffe001116bf00 unix:brand_sys_syscall+32f ()

dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel + curproc
@ikozhukhov ikozhukhov added Type: Defect Incorrect behavior (e.g. crash, hang) Status: Triage Needed New issue which needs to be triaged labels Apr 24, 2021
@ikozhukhov
Copy link
Contributor Author

cc @behlendorf @ahrens

@behlendorf
Copy link
Contributor

We haven't observed this on either Linux or FreeBSD to my knowledge. However, from the stack you've posted it looks like the zstream_t pointer being passed to dmu_zfetch_run() is referencing a freed zfetch_t structure. I'm not quite sure how that's possible, but if I had to make a guess I'd wager the issue is in dnode_move()->dnode_move_impl()->dmu_zfetch_fini(). This move functionality is disabled on both Linux and FreeBSD so this isn't a call path which would have been tested.

You could test this theory easily enough by commenting out the dnode_move() registration in dnode_init() for the purposes of a test.

cc: @amotin

@behlendorf behlendorf removed the Status: Triage Needed New issue which needs to be triaged label Apr 26, 2021
@behlendorf behlendorf changed the title degradation as periodic panic with commit https://github.com/openzfs/zfs/pull/11652 dmu_zfetch_run() panic Apr 26, 2021
@amotin
Copy link
Member

amotin commented May 3, 2021

The move functionality looks like a one big bug in context of zfetch. I've looked on it briefly while tried to debug previously reported issue, but then gave up, since the issue appeared to be different, while this code is indeed not use on either Linux or FreeBSD. It would be good to check whether the dnode_move() code is indeed related and then try to sort it out somehow.

@amotin
Copy link
Member

amotin commented May 4, 2021

@ikozhukhov Could you please test the linked PR?

@ikozhukhov
Copy link
Contributor Author

@ikozhukhov Could you please test the linked PR?

will do it, thanks.
i have to finish some build fixes with zstd on sparc and will return to this and others my PRs

amotin added a commit to amotin/zfs that referenced this issue May 4, 2021
Previous code tried to keep prefetch streams while moving dnode.  But
it was at least not updating per-stream zs_fetchback pointers, causing
use-after-free on next access.  Instead of that I see much easier and
cleaner to just drop old prefetch state and start new from scratch.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11936
behlendorf pushed a commit that referenced this issue May 7, 2021
Previous code tried to keep prefetch streams while moving dnode.  But
it was at least not updating per-stream zs_fetchback pointers, causing
use-after-free on next access.  Instead of that I see much easier and
cleaner to just drop old prefetch state and start new from scratch.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11936
Closes #11998
behlendorf pushed a commit to behlendorf/zfs that referenced this issue May 10, 2021
Previous code tried to keep prefetch streams while moving dnode.  But
it was at least not updating per-stream zs_fetchback pointers, causing
use-after-free on next access.  Instead of that I see much easier and
cleaner to just drop old prefetch state and start new from scratch.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11936
Closes openzfs#11998
sempervictus pushed a commit to sempervictus/zfs that referenced this issue May 31, 2021
Previous code tried to keep prefetch streams while moving dnode.  But
it was at least not updating per-stream zs_fetchback pointers, causing
use-after-free on next access.  Instead of that I see much easier and
cleaner to just drop old prefetch state and start new from scratch.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11936
Closes openzfs#11998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants