-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assertion failure: bad sa_magic in zpl_get_file_info() #11433
Comments
I would assume that the problem is the assertion failure, and the "blocked" threads are a result of trying to continue on after the assertion failure.
|
On Tue, Jan 05, 2021 at 10:40:02AM -0800, Matthew Ahrens wrote:
I would assume that the problem is the assertion failure, and the "blocked" threads are a result of trying to continue on after the assertion failure.
Maybe. I didn't look much through old bug reports (there are many).
Should I try to scrub ?
Anything else I should look for ?
I could maybe try to reproduce the problem, but please send a list of what to
collect. I can get a stacktrace for the postgres backend, and the kernel's
wchan, and a list of /proc/pid/FDs and inodes. Anything else ?
Since this happened twice within ~2hours (while migrating data to zstd after upgrading to v2.0), I think that rules out a RAM issue. However I looked and found that it seems to be using two RAM models, both of which are ECC.
EBJ81RF4ECFA-DJ-F
36JSF1G72PZ-1G6M1
…--
Justin
|
Is this likely to be relevant ? |
Last night I ran:
scan: scrub repaired 0B in 00:58:02 with 0 errors on Wed Jan 6 03:45:22 2021
Today this happened:
[94445.128283] VERIFY3(sa.sa_magic == SA_MAGIC) failed (1609959521 == 3100762)
[94445.133896] PANIC at zfs_quota.c:89:zpl_get_file_info()
[94445.135180] Showing stack for process 672
[94445.136079] CPU: 3 PID: 672 Comm: dp_sync_taskq Kdump: loaded Tainted: P OE ------------ 3.10.0-1127.10.1.el7.x86_64 #1
[94445.138971] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[94445.140733] Call Trace:
[94445.141328] [<ffffffff8377ffa5>] dump_stack+0x19/0x1b
[94445.142542] [<ffffffffc04e6c5b>] spl_dumpstack+0x2b/0x30 [spl]
[94445.144693] [<ffffffffc04e6d29>] spl_panic+0xc9/0x110 [spl]
[94445.146537] [<ffffffffc09e571a>] ? dsl_dir_diduse_space+0x17a/0x190 [zfs]
[94445.148055] [<ffffffff83225e9d>] ? __slab_free+0x9d/0x290
[94445.149296] [<ffffffffc09995b6>] ? dbuf_rele_and_unlock+0x286/0x600 [zfs]
[94445.150819] [<ffffffff83784022>] ? mutex_lock+0x12/0x2f
[94445.152073] [<ffffffffc0a7e8ef>] zpl_get_file_info+0x9f/0x250 [zfs]
[94445.153510] [<ffffffffc09ace4d>] dmu_objset_userquota_get_ids+0x11d/0x4e0 [zfs]
[94445.155249] [<ffffffffc09c544b>] dnode_sync+0x11b/0xac0 [zfs]
[94445.157228] [<ffffffff830c77ab>] ? autoremove_wake_function+0x2b/0x40
[94445.158699] [<ffffffffc09aa6e1>] sync_dnodes_task+0x61/0x150 [zfs]
[94445.162223] [<ffffffffc04ec516>] taskq_thread+0x2c6/0x520 [spl]
[94445.163907] [<ffffffff830db990>] ? wake_up_state+0x20/0x20
[94445.165701] [<ffffffffc04ec250>] ? taskq_thread_spawn+0x60/0x60 [spl]
[94445.167272] [<ffffffff830c6691>] kthread+0xd1/0xe0
[94445.168351] [<ffffffff830c65c0>] ? insert_kthread_work+0x40/0x40
[94445.170341] [<ffffffff83792d37>] ret_from_fork_nospec_begin+0x21/0x21
[94445.171876] [<ffffffff830c65c0>] ? insert_kthread_work+0x40/0x40
[94682.228593] INFO: task dp_sync_taskq:672 blocked for more than 120 seconds.
This process would've been rewriting tables (about 1/sec, same as during
previous two freezes), but now stuck for 10min
postgres 22365 3825 27 02:25 ? Ds 213:10 postgres: pryzbyj ts [local] ALTER TABLE
22365 do_last D ? 03:33:10 postgres: pryzbyj ts [local] ALTER TABLE
It's opened files:
[pryzbyj@database7 ~]$ sudo ls -l /proc/22365/fd
total 0
lr-x------. 1 postgres postgres 64 Jan 6 02:26 0 -> /dev/null
l-wx------. 1 postgres postgres 64 Jan 6 02:26 1 -> pipe:[18047]
lrwx------. 1 postgres postgres 64 Jan 6 02:26 10 -> socket:[21852]
lrwx------. 1 postgres postgres 64 Jan 6 02:26 11 -> socket:[1604876]
lr-x------. 1 postgres postgres 64 Jan 6 02:26 12 -> pipe:[1603123]
l-wx------. 1 postgres postgres 64 Jan 6 02:26 13 -> pipe:[1603123]
lrwx------. 1 postgres postgres 64 Jan 6 10:10 14 -> /var/lib/pgsql/13/data/base/16630/946923787
lrwx------. 1 postgres postgres 64 Jan 6 10:10 15 -> /var/lib/pgsql/13/data/base/16630/946923745
lrwx------. 1 postgres postgres 64 Jan 6 10:10 16 -> /var/lib/pgsql/13/data/base/16630/421266385
lrwx------. 1 postgres postgres 64 Jan 6 10:10 17 -> /var/lib/pgsql/13/data/base/16630/946925850
lrwx------. 1 postgres postgres 64 Jan 6 10:10 18 -> /var/lib/pgsql/13/data/base/16630/946925842
lrwx------. 1 postgres postgres 64 Jan 6 10:10 19 -> /var/lib/pgsql/13/data/global/946927403
l-wx------. 1 postgres postgres 64 Jan 6 02:26 2 -> pipe:[18047]
lrwx------. 1 postgres postgres 64 Jan 6 10:10 20 -> /var/lib/pgsql/13/data/base/16630/946925890
lrwx------. 1 postgres postgres 64 Jan 6 10:10 21 -> /var/lib/pgsql/13/data/base/16630/946925894
lrwx------. 1 postgres postgres 64 Jan 6 10:10 22 -> /var/lib/pgsql/13/data/base/16630/946925893
lrwx------. 1 postgres postgres 64 Jan 6 10:10 23 -> /var/lib/pgsql/13/data/base/16630/946925831
lrwx------. 1 postgres postgres 64 Jan 6 10:10 24 -> /var/lib/pgsql/13/data/base/16630/946925831_fsm
lrwx------. 1 postgres postgres 64 Jan 6 10:10 25 -> /zfs2/metric_table/PG_13_202007201/16630/981098328
lrwx------. 1 postgres postgres 64 Jan 6 10:10 26 -> /zfs2/oldtables/PG_13_202007201/16630/443019002
lr-x------. 1 postgres postgres 64 Jan 6 02:26 3 -> pipe:[18046]
lrwx------. 1 postgres postgres 64 Jan 6 02:26 4 -> anon_inode:[eventpoll]
lrwx------. 1 postgres postgres 64 Jan 6 10:10 5 -> /var/lib/pgsql/13/data/base/16630/946925827
lrwx------. 1 postgres postgres 64 Jan 6 10:10 6 -> /var/lib/pgsql/13/data/base/16630/946925777
lrwx------. 1 postgres postgres 64 Jan 6 10:10 7 -> /var/lib/pgsql/13/data/pg_wal/000000010000510A00000081
lrwx------. 1 postgres postgres 64 Jan 6 10:10 8 -> /var/lib/pgsql/13/data/base/16630/946925827_vm
lrwx------. 1 postgres postgres 64 Jan 6 10:10 9 -> /var/lib/pgsql/13/data/base/16630/946925827_fsm
Unfortunately, I cannot get its stacktrace (gdb freezes).
[pryzbyj@database7 ~]$ sudo ls -ldi /zfs2/metric_table/PG_13_202007201/16630/981098328 /zfs2/oldtables/PG_13_202007201/16630/443019002
65904 -rw-------. 1 postgres postgres 34111488 Jan 6 14:00 /zfs2/metric_table/PG_13_202007201/16630/981098328
209018 -rw-r-----. 1 postgres postgres 34111488 Dec 2 00:38 /zfs2/oldtables/PG_13_202007201/16630/443019002
zdb errors like this:
[pryzbyj@database7 ~]$ sudo zdb zfs2 65904 >zdb1.out
zdb: dmu_object_info() failed, errno 2
[pryzbyj@database7 ~]$ sudo zdb zfs2 209018 >zdb2.out
zdb: dmu_object_info() failed, errno 2
I've saved the full output but now have to reboot.
…--
Justin
|
Something very similar to this has been triggered by
|
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
I've twice today seen an issue where zfs freezes (process cannot be killed and VM had to be rebooted). This happened while moving files from one postgres tablespace on zfs FS to another FS on the same zpool (ALTER TABLE .. SET TABLESPACE ..). This would involve first copying the file and then unlinking it. I think we would've been doing about 1 file/second. The original file would've been written with compress=gzip-1 , and the new file is being written with compress=zstd.
This is a centos7 Qemu/KVM VM with ZFS. The VM is running on an ubuntu hypervisor. The ubuntu machine uses LVM with LVs exposed as block devices to QEMU with cache=none. These are the zvols. I realize that storage configuration is discouraged.
System information
Type | Linux
Distribution Name | Centos
Distribution Version | 7.8
Linux Kernel | kernel-3.10.0-1127.10.1.el7.x86_64
Architecture | x86_64
ZFS Version | kmod-zfs-0.8.4-1.el7.x86_64
SPL Version | 2.0.0-1
The text was updated successfully, but these errors were encountered: