Skip to content
This repository has been archived by the owner on Feb 26, 2020. It is now read-only.

SPLError: 2657:0:(zfs_vfsops.c:351:zfs_space_delta_cb()) SPL PANIC #352

Closed
ltaulell opened this issue Apr 16, 2014 · 5 comments
Closed

SPLError: 2657:0:(zfs_vfsops.c:351:zfs_space_delta_cb()) SPL PANIC #352

ltaulell opened this issue Apr 16, 2014 · 5 comments

Comments

@ltaulell
Copy link

SPLError: 2657:0:(zfs_vfsops.c:351:zfs_space_delta_cb()) SPL PANIC

From production servers (HPC center, /home nfs servers), running for about 1y1/2 without problems, recently got theses messages (one I have been able to save) :

Mar 31 19:30:59 r720data3 kernel: [ 7563.266511] VERIFY3(sa.sa_magic == 0x2F505A) failed (1383495966 == 3100762)
Mar 31 19:30:59 r720data3 kernel: [ 7563.266599] SPLError: 2593:0:(zfs_vfsops.c:351:zfs_space_delta_cb()) SPL PANIC
Mar 31 19:30:59 r720data3 kernel: [ 7563.266630] SPL: Showing stack for process 2593
Mar 31 19:30:59 r720data3 kernel: [ 7563.266639] Pid: 2593, comm: txg_sync Tainted: P W O 3.2.0-4-amd64 #1 Debian 3.2.54-2
Mar 31 19:30:59 r720data3 kernel: [ 7563.266644] Call Trace:
Mar 31 19:30:59 r720data3 kernel: [ 7563.266730] [] ? spl_debug_dumpstack+0x24/0x2a [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266737] [] ? spl_debug_bug+0x7f/0xc8 [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266767] [] ? zfs_space_delta_cb+0xcf/0x150 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266782] [] ? should_resched+0x5/0x23
Mar 31 19:30:59 r720data3 kernel: [ 7563.266798] [] ? dmu_objset_userquota_get_ids+0x1b4/0x2ae [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266805] [] ? mutex_lock+0xd/0x2d
Mar 31 19:30:59 r720data3 kernel: [ 7563.266808] [] ? should_resched+0x5/0x23
Mar 31 19:30:59 r720data3 kernel: [ 7563.266823] [] ? dnode_sync+0x8d/0x78a [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266833] [] ? buf_hash_remove+0x65/0x91 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266837] [] ? should_resched+0x5/0x23
Mar 31 19:30:59 r720data3 kernel: [ 7563.266841] [] ? _cond_resched+0x7/0x1c
Mar 31 19:30:59 r720data3 kernel: [ 7563.266844] [] ? mutex_lock+0xd/0x2d
Mar 31 19:30:59 r720data3 kernel: [ 7563.266857] [] ? dmu_objset_sync_dnodes+0x6f/0x88 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266869] [] ? dmu_objset_sync+0x1f3/0x263 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266878] [] ? arc_cksum_compute+0x83/0x83 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266887] [] ? arc_hdr_destroy+0x1b6/0x1b6 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266903] [] ? dsl_pool_sync+0xbf/0x475 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266923] [] ? spa_sync+0x4f4/0x90b [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266927] [] ? ktime_get_ts+0x5c/0x82
Mar 31 19:30:59 r720data3 kernel: [ 7563.266949] [] ? txg_sync_thread+0x2bd/0x49a [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266969] [] ? txg_thread_wait.isra.2+0x23/0x23 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266975] [] ? thread_generic_wrapper+0x6a/0x75 [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266981] [] ? __thread_create+0x2be/0x2be [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266986] [] ? kthread+0x76/0x7e
Mar 31 19:30:59 r720data3 kernel: [ 7563.266991] [] ? kernel_thread_helper+0x4/0x10
Mar 31 19:30:59 r720data3 kernel: [ 7563.266994] [] ? kthread_worker_fn+0x139/0x139
Mar 31 19:30:59 r720data3 kernel: [ 7563.266997] [] ? gs_change+0x13/0x13

then all txg_sync are hung, all knfsd are hung and uninterruptible, load goes to stars => hard reboot.

I can't force to reproduce that bug, but it appears randomly (on 3 differents servers, all with same config hard+soft) as nfs usage goes (once a week for one server, twice a day for another).

I scrubbed all pools after hangs/reboots => "No known data errors".

Data were imported from older pools (solaris x86 -> Debian x86_64) via zfs end/recv,
then upgraded (via zpool/zfs upgrade).

Maybe related with openzfs/zfs#1303 and openzfs/zfs#2025 ?

  • bare metal:
    Dell PE r720xd
    2x Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (32 logical cores)
    192Go ram
  • 2x HBA SAS
    03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon](rev 03)
    04:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon](rev 03)
  • 3x MD1200 12x4T (All pools build the same way)
  • OS:
    Debian 7.4 + multipathd (0.4.9+git0.4dfdaf2b-7deb7u2) + debian-zfs (0.6.2-4wheezy) from zol repository

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.2 (Debian 4.7.2-5)

  • pools

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
baie1 43,5T 21,6T 21,9T 49% 1.00x ONLINE -
baie2 43,5T 8,43T 35,1T 19% 1.00x ONLINE -
baie3 21,8T 7,57T 14,2T 34% 1.00x ONLINE -
front1 21,8T 561G 21,2T 2% 1.00x ONLINE -

zpool status baie1
pool: baie1
state: ONLINE
scan: scrub repaired 0 in 17h17m with 0 errors on Sat Mar 29 05:30:59 2014
config:

    NAME        STATE     READ WRITE CKSUM
    baie1       ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        B1D0    ONLINE       0     0     0
        B1D1    ONLINE       0     0     0
        B1D2    ONLINE       0     0     0
        B1D3    ONLINE       0     0     0
        B1D4    ONLINE       0     0     0
        B1D5    ONLINE       0     0     0
      raidz2-1  ONLINE       0     0     0
        B1D6    ONLINE       0     0     0
        B1D7    ONLINE       0     0     0
        B1D8    ONLINE       0     0     0
        B1D9    ONLINE       0     0     0
        B1D10   ONLINE       0     0     0
        B1D11   ONLINE       0     0     0

errors: No known data errors

zpool get all baie1
NAME PROPERTY VALUE SOURCE
baie1 size 43,5T -
baie1 capacity 49% -
baie1 altroot - default
baie1 health ONLINE -
baie1 guid 14312441928248404290 default
baie1 version - default
baie1 bootfs - default
baie1 delegation on default
baie1 autoreplace off default
baie1 cachefile - default
baie1 failmode wait default
baie1 listsnapshots off default
baie1 autoexpand off default
baie1 dedupditto 0 default
baie1 dedupratio 1.00x -
baie1 free 21,9T -
baie1 allocated 21,6T -
baie1 readonly off -
baie1 ashift 0 default
baie1 comment - default
baie1 expandsize 0 -
baie1 freeing 0 default
baie1 feature@async_destroy enabled local
baie1 feature@empty_bpobj active local
baie1 feature@lz4_compress enabled local

zfs list
NAME USED AVAIL REFER MOUNTPOINT
baie1 14,4T 14,1T 53,9K none
baie1/users 14,4T 14,1T 53,9K none
baie1/users/phys 14,4T 5,58T 11,8T /users/phys
baie2 5,62T 22,9T 53,9K none
baie2/users 5,62T 22,9T 53,9K none
baie2/users/geol 5,62T 10,4T 5,60T /users/geol
baie3 5,04T 9,22T 53,9K none
baie3/users 5,04T 9,22T 53,9K none
baie3/users/ilm 2,87T 1,13T 2,83T /users/ilm
baie3/users/insa 42,0K 1024G 42,0K /users/insa
baie3/users/ipag 113K 1024G 113K /users/ipag
baie3/users/lasim 1,49T 522G 1,49T /users/lasim
baie3/users/lmfa 42,0K 1024G 42,0K /users/lmfa
baie3/users/lmfaecl 694G 330G 604G /users/lmfaecl
front1 466G 17,3T 44,8K none
front1/tmp 466G 17,3T 466G none

zfs get all baie1/users/phys
NAME PROPERTY VALUE SOURCE
baie1/users/phys type filesystem -
baie1/users/phys creation dim. oct. 27 10:47 2013 -
baie1/users/phys used 14,4T -
baie1/users/phys available 5,58T -
baie1/users/phys referenced 11,8T -
baie1/users/phys compressratio 1.00x -
baie1/users/phys mounted yes -
baie1/users/phys quota 20T local
baie1/users/phys reservation none default
baie1/users/phys recordsize 128K default
baie1/users/phys mountpoint /users/phys local
baie1/users/phys sharenfs off default
baie1/users/phys checksum on default
baie1/users/phys compression off default
baie1/users/phys atime off inherited from baie1
baie1/users/phys devices on default
baie1/users/phys exec on default
baie1/users/phys setuid on default
baie1/users/phys readonly off default
baie1/users/phys zoned off default
baie1/users/phys snapdir hidden default
baie1/users/phys aclinherit restricted default
baie1/users/phys canmount on default
baie1/users/phys xattr on default
baie1/users/phys copies 1 default
baie1/users/phys version 5 -
baie1/users/phys utf8only off -
baie1/users/phys normalization none -
baie1/users/phys casesensitivity sensitive -
baie1/users/phys vscan off default
baie1/users/phys nbmand off default
baie1/users/phys sharesmb off default
baie1/users/phys refquota none default
baie1/users/phys refreservation none default
baie1/users/phys primarycache all default
baie1/users/phys secondarycache all default
baie1/users/phys usedbysnapshots 2,62T -
baie1/users/phys usedbydataset 11,8T -
baie1/users/phys usedbychildren 0 -
baie1/users/phys usedbyrefreservation 0 -
baie1/users/phys logbias latency default
baie1/users/phys dedup off default
baie1/users/phys mlslabel none default
baie1/users/phys sync standard default
baie1/users/phys refcompressratio 1.00x -
baie1/users/phys written 36,0G -
baie1/users/phys snapdev hidden default

These are production servers, I can not play with debug, but if you need any additionnal data, please ask, I'll do what I can.

Regards, Loïs

@dweeezil
Copy link
Contributor

@ltaulell Were any of the fileysstems created as ZPL versions < 5 and the upgraded to version 5? If so, I'd say it's a good chance to be openzfs/zfs#2025. The SA magic is totally bogus so this is not a simple single-bit memory error (and it sure looks like this system must have ECC in any case).

@ltaulell
Copy link
Author

@dweeezil : Yes, mainly all filesystems were created on Solaris (zfs v3 or v4), then upgraded after send/recv on new systems (debian/zol).

And yes, all r720xd servers have ECC memory.

Despite the tiny diff (zfs_vfsops.c:351 vs zfs_vfsops.c:390), I agree with you, it a good hit to openzfs/zfs#2025.

Again, if you need additionnals, please ask.

@dweeezil
Copy link
Contributor

@ltaulell I've not yet had a chance to look into Andriy's analysis in openzfs/zfs#2025 but at first glance it sounds reasonable (I've not studied the code path required to upgrade from DMU_OT_ZNODE to DMU_OT_SA).

In the mean time, given that this is most certainly not an SPL bug and is also very likely to be the same as openzfs/zfs#2025, my suggestion would be to close this issue (#352) and add your information to openzfs/zfs#2025.

@ltaulell
Copy link
Author

Who does that ? 0;-)

@ltaulell
Copy link
Author

copied to openzfs/zfs/issues/2025 => closing

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants