Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPLError: 324:0:(zfs_vfsops.c:390:zfs_space_delta_cb()) SPL PANIC #2025

Closed
clefru opened this issue Jan 4, 2014 · 27 comments
Closed

SPLError: 324:0:(zfs_vfsops.c:390:zfs_space_delta_cb()) SPL PANIC #2025

clefru opened this issue Jan 4, 2014 · 27 comments

Comments

@clefru
Copy link
Contributor

clefru commented Jan 4, 2014

After the following SPL panic

[402052.331476] VERIFY3(sa.sa_magic == 0x2F505A) failed (1388795800 == 3100762)
[402052.331514] SPLError: 324:0:(zfs_vfsops.c:390:zfs_space_delta_cb()) SPL PANIC
[402052.331545] SPL: Showing stack for process 324
[402052.331548] CPU: 2 PID: 324 Comm: txg_sync Tainted: P O 3.10.25-1-lts #1
[402052.331550] Hardware name: /DQ45CB, BIOS CBQ4510H.86A.0079.2009.0414.1340 04/14/2009
[402052.331552] ffff880069446fa8 ffff880221d8b9d8 ffffffff814b1e2d ffff880221d8b9e8
[402052.331555] ffffffffa0187897 ffff880221d8ba10 ffffffffa018890f ffffffffa019c531
[402052.331558] ffff8801e6697600 ffff8801e6697600 ffff880221d8ba50 ffffffffa02ce816
[402052.331561] Call Trace:
[402052.331568] [] dump_stack+0x19/0x1b
[402052.331589] [] spl_debug_dumpstack+0x27/0x40 [spl]
[402052.331595] [] spl_debug_bug+0x7f/0xe0 [spl]
[402052.331613] [] zfs_space_delta_cb+0x186/0x190 [zfs]
[402052.331628] [] dmu_objset_userquota_get_ids+0xd4/0x470 [zfs]
[402052.331643] [] dnode_sync+0xde/0xb80 [zfs]
[402052.331647] [] ? __mutex_lock_slowpath+0x24c/0x320
[402052.331661] [] dmu_objset_sync_dnodes+0xd2/0xf0 [zfs]
[402052.331676] [] dmu_objset_sync+0x1c7/0x2f0 [zfs]
[402052.331689] [] ? secondary_cache_changed_cb+0x20/0x20 [zfs]
[402052.331703] [] ? dmu_objset_sync+0x2f0/0x2f0 [zfs]
[402052.331720] [] dsl_dataset_sync+0x41/0x50 [zfs]
[402052.331737] [] dsl_pool_sync+0x98/0x470 [zfs]
[402052.331754] [] spa_sync+0x427/0xb30 [zfs]
[402052.331772] [] txg_sync_thread+0x374/0x5e0 [zfs]
[402052.331790] [] ? txg_delay+0x120/0x120 [zfs]
[402052.331796] [] thread_generic_wrapper+0x7a/0x90 [spl]
[402052.331802] [] ? __thread_exit+0xa0/0xa0 [spl]
[402052.331806] [] kthread+0xc0/0xd0
[402052.331809] [] ? kthread_create_on_node+0x120/0x120
[402052.331812] [] ret_from_fork+0x7c/0xb0
[402052.331815] [] ? kthread_create_on_node+0x120/0x120

my NFS server stayed partially unresponsive.

I am running 5d862cb zfs and 921a35a spl on linux-lts 3.10.25

I never enabled xattr=sa on this pool. I have xattr=on.

Another SA corruption issue?

@clefru clefru closed this as completed Jan 4, 2014
@clefru
Copy link
Contributor Author

clefru commented Jan 4, 2014

Sorry, stupid me hitting the wrong buttons. Reopening.

I just wanted to add my kernel config for reference https://projects.archlinux.org/svntogit/packages.git/tree/trunk/config.x86_64?h=packages/linux-lts

@clefru clefru reopened this Jan 4, 2014
@dweeezil
Copy link
Contributor

dweeezil commented Jan 4, 2014

The user/group used facility does manage the bonus area on its own so this could certainly be a newly-discovered case of SA corruption. Unfortunately, it's not currently possible to dump those dnodes directly with zdb. They are objects -1 and -2. Could you please run zdb -ddddd <pool>/<fs> | less on the suspect file system and then in "less", search forward for the word "used". You'll see something like this:

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        -1    1    16K    512     1K    512  100.00  ZFS user/group used
        dnode flags: USED_BYTES 
        dnode maxblkid: 0
        microzap: 512 bytes, 2 entries

                1f4 = 11369852928 
                0 = 117723136 

Once you get that output, please post it to this issue (or to a gist or paste somewhere). We want to see the entire output for Object -1 and -2. It wouldn't be surprising if zdb trips an assertion while doing this.

My initial suspicion is that there's a problem extending into a spill block when a file is created with a brand new user/group ID. For my part, I'll try a bit of testing on my own to see whether I can reproduce it.

@clefru
Copy link
Contributor Author

clefru commented Jan 4, 2014

Do I understand you correctly that there are only two such objects per filesystem and I can stop the zdb grepping once these objects are found?

I have a couple of file systems, here are three that are mostly likely involved:

https://gist.github.com/clefru/8256771
https://gist.github.com/clefru/8256789
https://gist.github.com/clefru/8256832

I can't really interpret them. Varying sizes (1k, 1.5k) while all show 100% full don't make much sense to me.

(Thanks for your speedy reply)

@dweeezil
Copy link
Contributor

dweeezil commented Jan 4, 2014

Yes, you're correct. Actually, it turns out I'm wrong, you can dump those objects by specifying them as large unsigned 64-bit integers:

zdb -ddddd <pool/fs> 18446744073709551614 18446744073709551615

For some reason, it didn't dawn on me at the time that you could do that.

I just took a peek at your group/user-used output and there doesn't seem to be anything too crazy going on. You mentioned you were using NFS so I'm wondering if some of the "special" NFS "nobody/nogroup" uid/gids might be involved with this. I don't suppose you've got any idea what type of file operation caused this? I've got a few ideas and will see if I can reproduce this on a test system.

As to interpreting the output, I was interested as to how many different uid/gids were being kept track of as well as their values. I was also interested as to whether a spill block was involved (there is none; it would show up after the "dnode flags:").

@dweeezil
Copy link
Contributor

dweeezil commented Jan 4, 2014

@clefru It looks like I was on the wrong track. Can you reproduce this problem?

@clefru
Copy link
Contributor Author

clefru commented Jan 4, 2014

Sorry, no idea on the specific NFS operations. Also the NFS nobody/nogroup UID doesn't seem to have an issue. I can happily create nobody/nogroup files.

I will just wait a couple of days to see if I can reproduce it with more information. I will also zpool scrub over night.

I found the following other panic reports:

http://lists.freebsd.org/pipermail/freebsd-fs/2012-December/thread.html#15945
https://groups.google.com/forum/#!topic/muc.lists.freebsd.current/VS_udV5jJhY
https://community.oracle.com/message/10959942

Interestingly, the last relates the problem to NFS sharing.

@maxximino
Copy link
Contributor

I would like to point out that the value found in place of the SA_MAGIC is a plausible timestamp of 4 Jan 2014.
And on FreeBSD mailing list the value is a plausible timestamp of January 2012 while the post is dated Dec 2012.
On groups.google.com, the values found are plausible unix timestamps of 2010/2011.

@dweeezil
Copy link
Contributor

dweeezil commented Jan 4, 2014

I'd be nice if @clefru could run with debugging enabled (configure with --enable-debug) for awhile. There's a bunch of ASSERTS related to SAs which might get hit a bit earlier.

@dweeezil
Copy link
Contributor

dweeezil commented Jan 4, 2014

@maxximino I think you're on the right track with the timestamp observation. I did examine the bogus value but didn't consider it might be a timestamp.

@dweeezil
Copy link
Contributor

dweeezil commented Jan 4, 2014

@clefru Does the system on which your zfs/spl was compiled have gcc 4.8 or newer? If so, I think I've just discovered a whole raft of other problems that its aggressive loop optimization can cause. See #2010 for details. In particular, it looks like the sa_lengths[] array in sa_hdr_phys is also going to be a problem.

@dweeezil
Copy link
Contributor

dweeezil commented Jan 4, 2014

@clefru Never mind the question regarding gcc 4.8. I had forgotten that the problem doesn't occur when the [1] array is the last member of the structure (which it is in this case).

@dweeezil
Copy link
Contributor

dweeezil commented Jan 5, 2014

Thinking more about a timestamp winding up in where the SA's magic should be make me think this may, in fact, be another potential manifestation of #2010. I've not yet thought about the ramifications about every loop involved but anything that causes dn_nblkptr to be mis-managed could easily cause SAs to be clobbered. I will be looking over each of the likely over-optimized loops to try to determine the impact.

With that in mind, I guess I would like to know whether your system, @clefru is using gcc 4.8 or newer.

I'm also curious as to the history of various of gcc 4.8 adoption in various distros.

@maxximino
Copy link
Contributor

@clefru what is the version of the zfs filesystem? v5 or older? was this upgraded or this fs was "born" as v5? (if this was received, consider also his life on the other pool)
@dweeezil: https://github.com/zfsonlinux/zfs/blob/master/module/zfs/sa.c#L172 tells me that pre-v5 filesystems, without SAs, had the atime as the first field. and sa_magic is the first field of an attribute (v5). and 5 days ago @clefru commented on issue #1264 talking about a v3 filesystem. Given the on-disk differences affect exactly the code which is creating problems, I think it's worth asking.

@clefru
Copy link
Contributor Author

clefru commented Jan 5, 2014

@maxximino Great catch on the timestamp! One potentially involved file system was v3->v5 upgraded before this happened without a reboot in between.

@dweeezil positive, I am on gcc (GCC) 4.8.2 20131219 (prerelease). My CFLAGS -- if config.status is authoritative -- are just "-g -O2", but I assume the kernel build process adds a ton.

I recompiled with --enable-debug on spl/zfs. Is there any spl_debug modparam I should pass-in?

@dweeezil
Copy link
Contributor

dweeezil commented Jan 5, 2014

@clefru Thanks for checking the GCC; that's likely to be causing some problems in its own right (unless your distro's kernel build process adds -fno-aggressive-loop-optimizations; you can do a "make clean; make V=1 in the modules directory to check)... but... yes! the v3->v5 upgrade may explain everything. That upgrade enables a pair of features that are at the core of this particular problem and your original stack trace. I never considered the possibility that the filesystem in question may have been created with a ZPL version < 5.

Under ZPL version 3, rather than an SA bundle, a znode_phys_t exists in the bonus buffer and is of type DMU_OT_DNODE. The very first members of that structure are the timestamps. In ZPL 5, however, where the bonus would be a DMU_OT_SA, the beginning of the bonus buffer is the magic number (0x2F505A or ASCII "ZP/\0") which is the 3100762 in your assert.

As to running with --enable-debug, you don't need to do anything else. I had mainly suggested it to get the ASSERTs activated.

I'm going to try some experiments with ZPL 3 pools that have been upgraded to version 5. I have a feeling that may explain everything.

EDIT: before I dive into my 3->5 upgrade testing, I must mention that I've not done a pool upgrade in a long time and that during my initial tests while writing this original message, a 3->5 upgrade actually hit an ASSERT and failed.

@clefru
Copy link
Contributor Author

clefru commented Jan 5, 2014

Thanks Tim, Thanks Massimo for hunting that down so quickly!

Can I somehow dump the bonus area of the file system in question, so that I can figure out whether the bonus area was properly upgraded to the new layout?

@dweeezil
Copy link
Contributor

dweeezil commented Jan 5, 2014

@clefru The bonus buffer is per-file (in the tail of the dnode) and it is not changed by performing the 3->5 upgrade. Only certain filesystem operations will cause a file's bonus buffer to be converted to a DMU_OT_SA. You can check a particular file with zdb -dddd <pool/fs> <inode>:

# ls -i /junk/fs1/blah
4 /junk/fs1/blah
# zdb -dddd junk/fs1 4
Dataset junk/fs1 [ZPL], ID 30, cr_txg 7, 18.5K, 4 objects, rootbp DVA[0]=<0:18400:200> DVA[1]=<0:1218200:200> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=400L/200P birth=9L/9P fill=4 cksum=9d8e63315:431bd53573f:e7c00e227504:21ca80919fbafc

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         4    1    16K    512    512    512  100.00  ZFS plain file
                                        264   bonus  ZFS znode

The "bonus ZFS znode" indicates version < 5. Files created on version 5 will have a bonus type listed as "bonus System attributes".

I have already been able to reproduce some corruption on 3->5 upgraded filesystems when running stock GCC 3.8-compiled code (without the #2010 hack). I'm working on tracking down the nature of the corruption and its cause.

@clefru
Copy link
Contributor Author

clefru commented Jan 12, 2014

Just FYI I am running a debug build for a week now and I haven't seen this issue again.

@avg-I
Copy link
Contributor

avg-I commented Apr 15, 2014

Below is my guess about what could be wrong.

A thread is changing file attributes and it could end up calling zfs_sa_upgrade() to convert file's bonus from DMU_OT_ZNODE to DMU_OT_SA. The conversion is achieved in two steps:

  • dmu_set_bonustype() to change the bonus type in the dnode
  • sa_replace_all_by_template_locked() to re-populate the bonus data

dmu_set_bonustype() calls dnode_setbonus_type() which does the following:
dn->dn_bonustype = newtype;
dn->dn_next_bonustype[tx->tx_txg & TXG_MASK] = dn->dn_bonustype;

Concurrently, the sync thread can run into the dnode if it was dirtied in an earlier txg. The sync thread calls dmu_objset_userquota_get_ids() via dnode_sync(). dmu_objset_userquota_get_ids() uses dn_bonustype that has the new value, but the data corresponding to the txg being sync-ed is still in the old format.

As I understand, dmu_objset_userquota_get_ids() already uses dmu_objset_userquota_find_data() when before == B_FALSE to find a proper copy of the data corresponding to the txg being sync-ed.
So, I think that in that case dmu_objset_userquota_get_ids() should also use values of dn_bonustype and dn_bonuslen that correspond to the txg.
If I am not mistaken, those values could be deduced from dn_next_bonustype[tx->tx_txg & TXG_MASK] plus dn_phys->dn_bonustype and dn_next_bonuslen[tx->tx_txg & TXG_MASK] plus dn_phys->dn_bonuslen.

@ltaulell
Copy link

As requested by @dweeezil, closing openzfs/spl#352, and copy/pasting infos here:

SPLError: 2657:0:(zfs_vfsops.c:351:zfs_space_delta_cb()) SPL PANIC

From production servers (HPC center, /home nfs servers), running for about 1y1/2 without problems, recently got theses messages (one I have been able to save) :

Mar 31 19:30:59 r720data3 kernel: [ 7563.266511] VERIFY3(sa.sa_magic == 0x2F505A) failed (1383495966 == 3100762)
Mar 31 19:30:59 r720data3 kernel: [ 7563.266599] SPLError: 2593:0:(zfs_vfsops.c:351:zfs_space_delta_cb()) SPL PANIC
Mar 31 19:30:59 r720data3 kernel: [ 7563.266630] SPL: Showing stack for process 2593
Mar 31 19:30:59 r720data3 kernel: [ 7563.266639] Pid: 2593, comm: txg_sync Tainted: P W O 3.2.0-4-amd64 #1 Debian 3.2.54-2
Mar 31 19:30:59 r720data3 kernel: [ 7563.266644] Call Trace:
Mar 31 19:30:59 r720data3 kernel: [ 7563.266730] [] ? spl_debug_dumpstack+0x24/0x2a [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266737] [] ? spl_debug_bug+0x7f/0xc8 [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266767] [] ? zfs_space_delta_cb+0xcf/0x150 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266782] [] ? should_resched+0x5/0x23
Mar 31 19:30:59 r720data3 kernel: [ 7563.266798] [] ? dmu_objset_userquota_get_ids+0x1b4/0x2ae [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266805] [] ? mutex_lock+0xd/0x2d
Mar 31 19:30:59 r720data3 kernel: [ 7563.266808] [] ? should_resched+0x5/0x23
Mar 31 19:30:59 r720data3 kernel: [ 7563.266823] [] ? dnode_sync+0x8d/0x78a [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266833] [] ? buf_hash_remove+0x65/0x91 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266837] [] ? should_resched+0x5/0x23
Mar 31 19:30:59 r720data3 kernel: [ 7563.266841] [] ? _cond_resched+0x7/0x1c
Mar 31 19:30:59 r720data3 kernel: [ 7563.266844] [] ? mutex_lock+0xd/0x2d
Mar 31 19:30:59 r720data3 kernel: [ 7563.266857] [] ? dmu_objset_sync_dnodes+0x6f/0x88 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266869] [] ? dmu_objset_sync+0x1f3/0x263 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266878] [] ? arc_cksum_compute+0x83/0x83 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266887] [] ? arc_hdr_destroy+0x1b6/0x1b6 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266903] [] ? dsl_pool_sync+0xbf/0x475 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266923] [] ? spa_sync+0x4f4/0x90b [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266927] [] ? ktime_get_ts+0x5c/0x82
Mar 31 19:30:59 r720data3 kernel: [ 7563.266949] [] ? txg_sync_thread+0x2bd/0x49a [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266969] [] ? txg_thread_wait.isra.2+0x23/0x23 [zfs]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266975] [] ? thread_generic_wrapper+0x6a/0x75 [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266981] [] ? __thread_create+0x2be/0x2be [spl]
Mar 31 19:30:59 r720data3 kernel: [ 7563.266986] [] ? kthread+0x76/0x7e
Mar 31 19:30:59 r720data3 kernel: [ 7563.266991] [] ? kernel_thread_helper+0x4/0x10
Mar 31 19:30:59 r720data3 kernel: [ 7563.266994] [] ? kthread_worker_fn+0x139/0x139
Mar 31 19:30:59 r720data3 kernel: [ 7563.266997] [] ? gs_change+0x13/0x13

then all txg_sync are hung, all knfsd are hung and uninterruptible, load goes to stars => hard reboot.

I can't force to reproduce that bug, but it appears randomly (on 3 differents servers, all with same config hard+soft) as nfs usage goes (once a week for one server, twice a day for another).

I scrubbed all pools after hangs/reboots => "No known data errors".

Data were imported from older pools (v3, v4 from solaris x86 to Debian x86_64/zol) via zfs end/recv, then upgraded (via zpool/zfs upgrade) to v5.

Maybe related with #1303 and #2025 ?

bare metal:
Dell PE r720xd
2x Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (32 logical cores)
192Go ram (ECC)

2x HBA SAS
03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 Falcon
04:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 Falcon

3x MD1200 12x4T (All pools build the same way)

OS:
Debian 7.4 + multipathd (0.4.9+git0.4dfdaf2b-7~deb7u2) + debian-zfs (0.6.2-4~wheezy) from zol repository

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.7.2-5' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.2 (Debian 4.7.2-5)

pools

NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
baie1 43,5T 21,6T 21,9T 49% 1.00x ONLINE -
baie2 43,5T 8,43T 35,1T 19% 1.00x ONLINE -
baie3 21,8T 7,57T 14,2T 34% 1.00x ONLINE -
front1 21,8T 561G 21,2T 2% 1.00x ONLINE -

zpool status baie1
pool: baie1
state: ONLINE
scan: scrub repaired 0 in 17h17m with 0 errors on Sat Mar 29 05:30:59 2014
config:

NAME        STATE     READ WRITE CKSUM
baie1       ONLINE       0     0     0
  raidz2-0  ONLINE       0     0     0
    B1D0    ONLINE       0     0     0
    B1D1    ONLINE       0     0     0
    B1D2    ONLINE       0     0     0
    B1D3    ONLINE       0     0     0
    B1D4    ONLINE       0     0     0
    B1D5    ONLINE       0     0     0
  raidz2-1  ONLINE       0     0     0
    B1D6    ONLINE       0     0     0
    B1D7    ONLINE       0     0     0
    B1D8    ONLINE       0     0     0
    B1D9    ONLINE       0     0     0
    B1D10   ONLINE       0     0     0
    B1D11   ONLINE       0     0     0

errors: No known data errors

zpool get all baie1
NAME PROPERTY VALUE SOURCE
baie1 size 43,5T -
baie1 capacity 49% -
baie1 altroot - default
baie1 health ONLINE -
baie1 guid 14312441928248404290 default
baie1 version - default
baie1 bootfs - default
baie1 delegation on default
baie1 autoreplace off default
baie1 cachefile - default
baie1 failmode wait default
baie1 listsnapshots off default
baie1 autoexpand off default
baie1 dedupditto 0 default
baie1 dedupratio 1.00x -
baie1 free 21,9T -
baie1 allocated 21,6T -
baie1 readonly off -
baie1 ashift 0 default
baie1 comment - default
baie1 expandsize 0 -
baie1 freeing 0 default
baie1 feature@async_destroy enabled local
baie1 feature@empty_bpobj active local
baie1 feature@lz4_compress enabled local

zfs list
NAME USED AVAIL REFER MOUNTPOINT
baie1 14,4T 14,1T 53,9K none
baie1/users 14,4T 14,1T 53,9K none
baie1/users/phys 14,4T 5,58T 11,8T /users/phys
baie2 5,62T 22,9T 53,9K none
baie2/users 5,62T 22,9T 53,9K none
baie2/users/geol 5,62T 10,4T 5,60T /users/geol
baie3 5,04T 9,22T 53,9K none
baie3/users 5,04T 9,22T 53,9K none
baie3/users/ilm 2,87T 1,13T 2,83T /users/ilm
baie3/users/insa 42,0K 1024G 42,0K /users/insa
baie3/users/ipag 113K 1024G 113K /users/ipag
baie3/users/lasim 1,49T 522G 1,49T /users/lasim
baie3/users/lmfa 42,0K 1024G 42,0K /users/lmfa
baie3/users/lmfaecl 694G 330G 604G /users/lmfaecl
front1 466G 17,3T 44,8K none
front1/tmp 466G 17,3T 466G none

zfs get all baie1/users/phys
NAME PROPERTY VALUE SOURCE
baie1/users/phys type filesystem -
baie1/users/phys creation dim. oct. 27 10:47 2013 -
baie1/users/phys used 14,4T -
baie1/users/phys available 5,58T -
baie1/users/phys referenced 11,8T -
baie1/users/phys compressratio 1.00x -
baie1/users/phys mounted yes -
baie1/users/phys quota 20T local
baie1/users/phys reservation none default
baie1/users/phys recordsize 128K default
baie1/users/phys mountpoint /users/phys local
baie1/users/phys sharenfs off default
baie1/users/phys checksum on default
baie1/users/phys compression off default
baie1/users/phys atime off inherited from baie1
baie1/users/phys devices on default
baie1/users/phys exec on default
baie1/users/phys setuid on default
baie1/users/phys readonly off default
baie1/users/phys zoned off default
baie1/users/phys snapdir hidden default
baie1/users/phys aclinherit restricted default
baie1/users/phys canmount on default
baie1/users/phys xattr on default
baie1/users/phys copies 1 default
baie1/users/phys version 5 -
baie1/users/phys utf8only off -
baie1/users/phys normalization none -
baie1/users/phys casesensitivity sensitive -
baie1/users/phys vscan off default
baie1/users/phys nbmand off default
baie1/users/phys sharesmb off default
baie1/users/phys refquota none default
baie1/users/phys refreservation none default
baie1/users/phys primarycache all default
baie1/users/phys secondarycache all default
baie1/users/phys usedbysnapshots 2,62T -
baie1/users/phys usedbydataset 11,8T -
baie1/users/phys usedbychildren 0 -
baie1/users/phys usedbyrefreservation 0 -
baie1/users/phys logbias latency default
baie1/users/phys dedup off default
baie1/users/phys mlslabel none default
baie1/users/phys sync standard default
baie1/users/phys refcompressratio 1.00x -
baie1/users/phys written 36,0G -
baie1/users/phys snapdev hidden default

These are production servers, I can not play with debug, but if you need any additionnal data, please ask, I'll do what I can.

@behlendorf
Copy link
Contributor

This is believe to be resolved. If anyone observed an instance of a 'zfs_space_delta_cb()) SPL PANIC' in 0.6.3 or newer please open a new issue.

@behlendorf behlendorf removed this from the 0.6.4 milestone Oct 7, 2014
@clefru
Copy link
Contributor Author

clefru commented Oct 26, 2014

Unfortunately, I saw this again today on heavy IO work over NFS:

Oct 26 11:32:00 alia kernel: VERIFY3(sa.sa_magic == 0x2F505A) failed (1414319519 == 3100762)
Oct 26 11:32:00 alia kernel: SPLError: 693:0:(zfs_vfsops.c:400:zfs_space_delta_cb()) SPL PANIC
Oct 26 11:32:00 alia kernel: SPL: Showing stack for process 693
Oct 26 11:32:00 alia kernel: CPU: 1 PID: 693 Comm: txg_sync Tainted: P        W  O 3.14.19-1-lts #1
Oct 26 11:32:00 alia kernel: Hardware name:                  /DQ87PG, BIOS PGQ8710H.86A.0145.2014.0218.1509 02/18/2014
Oct 26 11:32:00 alia kernel:  0000000000000000 00000000a41281e5 ffff88081ebdb9a0 ffffffff814fb9b0
Oct 26 11:32:00 alia kernel:  0000000000000000 ffff88081ebdb9b0 ffffffffa03b2867 ffff88081ebdb9d8
Oct 26 11:32:00 alia kernel:  ffffffffa03b3a90 ffffffffa03c83a1 ffffea00007d8300 ffff88016448b200
Oct 26 11:32:00 alia kernel: Call Trace:
Oct 26 11:32:00 alia kernel:  [<ffffffff814fb9b0>] dump_stack+0x45/0x56
Oct 26 11:32:00 alia kernel:  [<ffffffffa03b2867>] spl_debug_dumpstack+0x27/0x40 [spl]
Oct 26 11:32:00 alia kernel:  [<ffffffffa03b3a90>] spl_debug_bug+0x80/0xe0 [spl]
Oct 26 11:32:00 alia kernel:  [<ffffffffa04fb713>] zfs_space_delta_cb+0x1a3/0x1b0 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa0482a0e>] dmu_objset_userquota_get_ids+0x11e/0x3d0 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa0490894>] dnode_sync+0xc4/0xb20 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa047783a>] ? dbuf_sync_list+0x7a/0xa0 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa0481b42>] dmu_objset_sync_dnodes+0xd2/0xf0 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa0481d3a>] dmu_objset_sync+0x1da/0x330 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa047fce0>] ? secondary_cache_changed_cb+0x20/0x20 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa0481e90>] ? dmu_objset_sync+0x330/0x330 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa0491b31>] dsl_dataset_sync+0x41/0x50 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa049ebe8>] dsl_pool_sync+0xa8/0x480 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa04b74dd>] spa_sync+0x42d/0xb10 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffff810d15b8>] ? ktime_get_ts+0x48/0xf0
Oct 26 11:32:00 alia kernel:  [<ffffffffa04c81d2>] txg_sync_thread+0x382/0x5f0 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa04c7e50>] ? txg_delay+0xf0/0xf0 [zfs]
Oct 26 11:32:00 alia kernel:  [<ffffffffa03bb24a>] thread_generic_wrapper+0x7a/0x90 [spl]
Oct 26 11:32:00 alia kernel:  [<ffffffffa03bb1d0>] ? __thread_exit+0xa0/0xa0 [spl]
Oct 26 11:32:00 alia kernel:  [<ffffffff8108c35a>] kthread+0xea/0x100
Oct 26 11:32:00 alia kernel:  [<ffffffff8108c270>] ? kthread_create_on_node+0x1a0/0x1a0
Oct 26 11:32:00 alia kernel:  [<ffffffff8150a03c>] ret_from_fork+0x7c/0xb0
Oct 26 11:32:00 alia kernel:  [<ffffffff8108c270>] ? kthread_create_on_node+0x1a0/0x1a0

Oct 26 12:14:11 alia kernel: SPL: Loaded module v0.6.3-1
Oct 26 12:14:11 alia kernel: znvpair: module license 'CDDL' taints kernel.
Oct 26 12:14:11 alia kernel: Disabling lock debugging due to kernel taint
Oct 26 12:14:11 alia kernel: ZFS: Loaded module v0.6.3-1, ZFS pool version 5000, ZFS filesystem version 5

@behlendorf
Copy link
Contributor

OK, I'm reopening this issue.

@TheUbuntuGuy
Copy link

I saw this for the first time today with 0.6.3-2.
I was running a script which read-modify-writes a bunch of files over NFS.
The load under this script isn't much more than normal. I upgraded from v4->v5 only 2 weeks ago.
After hard-resetting the machine and trying again, it failed exactly the same way - even the timestamp of 1427679357 was the same (does that imply that it might be failing on the same file/node?). There don't seem to be any files with that timestamp in the directory I'm working in, and the one with the closest timestamp was created with v5 system attributes.

This is a production machine and I can't really do a huge amount of debugging. Is there anything I could do that would allow me to modify the affected files without it panicing?

Apr  2 15:32:04 tesla kernel: [2149795.941737] VERIFY3(sa.sa_magic == 0x2F505A) failed (1427679357 == 3100762)
Apr  2 15:32:04 tesla kernel: [2149795.942366] SPLError: 686:0:(zfs_vfsops.c:400:zfs_space_delta_cb()) SPL PANIC
Apr  2 15:32:04 tesla kernel: [2149795.942982] SPL: Showing stack for process 686
Apr  2 15:32:04 tesla kernel: [2149795.942985] CPU: 0 PID: 686 Comm: txg_sync Tainted: P           OE 3.16.0-031600-generic #201408031935
Apr  2 15:32:04 tesla kernel: [2149795.942986] Hardware name: System manufacturer System Product Name/P8H77-V LE, BIOS 0803 10/12/2012
Apr  2 15:32:04 tesla kernel: [2149795.942987]  ffff8805bf921740 ffff8805ea34f8f8 ffffffff81786525 0000000000000007
Apr  2 15:32:04 tesla kernel: [2149795.942990]  0000000000000000 ffff8805ea34f908 ffffffffc03614d7 ffff8805ea34f948
Apr  2 15:32:04 tesla kernel: [2149795.942991]  ffffffffc0362770 ffffffffc0376ced 000000005518a87d 00000000002f505a
Apr  2 15:32:04 tesla kernel: [2149795.942993] Call Trace:
Apr  2 15:32:04 tesla kernel: [2149795.942999]  [<ffffffff81786525>] dump_stack+0x46/0x58
Apr  2 15:32:04 tesla kernel: [2149795.943014]  [<ffffffffc03614d7>] spl_debug_dumpstack+0x27/0x40 [spl]
Apr  2 15:32:04 tesla kernel: [2149795.943018]  [<ffffffffc0362770>] spl_debug_bug+0x80/0xe0 [spl]
Apr  2 15:32:04 tesla kernel: [2149795.943048]  [<ffffffffc04a1ece>] zfs_space_delta_cb+0x18e/0x1a0 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943063]  [<ffffffffc042cf4f>] dmu_objset_userquota_get_ids+0xcf/0x420 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943066]  [<ffffffff81792366>] ? mutex_lock+0x16/0x37
Apr  2 15:32:04 tesla kernel: [2149795.943081]  [<ffffffffc043b5d2>] dnode_sync+0xa2/0xb30 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943085]  [<ffffffff81792366>] ? mutex_lock+0x16/0x37
Apr  2 15:32:04 tesla kernel: [2149795.943098]  [<ffffffffc042b45b>] dmu_objset_sync_dnodes+0xbb/0xe0 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943111]  [<ffffffffc042b614>] dmu_objset_sync+0x194/0x2f0 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943124]  [<ffffffffc042ad30>] ? secondary_cache_changed_cb+0x20/0x20 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943137]  [<ffffffffc042b770>] ? dmu_objset_sync+0x2f0/0x2f0 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943152]  [<ffffffffc043cd6c>] dsl_dataset_sync+0x4c/0x60 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943169]  [<ffffffffc04494f8>] dsl_pool_sync+0x98/0x410 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943187]  [<ffffffffc045f904>] spa_sync+0x414/0xb20 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943191]  [<ffffffff8101e579>] ? read_tsc+0x9/0x20
Apr  2 15:32:04 tesla kernel: [2149795.943194]  [<ffffffff810df41c>] ? ktime_get_ts+0x4c/0xe0
Apr  2 15:32:04 tesla kernel: [2149795.943213]  [<ffffffffc0471748>] txg_sync_thread+0x388/0x5d0 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943260]  [<ffffffffc04713c0>] ? txg_init+0x260/0x260 [zfs]
Apr  2 15:32:04 tesla kernel: [2149795.943289]  [<ffffffffc0369e88>] thread_generic_wrapper+0x78/0x90 [spl]
Apr  2 15:32:04 tesla kernel: [2149795.943294]  [<ffffffffc0369e10>] ? __thread_create+0x300/0x300 [spl]
Apr  2 15:32:04 tesla kernel: [2149795.943297]  [<ffffffff81096479>] kthread+0xc9/0xe0
Apr  2 15:32:04 tesla kernel: [2149795.943299]  [<ffffffff810963b0>] ? flush_kthread_worker+0xb0/0xb0
Apr  2 15:32:04 tesla kernel: [2149795.943302]  [<ffffffff81793efc>] ret_from_fork+0x7c/0xb0
Apr  2 15:32:04 tesla kernel: [2149795.943324]  [<ffffffff810963b0>] ? flush_kthread_worker+0xb0/0xb0
Apr  2 15:35:38 tesla kernel: [2150009.686377] INFO: task txg_sync:686 blocked for more than 120 seconds.
Apr  2 15:35:38 tesla kernel: [2150009.686945]       Tainted: P           OE 3.16.0-031600-generic #201408031935

@woffs
Copy link

woffs commented Apr 21, 2015

same here:

Apr 21 12:18:25 griselda kernel: [  762.087675] VERIFY3(sa.sa_magic == 0x2F505A) failed (1429611505 == 3100762)
Apr 21 12:18:25 griselda kernel: [  762.087680] PANIC at zfs_vfsops.c:400:zfs_space_delta_cb()
Apr 21 12:18:25 griselda kernel: [  762.087681] Showing stack for process 664
Apr 21 12:18:25 griselda kernel: [  762.087684] CPU: 0 PID: 664 Comm: txg_sync Tainted: P           O  3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt7-1~bpo70+1
Apr 21 12:18:25 griselda kernel: [  762.087685] Hardware name: LENOVO 2756N1G/MAHOBAY, BIOS 9SKT70AUS 06/07/2013
Apr 21 12:18:25 griselda kernel: [  762.087687]  0000000000000000 ffffffffa035c220 ffffffff81543f17 ffffffffa036fab0
Apr 21 12:18:25 griselda kernel: [  762.087690]  ffffffffa020ef62 ffff8800c2686840 ffffffffa0370008 0000000000000000
Apr 21 12:18:25 griselda kernel: [  762.087692]  ffff880300000030 ffff8803db5fba38 ffff8803db5fb9d8 ffff88037aae1b08
Apr 21 12:18:25 griselda kernel: [  762.087694] Call Trace:
Apr 21 12:18:25 griselda kernel: [  762.087721]  [<ffffffff81543f17>] ? dump_stack+0x41/0x51
Apr 21 12:18:25 griselda kernel: [  762.087729]  [<ffffffffa020ef62>] ? spl_panic+0xc2/0x100 [spl]
Apr 21 12:18:25 griselda kernel: [  762.087746]  [<ffffffffa02a05bf>] ? dbuf_write.isra.9+0x1cf/0x410 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087767]  [<ffffffffa0294790>] ? arc_cksum_compute.isra.18+0xd0/0xd0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087778]  [<ffffffffa0293160>] ? arc_evictable_memory+0x80/0x80 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087788]  [<ffffffffa0296bd0>] ? arc_adapt_thread+0x4b0/0x4b0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087793]  [<ffffffff810b0db8>] ? __wake_up+0x48/0x70
Apr 21 12:18:25 griselda kernel: [  762.087811]  [<ffffffffa032a9b8>] ? zfs_space_delta_cb+0xc8/0x1a0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087827]  [<ffffffffa02af35f>] ? dmu_objset_userquota_get_ids+0xdf/0x4e0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087844]  [<ffffffffa02bed38>] ? dnode_sync+0xc8/0x950 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087860]  [<ffffffffa02ad6e6>] ? dmu_objset_sync_dnodes+0xb6/0xe0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087874]  [<ffffffffa02ad8a7>] ? dmu_objset_sync+0x197/0x2f0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087885]  [<ffffffffa0294790>] ? arc_cksum_compute.isra.18+0xd0/0xd0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087896]  [<ffffffffa0293160>] ? arc_evictable_memory+0x80/0x80 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087906]  [<ffffffffa0296bd0>] ? arc_adapt_thread+0x4b0/0x4b0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087924]  [<ffffffffa02ce5c4>] ? dsl_pool_sync+0x94/0x440 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087945]  [<ffffffffa02e85fc>] ? spa_sync+0x45c/0xbb0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087948]  [<ffffffff810b119e>] ? autoremove_wake_function+0xe/0x30
Apr 21 12:18:25 griselda kernel: [  762.087951]  [<ffffffff810b0db8>] ? __wake_up+0x48/0x70
Apr 21 12:18:25 griselda kernel: [  762.087970]  [<ffffffffa02f940d>] ? txg_sync_thread+0x36d/0x5f0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087988]  [<ffffffffa02f90a0>] ? txg_fini+0x2c0/0x2c0 [zfs]
Apr 21 12:18:25 griselda kernel: [  762.087995]  [<ffffffffa020c5ea>] ? thread_generic_wrapper+0x7a/0x90 [spl]
Apr 21 12:18:25 griselda kernel: [  762.088002]  [<ffffffffa020c570>] ? __thread_create+0x160/0x160 [spl]
Apr 21 12:18:25 griselda kernel: [  762.088006]  [<ffffffff81090551>] ? kthread+0xc1/0xe0
Apr 21 12:18:25 griselda kernel: [  762.088010]  [<ffffffff81090490>] ? flush_kthread_worker+0xb0/0xb0
Apr 21 12:18:25 griselda kernel: [  762.088013]  [<ffffffff8154a33c>] ? ret_from_fork+0x7c/0xb0
Apr 21 12:18:25 griselda kernel: [  762.088017]  [<ffffffff81090490>] ? flush_kthread_worker+0xb0/0xb0

upgraded from zfs v4 to v5 some days ago. How to get rid of this
particular bad entry?

@behlendorf
Copy link
Contributor

Closing as stale. If anyone still hitting this let us know and we'll reopen it.

@jspiros
Copy link

jspiros commented Jul 22, 2017

Just in case my reference to this issue in #6332 didn't trigger a notification for anyone subscribed to this issue, I seem to be hitting this with 0.6.5.9. See #6332 for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants