Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createmany test failure #401

Closed
behlendorf opened this issue Sep 16, 2011 · 6 comments
Closed

createmany test failure #401

behlendorf opened this issue Sep 16, 2011 · 6 comments
Milestone

Comments

@behlendorf
Copy link
Contributor

Andreas Dilger: to cause this I ran lustre/tests/createmany -m /ztest/test/foo 2000000 (approximately) to create the original reported number of free inodes, and then it still reported about 450k free inodes (I assume because of metadnode compression using less than 512 bytes/dnode), so I ran createmany -m /ztest/test/f 450000 and it is now hung at around 390000 files, with df -i reporting still 189000 inodes/ 94500kB free. The pool is only 1GB in size, so it is currently 91% full. I had it 100% full yesterday, though there were a lot more files with data - this was purely zero-length inodes.

Pid: 21433, comm: z_ioctl_int/0 Tainted: P 2.6.32.26-175.fc12.x86_64 #1 Dell DXP051                  
RIP: 0010:[] [] _spin_lock+0x1a/0x25
Call Trace:
[] ?  __mutex_unlock_slowpath+0x1a/0x3e
[] ? mutex_unlock+0x1b/0x1d
[] ? zio_done+0xa0a/0xa29 [zfs]
[] ? zio_done+0x9a8/0xa29 [zfs]
[] ? zio_execute+0xb1/0xd5 [zfs]
[] ? taskq_thread+0x271/0x2b3 [spl]
[] ?  default_wake_function+0x0/0x14
[] ? taskq_thread+0x0/0x2b3 [spl]
[] ? kthread+0x7f/0x87
[] ? child_rip+0xa/0x20
[] ? kthread+0x0/0x87
[] ? child_rip+0x0/0x20
@adilger
Copy link
Contributor

adilger commented Sep 16, 2011

The above stack trace was from the watchdog timer. Running sysrq-t the stack was slightly different:

z_ioctl_int/0 R running task 0 21433 2 0x00000088
Call Trace:
[] zio_done+0x9a8/0xa29 [zfs]
[] zio_execute+0xb1/0xd5 [zfs]
[] taskq_thread+0x271/0x2b3 [spl]
[] ? default_wake_function+0x0/0x14
[] ? taskq_thread+0x0/0x2b3 [spl]
[] kthread+0x7f/0x87
[] child_rip+0xa/0x20
[] ? kthread+0x0/0x87
[] ? child_rip+0x0/0x20

so at this instant it didn't appear to be in any locking primitives, but still busy spinning 100% CPU. Other threads that had interesting stacks include:

z_rd_int/0 R running task 0 212992 0x00000080
Call Trace:
[] ? zio_destroy+0x91/0x95 [zfs]
[] taskq_thread+0x101/0x2b3 [spl]
[] ? default_wake_function+0x0/0x14
[] ? taskq_thread+0x0/0x2b3 [spl]
[] kthread+0x7f/0x87
[] child_rip+0xa/0x20
[] ? kthread+0x0/0x87
[] ? child_rip+0x0/0x20

txg_sync D ffffc90012775880 0 214372 0x00000080
Call Trace:
[] __mutex_lock_common+0x12d/0x194
[] __mutex_lock_slowpath+0x19/0x1b
[] mutex_lock+0x36/0x50
[] cv_wait_common+0x7a/0xb9 [spl]
[] ? autoremove_wake_function+0x0/0x39
[] __cv_wait+0x13/0x15 [spl]
[] zio_wait+0xf6/0x123 [zfs]
[] vdev_uberblock_sync_list+0xb4/0xcd [zfs]
[] vdev_config_sync+0x10c/0x136 [zfs]
[] spa_sync+0x68f/0x85d [zfs]
[] ? autoremove_wake_function+0x16/0x39
[] ? __wake_up+0x44/0x4d
[] txg_sync_thread+0x1c5/0x33e [zfs]
[] ? txg_sync_thread+0x0/0x33e [zfs]
[] thread_generic_wrapper+0x61/0x6b [spl]
[] ? thread_generic_wrapper+0x0/0x6b [spl]
[] kthread+0x7f/0x87
[] child_rip+0xa/0x20
[] ? kthread+0x0/0x87
[] ? child_rip+0x0/0x20

createmany D ffff88000c4681c0 0 2218425584 0x00000084
Call Trace:
[] ? try_to_wake_up+0x33b/0x34d
[] cv_wait_common+0x72/0xb9 [spl]
[] ? autoremove_wake_function+0x0/0x39
[] __cv_wait+0x13/0x15 [spl]
[] txg_wait_open+0x79/0x99 [zfs]
[] dmu_tx_wait+0xa4/0xad [zfs]
[] zfs_create+0x391/0x5f3 [zfs]
[] zpl_create+0x9d/0xc9 [zfs]
[] vfs_create+0x78/0x9a
[] sys_mknodat+0x143/0x1fd
[] ? apic_timer_interrupt+0xe/0x20
[] ? path_put+0x22/0x26
[] ? audit_syscall_entry+0x11e/0x14a
[] sys_mknod+0x1a/0x1c
[] system_call_fastpath+0x16/0x1b

@behlendorf
Copy link
Contributor Author

It appears the 100% cpu time can be attributed to the global virtual address space spin lock used by vmalloc(). Althought which exact call path is to blame wasn't clear.

@gunnarbeutner
Copy link
Contributor

zio_wait() hangs because the ZIO it's waiting for just doesn't complete. Most likely it has been set to ZIO_REEXECUTE_SUSPEND by zio_done().

This can have several reasons:

  • pool faulted
    • top-level vdevs inaccessible
    • uncorrectable read/write error
  • pool full
    • metaslab allocator returned ENOSPC (which is most likely what happened here)
  • possibly some bug(s) we haven't identified yet

Using 'zpool events' you should see a number of I/O failures like this (this one was generated by running the 'createmany' test until the zpool was completely filled - printks confirm that metaslab_alloc_dva returned ENOSPC in my tests):

root@zt:~# zpool events
TIME                           CLASS
Oct 20 2011 19:10:39.037314000 resource.fs.zfs.statechange
Oct 20 2011 19:10:39.368225000 resource.fs.zfs.statechange
Oct 20 2011 19:11:31.923079000 ereport.fs.zfs.data
Oct 20 2011 19:11:31.923155000 ereport.fs.zfs.data
Oct 20 2011 19:11:31.923177000 ereport.fs.zfs.data
Oct 20 2011 19:11:31.923226000 ereport.fs.zfs.data
Oct 20 2011 19:11:31.923266000 ereport.fs.zfs.data
Oct 20 2011 19:11:31.923434000 ereport.fs.zfs.data
Oct 20 2011 19:11:31.923485000 ereport.fs.zfs.data
Oct 20 2011 19:11:31.980107000 ereport.fs.zfs.data
Oct 20 2011 19:11:32.042742000 ereport.fs.zfs.io_failure
root@zt:~#

The txg_wait_*() hang is just a secondary effect because it's waiting for a TXG to complete which just won't happen because one of the I/Os the TXG depends on has been suspended.

@behlendorf
Copy link
Contributor Author

This is very plausibly the same as #2523 and is particularly interesting because it includes a simple reproducer.

behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 19, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421
@behlendorf
Copy link
Contributor Author

This issue which is a duplicate of #2523 was resolved by the following commit. Full details can be found in the commit message and related lwn article.

openzfs/spl@a3c1eb7 mutex: force serialization on mutex_exit() to fix races

dajhorn added a commit to zfsonlinux/pkg-spl that referenced this issue Dec 20, 2014
Commit: openzfs/zfs@a3c1eb7
From: Chunwei Chen <tuxoko@gmail.com>
Date: Fri, 19 Dec 2014 11:31:59 +0800
Subject: mutex: force serialization on mutex_exit() to fix races

It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Backported-by: Darik Horn <dajhorn@vanadac.com>
Closes #421

Conflicts:
        include/sys/mutex.h
behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 23, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421
@ryao
Copy link
Contributor

ryao commented Feb 10, 2015

@behlendorf This is not a duplicate of #2523. Instead, it is a duplicate of #3091.

ryao pushed a commit to ryao/spl that referenced this issue Feb 19, 2015
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes openzfs#421

Conflicts:
	include/sys/mutex.h
ahrens added a commit to ahrens/zfs that referenced this issue Aug 17, 2021
Because we can have concurrent "read block" requests for the same block, we can also have concurrent accesses to the same block of the zettacache. In particular, we may have 2 "read block"s of the same block, each of which will lookup() in the zettacache, find the entry not there, and then insert() to the zettacache. This can lead to several problems, especially in subtle cases like where one of the insert()'s has been flushed to the index before the 2nd insert() is called.

This PR solves this issue by adding per-block locking to the zettacache. A failed lookup() will return with the entry locked, blocking concurrent lookup()'s of the same block until the first thread either insert()'s the value or drop()'s the LockedKey.
sdimitro pushed a commit to sdimitro/zfs that referenced this issue May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants