zio->io_lock() spinlock damaged #2523

behlendorf · 2014-07-22T20:15:20Z

The following hang was observed running zfs-0.6.3 under RHEL 6.5. Very similar hangs in zio_wait() have been reported rarely for prior releases. However, this time a crash dump was available for detailed inspection which helps shed some light on the issue.

The hang in zio_wait() shown below is the result of the txg_sync thread getting blocked in __mutex_unlock_slowpath on the associated spin lock. This is somewhat surprising because the spin lock is managed entirely by the Linux kernel mutex_lock and mutex_unlock functions. Because we have access to the crash dump we're able to inspect the exact state of the spin lock.

Under x86_64 on Linux ticket spin locks are used. For those not familiar with ticket spin locks there's a good LWN article explaining them, http://lwn.net/Articles/267968/. But very quickly each lock is split in to two parts a next value and an owner value. When the next value equals the owner value the lock is not held and can be acquired.

Here's where it gets interesting. In the case shown below the spin lock has a value of 0x00020002. This means that since the lock was initialized it has been unlocked twice. But has only been locked once. In this state it will effectively block forever and never acquire the lock.

There are only two plausible ways this can happen.

The spin lock is unlocked multiple times.
The spin lock was reinitialized while it was locked.

I've inspected the code and thus far haven't found a way that a multiple unlock could occur. This also seems particularly unlikely since as I mentioned above the management of this lock is handled by the Linux mutex primitive. It seems very unlikely that it suffers from any kind of bug like this.

That leaves the spin lock being reinitialized while it is locked. I like this theory much better because it cleanly explains how the lock can end up in this state and also why this problem is so rare. The spin lock would need to be damaged at exactly the right instant. The problem is that the surrounding zio structure doesn't appear to be damaged as one might expect. I also haven't found any code which might cause this.

So unfortunately, while we now have solid evidence for what happened. It's still unclear to me how this can happen.

crash> bt 5272
PID: 5272   TASK: ffff88102db36040  CPU: 1   COMMAND: "z_null_iss/0"
 #0 [ffff88102dbf1be0] schedule at ffffffff8152a940
 #1 [ffff88102dbf1ca8] __mutex_lock_slowpath at ffffffff8152c13e
 #2 [ffff88102dbf1d18] mutex_lock at ffffffff8152bfde
 #3 [ffff88102dbf1d38] zio_wait_for_children at ffffffffa0492eb5 [zfs]
 #4 [ffff88102dbf1d68] zio_done at ffffffffa04979a8 [zfs]
 #5 [ffff88102dbf1de8] zio_execute at ffffffffa04944c3 [zfs]
 #6 [ffff88102dbf1e28] taskq_thread at ffffffffa0348d37 [spl]
 #7 [ffff88102dbf1ee8] kthread at ffffffff8109b486
 #8 [ffff88102dbf1f48] kernel_thread at ffffffff8100c24a

crash> bt 5374
PID: 5374   TASK: ffff88102ccbb500  CPU: 9   COMMAND: "txg_sync"
 #0 [ffff88085c427e90] crash_nmi_callback at ffffffff81030156
 #1 [ffff88085c427ea0] notifier_call_chain at ffffffff81530635
 #2 [ffff88085c427ee0] atomic_notifier_call_chain at ffffffff8153069a
 #3 [ffff88085c427ef0] notify_die at ffffffff810a1afe
 #4 [ffff88085c427f20] do_nmi at ffffffff8152e2fb
 #5 [ffff88085c427f50] nmi at ffffffff8152dbc0
    [exception RIP: _spin_lock+33]
    RIP: ffffffff8152d431  RSP: ffff88102cce9820  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: ffff880b2ef5f954  RCX: 0000000000000000
    RDX: 0000000000000002  RSI: 0000000000000202  RDI: ffff880b2ef5f954
    RBP: ffff88102cce9820   R8: ffff880b2ef5f980   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff880b2ef5f950
    R13: ffff880b2ef5f950  R14: ffff880b2ef5f978  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88102cce9820] _spin_lock at ffffffff8152d431
 #7 [ffff88102cce9828] __mutex_unlock_slowpath at ffffffff8152bd05
 #8 [ffff88102cce9848] mutex_unlock at ffffffff8152bcdb
 #9 [ffff88102cce9858] cv_wait_common at ffffffffa034cae2 [spl]
#10 [ffff88102cce98c8] __cv_wait_io at ffffffffa034cb78 [spl]
#11 [ffff88102cce98d8] zio_wait at ffffffffa049464b [zfs]
#12 [ffff88102cce9918] dmu_buf_hold_array_by_dnode at ffffffffa040b0c7 [zfs]
#13 [ffff88102cce9998] dmu_buf_hold_array at ffffffffa040b5a5 [zfs]
#14 [ffff88102cce99f8] dmu_write at ffffffffa040ba00 [zfs]
#15 [ffff88102cce9a78] space_map_sync at ffffffffa044d9b9 [zfs]
#16 [ffff88102cce9b28] metaslab_condense at ffffffffa043705b [zfs]
#17 [ffff88102cce9c38] metaslab_sync at ffffffffa043733c [zfs]
#18 [ffff88102cce9cb8] vdev_sync at ffffffffa0451b5f [zfs]
#19 [ffff88102cce9d08] spa_sync at ffffffffa043cd9b [zfs]
#20 [ffff88102cce9dd8] txg_sync_thread at ffffffffa044f540 [zfs]
#21 [ffff88102cce9eb8] thread_generic_wrapper at ffffffffa03485c8 [spl]
#22 [ffff88102cce9ee8] kthread at ffffffff8109b486
#23 [ffff88102cce9f48] kernel_thread at ffffffff8100c24a

crash> p *(zio_t *)0xffff880b2ef5f640
$34 = {
  io_bookmark = {
    zb_objset = 0, 
    zb_object = 0, 
    zb_level = 0, 
    zb_blkid = 0
  }, 
  io_prop = {
    zp_checksum = ZIO_CHECKSUM_INHERIT, 
    zp_compress = ZIO_COMPRESS_INHERIT, 
    zp_type = DMU_OT_NONE, 
    zp_level = 0 '\000', 
    zp_copies = 0 '\000', 
    zp_dedup = B_FALSE, 
    zp_dedup_verify = B_FALSE, 
    zp_nopwrite = B_FALSE
  }, 
  io_type = ZIO_TYPE_NULL, 
  io_child_type = ZIO_CHILD_LOGICAL, 
  io_cmd = 0, 
  io_priority = ZIO_PRIORITY_NOW, 
  io_reexecute = 0 '\000', 
  io_state = "\001", 
  io_txg = 0, 
  io_spa = 0xffff8807ff5c8000, 
  io_bp = 0x0, 
  io_bp_override = 0x0, 
  io_bp_copy = {
    blk_dva = {{
        dva_word = {0, 0}
      }, {
        dva_word = {0, 0}
      }, {
        dva_word = {0, 0}
      }}, 
    blk_prop = 0, 
    blk_pad = {0, 0}, 
    blk_phys_birth = 0, 
    blk_birth = 0, 
    blk_fill = 0, 
    blk_cksum = {
      zc_word = {0, 0, 0, 0}
    }
  }, 
  io_parent_list = {
    list_size = 48, 
    list_offset = 16, 
    list_head = {
      next = 0xffff880b2ef5f740, 
      prev = 0xffff880b2ef5f740
    }
  }, 
  io_child_list = {
    list_size = 48, 
    list_offset = 32, 
    list_head = {
      next = 0xffff880b2ef5f760, 
      prev = 0xffff880b2ef5f760
    }
  }, 
  io_walk_link = 0x0, 
  io_logical = 0x0, 
  io_transform_stack = 0x0, 
  io_ready = 0, 
  io_physdone = 0, 
  io_done = 0, 
  io_private = 0x0, 
  io_prev_space_delta = 0, 
  io_bp_orig = {
    blk_dva = {{
        dva_word = {0, 0}
      }, {
        dva_word = {0, 0}
      }, {
        dva_word = {0, 0}
      }}, 
    blk_prop = 0, 
    blk_pad = {0, 0}, 
    blk_phys_birth = 0, 
    blk_birth = 0, 
    blk_fill = 0, 
    blk_cksum = {
      zc_word = {0, 0, 0, 0}
    }
  }, 
  io_data = 0x0, 
  io_orig_data = 0x0, 
  io_size = 0, 
  io_orig_size = 0, 
  io_vd = 0x0, 
  io_vsd = 0x0, 
  io_vsd_ops = 0x0, 
  io_offset = 0, 
  io_timestamp = 0, 
  io_delta = 0, 
  io_delay = 0, 
  io_queue_node = {
    avl_child = {0x0, 0x0}, 
    avl_pcb = 0
  }, 
  io_flags = ZIO_FLAG_CANFAIL, 
  io_stage = ZIO_STAGE_DONE, 
  io_pipeline = 2162688, 
  io_orig_flags = ZIO_FLAG_CANFAIL, 
  io_orig_stage = ZIO_STAGE_OPEN, 
  io_orig_pipeline = 2162688, 
  io_error = 0, 
  io_child_error = {0, 0, 0, 0}, 
  io_children = {{0, 0}, {0, 0}, {0, 0}, {0, 0}}, 
  io_child_count = 0, 
  io_phys_children = 0, 
  io_parent_count = 0, 
  io_stall = 0x0, 
  io_gang_leader = 0x0, 
  io_gang_tree = 0x0, 
  io_executor = 0xffff88102db36040, 
  io_waiter = 0xffff88102ccbb500, 
  io_lock = {
    m = {
      count = {
        counter = -1
      }, 
      wait_lock = {
        raw_lock = {
          slock = 131074
        }
      }, 
      wait_list = {
        next = 0xffff88102dbf1cc0, 
        prev = 0xffff88102dbf1cc0
      }, 
      owner = 0x0
    }
  }, 
  io_cv = {
    cv_magic = 879052276, 
    cv_event = {
      lock = {
        raw_lock = {
          slock = 65537
        }
      }, 
      task_list = {
        next = 0xffff88102cce9878, 
        prev = 0xffff88102cce9878
      }
    }, 
    cv_destroy = {
      lock = {
        raw_lock = {
          slock = 0
        }
      }, 
      task_list = {
        next = 0xffff880b2ef5f998, 
        prev = 0xffff880b2ef5f998
      }
    }, 
    cv_refs = {
      counter = 2
    }, 
    cv_waiters = {
      counter = 1
    }, 
    cv_mutex = 0xffff880b2ef5f950
  }, 
  io_cksum_report = 0x0, 
  io_ena = 0, 
  io_tqent = {
    tqent_lock = {
      raw_lock = {
        slock = 65537
      }
    }, 
    tqent_waitq = {
      lock = {
        raw_lock = {
          slock = 0
        }
      }, 
      task_list = {
        next = 0xffff880b2ef5f9d8, 
        prev = 0xffff880b2ef5f9d8
      }
    }, 
    tqent_timer = {
      entry = {
        next = 0x0, 
        prev = 0x0
      }, 
      expires = 0, 
      function = 0, 
      data = 0, 
      base = 0xffff8810343b4000, 
      start_site = 0x0, 
      start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
      start_pid = -1
    }, 
    tqent_list = {
      next = 0xffff880b2ef5fa38, 
      prev = 0xffff880b2ef5fa38
    }, 
    tqent_id = 193943569, 
    tqent_func = 0xffffffffa0494410, 
    tqent_arg = 0xffff880b2ef5f640, 
    tqent_taskq = 0xffff8808076d5780, 
    tqent_flags = 1
  }
}

The text was updated successfully, but these errors were encountered:

behlendorf · 2014-08-20T15:54:52Z

The same issue was observed again. The z_fr_iss_0/1 is stuck spinning on the internal mutex spin lock which protects the wait list. The damage looks like a double free but it's unclear how that can happen. Critical details from a second instance follow:

# z_fr_iss_0/1
 [<ffffffff8152ccc5>] ? __mutex_unlock_slowpath+0x25/0x60
 [<ffffffff8152cc9b>] ? mutex_unlock+0x1b/0x20
 [<ffffffffa04b1f25>] ? zio_done+0x5b5/0xe00 [zfs]
 [<ffffffffa04b1677>] ? zio_ready+0x177/0x470 [zfs]
 [<ffffffffa04ae503>] ? zio_execute+0xb3/0x140 [zfs]
 [<ffffffffa0371d37>] ? taskq_thread+0x1e7/0x3f0 [spl]
 [<ffffffff81061f20>] ? default_wake_function+0x0/0x20
 [<ffffffffa0371b50>] ? taskq_thread+0x0/0x3f0 [spl]
 [<ffffffff8109b4d6>] ? kthread+0x96/0xa0
 [<ffffffff8100c24a>] ? child_rip+0xa/0x20
 [<ffffffff8109b440>] ? kthread+0x0/0xa0
 [<ffffffff8100c240>] ? child_rip+0x0/0x20

crash> gdb list *(zio_done+0x5b5)
0xffffffffa04b1f25 is in zio_done (/usr/src/debug/zfs-kmod-0.6.3/zfs-0.6.3/module/zfs/zio.c:3246).
3241    
3242            if (zio->io_waiter != NULL) {
3243                    mutex_enter(&zio->io_lock);
3244                    zio->io_executor = NULL;
3245                    cv_broadcast(&zio->io_cv);
3246                    mutex_exit(&zio->io_lock);
3247            } else {
3248                    zio_destroy(zio);
3249            }
3250

# txg_sync
[<ffffffffa0375af4>] schedule at ffffffff8152b900
[<ffffffffa0375af4>] __mutex_lock_slowpath at ffffffff8152d0fe
[<ffffffffa0375af4>] mutex_lock at ffffffff8152cf9e
[<ffffffffa0375af4>] cv_wait_common+0x94/0x100 [spl]
[<ffffffffa0375b78>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa04ae68b>] zio_wait+0xfb/0x1b0 [zfs]
[<ffffffffa04b016b>] zio_free+0xab/0xe0 [zfs]
[<ffffffffa0449611>] dsl_free+0x11/0x20 [zfs]
[<ffffffffa043e521>] dsl_dataset_block_kill+0x1a1/0x350 [zfs]
[<ffffffffa043710e>] free_blocks+0x6e/0xb0 [zfs]
[<ffffffffa04377a4>] free_children+0x294/0x340 [zfs]
[<ffffffffa04376db>] free_children+0x1cb/0x340 [zfs]
[<ffffffffa04376db>] free_children+0x1cb/0x340 [zfs]
[<ffffffffa0437bc9>] dnode_sync+0x379/0x9e0 [zfs]
[<ffffffffa0427fe9>] dmu_objset_sync_dnodes+0x89/0xb0 [zfs]
[<ffffffffa04281c3>] dmu_objset_sync+0x1b3/0x2d0 [zfs]
[<ffffffffa04394ac>] dsl_dataset_sync+0x4c/0x60 [zfs]
[<ffffffffa044479a>] dsl_pool_sync+0x9a/0x3f0 [zfs]
[<ffffffffa0456d4b>] spa_sync+0x40b/0xa60 [zfs]
[<ffffffffa0469570>] txg_sync_thread+0x300/0x520 [zfs]
[<ffffffffa03715c8>] thread_generic_wrapper+0x68/0x80 [spl]
[<ffffffff8109b4d6>] kthread+0x96/0xa0
[<ffffffff8100c24a>] child_rip+0xa/0x20

# There are 16 zio's on the slab including the one below indicating it has not been freed.
crash> kmem -s
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff880818d61e00 zio_cache               1072         16       112     16     8k

# The offending zio has damage to the spin lock suggesting a double free.
crash> struct -x zio ffff8808e6b5f700
struct zio {
  io_bookmark = {
    zb_objset = 0x0, 
    zb_object = 0x0, 
    zb_level = 0x0, 
    zb_blkid = 0x0
  }, 
  io_prop = {
    zp_checksum = ZIO_CHECKSUM_INHERIT, 
    zp_compress = ZIO_COMPRESS_INHERIT, 
    zp_type = DMU_OT_NONE, 
    zp_level = 0x0, 
    zp_copies = 0x0, 
    zp_dedup = B_FALSE, 
    zp_dedup_verify = B_FALSE, 
    zp_nopwrite = B_FALSE
  }, 
  io_type = ZIO_TYPE_FREE, 
  io_child_type = ZIO_CHILD_LOGICAL, 
  io_cmd = 0x0, 
  io_priority = ZIO_PRIORITY_NOW, 
  io_reexecute = 0x0, 
  io_state = "\001\001", 
  io_txg = 0x23a803, 
  io_spa = 0xffff88102e04e000, 
  io_bp = 0xffff8808e6b5f770, 
  io_bp_override = 0x0, 
  io_bp_copy = {
    blk_dva = {{
        dva_word = {0x35, 0x5cfd28404}
      }, {
        dva_word = {0x0, 0x0}
      }, {
        dva_word = {0x0, 0x0}
      }}, 
    blk_prop = 0x80130703003400ff, 
    blk_pad = {0x0, 0x0}, 
    blk_phys_birth = 0x0, 
    blk_birth = 0x23a75c, 
    blk_fill = 0x1, 
    blk_cksum = {
      zc_word = {0x77fd3b1b03f, 0x64114459516315, 0x6ae8943b3134ab81, 0x668ee8ddfd38e19d}
    }
  }, 
  io_parent_list = {
    list_size = 0x30, 
    list_offset = 0x10, 
    list_head = {
      next = 0xffff8808e6b5f800, 
      prev = 0xffff8808e6b5f800
    }
  }, 
  io_child_list = {
    list_size = 0x30, 
    list_offset = 0x20, 
    list_head = {
      next = 0xffff8808e6b5f820, 
      prev = 0xffff8808e6b5f820
    }
  }, 
  io_walk_link = 0x0, 
  io_logical = 0xffff8808e6b5f700, 
  io_transform_stack = 0x0, 
  io_ready = 0, 
  io_physdone = 0, 
  io_done = 0, 
  io_private = 0x0, 
  io_prev_space_delta = 0x0, 
  io_bp_orig = {
    blk_dva = {{
        dva_word = {0x35, 0x5cfd28404}
      }, {
        dva_word = {0x0, 0x0}
      }, {
        dva_word = {0x0, 0x0}
      }}, 
    blk_prop = 0x80130703003400ff, 
    blk_pad = {0x0, 0x0}, 
    blk_phys_birth = 0x0, 
    blk_birth = 0x23a75c, 
    blk_fill = 0x1, 
    blk_cksum = {
      zc_word = {0x77fd3b1b03f, 0x64114459516315, 0x6ae8943b3134ab81, 0x668ee8ddfd38e19d}
    }
  }, 
  io_data = 0x0, 
  io_orig_data = 0x0, 
  io_size = 0x6a00, 
  io_orig_size = 0x6a00, 
  io_vd = 0x0, 
  io_vsd = 0x0, 
  io_vsd_ops = 0x0, 
  io_offset = 0x0, 
  io_timestamp = 0x0, 
  io_delta = 0x0, 
  io_delay = 0x0, 
  io_queue_node = {
    avl_child = {0x0, 0x0}, 
    avl_pcb = 0x0
  }, 
  io_flags = 0, 
  io_stage = ZIO_STAGE_DONE, 
  io_pipeline = 2179076, 
  io_orig_flags = 0, 
  io_orig_stage = ZIO_STAGE_OPEN, 
  io_orig_pipeline = 2179076, 
  io_error = 0x0, 
  io_child_error = {0x0, 0x0, 0x0, 0x0}, 
  io_children = {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}}, 
  io_child_count = 0x0, 
  io_phys_children = 0x0, 
  io_parent_count = 0x0, 
  io_stall = 0x0, 
  io_gang_leader = 0x0, 
  io_gang_tree = 0x0, 
  io_executor = 0x0, 
  io_waiter = 0xffff88080b247540, 
  io_lock = {
    m = {
      count = {
        counter = 0xffffffff
      }, 
      wait_lock = {
        raw_lock = {
          slock = 0x20002
        }
      }, 
      wait_list = {
        next = 0xffff880806c67640, 
        prev = 0xffff880806c67640
      }, 
      owner = 0x0
    }
  }, 
  io_cv = {
    cv_magic = 0x346545f4, 
    cv_event = {
      lock = {
        raw_lock = {
          slock = 0x20002
        }
      }, 
      task_list = {
        next = 0xffff8808e6b5fa40, 
        prev = 0xffff8808e6b5fa40
      }
    }, 
    cv_destroy = {
      lock = {
        raw_lock = {
          slock = 0x0
        }
      }, 
      task_list = {
        next = 0xffff8808e6b5fa58, 
        prev = 0xffff8808e6b5fa58
      }
    }, 
    cv_refs = {
      counter = 0x2
    }, 
    cv_waiters = {
      counter = 0x1
    }, 
    cv_mutex = 0xffff8808e6b5fa10
  }, 
  io_cksum_report = 0x0, 
  io_ena = 0x0, 
  io_tqent = {
    tqent_lock = {
      raw_lock = {
        slock = 0x10001
      }
    }, 
    tqent_waitq = {
      lock = {
        raw_lock = {
          slock = 0x0
        }
      }, 
      task_list = {
        next = 0xffff8808e6b5fa98, 
        prev = 0xffff8808e6b5fa98
      }
    }, 
    tqent_timer = {
      entry = {
        next = 0x0, 
        prev = 0x0
      }, 
      expires = 0x0, 
      function = 0, 
      data = 0x0, 
      base = 0xffff881033c5c000, 
      start_site = 0x0, 
      start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
      start_pid = 0xffffffff
    }, 
    tqent_list = {
      next = 0xffff8808e6b5faf8, 
      prev = 0xffff8808e6b5faf8
    }, 
    tqent_id = 0x25e1f69, 
    tqent_func = 0xffffffffa04ae450 <zio_execute>, 
    tqent_arg = 0xffff8808e6b5f700, 
    tqent_taskq = 0xffff881030d4b8c0, 
    tqent_flags = 0x1
  }
}

lundman · 2014-08-26T08:09:44Z

I hesitate to comment, since it is very unlikely to be related. We had many panics in zio->io_locks, which seemed connected to memory corruption. But it was related to how our SPL memory manager was designed.

In the end it was caused by:
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_ioctl.c#L5833
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_ioctl.c#L5907

.. because saved_poolname had been truncated, the alloc and free calls were giving different "size" parameter. So for example, alloc would take it from size=16 size, but free place memory in size=8. This meant we handed the same memory out twice, and so on. The most panics resulted in zio_done() with z_iolock, but also in a whole bunch of random places.

But I don't think ZOL's allocator cares as much about the size parameter as ours do though, so it is likely not to be relevant.

openzfsonosx/zfs@2e1aed2

dweeezil · 2014-08-26T12:21:38Z

As your links above pointed out, in fb8e608, the handling of saved_poolname was converted to use strdup()/strfree(). In the Zol SPL, these calls map to kmalloc()/kfree() which (at least in SLOB) prepend the space with a 4-byte header to remember the size of the original allocation.

behlendorf · 2014-08-26T18:19:49Z

Any insights here are welcome. But as @dweeezil pointed out in this case we're directly mapping strdup()/strfree() to kmalloc()/kfree() so we can't really have an accounting issue here at least in the SPL.

You're comment however reminded me that ZoL's SPL has a --enable-debug-kmem-tracking option which enables detailed tracking of every allocation. It will verify that every address which is allocated with a given size is freed with the same size. It's horrible for performance but if there's something odd like that going on it should catch it. I'll give it a try on a test setup.

lundman · 2014-08-27T00:26:55Z

Ah I see, you directly map it to Linux specific allocation calls which do not take size as an argument, and yet still have spa_strdup().

Anyway, sorry I couldn't help...

ryao · 2014-09-04T16:28:01Z

@behlendorf If a double free occurs on an object allocated by spl_kmem_cache_alloc(), then it will be allocated twice. That would result in something like this happening. I managed to trigger something like this when building ztest with Clang not that long ago, but I did not identify the cause of the double free.

One theory for openzfs#2523 is that somehow the spinlock at the start of the mutex is being accidentally overwritten. To test this theory a magic value is added before the mutex. By checking this value periodically against a known magic value we can detect if this is somehow happening. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2523

There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2523

behlendorf · 2014-09-04T17:01:49Z

@ryao Yes, a double free is a possible explanation for this but thus far I haven't been able to determine how that's possible. Have you been able to reproduce anything like this? It's been difficult to resolve because it occurs often enough to be a problem but I've been completely unable to reproduce it otherwise.

dswartz · 2014-09-04T17:32:44Z

@ryao Yes, a double free is a possible explanation for this but thus far I
haven't been able to determine how that's possible. Have you been able to
reproduce anything like this? It's been difficult to resolve because it
occurs often enough to be a problem but I've been completely unable to
reproduce it otherwise.

Is it possible to make some kind of debug wrapper to catch the 2nd free in
the act?

dswartz · 2014-09-04T17:33:33Z

Oh never mind, only just now saw your patch to try to track this down :)

behlendorf · 2014-09-04T17:38:48Z

@dswartz It should be but thus far I haven't been able to hit the issue with any of the debug patches in place. Generally I dislike adding this kind of debug to the main tree but we may have too just to get some additional information.

ryao · 2014-09-05T13:03:21Z

@behlendorf I suspect that I could reproduce the double free that I encountered when building with Clang, but I do not have instructions to do that on hand at the moment.

One theory for openzfs#2523 is that somehow the spinlock at the start of the mutex is being accidentally overwritten. To test this theory a magic value is added before the mutex. By checking this value periodically against a known magic value we can detect if this is somehow happening. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2523

There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2523

From the information gathered about issue openzfs#2523 it's clear that somehow the spin lock is being damaged. This could occur if the spin lock is reinitialized, the memory is accidentally overwritten, or freed back to the cache to soon. To determine exactly what is happening this patch adds a couple new sanity tests. * A zio->io_magic field is added before the io_lock. This field is designed to act as a red zone allow us to detect if the zio has been written. * The zio->io_magic field is also used to detect if somehow the constructor or destructor is running multiple for the object. This would effectively cause the spin lock to be reinitialized. * The destructor has been updated to poison the entire structure. This should cause us to quickly detect any use-after-free bugs. One the root cause of this issue can be determined this patch should be reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2523

There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2523

From the information gathered about issue openzfs#2523 it's clear that somehow the spin lock is being damaged. This could occur if the spin lock is reinitialized, the memory is accidentally overwritten, or freed back to the cache to soon. To determine exactly what is happening this patch adds a couple new sanity tests. * A zio->io_magic field is added before the io_lock. This field is designed to act as a red zone allow us to detect if the zio has been written. * The zio->io_magic field is also used to detect if somehow the constructor or destructor is running multiple for the object. This would effectively cause the spin lock to be reinitialized. * The destructor has been updated to poison the entire structure. This should cause us to quickly detect any use-after-free bugs. Once the root cause of this issue can be determined this patch should be reverted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2523

There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2523

mgmartin · 2014-09-11T23:47:35Z

I ran with the assert magic patch last night and hit an assert failure under load. Hope this helps narrow this nasty issue down!

Sep 11 02:00:16 server kernel: VERIFY3(zio->io_magic != 0x210210210210210ULL) failed (148655080447476240 != 148655080447476240)
Sep 11 02:00:16 server kernel: SPLError: 697:0:(zio.c:102:zio_cons()) SPL PANIC
Sep 11 02:00:16 server kernel: SPL: Showing stack for process 697
Sep 11 02:00:16 server kernel: CPU: 5 PID: 697 Comm: z_wr_iss/7 Tainted: P          IO  3.16.2-mgm #1
Sep 11 02:00:16 server kernel: Hardware name: HP ProLiant DL380 G6, BIOS P62 07/02/2013
Sep 11 02:00:16 server kernel:  0000000000000000 00000000470a6f12 ffff880306e079b8 ffffffff8150bb2f
Sep 11 02:00:16 server kernel:  ffff880302e5da90 ffff880306e079d0 ffffffffa0006b15 0000000000000000
Sep 11 02:00:16 server kernel:  ffff880306e079f8 ffffffffa0007ab0 ffffffffa001c791 000000000000ffed
Sep 11 02:00:16 server kernel: Call Trace:
Sep 11 02:00:16 server kernel:  [<ffffffff8150bb2f>] dump_stack+0x45/0x56
Sep 11 02:00:16 server kernel:  [<ffffffffa0006b15>] spl_debug_dumpstack+0x45/0x60 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffffa0007ab0>] spl_debug_bug+0x80/0xe0 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffffa0014955>] spl_PANIC+0xa5/0xd0 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffffa012cca3>] ? vdev_queue_io_to_issue+0x153/0x8f0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffff813878e0>] ? extract_buf+0xd0/0x190
Sep 11 02:00:16 server kernel:  [<ffffffff81385a92>] ? __mix_pool_bytes+0x42/0x90
Sep 11 02:00:16 server kernel:  [<ffffffff81387901>] ? extract_buf+0xf1/0x190
Sep 11 02:00:16 server kernel:  [<ffffffffa0168774>] zio_cons+0xf4/0x100 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa000b67e>] spl_kmem_cache_alloc+0xae/0xdc0 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffff81388364>] ? extract_entropy+0x94/0x1d0
Sep 11 02:00:16 server kernel:  [<ffffffffa016b210>] zio_create+0x40/0x5a0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa016c0fb>] zio_vdev_child_io+0xbb/0x110 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa012bd30>] ? vdev_config_sync+0x150/0x150 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa012c746>] vdev_mirror_io_start+0xa6/0x1c0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa012bd30>] ? vdev_config_sync+0x150/0x150 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa011b8ef>] ? spa_config_enter+0xaf/0xf0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa0169327>] zio_vdev_io_start+0x217/0x2f0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa016a470>] zio_execute+0x100/0x1b0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa0011437>] taskq_thread+0x267/0x510 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffff8109eb80>] ? wake_up_process+0x50/0x50
Sep 11 02:00:16 server kernel:  [<ffffffffa00111d0>] ? taskq_cancel_id+0x200/0x200 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffff8108e38a>] kthread+0xea/0x100
Sep 11 02:00:16 server kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 11 02:00:16 server kernel:  [<ffffffff815112fc>] ret_from_fork+0x7c/0xb0
Sep 11 02:00:16 server kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0

behlendorf · 2014-09-12T00:10:29Z

@mgmartin Thanks, I was about to catch a failure in overnight stress testing as well. This should help us get to the root cause.

mgmartin · 2014-09-12T00:13:14Z

This might help too--the hung task stacks after the panic:

Sep 11 02:03:14  kernel: z_wr_iss/7      D ffff88061fa54400     0   697      2 0x00000000
Sep 11 02:03:14  kernel:  ffff880306e079c0 0000000000000046 ffff880302e5da90 0000000000014400
Sep 11 02:03:14  kernel:  ffff880306e07fd8 0000000000014400 ffff880302e5da90 ffff880306e07920
Sep 11 02:03:14  kernel:  ffffffff8101770d ffff880306e079e8 0000000000000000 ffff880306e079a0
Sep 11 02:03:14  kernel: Call Trace:
Sep 11 02:03:14  kernel:  [<ffffffff8101770d>] ? show_trace_log_lvl+0x4d/0x60
Sep 11 02:03:14  kernel:  [<ffffffff810163fd>] ? show_stack_log_lvl+0x9d/0x1a0
Sep 11 02:03:14  kernel:  [<ffffffff81017774>] ? show_stack+0x34/0xa0
Sep 11 02:03:14  kernel:  [<ffffffff8150d929>] schedule+0x29/0x70
Sep 11 02:03:14  kernel:  [<ffffffffa0007ae5>] spl_debug_bug+0xb5/0xe0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa0014955>] spl_PANIC+0xa5/0xd0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa012cca3>] ? vdev_queue_io_to_issue+0x153/0x8f0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffff813878e0>] ? extract_buf+0xd0/0x190
Sep 11 02:03:14  kernel:  [<ffffffff81385a92>] ? __mix_pool_bytes+0x42/0x90
Sep 11 02:03:14  kernel:  [<ffffffff81387901>] ? extract_buf+0xf1/0x190
Sep 11 02:03:14  kernel:  [<ffffffffa0168774>] zio_cons+0xf4/0x100 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa000b67e>] spl_kmem_cache_alloc+0xae/0xdc0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff81388364>] ? extract_entropy+0x94/0x1d0
Sep 11 02:03:14  kernel:  [<ffffffffa016b210>] zio_create+0x40/0x5a0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa016c0fb>] zio_vdev_child_io+0xbb/0x110 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa012bd30>] ? vdev_config_sync+0x150/0x150 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa012c746>] vdev_mirror_io_start+0xa6/0x1c0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa012bd30>] ? vdev_config_sync+0x150/0x150 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa011b8ef>] ? spa_config_enter+0xaf/0xf0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa0169327>] zio_vdev_io_start+0x217/0x2f0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa016a470>] zio_execute+0x100/0x1b0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa0011437>] taskq_thread+0x267/0x510 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff8109eb80>] ? wake_up_process+0x50/0x50
Sep 11 02:03:14  kernel:  [<ffffffffa00111d0>] ? taskq_cancel_id+0x200/0x200 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff8108e38a>] kthread+0xea/0x100
Sep 11 02:03:14  kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 11 02:03:14  kernel:  [<ffffffff815112fc>] ret_from_fork+0x7c/0xb0
Sep 11 02:03:14  kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 11 02:03:14  kernel: INFO: task txg_sync:790 blocked for more than 120 seconds.
Sep 11 02:03:14  kernel:       Tainted: P          IO  3.16.2-mgm #1
Sep 11 02:03:14  kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 11 02:03:14  kernel: txg_sync        D 0000000000000001     0   790      2 0x00000000
Sep 11 02:03:14  kernel:  ffff880304fafba8 0000000000000046 ffff880303309e30 0000000000014400
Sep 11 02:03:14  kernel:  ffff880304faffd8 0000000000014400 ffff880303309e30 0000000000000046
Sep 11 02:03:14  kernel:  ffff880304fafb30 ffffffff8109e9fb 0000000000000246 ffff880304fafb30
Sep 11 02:03:14  kernel: Call Trace:
Sep 11 02:03:14  kernel:  [<ffffffff8109e9fb>] ? try_to_wake_up+0x1fb/0x330
Sep 11 02:03:14  kernel:  [<ffffffffa000fc1a>] ? taskq_dispatch_ent+0x6a/0x1c0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff8109eb92>] ? default_wake_function+0x12/0x20
Sep 11 02:03:14  kernel:  [<ffffffff810b1d35>] ? __wake_up_common+0x55/0x90
Sep 11 02:03:14  kernel:  [<ffffffff8150d929>] schedule+0x29/0x70
Sep 11 02:03:14  kernel:  [<ffffffff8150dbb4>] io_schedule+0x94/0xf0
Sep 11 02:03:14  kernel:  [<ffffffffa001743c>] cv_wait_common+0xac/0x1b0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa016a370>] ? zio_taskq_member.isra.4+0x80/0x80 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffff810b2440>] ? __wake_up_sync+0x20/0x20
Sep 11 02:03:14  kernel:  [<ffffffffa0017598>] __cv_wait_io+0x18/0x20 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa016c3da>] zio_wait+0x19a/0x260 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa00f6688>] dsl_pool_sync+0x2e8/0x470 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa0110cd5>] spa_sync+0x455/0xaf0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffff810d4a78>] ? ktime_get_ts+0x48/0xf0
Sep 11 02:03:14  kernel:  [<ffffffffa0121cc2>] txg_sync_thread+0x3b2/0x640 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa0121910>] ? txg_delay+0xf0/0xf0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa000f23a>] thread_generic_wrapper+0x7a/0x90 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa000f1c0>] ? __thread_exit+0xa0/0xa0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff8108e38a>] kthread+0xea/0x100
Sep 11 02:03:14  kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 11 02:03:14  kernel:  [<ffffffff815112fc>] ret_from_fork+0x7c/0xb0
Sep 11 02:03:14  kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0

bjquinn · 2014-12-18T20:10:56Z

@behlendorf I'll try it out and report back

behlendorf · 2014-12-18T20:29:11Z

@olw2005 using just the patch in openzfs/spl#421 is enough.

mgmartin · 2014-12-18T22:59:52Z

@tuxoko awesome! I've been running pretty well with a current 3.14 kernel and using the temporary fix of spl_kmem_cache_slab_limit=0 . I'll remove that module option and apply the patch across a few servers that were locking up pretty consistently.

lundman · 2014-12-18T23:53:53Z

This issue only impacts ZoL because Illumos and FreeBSD mutexes are implemented very differently which makes this usage safe.

May one ask how this can be checked? I am of course curious if it is something we have to look into over in XNU. The locks (mutex) code appears to come from Mach:

http://fxr.watson.org/fxr/source//osfmk/i386/i386_lock.s?v=xnu-2050.18.24;im=excerpts

ryao · 2014-12-19T00:15:46Z

@lundman This is specific to Linux's mutex design, which was made faster at the expense of safety without anyone realizing it. Linus' explanation at the following url explains it fairly well:

http://lwn.net/Articles/575477/

You will need to analyze XNU's locks to see if the same scenario is possible with them.

It is known that mutex in Linux is not safe when using it to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx: zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). dbuf uses dbuf->db_mtx to protect reference counting. Add it's been reported multiple time that zio->io_lock and dbuf->db_mtx suffer from corruption, which is exactly the symptom of the race. openzfs/zfs#2523 openzfs/zfs#2897 This patch fix this kind of race by forcing serializaion on mutex_exit() with a spinlock, making the mutex safer by sacrificing a bit of performance and memory overhead. Signed-off-by: Chunwei Chen <tuxoko@gmail.com>

It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #421

behlendorf · 2014-12-19T19:14:00Z

@tuxoko's fix for this issue has been merged to the spl master branch. Because of how well the root cause explains all the symptoms we've seen I've closed out all the duplicates of this issue I'm aware of and linked them to the fix. And while I'm highly confident that this issue is now fixed, I'd like to leave this issue open a while longer to get absolute confirmation of that. Please let us know if you observe any similar symptoms with the fix applied. If things look good for the next few weeks we can close out this bug too.

Commit: openzfs/zfs@a3c1eb7 From: Chunwei Chen <tuxoko@gmail.com> Date: Fri, 19 Dec 2014 11:31:59 +0800 Subject: mutex: force serialization on mutex_exit() to fix races It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Backported-by: Darik Horn <dajhorn@vanadac.com> Closes #421 Conflicts: include/sys/mutex.h

It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #421

mgmartin · 2014-12-27T16:50:46Z

Fix confirmation: I've been running with this patch in place on a few servers for over a week now and have had 0 lock ups. Prior to the patch, I seemed to hit the issue within a day or two. I ran for a few days on 3.14 kernel, then moved everything back to a 3.18 kernel. I used the default spl module options with spl_kmem_cache_kmem_limit=1024 .

Looks good to me. Thanks everyone for all the work in tracking this down and finding the issue!

lintonv · 2015-01-06T18:11:01Z

Is there a hot-fix patch for this fix? We are running the stable release (0.6.3) from July. How would we apply this patch cleanly without having other dependencies?

behlendorf · 2015-01-06T23:29:39Z

@lidongyang This patch was pulled in the stable ZFS packages for Fedora, CentOS, Ubuntu, Debian, and Gentoo. If you're using any of these repositories simply update your packages and reboot to apply the fix. If you're building from source use the 0.6.3-1.2 tag.

behlendorf · 2015-01-06T23:31:11Z

Closing this issue. After over two weeks of run time on numerous systems we've confirmed that the proposed fix has resolved the issue as expected. Thanks everybody.

lintonv · 2015-01-07T14:10:01Z

@behlendorf Great information. Thank you. I did not know there was a 0.6.3-1.2 tag.

prakashsurya · 2015-01-25T20:16:35Z

@behlendorf I'm not sure where to put this, but it'd be really great to start to track some of these differences between linux and illumos w.r.t this project. Subtle incompatibilities like this will make it extremely difficult to ever move to a single OpenZFS repository unless we start to collect and document the known differences, and kernel API requirements.

@tuxoko great job finding the root cause!

The zio_cons() constructor and zio_dest() destructor don't exist in the upstream Illumos code. They were introduced as a workaround to avoid issue openzfs#2523. Since this issue has now been resolved this code is being reverted to bring ZoL back in sync with Illumos. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

The zio_cons() constructor and zio_dest() destructor don't exist in the upstream Illumos code. They were introduced as a workaround to avoid issue openzfs#2523. Since this issue has now been resolved this code is being reverted to bring ZoL back in sync with Illumos. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue openzfs#3063

It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes openzfs#421 Conflicts: include/sys/mutex.h

There exists a plausible cache concurrency issue with zio_wait(). This might occur because the zio->io_waiter to not assigned under a lock in zio_wait(), is not checked under a lock in zio_done(), and the zio may be dispatched to another thread for handling. That said, none of the actual crash dumps I've looked at show that this has ever occurred. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#2523

behlendorf added this to the 0.6.4 milestone Jul 22, 2014

behlendorf added the Bug label Jul 22, 2014

behlendorf mentioned this issue Jul 30, 2014

Hang under memory pressure #2308

Closed

behlendorf mentioned this issue Aug 25, 2014

ZFS diff causes hard lockups when file has invalid parent object #2597

Closed

behlendorf mentioned this issue Sep 3, 2014

Linux AIO Support #2565

Closed

behlendorf mentioned this issue Sep 4, 2014

Issue 2523 (DEBUG) #2658

Closed

prakashsurya mentioned this issue Sep 8, 2014

Fedora 20 CPU soft lockup under load #2679

Closed

behlendorf mentioned this issue Sep 10, 2014

Kernel Oops #2684

Closed

This was referenced Dec 19, 2014

Illumos #5244 zio pipeline callers should explicitly invoke next stage #2828

Closed

task txg_sync blocked for more than 120 seconds #2934

Closed

behlendorf mentioned this issue Dec 20, 2014

NULL pointer dereference in zfs_sb_teardown #2278

Closed

behlendorf closed this as completed Jan 6, 2015

Mehul1313 mentioned this issue Jan 16, 2015

zfs 0.3.6-1 kernel crash #3016

Closed

behlendorf mentioned this issue Jan 22, 2015

rsync and txg_sync blocked for more than 120 seconds #2611

Closed

This was referenced Feb 6, 2015

zfs deadlock apparently because of zio_wait never returning #2571

Closed

Kernel logged pool as suspended, but zpool status says otherwise #2424

Closed

behlendorf mentioned this issue Feb 10, 2015

ZFS lockup 0.6.3.1 #3091

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zio->io_lock() spinlock damaged #2523

zio->io_lock() spinlock damaged #2523

behlendorf commented Jul 22, 2014

behlendorf commented Aug 20, 2014

lundman commented Aug 26, 2014

dweeezil commented Aug 26, 2014

behlendorf commented Aug 26, 2014

lundman commented Aug 27, 2014

ryao commented Sep 4, 2014

behlendorf commented Sep 4, 2014

dswartz commented Sep 4, 2014

dswartz commented Sep 4, 2014

behlendorf commented Sep 4, 2014

ryao commented Sep 5, 2014

mgmartin commented Sep 11, 2014

behlendorf commented Sep 12, 2014

mgmartin commented Sep 12, 2014

bjquinn commented Dec 18, 2014

behlendorf commented Dec 18, 2014

mgmartin commented Dec 18, 2014

lundman commented Dec 18, 2014

ryao commented Dec 19, 2014

behlendorf commented Dec 19, 2014

mgmartin commented Dec 27, 2014

lintonv commented Jan 6, 2015

behlendorf commented Jan 6, 2015

behlendorf commented Jan 6, 2015

lintonv commented Jan 7, 2015

prakashsurya commented Jan 25, 2015

zio->io_lock() spinlock damaged #2523

zio->io_lock() spinlock damaged #2523

Comments

behlendorf commented Jul 22, 2014

behlendorf commented Aug 20, 2014

lundman commented Aug 26, 2014

dweeezil commented Aug 26, 2014

behlendorf commented Aug 26, 2014

lundman commented Aug 27, 2014

ryao commented Sep 4, 2014

behlendorf commented Sep 4, 2014

dswartz commented Sep 4, 2014

dswartz commented Sep 4, 2014

behlendorf commented Sep 4, 2014

ryao commented Sep 5, 2014

mgmartin commented Sep 11, 2014

behlendorf commented Sep 12, 2014

mgmartin commented Sep 12, 2014

bjquinn commented Dec 18, 2014

behlendorf commented Dec 18, 2014

mgmartin commented Dec 18, 2014

lundman commented Dec 18, 2014

ryao commented Dec 19, 2014

behlendorf commented Dec 19, 2014

mgmartin commented Dec 27, 2014

lintonv commented Jan 6, 2015

behlendorf commented Jan 6, 2015

behlendorf commented Jan 6, 2015

lintonv commented Jan 7, 2015

prakashsurya commented Jan 25, 2015