Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zio->io_lock() spinlock damaged #2523

Closed
behlendorf opened this issue Jul 22, 2014 · 51 comments
Closed

zio->io_lock() spinlock damaged #2523

behlendorf opened this issue Jul 22, 2014 · 51 comments
Milestone

Comments

@behlendorf
Copy link
Contributor

The following hang was observed running zfs-0.6.3 under RHEL 6.5. Very similar hangs in zio_wait() have been reported rarely for prior releases. However, this time a crash dump was available for detailed inspection which helps shed some light on the issue.

The hang in zio_wait() shown below is the result of the txg_sync thread getting blocked in __mutex_unlock_slowpath on the associated spin lock. This is somewhat surprising because the spin lock is managed entirely by the Linux kernel mutex_lock and mutex_unlock functions. Because we have access to the crash dump we're able to inspect the exact state of the spin lock.

Under x86_64 on Linux ticket spin locks are used. For those not familiar with ticket spin locks there's a good LWN article explaining them, http://lwn.net/Articles/267968/. But very quickly each lock is split in to two parts a next value and an owner value. When the next value equals the owner value the lock is not held and can be acquired.

Here's where it gets interesting. In the case shown below the spin lock has a value of 0x00020002. This means that since the lock was initialized it has been unlocked twice. But has only been locked once. In this state it will effectively block forever and never acquire the lock.

There are only two plausible ways this can happen.

  • The spin lock is unlocked multiple times.
  • The spin lock was reinitialized while it was locked.

I've inspected the code and thus far haven't found a way that a multiple unlock could occur. This also seems particularly unlikely since as I mentioned above the management of this lock is handled by the Linux mutex primitive. It seems very unlikely that it suffers from any kind of bug like this.

That leaves the spin lock being reinitialized while it is locked. I like this theory much better because it cleanly explains how the lock can end up in this state and also why this problem is so rare. The spin lock would need to be damaged at exactly the right instant. The problem is that the surrounding zio structure doesn't appear to be damaged as one might expect. I also haven't found any code which might cause this.

So unfortunately, while we now have solid evidence for what happened. It's still unclear to me how this can happen.

crash> bt 5272
PID: 5272   TASK: ffff88102db36040  CPU: 1   COMMAND: "z_null_iss/0"
 #0 [ffff88102dbf1be0] schedule at ffffffff8152a940
 #1 [ffff88102dbf1ca8] __mutex_lock_slowpath at ffffffff8152c13e
 #2 [ffff88102dbf1d18] mutex_lock at ffffffff8152bfde
 #3 [ffff88102dbf1d38] zio_wait_for_children at ffffffffa0492eb5 [zfs]
 #4 [ffff88102dbf1d68] zio_done at ffffffffa04979a8 [zfs]
 #5 [ffff88102dbf1de8] zio_execute at ffffffffa04944c3 [zfs]
 #6 [ffff88102dbf1e28] taskq_thread at ffffffffa0348d37 [spl]
 #7 [ffff88102dbf1ee8] kthread at ffffffff8109b486
 #8 [ffff88102dbf1f48] kernel_thread at ffffffff8100c24a

crash> bt 5374
PID: 5374   TASK: ffff88102ccbb500  CPU: 9   COMMAND: "txg_sync"
 #0 [ffff88085c427e90] crash_nmi_callback at ffffffff81030156
 #1 [ffff88085c427ea0] notifier_call_chain at ffffffff81530635
 #2 [ffff88085c427ee0] atomic_notifier_call_chain at ffffffff8153069a
 #3 [ffff88085c427ef0] notify_die at ffffffff810a1afe
 #4 [ffff88085c427f20] do_nmi at ffffffff8152e2fb
 #5 [ffff88085c427f50] nmi at ffffffff8152dbc0
    [exception RIP: _spin_lock+33]
    RIP: ffffffff8152d431  RSP: ffff88102cce9820  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: ffff880b2ef5f954  RCX: 0000000000000000
    RDX: 0000000000000002  RSI: 0000000000000202  RDI: ffff880b2ef5f954
    RBP: ffff88102cce9820   R8: ffff880b2ef5f980   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff880b2ef5f950
    R13: ffff880b2ef5f950  R14: ffff880b2ef5f978  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88102cce9820] _spin_lock at ffffffff8152d431
 #7 [ffff88102cce9828] __mutex_unlock_slowpath at ffffffff8152bd05
 #8 [ffff88102cce9848] mutex_unlock at ffffffff8152bcdb
 #9 [ffff88102cce9858] cv_wait_common at ffffffffa034cae2 [spl]
#10 [ffff88102cce98c8] __cv_wait_io at ffffffffa034cb78 [spl]
#11 [ffff88102cce98d8] zio_wait at ffffffffa049464b [zfs]
#12 [ffff88102cce9918] dmu_buf_hold_array_by_dnode at ffffffffa040b0c7 [zfs]
#13 [ffff88102cce9998] dmu_buf_hold_array at ffffffffa040b5a5 [zfs]
#14 [ffff88102cce99f8] dmu_write at ffffffffa040ba00 [zfs]
#15 [ffff88102cce9a78] space_map_sync at ffffffffa044d9b9 [zfs]
#16 [ffff88102cce9b28] metaslab_condense at ffffffffa043705b [zfs]
#17 [ffff88102cce9c38] metaslab_sync at ffffffffa043733c [zfs]
#18 [ffff88102cce9cb8] vdev_sync at ffffffffa0451b5f [zfs]
#19 [ffff88102cce9d08] spa_sync at ffffffffa043cd9b [zfs]
#20 [ffff88102cce9dd8] txg_sync_thread at ffffffffa044f540 [zfs]
#21 [ffff88102cce9eb8] thread_generic_wrapper at ffffffffa03485c8 [spl]
#22 [ffff88102cce9ee8] kthread at ffffffff8109b486
#23 [ffff88102cce9f48] kernel_thread at ffffffff8100c24a

crash> p *(zio_t *)0xffff880b2ef5f640
$34 = {
  io_bookmark = {
    zb_objset = 0, 
    zb_object = 0, 
    zb_level = 0, 
    zb_blkid = 0
  }, 
  io_prop = {
    zp_checksum = ZIO_CHECKSUM_INHERIT, 
    zp_compress = ZIO_COMPRESS_INHERIT, 
    zp_type = DMU_OT_NONE, 
    zp_level = 0 '\000', 
    zp_copies = 0 '\000', 
    zp_dedup = B_FALSE, 
    zp_dedup_verify = B_FALSE, 
    zp_nopwrite = B_FALSE
  }, 
  io_type = ZIO_TYPE_NULL, 
  io_child_type = ZIO_CHILD_LOGICAL, 
  io_cmd = 0, 
  io_priority = ZIO_PRIORITY_NOW, 
  io_reexecute = 0 '\000', 
  io_state = "\001", 
  io_txg = 0, 
  io_spa = 0xffff8807ff5c8000, 
  io_bp = 0x0, 
  io_bp_override = 0x0, 
  io_bp_copy = {
    blk_dva = {{
        dva_word = {0, 0}
      }, {
        dva_word = {0, 0}
      }, {
        dva_word = {0, 0}
      }}, 
    blk_prop = 0, 
    blk_pad = {0, 0}, 
    blk_phys_birth = 0, 
    blk_birth = 0, 
    blk_fill = 0, 
    blk_cksum = {
      zc_word = {0, 0, 0, 0}
    }
  }, 
  io_parent_list = {
    list_size = 48, 
    list_offset = 16, 
    list_head = {
      next = 0xffff880b2ef5f740, 
      prev = 0xffff880b2ef5f740
    }
  }, 
  io_child_list = {
    list_size = 48, 
    list_offset = 32, 
    list_head = {
      next = 0xffff880b2ef5f760, 
      prev = 0xffff880b2ef5f760
    }
  }, 
  io_walk_link = 0x0, 
  io_logical = 0x0, 
  io_transform_stack = 0x0, 
  io_ready = 0, 
  io_physdone = 0, 
  io_done = 0, 
  io_private = 0x0, 
  io_prev_space_delta = 0, 
  io_bp_orig = {
    blk_dva = {{
        dva_word = {0, 0}
      }, {
        dva_word = {0, 0}
      }, {
        dva_word = {0, 0}
      }}, 
    blk_prop = 0, 
    blk_pad = {0, 0}, 
    blk_phys_birth = 0, 
    blk_birth = 0, 
    blk_fill = 0, 
    blk_cksum = {
      zc_word = {0, 0, 0, 0}
    }
  }, 
  io_data = 0x0, 
  io_orig_data = 0x0, 
  io_size = 0, 
  io_orig_size = 0, 
  io_vd = 0x0, 
  io_vsd = 0x0, 
  io_vsd_ops = 0x0, 
  io_offset = 0, 
  io_timestamp = 0, 
  io_delta = 0, 
  io_delay = 0, 
  io_queue_node = {
    avl_child = {0x0, 0x0}, 
    avl_pcb = 0
  }, 
  io_flags = ZIO_FLAG_CANFAIL, 
  io_stage = ZIO_STAGE_DONE, 
  io_pipeline = 2162688, 
  io_orig_flags = ZIO_FLAG_CANFAIL, 
  io_orig_stage = ZIO_STAGE_OPEN, 
  io_orig_pipeline = 2162688, 
  io_error = 0, 
  io_child_error = {0, 0, 0, 0}, 
  io_children = {{0, 0}, {0, 0}, {0, 0}, {0, 0}}, 
  io_child_count = 0, 
  io_phys_children = 0, 
  io_parent_count = 0, 
  io_stall = 0x0, 
  io_gang_leader = 0x0, 
  io_gang_tree = 0x0, 
  io_executor = 0xffff88102db36040, 
  io_waiter = 0xffff88102ccbb500, 
  io_lock = {
    m = {
      count = {
        counter = -1
      }, 
      wait_lock = {
        raw_lock = {
          slock = 131074
        }
      }, 
      wait_list = {
        next = 0xffff88102dbf1cc0, 
        prev = 0xffff88102dbf1cc0
      }, 
      owner = 0x0
    }
  }, 
  io_cv = {
    cv_magic = 879052276, 
    cv_event = {
      lock = {
        raw_lock = {
          slock = 65537
        }
      }, 
      task_list = {
        next = 0xffff88102cce9878, 
        prev = 0xffff88102cce9878
      }
    }, 
    cv_destroy = {
      lock = {
        raw_lock = {
          slock = 0
        }
      }, 
      task_list = {
        next = 0xffff880b2ef5f998, 
        prev = 0xffff880b2ef5f998
      }
    }, 
    cv_refs = {
      counter = 2
    }, 
    cv_waiters = {
      counter = 1
    }, 
    cv_mutex = 0xffff880b2ef5f950
  }, 
  io_cksum_report = 0x0, 
  io_ena = 0, 
  io_tqent = {
    tqent_lock = {
      raw_lock = {
        slock = 65537
      }
    }, 
    tqent_waitq = {
      lock = {
        raw_lock = {
          slock = 0
        }
      }, 
      task_list = {
        next = 0xffff880b2ef5f9d8, 
        prev = 0xffff880b2ef5f9d8
      }
    }, 
    tqent_timer = {
      entry = {
        next = 0x0, 
        prev = 0x0
      }, 
      expires = 0, 
      function = 0, 
      data = 0, 
      base = 0xffff8810343b4000, 
      start_site = 0x0, 
      start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
      start_pid = -1
    }, 
    tqent_list = {
      next = 0xffff880b2ef5fa38, 
      prev = 0xffff880b2ef5fa38
    }, 
    tqent_id = 193943569, 
    tqent_func = 0xffffffffa0494410, 
    tqent_arg = 0xffff880b2ef5f640, 
    tqent_taskq = 0xffff8808076d5780, 
    tqent_flags = 1
  }
}
@behlendorf behlendorf added this to the 0.6.4 milestone Jul 22, 2014
@behlendorf behlendorf added the Bug label Jul 22, 2014
@behlendorf
Copy link
Contributor Author

The same issue was observed again. The z_fr_iss_0/1 is stuck spinning on the internal mutex spin lock which protects the wait list. The damage looks like a double free but it's unclear how that can happen. Critical details from a second instance follow:

# z_fr_iss_0/1
 [<ffffffff8152ccc5>] ? __mutex_unlock_slowpath+0x25/0x60
 [<ffffffff8152cc9b>] ? mutex_unlock+0x1b/0x20
 [<ffffffffa04b1f25>] ? zio_done+0x5b5/0xe00 [zfs]
 [<ffffffffa04b1677>] ? zio_ready+0x177/0x470 [zfs]
 [<ffffffffa04ae503>] ? zio_execute+0xb3/0x140 [zfs]
 [<ffffffffa0371d37>] ? taskq_thread+0x1e7/0x3f0 [spl]
 [<ffffffff81061f20>] ? default_wake_function+0x0/0x20
 [<ffffffffa0371b50>] ? taskq_thread+0x0/0x3f0 [spl]
 [<ffffffff8109b4d6>] ? kthread+0x96/0xa0
 [<ffffffff8100c24a>] ? child_rip+0xa/0x20
 [<ffffffff8109b440>] ? kthread+0x0/0xa0
 [<ffffffff8100c240>] ? child_rip+0x0/0x20

crash> gdb list *(zio_done+0x5b5)
0xffffffffa04b1f25 is in zio_done (/usr/src/debug/zfs-kmod-0.6.3/zfs-0.6.3/module/zfs/zio.c:3246).
3241    
3242            if (zio->io_waiter != NULL) {
3243                    mutex_enter(&zio->io_lock);
3244                    zio->io_executor = NULL;
3245                    cv_broadcast(&zio->io_cv);
3246                    mutex_exit(&zio->io_lock);
3247            } else {
3248                    zio_destroy(zio);
3249            }
3250  
# txg_sync
[<ffffffffa0375af4>] schedule at ffffffff8152b900
[<ffffffffa0375af4>] __mutex_lock_slowpath at ffffffff8152d0fe
[<ffffffffa0375af4>] mutex_lock at ffffffff8152cf9e
[<ffffffffa0375af4>] cv_wait_common+0x94/0x100 [spl]
[<ffffffffa0375b78>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa04ae68b>] zio_wait+0xfb/0x1b0 [zfs]
[<ffffffffa04b016b>] zio_free+0xab/0xe0 [zfs]
[<ffffffffa0449611>] dsl_free+0x11/0x20 [zfs]
[<ffffffffa043e521>] dsl_dataset_block_kill+0x1a1/0x350 [zfs]
[<ffffffffa043710e>] free_blocks+0x6e/0xb0 [zfs]
[<ffffffffa04377a4>] free_children+0x294/0x340 [zfs]
[<ffffffffa04376db>] free_children+0x1cb/0x340 [zfs]
[<ffffffffa04376db>] free_children+0x1cb/0x340 [zfs]
[<ffffffffa0437bc9>] dnode_sync+0x379/0x9e0 [zfs]
[<ffffffffa0427fe9>] dmu_objset_sync_dnodes+0x89/0xb0 [zfs]
[<ffffffffa04281c3>] dmu_objset_sync+0x1b3/0x2d0 [zfs]
[<ffffffffa04394ac>] dsl_dataset_sync+0x4c/0x60 [zfs]
[<ffffffffa044479a>] dsl_pool_sync+0x9a/0x3f0 [zfs]
[<ffffffffa0456d4b>] spa_sync+0x40b/0xa60 [zfs]
[<ffffffffa0469570>] txg_sync_thread+0x300/0x520 [zfs]
[<ffffffffa03715c8>] thread_generic_wrapper+0x68/0x80 [spl]
[<ffffffff8109b4d6>] kthread+0x96/0xa0
[<ffffffff8100c24a>] child_rip+0xa/0x20
# There are 16 zio's on the slab including the one below indicating it has not been freed.
crash> kmem -s
CACHE            NAME                 OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE
ffff880818d61e00 zio_cache               1072         16       112     16     8k

# The offending zio has damage to the spin lock suggesting a double free.
crash> struct -x zio ffff8808e6b5f700
struct zio {
  io_bookmark = {
    zb_objset = 0x0, 
    zb_object = 0x0, 
    zb_level = 0x0, 
    zb_blkid = 0x0
  }, 
  io_prop = {
    zp_checksum = ZIO_CHECKSUM_INHERIT, 
    zp_compress = ZIO_COMPRESS_INHERIT, 
    zp_type = DMU_OT_NONE, 
    zp_level = 0x0, 
    zp_copies = 0x0, 
    zp_dedup = B_FALSE, 
    zp_dedup_verify = B_FALSE, 
    zp_nopwrite = B_FALSE
  }, 
  io_type = ZIO_TYPE_FREE, 
  io_child_type = ZIO_CHILD_LOGICAL, 
  io_cmd = 0x0, 
  io_priority = ZIO_PRIORITY_NOW, 
  io_reexecute = 0x0, 
  io_state = "\001\001", 
  io_txg = 0x23a803, 
  io_spa = 0xffff88102e04e000, 
  io_bp = 0xffff8808e6b5f770, 
  io_bp_override = 0x0, 
  io_bp_copy = {
    blk_dva = {{
        dva_word = {0x35, 0x5cfd28404}
      }, {
        dva_word = {0x0, 0x0}
      }, {
        dva_word = {0x0, 0x0}
      }}, 
    blk_prop = 0x80130703003400ff, 
    blk_pad = {0x0, 0x0}, 
    blk_phys_birth = 0x0, 
    blk_birth = 0x23a75c, 
    blk_fill = 0x1, 
    blk_cksum = {
      zc_word = {0x77fd3b1b03f, 0x64114459516315, 0x6ae8943b3134ab81, 0x668ee8ddfd38e19d}
    }
  }, 
  io_parent_list = {
    list_size = 0x30, 
    list_offset = 0x10, 
    list_head = {
      next = 0xffff8808e6b5f800, 
      prev = 0xffff8808e6b5f800
    }
  }, 
  io_child_list = {
    list_size = 0x30, 
    list_offset = 0x20, 
    list_head = {
      next = 0xffff8808e6b5f820, 
      prev = 0xffff8808e6b5f820
    }
  }, 
  io_walk_link = 0x0, 
  io_logical = 0xffff8808e6b5f700, 
  io_transform_stack = 0x0, 
  io_ready = 0, 
  io_physdone = 0, 
  io_done = 0, 
  io_private = 0x0, 
  io_prev_space_delta = 0x0, 
  io_bp_orig = {
    blk_dva = {{
        dva_word = {0x35, 0x5cfd28404}
      }, {
        dva_word = {0x0, 0x0}
      }, {
        dva_word = {0x0, 0x0}
      }}, 
    blk_prop = 0x80130703003400ff, 
    blk_pad = {0x0, 0x0}, 
    blk_phys_birth = 0x0, 
    blk_birth = 0x23a75c, 
    blk_fill = 0x1, 
    blk_cksum = {
      zc_word = {0x77fd3b1b03f, 0x64114459516315, 0x6ae8943b3134ab81, 0x668ee8ddfd38e19d}
    }
  }, 
  io_data = 0x0, 
  io_orig_data = 0x0, 
  io_size = 0x6a00, 
  io_orig_size = 0x6a00, 
  io_vd = 0x0, 
  io_vsd = 0x0, 
  io_vsd_ops = 0x0, 
  io_offset = 0x0, 
  io_timestamp = 0x0, 
  io_delta = 0x0, 
  io_delay = 0x0, 
  io_queue_node = {
    avl_child = {0x0, 0x0}, 
    avl_pcb = 0x0
  }, 
  io_flags = 0, 
  io_stage = ZIO_STAGE_DONE, 
  io_pipeline = 2179076, 
  io_orig_flags = 0, 
  io_orig_stage = ZIO_STAGE_OPEN, 
  io_orig_pipeline = 2179076, 
  io_error = 0x0, 
  io_child_error = {0x0, 0x0, 0x0, 0x0}, 
  io_children = {{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}}, 
  io_child_count = 0x0, 
  io_phys_children = 0x0, 
  io_parent_count = 0x0, 
  io_stall = 0x0, 
  io_gang_leader = 0x0, 
  io_gang_tree = 0x0, 
  io_executor = 0x0, 
  io_waiter = 0xffff88080b247540, 
  io_lock = {
    m = {
      count = {
        counter = 0xffffffff
      }, 
      wait_lock = {
        raw_lock = {
          slock = 0x20002
        }
      }, 
      wait_list = {
        next = 0xffff880806c67640, 
        prev = 0xffff880806c67640
      }, 
      owner = 0x0
    }
  }, 
  io_cv = {
    cv_magic = 0x346545f4, 
    cv_event = {
      lock = {
        raw_lock = {
          slock = 0x20002
        }
      }, 
      task_list = {
        next = 0xffff8808e6b5fa40, 
        prev = 0xffff8808e6b5fa40
      }
    }, 
    cv_destroy = {
      lock = {
        raw_lock = {
          slock = 0x0
        }
      }, 
      task_list = {
        next = 0xffff8808e6b5fa58, 
        prev = 0xffff8808e6b5fa58
      }
    }, 
    cv_refs = {
      counter = 0x2
    }, 
    cv_waiters = {
      counter = 0x1
    }, 
    cv_mutex = 0xffff8808e6b5fa10
  }, 
  io_cksum_report = 0x0, 
  io_ena = 0x0, 
  io_tqent = {
    tqent_lock = {
      raw_lock = {
        slock = 0x10001
      }
    }, 
    tqent_waitq = {
      lock = {
        raw_lock = {
          slock = 0x0
        }
      }, 
      task_list = {
        next = 0xffff8808e6b5fa98, 
        prev = 0xffff8808e6b5fa98
      }
    }, 
    tqent_timer = {
      entry = {
        next = 0x0, 
        prev = 0x0
      }, 
      expires = 0x0, 
      function = 0, 
      data = 0x0, 
      base = 0xffff881033c5c000, 
      start_site = 0x0, 
      start_comm = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000", 
      start_pid = 0xffffffff
    }, 
    tqent_list = {
      next = 0xffff8808e6b5faf8, 
      prev = 0xffff8808e6b5faf8
    }, 
    tqent_id = 0x25e1f69, 
    tqent_func = 0xffffffffa04ae450 <zio_execute>, 
    tqent_arg = 0xffff8808e6b5f700, 
    tqent_taskq = 0xffff881030d4b8c0, 
    tqent_flags = 0x1
  }
}

@lundman
Copy link
Contributor

lundman commented Aug 26, 2014

I hesitate to comment, since it is very unlikely to be related. We had many panics in zio->io_locks, which seemed connected to memory corruption. But it was related to how our SPL memory manager was designed.

In the end it was caused by:
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_ioctl.c#L5833
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zfs_ioctl.c#L5907

.. because saved_poolname had been truncated, the alloc and free calls were giving different "size" parameter. So for example, alloc would take it from size=16 size, but free place memory in size=8. This meant we handed the same memory out twice, and so on. The most panics resulted in zio_done() with z_iolock, but also in a whole bunch of random places.

But I don't think ZOL's allocator cares as much about the size parameter as ours do though, so it is likely not to be relevant.

openzfsonosx/zfs@2e1aed2

@dweeezil
Copy link
Contributor

As your links above pointed out, in fb8e608, the handling of saved_poolname was converted to use strdup()/strfree(). In the Zol SPL, these calls map to kmalloc()/kfree() which (at least in SLOB) prepend the space with a 4-byte header to remember the size of the original allocation.

@behlendorf
Copy link
Contributor Author

Any insights here are welcome. But as @dweeezil pointed out in this case we're directly mapping strdup()/strfree() to kmalloc()/kfree() so we can't really have an accounting issue here at least in the SPL.

You're comment however reminded me that ZoL's SPL has a --enable-debug-kmem-tracking option which enables detailed tracking of every allocation. It will verify that every address which is allocated with a given size is freed with the same size. It's horrible for performance but if there's something odd like that going on it should catch it. I'll give it a try on a test setup.

@lundman
Copy link
Contributor

lundman commented Aug 27, 2014

Ah I see, you directly map it to Linux specific allocation calls which do not take size as an argument, and yet still have spa_strdup().

Anyway, sorry I couldn't help...

@ryao
Copy link
Contributor

ryao commented Sep 4, 2014

@behlendorf If a double free occurs on an object allocated by spl_kmem_cache_alloc(), then it will be allocated twice. That would result in something like this happening. I managed to trigger something like this when building ztest with Clang not that long ago, but I did not identify the cause of the double free.

behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 4, 2014
One theory for openzfs#2523 is that somehow the spinlock at the start of
the mutex is being accidentally overwritten.  To test this theory
a magic value is added before the mutex.  By checking this value
periodically against a known magic value we can detect if this
is somehow happening.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2523
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 4, 2014
There exists a plausible cache concurrency issue with zio_wait().
This might occur because the zio->io_waiter to not assigned under
a lock in zio_wait(), is not checked under a lock in zio_done(),
and the zio may be dispatched to another thread for handling.
That said, none of the actual crash dumps I've looked at show
that this has ever occurred.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2523
@behlendorf
Copy link
Contributor Author

@ryao Yes, a double free is a possible explanation for this but thus far I haven't been able to determine how that's possible. Have you been able to reproduce anything like this? It's been difficult to resolve because it occurs often enough to be a problem but I've been completely unable to reproduce it otherwise.

@dswartz
Copy link
Contributor

dswartz commented Sep 4, 2014

@ryao Yes, a double free is a possible explanation for this but thus far I
haven't been able to determine how that's possible. Have you been able to
reproduce anything like this? It's been difficult to resolve because it
occurs often enough to be a problem but I've been completely unable to
reproduce it otherwise.

Is it possible to make some kind of debug wrapper to catch the 2nd free in
the act?

@dswartz
Copy link
Contributor

dswartz commented Sep 4, 2014

Oh never mind, only just now saw your patch to try to track this down :)

@behlendorf
Copy link
Contributor Author

@dswartz It should be but thus far I haven't been able to hit the issue with any of the debug patches in place. Generally I dislike adding this kind of debug to the main tree but we may have too just to get some additional information.

@ryao
Copy link
Contributor

ryao commented Sep 5, 2014

@behlendorf I suspect that I could reproduce the double free that I encountered when building with Clang, but I do not have instructions to do that on hand at the moment.

behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 8, 2014
One theory for openzfs#2523 is that somehow the spinlock at the start of
the mutex is being accidentally overwritten.  To test this theory
a magic value is added before the mutex.  By checking this value
periodically against a known magic value we can detect if this
is somehow happening.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2523
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 8, 2014
There exists a plausible cache concurrency issue with zio_wait().
This might occur because the zio->io_waiter to not assigned under
a lock in zio_wait(), is not checked under a lock in zio_done(),
and the zio may be dispatched to another thread for handling.
That said, none of the actual crash dumps I've looked at show
that this has ever occurred.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2523
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 11, 2014
From the information gathered about issue openzfs#2523 it's clear that
somehow the spin lock is being damaged.  This could occur if the
spin lock is reinitialized, the memory is accidentally overwritten,
or freed back to the cache to soon.  To determine exactly what is
happening this patch adds a couple new sanity tests.

* A zio->io_magic field is added before the io_lock.  This field
  is designed to act as a red zone allow us to detect if the zio
  has been written.

* The zio->io_magic field is also used to detect if somehow the
  constructor or destructor is running multiple for the object.
  This would effectively cause the spin lock to be reinitialized.

* The destructor has been updated to poison the entire structure.
  This should cause us to quickly detect any use-after-free bugs.

One the root cause of this issue can be determined this patch
should be reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2523
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 11, 2014
There exists a plausible cache concurrency issue with zio_wait().
This might occur because the zio->io_waiter to not assigned under
a lock in zio_wait(), is not checked under a lock in zio_done(),
and the zio may be dispatched to another thread for handling.
That said, none of the actual crash dumps I've looked at show
that this has ever occurred.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2523
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 11, 2014
From the information gathered about issue openzfs#2523 it's clear that
somehow the spin lock is being damaged.  This could occur if the
spin lock is reinitialized, the memory is accidentally overwritten,
or freed back to the cache to soon.  To determine exactly what is
happening this patch adds a couple new sanity tests.

* A zio->io_magic field is added before the io_lock.  This field
  is designed to act as a red zone allow us to detect if the zio
  has been written.

* The zio->io_magic field is also used to detect if somehow the
  constructor or destructor is running multiple for the object.
  This would effectively cause the spin lock to be reinitialized.

* The destructor has been updated to poison the entire structure.
  This should cause us to quickly detect any use-after-free bugs.

Once the root cause of this issue can be determined this patch
should be reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2523
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 11, 2014
There exists a plausible cache concurrency issue with zio_wait().
This might occur because the zio->io_waiter to not assigned under
a lock in zio_wait(), is not checked under a lock in zio_done(),
and the zio may be dispatched to another thread for handling.
That said, none of the actual crash dumps I've looked at show
that this has ever occurred.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2523
@mgmartin
Copy link

I ran with the assert magic patch last night and hit an assert failure under load. Hope this helps narrow this nasty issue down!

Sep 11 02:00:16 server kernel: VERIFY3(zio->io_magic != 0x210210210210210ULL) failed (148655080447476240 != 148655080447476240)
Sep 11 02:00:16 server kernel: SPLError: 697:0:(zio.c:102:zio_cons()) SPL PANIC
Sep 11 02:00:16 server kernel: SPL: Showing stack for process 697
Sep 11 02:00:16 server kernel: CPU: 5 PID: 697 Comm: z_wr_iss/7 Tainted: P          IO  3.16.2-mgm #1
Sep 11 02:00:16 server kernel: Hardware name: HP ProLiant DL380 G6, BIOS P62 07/02/2013
Sep 11 02:00:16 server kernel:  0000000000000000 00000000470a6f12 ffff880306e079b8 ffffffff8150bb2f
Sep 11 02:00:16 server kernel:  ffff880302e5da90 ffff880306e079d0 ffffffffa0006b15 0000000000000000
Sep 11 02:00:16 server kernel:  ffff880306e079f8 ffffffffa0007ab0 ffffffffa001c791 000000000000ffed
Sep 11 02:00:16 server kernel: Call Trace:
Sep 11 02:00:16 server kernel:  [<ffffffff8150bb2f>] dump_stack+0x45/0x56
Sep 11 02:00:16 server kernel:  [<ffffffffa0006b15>] spl_debug_dumpstack+0x45/0x60 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffffa0007ab0>] spl_debug_bug+0x80/0xe0 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffffa0014955>] spl_PANIC+0xa5/0xd0 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffffa012cca3>] ? vdev_queue_io_to_issue+0x153/0x8f0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffff813878e0>] ? extract_buf+0xd0/0x190
Sep 11 02:00:16 server kernel:  [<ffffffff81385a92>] ? __mix_pool_bytes+0x42/0x90
Sep 11 02:00:16 server kernel:  [<ffffffff81387901>] ? extract_buf+0xf1/0x190
Sep 11 02:00:16 server kernel:  [<ffffffffa0168774>] zio_cons+0xf4/0x100 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa000b67e>] spl_kmem_cache_alloc+0xae/0xdc0 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffff81388364>] ? extract_entropy+0x94/0x1d0
Sep 11 02:00:16 server kernel:  [<ffffffffa016b210>] zio_create+0x40/0x5a0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa016c0fb>] zio_vdev_child_io+0xbb/0x110 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa012bd30>] ? vdev_config_sync+0x150/0x150 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa012c746>] vdev_mirror_io_start+0xa6/0x1c0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa012bd30>] ? vdev_config_sync+0x150/0x150 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa011b8ef>] ? spa_config_enter+0xaf/0xf0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa0169327>] zio_vdev_io_start+0x217/0x2f0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa016a470>] zio_execute+0x100/0x1b0 [zfs]
Sep 11 02:00:16 server kernel:  [<ffffffffa0011437>] taskq_thread+0x267/0x510 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffff8109eb80>] ? wake_up_process+0x50/0x50
Sep 11 02:00:16 server kernel:  [<ffffffffa00111d0>] ? taskq_cancel_id+0x200/0x200 [spl]
Sep 11 02:00:16 server kernel:  [<ffffffff8108e38a>] kthread+0xea/0x100
Sep 11 02:00:16 server kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 11 02:00:16 server kernel:  [<ffffffff815112fc>] ret_from_fork+0x7c/0xb0
Sep 11 02:00:16 server kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0

@behlendorf
Copy link
Contributor Author

@mgmartin Thanks, I was about to catch a failure in overnight stress testing as well. This should help us get to the root cause.

@mgmartin
Copy link

This might help too--the hung task stacks after the panic:

Sep 11 02:03:14  kernel: z_wr_iss/7      D ffff88061fa54400     0   697      2 0x00000000
Sep 11 02:03:14  kernel:  ffff880306e079c0 0000000000000046 ffff880302e5da90 0000000000014400
Sep 11 02:03:14  kernel:  ffff880306e07fd8 0000000000014400 ffff880302e5da90 ffff880306e07920
Sep 11 02:03:14  kernel:  ffffffff8101770d ffff880306e079e8 0000000000000000 ffff880306e079a0
Sep 11 02:03:14  kernel: Call Trace:
Sep 11 02:03:14  kernel:  [<ffffffff8101770d>] ? show_trace_log_lvl+0x4d/0x60
Sep 11 02:03:14  kernel:  [<ffffffff810163fd>] ? show_stack_log_lvl+0x9d/0x1a0
Sep 11 02:03:14  kernel:  [<ffffffff81017774>] ? show_stack+0x34/0xa0
Sep 11 02:03:14  kernel:  [<ffffffff8150d929>] schedule+0x29/0x70
Sep 11 02:03:14  kernel:  [<ffffffffa0007ae5>] spl_debug_bug+0xb5/0xe0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa0014955>] spl_PANIC+0xa5/0xd0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa012cca3>] ? vdev_queue_io_to_issue+0x153/0x8f0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffff813878e0>] ? extract_buf+0xd0/0x190
Sep 11 02:03:14  kernel:  [<ffffffff81385a92>] ? __mix_pool_bytes+0x42/0x90
Sep 11 02:03:14  kernel:  [<ffffffff81387901>] ? extract_buf+0xf1/0x190
Sep 11 02:03:14  kernel:  [<ffffffffa0168774>] zio_cons+0xf4/0x100 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa000b67e>] spl_kmem_cache_alloc+0xae/0xdc0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff81388364>] ? extract_entropy+0x94/0x1d0
Sep 11 02:03:14  kernel:  [<ffffffffa016b210>] zio_create+0x40/0x5a0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa016c0fb>] zio_vdev_child_io+0xbb/0x110 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa012bd30>] ? vdev_config_sync+0x150/0x150 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa012c746>] vdev_mirror_io_start+0xa6/0x1c0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa012bd30>] ? vdev_config_sync+0x150/0x150 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa011b8ef>] ? spa_config_enter+0xaf/0xf0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa0169327>] zio_vdev_io_start+0x217/0x2f0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa016a470>] zio_execute+0x100/0x1b0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa0011437>] taskq_thread+0x267/0x510 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff8109eb80>] ? wake_up_process+0x50/0x50
Sep 11 02:03:14  kernel:  [<ffffffffa00111d0>] ? taskq_cancel_id+0x200/0x200 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff8108e38a>] kthread+0xea/0x100
Sep 11 02:03:14  kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 11 02:03:14  kernel:  [<ffffffff815112fc>] ret_from_fork+0x7c/0xb0
Sep 11 02:03:14  kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 11 02:03:14  kernel: INFO: task txg_sync:790 blocked for more than 120 seconds.
Sep 11 02:03:14  kernel:       Tainted: P          IO  3.16.2-mgm #1
Sep 11 02:03:14  kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 11 02:03:14  kernel: txg_sync        D 0000000000000001     0   790      2 0x00000000
Sep 11 02:03:14  kernel:  ffff880304fafba8 0000000000000046 ffff880303309e30 0000000000014400
Sep 11 02:03:14  kernel:  ffff880304faffd8 0000000000014400 ffff880303309e30 0000000000000046
Sep 11 02:03:14  kernel:  ffff880304fafb30 ffffffff8109e9fb 0000000000000246 ffff880304fafb30
Sep 11 02:03:14  kernel: Call Trace:
Sep 11 02:03:14  kernel:  [<ffffffff8109e9fb>] ? try_to_wake_up+0x1fb/0x330
Sep 11 02:03:14  kernel:  [<ffffffffa000fc1a>] ? taskq_dispatch_ent+0x6a/0x1c0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff8109eb92>] ? default_wake_function+0x12/0x20
Sep 11 02:03:14  kernel:  [<ffffffff810b1d35>] ? __wake_up_common+0x55/0x90
Sep 11 02:03:14  kernel:  [<ffffffff8150d929>] schedule+0x29/0x70
Sep 11 02:03:14  kernel:  [<ffffffff8150dbb4>] io_schedule+0x94/0xf0
Sep 11 02:03:14  kernel:  [<ffffffffa001743c>] cv_wait_common+0xac/0x1b0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa016a370>] ? zio_taskq_member.isra.4+0x80/0x80 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffff810b2440>] ? __wake_up_sync+0x20/0x20
Sep 11 02:03:14  kernel:  [<ffffffffa0017598>] __cv_wait_io+0x18/0x20 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa016c3da>] zio_wait+0x19a/0x260 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa00f6688>] dsl_pool_sync+0x2e8/0x470 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa0110cd5>] spa_sync+0x455/0xaf0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffff810d4a78>] ? ktime_get_ts+0x48/0xf0
Sep 11 02:03:14  kernel:  [<ffffffffa0121cc2>] txg_sync_thread+0x3b2/0x640 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa0121910>] ? txg_delay+0xf0/0xf0 [zfs]
Sep 11 02:03:14  kernel:  [<ffffffffa000f23a>] thread_generic_wrapper+0x7a/0x90 [spl]
Sep 11 02:03:14  kernel:  [<ffffffffa000f1c0>] ? __thread_exit+0xa0/0xa0 [spl]
Sep 11 02:03:14  kernel:  [<ffffffff8108e38a>] kthread+0xea/0x100
Sep 11 02:03:14  kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0
Sep 11 02:03:14  kernel:  [<ffffffff815112fc>] ret_from_fork+0x7c/0xb0
Sep 11 02:03:14  kernel:  [<ffffffff8108e2a0>] ? kthread_create_on_node+0x1a0/0x1a0

@bjquinn
Copy link

bjquinn commented Dec 18, 2014

@behlendorf I'll try it out and report back

@behlendorf
Copy link
Contributor Author

@olw2005 using just the patch in openzfs/spl#421 is enough.

@mgmartin
Copy link

@tuxoko awesome! I've been running pretty well with a current 3.14 kernel and using the temporary fix of spl_kmem_cache_slab_limit=0 . I'll remove that module option and apply the patch across a few servers that were locking up pretty consistently.

@lundman
Copy link
Contributor

lundman commented Dec 18, 2014

This issue only impacts ZoL because Illumos and FreeBSD mutexes are implemented very differently which makes this usage safe.

May one ask how this can be checked? I am of course curious if it is something we have to look into over in XNU. The locks (mutex) code appears to come from Mach:

http://fxr.watson.org/fxr/source//osfmk/i386/i386_lock.s?v=xnu-2050.18.24;im=excerpts

@ryao
Copy link
Contributor

ryao commented Dec 19, 2014

@lundman This is specific to Linux's mutex design, which was made faster at the expense of safety without anyone realizing it. Linus' explanation at the following url explains it fairly well:

http://lwn.net/Articles/575477/

You will need to analyze XNU's locks to see if the same scenario is possible with them.

tuxoko added a commit to tuxoko/spl that referenced this issue Dec 19, 2014
It is known that mutex in Linux is not safe when using it to synchronize
the freeing of object in which the mutex is embedded:
http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race condition
are zio->io_lock and dbuf->db_mtx:
  zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait()
  and zio_done().
  dbuf uses dbuf->db_mtx to protect reference counting.

Add it's been reported multiple time that zio->io_lock and dbuf->db_mtx suffer
from corruption, which is exactly the symptom of the race.
openzfs/zfs#2523
openzfs/zfs#2897

This patch fix this kind of race by forcing serializaion on mutex_exit() with
a spinlock, making the mutex safer by sacrificing a bit of performance and
memory overhead.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 19, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421
@behlendorf
Copy link
Contributor Author

@tuxoko's fix for this issue has been merged to the spl master branch. Because of how well the root cause explains all the symptoms we've seen I've closed out all the duplicates of this issue I'm aware of and linked them to the fix. And while I'm highly confident that this issue is now fixed, I'd like to leave this issue open a while longer to get absolute confirmation of that. Please let us know if you observe any similar symptoms with the fix applied. If things look good for the next few weeks we can close out this bug too.

dajhorn added a commit to zfsonlinux/pkg-spl that referenced this issue Dec 20, 2014
Commit: openzfs/zfs@a3c1eb7
From: Chunwei Chen <tuxoko@gmail.com>
Date: Fri, 19 Dec 2014 11:31:59 +0800
Subject: mutex: force serialization on mutex_exit() to fix races

It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Backported-by: Darik Horn <dajhorn@vanadac.com>
Closes #421

Conflicts:
        include/sys/mutex.h
behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 23, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421
@mgmartin
Copy link

Fix confirmation: I've been running with this patch in place on a few servers for over a week now and have had 0 lock ups. Prior to the patch, I seemed to hit the issue within a day or two. I ran for a few days on 3.14 kernel, then moved everything back to a 3.18 kernel. I used the default spl module options with spl_kmem_cache_kmem_limit=1024 .

Looks good to me. Thanks everyone for all the work in tracking this down and finding the issue!

@lintonv
Copy link

lintonv commented Jan 6, 2015

Is there a hot-fix patch for this fix? We are running the stable release (0.6.3) from July. How would we apply this patch cleanly without having other dependencies?

@behlendorf
Copy link
Contributor Author

@lidongyang This patch was pulled in the stable ZFS packages for Fedora, CentOS, Ubuntu, Debian, and Gentoo. If you're using any of these repositories simply update your packages and reboot to apply the fix. If you're building from source use the 0.6.3-1.2 tag.

@behlendorf
Copy link
Contributor Author

Closing this issue. After over two weeks of run time on numerous systems we've confirmed that the proposed fix has resolved the issue as expected. Thanks everybody.

@lintonv
Copy link

lintonv commented Jan 7, 2015

@behlendorf Great information. Thank you. I did not know there was a 0.6.3-1.2 tag.

@prakashsurya
Copy link
Member

@behlendorf I'm not sure where to put this, but it'd be really great to start to track some of these differences between linux and illumos w.r.t this project. Subtle incompatibilities like this will make it extremely difficult to ever move to a single OpenZFS repository unless we start to collect and document the known differences, and kernel API requirements.

@tuxoko great job finding the root cause!

behlendorf added a commit to behlendorf/zfs that referenced this issue Jan 30, 2015
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code.  They were introduced as a workaround
to avoid issue openzfs#2523.  Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 6, 2015
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code.  They were introduced as a workaround
to avoid issue openzfs#2523.  Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
behlendorf added a commit to behlendorf/zfs that referenced this issue Feb 11, 2015
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code.  They were introduced as a workaround
to avoid issue openzfs#2523.  Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue openzfs#3063
ryao pushed a commit to ryao/spl that referenced this issue Feb 19, 2015
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes openzfs#421

Conflicts:
	include/sys/mutex.h
behlendorf added a commit to behlendorf/zfs that referenced this issue Jun 9, 2015
There exists a plausible cache concurrency issue with zio_wait().
This might occur because the zio->io_waiter to not assigned under
a lock in zio_wait(), is not checked under a lock in zio_done(),
and the zio may be dispatched to another thread for handling.
That said, none of the actual crash dumps I've looked at show
that this has ever occurred.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2523
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

16 participants