Commits on Jun 23, 2011
  1. Multipath device manageability improvements

    nedbass authored and behlendorf committed Jun 21, 2011
    Update udev helper scripts to deal with device-mapper devices created
    by multipathd.  These enhancements are targeted at a particular
    storage network topology under evaluation at LLNL consisting of two
    SAS switches providing redundant connectivity between multiple server
    nodes and disk enclosures.
    The key to making these systems manageable is to create shortnames for
    each disk that conveys its physical location in a drawer.  In a
    direct-attached topology we infer a disk's enclosure from the PCI bus
    number and HBA port number in the by-path name provided by udev.  In a
    switched topology, however, multiple drawers are accessed via a single
    HBA port.  We therefore resort to assigning drawer identifiers based
    on which switch port a drive's enclosure is connected to.  This
    information is available from sysfs.
    Add options to zpool_layout to generate an /etc/zfs/zdev.conf using
    symbolic links in /dev/disk/by-id of the form
    <label>-<UUID>-switch-port:<X>-slot:<Y>.  <label> is a string that
    depends on the subsystem that created the link and defaults to
    "dm-uuid-mpath" (this prefix is used by multipathd).  <UUID> is a
    unique identifier for the disk typically obtained from the scsi_id
    program, and <X> and <Y> denote the switch port and disk slot numbers,
    Add a callout script sas_switch_id for use by multipathd to help
    create symlinks of the form described above.  Update zpool_id and the
    udev zpool rules file to handle both multipath devices and
    conventional drives.
Commits on Jun 21, 2011
  1. Linux 3.0 compat, shrinker compatibility

    behlendorf committed Jun 21, 2011
    To accomindate the updated Linux 3.0 shrinker API the spl
    shrinker compatibility code was updated.  Unfortunately, this
    couldn't be done cleanly without slightly adjusting the comapt
    API.  See spl commit a55bcaa.
    This commit updates the ZFS code to use the slightly modified
    API.  You must use the latest SPL if your building ZFS.
Commits on Jun 20, 2011
  1. Fix unlink/xattr deadlock

    gunnarbeutner authored and behlendorf committed Jun 20, 2011
    The problem here is that prune_icache() tries to evict/delete
    both the xattr directory inode as well as at least one xattr
    inode contained in that directory. Here's what happens:
    1. File is created.
    2. xattr is created for that file (behind the scenes a xattr
       directory and a file in that xattr directory are created)
    3. File is deleted.
    4. Both the xattr directory inode and at least one xattr
       inode from that directory are evicted by prune_icache();
       prune_icache() acquires a lock on both inodes before it
       calls ->evict() on the inodes
    When the xattr directory inode is evicted zfs_zinactive attempts
    to delete the xattr files contained in that directory. While
    enumerating these files zfs_zget() is called to obtain a reference
    to the xattr file znode - which tries to lock the xattr inode.
    However that very same xattr inode was already locked by
    prune_icache() further up the call stack, thus leading to a
    This can be reliably reproduced like this:
    $ touch test
    $ attr -s a -V b test
    $ rm test
    $ echo 3 > /proc/sys/vm/drop_caches
    This patch fixes the deadlock by moving the zfs_purgedir() call to
    zfs_unlinked_drain().  Instead zfs_rmnode() now checks whether the
    xattr dir is empty and leaves the xattr dir in the unlinked set if
    it finds any xattrs.
    To ensure zfs_unlinked_drain() never accesses a stale super block
    zfsvfs_teardown() has been update to block until the iput taskq
    has been drained.  This avoids a potential race where a file with
    an xattr directory is removed and the file system is immediately
    Signed-off-by: Brian Behlendorf <>
    Closes #266
  2. Removed erroneous zfs_inode_destroy() calls from zfs_rmnode().

    gunnarbeutner authored and behlendorf committed Jun 16, 2011
    iput_final() already calls zpl_inode_destroy() -> zfs_inode_destroy()
    for us after zfs_zinactive(), thus making sure that the inode is
    properly cleaned up.
    The zfs_inode_destroy() calls in zfs_rmnode() would lead to a
    Fixes #282
Commits on Jun 17, 2011
  1. Add "ashift" property to zpool create

    kohlschuetter authored and behlendorf committed Jun 16, 2011
    Some disks with internal sectors larger than 512 bytes (e.g., 4k) can
    suffer from bad write performance when ashift is not configured
    correctly.  This is caused by the disk not reporting its actual sector
    size, but a sector size of 512 bytes.  The drive may behave this way
    for compatibility reasons.  For example, the WDC WD20EARS disks are
    known to exhibit this behavior.
    When creating a zpool, ZFS takes that wrong sector size and sets the
    "ashift" property accordingly (to 9: 1<<9=512), whereas it should be
    set to 12 for 4k sectors (1<<12=4096).
    This patch allows an adminstrator to manual specify the known correct
    ashift size at 'zpool create' time.  This can significantly improve
    performance in certain cases.  However, it will have an impact on your
    total pool capacity.  See the updated ashift property description
    in the zpool.8 man page for additional details.
    Valid values for the ashift property range from 9 to 17 (512B-128KB).
    Additionally, you may set the ashift to 0 if you wish to auto-detect
    the sector size based on what the disk reports, this is the default
    behavior.  The most common ashift values are 9 and 12.
      zpool create -o ashift=12 tank raidz2 sda sdb sdc sdd
    Closes #280
    Original-patch-by: Richard Laager <>
    Signed-off-by: Brian Behlendorf <>
  2. Linux 2.6.37 compat, WRITE_FLUSH_FUA

    behlendorf committed Jun 16, 2011
    The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
    introduced as a replacement for WRITE_BARRIER.  This was done
    to allow richer semantics to be expressed to the block layer.
    It is the block layers responsibility to choose the correct way
    to implement these semantics.
    This change simply updates the bio's to use the new kernel API
    which should be absolutely safe.  However, since ZFS depends
    entirely on this working as designed for correctness we do
    want to be careful.
    Closes #281
  3. Update rpm/deb packages to be FHS compliant

    behlendorf committed Jun 17, 2011
    This change is the first step towards updating the default
    rpm/deb packages to be FHS compliant.  It accomplishes this
    by passing the following options to ./configure to ensure the
    zfs build products are installed in FHS compliant locations.
      ./configure --prefix=/ --bindir=/lib/udev \
        --libexecdir=/usr/libexec --datadir=/usr/share
    The core zfs utilities (zfs, zpool, zdb) are now be installed
    in /sbin, the core libraries in /lib, and the udev helpers
    (zpool_id, zvol_id) are in /lib/udev with the other udev
    The remaining files in the zfs package remain in their
    previous locations under /usr.
  4. Autogen refresh.

    dajhorn authored and behlendorf committed Jun 17, 2011
    Run using the same autotools versions as upstream:
     * autoconf-2.63
     * automake-1.11.1
     * libtool-2.2.6b
  5. Use datadir not datarootdir for dracut

    behlendorf committed Jun 17, 2011
    The zfs dracut modules should be installed under the --datadir
    not --datarootdir path.  This was just an oversight in the
    After this change %{_datadir} can now be set safely in the
    zfs.spec file.  The 'make install' location is now consistent
    with the location expected by the spec file.
  6. Fix autoconf variable substitution in udev rules.

    dajhorn authored and behlendorf committed Jun 17, 2011
    Change the variable substitution in the udev rule templates
    according to the method described in the Autoconf manual;
    Chapter 4.7.2: Installation Directory Variables.
    The udev rules are improperly generated if the bindir parameter
    overrides the prefix parameter during configure. For example:
      # ./configure --prefix=/usr/local --bindir=/opt/zfs/bin
    The udev helper is installed as /opt/zfs/bin/zpool_id, but the
    corresponding udev rule has a different path:
      # /usr/local/etc/udev/rules.d/60-zpool.rules
      ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/local/bin/zpool_id -d %p"
    The @bindir@ variable expands to "${exec_prefix}/bin", so it cannot
    be used instead of @prefix@ directly.
    This also applies to the zvol_id helper.
    Closes #283.
Commits on Jun 14, 2011
  1. Handle /etc/mtab -> /proc/mounts symlink

    behlendorf committed Jun 14, 2011
    Under Fedora 15 /etc/mtab is now a symlink to /proc/mounts by
    default.  When /etc/mtab is a symlink the mount.zfs helper
    should not update it.   There was code in place to handle this
    case but it used stat() which traverses the link and then issues
    the stat on /proc/mounts.  We need to use lstat() to prevent the
    link traversal and instead stat /etc/mtab.
    Closes #270
  2. Always check -Wno-unused-but-set-variable gcc support

    behlendorf committed Jun 14, 2011
    The previous commit 8a7e1ce wasn't
    quite right.  This check applies to both the user and kernel space
    build and as such we must make sure it runs regardless of what
    the --with-config option is set too.
    For example, if --with-config=kernel then the autoconf test does
    not run and we generate build warnings when compiling the kernel
  3. Check for -Wno-unused-but-set-variable gcc support

    behlendorf committed Jun 14, 2011
    Gcc versions 4.3.2 and earlier do not support the compiler flag
    -Wno-unused-but-set-variable.  This can lead to build failures
    on older Linux platforms such as Debian Lenny.  Since this is
    an optional build argument this changes add a new autoconf check
    for the option.  If it is supported by the installed version of
    gcc then it is used otherwise it is omited.
    See commit's 12c1acd and
    7971303 for the reason the
    -Wno-unused-but-set-variable options was originally added.
Commits on Jun 13, 2011
  1. Add default stack checking

    behlendorf committed Jun 12, 2011
    When your kernel is built with kernel stack tracing enabled and you
    have the debugfs filesystem mounted.  Then the script will clear
    the worst observed kernel stack depth on module load and check the worst
    case usage on module removal.  If the stack depth ever exceeds 7000
    bytes the full stack will be printed for debugging.  This is dangerously
    close to overrunning the default 8k stack.
    This additional advisory debugging is particularly valuable when running
    the regression tests on a kernel built with 16k stacks.  In this case,
    almost no matter how bad the stack overrun is you will see be able to
    get a clean stack trace for debugging.  Since the worst case stack usage
    can be highly variable it's helpful to always check the worst case usage.
Commits on Jun 10, 2011
  1. Pass -f option for import

    behlendorf committed Jun 10, 2011
    If a pool was not cleanly exported passing the -f flag may be required
    at 'zpool import' time.  Since this test is simply validating that the
    pool can be successfully imported in the absense of the cache file
    always pass the -f to ensure it succeeds.  This failure was observed
    under RHEL6.1.
Commits on Jun 9, 2011
  1. Fix 'zfs send -D' segfault

    behlendorf committed Jun 9, 2011
    Sending pools with dedup results in a segfault due to a Solaris
    portability issue.  Under Solaris the pipe(2) library call
    creates a bidirectional data channel.  Unfortunately, on Linux
    pipe(2) call creates unidirection data channel.  The fix is to
    use the socketpair(2) function to create the expected
    bidirectional channel.
    Seth Heeren did the original leg work on this issue for zfs-fuse.
    We finally just rediscovered the same portability issue and
    dfurphy was able to point me at the original issue for the fix.
    Closes #268
Commits on Jun 3, 2011
  1. Sanatize environment

    behlendorf committed Jun 3, 2011
    Just like the tests should run in a
    sanatized environment.  This ensures they never conflict with an
    installed /etc/zfs/zpool.cache file.
    This commit additionally improves the -c cleanup option.  It now
    removes the modules stack if loaded and destroys relevant md devices.
    This behavior is now identical to
  2. Delay before destroying loopback devices

    behlendorf committed Jun 3, 2011
    Generally I don't approve of just adding an arbitrary delay to
    avoid a problem but in this case I'm going to let it slide.  We
    may need to delay briefly after 'zpool destroy' returns to ensure
    the loopback devices are closed.  If they aren't closed than
    losetup -d will not be able to destroy them.  Unfortunately,
    there's no easy state the check so we'll have to make due with
    a simple delay.
Commits on Jun 2, 2011
  1. Always unload zpios.ko on exit

    behlendorf committed Jun 2, 2011
    We should always unload zpios.ko on exit.  This ensures
    that subsequent calls to ' -u' from other utilities
    will be able to unload the module stack and properly
    cleanup.  This is important for the the --cleanup option
    which can be passed to and
  2. Fix return code

    behlendorf committed Jun 2, 2011
    The script should return failure when any
    of the individual tests fail.  The previous code
    would always return success suppressing real failures.
Commits on May 31, 2011
  1. Fix stack ddt_class_contains()

    behlendorf committed May 25, 2011
    Stack usage for ddt_class_contains() reduced from 524 bytes to 68
    bytes.  This large stack allocation significantly contributed to
    the likelyhood of a stack overflow when scrubbing/resilvering
    dedup pools.
  2. Fix stack ddt_zap_lookup()

    behlendorf committed May 25, 2011
    Stack usage for ddt_zap_lookup() reduced from 368 bytes to 120
    bytes.  This large stack allocation significantly contributed to
    the likelyhood of a stack overflow when scrubbing/resilvering
    dedup pools.
  3. Revert "Fix stack traverse_visitbp()"

    behlendorf committed May 25, 2011
    This abomination is no longer required because the zio's issued
    during this recursive call path will now be handled asynchronously
    by the taskq thread pool.
    This reverts commit 6656bf5.
  4. Make tgx_sync_thread zio's async

    behlendorf committed May 25, 2011
    The majority of the recursive operations performed by the dsl
    are done either in the context of the tgx_sync_thread or during
    pool import.  It is these recursive operations which contribute
    greatly to the stack depth.  When this recursion is coupled with
    a synchronous I/O in the same context overflow becomes possible.
    Previously to handle this case I have focused on keeping the
    individual stack frames as light as possible.  This is a good
    idea as long as it can be done in a way which doesn't overly
    complicate the code.  However, there is a better solution.
    If we treat all zio's issued by the tgx_sync_thread as async then
    we can use the tgx_sync_thread stack for the recursive parts, and
    the zio_* threads for the I/O parts.  This effectively doubles our
    available stack space with the only drawback being a small delay
    to schedule the I/O.  However, in practice the scheduling time
    is so much smaller than the actual I/O time this isn't an issue.
    Another benefit of making the zio async is that the zio pipeline
    is now parallel.  That should mean for CPU intensive pipelines
    such as compression or dedup performance may be improved.
    With this change in place the worst case stack usage observed so
    far is 6902 bytes.  This is still higher than I'd like but
    significantly improved.  Additional changes to specific functions
    should improve this further.  This change allows us to revent
    commit 6656bf5 which did some horrible things to the recursive
    traverse_visitbp() callpath in the name of saving stack.
Commits on May 27, 2011
  1. Fix 4K sector support

    behlendorf committed May 26, 2011
    Yesterday I ran across a 3TB drive which exposed 4K sectors to
    Linux.  While I thought I had gotten this support correct it
    turns out there were 2 subtle bugs which prevented it from
      sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
      cannot create 'large-sector': one or more devices is currently unavailable
    1) The first issue was that it was possible that bdev_capacity()
    would return the number of 512 byte sectors rather than the number
    of 4096 sectors.  Internally, certain Linux functions only operate
    with 512 byte sectors so you need to be careful.  To avoid any
    confusion in the future I've updated bdev_capacity() to simply
    return the device (or partition) capacity in bytes.  The higher
    levels of ZFS want the value in bytes anyway so this is cleaner.
    2) When creating a bio the ->bi_sector count must always be
    expressed in 512 byte sectors.  The existing code would scale
    the byte offset by the logical sector size.   Until now this was
    always 512 so it never caused problems.  Trying a 4K sector drive
    clearly exposed the issue.  The problem has been fixed by
    hard coding the 512 byte sector which is exactly what the bio
    code does internally.
    With these changes I'm now able to create ZFS pools using 4K
    sector drives.  No issues were observed during fairly extensive
    testing.  This is also a low risk change if your using 512b
    sectors devices because none of the logic changes.
    Closes #256
Commits on May 20, 2011
  1. Use vmem_alloc() for zfs_ioc_userspace_many()

    behlendorf committed May 20, 2011
    The default buffer size when requesting multiple quota entries
    is 100 times the zfs_useracct_t size.  In practice this works out
    to exactly 27200 bytes.  Since this will be a short lived buffer
    in a non-performance critical path it is preferable to vmem_alloc()
    the needed memory.
  2. Default to internal 'zfs userspace' implementation

    behlendorf committed May 20, 2011
    We will never bring over the helper script from Solaris
    to Linux.  Instead the missing functionality will be directly
    integrated in to the zfs commands and libraries.  To avoid
    confusion remove the warning about the missing utility
    and simply use the default internal support.
    The Illumous developers are of the same mind and have proposed an
    initial patch to do this which has been integrated in to the 'allow'
    development branch.  After some additional testing this code
    can be merged in to master as the right long term solution.
  3. Pass caller's credential in zfsdev_ioctl()

    behlendorf committed May 20, 2011
    Initially when zfsdev_ioctl() was ported to Linux we didn't have
    any credential support implemented.  So at the time we simply
    passed NULL which wasn't much of a problem since most of the
    secpolicy code was disabled.
    However, one exception is quota handling which does require the
    credential.  Now that proper credentials are supported we can
    safely start passing the callers credential.  This is also an
    initial step towards fully implemented the zfs secpolicy.
Commits on May 18, 2011
  1. Fix 'negative objects to delete' warning

    behlendorf committed May 9, 2011
    Normally when the arc_shrinker_func() function is called the return
    value should be:
       >=0 - To indicate the number of freeable objects in the cache, or
       -1  - To indicate this cache should be skipped
    However, when the shrinker callback is called with 'nr_to_scan' equal
    to zero.  The caller simply wants the number of freeable objects in
    the cache and we must never return -1.  This patch reorders the
    first two conditionals in arc_shrinker_func() to ensure this behavior.
    This patch also now explictly casts arc_size and arc_c_min to signed
    int64_t types so MAX(x, 0) works as expected.  As unsigned types
    we would never see an negative value which defeated the purpose of
    the MAX() lower bound and broke the shrinker logic.
    Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
    below arc_c_min.  This is done to prevent the Linux page cache from
    completely crowding out the ARC.  This limit is tunable and some
    experimentation is likely going to be required to set it exactly right.
    For now we're sticking with the OpenSolaris defaults.
    Closes #218
    Closes #243
Commits on May 14, 2011
  1. Fix distribution detection for gentoo

    alexxy authored and behlendorf committed May 14, 2011
    Also this may fix other distros because some of them also provide
    /etc/lsb-release not only ubuntu.
    Closes #244
Commits on May 13, 2011
  1. Update synchronous open zfs_close() comment

    behlendorf committed May 13, 2011
    The comment in zfs_close() pertaining to decrementing the synchronous
    open count needs to be updated for Linux.  The code was already
    updated to be correct, but the comment was missed and is now misleading.
    Under Linux the zfs_close() hook is only called once when the final
    reference is dropped.  This differs from Solaris where zfs_close()
    is called for each close.
    Closes #237
Commits on May 12, 2011
  1. Remove root 'ls' after mount workaround

    alexxy authored and behlendorf committed May 12, 2011
    This workaround was introduced to workaround issue #164.  This
    issue was fixed by commit 5f35b19 so the workaround can be safely
    dropped from both the zfs.fedora and zfs.gentoo init scripts.
  2. Fix zfs.gentoo init script logic

    alexxy authored and behlendorf committed May 10, 2011
    * Fix zfs.ko module check
    * Check 'zfs umount -a' return value
  3. Make zfs.gentoo init script more gentoo style.

    alexxy authored and behlendorf committed May 10, 2011
    * Improved compatibility with openrc
    * Removed LOCKFILE
    * Improved checksystem() function
    * Remove /etc/mtab check for /
    * General cleanup
Commits on May 9, 2011
  1. Merge pull request #235 from nedbass/rdev

    behlendorf committed May 9, 2011
    Don't store rdev in SA for FIFOs and sockets