Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ashift issue on linux - cannot add mirror disk or replace single-disk pool created with "default" (ashift=9) with 4K disk ashift=12 -- new device has a different optimal sector size #4740

Closed
kingneutron opened this issue Jun 7, 2016 · 7 comments · Fixed by #5763

Comments

@kingneutron
Copy link

kingneutron commented Jun 7, 2016

Ref: #1328

--System info:
Ubuntu 14.04--64--LTS
Kernel: 4.2.0-30-generic #36~14.04.1-Ubuntu SMP
RAM: 12GB
Swap: 2GB, mostly un-used due to system optimizations

--ZFS software versions:

$ dpkg -l|grep zfs
ii dkms 2.2.0.3-1.1ubuntu5.14.04.1+zfs10trusty all Dynamic Kernel Module Support Framework
ii libzfs2 0.6.5.7-1
trusty amd64 Native OpenZFS filesystem library for Linux
ii mountall 2.53-zfs1 amd64 filesystem mounting tool
ii ubuntu-zfs 8trusty amd64 Native ZFS filesystem metapackage for Ubuntu.
ii zfs-dkms 0.6.5.7-1
trusty amd64 Native OpenZFS filesystem kernel modules for Linux
ii zfs-doc 0.6.5.7-1trusty amd64 Native OpenZFS filesystem documentation and examples.
ii zfsutils 0.6.5.7-1
trusty amd64 Native OpenZFS management utilities for Linux

INTENT: I have a single-disk pool called "bigvaiterazfs" with mountpoints for /home and a few others, and I want to add a mirror to it with the least amount of fuss:

$ df
Filesystem 1K-blocks Used Available Use% Mounted on
bigvaiterazfs 628019968 0 628019968 0% /bigvaiterazfs
bigvaiterazfs/bluraytemp 24641536 23980800 660736 98% /mnt/bluraytemp25
bigvaiterazfs/dv 643575424 15555456 628019968 3% /bigvaiterazfs/dv
bigvaiterazfs/dv/bigvai500 857146496 229126528 628019968 27% /mnt/bigvai500
bigvaiterazfs/dv/compr 659756544 31736576 628019968 5% /bigvaiterazfs/dv/compr
bigvaiterazfs/home 628036608 16640 628019968 1% /home
bigvaiterazfs/home/user 640859264 12839296 628019968 3% /home/user
bigvaiterazfs/home/squid 629390336 1370368 628019968 1% /home/squid
bigvaiterazfs/home/vmtmpdir 628019968 0 628019968 0% /home/vmtmpdir

TODO: attach new WD 1TB black as mirror to existing zfs 1-disk pool "bigvaiterazfs" to get redundancy and better I/O

smartctl -a /dev/sdd # Original disk

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.2.0-30-generic](local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Black
Device Model: WDC WD1002FAEX-00Z3A0
Firmware Version: 05.01D05
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical

zpool status

pool: bigvaiterazfs
state: ONLINE
scan: scrub repaired 0 in 2h48m with 0 errors on Wed May 11 13:54:52 2016
config:

    NAME                                         STATE     READ WRITE CKSUM
    bigvaiterazfs                                ONLINE       0     0     0
      ata-WDC_WD1002FAEX-00Z3A0_WD-WCATRC635585  ONLINE       0     0     0

errors: No known data errors

--Intended mirror disk: new WD 1TB Black

smartctl -a /dev/sdd # NEW disk

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.2.0-30-generic](local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: WDC WD1003FZEX-00MK2A0
Serial Number: WD-WCC3F7RZZCL7
LU WWN Device Id: 5 0014ee 20d54fbb4
Firmware Version: 01.01A01
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)

--I put a GPT label on the new disk with gparted, and tried:

pool1=bigvaiterazfs; time zpool attach -o ashift=12 $pool1 \

ata-WDC_WD1002FAEX-00Z3A0_WD-WCATRC635585
ata-WDC_WD1003FZEX-00MK2A0_WD-WCC3F7RZZCL7

...and got error:

" cannot attach ata-WDC_WD1003FZEX-00MK2A0_WD-WCC3F7RZZCL7 to
ata-WDC_WD1002FAEX-00Z3A0_WD-WCATRC635585: new device has a different
optimal sector size; use the option '-o ashift=N' to override the optimal
size "

zpool get all bigvaiterazfs|grep ashift

bigvaiterazfs ashift 0 default

FML :( bigvaiterazfs pool was not created with ashift=12!

--Ok fair enough, the single-disk pool was created in ~2014 and I have learned a lot about ZFS since then. Trying to think around the problem, maybe I can replace the existing disk with the new disk on the fly, labelclear the old disk, re-GPT it, and attach the old one as a mirror with ashift=12...

--I came across this on a google search, but it doesn't work:
http://www.sotechdesign.com.au/zfs-zpool-replace-returns-error-cannot-replace-devices-have-different-sector-alignment/

time zpool replace bigvaiterazfs \

ata-WDC_WD1002FAEX-00Z3A0_WD-WCATRC635585
ata-WDC_WD1003FZEX-00MK2A0_WD-WCC3F7RZZCL7 -o ashift=12 -f

...and got:
(error) cannot replace ata-WDC_WD1002FAEX-00Z3A0_WD-WCATRC635585 with ata-WDC_WD1003FZEX-00MK2A0_WD-WCC3F7RZZCL7: new device has a different optimal sector size; use the option '-o ashift=N' to override the optimal size

BUG: No matter what I try, I can't accomplish what the error message is recommending, even when I move the -o to before the "replace" part of the command.

...So, looks like I have to create a new single-disk pool with the new drive, snapshot existing pool, copy existing data over with " zfs send |zfs receive ", and recreate the mountpoints. On top of planning all that out, I have to do this from TTY1 with no GUI running in a "screen" session because /home is involved and I will need to destroy the original pool to re-use the disk.

--This is turning out to be a rather large PITA when it was supposed to be simple(r) with ZFS.

--So, I am filing a bug report / feature request and documenting what I am doing so maybe it will help others. Ideally, ZFS should be able to copy existing data over on the fly to the new ashift=12 disk.

REF for zfs send/receive: https://forums.freebsd.org/threads/37819/

NEW INTENT: create a 1-disk pool with new drive, copy data over, reuse old drive as new mirror disk

p1=bigvaiterazfs

p2=bigvaiterazfsNB

d1=ata-WDC_WD1002FAEX-00Z3A0_WD-WCATRC635585

d2=ata-WDC_WD1003FZEX-00MK2A0_WD-WCC3F7RZZCL7

zpool create -o ashift=12 -o autoexpand=on -O atime=off $p2 $d2

Filesystem 1K-blocks Used Available Use% Mounted on
bigvaiterazfsNB 942669440 0 942669440 0% /bigvaiterazfsNB

zpool get ashift $p2

NAME PROPERTY VALUE SOURCE
bigvaiterazfsNB ashift 12 local

Prepare for data migration to new pool

zfs snapshot -r $p1@now

DONE - ONLY ONCE!

zpool set listsnaps=on $p1

zfs list -r $p1

NAME USED AVAIL REFER MOUNTPOINT
bigvaiterazfs 300G 599G 31K /bigvaiterazfs
bigvaiterazfs@now 0 - 31K -
bigvaiterazfs/bluraytemp 22.9G 645M 22.9G /mnt/bluraytemp25
bigvaiterazfs/bluraytemp@now 0 - 22.9G -
bigvaiterazfs/dv 264G 599G 14.8G /bigvaiterazfs/dv
bigvaiterazfs/dv@now 0 - 14.8G -
bigvaiterazfs/dv/bigvai500 219G 599G 219G /mnt/bigvai500
bigvaiterazfs/dv/bigvai500@now 0 - 219G -
bigvaiterazfs/dv/compr 30.3G 599G 30.3G /bigvaiterazfs/dv/compr
bigvaiterazfs/dv/compr@now 0 - 30.3G -
bigvaiterazfs/home 13.6G 599G 16.3M /home
bigvaiterazfs/home@now 0 - 16.3M -
bigvaiterazfs/home/user 12.2G 599G 12.2G /home/user
bigvaiterazfs/home/user@now 563K - 12.2G -
bigvaiterazfs/home/squid 1.31G 599G 1.31G /home/squid
bigvaiterazfs/home/squid@now 0 - 1.31G -
bigvaiterazfs/home/vmtmpdir 37K 599G 37K /home/vmtmpdir
bigvaiterazfs/home/vmtmpdir@now 0 - 37K -

DONE - migrate existing pool to new pool:

time zfs send -R $p1@now | zfs recv -dF $p2

cannot share 'bigvaiterazfsNB/dv': smb add share failed
cannot share 'bigvaiterazfsNB/dv/compr': smb add share failed
real 75m43.531s

Filesystem 1K-blocks Used Available Use% Mounted on
bigvaiterazfsNB 627546624 128 627546496 1% /bigvaiterazfsNB
bigvaiterazfsNB/dv 643107712 15561216 627546496 3% /bigvaiterazfsNB/dv
bigvaiterazfsNB/dv/compr 659511296 31964800 627546496 5% /bigvaiterazfsNB/dv/compr

--Now I need to take myself out of X windows, recreate the mountpoints (which can be gotten from zpool history, but fortunately I also document all my changes in a text file) and get the old disk out of the way. Will update this issue when I have everything in place, and hopefully nothing goes wrong.

@kingneutron
Copy link
Author

kingneutron commented Jun 7, 2016

--OK, so after unplugging the original disk and rebooting, things went a bit better than expected. Did not have to recreate the mountpoints. Again, this is Ubuntu 14.04-64

( I also forgot to mention that I backed up the whole original pool with tar to another compressed zfs pool before doing ANYTHING: )

time tar cpf - /bigvaiterazfs /mnt/bluraytemp25 /mnt/bigvai500 /home \

|pv > /zredpool2/dvcompr/bkp-bigvaiterazfs--home--bigvai500--bluray--b4-add-mirror--20160607.tar1
325GB 2:21:32 [39.2MB/s] ]
real 141m32.419s

DbigvaiterazfsA=/dev/disk/by-id/ata-WDC_WD1002FAEX-00Z3A0_WD-WCATRC635585 #= sdd
DbigvaiterazfsB=/dev/disk/by-id/ata-WDC_WD1003FZEX-00MK2A0_WD-WCC3F7RZZCL7 #= sdh

stop lightdm

umount /mnt/bluraytemp25; umount /mnt/bigvai500; umount /home; umount /home/*

umount: /home: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))

zpool export bigvaiterazfs ## also unplugged SATA power

( NOTE: this is what I tried, but you can SKIP this step )

zfs create -o mountpoint=/home -o atime=off bigvaiterazfsNB/home

(got error) cannot create 'bigvaiterazfsNB/home': dataset already exists

reboot

stop lightdm

things went better than expected; did not have to recreate mountpoints with original disk unplugged:

bigvaiterazfsNB 627546624 128 627546496 1% /bigvaiterazfsNB
bigvaiterazfsNB/bluraytemp 24641536 23988224 653312 98% /mnt/bluraytemp25
bigvaiterazfsNB/dv 643107712 15561216 627546496 3% /bigvaiterazfsNB/dv
bigvaiterazfsNB/dv/bigvai500 856759040 229212544 627546496 27% /mnt/bigvai500
bigvaiterazfsNB/dv/compr 659511296 31964800 627546496 5% /bigvaiterazfsNB/dv/compr
bigvaiterazfsNB/home 627563392 16896 627546496 1% /home
bigvaiterazfsNB/home/user 640518912 12972416 627546496 3% /home/user
bigvaiterazfsNB/home/squid 628951040 1404544 627546496 1% /home/squid
bigvaiterazfsNB/home/vmtmpdir 627546624 128 627546496 1% /home/vmtmpdir

SKIP # zfs create -o mountpoint=/mnt/bigvai500 -o atime=off bigvaiterazfsNB/dv/bigvai500
SKIP # zfs create -o compression=off -o atime=off
-o mountpoint=/mnt/bluraytemp25 -o quota=23.5G bigvaiterazfsNB/bluraytemp

start lightdm

--X came up OK, and overall response is improved because the original disk was attached to a 4-port SATA II PCI card.

--Now I can re-use the original disk as a mirror, more details to follow.

@DeHackEd
Copy link
Contributor

DeHackEd commented Jun 8, 2016

You're actually supposed to use zpool replace -o ashift=9 ... to override detection on the new drive. Also note that while ashift looks like a pool property, it really isn't and the -o ashift=X notation is not really related to the equivalent operation of zpool set. That could be documented better.

But the warning is for your own good. The resilver will easily take 2x, maybe 10x as long as a scrub normally would and the pool will perform a bit worse during normal operation.

Use <code>tags</code> to help make your output more readable

@kingneutron
Copy link
Author

--I understand what you're saying, but I don't believe this behavior is what the user expects. Original drive was 512 sector, replacement drive is 512 reported/4096 actual, so using ashift=9 gets worse performance out of the new drive.

--Desired behavior is for ZFS to replace the existing drive on the fly and use the more desirable ashift=12, since practically nobody is making 512 sector drives anymore and you want to future-proof the pool PLUS get better performance.

--Actually my resilvers don't take that long, I just do a few basic blockdev --setra tweaks and don't use the pool during resilver. This is the new replacement WD 1TB Black with a WD 1TB Blue mirror, standard COTS hardware; they are not even SAS drives or 10K RPM:

zpool status

pool: bigvaiterazfsNB
state: ONLINE
scan: resilvered 301G in 0h36m with 0 errors on Wed Jun 8 11:52:10 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    bigvaiterazfsNB                                 ONLINE       0     0     0
      mirror-0                                      ONLINE       0     0     0
        ata-WDC_WD1003FZEX-00MK2A0_WD-WCC3F7RZZCL7  ONLINE       0     0     0
        ata-WDC_WD10EZEX-00RKKA0_WD-WCC1S0347255    ONLINE       0     0     0

@DeHackEd
Copy link
Contributor

DeHackEd commented Jun 9, 2016

Sorry, it doesn't matter what the user expects because that's not how it works. You can't mirror two disks but have them use different pool layout geometries. Furthermore ZFS is incapable of converting the geometry of an existing vdev due to the requirements of Block Pointer Rewrites.

@kingneutron
Copy link
Author

Furthermore ZFS is incapable of converting the geometry of an existing vdev due to the requirements of Block Pointer Rewrites.

--Which has been "pending" for YEARS. Which is why I'm filing this bug report... I was trying to add a mirror, and failing that the device should be able to be replaced with a higher ashift since ZFS is copying the data over anyway.

--Getting bitten by "ashift" in 2016 is a big PITA when you've been strongly recommending to all your friends to use ZFS on Linux for the last 3 years. For anyone else who may run into the issue, I hope I've documented the process well enough to get past it, but the filesystem should be capable of doing what the user expects given how it works in other areas.

--Ashift behavior should be uniform across commands. If the pool was created with ashift=12, all vdevs that are added to the pool should INHERIT this property unless overridden. Furthermore, ALL new pools should by now be created with ashift=12 as the default (with the exception of SSD-based disks) even if they are using 512 sector disks, to avoid this issue in the future.

--We expect the filesystem to automagic do what it takes to make our lives easier, that is one of the main draws of ZFS. Waiting for this bug report to get an Assignee, thanks

behlendorf pushed a commit that referenced this issue May 3, 2017
This commit allow higher ashift values (up to 16) in 'zpool create'

The ashift value was previously limited to 13 (8K block) in b41c990
because the limited number of uberblocks we could fit in the
statically sized (128K) vdev label ring buffer could prevent the
ability the safely roll back a pool to recover it.

Since b02fe35 the largest uberblock size we support is 8K: this
allow us to store a minimum number of 16 uberblocks in the vdev
label, even with higher ashift values.

Additionally change 'ashift' pool property behaviour: if set it will
be used as the default hint value in subsequent vdev operations
('zpool add', 'attach' and 'replace'). A custom ashift value can still
be specified from the command line, if desired.

Finally, fix a bug in add-o_ashift.ksh caused by a missing variable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #2024 
Closes #4205 
Closes #4740 
Closes #5763
@DurvalMenezes
Copy link

I'm getting exactly this error message:
Trying "zpool replace":

cannot replace REDACTED_OLD_DEV with REDACTED_NEW_DEV: new device has a
different optimal sector size; use the option '-o ashift=N' to override the
optimal size

Ditto, "zpool attach" (after detaching the device that was going to be replaced):

cannot attach REDACTED_NEW_DEV to REDACTED_BASE_DEV: new device has a
different optimal sector size; use the option '-o ashift=N' to override the
optimal size

In both cases, re-running the commands with the suggested "-o ashift=12" accomplishes nothing but getting the same messages all over again.

To add insult to injury, the aforementioned pool has ashift=12:

zpool get ashift REDACTED_POOL_NAME
NAME                                   PROPERTY  VALUE   SOURCE
REDACTED_POOL_NAME  ashift    12      local

REDACTED_NEW_DEV is listed by fdisk as having 512-byte physical sectors (as it should, being a LUKS device):

fdisk -l /dev/mapper/ZFS_ARCHIVE_003B2_TRY3 
[...]
Sector size (logical/physical): 512 bytes / 512 bytes

If the error is due to the device being 512 bytes, shouldn't "-o ashift=12" override it?

Please see more details here: http://list.zfsonlinux.org/pipermail/zfs-discuss/2017-November/029831.html

This is with ZFS/SPL 0.7.2 running on top of kernel 4.9.30 on amd64.

Wasn't this supposed to be resolved already? Or didn't the fix made it to 0.7.2?

I really need this working, I have a critical pool here without redundancy because of it... thanks in advance for any help in fixing and/or working around this.

@rlaager
Copy link
Member

rlaager commented Nov 9, 2017

I can't help much, but... What does sudo zdb POOL show for ashift values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants