Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"zpool remove" on a log device being silently ignored #6677

Closed
DurvalMenezes opened this issue Sep 23, 2017 · 20 comments
Closed

"zpool remove" on a log device being silently ignored #6677

DurvalMenezes opened this issue Sep 23, 2017 · 20 comments
Labels
Status: Stale No recent activity for issue

Comments

@DurvalMenezes
Copy link

Hello,

As my original issue (#4067) was closed by a repo collaborator, I can't reopen it; so I'm opening this new one.

Just to be clear, this issue is pretty much alive; the commands and output below are from Springdale EL6 (RHEL 6 clone) running kernel 4.1.12 (based off Oracle's UEK package) with ZoL 0.7.1-1:

[root@REDACTED ~]# zpool status
  pool: pool01 
 state: ONLINE 
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support
        feature flags.
  scan: scrub repaired 0 in 27h7m with 0 errors on Sat Sep 23 14:06:22 2017
config:

        NAME                                               STATE     READ WRITE CKSUM
        pool01                                             ONLINE       0     0     0
          mirror-0                                         ONLINE       0     0     0
            ata-HGST_HTS541010A9E680_REDACTED-part2  ONLINE       0     0     0
            ata-HGST_HTS721010A9E630_REDACTED-part2  ONLINE       0     0     0
        logs
          ata-M4-CT256M4SSD3_REDACTED-part6    ONLINE       0     0     0

errors: No known data errors

[root@REDACTED ~]# zpool remove pool01 ata-M4-CT256M4SSD3_REDACTED-part6

[root@REDACTED ~]# zpool status
  pool: pool01
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support
        feature flags.
  scan: scrub repaired 0 in 27h7m with 0 errors on Sat Sep 23 14:06:22 2017

        NAME                                               STATE     READ WRITE CKSUM
        pool01                                             ONLINE       0     0     0
          mirror-0                                         ONLINE       0     0     0
            ata-HGST_HTS541010A9E680_REDACTED-part2  ONLINE       0     0     0
            ata-HGST_HTS721010A9E630_REDACTED-part2  ONLINE       0     0     0
        logs   
          ata-M4-CT256M4SSD3_REDACTED-part6    ONLINE       0     0     0

errors: No known data errors

Please let me know if you need anything else, and I would appreciate being contacted before closing this again as "stale" with no further request to (or input from) me.

Thanks in advance,
-- Durval.

@kernelOfTruth
Copy link
Contributor

referencing: #1422 zpool remove on mirrored logs fails silently

#1530 Removing cache device fails

For reference (FreeNas):
https://mikebeach.org/2013/02/28/adding-and-removing-zfs-zpool-zil-disk-live-by-gptid/

@loli10K
Copy link
Contributor

loli10K commented Sep 24, 2017

As my original issue (#4067) was closed by a repo collaborator, I can't reopen it; so I'm opening this new one.
Please let me know if you need anything else, and I would appreciate being contacted before closing this again as "stale" with no further request to (or input from) me.

To be fair to the "repo collaborator" (me) that closed your issue the first comment stating the issue is "stale" is dated Nov 8, 2016. The issue was closed on Jul 9, 2017, almost one year later, which is a lot of time to provide more information or even simply add a comment stating you're still experiencing the issue. The fact your other account shows no activity in 2 years doesn't help either.

As a side note, you can also comment on closed issues and ask for them to be re-opened.

@DurvalMenezes
Copy link
Author

Thanks for your clarifications @loli10K. I didn't receive any warning that the issue was marked "Stale", nor was I aware that such a marking meant I was required to provide more information, lest it be closed. As per commenting on the previous issue and asking for it to be reopened, I would rather open a new one as the reason for closing the previous wasn't clear to me.

Anyway, as you can see the problem is still happening, and not only with me: see for example here: http://list.zfsonlinux.org/pipermail/zfs-discuss/2017-September/029413.html

In case you require any more information, please say so and I will oblige.

@rgbtxus
Copy link

rgbtxus commented Sep 24, 2017

I too have the problem of a ZFS log vdev that I cannot remove for the life of me. I initially tried to remove it over a year ago, at which time it resided on a mirror of 2 SD partitions. Removal of the mirror return without error, but the log was still there. While recovering from a system disk failure a few days ago I was afraid I had lost the pool. I moved it to another system (minus the log which was on the system disk and which I thought had been lost) and was startled to have zpool import find all the data vdevs but report that the pool was unrecoverable because there where known to be other disks needed, but they could not be identified. After some googling and hand holding from folks on the internet, including Durval, who started this thread about a similar problem he faces, I tried import -m and thankfully the pool came right up.

So, I once again tried to remove the now single log device (last year I had removed one of the pair in hopes that I could then remove the other). I tried the following, all of which appeared to work, but failed to remove the device

  • remove by path with ZIL online
  • remove by GUID with ZIL online
  • remove by path with ZIL offline
  • remove by GUID with ZIL offline
    I also exported and reimported the pool a few times around trying these. Nothing worked.

I looked at the ZIL device info with zdb
children[2]:
type: 'disk'
id: 2
guid: 17637578473400123453
path: '/dev/disk/by-id/ata-OCZ-VERTEX2_OCZ-3W9932VN8R9818N7-part5'
whole_disk: 0
metaslab_array: 4209
metaslab_shift: 25
ashift: 9
asize: 4292345856
is_log: 1
removing: 1
create_txg: 270959

So, the system thinks it is being removed.

Based on various old posts I’ve seen scattered around the net, it appears that the removal is not occurring because somewhere space is still in allocated for the ZIL even though there are no active ZIL entries. I ran across someone who recompiled ZFS to remove the checks that caused this to block the deletion and was able to remove his “stuck” log. It has been too many years in the past that I did such things so I’m not sure I want to give that a try (and the post was about 2 years old)

I assure the ZoL developers that this problem is in no way closed and hope something like a -f flag can be implemented to allow this vdev to finally be removed without requiring user to cook up their own special purpose ZoL version. My most recent attempt to remove the log was on a Kubuntu 16.04 system, the latest "Long Term Support" version of Kubuntu. It is about 18 months old, so it is possible it is not running the most recent ZoL, but Durval, who has the same issue is running the latest and greatest. So, it is seems this problem that dates back years is still with us.

Lest I sound like a complainer. Please accept my most hearty thanks for making ZoL available. I use it every day and truly appreciate this wonderful software. Sure, I am dying to have encryption supported and would love for this bug to be fixed, but I am thankful that I have ZoL to protect my data. In the years I have been using ZoL I have never lost any data or sleep over the prospect of doing so. Thank you.

Thank you,
Richard

@TGM
Copy link

TGM commented Oct 27, 2017

Confirmed!

@DurvalMenezes
Copy link
Author

Just following it up, with ZFS 0.7.3-1 the problem continues exactly the same (ie, "zpool remove" on the log device returns with status code zero and prints no messages) but the log device remains in the pool. I stand available to help fix this, as I haven't yet recreated the pool.

@DurvalMenezes
Copy link
Author

Closed by mistake (slippery finger on the "Close and comment" while zooming/panning on my cell screen) and then reopened. Sorry about that.

@mrpippy
Copy link

mrpippy commented Apr 16, 2018

Also having this problem on FreeBSD 11.1

@DurvalMenezes
Copy link
Author

DurvalMenezes commented Aug 9, 2018

Also having this problem on FreeBSD 11.1

Are you sure about that, @mrpippy ? In fact, I fixed this pool a few days ago by rebooting the machine from a pendrive with FreeNAS 11.1U4 (which is based on FreeBSD 11.1-STABLE), importing the pool, and removing the device using the exact same zpool remove command that was being silently ignored by ZoL. Exported the pool and rebooted the machine into Linux+ZoL, imported it and everything is fine and dandy...

This IMHO seems to indicate that this issue does not happen in FreeBSD 11.1 and is rather restricted to ZoL.

@mrpippy
Copy link

mrpippy commented Aug 9, 2018

My pool started on Solaris and then FreeBSD 11.1 and 11.2, I've never used it with ZoL. I tried removing the log device months ago, it still shows up in zpool status normally, and in zdb as removing: 1.

I haven't exported/imported the pool since, that would be worth a try.

zdb from FreeBSD 11.2:

tank:
    version: 28
    name: 'tank'
    state: 0
    txg: 5976912
    pool_guid: 5221636966091643879
    hostid: 2200616315
    hostname: ''
    com.delphix:has_per_vdev_zaps
    vdev_children: 2
    vdev_tree:
        type: 'root'
        id: 0
        guid: 5221636966091643879
        children[0]:
            type: 'raidz'
            id: 0
            guid: 6908235789649764826
            nparity: 2
            metaslab_array: 30
            metaslab_shift: 37
            ashift: 12
            asize: 23914377904128
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 4116
            children[0]:
                type: 'disk'
                id: 0
                guid: 1913305025832965192
                path: '/dev/da6p1'
                phys_path: '/scsi_vhci/disk@g50014ee2b7453288:a'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 4117
            children[1]:
                type: 'disk'
                id: 1
                guid: 7203592197606495526
                path: '/dev/da8p1'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 4118
            children[2]:
                type: 'disk'
                id: 2
                guid: 9387228659083144016
                path: '/dev/da5p1'
                phys_path: '/scsi_vhci/disk@g5000cca250d67a21:a'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 4119
            children[3]:
                type: 'disk'
                id: 3
                guid: 48004287643406211
                path: '/dev/da4p1'
                phys_path: '/scsi_vhci/disk@g5000cca24ce9f912:a'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 4120
            children[4]:
                type: 'disk'
                id: 4
                guid: 10735790578038302473
                path: '/dev/da3p1'
                phys_path: '/scsi_vhci/disk@g50014ee2b744b076:a'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 4121
            children[5]:
                type: 'disk'
                id: 5
                guid: 14764836432167308654
                path: '/dev/da2p1'
                phys_path: '/scsi_vhci/disk@g5000c50087f48185:a'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 4172
        children[1]:
            type: 'disk'
            id: 1
            guid: 17298343182700501993
            path: '/dev/gpt/zfs'
            phys_path: '/scsi_vhci/disk@g5001517bb284c110:a'
            whole_disk: 1
            metaslab_array: 34
            metaslab_shift: 27
            ashift: 9
            asize: 19998441472
            is_log: 1
            removing: 1
            create_txg: 26
            com.delphix:vdev_zap_top: 4174
    features_for_read:

@Metatron22
Copy link

Metatron22 commented Nov 11, 2018

Sadly the error also still exists on Linux. For some reason both SSDs in the log mirror died. Romving the mirror does not work even after offlining the disks. It silently fails. Any idea?

` pool: zroot
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scan: scrub canceled on Sun Nov 11 12:05:25 2018
config:

    NAME                                                     STATE     READ WRITE CKSUM
    zroot                                                    DEGRADED     0     0     0
      raidz2-0                                               ONLINE       0     0     0
        ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0388392-part3       ONLINE       0     0     0
        ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0433247-part3       ONLINE       0     0     0
        ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0382075-part3       ONLINE       0     0     0
        sda3                                                 ONLINE       0     0     0
        ata-WDC_WD30EZRX-00D8PB0_WD-WMC4N0D58APF-part3       ONLINE       0     0     0
        ata-WDC_WD30EZRX-00D8PB0_WD-WCC4NJ1EU345-part3       ONLINE       0     0     0
        ata-WDC_WD30EZRX-00SPEB0_WD-WCC4E4D3RJJ3-part3       ONLINE       0     0     0
        ata-WDC_WD30EZRX-00D8PB0_WD-WMC4N1750470-part3       ONLINE       0     0     0
    logs
      mirror-1                                               UNAVAIL      0     0     0  insufficient replicas
        ata-Samsung_SSD_850_PRO_128GB_S1SMNSAF808913N-part1  OFFLINE      0     0     0
        ata-Samsung_SSD_850_PRO_128GB_S1SMNSAF808912Y-part1  OFFLINE      0     0     0
    spares
      ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N2JJ746L-part3         AVAIL`
zroot:
    version: 5000
    name: 'zroot'
    state: 0
    txg: 133760287
    pool_guid: 227147110757182278
    errata: 0
    hostname: 'mediaserver'
    com.delphix:has_per_vdev_zaps
    vdev_children: 2
    vdev_tree:
        type: 'root'
        id: 0
        guid: 227147110757182278
        children[0]:
            type: 'raidz'
            id: 0
            guid: 3738002590112127967
            nparity: 2
            metaslab_array: 34
            metaslab_shift: 37
            ashift: 12
            asize: 23985922244608
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 24578
            children[0]:
                type: 'disk'
                id: 0
                guid: 13881899134307772252
                path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0388392-part3'
                whole_disk: 0
                DTL: 16388
                create_txg: 4
                com.delphix:vdev_zap_leaf: 24582
            children[1]:
                type: 'disk'
                id: 1
                guid: 5702167993598056862
                path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0433247-part3'
                whole_disk: 0
                DTL: 16387
                create_txg: 4
                com.delphix:vdev_zap_leaf: 24583
            children[2]:
                type: 'disk'
                id: 2
                guid: 17703299667594327735
                path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0382075-part3'
                whole_disk: 0
                DTL: 16386
                create_txg: 4
                com.delphix:vdev_zap_leaf: 24585
            children[3]:
                type: 'disk'
                id: 3
                guid: 12389755360687232836
                path: '/dev/sda3'
                whole_disk: 0
                DTL: 152660
                create_txg: 4
                com.delphix:vdev_zap_leaf: 152659
            children[4]:
                type: 'disk'
                id: 4
                guid: 4432863555462586293
                path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00D8PB0_WD-WMC4N0D58APF-part3'
                whole_disk: 0
                DTL: 16384
                create_txg: 4
                com.delphix:vdev_zap_leaf: 24589
            children[5]:
                type: 'disk'
                id: 5
                guid: 2712563888628207324
                path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00D8PB0_WD-WCC4NJ1EU345-part3'
                whole_disk: 0
                DTL: 16383
                create_txg: 4
                com.delphix:vdev_zap_leaf: 24591
            children[6]:
                type: 'disk'
                id: 6
                guid: 5888635034105858335
                path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00SPEB0_WD-WCC4E4D3RJJ3-part3'
                whole_disk: 0
                DTL: 16382
                create_txg: 4
                com.delphix:vdev_zap_leaf: 24594
            children[7]:
                type: 'disk'
                id: 7
                guid: 13772617811860603815
                path: '/dev/disk/by-id/ata-WDC_WD30EZRX-00D8PB0_WD-WMC4N1750470-part3'
                whole_disk: 0
                DTL: 16381
                create_txg: 4
                com.delphix:vdev_zap_leaf: 24595
        children[1]:
            type: 'mirror'
            id: 1
            guid: 5877809509795270616
            whole_disk: 0
            metaslab_array: 395
            metaslab_shift: 28
            ashift: 9
            asize: 34355019776
            is_log: 1
            removing: 1
            create_txg: 1412120
            com.delphix:vdev_zap_top: 24608
            children[0]:
                type: 'disk'
                id: 0
                guid: 15777048906001023810
                path: '/dev/disk/by-id/ata-Samsung_SSD_850_PRO_128GB_S1SMNSAF808913N-part1'
                whole_disk: 0
                create_txg: 1412120
                offline: 1
            children[1]:
                type: 'disk'
                id: 1
                guid: 10407876152326858717
                path: '/dev/disk/by-id/ata-Samsung_SSD_850_PRO_128GB_S1SMNSAF808912Y-part1'
                whole_disk: 0
                create_txg: 1412120
                offline: 1
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

@Metatron22
Copy link

Forget to tell. Already tried the patch in the old thread. Does not seem to work.

@richardelling
Copy link
Contributor

this is a bug tracker, not a support forum. please use the email list

@Metatron22
Copy link

Metatron22 commented Nov 12, 2018

Sorry, for the confusion. But pool remove exiting with 0 status, despite not working seems like a bug to me. Especially since this was already happening for quite some time. Please let me know, what more input is needed to solve the bug.

If there is no intention of fixing it, the bug issue should probably be closed.

@richardelling
Copy link
Contributor

you haven’t shared the command that failed, so it is not clear that there is a bug in your case. also, the return code for zpool and zfs commands is not an indicator of success or failure.

@Metatron22
Copy link

Oh, I'm very sorry. Please find the command and the status afterwards.

As the remove fails silently I believe it is the very old bug.

root:~/ # zpool remove zroot mirror-1; echo $?
0
root:~/ # zpool status
  pool: zroot
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub canceled on Sun Nov 11 12:05:25 2018
config:

        NAME                                                     STATE     READ WRITE CKSUM
        zroot                                                    DEGRADED     0     0     0
          raidz2-0                                               ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0388392-part3       ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0433247-part3       ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-WMAWZ0382075-part3       ONLINE       0     0     0
            sda3                                                 ONLINE       0     0     0
            ata-WDC_WD30EZRX-00D8PB0_WD-WMC4N0D58APF-part3       ONLINE       0     0     0
            ata-WDC_WD30EZRX-00D8PB0_WD-WCC4NJ1EU345-part3       ONLINE       0     0     0
            ata-WDC_WD30EZRX-00SPEB0_WD-WCC4E4D3RJJ3-part3       ONLINE       0     0     0
            ata-WDC_WD30EZRX-00D8PB0_WD-WMC4N1750470-part3       ONLINE       0     0     0
        logs
          mirror-1                                               UNAVAIL      0     0     0  insufficient replicas
            ata-Samsung_SSD_850_PRO_128GB_S1SMNSAF808913N-part1  OFFLINE      0     0     0
            ata-Samsung_SSD_850_PRO_128GB_S1SMNSAF808912Y-part1  OFFLINE      0     0     0
        spares
          ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N2JJ746L-part3         AVAIL

errors: No known data errors

@Metatron22
Copy link

OK, i reworked the patch from the old issue to ignore the device not being offline. It was removed this way.

@slima
Copy link

slima commented Nov 22, 2018

OK, i reworked the patch from the old issue to ignore the device not being offline. It was removed this way.

Can you share?

@DurvalMenezes
Copy link
Author

DurvalMenezes commented May 12, 2019

@Metatron22:

OK, i reworked the patch from the old issue to ignore the device not being offline. It was removed this way.

@slima

Can you share?

Or better yet, make a PR?

@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020
@stale stale bot closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue
Projects
None yet
Development

No branches or pull requests

9 participants