Kernel logged pool as suspended, but zpool status says otherwise #2424

ZNikke · 2014-06-25T14:29:06Z

We have a test machine running Ubuntu 14.04LTS, using the zfs-native-stable PPA (0.6.3-2~trusty). It's used to test zfs stuff, error handling, stability etc and the flakey old 750G Seagate drives I found in a corner are rather perfectly suited for this ;)

This issue is related to the other disk-goes-bad-lockup issues, but I think this is a different bug.

I have a raidz1 test pool, so naturally things goes bad when I hit double failures (these drives can be relied upon being unreliable).

However, this time it was resilvering when I hit what I suspect was a double-fault. The kernel log says:

[138671.416334] SPLError: 1850:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'test750G' has encountered an uncorrectable I/O failure and has been suspended.

Which is perfectly okay, I expect that to happen in a single-parity raid when I loose too much redundancy.

However, zpool status says:

   pool: test750G
  state: DEGRADED
 status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
 action: Wait for the resilver to complete.
   scan: resilver in progress since Wed Jun 25 09:04:26 2014
     1.57T scanned out of 5.35T at 65.5M/s, 16h49m to go
     199G resilvered, 29.32% done
 config:

        NAME             STATE     READ WRITE CKSUM
        test750G         DEGRADED    57     0     0
          raidz1-0       DEGRADED   233     0     0
            slot9        ONLINE       0     0     0  (resilvering)
            slot10       ONLINE      24     3     0
            slot11       ONLINE       0     0     0  (resilvering)
            slot12       ONLINE       0     0     0  (resilvering)
            slot13       ONLINE       0     0     0  (resilvering)
            slot14       ONLINE       0     0     0
            slot15       ONLINE       0     0     0  (resilvering)
            replacing-7  FAULTED      0     0     0
              old        FAULTED     48   398     0  too many errors
              slot16     ONLINE       0     0     0  (resilvering)

errors: 31 data errors, use '-v' for a list

If I remember correctly, DEGRADED is not a suspended state.

I don't really know if this is a side effect of something deadlocked somewhere. I have a lot of task blocked for more than 120 seconds logged after the pool suspended message, but they might as well be a side effect of the pool being suspended:

[138840.640077] INFO: task txg_sync:1930 blocked for more than 120 seconds.
[138840.647350]       Tainted: PF          O 3.13.0-29-generic #53-Ubuntu
[138840.654577] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[138840.661879] txg_sync        D ffff88022fd14440     0  1930      2 0x00000000
[138840.661887]  ffff8800c5505908 0000000000000002 ffff8800c6aa8000 ffff8800c5505fd8
[138840.661892]  0000000000014440 0000000000014440 ffff8800c6aa8000 ffff88022fd14cd8
[138840.661895]  ffffc9001e7057f0 ffffc9001e705820 0000000000000001 0000000000000002
[138840.661899] Call Trace:
[138840.661917]  [<ffffffff8171e95d>] io_schedule+0x9d/0x140
[138840.661962]  [<ffffffffa013641d>] cv_wait_common+0x9d/0x1a0 [spl]
[138840.661971]  [<ffffffff810aaea0>] ? prepare_to_wait_event+0x100/0x100
[138840.661979]  [<ffffffffa0136578>] __cv_wait_io+0x18/0x20 [spl]
[138840.662079]  [<ffffffffa025d9d3>] zio_wait+0x103/0x1c0 [zfs]
[138840.662116]  [<ffffffffa01c6ee5>] dbuf_read+0x315/0x890 [zfs]
[138840.662146]  [<ffffffffa01dea1f>] dnode_hold_impl+0x18f/0x5c0 [zfs]
[138840.662171]  [<ffffffffa01dee69>] dnode_hold+0x19/0x20 [zfs]
[138840.662193]  [<ffffffffa01cd66b>] dmu_free_range+0x2b/0x70 [zfs]
[138840.662224]  [<ffffffffa02154b0>] ? space_map_add+0x390/0x390 [zfs]
[138840.662254]  [<ffffffffa0216214>] space_map_truncate+0x24/0x70 [zfs]
[138840.662281]  [<ffffffffa01fb4a1>] metaslab_condense+0x171/0x300 [zfs]
[138840.662312]  [<ffffffffa025d9fd>] ? zio_wait+0x12d/0x1c0 [zfs]
[138840.662318]  [<ffffffff811a1639>] ? __kmalloc+0x1e9/0x230
[138840.662326]  [<ffffffffa012d2a6>] ? kmem_alloc_debug+0x96/0x3c0 [spl]
[138840.662357]  [<ffffffffa0214d20>] ? spa_stats_destroy+0x380/0x380 [zfs]
[138840.662386]  [<ffffffffa01fc2d4>] metaslab_sync+0x274/0x350 [zfs]
[138840.662390]  [<ffffffff81720a52>] ? mutex_lock+0x12/0x2f
[138840.662422]  [<ffffffffa021b213>] vdev_sync+0x73/0x140 [zfs]
[138840.662454]  [<ffffffffa0206fd3>] spa_sync+0x493/0xb00 [zfs]
[138840.662459]  [<ffffffff810cd5f8>] ? ktime_get_ts+0x48/0xe0
[138840.662489]  [<ffffffffa021783e>] txg_sync_thread+0x37e/0x5c0 [zfs]
[138840.662519]  [<ffffffffa02174c0>] ? txg_quiesce_thread+0x340/0x340 [zfs]
[138840.662527]  [<ffffffffa012ea8a>] thread_generic_wrapper+0x7a/0x90 [spl]
[138840.662535]  [<ffffffffa012ea10>] ? __thread_exit+0xa0/0xa0 [spl]
[138840.662539]  [<ffffffff8108b322>] kthread+0xd2/0xf0
[138840.662552]  [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0
[138840.662567]  [<ffffffff8172ab3c>] ret_from_fork+0x7c/0xb0
[138840.662586]  [<ffffffff8108b250>] ? kthread_create_on_node+0x1d0/0x1d0

If this is a duplicate of a known issue, I'm sorry for the noise.

However, if this is a new issue, let me know if I can provide more information for you to be able to pinpoint it.

The text was updated successfully, but these errors were encountered:

behlendorf · 2014-06-27T00:01:32Z

@ZNikke We have a similar issue open but I don't mind keeping this open as a duplicate for now. It would be helpful if you could post all the console stacks somewhere for analysis.

ZNikke · 2014-06-27T07:38:47Z

All stack traces are logged after the SPLError and are identical to the one I posted here, the only difference is the timestamp. I think it's the kernel that triggers on 120s-hung-io and then dumps a stacktrace as part of the error message, ish.

Everything logged before the SPLError is scsi noise about the HDD behaving badly.

It's like the pool got suspended, but something (the resilver perhaps?) wants to sync state or something.

behlendorf · 2015-02-06T01:33:41Z

This appears to be a mix of #2523, which has been fixed, and legitimate hardware problems.

behlendorf added this to the 0.6.4 milestone Jun 27, 2014

behlendorf added the Bug label Jun 27, 2014

behlendorf closed this as completed Feb 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel logged pool as suspended, but zpool status says otherwise #2424

Kernel logged pool as suspended, but zpool status says otherwise #2424

ZNikke commented Jun 25, 2014

behlendorf commented Jun 27, 2014

ZNikke commented Jun 27, 2014

behlendorf commented Feb 6, 2015

Kernel logged pool as suspended, but zpool status says otherwise #2424

Kernel logged pool as suspended, but zpool status says otherwise #2424

Comments

ZNikke commented Jun 25, 2014

behlendorf commented Jun 27, 2014

ZNikke commented Jun 27, 2014

behlendorf commented Feb 6, 2015