nexus: Assertion failure when unpublishing a faulted nexus #584

jonathan-teh · 2020-12-16T13:03:03Z

The assertion occurs due to the nvmf subsystem transitioning from a
paused to an inactive state, which is forbidden. Resume the nexus
first, so that it is in an active state, before unpublishing it, which
also covers destroying a nexus. A nexus in a faulted state already has
no usable replicas so is itself already unusable.

Fixes #549, CAS-549.

jonathan-teh · 2020-12-16T13:03:15Z

bors try

bors · 2020-12-16T13:25:44Z

try

Build failed:

continuous-integration/jenkins/branch

jonathan-teh · 2020-12-16T13:27:09Z

bors retry

bors · 2020-12-16T14:14:07Z

try

Build succeeded:

continuous-integration/jenkins/branch

paulyoong

lgtm

mayastor/tests/replica_timeout.rs

The assertion occurs due to the nvmf subsystem transitioning from a paused to an inactive state, which is forbidden. Resume the nexus first, so that it is in an active state, before unpublishing it, which also covers destroying a nexus. A nexus in a faulted state already has no usable replicas so is itself already unusable. Repurpose the existing cargo test for a replica that is stopped, then continued, to test the above case with a single remote replica.

gila

Although code wise it seems to be correct -- I wonder if we should fix it this way. The pause stated is not something that is allowed AFAIK, so what if we simply always resume with the note that when all replicas are gone, we simply have a controller without namespace.

jonathan-teh · 2020-12-17T13:21:49Z

I tried changing Bio::child_retire to always resume the nexus instead (and dropping the change to unshare_nexus here). That passes this test as long as the delay before thawing the container is sufficiently long (I've stripped the repeated Failed to construct the tqpair):

unfreeze delay... 1/20
unfreeze delay... 2/20
unfreeze delay... 3/20
unfreeze delay... 4/20
unfreeze delay... 5/20
[2020-12-17T13:07:52.193276966+00:00  WARN mayastor::spdk:bdev_nvme.c:1232] Warning: Detected a timeout. ctrlr=0x7f29d06f0a60 qpair=(nil) cid=2   
[2020-12-17T13:07:52.193562812+00:00 ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 10.1.0.2:8420/nqn.2019-05.io.openebs:disk0n1, driver: nvme, product: NVMe disk, num_blocks: 204800, block_len: 512
[2020-12-17T13:07:52.193640369+00:00  WARN mayastor::bdev::nexus::nexus_io:nexus_io.rs:329] core 0 thread Some(Mthread(0x7f29d0008a30)), faulting child nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0: Faulted(IoError), blk_cnt: 204800, blk_size: 512

[2020-12-17T13:07:52.193740992+00:00 DEBUG mayastor::spdk:nvme_ctrlr.c:1347] resetting controller   
unfreeze delay... 6/20
unfreeze delay... 7/20
[2020-12-17T13:07:54.192870939+00:00 ERROR mayastor::spdk:nvme_tcp.c:1674] Failed to construct the tqpair=0x7f29d06f23b0 via correct icresp   
[2020-12-17T13:07:54.192999741+00:00 ERROR mayastor::spdk:nvme_tcp.c:1764] Unable to connect the tqpair   
[2020-12-17T13:07:54.193159278+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:1363] Controller reinitialization failed.   
[2020-12-17T13:07:54.193250799+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:874] ctrlr 10.1.0.2 in failed state.   
[2020-12-17T13:07:54.193331868+00:00 ERROR mayastor::spdk:bdev_nvme.c:394] Resetting controller failed.   
[2020-12-17T13:07:54.193420305+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:3406] Submitting Keep Alive failed   
[2020-12-17T13:07:54.193604209+00:00 DEBUG mayastor::spdk:nvme_ctrlr.c:1347] resetting controller   
unfreeze delay... 8/20
unfreeze delay... 9/20
...
[2020-12-17T13:07:56.193371976+00:00  INFO mayastor::bdev::nexus::nexus_bdev:nexus_bdev.rs:454] nexus: Dynamic reconfiguration event: ChildFault started
[2020-12-17T13:07:56.193467288+00:00 DEBUG mayastor::spdk:nvme_ctrlr.c:1347] resetting controller   
unfreeze delay... 10/20
unfreeze delay... 11/20
...
[2020-12-17T13:07:58.193331800+00:00  INFO mayastor::bdev::nexus::nexus_channel:nexus_channel.rs:102] nexus(thread:"mayastor_nvmf_tcp_pg_core_0"), refreshing IO channels
[2020-12-17T13:07:58.193436469+00:00  INFO mayastor::bdev::nexus::nexus_channel:nexus_channel.rs:248] nexus: Reconfigure completed
[2020-12-17T13:07:58.193519186+00:00 DEBUG mayastor::spdk:nvme_ctrlr.c:1347] resetting controller   
unfreeze delay... 12/20
unfreeze delay... 13/20
...
[2020-12-17T13:08:00.193181617+00:00  INFO mayastor::bdev::nexus::nexus_bdev:nexus_bdev.rs:468] nexus: Dynamic reconfiguration event: ChildFault completed 0
[2020-12-17T13:08:00.193304498+00:00  INFO mayastor::nexus_uri:bdev_destroy{uri="nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0"}:nexus_uri.rs:79] new
[2020-12-17T13:08:00.193385791+00:00  INFO mayastor::nexus_uri:bdev_destroy{uri="nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0"}:nexus_uri.rs:79] enter
[2020-12-17T13:08:00.193587104+00:00  INFO mayastor::nexus_uri:bdev_destroy{uri="nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0"}:nexus_uri.rs:79] exit
[2020-12-17T13:08:00.193657001+00:00  INFO mayastor::nexus_uri:bdev_destroy{uri="nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0"}:nexus_uri.rs:79] close time.busy=201µs time.idle=154µs
[2020-12-17T13:08:00.193774702+00:00 DEBUG mayastor::spdk:nvme_ctrlr.c:1347] resetting controller   
unfreeze delay... 14/20
unfreeze delay... 15/20
...
[2020-12-17T13:08:02.193519150+00:00  INFO mayastor::core::bdev:bdev.rs:168] Received remove event for bdev 10.1.0.2:8420/nqn.2019-05.io.openebs:disk0n1
[2020-12-17T13:08:02.193610744+00:00  INFO mayastor::bdev::nexus::nexus_child:nexus_child.rs:370] Removing child nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0
[2020-12-17T13:08:02.193724046+00:00  INFO mayastor::bdev::nexus::nexus_child:nexus_child.rs:406] Child nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0 removed
[2020-12-17T13:08:02.193923381+00:00 ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:343] :nexus: state: Mutex { data: Open } blk_cnt: 102400, blk_size: 512
        nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0: state Faulted(IoError)
 has no children left... 
[2020-12-17T13:08:02.194025479+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:902] ctrlr 10.1.0.2 get_cc() failed   
unfreeze delay... 16/20
unfreeze delay... 17/20
unfreeze delay... 18/20
unfreeze delay... 19/20
unfreeze delay... 20/20
container thawed
[2020-12-17T13:08:06.282136209+00:00 DEBUG mayastor::subsys::nvmf::subsystem:subsystem.rs:355] stopping NvmfSubsystem { id: 1, subtype: "NVMe", subnqn: "nqn.2019-05.io.openebs:nexus", sn: "33\' ~\'~._`o##o>", mn: "Mayastor NVMe controller", allow_any_host: 1, ana_reporting: 1, listeners: Some([Transport ID { trtype: 3, trstring: "TCP", traddr: "127.0.0.1", trsvcid: "8420" }]) }
[2020-12-17T13:08:06.282355451+00:00  INFO mayastor::subsys::nvmf::subsystem:subsystem.rs:371] stopped nqn.2019-05.io.openebs:nexus
[2020-12-17T13:08:06.282483071+00:00 DEBUG mayastor::bdev::nexus::nexus_channel:nexus_channel.rs:204] nexus Destroying IO channels
...

If the delay is 10-12s, when faulting the child has started but not yet completed, then you get the original assertion back as the subsystem is in the paused state while child reconfiguration is taking place.
If the delay is 15s, you end with this panic (and a misleading error msg because SPDK's nvmf_subsystem_state_change returned EBUSY because the subsystem is in the middle of changing state):

thread 'mayastor_master' panicked at 'should unpublish nexus: UnshareNexus { source: UnshareNvmf { source: Subsystem { source: EBUSY, nqn: "nqn.2019-05.io.openebs:nexus", msg: "out of memory" } }, name: "nexus" }', mayastor/tests/replica_timeout.rs:120:18

I'm inclined to close this and go back to the drawing board for a fix with wider applicability.

jonathan-teh · 2020-12-17T18:46:31Z

I tried changing Bio::child_retire to always resume the nexus instead (and dropping the change to unshare_nexus here).

This makes things worse, with the kernel initiator connected, once the child and nexus are in a faulted state, even nvme disconnect locks up whereas it used to work before (which also stops the kernel from retrying any outstanding IO). That leaves the client with no option than to reboot.
With the changes here, the test still passes with the delay at 12s or 15s.

gila · 2020-12-21T10:10:29Z

Then I guess the best approach here is to simply not retire the last child, but have the initiator timeout/disconnect logic handle it?

jonathan-teh · 2020-12-29T10:48:14Z

Then I guess the best approach here is to simply not retire the last child, but have the initiator timeout/disconnect logic handle it?

I guess so, I'll rework the PR.

jonathan-teh · 2020-12-29T12:30:29Z

FWIW, returning early in Nexus::nexus_io::child_retire if there is only 1 child (i.e. leaving the state of the child and nexus unchanged) leads to the nexus retrying the keep alive every 2s until the child comes back:

[2020-12-29T12:22:39.821734070+00:00 ERROR mayastor::spdk:nvme_tcp.c:1674] Failed to construct the tqpair=0x561822958aa0 via correct icresp   
[2020-12-29T12:22:39.821844436+00:00 ERROR mayastor::spdk:nvme_tcp.c:1764] Unable to connect the tqpair   
[2020-12-29T12:22:39.821929562+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:1363] Controller reinitialization failed.   
[2020-12-29T12:22:39.821985435+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:874] ctrlr 127.0.0.1 in failed state.   
[2020-12-29T12:22:39.822035939+00:00 ERROR mayastor::spdk:bdev_nvme.c:394] Resetting controller failed.   
[2020-12-29T12:22:39.822093379+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:3406] Submitting Keep Alive failed

jonathan-teh self-assigned this Dec 16, 2020

bors bot pushed a commit that referenced this pull request Dec 16, 2020

Try #584:

6373903

bors bot pushed a commit that referenced this pull request Dec 16, 2020

Try #584:

2713261

jonathan-teh requested review from gila, paulyoong and jkryl December 16, 2020 14:15

paulyoong approved these changes Dec 16, 2020

View reviewed changes

mayastor/tests/replica_timeout.rs Outdated Show resolved Hide resolved

jonathan-teh force-pushed the cas-549 branch from 190818c to 2beea26 Compare December 16, 2020 14:26

gila reviewed Dec 16, 2020

View reviewed changes

jonathan-teh closed this Dec 29, 2020

jonathan-teh deleted the cas-549 branch January 6, 2021 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nexus: Assertion failure when unpublishing a faulted nexus #584

nexus: Assertion failure when unpublishing a faulted nexus #584

jonathan-teh commented Dec 16, 2020

jonathan-teh commented Dec 16, 2020

bors bot commented Dec 16, 2020

jonathan-teh commented Dec 16, 2020

bors bot commented Dec 16, 2020

paulyoong left a comment

gila left a comment

jonathan-teh commented Dec 17, 2020

jonathan-teh commented Dec 17, 2020 •

edited

Loading

gila commented Dec 21, 2020

jonathan-teh commented Dec 29, 2020

jonathan-teh commented Dec 29, 2020

nexus: Assertion failure when unpublishing a faulted nexus #584

nexus: Assertion failure when unpublishing a faulted nexus #584

Conversation

jonathan-teh commented Dec 16, 2020

jonathan-teh commented Dec 16, 2020

bors bot commented Dec 16, 2020

try

jonathan-teh commented Dec 16, 2020

bors bot commented Dec 16, 2020

try

paulyoong left a comment

Choose a reason for hiding this comment

gila left a comment

Choose a reason for hiding this comment

jonathan-teh commented Dec 17, 2020

jonathan-teh commented Dec 17, 2020 • edited Loading

gila commented Dec 21, 2020

jonathan-teh commented Dec 29, 2020

jonathan-teh commented Dec 29, 2020

jonathan-teh commented Dec 17, 2020 •

edited

Loading