Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure when unpublishing or destroying a nexus in faulted state #549

Closed
jonathan-teh opened this issue Dec 1, 2020 · 2 comments · Fixed by #604
Closed

Assertion failure when unpublishing or destroying a nexus in faulted state #549

jonathan-teh opened this issue Dec 1, 2020 · 2 comments · Fixed by #604
Assignees
Labels
BUG Something isn't working DONE Resolved

Comments

@jonathan-teh
Copy link
Contributor

Describe the bug
Mayastor fails an assertion when attempting to destroy a nexus that is in the faulted state:

mayastor: subsystem.c:465: nvmf_subsystem_set_state: Assertion `actual_old_state == expected_old_state' failed.

To Reproduce
Create a nexus with 1 child being a remote replica served by an nvmf target, i.e.:

$ mayastor-client nexus list -c
NAME                                 PATH                                                                                        SIZE STATE  REBUILDS CHILDREN                                                                         
5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391 nvmf://127.0.0.1:8430/nqn.2019-05.io.openebs:nexus-5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391 60000000 online        0 nvmf://127.0.0.1:8420/nqn.2019-05.io.openebs:5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391

Publish the nexus over nvmf, connect to it with the kernel NVMe initiator and send IO e.g. with fio.
Make the nvmf target serving the replica inaccessible. In this case, sending SIGSTOP to that mayastor instance.

The mayastor instance serving the nexus detects a timeout:

[2020-12-01T17:54:24.942056536+00:00  WARN mayastor::spdk:bdev_nvme.c:1149] Warning: Detected a timeout. ctrlr=0x557ec85b6d70 qpair=(nil) cid=2 

and eventually notices that a reset is required:

[2020-12-01T17:56:29.874237303+00:00 ERROR mayastor::spdk:bdev_nvme.c:1153] Controller Fatal Status, reset required   
[2020-12-01T17:56:34.872024397+00:00  WARN mayastor::spdk:bdev_nvme.c:1149] Warning: Detected a timeout. ctrlr=0x557ec85b6d70 qpair=(nil) cid=27   

At this point, send SIGCONT to the mayastor instance serving the replica.
Back at the other mayastor instance, the (only) child is faulted:

[2020-12-01T17:57:50.724579130+00:00 ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:8420/nqn.2019-05.io.openebs:5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391n1, driver: nvme, product: NVMe disk, num_blocks: 122880, block_len: 512
[2020-12-01T17:57:50.724733805+00:00  WARN mayastor::bdev::nexus::nexus_io:nexus_io.rs:329] core 0 thread Some(Mthread(0x557ec7c83090)), faulting child nvmf://127.0.0.1:8420/nqn.2019-05.io.openebs:5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391: Faulted(IoError), blk_cnt: 122880, blk_size: 512

Network errors are logged:

[2020-12-01T17:57:50.825560183+00:00 ERROR mayastor::spdk:posix.c:171] getpeername() failed (errno=107)   
[2020-12-01T17:57:50.825668020+00:00 ERROR mayastor::spdk:tcp.c:958] spdk_sock_getaddr() failed of tqpair=0x557ec85d5fa0   

and the nexus is reconfigured to fault the child and remove it from the nexus:

[2020-12-01T17:57:50.844321020+00:00  INFO mayastor::bdev::nexus::nexus_bdev:nexus_bdev.rs:454] nexus-5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391: Dynamic reconfiguration event: ChildFault started
[2020-12-01T17:57:50.844403241+00:00  INFO mayastor::bdev::nexus::nexus_channel:nexus_channel.rs:102] nexus-5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391(thread:"mayastor_nvmf_tcp_pg_core_0"), refreshing IO channels
[2020-12-01T17:57:50.845618145+00:00  INFO mayastor::bdev::nexus::nexus_channel:nexus_channel.rs:244] nexus-5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391: Reconfigure completed
[2020-12-01T17:57:50.846884423+00:00  INFO mayastor::bdev::nexus::nexus_bdev:nexus_bdev.rs:468] nexus-5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391: Dynamic reconfiguration event: ChildFault completed 0

The child removal is complete:

[2020-12-01T17:57:50.847500441+00:00 ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:344] :nexus-5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391: state: Mutex { data: Open } blk_cnt: 112607, blk_size: 512
        nvmf://127.0.0.1:8420/nqn.2019-05.io.openebs:5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391: Faulted(IoError), blk_cnt: 122880, blk_size: 512
 has no children left... 
[2020-12-01T17:57:50.847578102+00:00  INFO mayastor::core::bdev:bdev.rs:168] Received remove event for bdev 127.0.0.1:8420/nqn.2019-05.io.openebs:5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391n1
[2020-12-01T17:57:50.847634549+00:00  INFO mayastor::bdev::nexus::nexus_child:nexus_child.rs:367] Removing child nvmf://127.0.0.1:8420/nqn.2019-05.io.openebs:5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391
[2020-12-01T17:57:50.847766549+00:00  INFO mayastor::bdev::nexus::nexus_child:nexus_child.rs:405] Child nvmf://127.0.0.1:8420/nqn.2019-05.io.openebs:5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391 removed

However, the replica is still listed as a child in the nexus:

$ mayastor-client nexus list -c
NAME                                 PATH                                                                                        SIZE STATE   REBUILDS CHILDREN                                                                         
5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391 nvmf://127.0.0.1:8430/nqn.2019-05.io.openebs:nexus-5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391 60000000 faulted        0 nvmf://127.0.0.1:8420/nqn.2019-05.io.openebs:5b5b04ea-c1e3-11ea-bd82-a7d5cb04b391

but when you try to destroy the nexus mayastor fails an assertion:

mayastor: subsystem.c:465: nvmf_subsystem_set_state: Assertion `actual_old_state == expected_old_state' failed.

where the assertion is here.

Expected behavior
The nexus should be destroyed.

Screenshots
** OS info:**

  • Distro: Ubuntu 20.04
  • Kernel 5.4.0-56-generic
  • Mayastor 0.6

Additional context
Developer debug build.

@jonathan-teh jonathan-teh added BUG Something isn't working NEW New issue labels Dec 1, 2020
@jonathan-teh jonathan-teh assigned jkryl and unassigned jkryl Dec 1, 2020
@gila
Copy link
Contributor

gila commented Dec 1, 2020

This is because the target is left in the paused state.

@gila gila removed the NEW New issue label Dec 1, 2020
@jonathan-teh
Copy link
Contributor Author

Managed to hit a different assertion if the NVMf target comes back online before the initiator detects a "Controller Fatal status" and decides to reset it:

[2020-12-02T16:05:43.522298120+00:00  WARN mayastor::spdk:bdev_nvme.c:1149] Warning: Detected a timeout. ctrlr=0x562a58580d70 qpair=(nil) cid=2   
[2020-12-02T16:05:48.519103633+00:00  WARN mayastor::spdk:bdev_nvme.c:1149] Warning: Detected a timeout. ctrlr=0x562a58580d70 qpair=(nil) cid=0   
[2020-12-02T16:05:53.516757969+00:00  WARN mayastor::spdk:bdev_nvme.c:1149] Warning: Detected a timeout. ctrlr=0x562a58580d70 qpair=(nil) cid=3   
mayastor: nvme_tcp.c:1552: nvme_tcp_qpair_check_timeout: Assertion `tcp_req->req != NULL' failed.

which is here.

@jonathan-teh jonathan-teh self-assigned this Dec 2, 2020
@jonathan-teh jonathan-teh changed the title Assertion failure when destroying a nexus in faulted state Assertion failure when unpublishing or destroying a nexus in faulted state Dec 16, 2020
@exalate-issue-sync exalate-issue-sync bot added backlog Selected for development Accepted into roadmap for product integration in progress and removed reviewing backlog Selected for development Accepted into roadmap for product integration labels Jan 5, 2021
@bors bors bot closed this as completed in ec43e53 Jan 6, 2021
@exalate-issue-sync exalate-issue-sync bot reopened this Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG Something isn't working DONE Resolved
Projects
None yet
3 participants