[BUG] Sometimes v2 volume stuck at attaching because engine error #6176

chriscchien · 2023-06-21T09:36:32Z

Describe the bug (🐛 if you encounter this issue)

Sometimes v2 volume stuck at attaching state because engine error
From engine describe can see

Events:
  Type     Reason          Age    From                        Message
  ----     ------          ----   ----                        -------
  Warning  FailedStarting  4m22s  longhorn-engine-controller  Error starting vol1-e-f5fba3a5: failed to create instance: rpc error: code = Unknown desc = failed to start SPDK engine: rpc error: code = Unknown desc = failed to stop the mismatching NVMe initiator vol1 before starting: failed to logout target: failed to execute: nsenter [--mount=/host/proc/5734/ns/mnt --net=/host/proc/5734/ns/net nvme disconnect --nqn nqn.2023-01.io.longhorn.spdk:vol1-e-f5fba3a5], output , stderr Failed to scan topoplogy: No such file or directory
: exit status 1

To Reproduce

Steps to reproduce the behavior:

Create/Attach a volume in maintenance mode
Delete replica trigger offline rebuilding
Delete that volume
Perform case2
Do above steps several times

Expected behavior

V2 volume attach should not stuck at attaching

Log or Support bundle

In longhorn-manager

time="2023-06-21T09:23:54Z" level=info msg="Creating instance vol1-e-f5fba3a5"
time="2023-06-21T09:23:54Z" level=error msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/vol1-e-f5fba3a5 error="failed to sync engine for longhorn-system/vol1-e-f5fba3a5: failed to create instance: rpc error: code = Unknown desc = failed to start SPDK engine: rpc error: code = Unknown desc = failed to stop the mismatching NVMe initiator vol1 before starting: failed to logout target: failed to execute: nsenter [--mount=/host/proc/5734/ns/mnt --net=/host/proc/5734/ns/net nvme disconnect --nqn nqn.2023-01.io.longhorn.spdk:vol1-e-f5fba3a5], output , stderr Failed to scan topoplogy: No such file or directory\n: exit status 1" node=ip-172-31-94-193
time="2023-06-21T09:23:54Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol1-e-f5fba3a5\", UID:\"5ea1e409-3b5a-4258-aa37-d011455ebe34\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"17058\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedStarting' Error starting vol1-e-f5fba3a5: failed to create instance: rpc error: code = Unknown desc = failed to start SPDK engine: rpc error: code = Unknown desc = failed to stop the mismatching NVMe initiator vol1 before starting: failed to logout target: failed to execute: nsenter [--mount=/host/proc/5734/ns/mnt --net=/host/proc/5734/ns/net nvme disconnect --nqn nqn.2023-01.io.longhorn.spdk:vol1-e-f5fba3a5], output , stderr Failed to scan topoplogy: No such file or directory\n: exit status 1"

supportbundle_4ed7928e-2dcc-4dc4-8444-681998a94511_2023-06-21T09-29-23Z.zip

Environment

Longhorn version: master, v1,5,x
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s

Additional context

N/A

The text was updated successfully, but these errors were encountered:

derekbit · 2023-06-21T09:54:14Z

cc @shuo-wu

shuo-wu · 2023-06-26T09:27:42Z

@chriscchien What's the OS/kernel/nvmecli version of the test env?
I just saw the similar issue in nvme-cli community. Not sure if the root cause is the same.

Besides, I am confused about the reproducing step. What's the relationship between step 1-3 with step 4? Does the reproducing step mean that we need to keep triggering the offline rebuilding?

chriscchien · 2023-06-27T01:40:20Z

Hi @shuo-wu ,

I tested on Ubuntu 22.04, kernel 5.19.0-1025-aws and nvme version is 1.16, about the test steps, yes it's happening when I am test offline rebuilding.

chriscchien added kind/bug reproduce/rare < 50% reproducible area/v2-data-engine v2 data engine (SPDK) labels Jun 21, 2023

chriscchien added this to the v1.5.0 milestone Jun 21, 2023

innobead assigned shuo-wu Jun 21, 2023

innobead added priority/0 Must be fixed in this release (managed by PO) priority/2 Nice to fix in this release (managed by PO) and removed priority/0 Must be fixed in this release (managed by PO) labels Jun 21, 2023

innobead added the investigation-needed Need to identify the case before estimating and starting the development label Jun 27, 2023

innobead modified the milestones: v1.5.0, v1.6.0 Jun 27, 2023

innobead added priority/0 Must be fixed in this release (managed by PO) and removed priority/2 Nice to fix in this release (managed by PO) labels Jul 17, 2023

innobead added priority/2 Nice to fix in this release (managed by PO) and removed priority/0 Must be fixed in this release (managed by PO) labels Dec 22, 2023

innobead modified the milestones: v1.6.0, v1.7.0 Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sometimes v2 volume stuck at attaching because engine error #6176

[BUG] Sometimes v2 volume stuck at attaching because engine error #6176

chriscchien commented Jun 21, 2023 •

edited

derekbit commented Jun 21, 2023

shuo-wu commented Jun 26, 2023

chriscchien commented Jun 27, 2023

[BUG] Sometimes v2 volume stuck at attaching because engine error #6176

[BUG] Sometimes v2 volume stuck at attaching because engine error #6176

Comments

chriscchien commented Jun 21, 2023 • edited

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

derekbit commented Jun 21, 2023

shuo-wu commented Jun 26, 2023

chriscchien commented Jun 27, 2023

chriscchien commented Jun 21, 2023 •

edited