Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sometimes v2 volume stuck at attaching because engine error #6176

Open
chriscchien opened this issue Jun 21, 2023 · 3 comments
Open

[BUG] Sometimes v2 volume stuck at attaching because engine error #6176

chriscchien opened this issue Jun 21, 2023 · 3 comments
Assignees
Labels
area/v2-data-engine v2 data engine (SPDK) investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/2 Nice to fix in this release (managed by PO) reproduce/rare < 50% reproducible
Milestone

Comments

@chriscchien
Copy link
Contributor

chriscchien commented Jun 21, 2023

Describe the bug (馃悰 if you encounter this issue)

Sometimes v2 volume stuck at attaching state because engine error
From engine describe can see

Events:
  Type     Reason          Age    From                        Message
  ----     ------          ----   ----                        -------
  Warning  FailedStarting  4m22s  longhorn-engine-controller  Error starting vol1-e-f5fba3a5: failed to create instance: rpc error: code = Unknown desc = failed to start SPDK engine: rpc error: code = Unknown desc = failed to stop the mismatching NVMe initiator vol1 before starting: failed to logout target: failed to execute: nsenter [--mount=/host/proc/5734/ns/mnt --net=/host/proc/5734/ns/net nvme disconnect --nqn nqn.2023-01.io.longhorn.spdk:vol1-e-f5fba3a5], output , stderr Failed to scan topoplogy: No such file or directory
: exit status 1                                 

To Reproduce

Steps to reproduce the behavior:

  1. Create/Attach a volume in maintenance mode
  2. Delete replica trigger offline rebuilding
  3. Delete that volume
  4. Perform case2
  5. Do above steps several times

Expected behavior

V2 volume attach should not stuck at attaching

Log or Support bundle

In longhorn-manager

time="2023-06-21T09:23:54Z" level=info msg="Creating instance vol1-e-f5fba3a5"
time="2023-06-21T09:23:54Z" level=error msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/vol1-e-f5fba3a5 error="failed to sync engine for longhorn-system/vol1-e-f5fba3a5: failed to create instance: rpc error: code = Unknown desc = failed to start SPDK engine: rpc error: code = Unknown desc = failed to stop the mismatching NVMe initiator vol1 before starting: failed to logout target: failed to execute: nsenter [--mount=/host/proc/5734/ns/mnt --net=/host/proc/5734/ns/net nvme disconnect --nqn nqn.2023-01.io.longhorn.spdk:vol1-e-f5fba3a5], output , stderr Failed to scan topoplogy: No such file or directory\n: exit status 1" node=ip-172-31-94-193
time="2023-06-21T09:23:54Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"vol1-e-f5fba3a5\", UID:\"5ea1e409-3b5a-4258-aa37-d011455ebe34\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"17058\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedStarting' Error starting vol1-e-f5fba3a5: failed to create instance: rpc error: code = Unknown desc = failed to start SPDK engine: rpc error: code = Unknown desc = failed to stop the mismatching NVMe initiator vol1 before starting: failed to logout target: failed to execute: nsenter [--mount=/host/proc/5734/ns/mnt --net=/host/proc/5734/ns/net nvme disconnect --nqn nqn.2023-01.io.longhorn.spdk:vol1-e-f5fba3a5], output , stderr Failed to scan topoplogy: No such file or directory\n: exit status 1"

supportbundle_4ed7928e-2dcc-4dc4-8444-681998a94511_2023-06-21T09-29-23Z.zip

Environment

  • Longhorn version: master, v1,5,x
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s

Additional context

N/A

@chriscchien chriscchien added kind/bug reproduce/rare < 50% reproducible area/v2-data-engine v2 data engine (SPDK) labels Jun 21, 2023
@chriscchien chriscchien added this to the v1.5.0 milestone Jun 21, 2023
@derekbit
Copy link
Member

cc @shuo-wu

@innobead innobead added priority/0 Must be fixed in this release (managed by PO) priority/2 Nice to fix in this release (managed by PO) and removed priority/0 Must be fixed in this release (managed by PO) labels Jun 21, 2023
@shuo-wu
Copy link
Contributor

shuo-wu commented Jun 26, 2023

@chriscchien What's the OS/kernel/nvmecli version of the test env?
I just saw the similar issue in nvme-cli community. Not sure if the root cause is the same.

Besides, I am confused about the reproducing step. What's the relationship between step 1-3 with step 4? Does the reproducing step mean that we need to keep triggering the offline rebuilding?

@chriscchien
Copy link
Contributor Author

Hi @shuo-wu ,

I tested on Ubuntu 22.04, kernel 5.19.0-1025-aws and nvme version is 1.16, about the test steps, yes it's happening when I am test offline rebuilding.

@innobead innobead added the investigation-needed Need to identify the case before estimating and starting the development label Jun 27, 2023
@innobead innobead modified the milestones: v1.5.0, v1.6.0 Jun 27, 2023
@innobead innobead added priority/0 Must be fixed in this release (managed by PO) and removed priority/2 Nice to fix in this release (managed by PO) labels Jul 17, 2023
@innobead innobead added priority/2 Nice to fix in this release (managed by PO) and removed priority/0 Must be fixed in this release (managed by PO) labels Dec 22, 2023
@innobead innobead modified the milestones: v1.6.0, v1.7.0 Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v2-data-engine v2 data engine (SPDK) investigation-needed Need to identify the case before estimating and starting the development kind/bug priority/2 Nice to fix in this release (managed by PO) reproduce/rare < 50% reproducible
Projects
None yet
Development

No branches or pull requests

4 participants