Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [v1.6.1-rc1] v2 volume replica offline rebuilding fail #8187

Closed
chriscchien opened this issue Mar 14, 2024 · 3 comments
Closed

[BUG] [v1.6.1-rc1] v2 volume replica offline rebuilding fail #8187

chriscchien opened this issue Mar 14, 2024 · 3 comments
Assignees
Labels
area/v2-data-engine v2 data engine (SPDK) kind/bug kind/regression Regression which has worked before priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@chriscchien
Copy link
Contributor

Describe the bug

In a fresh cluster, v2 volume replica offline rebuilding fail, volum kept in loop attching -> detaching -> faulted.

instance-manager log

2024-03-14T08:17:42.792381276Z [2024-03-14 08:17:42.792275] nvme_qpair.c: 798:spdk_nvme_qpair_process_completions: *ERROR*: CQ transport error -6 (No such device or address) on qpair id 0
2024-03-14T08:17:42.792385987Z [2024-03-14 08:17:42.792293] nvme_ctrlr.c:1022:nvme_ctrlr_fail: *ERROR*: [nqn.2023-01.io.longhorn.spdk:pvc-3ceb62c6-8ec0-4f20-828d-5f79f83daf13-r-81291e18] in failed state.
2024-03-14T08:17:42.792390876Z [2024-03-14 08:17:42.792318] nvme_ctrlr.c:1624:nvme_ctrlr_disconnect: *NOTICE*: [nqn.2023-01.io.longhorn.spdk:pvc-3ceb62c6-8ec0-4f20-828d-5f79f83daf13-r-81291e18] resetting controller
2024-03-14T08:17:42.793125465Z [2024-03-14 08:17:42.792982] posix.c: 937:posix_sock_create: *ERROR*: connect() failed, errno = 111
2024-03-14T08:17:42.793805152Z [2024-03-14 08:17:42.793700] posix.c: 937:posix_sock_create: *ERROR*: connect() failed, errno = 111
2024-03-14T08:17:42.793832460Z [2024-03-14 08:17:42.793752] nvme_tcp.c:1958:nvme_tcp_qpair_connect_sock: *ERROR*: sock connection error of tqpair=0x1f2aaa0 with addr=10.42.2.14, port=20001
2024-03-14T08:17:42.793948705Z [2024-03-14 08:17:42.793839] nvme_ctrlr.c:4016:nvme_ctrlr_process_init: *ERROR*: [nqn.2023-01.io.longhorn.spdk:pvc-3ceb62c6-8ec0-4f20-828d-5f79f83daf13-r-81291e18] Ctrlr operation failed with error: -1, ctrlr state: 51 (error)

To Reproduce

  1. Deploy v1.6.1-rc1, enable v2 data engine.
  2. Create workload with v2 data engine storageclass, wait v2 volume healthy and write data.
  3. Delete 1 replica from the v2 volume.
  4. Scaledown the workload to detach volume, trigger replica offline rebuilding.
  5. Replica offline rebuilding not success.

Expected behavior

Offline replica rebuild success

Support bundle for troubleshooting

supportbundle_8db04e03-3d36-4f0a-9467-598ef3a29624_2024-03-14T08-17-18Z.zip

Environment

  • Longhorn version: v1.6.1-rc1
  • Impacted volume (PV): pvc-3ceb62c6-8ec0-4f20-828d-5f79f83daf13
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.5+k3s1

Additional context

Offline rebuild success in v1.6.0

@chriscchien chriscchien added kind/bug severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) reproduce/always 100% reproducible kind/regression Regression which has worked before area/v2-data-engine v2 data engine (SPDK) require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Mar 14, 2024
@chriscchien chriscchien added this to the v1.6.1 milestone Mar 14, 2024
@innobead innobead added the priority/0 Must be fixed in this release (managed by PO) label Mar 14, 2024
@innobead
Copy link
Member

cc @DamiaSan

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Mar 19, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?

  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at
    Dockerfile: Update SPDK commit ID longhorn-instance-manager#444

  • Which areas/issues this PR might have potential impacts on?
    Area: SPDK snapshot and rebuilding

@chriscchien
Copy link
Contributor Author

Verified pass on longhorn v1.6.2-rc2(longhorn-instance-manager 62c25c)

V2 volume replica offline rebuilding worked well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/v2-data-engine v2 data engine (SPDK) kind/bug kind/regression Regression which has worked before priority/0 Must be fixed in this release (managed by PO) reproduce/always 100% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
None yet
Development

No branches or pull requests

4 participants