Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Node disconnection test failed #5476

Closed
yangchiu opened this issue Mar 3, 2023 · 7 comments
Closed

[BUG] Node disconnection test failed #5476

yangchiu opened this issue Mar 3, 2023 · 7 comments
Assignees
Labels
area/resilience System or volume resilience area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.4.1 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) release/behavior-change-note Note for behavior change reproduce/always 100% reproducible require/manual-test-plan Require adding/updating manual test cases if they can't be automated severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@yangchiu
Copy link
Member

yangchiu commented Mar 3, 2023

Describe the bug (馃悰 if you encounter this issue)

In Node disconnection test case 1, the desired behavior is:

If there is data writing during the disconnection, 
due to the engine process not able to talk with other replicas, 
the engine process will mark all other replicas as ERROR.

The volume will remain detached, and all replicas remain in error state after node network connection back.

But in v1.4.1-rc1, network disconnection will not cause the replica on the attached node becoming error state, instead it remains healthy, and after node network connection back, replicas on other nodes can be rebuilt from this healthy replica, and finally all the replicas are in healthy state, which is not the expected behavior.

Need to confirm whether the behavior changing is expected.

To Reproduce

manually execute Node disconnection test case 1

Expected behavior

A clear and concise description of what you expected to happen.

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version:
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

@yangchiu yangchiu added kind/bug reproduce/always 100% reproducible labels Mar 3, 2023
@yangchiu yangchiu added this to the v1.4.1 milestone Mar 3, 2023
@innobead innobead added the priority/0 Must be fixed in this release (managed by PO) label Mar 3, 2023
@innobead innobead added area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) labels Mar 3, 2023
@innobead
Copy link
Member

innobead commented Mar 3, 2023

@yangchiu how long the disconnection was?

Case one is for 100 seconds for the disconnection, then what we would expect should be replicas marked as error already

cc @derekbit @shuo-wu @c3y1huang

@innobead innobead added the severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) label Mar 3, 2023
@yangchiu
Copy link
Member Author

yangchiu commented Mar 3, 2023

@yangchiu how long the disconnection was?

Case one is for 100 seconds for the disconnection, then what we would expect should be replicas marked as error already

cc @derekbit @shuo-wu @c3y1huang

100 seconds.

Following the same test steps, v1.4.0 doesn't have this problem. The behavior is same as the description in the test case.

@derekbit
Copy link
Member

derekbit commented Mar 3, 2023

The behavior change is due to the commit longhorn/longhorn-manager#1691.

In master-head/v1.4.1-rc1, disable auto-salvage in the following scenarios.

Scenario 1: one of the replicas is on the attached node.

  • Attached node is disconnected
  • The replica is on the attached node becomes unknown. The connection between the engine and the replica still works, so the unknown state should be expected.
  • The replica becomes running after the network connection is back. Then, other replicas are rebuilt from the replica on the attached node.
  • In the end, all replicas are running.

Scenario 2: none of the replicas is on the attached node.

  • Attached node is disconnected
  • All replicas become stopped after the network connection is back.

@yangchiu Can you help check the behaviors again? Thank you.

@innobead
Copy link
Member

innobead commented Mar 3, 2023

If confirming the above test cases, then we need to update the test doc. cc @longhorn/qa

@yangchiu
Copy link
Member Author

yangchiu commented Mar 3, 2023

Yes, current behavior follows #5476 (comment), and the test case needs to be updated.

@innobead innobead added wontfix require/manual-test-plan Require adding/updating manual test cases if they can't be automated release/behavior-change-note Note for behavior change and removed wontfix labels Mar 3, 2023
@innobead innobead modified the milestones: v1.4.1, v1.5.0 Mar 3, 2023
@innobead innobead added the area/resilience System or volume resilience label Mar 3, 2023
@innobead
Copy link
Member

innobead commented Mar 6, 2023

@yangchiu It seems behavior change, so just close this if all is good, and also update the test cases accordingly. Thank.

cc @longhorn/qa

@yangchiu
Copy link
Member Author

yangchiu commented Mar 7, 2023

Manual test updated in longhorn/longhorn-tests#1278, we can close this one.

@yangchiu yangchiu closed this as completed Mar 7, 2023
@innobead innobead changed the title [BUG][v1.4.1-rc1] Node disconnection test failed [BUG] Node disconnection test failed Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/resilience System or volume resilience area/v1-data-engine v1 data engine (iSCSI tgt) backport/1.4.1 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) release/behavior-change-note Note for behavior change reproduce/always 100% reproducible require/manual-test-plan Require adding/updating manual test cases if they can't be automated severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
None yet
Development

No branches or pull requests

3 participants