[BUG] Node disconnection test failed #5476

yangchiu · 2023-03-03T03:40:55Z

Describe the bug (🐛 if you encounter this issue)

In Node disconnection test case 1, the desired behavior is:

If there is data writing during the disconnection, 
due to the engine process not able to talk with other replicas, 
the engine process will mark all other replicas as ERROR.

The volume will remain detached, and all replicas remain in error state after node network connection back.

But in v1.4.1-rc1, network disconnection will not cause the replica on the attached node becoming error state, instead it remains healthy, and after node network connection back, replicas on other nodes can be rebuilt from this healthy replica, and finally all the replicas are in healthy state, which is not the expected behavior.

Need to confirm whether the behavior changing is expected.

To Reproduce

manually execute Node disconnection test case 1

Expected behavior

A clear and concise description of what you expected to happen.

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

Longhorn version:
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of management node in the cluster:
- Number of worker node in the cluster:
Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

innobead · 2023-03-03T03:45:06Z

@yangchiu how long the disconnection was?

Case one is for 100 seconds for the disconnection, then what we would expect should be replicas marked as error already

cc @derekbit @shuo-wu @c3y1huang

yangchiu · 2023-03-03T04:12:32Z

@yangchiu how long the disconnection was?

Case one is for 100 seconds for the disconnection, then what we would expect should be replicas marked as error already

cc @derekbit @shuo-wu @c3y1huang

100 seconds.

Following the same test steps, v1.4.0 doesn't have this problem. The behavior is same as the description in the test case.

derekbit · 2023-03-03T05:50:13Z

The behavior change is due to the commit longhorn/longhorn-manager#1691.

In master-head/v1.4.1-rc1, disable auto-salvage in the following scenarios.

Scenario 1: one of the replicas is on the attached node.

Attached node is disconnected
The replica is on the attached node becomes unknown. The connection between the engine and the replica still works, so the unknown state should be expected.
The replica becomes running after the network connection is back. Then, other replicas are rebuilt from the replica on the attached node.
In the end, all replicas are running.

Scenario 2: none of the replicas is on the attached node.

Attached node is disconnected
All replicas become stopped after the network connection is back.

@yangchiu Can you help check the behaviors again? Thank you.

innobead · 2023-03-03T05:57:27Z

If confirming the above test cases, then we need to update the test doc. cc @longhorn/qa

yangchiu · 2023-03-03T06:51:59Z

Yes, current behavior follows #5476 (comment), and the test case needs to be updated.

innobead · 2023-03-06T03:01:40Z

@yangchiu It seems behavior change, so just close this if all is good, and also update the test cases accordingly. Thank.

cc @longhorn/qa

yangchiu · 2023-03-07T01:01:32Z

Manual test updated in longhorn/longhorn-tests#1278, we can close this one.

yangchiu added kind/bug reproduce/always 100% reproducible labels Mar 3, 2023

yangchiu added this to the v1.4.1 milestone Mar 3, 2023

innobead added the priority/0 Must be fixed in this release (managed by PO) label Mar 3, 2023

innobead assigned derekbit Mar 3, 2023

innobead added area/v1-data-engine v1 data engine (iSCSI tgt) component/longhorn-manager Longhorn manager (control plane) labels Mar 3, 2023

innobead added the severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) label Mar 3, 2023

innobead added wontfix require/manual-test-plan Require adding/updating manual test cases if they can't be automated release/behavior-change-note Note for behavior change and removed wontfix labels Mar 3, 2023

innobead modified the milestones: v1.4.1, v1.5.0 Mar 3, 2023

innobead added the backport/1.4.1 label Mar 3, 2023

github-actions bot mentioned this issue Mar 3, 2023

[BACKPORT][v1.4.1][BUG][v1.4.1-rc1] Node disconnection test failed #5479

Closed

innobead added the area/resilience System or volume resilience label Mar 3, 2023

innobead assigned yangchiu Mar 6, 2023

shuo-wu mentioned this issue Mar 6, 2023

[BUG] Volume restoration will never complete if attached node is down #5464

Closed

yangchiu mentioned this issue Mar 6, 2023

test: update node disconnection test for behavior change longhorn/longhorn-tests#1278

Merged

yangchiu closed this as completed Mar 7, 2023

innobead changed the title ~~[BUG][v1.4.1-rc1] Node disconnection test failed~~ [BUG] Node disconnection test failed Mar 9, 2023

c3y1huang mentioned this issue Mar 29, 2023

[BUG] After the node reconnects, the restoring volume sometimes becomes faulted. #5666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Node disconnection test failed #5476

[BUG] Node disconnection test failed #5476

yangchiu commented Mar 3, 2023

innobead commented Mar 3, 2023 •

edited

yangchiu commented Mar 3, 2023

derekbit commented Mar 3, 2023

innobead commented Mar 3, 2023

yangchiu commented Mar 3, 2023

innobead commented Mar 6, 2023

yangchiu commented Mar 7, 2023

[BUG] Node disconnection test failed #5476

[BUG] Node disconnection test failed #5476

Comments

yangchiu commented Mar 3, 2023

Describe the bug (🐛 if you encounter this issue)

To Reproduce

Expected behavior

Log or Support bundle

Environment

Additional context

innobead commented Mar 3, 2023 • edited

yangchiu commented Mar 3, 2023

derekbit commented Mar 3, 2023

innobead commented Mar 3, 2023

yangchiu commented Mar 3, 2023

innobead commented Mar 6, 2023

yangchiu commented Mar 7, 2023

innobead commented Mar 3, 2023 •

edited