Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Continuously rebuild when auto-balance==least-effort and existing node becomes unschedulable #4502

Closed
docbobo opened this issue Aug 30, 2022 · 8 comments
Assignees
Labels
area/volume-replica-scheduling Volume replica scheduling related backport/1.2.6 backport/1.3.2 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Milestone

Comments

@docbobo
Copy link

docbobo commented Aug 30, 2022

Describe the bug

I was just noticing some weird interactions beet auto-balance "least-effort" and unschedulable replicas when I had to cordon a few of my nodes. Here's a quick description:

I have a volume with 3 replicas, with auto-balance configure to "ignored", so that it would fall back to my system default of least effort. Each of the replicas is assigned to a different zone. When I had to cordon a few of the nodes, all of the nodes from one of the zone became unschedulable and there were literally only two zones left. However, the replica on the unschedulable node was still running. So far, so good.

Longhorn then started to build a fourth replica in one of the two regions already being used. When it was done, it deleted one of the previously existing ones. And the it did that again. And again. And again. It actually never stopped building new replicas and deleting the old ones.

I've seen that behavior already a few times. What helped in that situation was setting auto balance to disabled. In that case, it will finish the cycle that it's currently in, then stop

To Reproduce

See above.

Expected behavior

Even though one of the replicas is on an unschedulable node, I'd expect longhorn to realize that it already has achieved the best balance regarding fault-tolerance. I would definitely not expect it to continue recreating and deleting replicas forever.

Log or Support bundle

n/a

Environment

  • Longhorn version: 1.3.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s v1.24.4
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 11
  • Node config
    • OS type and version: openSUSE MicroOS
    • CPU per node: 4 cores
    • Memory per node: 4-12 MB
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes: 8x10Gbe, 3x1Gbe
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): KVM & Baremetal
  • Number of Longhorn volumes in the cluster: 37

Additional context

n/a

@innobead
Copy link
Member

cc @c3y1huang

@c3y1huang
Copy link
Contributor

Thanks for reporting. We will look into this.

@c3y1huang c3y1huang added component/longhorn-manager Longhorn manager (control plane) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated labels Aug 30, 2022
@c3y1huang c3y1huang self-assigned this Aug 30, 2022
@innobead innobead added this to the v1.4.0 milestone Aug 30, 2022
@innobead innobead added backport/1.3.2 priority/0 Must be fixed in this release (managed by PO) area/volume-replica-scheduling Volume replica scheduling related labels Aug 30, 2022
@withinboredom
Copy link

withinboredom commented Aug 31, 2022

Looks like the same thing (or maybe the opposite?) happens if set to best-effort and it is auto-upgrading the image and one of the nodes goes away during the upgrade.

replicas are created/destroyed in an infinite loop.

@c3y1huang
Copy link
Contributor

Auto-balance best-effort will first go through the logic to achieve the balance for least-effect. So the infinite loop goes for both cases. We need to fix this bug so the setting recognizes replicas already on an un-schedulable node.

However, I am not expecting it to auto-upgrading the image, can you give some more info about the behavior?

@withinboredom
Copy link

I had the "auto-upgrade engine" turned on, and during the upgrade to 1.3.1 from 1.3.0, one of the nodes turned off while upgrading some engines. The volume then went into an infinite loop of creating replicas and deleting them until the node returned online sometime later.

I came here to report the issue and saw this here.

There were only three nodes, and each volume had three replicas (then two nodes and three replicas). It is hard to tell, but it may not be the same issue but the same behavior.

@docbobo
Copy link
Author

docbobo commented Sep 2, 2022

I am seeing something similar just when using nodeSelector. When the nodeSelector only matches nodes in two different regions, but the strategy is set of least-effort (and maybe also best-effort) then longhorn will continuously rebuild.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Sep 28, 2022

Pre Ready-For-Testing Checklist

@yangchiu
Copy link
Member

yangchiu commented Sep 30, 2022

Verified passed on master-head (longhorn-manager aa79220) by executing the test_replica_auto_balance_when_replica_on_unschedulable_node automated test case, and manually running the test steps following #4502 (comment), the unexpected loop of replica deleting and recreating isn't observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-replica-scheduling Volume replica scheduling related backport/1.2.6 backport/1.3.2 component/longhorn-manager Longhorn manager (control plane) kind/bug priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated
Projects
None yet
Development

No branches or pull requests

6 participants