Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple concurrent node reboot feature #660

Merged
merged 8 commits into from
Aug 14, 2023

Conversation

trstringer
Copy link
Contributor

Currently in kured a single node can get a lock with Acquire. There could be situations where multiple nodes might want a lock in the event that a cluster can handle multiple nodes being rebooted. This adds the side-by-side implementation for a multiple node lock situation.

Testing done

Added unit tests. Also ran a manual test with --concurrency=2. I observed that two nodes rebooted at the same time:

$ kubectl get no
NAME                                STATUS                        ROLES   AGE   VERSION
aks-nodepool1-14327021-vmss000000   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000001   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000002   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000003   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000004   NotReady,SchedulingDisabled   agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000005   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000006   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000007   Ready,SchedulingDisabled      agent   24m   v1.23.8

And the lock annotation reflected the format for a multiple owner lock:

"weave.works/kured-node-lock": "{\"maxOwners\":2,\"locks\":[{\"nodeID\":\"aks-nodepool1-14327021-vmss000007\",\"metadata\":{\"unschedulable\":false},\"created\":\"2022-09-23T21:06:30.814409507Z\",\"TTL\":0},{\"nodeID\":\"aks-nodepool1-14327021-vmss000004\",\"metadata\":{\"unschedulable\":false},\"created\":\"2022-09-23T21:06:57.626718467Z\",\"TTL\":0}]}"

@trstringer trstringer changed the title Add ability to have multiple nodes get a lock Add multiple concurrent node reboot feature Sep 24, 2022
@ckotzbauer ckotzbauer linked an issue Oct 2, 2022 that may be closed by this pull request
Copy link
Member

@ckotzbauer ckotzbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR.

I didn't make a deep review so far (especially from the lock-code).
Currently the logic is able to handle multiple locks at the same time, but all nodes are treated equally. Bigger clusters typically make use of topologies like zones and regions.

Example: You have a six node cluster splitted into three zones (two nodes each) and run kured with concurrency 2. It would be possible that two nodes of the same zone hold the lock concurrently. This would shutdown one zone completely (and typically shouldn't happen).

I think kured needs an understanding for topology when it comes to concurrent reboots. I also described this topic in #620.

cmd/kured/main.go Show resolved Hide resolved
@trstringer
Copy link
Contributor Author

Thanks for the review, @ckotzbauer!

While I agree that having a smarter way to reboot multiple nodes is a great end-state goal, I think that would be a good phase two of this work. I could imagine for many users, including the author of #620, this PR would likely fit their requirements without having additional knowledge of the topology (or needing that level of control).

Another thing to note is that this change is not a breaking change. If a user has a requirement for topology intelligence for multi-node reboots, then kured would work the same way as it did prior to this PR with single node reboots. So this would be an opt-in feature.

I'd love to investigate how we could make this multi-node reboot logic better in a subsequent PR! Curious on your thoughts of dividing this work up in separate initiatives.

Copy link
Member

@ckotzbauer ckotzbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your adjustments. I did a full review now and I think the code looks good from a logical point of view.

I agree, that this implementation fits for use-cases when you want to reboot a cluster without topologies. It's good, that it is designed as opt-in feature. But we have to add a clear note on this in our README. It is not save to reboot bigger production clusters with topologies and concurrency enabled. Kured can disrupt critical production environments if used incorrectly. So Kured is not "safe to use by default" anymore.

@trstringer
Copy link
Contributor Author

Thank you for the comments, @ckotzbauer!

So Kured is not "safe to use by default" anymore.

I'm not sure I understand that statement. By default, kured will still only reboot a single node.

As a side note, it looks like the end-to-end tests are failing. I'm unable to repro this on my machine, is it possible to rerun the failures? It appears as though the nodes themselves just never recover, which shouldn't be a kured issue (from my understanding).

@ckotzbauer
Copy link
Member

My statement was a bit inaccurate: I mean that increasing concurrency is unsafe if the cluster uses topologies.

Yes, I can restart the tests. This one is a flaky failure, but typically only for one of the jobs.

@trstringer
Copy link
Contributor Author

Thanks for re-running that, @ckotzbauer! Looks like the failures are remaining. Still not getting any issues locally. I combed through the failure logs and nothing seems to be jumping out at me. The kind nodes go into an unscheduled state, which makes me believe that kured is doing the right thing but maybe it's a kind/docker issue? If you have a free moment, could you take a look through the failure logs? Thanks in advance for any assistance!

@ckotzbauer
Copy link
Member

Yes, I can do that. I think I find some time on the weekend.

@trstringer
Copy link
Contributor Author

@ckotzbauer, just wanted to follow up on this. Let me know if you had a chance to look at the failures! Thank you!

@ckotzbauer
Copy link
Member

It seems that all nodes are rebooting at once: @trstringer

There are 5 nodes in the cluster
0 nodes were removed from pool once:
0 nodes removed from the pool are now back:
Result of command kubectl get nodes ... showing unschedulable nodes:
chart-testing-control-plane    <none>
chart-testing-control-plane2   <none>
chart-testing-control-plane3   <none>
chart-testing-worker           <none>
chart-testing-worker2          <none>
Attempt 1 failed! Trying again in 60 seconds...
0 nodes were removed from pool once:
0 nodes removed from the pool are now back:
Result of command kubectl get nodes ... showing unschedulable nodes:
chart-testing-control-plane    true
chart-testing-control-plane2   true
chart-testing-control-plane3   true
chart-testing-worker           true
chart-testing-worker2          true
chart-testing-control-plane is now unschedulable!
chart-testing-control-plane2 is now unschedulable!
chart-testing-control-plane3 is now unschedulable!
chart-testing-worker is now unschedulable!
chart-testing-worker2 is now unschedulable!
Attempt 2 failed! Trying again in 60 seconds...

@ckotzbauer
Copy link
Member

@trstringer Any news on this?

@jpiper
Copy link

jpiper commented Mar 7, 2023

Just writing down something that came to mind for this PR - it could be useful to constrain simultaneous restarts per topology zone (https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone)

e.g. if I have a ~100 node cluster that spans 3 AWS AZs, allow 2 node restarts per AZ at a time?

@github-actions
Copy link

github-actions bot commented May 7, 2023

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

@ckotzbauer ckotzbauer added keep This won't be closed by the stale bot. and removed no-pr-activity labels May 7, 2023
@github-actions
Copy link

github-actions bot commented Jul 7, 2023

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

@george-angel
Copy link

Remove stale

trstringer and others added 3 commits August 12, 2023 17:01
Currently in kured a single node can get a lock with Acquire. There
could be situations where multiple nodes might want a lock in the event
that a cluster can handle multiple nodes being rebooted. This adds the
side-by-side implementation for a multiple node lock situation.

Signed-off-by: Thomas Stringer <thomas@trstringer.com>
Signed-off-by: Thomas Stringer <thomas@trstringer.com>
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>
@ckotzbauer ckotzbauer added this to the 1.14.0 milestone Aug 14, 2023
@ckotzbauer ckotzbauer merged commit 3b9b190 into kubereboot:main Aug 14, 2023
14 checks passed
@ckotzbauer ckotzbauer mentioned this pull request Aug 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement keep This won't be closed by the stale bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How To Restart Multiple Nodes
5 participants