Add multiple concurrent node reboot feature #660

trstringer · 2022-09-23T21:23:49Z

Currently in kured a single node can get a lock with Acquire. There could be situations where multiple nodes might want a lock in the event that a cluster can handle multiple nodes being rebooted. This adds the side-by-side implementation for a multiple node lock situation.

Testing done

Added unit tests. Also ran a manual test with --concurrency=2. I observed that two nodes rebooted at the same time:

$ kubectl get no
NAME                                STATUS                        ROLES   AGE   VERSION
aks-nodepool1-14327021-vmss000000   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000001   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000002   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000003   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000004   NotReady,SchedulingDisabled   agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000005   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000006   Ready                         agent   24m   v1.23.8
aks-nodepool1-14327021-vmss000007   Ready,SchedulingDisabled      agent   24m   v1.23.8

And the lock annotation reflected the format for a multiple owner lock:

"weave.works/kured-node-lock": "{\"maxOwners\":2,\"locks\":[{\"nodeID\":\"aks-nodepool1-14327021-vmss000007\",\"metadata\":{\"unschedulable\":false},\"created\":\"2022-09-23T21:06:30.814409507Z\",\"TTL\":0},{\"nodeID\":\"aks-nodepool1-14327021-vmss000004\",\"metadata\":{\"unschedulable\":false},\"created\":\"2022-09-23T21:06:57.626718467Z\",\"TTL\":0}]}"

ckotzbauer

Thanks for your PR.

I didn't make a deep review so far (especially from the lock-code).
Currently the logic is able to handle multiple locks at the same time, but all nodes are treated equally. Bigger clusters typically make use of topologies like zones and regions.

Example: You have a six node cluster splitted into three zones (two nodes each) and run kured with concurrency 2. It would be possible that two nodes of the same zone hold the lock concurrently. This would shutdown one zone completely (and typically shouldn't happen).

I think kured needs an understanding for topology when it comes to concurrent reboots. I also described this topic in #620.

cmd/kured/main.go

trstringer · 2022-10-13T20:35:31Z

Thanks for the review, @ckotzbauer!

While I agree that having a smarter way to reboot multiple nodes is a great end-state goal, I think that would be a good phase two of this work. I could imagine for many users, including the author of #620, this PR would likely fit their requirements without having additional knowledge of the topology (or needing that level of control).

Another thing to note is that this change is not a breaking change. If a user has a requirement for topology intelligence for multi-node reboots, then kured would work the same way as it did prior to this PR with single node reboots. So this would be an opt-in feature.

I'd love to investigate how we could make this multi-node reboot logic better in a subsequent PR! Curious on your thoughts of dividing this work up in separate initiatives.

ckotzbauer

Thanks for your adjustments. I did a full review now and I think the code looks good from a logical point of view.

I agree, that this implementation fits for use-cases when you want to reboot a cluster without topologies. It's good, that it is designed as opt-in feature. But we have to add a clear note on this in our README. It is not save to reboot bigger production clusters with topologies and concurrency enabled. Kured can disrupt critical production environments if used incorrectly. So Kured is not "safe to use by default" anymore.

trstringer · 2022-10-18T15:51:51Z

Thank you for the comments, @ckotzbauer!

So Kured is not "safe to use by default" anymore.

I'm not sure I understand that statement. By default, kured will still only reboot a single node.

As a side note, it looks like the end-to-end tests are failing. I'm unable to repro this on my machine, is it possible to rerun the failures? It appears as though the nodes themselves just never recover, which shouldn't be a kured issue (from my understanding).

ckotzbauer · 2022-10-18T15:58:26Z

My statement was a bit inaccurate: I mean that increasing concurrency is unsafe if the cluster uses topologies.

Yes, I can restart the tests. This one is a flaky failure, but typically only for one of the jobs.

trstringer · 2022-10-26T21:27:51Z

Thanks for re-running that, @ckotzbauer! Looks like the failures are remaining. Still not getting any issues locally. I combed through the failure logs and nothing seems to be jumping out at me. The kind nodes go into an unscheduled state, which makes me believe that kured is doing the right thing but maybe it's a kind/docker issue? If you have a free moment, could you take a look through the failure logs? Thanks in advance for any assistance!

ckotzbauer · 2022-10-27T05:07:27Z

Yes, I can do that. I think I find some time on the weekend.

trstringer · 2022-12-03T01:21:59Z

@ckotzbauer, just wanted to follow up on this. Let me know if you had a chance to look at the failures! Thank you!

ckotzbauer · 2022-12-11T10:45:44Z

It seems that all nodes are rebooting at once: @trstringer

There are 5 nodes in the cluster
0 nodes were removed from pool once:
0 nodes removed from the pool are now back:
Result of command kubectl get nodes ... showing unschedulable nodes:
chart-testing-control-plane    <none>
chart-testing-control-plane2   <none>
chart-testing-control-plane3   <none>
chart-testing-worker           <none>
chart-testing-worker2          <none>
Attempt 1 failed! Trying again in 60 seconds...
0 nodes were removed from pool once:
0 nodes removed from the pool are now back:
Result of command kubectl get nodes ... showing unschedulable nodes:
chart-testing-control-plane    true
chart-testing-control-plane2   true
chart-testing-control-plane3   true
chart-testing-worker           true
chart-testing-worker2          true
chart-testing-control-plane is now unschedulable!
chart-testing-control-plane2 is now unschedulable!
chart-testing-control-plane3 is now unschedulable!
chart-testing-worker is now unschedulable!
chart-testing-worker2 is now unschedulable!
Attempt 2 failed! Trying again in 60 seconds...

ckotzbauer · 2023-01-11T18:40:22Z

@trstringer Any news on this?

jpiper · 2023-03-07T08:35:18Z

Just writing down something that came to mind for this PR - it could be useful to constrain simultaneous restarts per topology zone (https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone)

e.g. if I have a ~100 node cluster that spans 3 AWS AZs, allow 2 node restarts per AZ at a time?

github-actions · 2023-05-07T01:57:03Z

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

github-actions · 2023-07-07T02:07:11Z

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

george-angel · 2023-07-07T05:24:09Z

Remove stale

Currently in kured a single node can get a lock with Acquire. There could be situations where multiple nodes might want a lock in the event that a cluster can handle multiple nodes being rebooted. This adds the side-by-side implementation for a multiple node lock situation. Signed-off-by: Thomas Stringer <thomas@trstringer.com>

Signed-off-by: Thomas Stringer <thomas@trstringer.com>

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

trstringer changed the title ~~Add ability to have multiple nodes get a lock~~ Add multiple concurrent node reboot feature Sep 24, 2022

ckotzbauer linked an issue Oct 2, 2022 that may be closed by this pull request

How To Restart Multiple Nodes #620

Closed

ckotzbauer requested changes Oct 2, 2022

View reviewed changes

cmd/kured/main.go Show resolved Hide resolved

ckotzbauer added the enhancement label Oct 2, 2022

trstringer requested a review from ckotzbauer October 13, 2022 20:59

trstringer mentioned this pull request Oct 14, 2022

How To Restart Multiple Nodes #620

Closed

ckotzbauer reviewed Oct 15, 2022

View reviewed changes

ckotzbauer requested review from evrardjp and jackfrancis October 15, 2022 07:22

github-actions bot added the no-pr-activity label May 7, 2023

ckotzbauer added keep This won't be closed by the stale bot. and removed no-pr-activity labels May 7, 2023

github-actions bot added the no-pr-activity label Jul 7, 2023

dholbach removed the no-pr-activity label Jul 7, 2023

trstringer and others added 3 commits August 12, 2023 17:01

Refactor to use the same code path for a single lock and a multilock

c5a83b7

Signed-off-by: Thomas Stringer <thomas@trstringer.com>

test: force rebuild

3e8513b

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

ckotzbauer force-pushed the trstringer/acquire-multiple branch from 50dcf3e to 3e8513b Compare August 12, 2023 15:02

ckotzbauer added 5 commits August 12, 2023 17:26

build: log pod-logs

219c91d

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

fix: change condition

35e9680

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

build: fix test-script

0f54b3c

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

build: add concurrent test

44fa900

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

fix: final changes

917d3d0

Signed-off-by: Christian Kotzbauer <git@ckotzbauer.de>

ckotzbauer approved these changes Aug 12, 2023

View reviewed changes

ckotzbauer added this to the 1.14.0 milestone Aug 14, 2023

ckotzbauer merged commit 3b9b190 into kubereboot:main Aug 14, 2023
14 checks passed

ckotzbauer mentioned this pull request Aug 19, 2023

Release 1.14.0 #820

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiple concurrent node reboot feature #660

Add multiple concurrent node reboot feature #660

trstringer commented Sep 23, 2022

ckotzbauer left a comment

trstringer commented Oct 13, 2022

ckotzbauer left a comment •

edited

trstringer commented Oct 18, 2022

ckotzbauer commented Oct 18, 2022

trstringer commented Oct 26, 2022

ckotzbauer commented Oct 27, 2022

trstringer commented Dec 3, 2022

ckotzbauer commented Dec 11, 2022

ckotzbauer commented Jan 11, 2023

jpiper commented Mar 7, 2023

github-actions bot commented May 7, 2023

github-actions bot commented Jul 7, 2023

george-angel commented Jul 7, 2023

Add multiple concurrent node reboot feature #660

Add multiple concurrent node reboot feature #660

Conversation

trstringer commented Sep 23, 2022

Testing done

ckotzbauer left a comment

Choose a reason for hiding this comment

trstringer commented Oct 13, 2022

ckotzbauer left a comment • edited

Choose a reason for hiding this comment

trstringer commented Oct 18, 2022

ckotzbauer commented Oct 18, 2022

trstringer commented Oct 26, 2022

ckotzbauer commented Oct 27, 2022

trstringer commented Dec 3, 2022

ckotzbauer commented Dec 11, 2022

ckotzbauer commented Jan 11, 2023

jpiper commented Mar 7, 2023

github-actions bot commented May 7, 2023

github-actions bot commented Jul 7, 2023

george-angel commented Jul 7, 2023

ckotzbauer left a comment •

edited