Add support for Kubernetes node draining #51

dghubble · 2021-08-09T02:31:02Z

Zincati requests include a node (i.e. agent) UUID identifier, assigned matching systemd-id128's app-specific UUID behavior. Replicate the same logic used in systemd and Zincati (rust) to allow systemd machine IDs (available via Kubernetes Nodes
System UUID info) to be mapped to the ID Zincati would assign. This allows a specific Zincati request to be mapped to the
matching Kubernetes node

Behavior

When a node prepares for reboot, Zincati sends a lock request, and if a lock is obtained (or still held, since protocol allows
reentrant locks) cordon the Kubernetes node and drain it by evicting Pods (except mirror or DaemonSet Pods)
When a node reaches steady state, Zincati sends an unlock request and if any lock was previously held, it is released
and the node uncordoned (reversing earlier cordon or possibly a cordon applied by another system)

Caveats

Drain immediately creates Eviction objects, which are responsible for evicting Pods respecting different constraints
At the moment, fleetlock does not wait for Pods to be terminated. Similarly, locks are still given out, even if draining fails. For now, I'd rather not gate lock grants on successful drains, since unknown termination edge cases could wedge a cluster's auto-upgrades. This is still an improvement over no draining (previous)

* Zincati requests include a node (i.e. agent) UUID identifier, assigned matching systemd-id128's app-specific UUID behavior * Replicate the same logic used in systemd and Zincati (rust) to allow systemd machine IDs (available via Kubernetes Nodes System UUID info) to be mapped to the ID Zincati would assign. This allows a specific Zincati request to be mapped to the matching Kubernetes node, to improve logging

* When a node prepares for reboot, Zincati sends a lock request, and if a lock is obtained (or still held, since protocol allows reentrant locks) the Kubernetes node is cordoned * When a node reaches steady state, Zincati sends an unlock request and if any lock was previously held, it is released and the node uncordoned (reversing earlier cordon or possibly a cordon applied by another system)

* When Zincati requests a lock, if the lock is obtained or still held, drain the associated Kubernetes node by evicting its Pods (except mirror or DaemonSet Pods). * Drain immediately created Eviction objects, which are responsible for evicting Pods respecting different constraints. * At the moment, fleetlock does not wait for Pods to be terminated. Similarly, locks are still given out, even if draining fails. For now, I'd rather not gate lock grants on successful drains, since unknown termination edge cases could wedge a cluster's auto-upgrades. This is still an improvement over no draining (previous)

lucab · 2021-08-25T07:21:37Z

Sorry that I didn't notice this earlier. If you happen to need more flexibility, you can also freely customize the agent ID through the identity.node_uuid config parameter (documented here).

dghubble · 2021-08-25T17:12:40Z

Thanks @lucab. This drain feature would only work for auto-assigned node_uuid's at the moment, since its predictable and folks don't have to configure anything. But good to know about, maybe that becomes useful in future 👍

perosb · 2021-09-11T08:26:08Z

Thanks for adding this @dghubble

I'm not sure how to report issues in this repo?
There seem to be some mimatch with the ID matching.
It seem SystemUUID is passed to ZincatiID but it should be MachineID? I tried machineId in a local build and it matches the node and drains.

Also, the nodes_test.go is testing machineID as input which works.

dghubble added 3 commits August 11, 2021 09:28

dghubble force-pushed the k8s-drain branch from d203239 to 33e54ca Compare August 11, 2021 16:29

dghubble merged commit 33e54ca into master Aug 14, 2021

dghubble deleted the k8s-drain branch August 14, 2021 01:12

perosb mentioned this pull request Sep 11, 2021

ZincatiID method expects MachineID #60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Kubernetes node draining #51

Add support for Kubernetes node draining #51

dghubble commented Aug 9, 2021 •

edited

Loading

lucab commented Aug 25, 2021

dghubble commented Aug 25, 2021

perosb commented Sep 11, 2021

Add support for Kubernetes node draining #51

Add support for Kubernetes node draining #51

Conversation

dghubble commented Aug 9, 2021 • edited Loading

Behavior

Caveats

lucab commented Aug 25, 2021

dghubble commented Aug 25, 2021

perosb commented Sep 11, 2021

dghubble commented Aug 9, 2021 •

edited

Loading