Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Kubernetes node draining #51

Merged
merged 3 commits into from
Aug 14, 2021
Merged

Add support for Kubernetes node draining #51

merged 3 commits into from
Aug 14, 2021

Conversation

dghubble
Copy link
Member

@dghubble dghubble commented Aug 9, 2021

  • Zincati requests include a node (i.e. agent) UUID identifier, assigned matching systemd-id128's app-specific UUID behavior. Replicate the same logic used in systemd and Zincati (rust) to allow systemd machine IDs (available via Kubernetes Nodes
    System UUID info) to be mapped to the ID Zincati would assign. This allows a specific Zincati request to be mapped to the
    matching Kubernetes node

Behavior

  • When a node prepares for reboot, Zincati sends a lock request, and if a lock is obtained (or still held, since protocol allows
    reentrant locks) cordon the Kubernetes node and drain it by evicting Pods (except mirror or DaemonSet Pods)
  • When a node reaches steady state, Zincati sends an unlock request and if any lock was previously held, it is released
    and the node uncordoned (reversing earlier cordon or possibly a cordon applied by another system)

Caveats

  • Drain immediately creates Eviction objects, which are responsible for evicting Pods respecting different constraints
  • At the moment, fleetlock does not wait for Pods to be terminated. Similarly, locks are still given out, even if draining fails. For now, I'd rather not gate lock grants on successful drains, since unknown termination edge cases could wedge a cluster's auto-upgrades. This is still an improvement over no draining (previous)

* Zincati requests include a node (i.e. agent) UUID identifier,
assigned matching systemd-id128's app-specific UUID behavior
* Replicate the same logic used in systemd and Zincati (rust)
to allow systemd machine IDs (available via Kubernetes Nodes
System UUID info) to be mapped to the ID Zincati would assign.
This allows a specific Zincati request to be mapped to the
matching Kubernetes node, to improve logging
* When a node prepares for reboot, Zincati sends a lock request,
and if a lock is obtained (or still held, since protocol allows
reentrant locks) the Kubernetes node is cordoned
* When a node reaches steady state, Zincati sends an unlock
request and if any lock was previously held, it is released
and the node uncordoned (reversing earlier cordon or possibly
a cordon applied by another system)
* When Zincati requests a lock, if the lock is obtained
or still held, drain the associated Kubernetes node by
evicting its Pods (except mirror or DaemonSet Pods).
* Drain immediately created Eviction objects, which are
responsible for evicting Pods respecting different
constraints.
* At the moment, fleetlock does not wait for Pods to be
terminated. Similarly, locks are still given out, even
if draining fails. For now, I'd rather not gate lock
grants on successful drains, since unknown termination
edge cases could wedge a cluster's auto-upgrades. This
is still an improvement over no draining (previous)
@dghubble dghubble merged commit 33e54ca into master Aug 14, 2021
@dghubble dghubble deleted the k8s-drain branch August 14, 2021 01:12
@lucab
Copy link

lucab commented Aug 25, 2021

Sorry that I didn't notice this earlier. If you happen to need more flexibility, you can also freely customize the agent ID through the identity.node_uuid config parameter (documented here).

@dghubble
Copy link
Member Author

Thanks @lucab. This drain feature would only work for auto-assigned node_uuid's at the moment, since its predictable and folks don't have to configure anything. But good to know about, maybe that becomes useful in future 👍

@perosb
Copy link

perosb commented Sep 11, 2021

Thanks for adding this @dghubble

I'm not sure how to report issues in this repo?
There seem to be some mimatch with the ID matching.
It seem SystemUUID is passed to ZincatiID but it should be MachineID? I tried machineId in a local build and it matches the node and drains.

Also, the nodes_test.go is testing machineID as input which works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants