Draino only applies cordon/drain conditions on restart of pods #118

JosephGJ · 2021-06-17T12:57:57Z

We have a custom node condition setup using node problem detector which curls the goss check endpoints on our worker nodes every 60 seconds. We've then setup draino to look for this node condition, however, when the custom condition GossCheckFailure is true, the Cordon and Drain events aren't appearing and the conditions aren't applied to the given worker node. If the draino node is killed and restarted, those conditions are then applied immediately.

If I stop the fluentbit service on a worker, the goss checks will fail and the GossCheckFailure condition is true for that worker.

Worker node events:

Events:
  Type     Reason           Age                     From                                                        Message
  ----     ------           ----                    ----                                                        -------
  Normal   GossCheckFailed  <invalid> (x2 over 2d)  health-checker, ip-10-252-18-10.eu-west-1.compute.internal  Node condition GossCheckFailure is now: True, reason: GossCheckFailed
  Warning  GossCheckFailed  <invalid> (x2 over 2d)  health-checker, ip-10-252-18-10.eu-west-1.compute.internal

But the cordon events / DrainScheduled events are not applied to the worker.

Draino pod logs don't show anything, even in debug mode.

Draino pod logs:

$ kubectl logs draino-cc967c887-c6596 -f -n kube-addons
2021-06-17T12:39:58.118Z        INFO    draino/draino.go:134    web server is running   {"listen": ":10002"}
2021-06-17T12:39:58.240Z        DEBUG   draino/draino.go:187    node labels     {"labels": {"node.kubernetes.io/role":"worker"}}
2021-06-17T12:39:58.241Z        DEBUG   draino/draino.go:196    label expression        {"expr": "metadata.labels['node.kubernetes.io/role'] == 'worker'"}
I0617 12:39:58.241899       1 leaderelection.go:235] attempting to acquire leader lease  kube-addons/draino...
I0617 12:40:15.657534       1 leaderelection.go:245] successfully acquired lease kube-addons/draino
2021-06-17T12:40:15.658Z        INFO    draino/draino.go:235    node watcher is running

When I then kill the draino pod.

Worker node events:

Events:
  Type     Reason           Age                     From                                                        Message
  ----     ------           ----                    ----                                                        -------
  Normal   GossCheckFailed  2m54s (x2 over 2d)      health-checker, ip-10-252-18-10.eu-west-1.compute.internal  Node condition GossCheckFailure is now: True, reason: GossCheckFailed
  Warning  GossCheckFailed  <invalid> (x5 over 2d)  health-checker, ip-10-252-18-10.eu-west-1.compute.internal
  Warning  CordonStarting   <invalid>               draino                                                      Cordoning node
  Warning  CordonSucceeded  <invalid>               draino                                                      Cordoned node
  Warning  DrainScheduled   <invalid>               draino                                                      Will drain node after 2021-06-17T12:52:18.192872059Z

Draino new pod logs:

$ kubectl logs draino-cc967c887-vk9gj -f -n kube-addons
2021-06-17T12:51:49.286Z        INFO    draino/draino.go:134    web server is running   {"listen": ":10002"}
2021-06-17T12:51:49.288Z        DEBUG   draino/draino.go:187    node labels     {"labels": {"node.kubernetes.io/role":"worker"}}
2021-06-17T12:51:49.288Z        DEBUG   draino/draino.go:196    label expression        {"expr": "metadata.labels['node.kubernetes.io/role'] == 'worker'"}
I0617 12:51:49.382835       1 leaderelection.go:235] attempting to acquire leader lease  kube-addons/draino...
I0617 12:52:06.918002       1 leaderelection.go:245] successfully acquired lease kube-addons/draino
2021-06-17T12:52:06.918Z        INFO    draino/draino.go:235    node watcher is running
2021-06-17T12:52:07.192Z        DEBUG   kubernetes/eventhandler.go:263  Cordoning       {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
2021-06-17T12:52:07.192Z        INFO    kubernetes/eventhandler.go:272  Cordoned        {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
2021-06-17T12:52:07.192Z        DEBUG   kubernetes/eventhandler.go:296  Scheduling drain        {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
2021-06-17T12:52:07.192Z        INFO    kubernetes/eventhandler.go:308  Drain scheduled         {"node": "ip-10-252-18-10.eu-west-1.compute.internal", "after": "2021-06-17T12:52:18.192Z"}
2021-06-17T12:52:18.193Z        INFO    kubernetes/drainSchedule.go:154 Drained {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}

This is the helm chart values configured.

$ helm get values draino  -n kube-addons
USER-SUPPLIED VALUES:
conditions:
- GossCheckFailure
extraArgs:
  debug: true
  drain-buffer: 10s
  dry-run: true
  max-grace-period: 10s
  namespace: kube-addons
  node-label: node.kubernetes.io/role=worker
image:
  repository: <our-registry>planetlabs/draino
  tag: 450a853

Any ideas as to why this is happening?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draino only applies cordon/drain conditions on restart of pods #118

Draino only applies cordon/drain conditions on restart of pods #118

JosephGJ commented Jun 17, 2021 •

edited

Draino only applies cordon/drain conditions on restart of pods #118

Draino only applies cordon/drain conditions on restart of pods #118

Comments

JosephGJ commented Jun 17, 2021 • edited

JosephGJ commented Jun 17, 2021 •

edited