Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draino only applies cordon/drain conditions on restart of pods #118

Open
JosephGJ opened this issue Jun 17, 2021 · 0 comments
Open

Draino only applies cordon/drain conditions on restart of pods #118

JosephGJ opened this issue Jun 17, 2021 · 0 comments

Comments

@JosephGJ
Copy link

JosephGJ commented Jun 17, 2021

We have a custom node condition setup using node problem detector which curls the goss check endpoints on our worker nodes every 60 seconds. We've then setup draino to look for this node condition, however, when the custom condition GossCheckFailure is true, the Cordon and Drain events aren't appearing and the conditions aren't applied to the given worker node. If the draino node is killed and restarted, those conditions are then applied immediately.

If I stop the fluentbit service on a worker, the goss checks will fail and the GossCheckFailure condition is true for that worker.

Worker node events:

Events:
  Type     Reason           Age                     From                                                        Message
  ----     ------           ----                    ----                                                        -------
  Normal   GossCheckFailed  <invalid> (x2 over 2d)  health-checker, ip-10-252-18-10.eu-west-1.compute.internal  Node condition GossCheckFailure is now: True, reason: GossCheckFailed
  Warning  GossCheckFailed  <invalid> (x2 over 2d)  health-checker, ip-10-252-18-10.eu-west-1.compute.internal

But the cordon events / DrainScheduled events are not applied to the worker.

Draino pod logs don't show anything, even in debug mode.

Draino pod logs:

$ kubectl logs draino-cc967c887-c6596 -f -n kube-addons
2021-06-17T12:39:58.118Z        INFO    draino/draino.go:134    web server is running   {"listen": ":10002"}
2021-06-17T12:39:58.240Z        DEBUG   draino/draino.go:187    node labels     {"labels": {"node.kubernetes.io/role":"worker"}}
2021-06-17T12:39:58.241Z        DEBUG   draino/draino.go:196    label expression        {"expr": "metadata.labels['node.kubernetes.io/role'] == 'worker'"}
I0617 12:39:58.241899       1 leaderelection.go:235] attempting to acquire leader lease  kube-addons/draino...
I0617 12:40:15.657534       1 leaderelection.go:245] successfully acquired lease kube-addons/draino
2021-06-17T12:40:15.658Z        INFO    draino/draino.go:235    node watcher is running

When I then kill the draino pod.

Worker node events:

Events:
  Type     Reason           Age                     From                                                        Message
  ----     ------           ----                    ----                                                        -------
  Normal   GossCheckFailed  2m54s (x2 over 2d)      health-checker, ip-10-252-18-10.eu-west-1.compute.internal  Node condition GossCheckFailure is now: True, reason: GossCheckFailed
  Warning  GossCheckFailed  <invalid> (x5 over 2d)  health-checker, ip-10-252-18-10.eu-west-1.compute.internal
  Warning  CordonStarting   <invalid>               draino                                                      Cordoning node
  Warning  CordonSucceeded  <invalid>               draino                                                      Cordoned node
  Warning  DrainScheduled   <invalid>               draino                                                      Will drain node after 2021-06-17T12:52:18.192872059Z

Draino new pod logs:

$ kubectl logs draino-cc967c887-vk9gj -f -n kube-addons
2021-06-17T12:51:49.286Z        INFO    draino/draino.go:134    web server is running   {"listen": ":10002"}
2021-06-17T12:51:49.288Z        DEBUG   draino/draino.go:187    node labels     {"labels": {"node.kubernetes.io/role":"worker"}}
2021-06-17T12:51:49.288Z        DEBUG   draino/draino.go:196    label expression        {"expr": "metadata.labels['node.kubernetes.io/role'] == 'worker'"}
I0617 12:51:49.382835       1 leaderelection.go:235] attempting to acquire leader lease  kube-addons/draino...
I0617 12:52:06.918002       1 leaderelection.go:245] successfully acquired lease kube-addons/draino
2021-06-17T12:52:06.918Z        INFO    draino/draino.go:235    node watcher is running
2021-06-17T12:52:07.192Z        DEBUG   kubernetes/eventhandler.go:263  Cordoning       {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
2021-06-17T12:52:07.192Z        INFO    kubernetes/eventhandler.go:272  Cordoned        {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
2021-06-17T12:52:07.192Z        DEBUG   kubernetes/eventhandler.go:296  Scheduling drain        {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}
2021-06-17T12:52:07.192Z        INFO    kubernetes/eventhandler.go:308  Drain scheduled         {"node": "ip-10-252-18-10.eu-west-1.compute.internal", "after": "2021-06-17T12:52:18.192Z"}
2021-06-17T12:52:18.193Z        INFO    kubernetes/drainSchedule.go:154 Drained {"node": "ip-10-252-18-10.eu-west-1.compute.internal"}

This is the helm chart values configured.

$ helm get values draino  -n kube-addons
USER-SUPPLIED VALUES:
conditions:
- GossCheckFailure
extraArgs:
  debug: true
  drain-buffer: 10s
  dry-run: true
  max-grace-period: 10s
  namespace: kube-addons
  node-label: node.kubernetes.io/role=worker
image:
  repository: <our-registry>planetlabs/draino
  tag: 450a853

Any ideas as to why this is happening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant