Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ask god what the draino works? #78

Open
ghost opened this issue Jun 28, 2020 · 4 comments
Open

Ask god what the draino works? #78

ghost opened this issue Jun 28, 2020 · 4 comments

Comments

@ghost
Copy link

ghost commented Jun 28, 2020

Dear gods, please ask me a question about NPD. I don't know if it's right. I hope I can correct my question. Draino is a relief system, which can be used with NPD. For example, when NPD is subject to kernel deadlock, or CPU disk is broken, NPD can get this information. In order to prevent reuse of this node, draino will Each node is set as maintenance and cannot be scheduled to prevent other containers from being allocated to this node and expel the pod of the node. This understanding is correct, but to realize this function, the job of the rescue system is to judge whether the information NPD obtains is an internal core deadlock, CPU, The disk problem triggers this rule to set up, maintain and expel this node. After the expelling, can redundant nodes be guaranteed to schedule the pod of the previous node? Perhaps autoscale will be used. Is this understanding correct?

@jacobstr
Copy link
Contributor

Yes your understanding is correct:

  • NPD sets node conditions. By themselves, conditions do not cause a node to be drained and cordoned. Many are informational.
  • Draino allows one to configure conditions that should result in the node being drained - what you described as 👇:

In order to prevent reuse of this node, draino will Each node is set as maintenance and cannot be scheduled to prevent other containers from being allocated to this node and expel the pod of the node.

  • Regular kubernetes mechanisms will attempt to schedule your pod to another node (e.g. because only 2/3 pod replicas are now running).
  • As a result of the node being drained, the autoscaler will typically identify an underutilized node and delete it. This threshold is configurable in the autoscaler. Draino does not itself destroy nodes.
  • As a result of there being fewer nodes, the autoscaler will typically create a new one for you if there is is not enough CPU/Memory in the remaining nodes to schedule the pods that were evicted when the node was drained.

@ghost
Copy link
Author

ghost commented Jul 1, 2020

Yes your understanding is correct:

* NPD sets node conditions. By themselves, conditions do not cause a node to be `drained` and `cordoned`. Many are informational.

* Draino allows one to configure conditions that should result in the node being `drained` - what you described as 👇:

In order to prevent reuse of this node, draino will Each node is set as maintenance and cannot be scheduled to prevent other containers from being allocated to this node and expel the pod of the node.

* Regular kubernetes mechanisms will attempt to schedule your `pod` to another `node` (e.g. because only 2/3 pod replicas are now running).

* As a result of the node being drained, the autoscaler will typically identify an underutilized node and delete it. This threshold is configurable in the [autoscaler](https://github.com/kubernetes/autoscaler/blob/1434d14ec768cf099a1f3d8f615854bf1361e484/cluster-autoscaler/main.go#L101). Draino does not itself destroy nodes.

* As a result of there being fewer nodes, the autoscaler will typically create a new one for you if there is is not enough CPU/Memory in the remaining nodes to schedule the pods that were `evicted` when the node was `drained`.

I'm sorry I'd like to know more
Just mentioned the NPD itself cannot achieve the effect of self-healing, can only access to relevant information, may need to be self healing draino to implement, and this I have been in the machines deployed a draino pod operation on a node, then the draino how to get the NPD events event, there are many kinds of types, and events event may this type may not lead to node is not available, and how to judge draino belongs to such as the kernel of deadlock or CPU to fail, disk broken information? While the draino judgment does allow the information to be larged to mark the node as undispatchable, and the drain is automatically implemented

@ghost
Copy link
Author

ghost commented Jul 1, 2020

Yes your understanding is correct:

* NPD sets node conditions. By themselves, conditions do not cause a node to be `drained` and `cordoned`. Many are informational.

* Draino allows one to configure conditions that should result in the node being `drained` - what you described as 👇:

In order to prevent reuse of this node, draino will Each node is set as maintenance and cannot be scheduled to prevent other containers from being allocated to this node and expel the pod of the node.

* Regular kubernetes mechanisms will attempt to schedule your `pod` to another `node` (e.g. because only 2/3 pod replicas are now running).

* As a result of the node being drained, the autoscaler will typically identify an underutilized node and delete it. This threshold is configurable in the [autoscaler](https://github.com/kubernetes/autoscaler/blob/1434d14ec768cf099a1f3d8f615854bf1361e484/cluster-autoscaler/main.go#L101). Draino does not itself destroy nodes.

* As a result of there being fewer nodes, the autoscaler will typically create a new one for you if there is is not enough CPU/Memory in the remaining nodes to schedule the pods that were `evicted` when the node was `drained`.

I'm sorry I'd like to know more
Just mentioned the NPD itself cannot achieve the effect of self-healing, can only access to relevant information, may need to be self healing draino to implement, and this I have been in the machines deployed a draino pod operation on a node, then the draino how to get the NPD events event, there are many kinds of types, and events event may this type may not lead to node is not available, and how to judge draino belongs to such as the kernel of deadlock or CPU to fail, disk broken information? While the draino judgment does allow the information to be larged to mark the node as undispatchable, and the drain is automatically implemented

@ghost ghost closed this as completed Jul 1, 2020
@ghost ghost reopened this Jul 1, 2020
@tarunptala
Copy link

tarunptala commented Feb 10, 2021

hey

can someone help me to test or make NPD and draino work together for my custom condition. i have following in configMap for NPD

docker-monitor.json: |
        {
            "plugin": "journald", 
            "pluginConfig": {
                    "source": "docker"
            },
            "logPath": "/var/log/journal", 
            "lookback": "5m",
            "bufferSize": 10,
            "source": "docker-monitor",
            "conditions": [],
            "rules": [              
                    {
                            "type": "temporary", 
                            "reason": "CorruptDockerImage", 
                            "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" 
                    }
            ]
        }
    kernel-monitor.json: |
      {
          "plugin": "journald", 
          "pluginConfig": {
                  "source": "kernel"
          },
          "logPath": "/var/log/journal", 
          "lookback": "5m",
          "bufferSize": 10,
          "source": "kernel-monitor",
          "conditions": [                 
                  {
                          "type": "KernelDeadlock", 
                          "reason": "KernelHasNoDeadlock", 
                          "message": "kernel has no deadlock"  
                  },
                  {
                          "type": "Ready",
                          "reason": "NodeStatusUnknown",
                          "message": "Kubelet stopped posting node status"
                  }
          ],
          "rules": [
                  {
                          "type": "temporary",
                          "reason": "OOMKilling",
                          "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB"
                  },
                  {
                          "type": "temporary",
                          "reason": "TaskHung",
                          "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
                  },
                  {
                          "type": "temporary",
                          "reason": "UnregisterNetDevice",
                          "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
                  },
                  {
                          "type": "temporary",
                          "reason": "KernelOops",
                          "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
                  },
                  {
                          "type": "temporary/permanent",
                          "condition": "NodeStatusUnknown",
                          "reason": "NodeStatusUnknown",
                          "pattern": "Kubelet stopped posting node status"
                  },
                  {
                          "type": "temporary",
                          "reason": "KernelOops",
                          "pattern": "divide error: 0000 \\[#\\d+\\] SMP"
                  },
                  {
                          "type": "permanent",
                          "condition": "KernelDeadlock",
                          "reason": "AUFSUmountHung",
                          "pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
                  },
                  {
                          "type": "permanent",
                          "condition": "KernelDeadlock",
                          "reason": "DockerHung",
                          "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
                  }
          ]
      }

m majorly concerned about following as this happen quite frequently with us.

                 {
                          "type": "temporary/permanent",
                          "condition": "NodeStatusUnknown",
                          "reason": "NodeStatusUnknown",
                          "pattern": "Kubelet stopped posting node status"
                  }

I have draino configured as this

- command: [/draino, --debug, --evict-daemonset-pods, --evict-emptydir-pods, --evict-unreplicated-pods, KernelDeadlock, NodeStatusUnknown]

Is that how it works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants