NodeProblem API #23028

dchen1107 · 2016-03-16T00:37:10Z

We should introduce NodeProblem to Kubernetes Node API, so that different daemons on the node can report machine health issues to the control-plane through Kubelet. The machine health issues include bad cpu, bad memory, kernel deadlock, corrupted filesystems, etc. Once the control-plane has the visibility to those issues, we can discuss the remedy system.

gmarek · 2016-03-21T11:06:09Z

+1 - I'd like to be kept up to date with this effort.

liggitt · 2016-03-21T18:05:34Z

as we build integrations to the node api from other components, don't forget the potential need for things using the kubelet API to be able to provide auth (client cert or bearer token, etc).

alex-mohr · 2016-03-21T23:59:07Z

FWIW, looks like on GCE that nodes can report as healthy even if the route controller has not yet fully programmed their route into GCE's network. Unsure if that warrants a separate issue or should be solved control-plane side by unioning node-reported readiness with environment-of-node-readiness?

gmarek · 2016-03-22T13:11:30Z

@alex-mohr - IUUC this is an ubrella issue for v1.3 feature. GCE route thingy is a bug that we need to fix sooner. I filed #23327

mikedanese · 2016-03-25T18:19:17Z

@klizhentas

Random-Liu · 2016-04-18T17:40:44Z

Ref #23028, #24295

dchen1107 · 2016-04-28T22:13:56Z

We already have Event, NodeCondition today to report some issues on the node. Also a Taint API object is proposed and under reviewed. I don't want to introduce another API object blindly here. Instead of we should consolidate those objects or have a clear definition for each of them.

@Random-Liu and I had API related discussion with @bgrant0607 and @davidopp. We settled down with:

No new API object is introduced for now.
NodeProblemDetector (newly introduced daemon here) will use either Event or NodeCondition to carry a node problem. Most likely a triansient problem will be reported as an event, and a permanent problem as a condition, like what we are having today for out-of-resource (memory and disk).
In a long term, scheduler is going to only check against taints, not conditions at all. A separate controller will be introduced to the ecosystem to convert the problem related events and conditions to taints. But this is out of the scope of 1.3 release.

@Random-Liu and I had several design discussion today. Here is the current plan for 1.3 release

We are going to have a very simply and dump NodeProblemDetector implemention as a reference implementation.
The detector will run as a separate daemon on the node.
The daemon will only detect the issues from kernel log. :-)
The daemon will use watch API to propagate problem events or conditions to upstream layers. We initially had some concern with API server's scalability since this design is introduced another watcher channel per node. After talking to @lavalamp we think it is acceptable for 1.3 release.
In a long term, there might be more problem daemons introduced to the node to detect different failures on a node (for example, hardware failures, etc.), then API server's scalability might become to the real problem. We will reconsider the architecture of the problem reporting pipeline. But that is the implementation detail for the users. We have several alternative proposals to address the issue already.

@Random-Liu is working on the detail design for this. Thanks!

dchen1107 · 2016-04-28T22:15:56Z

cc/ @jlowdermilk for our later discussion on GKE node health tracking.

davidopp · 2016-04-29T20:41:57Z

The daemon will use watch API to propagate problem events or conditions to upstream layers.

What is the daemon watching? It sounds like the daemon is write-only.

BTW an advantage of NodeCondition over Event is that NodeCondition can be piggybacked on existing NodeStatus updates (though of course it makes the NodeStatus object larger). But anyway, I agree events is reasonable if you are essentially just proxying kernel log messages.

Random-Liu · 2016-04-29T21:25:19Z

What is the daemon watching? It sounds like the daemon is write-only.

@davidopp The daemon is watching the current NodeCondition (part of NodeStatus). Because to update the NodeCondition, the daemon needs to know the newest NodeCondition.
However a simple Get before update or Patch may also work. :)

davidopp · 2016-04-29T21:27:06Z

I am not suggesting to use NodeStatus here, I just wanted to mention it.

Random-Liu · 2016-04-29T21:48:07Z

@davidopp Updated the comment above to make it less ambiguous.

gmarek · 2016-04-29T23:24:47Z

Events as api.Events? I thought that those are for human consumption only. cc @bgrant0607

zmerlynn · 2016-05-11T00:17:51Z

Is this landing for v1.3? I was assuming it as a solution for #24295, but if that's not the case, that bug will need a different mitigation.

Random-Liu · 2016-05-11T20:40:31Z

@zmerlynn The node-problem-detector will only monitor kernel log for 1.3. We may support simple docker problem detection in the future, and the node-problem-detector is very easy to be extended to do that.
Is there any symptom in the kernel log for #24295?
[ 6882.619004] aufs au_opts_verify:1570:docker[6990]: dirperm1 breaks the protection by the permission bits on the lower branch
seems to be quite noisy to me, it is full of my kernel log. :)

zmerlynn · 2016-05-11T20:57:45Z

Nope. And yeah, that log is all over the place for basically any running container.

dchen1107 · 2016-05-11T21:50:07Z

re #24295, I updated it at #24295 (comment) That issue shouldn't be handled by node problem detector.

Random-Liu · 2016-05-17T23:40:50Z

FYI, the first version of node problem detector is ready for review kubernetes/node-problem-detector#1

davidopp · 2016-05-27T01:45:43Z

Since kubernetes/node-problem-detector#1 has been merged, can we remove v1.3 milestone?

davidopp · 2016-06-03T18:21:37Z

ping

Random-Liu · 2016-06-03T18:28:14Z

@davidopp This is almost done, only several things left:

1) E2e test for node problem detector: Add e2e test for node problem detector #26648 (under review)
2) Temporary hack for unsupported OS distros like GCI: Hack for unsupported OS distros. node-problem-detector#13 (under review)
3) Re-enable node problem detector in the cluster, we temporarily disable it for the sake of Node-Problem-Detector should Patch NodeStatus not Update node-problem-detector#9.

Documentation for node problem detector: Write Readme.md for NodeProblemDetector node-problem-detector#4, Write Readme.md for KernelMonitor node-problem-detector#5

The PRs of 1) and 2) are under review and 3) and 4) will be sent out soon. And 4) should not be blocker for this issue. :)

dchen1107 · 2016-06-03T18:29:51Z

I removed the control-plane team label since this one only covers node level change to make the problem visible to the upstream layers. There is no remedy controller built for 1.3 release.

Random-Liu · 2016-06-15T00:13:24Z

@dchen1107 All things are done. Can we close this one?

matchstick · 2016-06-15T20:01:40Z

Closing as we scrub bugs for 1.3. If it needs to be re-opened please do so for next-candidate.

pwittrock · 2016-06-17T19:38:59Z

@Random-Liu
@dchen1107

Would you provide an update on the status for the documentation for this feature as well as add any PRs as they are created?

Not Started / In Progress / In Review / Done

Expected Merge Time

Thanks
@pwittrock

Random-Liu · 2016-06-17T19:56:15Z

@pwittrock I have documentation for this feature, but I'm not sure whether or where we should put it in kubernetes.io.

Random-Liu · 2016-06-20T23:28:08Z

@pwittrock @dchen1107 The document is In Review now: kubernetes/website#696.

Random-Liu · 2016-06-24T23:02:35Z

@pwittrock Doc Status Upgrade: The doc is Done now. :)

dchen1107 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. team/api labels Mar 16, 2016

dchen1107 added this to the next-candidate milestone Mar 16, 2016

dchen1107 modified the milestones: v1.3, next-candidate Apr 28, 2016

dchen1107 assigned Random-Liu Apr 28, 2016

zmerlynn mentioned this issue May 5, 2016

GCE: Allow node count to exceed GCE TargetPool maximums #25178

Merged

This was referenced May 9, 2016

Mitigate impact of unregister_netdevice kernel race #20096

Closed

Reboot tests fail for kubernetes-e2e-gke-ci-reboot (node doesn't come back after reboot) #19986

Closed

davidopp added the team/control-plane label May 10, 2016

dchen1107 mentioned this issue May 10, 2016

Running pods fails with AUFS mount errors #10959

Closed

dchen1107 mentioned this issue May 13, 2016

Work around KubeProxy panic that makes Node's network not working #25543

Closed

This was referenced May 14, 2016

Add node problem detector kubernetes-retired/contrib#937

Closed

Add initial version of node problem detector kubernetes/node-problem-detector#1

Merged

Random-Liu mentioned this issue May 18, 2016

Why expose OutOfDisk condition by default? #25773

Closed

dchen1107 removed the team/control-plane label Jun 3, 2016

matchstick closed this as completed Jun 15, 2016

Random-Liu mentioned this issue Jun 20, 2016

Add document for node problem detector. kubernetes/website#696

Closed

Random-Liu mentioned this issue Jun 21, 2016

Add document for node problem detector. kubernetes/website#702

Merged

Random-Liu mentioned this issue Jul 8, 2016

Prevent kube-proxy from panicing when sysfs is mounted as read-only. #28697

Merged

dchen1107 mentioned this issue Apr 17, 2017

add NodeKernelDeadlock condition #44099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeProblem API #23028

NodeProblem API #23028

dchen1107 commented Mar 16, 2016

gmarek commented Mar 21, 2016

liggitt commented Mar 21, 2016

alex-mohr commented Mar 21, 2016

gmarek commented Mar 22, 2016

mikedanese commented Mar 25, 2016

Random-Liu commented Apr 18, 2016 •

edited

dchen1107 commented Apr 28, 2016

dchen1107 commented Apr 28, 2016

davidopp commented Apr 29, 2016

Random-Liu commented Apr 29, 2016 •

edited

davidopp commented Apr 29, 2016

Random-Liu commented Apr 29, 2016

gmarek commented Apr 29, 2016

zmerlynn commented May 11, 2016

Random-Liu commented May 11, 2016 •

edited

zmerlynn commented May 11, 2016

dchen1107 commented May 11, 2016

Random-Liu commented May 17, 2016

davidopp commented May 27, 2016

davidopp commented Jun 3, 2016

Random-Liu commented Jun 3, 2016 •

edited

dchen1107 commented Jun 3, 2016

Random-Liu commented Jun 15, 2016

matchstick commented Jun 15, 2016

pwittrock commented Jun 17, 2016

Random-Liu commented Jun 17, 2016 •

edited

Random-Liu commented Jun 20, 2016

Random-Liu commented Jun 24, 2016

NodeProblem API #23028

NodeProblem API #23028

Comments

dchen1107 commented Mar 16, 2016

gmarek commented Mar 21, 2016

liggitt commented Mar 21, 2016

alex-mohr commented Mar 21, 2016

gmarek commented Mar 22, 2016

mikedanese commented Mar 25, 2016

Random-Liu commented Apr 18, 2016 • edited

dchen1107 commented Apr 28, 2016

dchen1107 commented Apr 28, 2016

davidopp commented Apr 29, 2016

Random-Liu commented Apr 29, 2016 • edited

davidopp commented Apr 29, 2016

Random-Liu commented Apr 29, 2016

gmarek commented Apr 29, 2016

zmerlynn commented May 11, 2016

Random-Liu commented May 11, 2016 • edited

zmerlynn commented May 11, 2016

dchen1107 commented May 11, 2016

Random-Liu commented May 17, 2016

davidopp commented May 27, 2016

davidopp commented Jun 3, 2016

Random-Liu commented Jun 3, 2016 • edited

dchen1107 commented Jun 3, 2016

Random-Liu commented Jun 15, 2016

matchstick commented Jun 15, 2016

pwittrock commented Jun 17, 2016

Random-Liu commented Jun 17, 2016 • edited

Random-Liu commented Jun 20, 2016

Random-Liu commented Jun 24, 2016

Random-Liu commented Apr 18, 2016 •

edited

Random-Liu commented Apr 29, 2016 •

edited

Random-Liu commented May 11, 2016 •

edited

Random-Liu commented Jun 3, 2016 •

edited

Random-Liu commented Jun 17, 2016 •

edited