New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NodeProblem API #23028
Comments
+1 - I'd like to be kept up to date with this effort. |
as we build integrations to the node api from other components, don't forget the potential need for things using the kubelet API to be able to provide auth (client cert or bearer token, etc). |
FWIW, looks like on GCE that nodes can report as healthy even if the route controller has not yet fully programmed their route into GCE's network. Unsure if that warrants a separate issue or should be solved control-plane side by unioning node-reported readiness with environment-of-node-readiness? |
@alex-mohr - IUUC this is an ubrella issue for v1.3 feature. GCE route thingy is a bug that we need to fix sooner. I filed #23327 |
We already have Event, NodeCondition today to report some issues on the node. Also a Taint API object is proposed and under reviewed. I don't want to introduce another API object blindly here. Instead of we should consolidate those objects or have a clear definition for each of them. @Random-Liu and I had API related discussion with @bgrant0607 and @davidopp. We settled down with:
@Random-Liu and I had several design discussion today. Here is the current plan for 1.3 release
@Random-Liu is working on the detail design for this. Thanks! |
cc/ @jlowdermilk for our later discussion on GKE node health tracking. |
What is the daemon watching? It sounds like the daemon is write-only. BTW an advantage of NodeCondition over Event is that NodeCondition can be piggybacked on existing NodeStatus updates (though of course it makes the NodeStatus object larger). But anyway, I agree events is reasonable if you are essentially just proxying kernel log messages. |
@davidopp The daemon is watching the current NodeCondition (part of NodeStatus). Because to update the NodeCondition, the daemon needs to know the newest NodeCondition. |
I am not suggesting to use NodeStatus here, I just wanted to mention it. |
@davidopp Updated the comment above to make it less ambiguous. |
Events as |
Is this landing for v1.3? I was assuming it as a solution for #24295, but if that's not the case, that bug will need a different mitigation. |
@zmerlynn The node-problem-detector will only monitor kernel log for 1.3. We may support simple docker problem detection in the future, and the node-problem-detector is very easy to be extended to do that. |
Nope. And yeah, that log is all over the place for basically any running container. |
re #24295, I updated it at #24295 (comment) That issue shouldn't be handled by node problem detector. |
FYI, the first version of node problem detector is ready for review kubernetes/node-problem-detector#1 |
Since kubernetes/node-problem-detector#1 has been merged, can we remove v1.3 milestone? |
ping |
@davidopp This is almost done, only several things left:
The PRs of 1) and 2) are under review and 3) and 4) will be sent out soon. And 4) should not be blocker for this issue. :) |
I removed the control-plane team label since this one only covers node level change to make the problem visible to the upstream layers. There is no remedy controller built for 1.3 release. |
@dchen1107 All things are done. Can we close this one? |
Closing as we scrub bugs for 1.3. If it needs to be re-opened please do so for next-candidate. |
Would you provide an update on the status for the documentation for this feature as well as add any PRs as they are created? Not Started / In Progress / In Review / Done Expected Merge Time Thanks |
@pwittrock I have documentation for this feature, but I'm not sure whether or where we should put it in kubernetes.io. |
@pwittrock @dchen1107 The document is In Review now: kubernetes/website#696. |
@pwittrock Doc Status Upgrade: The doc is Done now. :) |
We should introduce NodeProblem to Kubernetes Node API, so that different daemons on the node can report machine health issues to the control-plane through Kubelet. The machine health issues include bad cpu, bad memory, kernel deadlock, corrupted filesystems, etc. Once the control-plane has the visibility to those issues, we can discuss the remedy system.
The text was updated successfully, but these errors were encountered: