New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work around KubeProxy panic that makes Node's network not working #25543
Comments
Does kubelet monitor kube-proxy health? |
Kube-proxy today runs as a static pod (daemonset), kubelet treats it the same as other static pods. Yes, Kubelet reports kube-proxy podstatus back to master, that is all. The detail information on kube-proxy is in crashloop at #24295 (comment) |
cc/ @philips Brandon, this issue explains my long position against running kubelet, kube-proxy, etc. critical daemons as docker container. I am ok to packaging them into docker image, and running as linux containers, but they shouldn't depend on docker daemon. |
@dchen1107 I think we don't want to run them as docker containers as well, we want to run them as rkt containers. |
@yifan-gu The initial proposal I reviewed was docker container, and I did raise my concerns to @vishh, @mikedanese and coreos developers then. Running it as rkt container should be ok since rkt is daemon-less today, and shouldn't have this chicken egg issue at the first place. But we still have the problem on the nodes without rkt as runtime. @gmarek At the meeting, I suggested that kube-proxy update NodeStatus with a newly introduced NodeCondition, like what we plan to do with kernel issue at: #23028 (comment) The reason of not using NodeReady is that kubelet might override that. Another very hacky way is before crashing, Kube-proxy logs the read-only filesystem problem somewhere, like /var/lib/kube-proxy/... Kubelet will pick up the information, then update NodeReady Condition with detail error message prefixing with "kube-proxy:" kube-proxy's responsibility to make the information up-to-date. |
Can I have more description of the bug? Is there a related docker issue? Is there a repro? Does this only affect kube-proxy or will it affect other pods on the node? We need to support running critical applications (including system daemons) in pods in order to allow easy deployment and integration with vendors (e.g. storage and network vendors). cc @aronchick @thockin since we were just talking about integrating through addons. What features are we missing that we can't support this right now? Is it just that our default container runtime is buggy and flaky and unreliable? Do we need the NodeProblem API? Do we need taints/tolerations? |
@mikedanese - I don't know about any way to repro it other than running huge clusters. Pretty much every run of a 1000 Node cluster has this problem on at least one Node. |
@mikedanese If you want to know the detail, see #24295 (comment). It is a docker issue, and pretty rare. |
Can we regroup on this issue this week? Like tomorrow? On Fri, May 13, 2016 at 5:01 PM, Dawn Chen notifications@github.com wrote:
|
I will happily make kube-proxy write to or offer up pretty much any API we To me it still feels like a docker health-check suite is needed, and one of make type Problem struct {
Description string `json:"description"`
Suspect string `json:"suspect"`
Reference string `json:"reference"`
Timestamp: string `json:"timestamp"`
}
type ProblemList []Problem For this case, kube-proxy would write a file: {
[
{
"description": "sysfs is read-only",
"suspect": "docker",
"reference": "
http://github.com/kubernetes/kubernetes/issues/25543",
"timestamp": "2016-05-17 12:10PDT"
}
]
} Kubelet can turn that into Node status reports. On Tue, May 17, 2016 at 12:02 AM, Tim Hockin thockin@google.com wrote:
|
PR of introducing NodeProblemDetector was merged last week, and the pr to enable it by default for Kubernetes cluster was merged over the weekend. So far everything works fine. Even it is an alpha feature, GKE folks agreed to enable it for GKE. To workaround this particular issue before we have proper docker fix, we can easily extend today's KernelMonitor module of NodeProblemDetector to monitor kube-proxy log, and report a new NodeCondition called SysfsReadonly to the issue visible. But we don't have remedy system yet, and pr of converting node problem (events, conditions etc.) to taints is still under the debate / review:
|
Do we still observe the problem with docker 1.11.X? If no, I want to close this; otherwise, we are going to reconfigure NodeProblemDetector to make the issue visible. |
cc/ @davidopp |
Agree, I'd love to not "fix" this.. |
I think we've seen it the day before yesterday during our tests - @gmarek can you please confirm? |
I can't recall - we certainly did saw some failed nodes during the test, but I'm not sure we made sure it was a kube-proxy issue. |
But what I can tell for sure that it's definitely less frequent - we started 2000-node cluster 10+ times during this week and we've seen it at most once (which makes it at least 99.995% reliable). |
Not enough data. GKE is still deploying the old ContainerVM and isn't running Docker 1.11.1 yet, unless I'm mistaken. |
aah ok - I didn't know that... |
I'm going to tentatively close this until we show that it's a problem on Docker 1.11.2, now that Docker 1.11.2 is in |
@thockin @matchstick Node problem detector is not running on GKE cluster now. :) If we still want a fix in the node problem detector, letting kubeproxy drop a file could be enough for now. |
If you can give me a spec of what dir to mount and what sort of file to On Wed, Jul 6, 2016 at 11:00 PM, Lantao Liu notifications@github.com
|
@thockin For temporary fix, something like
|
I thought that Dawn said we DO NOT want to parse logs. Parsing logs is, in On Wed, Jul 6, 2016 at 11:43 PM, Lantao Liu notifications@github.com
|
@Random-Liu I agree with @thockin I also understood @dchen1107 thought we should not parse logs. Is that common practise for the node problem detector? It feels fragile to me and to be avoided. |
@thockin @matchstick Yeah, I agree parsing log is fragile, that's why I call it a temporary or a quick fix. :) For a better fix, it will need some more design and work. Let me think about it a little more~ Anyway, I still think #25543 (comment) and #25543 (comment) is needed. As a key component running on each node, it's wired that we don't have a component to monitor whether it is ready or not. |
Re making kube-proxy "not panic", it currently fails in resizing the hash table for conntrack entries, for which it needs to write to /sys. I'm not sure if there are more locations (we moved /sys/module/br_netfilter loading into kubelet IIUC), or if conntrack itself needs write access to /sys. I think Dawn/Liu are more worried about the NPD not parsing kube-proxy logs, because its memory usage would balloon. If kube-proxy can detect the situtation internally, either by parsing its own logs (which is brittle) or poking someting in /sys for example (like the hash table resizing it's currently failing on), it can drop a token into a hostPath (hoping the docker bug doesn't manifest as ro hostPath). Kube-proxy will still remain dysfunctional, though we might be able to:
We just need to be careful to cleanup the hostPath so it doesn't end up in a restart loop. |
My understanding of 1.3's state based on a brief chat with @Random-Liu : something like ~1 in 1000 nodes will be affected by this issue and if nodes are affected, pods scheduled on those nodes have {broken,little,no} networking. We hope Docker will eventually fix it. But because that's not the case for 1.3, and we don't want to make excessive changes to 1.3, perhaps that means that we should fix it expediently. Summarizing my understanding of previous discussion: expedient thing would be some mechanism to detect (@bprashanth's proposal like kube-proxy detecting and writing to /tmp/kube-proxy-claims-docker-issue-present.txt) and $something (kubelet, node problem detector, a bash script in a loop) that checks for existence of that file followed by some form of remediation {kill docker, sudo reboot, .....} to bash the node into working again. Maybe an ugly hack is okay for now and we hope that docker fixes it for the next release so in 1.4 we can rm the code? Or for 1.4 we can do something that is More Elegant And Less Offensive To Engineering Sensibilities? |
Bounded-lifetime hacks are fine, but they need to come with giant On Thu, Jul 7, 2016 at 4:32 PM, Alex Mohr notifications@github.com wrote:
|
Agree, re: bounded-lifetime hacks needing care and we shouldn't do them indiscriminately. We and our users have a problem today -- and we've had a problem for the past N months without traction. I'd like the problem solved for 1.3 because I'd like us to have a great product and I don't think the state of broken nodes is there. Aside; we've spent N hours discussing the issue and could likely have worked around the issue in less aggregate people time than we spent discussing? Apologies for the leaked frustration: this issue seems to be stuck in either some form of analysis paralysis or cracks between teams or perfect-is-the-enemy-of-the-good state. @Random-Liu Again, I don't care about implementation details. We also need more than just tech or a piece of code. Exit criteria is that users don't get broken nodes shipped to them. |
We're not going to solve this problem for as long as we depend on a lagging docker release cycle, and they don't backport or do intermediate releases to help us. This is just a one off docker bug that's left us in a bad state. Such issues come up every release, get documented in the release notes and we ship anyway. The easiest fix (disclaimers and all) is to redirect kube-proxy stderr and parse it out from NPD. Even then, the node will remain broken unless we restart docker. the NDP will only surface the error. |
@alex-mohr I'm working on a PR yesterday night, will send it out soon. |
FYI, a fix is here #28697. |
FWIW I agree 100% with @alex-mohr. @Random-Liu using node condition SGTM, but your PR should probably also modify the scheduler here |
@davidopp Yeah, we can do that. But before that, we should make sure that the node is really unusable without setting conntrack.
I'll run e2e test without conntrack set to see whether there is any problem.
Yeah, that's why I think we should not restart docker ourselves to try to remedy the problem. There may still be workloads running the node. We'd better leave this to the user to decide. :) |
Automatic merge from submit-queue Prevent kube-proxy from panicing when sysfs is mounted as read-only. Fixes #25543. This PR: * Checks the permission of sysfs before setting conntrack hashsize, and returns an error "readOnlySysFSError" if sysfs is readonly. As I know, this is the only place we need write permission to sysfs, CMIIW. * Update a new node condition 'RuntimeUnhealthy' with specific reason, message and hit to the administrator about the remediation. I think this should be an acceptable fix for now. Node problem detector is designed to integrate with different problem daemons, but **the main logic is in the problem detection phase**. After the problem is detected, what node problem detector does is also simply updating a node condition. If we let kube-proxy pass the problem to node problem detector and let node problem detector update the node condition. It looks like an unnecessary hop. The logic in kube-proxy won't be different from this PR, but node problem detector will have to open an unsafe door to other pods because the lack of authentication mechanism. It is a bit hard to test this PR, because we don't really have a bad docker in hand. I can only manually test it: * If I manually change the code to let it return `"readOnlySysFSError`, the node condition will be updated: ``` NetworkUnavailable False Mon, 01 Jan 0001 00:00:00 +0000 Fri, 08 Jul 2016 01:36:41 -0700 RouteCreated RouteController created a route OutOfDisk False Fri, 08 Jul 2016 01:37:36 -0700 Fri, 08 Jul 2016 01:34:49 -0700 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Fri, 08 Jul 2016 01:37:36 -0700 Fri, 08 Jul 2016 01:34:49 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available Ready True Fri, 08 Jul 2016 01:37:36 -0700 Fri, 08 Jul 2016 01:35:26 -0700 KubeletReady kubelet is posting ready status. WARNING: CPU hardcapping unsupported RuntimeUnhealthy True Fri, 08 Jul 2016 01:35:31 -0700 Fri, 08 Jul 2016 01:35:31 -0700 ReadOnlySysFS Docker unexpectedly mounts sysfs as read-only for privileged container (docker issue #24000). This causes the critical system components of Kubernetes not properly working. To remedy this please restart the docker daemon. KernelDeadlock False Fri, 08 Jul 2016 01:37:39 -0700 Fri, 08 Jul 2016 01:35:34 -0700 KernelHasNoDeadlock kernel has no deadlock Addresses: 10.240.0.3,104.155.176.101 ``` * If not, the node condition `RuntimeUnhealthy` won't appear. * If I run the permission checking code in a unprivileged container, it did return `readOnlySysFSError`. I'm not sure whether we want to mark the node as `Unscheduable` when this happened, which only needs few lines change. I can do that if we think we should. I'll add some unit test if we think this fix is acceptable. /cc @bprashanth @dchen1107 @matchstick @thockin @alex-mohr Mark P1 to match the original issue. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()
@Random-Liu thanks for the help! |
@Random-Liu I just ran into this, and I am able to repro consistently. It might be due to the craziness of my experiment, but I still wanted to contribute the data point. I am experimenting with docker inside docker. All I am about to describe is itself running inside a privileged CentOS 7 container that has systemd and Docker installed. The CentOS 7 container is running on my dev machine (Docker for Mac). I have a kubelet running fine, and all control plane components running as static pods. When I attempt to run kube-proxy as a static pod (privileged) it crashes with the following logs:
sysfs is mounted as rw:
sysfs is mounted as rw on containers:
Using busybox to further inspect:
|
@Random-Liu, @alexbrand I'm seeing the same issue when running kubeadm-dind-cluster, where kube-proxy is failing to come up, because it is trying to write to /sys/module/nf_conntrack/parameters/hashsize. This is on the filesystem sysfs, which "mount" shows is rw (so it passes the R/W check in kube-proxy code), but the filesystem is not writeable. Within kube-proxy:
I'm also running docker containers inside of docker containers. FYI, others have not seen this issue, when just running docker on bare-metal. |
doesn't try to update hashsize (if max connections > 4 * hashsize). Appears to be inconsistency between help info in CLI (which says max and max per core can be zero to leave alone), and default used (which sets max per core to 32K, if both max and max per core are zero). This wouldn'be be needed, but it appears that when running kube-proxy in a docker-in-docker environment, the sysfs file system says it allows read-write access, but we are unable to update hashsize, when writing to /sys/module/nf_conntrack/parameters/hashsize. There is an issue for the latter: kubernetes#25543 (comment) I'll create an issue for the former. Until then, we hack it out.
doesn't try to update hashsize (if max connections > 4 * hashsize). Appears to be inconsistency between help info in CLI (which says max and max per core can be zero to leave alone), and default used (which sets max per core to 32K, if both max and max per core are zero). This wouldn'be be needed, but it appears that when running kube-proxy in a docker-in-docker environment, the sysfs file system says it allows read-write access, but we are unable to update hashsize, when writing to /sys/module/nf_conntrack/parameters/hashsize. There is an issue for the latter: kubernetes#25543 (comment) I'll create an issue for the former. Until then, we hack it out.
Bug 1882033: UPSTREAM: 94112: Remove canonicalization of endpoints by endpoints controller for better comparison Origin-commit: 8c5f4c3884cd8140ab4e142ef7fcf93666d40c1d
Currently a Docker bug may cause KubeProxy to crashloop thus making the Node unable to use cluster network, which in turn makes Pods scheduled on this Node unreachable.
To mitigate this issue we need to surface the information about problems with KubeProxy, that will allow scheduler to ignore given node when scheduling new Pods. ProblemAPI is going to address this kind of problems in the future, but we need a fix for the 1.3.
Either Kubelet or KubeProxy needs to update a NodeStatus with the information that Node networking is down. I suggests that we reuse NodeReady Condition for this, as this would make the rest of ControlPlane work out of the box. We will replace this fix with the ProblemAPI use in the future. @thockin @dchen1107
Another option, which requires more work, is creating a completely new NodeCondition type to handle this case. This would require making Scheduler aware of this new Condition. @davidopp @bgrant0607
@alex-mohr @wojtek-t @fgrzadkowski @zmerlynn
The text was updated successfully, but these errors were encountered: