Work around KubeProxy panic that makes Node's network not working #25543

gmarek · 2016-05-12T19:06:52Z

Currently a Docker bug may cause KubeProxy to crashloop thus making the Node unable to use cluster network, which in turn makes Pods scheduled on this Node unreachable.

To mitigate this issue we need to surface the information about problems with KubeProxy, that will allow scheduler to ignore given node when scheduling new Pods. ProblemAPI is going to address this kind of problems in the future, but we need a fix for the 1.3.

Either Kubelet or KubeProxy needs to update a NodeStatus with the information that Node networking is down. I suggests that we reuse NodeReady Condition for this, as this would make the rest of ControlPlane work out of the box. We will replace this fix with the ProblemAPI use in the future. @thockin @dchen1107

Another option, which requires more work, is creating a completely new NodeCondition type to handle this case. This would require making Scheduler aware of this new Condition. @davidopp @bgrant0607

@alex-mohr @wojtek-t @fgrzadkowski @zmerlynn

bgrant0607 · 2016-05-13T00:13:04Z

Does kubelet monitor kube-proxy health?

dchen1107 · 2016-05-13T00:22:25Z

Kube-proxy today runs as a static pod (daemonset), kubelet treats it the same as other static pods. Yes, Kubelet reports kube-proxy podstatus back to master, that is all. The detail information on kube-proxy is in crashloop at #24295 (comment)

dchen1107 · 2016-05-13T00:32:18Z

cc/ @philips Brandon, this issue explains my long position against running kubelet, kube-proxy, etc. critical daemons as docker container. I am ok to packaging them into docker image, and running as linux containers, but they shouldn't depend on docker daemon.

yifan-gu · 2016-05-13T00:37:32Z

@dchen1107 I think we don't want to run them as docker containers as well, we want to run them as rkt containers.

dchen1107 · 2016-05-13T00:49:52Z

@yifan-gu The initial proposal I reviewed was docker container, and I did raise my concerns to @vishh, @mikedanese and coreos developers then. Running it as rkt container should be ok since rkt is daemon-less today, and shouldn't have this chicken egg issue at the first place. But we still have the problem on the nodes without rkt as runtime.

@gmarek At the meeting, I suggested that kube-proxy update NodeStatus with a newly introduced NodeCondition, like what we plan to do with kernel issue at: #23028 (comment) The reason of not using NodeReady is that kubelet might override that.

Another very hacky way is before crashing, Kube-proxy logs the read-only filesystem problem somewhere, like /var/lib/kube-proxy/... Kubelet will pick up the information, then update NodeReady Condition with detail error message prefixing with "kube-proxy:" kube-proxy's responsibility to make the information up-to-date.

mikedanese · 2016-05-13T01:24:05Z

Can I have more description of the bug? Is there a related docker issue? Is there a repro? Does this only affect kube-proxy or will it affect other pods on the node?

We need to support running critical applications (including system daemons) in pods in order to allow easy deployment and integration with vendors (e.g. storage and network vendors). cc @aronchick @thockin since we were just talking about integrating through addons. What features are we missing that we can't support this right now? Is it just that our default container runtime is buggy and flaky and unreliable? Do we need the NodeProblem API? Do we need taints/tolerations?

gmarek · 2016-05-13T05:22:41Z

@mikedanese - I don't know about any way to repro it other than running huge clusters. Pretty much every run of a 1000 Node cluster has this problem on at least one Node.

dchen1107 · 2016-05-14T00:01:10Z

@mikedanese If you want to know the detail, see #24295 (comment). It is a docker issue, and pretty rare.

thockin · 2016-05-17T07:02:44Z

Can we regroup on this issue this week? Like tomorrow?

On Fri, May 13, 2016 at 5:01 PM, Dawn Chen notifications@github.com wrote:

@mikedanese https://github.com/mikedanese If you want to know the
detail, see #24295 (comment)
#24295 (comment).
It is a docker issue, and pretty rare.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#25543 (comment)

thockin · 2016-05-17T07:12:15Z

I will happily make kube-proxy write to or offer up pretty much any API we
deem necessary to detect this issue.

To me it still feels like a docker health-check suite is needed, and one of
the tests is that sysfs gets mounted rw. BUt in the mean time, how about
this:

make /var/run/kubernetes/problems/ be the API. Any component which feels
that it has a node problem can write a JSON file into that directory.

type Problem struct {
  Description string `json:"description"`
  Suspect string `json:"suspect"`
  Reference string `json:"reference"`
  Timestamp: string `json:"timestamp"`
}
type ProblemList []Problem

For this case, kube-proxy would write a file:

{
    [
        {
            "description": "sysfs is read-only",
            "suspect": "docker",
            "reference": "
http://github.com/kubernetes/kubernetes/issues/25543",
            "timestamp": "2016-05-17 12:10PDT"
        }
    ]
}

Kubelet can turn that into Node status reports.

On Tue, May 17, 2016 at 12:02 AM, Tim Hockin thockin@google.com wrote:

Can we regroup on this issue this week? Like tomorrow?

On Fri, May 13, 2016 at 5:01 PM, Dawn Chen notifications@github.com
wrote:

@mikedanese https://github.com/mikedanese If you want to know the
detail, see #24295 (comment)
#24295 (comment).
It is a docker issue, and pretty rare.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#25543 (comment)

gmarek · 2016-05-17T08:48:43Z

cc @dchen1107 @alex-mohr @davidopp @wojtek-t @zmerlynn

dchen1107 · 2016-05-23T17:15:42Z

PR of introducing NodeProblemDetector was merged last week, and the pr to enable it by default for Kubernetes cluster was merged over the weekend. So far everything works fine.

Even it is an alpha feature, GKE folks agreed to enable it for GKE. To workaround this particular issue before we have proper docker fix, we can easily extend today's KernelMonitor module of NodeProblemDetector to monitor kube-proxy log, and report a new NodeCondition called SysfsReadonly to the issue visible.

But we don't have remedy system yet, and pr of converting node problem (events, conditions etc.) to taints is still under the debate / review:

Should NodeController respect this new Condition, and simply mark node NotReady?
Should repair system pickup this issue, and restart docker?

dchen1107 · 2016-06-03T18:03:48Z

Do we still observe the problem with docker 1.11.X? If no, I want to close this; otherwise, we are going to reconfigure NodeProblemDetector to make the issue visible.

dchen1107 · 2016-06-03T19:10:31Z

cc/ @davidopp

thockin · 2016-06-03T20:17:31Z

Agree, I'd love to not "fix" this..

wojtek-t · 2016-06-03T20:29:50Z

I think we've seen it the day before yesterday during our tests - @gmarek can you please confirm?

gmarek · 2016-06-04T07:19:28Z

I can't recall - we certainly did saw some failed nodes during the test, but I'm not sure we made sure it was a kube-proxy issue.

wojtek-t · 2016-06-04T07:37:25Z

But what I can tell for sure that it's definitely less frequent - we started 2000-node cluster 10+ times during this week and we've seen it at most once (which makes it at least 99.995% reliable).

zmerlynn · 2016-06-04T14:31:34Z

Not enough data. GKE is still deploying the old ContainerVM and isn't running Docker 1.11.1 yet, unless I'm mistaken.

wojtek-t · 2016-06-06T06:26:06Z

aah ok - I didn't know that...

zmerlynn · 2016-06-06T22:41:05Z

I'm going to tentatively close this until we show that it's a problem on Docker 1.11.2, now that Docker 1.11.2 is in master on both GCE and GKE.

Random-Liu · 2016-07-07T05:59:54Z

@thockin @matchstick Node problem detector is not running on GKE cluster now. :)
#25543 (comment) and #25543 (comment) should be a better fix for this.

If we still want a fix in the node problem detector, letting kubeproxy drop a file could be enough for now.

thockin · 2016-07-07T06:04:44Z

If you can give me a spec of what dir to mount and what sort of file to
write, I'll do it.

On Wed, Jul 6, 2016 at 11:00 PM, Lantao Liu notifications@github.com
wrote:

@thockin https://github.com/thockin @matchstick
https://github.com/matchstick Node problem detector is not running on
GKE cluster now. :)
#25543 (comment)
#25543 (comment)
and #25543 (comment)
#25543 (comment)
should be a better fix for this.

If we still want a fix in the node problem detector, letting kubeproxy
drop a file could be enough for now.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#25543 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVA-brNj854g-Lste4ZNJ8DNDpxB6ks5qTJYIgaJpZM4IdYcy
.

Random-Liu · 2016-07-07T06:42:53Z

@thockin For temporary fix, something like 2>kube-proxy.log.stderr should be enough.
The kernel monitor of node problem detector could be ported to parse other logs. The only reason we can't do that for kube-proxy.log is that the log is too spammy.

2>kube-proxy.log.stderr should be enough for now.

Only the kubeproxy panic error is printed directly into stderr: https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-proxy/proxy.go#L56
Even though the default stderrThreshold of glog is ERROR, I checked the node in kubernetes-e2e-gke-large-cluster and kubernetes-e2e-gce cluster, normally only few lines are at Error level.

thockin · 2016-07-07T06:52:43Z

I thought that Dawn said we DO NOT want to parse logs. Parsing logs is, in
general, a total hack and very very brittle. we should have a real push
API.

On Wed, Jul 6, 2016 at 11:43 PM, Lantao Liu notifications@github.com
wrote:

@thockin https://github.com/thockin For temporary fix, something like
2>kube-proxy.log.stderr should be enough.
The kernel monitor of node problem detector could be ported to parse
arbitrary log, the only reason we can't do that for kube-proxy.log is
because the log is too spammy.
2>kube-proxy.log.stderr should be enough for now. I checked the node in
kubernetes-e2e-gke-large-cluster and kubernetes-e2e-gce cluster, normally
only few lines are at Error level. :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#25543 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVPm5SPOWjX4dqlXvGA5a0X4_EiF4ks5qTKAbgaJpZM4IdYcy
.

matchstick · 2016-07-07T07:34:12Z

@Random-Liu I agree with @thockin I also understood @dchen1107 thought we should not parse logs. Is that common practise for the node problem detector? It feels fragile to me and to be avoided.

Random-Liu · 2016-07-07T08:29:25Z

@thockin @matchstick Yeah, I agree parsing log is fragile, that's why I call it a temporary or a quick fix. :)
The only good thing is that it doesn't need significant change on both sides, although I don't think it's a good solution, either. :(

For a better fix, it will need some more design and work. Let me think about it a little more~

Anyway, I still think #25543 (comment) and #25543 (comment) is needed. As a key component running on each node, it's wired that we don't have a component to monitor whether it is ready or not.

bprashanth · 2016-07-07T21:36:00Z

Re making kube-proxy "not panic", it currently fails in resizing the hash table for conntrack entries, for which it needs to write to /sys. I'm not sure if there are more locations (we moved /sys/module/br_netfilter loading into kubelet IIUC), or if conntrack itself needs write access to /sys.

I think Dawn/Liu are more worried about the NPD not parsing kube-proxy logs, because its memory usage would balloon. If kube-proxy can detect the situtation internally, either by parsing its own logs (which is brittle) or poking someting in /sys for example (like the hash table resizing it's currently failing on), it can drop a token into a hostPath (hoping the docker bug doesn't manifest as ro hostPath).

Kube-proxy will still remain dysfunctional, though we might be able to:

Surface the issue/mark node unusable
Setup a feedback loop that restarts docker

We just need to be careful to cleanup the hostPath so it doesn't end up in a restart loop.

alex-mohr · 2016-07-07T23:31:45Z

My understanding of 1.3's state based on a brief chat with @Random-Liu : something like ~1 in 1000 nodes will be affected by this issue and if nodes are affected, pods scheduled on those nodes have {broken,little,no} networking.

We hope Docker will eventually fix it. But because that's not the case for 1.3, and we don't want to make excessive changes to 1.3, perhaps that means that we should fix it expediently.

Summarizing my understanding of previous discussion: expedient thing would be some mechanism to detect (@bprashanth's proposal like kube-proxy detecting and writing to /tmp/kube-proxy-claims-docker-issue-present.txt) and $something (kubelet, node problem detector, a bash script in a loop) that checks for existence of that file followed by some form of remediation {kill docker, sudo reboot, .....} to bash the node into working again.

Maybe an ugly hack is okay for now and we hope that docker fixes it for the next release so in 1.4 we can rm the code? Or for 1.4 we can do something that is More Elegant And Less Offensive To Engineering Sensibilities?

thockin · 2016-07-08T05:15:04Z

Bounded-lifetime hacks are fine, but they need to come with giant
disclaimers and named-assignees to cleanup the mess after a specific event.

On Thu, Jul 7, 2016 at 4:32 PM, Alex Mohr notifications@github.com wrote:

My understanding of 1.3's state based on a brief chat with @Random-Liu
https://github.com/Random-Liu : something like ~1 in 1000 nodes will be
affected by this issue and if nodes are affected, pods scheduled on those
nodes have {broken,little,no} networking.

We hope Docker will eventually fix it. But because that's not the case for
1.3, and we don't want to make excessive changes to 1.3, perhaps that means
that we should fix it expediently.

Summarizing my understanding of previous discussion: expedient thing would
be some mechanism to detect (@bprashanth https://github.com/bprashanth's
proposal like kube-proxy detecting and writing to
/tmp/kube-proxy-claims-docker-issue-present.txt) and $something (kubelet,
node problem detector, a bash script in a loop) that checks for existence
of that file followed by some form of remediation {kill docker, sudo
reboot, .....} to bash the node into working again.

Maybe an ugly hack is okay for now and we hope that docker fixes it for
the next release so in 1.4 we can rm the code? Or for 1.4 we can do
something that is More Elegant And Less Offensive To Engineering
Sensibilities?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#25543 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVKO1OtRTvNPcupYyNJ50p7pgCiXwks5qTYyRgaJpZM4IdYcy
.

alex-mohr · 2016-07-08T13:08:20Z

Agree, re: bounded-lifetime hacks needing care and we shouldn't do them indiscriminately.

We and our users have a problem today -- and we've had a problem for the past N months without traction. I'd like the problem solved for 1.3 because I'd like us to have a great product and I don't think the state of broken nodes is there.

Aside; we've spent N hours discussing the issue and could likely have worked around the issue in less aggregate people time than we spent discussing? Apologies for the leaked frustration: this issue seems to be stuck in either some form of analysis paralysis or cracks between teams or perfect-is-the-enemy-of-the-good state.

@Random-Liu Again, I don't care about implementation details. We also need more than just tech or a piece of code. Exit criteria is that users don't get broken nodes shipped to them.

bprashanth · 2016-07-08T15:45:42Z

Exit criteria is that users don't get broken nodes shipped to them.

We're not going to solve this problem for as long as we depend on a lagging docker release cycle, and they don't backport or do intermediate releases to help us. This is just a one off docker bug that's left us in a bad state. Such issues come up every release, get documented in the release notes and we ship anyway.

The easiest fix (disclaimers and all) is to redirect kube-proxy stderr and parse it out from NPD. Even then, the node will remain broken unless we restart docker. the NDP will only surface the error.

Random-Liu · 2016-07-08T16:54:10Z

@alex-mohr I'm working on a PR yesterday night, will send it out soon.

Random-Liu · 2016-07-08T18:17:29Z

FYI, a fix is here #28697.
In #28697, I let kube-proxy update a node condition RuntimeUnhealthy with specific reason, message and hit to the administrator about the remediation.
The specific reason why we want to do this is discussed in the PR description #28697 (comment).

davidopp · 2016-07-08T19:19:12Z

FWIW I agree 100% with @alex-mohr.

@Random-Liu using node condition SGTM, but your PR should probably also modify the scheduler here
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/factory/factory.go#L434
to prevent the system from sending new pods to a node that has RuntimeUnhealthy condition. I realize there will still be a time window when we might send some pods to that node (before the problem is detected) but at least the time window will be bounded this way.

Random-Liu · 2016-07-08T20:44:40Z

using node condition SGTM, but your PR should probably also modify the scheduler here
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/factory/factory.go#L434
to prevent the system from sending new pods to a node that has RuntimeUnhealthy condition.

@davidopp Yeah, we can do that. But before that, we should make sure that the node is really unusable without setting conntrack.
In fact, we only setting conntrack to increase the max connection limit on the node (default 64k->256k) #19182.
Without this, is the node really considered to be unusable? @thockin

If so, we should prevent scheduling pods to the node.
If not, maybe we should just surface the problem to the administrator and let the node keep working.

I'll run e2e test without conntrack set to see whether there is any problem.

I realize there will still be a time window when we might send some pods to that node (before the problem is detected) but at least the time window will be bounded this way.

Yeah, that's why I think we should not restart docker ourselves to try to remedy the problem. There may still be workloads running the node. We'd better leave this to the user to decide. :)

@bprashanth

Automatic merge from submit-queue Prevent kube-proxy from panicing when sysfs is mounted as read-only. Fixes #25543. This PR: * Checks the permission of sysfs before setting conntrack hashsize, and returns an error "readOnlySysFSError" if sysfs is readonly. As I know, this is the only place we need write permission to sysfs, CMIIW. * Update a new node condition 'RuntimeUnhealthy' with specific reason, message and hit to the administrator about the remediation. I think this should be an acceptable fix for now. Node problem detector is designed to integrate with different problem daemons, but **the main logic is in the problem detection phase**. After the problem is detected, what node problem detector does is also simply updating a node condition. If we let kube-proxy pass the problem to node problem detector and let node problem detector update the node condition. It looks like an unnecessary hop. The logic in kube-proxy won't be different from this PR, but node problem detector will have to open an unsafe door to other pods because the lack of authentication mechanism. It is a bit hard to test this PR, because we don't really have a bad docker in hand. I can only manually test it: * If I manually change the code to let it return `"readOnlySysFSError`, the node condition will be updated: ``` NetworkUnavailable False Mon, 01 Jan 0001 00:00:00 +0000 Fri, 08 Jul 2016 01:36:41 -0700 RouteCreated RouteController created a route OutOfDisk False Fri, 08 Jul 2016 01:37:36 -0700 Fri, 08 Jul 2016 01:34:49 -0700 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Fri, 08 Jul 2016 01:37:36 -0700 Fri, 08 Jul 2016 01:34:49 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available Ready True Fri, 08 Jul 2016 01:37:36 -0700 Fri, 08 Jul 2016 01:35:26 -0700 KubeletReady kubelet is posting ready status. WARNING: CPU hardcapping unsupported RuntimeUnhealthy True Fri, 08 Jul 2016 01:35:31 -0700 Fri, 08 Jul 2016 01:35:31 -0700 ReadOnlySysFS Docker unexpectedly mounts sysfs as read-only for privileged container (docker issue #24000). This causes the critical system components of Kubernetes not properly working. To remedy this please restart the docker daemon. KernelDeadlock False Fri, 08 Jul 2016 01:37:39 -0700 Fri, 08 Jul 2016 01:35:34 -0700 KernelHasNoDeadlock kernel has no deadlock Addresses: 10.240.0.3,104.155.176.101 ``` * If not, the node condition `RuntimeUnhealthy` won't appear. * If I run the permission checking code in a unprivileged container, it did return `readOnlySysFSError`. I'm not sure whether we want to mark the node as `Unscheduable` when this happened, which only needs few lines change. I can do that if we think we should. I'll add some unit test if we think this fix is acceptable. /cc @bprashanth @dchen1107 @matchstick @thockin @alex-mohr Mark P1 to match the original issue. [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()

Random-Liu · 2016-07-11T07:23:51Z

#28697 is merged, and hopefully, it could solve this issue.
I've also sent a PR to revert the walkaround in the test framework #28015 to verify whether the issue is really fully solved.
Feel free to reopen this if the issue or relative issue happens again.

@wojtek-t

…eproxy Automatic merge from submit-queue Revert "Workardound KubeProxy failures in test framework" Reverts #28015 For #25543. Revert walkaround in test framework to verify whether #28697 solved the problem. @wojtek-t

bprashanth · 2016-07-13T20:57:09Z

@Random-Liu thanks for the help!

alexbrand · 2017-02-25T18:37:59Z

@Random-Liu I just ran into this, and I am able to repro consistently. It might be due to the craziness of my experiment, but I still wanted to contribute the data point.

I am experimenting with docker inside docker. All I am about to describe is itself running inside a privileged CentOS 7 container that has systemd and Docker installed. The CentOS 7 container is running on my dev machine (Docker for Mac).

I have a kubelet running fine, and all control plane components running as static pods. When I attempt to run kube-proxy as a static pod (privileged) it crashes with the following logs:

I0225 18:13:46.886161       1 iptables.go:176] Could not connect to D-Bus system bus: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory
I0225 18:13:46.886259       1 server.go:168] setting OOM scores is unsupported in this build
I0225 18:13:46.888648       1 server.go:215] Using iptables Proxier.
I0225 18:13:46.941545       1 server.go:227] Tearing down userspace rules.
I0225 18:13:46.941707       1 healthcheck.go:119] Initializing kube-proxy health checker
I0225 18:13:46.956490       1 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0225 18:13:46.957841       1 conntrack.go:66] Setting conntrack hashsize to 32768
write /sys/module/nf_conntrack/parameters/hashsize: operation not supported

sysfs is mounted as rw:

# mount | grep sysfs
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)

sysfs is mounted as rw on containers:

docker run -ti --privileged --net=host busybox mount | grep sysfs
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)

Using busybox to further inspect:

# docker run -ti --privileged --net=host busybox sh
/ # echo 16384 > /sys/module/nf_conntrack/parameters/hashsize
sh: write error: Operation not supported
/ # mount | grep sysfs
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
/ # id
uid=0(root) gid=0(root) groups=10(wheel)
/ # ls -l /sys/module/nf_conntrack/parameters/
total 0
-rw-r--r--    1 root     root          4096 Feb 25 17:54 acct
-r--------    1 root     root          4096 Feb 25 17:54 expect_hashsize
-rw-------    1 root     root          4096 Feb 25 18:20 hashsize
-rw-r--r--    1 root     root          4096 Feb 25 17:54 nf_conntrack_helper
-rw-r--r--    1 root     root          4096 Feb 25 18:09 tstamp

pmichali · 2017-09-28T18:15:48Z

@Random-Liu, @alexbrand I'm seeing the same issue when running kubeadm-dind-cluster, where kube-proxy is failing to come up, because it is trying to write to /sys/module/nf_conntrack/parameters/hashsize. This is on the filesystem sysfs, which "mount" shows is rw (so it passes the R/W check in kube-proxy code), but the filesystem is not writeable.

Within kube-proxy:

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)

/sys/module/nf_conntrack/parameters
touch x
touch: cannot touch 'x': Permission denied

I'm also running docker containers inside of docker containers. FYI, others have not seen this issue, when just running docker on bare-metal.

@ivan4th @danehans

doesn't try to update hashsize (if max connections > 4 * hashsize). Appears to be inconsistency between help info in CLI (which says max and max per core can be zero to leave alone), and default used (which sets max per core to 32K, if both max and max per core are zero). This wouldn'be be needed, but it appears that when running kube-proxy in a docker-in-docker environment, the sysfs file system says it allows read-write access, but we are unable to update hashsize, when writing to /sys/module/nf_conntrack/parameters/hashsize. There is an issue for the latter: kubernetes#25543 (comment) I'll create an issue for the former. Until then, we hack it out.

Bug 1882033: UPSTREAM: 94112: Remove canonicalization of endpoints by endpoints controller for better comparison Origin-commit: 8c5f4c3884cd8140ab4e142ef7fcf93666d40c1d

gmarek added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. team/cluster labels May 12, 2016

gmarek added this to the v1.3 milestone May 12, 2016

dchen1107 mentioned this issue May 23, 2016

1000 node cluster, 1 kube-proxy comes up with "open /sys/module/nf_conntrack/parameters/hashsize: read-only file system" #24295

Closed

dchen1107 self-assigned this Jun 3, 2016

goltermann removed the team/control-plane label Jun 3, 2016

zmerlynn closed this as completed Jun 6, 2016

Random-Liu mentioned this issue Jul 8, 2016

Prevent kube-proxy from panicing when sysfs is mounted as read-only. #28697

Merged

k8s-github-robot closed this as completed in #28697 Jul 11, 2016

Random-Liu mentioned this issue Jul 11, 2016

Revert "Workardound KubeProxy failures in test framework" #28760

Merged

pmichali mentioned this issue Oct 2, 2017

kube-proxy: Inconsistent settings for conntrack #53329

Closed

pmichali mentioned this issue Oct 16, 2017

zero-value settings for kube-proxy being overwritten by default values #50787

Closed

pmichali mentioned this issue Nov 28, 2017

Conntrack work-around no longer works kubernetes-retired/kubeadm-dind-cluster#50

Closed

Work around KubeProxy panic that makes Node's network not working #25543

Work around KubeProxy panic that makes Node's network not working #25543

Comments

gmarek commented May 12, 2016 • edited

bgrant0607 commented May 13, 2016

dchen1107 commented May 13, 2016

dchen1107 commented May 13, 2016

yifan-gu commented May 13, 2016

dchen1107 commented May 13, 2016

mikedanese commented May 13, 2016

gmarek commented May 13, 2016

dchen1107 commented May 14, 2016

thockin commented May 17, 2016

thockin commented May 17, 2016

gmarek commented May 17, 2016

dchen1107 commented May 23, 2016

dchen1107 commented Jun 3, 2016

dchen1107 commented Jun 3, 2016

thockin commented Jun 3, 2016

wojtek-t commented Jun 3, 2016

gmarek commented Jun 4, 2016

wojtek-t commented Jun 4, 2016

zmerlynn commented Jun 4, 2016

wojtek-t commented Jun 6, 2016

zmerlynn commented Jun 6, 2016

Random-Liu commented Jul 7, 2016

thockin commented Jul 7, 2016

Random-Liu commented Jul 7, 2016 • edited

thockin commented Jul 7, 2016

matchstick commented Jul 7, 2016

Random-Liu commented Jul 7, 2016 • edited

bprashanth commented Jul 7, 2016

alex-mohr commented Jul 7, 2016

thockin commented Jul 8, 2016

alex-mohr commented Jul 8, 2016

bprashanth commented Jul 8, 2016

Random-Liu commented Jul 8, 2016

Random-Liu commented Jul 8, 2016

davidopp commented Jul 8, 2016

Random-Liu commented Jul 8, 2016 • edited

Random-Liu commented Jul 11, 2016 • edited

bprashanth commented Jul 13, 2016

alexbrand commented Feb 25, 2017

pmichali commented Sep 28, 2017 • edited

gmarek commented May 12, 2016 •

edited

Random-Liu commented Jul 7, 2016 •

edited

Random-Liu commented Jul 7, 2016 •

edited

Random-Liu commented Jul 8, 2016 •

edited

Random-Liu commented Jul 11, 2016 •

edited

pmichali commented Sep 28, 2017 •

edited