Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around KubeProxy panic that makes Node's network not working #25543

Closed
gmarek opened this issue May 12, 2016 · 72 comments · Fixed by #28697
Closed

Work around KubeProxy panic that makes Node's network not working #25543

gmarek opened this issue May 12, 2016 · 72 comments · Fixed by #28697
Assignees
Labels
area/provider/gcp Issues or PRs related to gcp provider priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@gmarek
Copy link
Contributor

gmarek commented May 12, 2016

Currently a Docker bug may cause KubeProxy to crashloop thus making the Node unable to use cluster network, which in turn makes Pods scheduled on this Node unreachable.

To mitigate this issue we need to surface the information about problems with KubeProxy, that will allow scheduler to ignore given node when scheduling new Pods. ProblemAPI is going to address this kind of problems in the future, but we need a fix for the 1.3.

Either Kubelet or KubeProxy needs to update a NodeStatus with the information that Node networking is down. I suggests that we reuse NodeReady Condition for this, as this would make the rest of ControlPlane work out of the box. We will replace this fix with the ProblemAPI use in the future. @thockin @dchen1107

Another option, which requires more work, is creating a completely new NodeCondition type to handle this case. This would require making Scheduler aware of this new Condition. @davidopp @bgrant0607

@alex-mohr @wojtek-t @fgrzadkowski @zmerlynn

@gmarek gmarek added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. team/cluster labels May 12, 2016
@gmarek gmarek added this to the v1.3 milestone May 12, 2016
@bgrant0607
Copy link
Member

Does kubelet monitor kube-proxy health?

@dchen1107
Copy link
Member

Kube-proxy today runs as a static pod (daemonset), kubelet treats it the same as other static pods. Yes, Kubelet reports kube-proxy podstatus back to master, that is all. The detail information on kube-proxy is in crashloop at #24295 (comment)

@dchen1107
Copy link
Member

cc/ @philips Brandon, this issue explains my long position against running kubelet, kube-proxy, etc. critical daemons as docker container. I am ok to packaging them into docker image, and running as linux containers, but they shouldn't depend on docker daemon.

@yifan-gu
Copy link
Contributor

@dchen1107 I think we don't want to run them as docker containers as well, we want to run them as rkt containers.

@dchen1107
Copy link
Member

@yifan-gu The initial proposal I reviewed was docker container, and I did raise my concerns to @vishh, @mikedanese and coreos developers then. Running it as rkt container should be ok since rkt is daemon-less today, and shouldn't have this chicken egg issue at the first place. But we still have the problem on the nodes without rkt as runtime.

@gmarek At the meeting, I suggested that kube-proxy update NodeStatus with a newly introduced NodeCondition, like what we plan to do with kernel issue at: #23028 (comment) The reason of not using NodeReady is that kubelet might override that.

Another very hacky way is before crashing, Kube-proxy logs the read-only filesystem problem somewhere, like /var/lib/kube-proxy/... Kubelet will pick up the information, then update NodeReady Condition with detail error message prefixing with "kube-proxy:" kube-proxy's responsibility to make the information up-to-date.

@mikedanese
Copy link
Member

Can I have more description of the bug? Is there a related docker issue? Is there a repro? Does this only affect kube-proxy or will it affect other pods on the node?

We need to support running critical applications (including system daemons) in pods in order to allow easy deployment and integration with vendors (e.g. storage and network vendors). cc @aronchick @thockin since we were just talking about integrating through addons. What features are we missing that we can't support this right now? Is it just that our default container runtime is buggy and flaky and unreliable? Do we need the NodeProblem API? Do we need taints/tolerations?

@gmarek
Copy link
Contributor Author

gmarek commented May 13, 2016

@mikedanese - I don't know about any way to repro it other than running huge clusters. Pretty much every run of a 1000 Node cluster has this problem on at least one Node.

@dchen1107
Copy link
Member

@mikedanese If you want to know the detail, see #24295 (comment). It is a docker issue, and pretty rare.

@thockin
Copy link
Member

thockin commented May 17, 2016

Can we regroup on this issue this week? Like tomorrow?

On Fri, May 13, 2016 at 5:01 PM, Dawn Chen notifications@github.com wrote:

@mikedanese https://github.com/mikedanese If you want to know the
detail, see #24295 (comment)
#24295 (comment).
It is a docker issue, and pretty rare.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#25543 (comment)

@thockin
Copy link
Member

thockin commented May 17, 2016

I will happily make kube-proxy write to or offer up pretty much any API we
deem necessary to detect this issue.

To me it still feels like a docker health-check suite is needed, and one of
the tests is that sysfs gets mounted rw. BUt in the mean time, how about
this:

make /var/run/kubernetes/problems/ be the API. Any component which feels
that it has a node problem can write a JSON file into that directory.

type Problem struct {
  Description string `json:"description"`
  Suspect string `json:"suspect"`
  Reference string `json:"reference"`
  Timestamp: string `json:"timestamp"`
}
type ProblemList []Problem

For this case, kube-proxy would write a file:

{
    [
        {
            "description": "sysfs is read-only",
            "suspect": "docker",
            "reference": "
http://github.com/kubernetes/kubernetes/issues/25543",
            "timestamp": "2016-05-17 12:10PDT"
        }
    ]
}

Kubelet can turn that into Node status reports.

On Tue, May 17, 2016 at 12:02 AM, Tim Hockin thockin@google.com wrote:

Can we regroup on this issue this week? Like tomorrow?

On Fri, May 13, 2016 at 5:01 PM, Dawn Chen notifications@github.com
wrote:

@mikedanese https://github.com/mikedanese If you want to know the
detail, see #24295 (comment)
#24295 (comment).
It is a docker issue, and pretty rare.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#25543 (comment)

@gmarek
Copy link
Contributor Author

gmarek commented May 17, 2016

@dchen1107
Copy link
Member

PR of introducing NodeProblemDetector was merged last week, and the pr to enable it by default for Kubernetes cluster was merged over the weekend. So far everything works fine.

Even it is an alpha feature, GKE folks agreed to enable it for GKE. To workaround this particular issue before we have proper docker fix, we can easily extend today's KernelMonitor module of NodeProblemDetector to monitor kube-proxy log, and report a new NodeCondition called SysfsReadonly to the issue visible.

But we don't have remedy system yet, and pr of converting node problem (events, conditions etc.) to taints is still under the debate / review:

  1. Should NodeController respect this new Condition, and simply mark node NotReady?
  2. Should repair system pickup this issue, and restart docker?

@dchen1107
Copy link
Member

Do we still observe the problem with docker 1.11.X? If no, I want to close this; otherwise, we are going to reconfigure NodeProblemDetector to make the issue visible.

@dchen1107
Copy link
Member

cc/ @davidopp

@thockin
Copy link
Member

thockin commented Jun 3, 2016

Agree, I'd love to not "fix" this..

@wojtek-t
Copy link
Member

wojtek-t commented Jun 3, 2016

I think we've seen it the day before yesterday during our tests - @gmarek can you please confirm?

@gmarek
Copy link
Contributor Author

gmarek commented Jun 4, 2016

I can't recall - we certainly did saw some failed nodes during the test, but I'm not sure we made sure it was a kube-proxy issue.

@wojtek-t
Copy link
Member

wojtek-t commented Jun 4, 2016

But what I can tell for sure that it's definitely less frequent - we started 2000-node cluster 10+ times during this week and we've seen it at most once (which makes it at least 99.995% reliable).

@zmerlynn
Copy link
Member

zmerlynn commented Jun 4, 2016

Not enough data. GKE is still deploying the old ContainerVM and isn't running Docker 1.11.1 yet, unless I'm mistaken.

@wojtek-t
Copy link
Member

wojtek-t commented Jun 6, 2016

aah ok - I didn't know that...

@zmerlynn
Copy link
Member

zmerlynn commented Jun 6, 2016

I'm going to tentatively close this until we show that it's a problem on Docker 1.11.2, now that Docker 1.11.2 is in master on both GCE and GKE.

@zmerlynn zmerlynn closed this as completed Jun 6, 2016
@Random-Liu
Copy link
Member

@thockin @matchstick Node problem detector is not running on GKE cluster now. :)
#25543 (comment) and #25543 (comment) should be a better fix for this.

If we still want a fix in the node problem detector, letting kubeproxy drop a file could be enough for now.

@thockin
Copy link
Member

thockin commented Jul 7, 2016

If you can give me a spec of what dir to mount and what sort of file to
write, I'll do it.

On Wed, Jul 6, 2016 at 11:00 PM, Lantao Liu notifications@github.com
wrote:

@thockin https://github.com/thockin @matchstick
https://github.com/matchstick Node problem detector is not running on
GKE cluster now. :)
#25543 (comment)
#25543 (comment)
and #25543 (comment)
#25543 (comment)
should be a better fix for this.

If we still want a fix in the node problem detector, letting kubeproxy
drop a file could be enough for now.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#25543 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVA-brNj854g-Lste4ZNJ8DNDpxB6ks5qTJYIgaJpZM4IdYcy
.

@Random-Liu
Copy link
Member

Random-Liu commented Jul 7, 2016

@thockin For temporary fix, something like 2>kube-proxy.log.stderr should be enough.
The kernel monitor of node problem detector could be ported to parse other logs. The only reason we can't do that for kube-proxy.log is that the log is too spammy.

2>kube-proxy.log.stderr should be enough for now.

@thockin
Copy link
Member

thockin commented Jul 7, 2016

I thought that Dawn said we DO NOT want to parse logs. Parsing logs is, in
general, a total hack and very very brittle. we should have a real push
API.

On Wed, Jul 6, 2016 at 11:43 PM, Lantao Liu notifications@github.com
wrote:

@thockin https://github.com/thockin For temporary fix, something like
2>kube-proxy.log.stderr should be enough.
The kernel monitor of node problem detector could be ported to parse
arbitrary log, the only reason we can't do that for kube-proxy.log is
because the log is too spammy.
2>kube-proxy.log.stderr should be enough for now. I checked the node in
kubernetes-e2e-gke-large-cluster and kubernetes-e2e-gce cluster, normally
only few lines are at Error level. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#25543 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVPm5SPOWjX4dqlXvGA5a0X4_EiF4ks5qTKAbgaJpZM4IdYcy
.

@matchstick
Copy link
Contributor

@Random-Liu I agree with @thockin I also understood @dchen1107 thought we should not parse logs. Is that common practise for the node problem detector? It feels fragile to me and to be avoided.

@Random-Liu
Copy link
Member

Random-Liu commented Jul 7, 2016

@thockin @matchstick Yeah, I agree parsing log is fragile, that's why I call it a temporary or a quick fix. :)
The only good thing is that it doesn't need significant change on both sides, although I don't think it's a good solution, either. :(

For a better fix, it will need some more design and work. Let me think about it a little more~

Anyway, I still think #25543 (comment) and #25543 (comment) is needed. As a key component running on each node, it's wired that we don't have a component to monitor whether it is ready or not.

@bprashanth
Copy link
Contributor

Re making kube-proxy "not panic", it currently fails in resizing the hash table for conntrack entries, for which it needs to write to /sys. I'm not sure if there are more locations (we moved /sys/module/br_netfilter loading into kubelet IIUC), or if conntrack itself needs write access to /sys.

I think Dawn/Liu are more worried about the NPD not parsing kube-proxy logs, because its memory usage would balloon. If kube-proxy can detect the situtation internally, either by parsing its own logs (which is brittle) or poking someting in /sys for example (like the hash table resizing it's currently failing on), it can drop a token into a hostPath (hoping the docker bug doesn't manifest as ro hostPath).

Kube-proxy will still remain dysfunctional, though we might be able to:

  1. Surface the issue/mark node unusable
  2. Setup a feedback loop that restarts docker

We just need to be careful to cleanup the hostPath so it doesn't end up in a restart loop.

@alex-mohr
Copy link
Contributor

My understanding of 1.3's state based on a brief chat with @Random-Liu : something like ~1 in 1000 nodes will be affected by this issue and if nodes are affected, pods scheduled on those nodes have {broken,little,no} networking.

We hope Docker will eventually fix it. But because that's not the case for 1.3, and we don't want to make excessive changes to 1.3, perhaps that means that we should fix it expediently.

Summarizing my understanding of previous discussion: expedient thing would be some mechanism to detect (@bprashanth's proposal like kube-proxy detecting and writing to /tmp/kube-proxy-claims-docker-issue-present.txt) and $something (kubelet, node problem detector, a bash script in a loop) that checks for existence of that file followed by some form of remediation {kill docker, sudo reboot, .....} to bash the node into working again.

Maybe an ugly hack is okay for now and we hope that docker fixes it for the next release so in 1.4 we can rm the code? Or for 1.4 we can do something that is More Elegant And Less Offensive To Engineering Sensibilities?

@thockin
Copy link
Member

thockin commented Jul 8, 2016

Bounded-lifetime hacks are fine, but they need to come with giant
disclaimers and named-assignees to cleanup the mess after a specific event.

On Thu, Jul 7, 2016 at 4:32 PM, Alex Mohr notifications@github.com wrote:

My understanding of 1.3's state based on a brief chat with @Random-Liu
https://github.com/Random-Liu : something like ~1 in 1000 nodes will be
affected by this issue and if nodes are affected, pods scheduled on those
nodes have {broken,little,no} networking.

We hope Docker will eventually fix it. But because that's not the case for
1.3, and we don't want to make excessive changes to 1.3, perhaps that means
that we should fix it expediently.

Summarizing my understanding of previous discussion: expedient thing would
be some mechanism to detect (@bprashanth https://github.com/bprashanth's
proposal like kube-proxy detecting and writing to
/tmp/kube-proxy-claims-docker-issue-present.txt) and $something (kubelet,
node problem detector, a bash script in a loop) that checks for existence
of that file followed by some form of remediation {kill docker, sudo
reboot, .....} to bash the node into working again.

Maybe an ugly hack is okay for now and we hope that docker fixes it for
the next release so in 1.4 we can rm the code? Or for 1.4 we can do
something that is More Elegant And Less Offensive To Engineering
Sensibilities?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#25543 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AFVgVKO1OtRTvNPcupYyNJ50p7pgCiXwks5qTYyRgaJpZM4IdYcy
.

@alex-mohr
Copy link
Contributor

Agree, re: bounded-lifetime hacks needing care and we shouldn't do them indiscriminately.

We and our users have a problem today -- and we've had a problem for the past N months without traction. I'd like the problem solved for 1.3 because I'd like us to have a great product and I don't think the state of broken nodes is there.

Aside; we've spent N hours discussing the issue and could likely have worked around the issue in less aggregate people time than we spent discussing? Apologies for the leaked frustration: this issue seems to be stuck in either some form of analysis paralysis or cracks between teams or perfect-is-the-enemy-of-the-good state.

@Random-Liu Again, I don't care about implementation details. We also need more than just tech or a piece of code. Exit criteria is that users don't get broken nodes shipped to them.

@bprashanth
Copy link
Contributor

Exit criteria is that users don't get broken nodes shipped to them.

We're not going to solve this problem for as long as we depend on a lagging docker release cycle, and they don't backport or do intermediate releases to help us. This is just a one off docker bug that's left us in a bad state. Such issues come up every release, get documented in the release notes and we ship anyway.

The easiest fix (disclaimers and all) is to redirect kube-proxy stderr and parse it out from NPD. Even then, the node will remain broken unless we restart docker. the NDP will only surface the error.

@Random-Liu
Copy link
Member

@alex-mohr I'm working on a PR yesterday night, will send it out soon.

@Random-Liu
Copy link
Member

FYI, a fix is here #28697.
In #28697, I let kube-proxy update a node condition RuntimeUnhealthy with specific reason, message and hit to the administrator about the remediation.
The specific reason why we want to do this is discussed in the PR description #28697 (comment).

@davidopp
Copy link
Member

davidopp commented Jul 8, 2016

FWIW I agree 100% with @alex-mohr.

@Random-Liu using node condition SGTM, but your PR should probably also modify the scheduler here
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/factory/factory.go#L434
to prevent the system from sending new pods to a node that has RuntimeUnhealthy condition. I realize there will still be a time window when we might send some pods to that node (before the problem is detected) but at least the time window will be bounded this way.

@Random-Liu
Copy link
Member

Random-Liu commented Jul 8, 2016

using node condition SGTM, but your PR should probably also modify the scheduler here
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/factory/factory.go#L434
to prevent the system from sending new pods to a node that has RuntimeUnhealthy condition.

@davidopp Yeah, we can do that. But before that, we should make sure that the node is really unusable without setting conntrack.
In fact, we only setting conntrack to increase the max connection limit on the node (default 64k->256k) #19182.
Without this, is the node really considered to be unusable? @thockin

  • If so, we should prevent scheduling pods to the node.
  • If not, maybe we should just surface the problem to the administrator and let the node keep working.

I'll run e2e test without conntrack set to see whether there is any problem.

I realize there will still be a time window when we might send some pods to that node (before the problem is detected) but at least the time window will be bounded this way.

Yeah, that's why I think we should not restart docker ourselves to try to remedy the problem. There may still be workloads running the node. We'd better leave this to the user to decide. :)

k8s-github-robot pushed a commit that referenced this issue Jul 11, 2016
Automatic merge from submit-queue

Prevent kube-proxy from panicing when sysfs is mounted as read-only.

Fixes #25543.

This PR:
* Checks the permission of sysfs before setting conntrack hashsize, and returns an error "readOnlySysFSError" if sysfs is readonly. As I know, this is the only place we need write permission to sysfs, CMIIW.
* Update a new node condition 'RuntimeUnhealthy' with specific reason, message and hit to the administrator about the remediation.

I think this should be an acceptable fix for now.
Node problem detector is designed to integrate with different problem daemons, but **the main logic is in the problem detection phase**. After the problem is detected, what node problem detector does is also simply updating a node condition.

If we let kube-proxy pass the problem to node problem detector and let node problem detector update the node condition. It looks like an unnecessary hop. The logic in kube-proxy won't be different from this PR, but node problem detector will have to open an unsafe door to other pods because the lack of authentication mechanism.

It is a bit hard to test this PR, because we don't really have a bad docker in hand. I can only manually test it:
* If I manually change the code to let it return `"readOnlySysFSError`, the node condition will be updated:
```
  NetworkUnavailable 	False 	Mon, 01 Jan 0001 00:00:00 +0000 	Fri, 08 Jul 2016 01:36:41 -0700 	RouteCreated 			RouteController created a route
  OutOfDisk 		False 	Fri, 08 Jul 2016 01:37:36 -0700 	Fri, 08 Jul 2016 01:34:49 -0700 	KubeletHasSufficientDisk 	kubelet has sufficient disk space available
  MemoryPressure 	False 	Fri, 08 Jul 2016 01:37:36 -0700 	Fri, 08 Jul 2016 01:34:49 -0700 	KubeletHasSufficientMemory 	kubelet has sufficient memory available
  Ready 		True 	Fri, 08 Jul 2016 01:37:36 -0700 	Fri, 08 Jul 2016 01:35:26 -0700 	KubeletReady 			kubelet is posting ready status. WARNING: CPU hardcapping unsupported
  RuntimeUnhealthy 	True 	Fri, 08 Jul 2016 01:35:31 -0700 	Fri, 08 Jul 2016 01:35:31 -0700 	ReadOnlySysFS 			Docker unexpectedly mounts sysfs as read-only for privileged container (docker issue #24000). This causes the critical system components of Kubernetes not properly working. To remedy this please restart the docker daemon.
  KernelDeadlock 	False 	Fri, 08 Jul 2016 01:37:39 -0700 	Fri, 08 Jul 2016 01:35:34 -0700 	KernelHasNoDeadlock 		kernel has no deadlock
Addresses:		10.240.0.3,104.155.176.101
```
* If not, the node condition `RuntimeUnhealthy` won't appear.
* If I run the permission checking code in a unprivileged container, it did return `readOnlySysFSError`.

I'm not sure whether we want to mark the node as `Unscheduable` when this happened, which only needs few lines change. I can do that if we think we should.

I'll add some unit test if we think this fix is acceptable.

/cc @bprashanth @dchen1107 @matchstick @thockin @alex-mohr 

Mark P1 to match the original issue.
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/.github/PULL_REQUEST_TEMPLATE.md?pixel)]()
@Random-Liu
Copy link
Member

Random-Liu commented Jul 11, 2016

#28697 is merged, and hopefully, it could solve this issue.
I've also sent a PR to revert the walkaround in the test framework #28015 to verify whether the issue is really fully solved.
Feel free to reopen this if the issue or relative issue happens again.

k8s-github-robot pushed a commit that referenced this issue Jul 12, 2016
…eproxy

Automatic merge from submit-queue

Revert "Workardound KubeProxy failures in test framework"

Reverts #28015

For #25543.
Revert walkaround in test framework to verify whether #28697 solved the problem.

@wojtek-t
@bprashanth
Copy link
Contributor

@Random-Liu thanks for the help!

@alexbrand
Copy link
Contributor

@Random-Liu I just ran into this, and I am able to repro consistently. It might be due to the craziness of my experiment, but I still wanted to contribute the data point.

I am experimenting with docker inside docker. All I am about to describe is itself running inside a privileged CentOS 7 container that has systemd and Docker installed. The CentOS 7 container is running on my dev machine (Docker for Mac).

I have a kubelet running fine, and all control plane components running as static pods. When I attempt to run kube-proxy as a static pod (privileged) it crashes with the following logs:

I0225 18:13:46.886161       1 iptables.go:176] Could not connect to D-Bus system bus: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory
I0225 18:13:46.886259       1 server.go:168] setting OOM scores is unsupported in this build
I0225 18:13:46.888648       1 server.go:215] Using iptables Proxier.
I0225 18:13:46.941545       1 server.go:227] Tearing down userspace rules.
I0225 18:13:46.941707       1 healthcheck.go:119] Initializing kube-proxy health checker
I0225 18:13:46.956490       1 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0225 18:13:46.957841       1 conntrack.go:66] Setting conntrack hashsize to 32768
write /sys/module/nf_conntrack/parameters/hashsize: operation not supported

sysfs is mounted as rw:

# mount | grep sysfs
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)

sysfs is mounted as rw on containers:

docker run -ti --privileged --net=host busybox mount | grep sysfs
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)

Using busybox to further inspect:

# docker run -ti --privileged --net=host busybox sh
/ # echo 16384 > /sys/module/nf_conntrack/parameters/hashsize
sh: write error: Operation not supported
/ # mount | grep sysfs
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
/ # id
uid=0(root) gid=0(root) groups=10(wheel)
/ # ls -l /sys/module/nf_conntrack/parameters/
total 0
-rw-r--r--    1 root     root          4096 Feb 25 17:54 acct
-r--------    1 root     root          4096 Feb 25 17:54 expect_hashsize
-rw-------    1 root     root          4096 Feb 25 18:20 hashsize
-rw-r--r--    1 root     root          4096 Feb 25 17:54 nf_conntrack_helper
-rw-r--r--    1 root     root          4096 Feb 25 18:09 tstamp

@pmichali
Copy link
Contributor

pmichali commented Sep 28, 2017

@Random-Liu, @alexbrand I'm seeing the same issue when running kubeadm-dind-cluster, where kube-proxy is failing to come up, because it is trying to write to /sys/module/nf_conntrack/parameters/hashsize. This is on the filesystem sysfs, which "mount" shows is rw (so it passes the R/W check in kube-proxy code), but the filesystem is not writeable.

Within kube-proxy:

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)

/sys/module/nf_conntrack/parameters
touch x
touch: cannot touch 'x': Permission denied

I'm also running docker containers inside of docker containers. FYI, others have not seen this issue, when just running docker on bare-metal.

@ivan4th @danehans

pmichali added a commit to pmichali/kubernetes that referenced this issue Sep 29, 2017
doesn't try to update hashsize (if max connections > 4 * hashsize).

Appears to be inconsistency between help info in CLI (which says max
and max per core can be zero to leave alone), and default used (which
sets max per core to 32K, if both max and max per core are zero).

This wouldn'be be needed, but it appears that when running kube-proxy
in a docker-in-docker environment, the sysfs file system says it allows
read-write access, but we are unable to update hashsize, when writing
to /sys/module/nf_conntrack/parameters/hashsize.

There is an issue for the latter:
kubernetes#25543 (comment)

I'll create an issue for the former. Until then, we hack it out.
leblancd pushed a commit to leblancd/kubernetes that referenced this issue Oct 20, 2017
doesn't try to update hashsize (if max connections > 4 * hashsize).

Appears to be inconsistency between help info in CLI (which says max
and max per core can be zero to leave alone), and default used (which
sets max per core to 32K, if both max and max per core are zero).

This wouldn'be be needed, but it appears that when running kube-proxy
in a docker-in-docker environment, the sysfs file system says it allows
read-write access, but we are unable to update hashsize, when writing
to /sys/module/nf_conntrack/parameters/hashsize.

There is an issue for the latter:
kubernetes#25543 (comment)

I'll create an issue for the former. Until then, we hack it out.
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Oct 14, 2020
Bug 1882033: UPSTREAM: 94112: Remove canonicalization of endpoints by endpoints controller for better comparison

Origin-commit: 8c5f4c3884cd8140ab4e142ef7fcf93666d40c1d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/gcp Issues or PRs related to gcp provider priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.