kube-proxy process silently hangs on one of nodes #38372

shakhat · 2016-12-08T09:41:32Z

Kubernetes version (use kubectl version):
Kubernetes v1.4.3+coreos.0

Environment:

Cloud provider or hardware configuration:
6 bare-metal nodes
OS (e.g. from /etc/os-release):
Ubuntu 16.04.1 LTS (Xenial Xerus)
Kernel (e.g. uname -a):
Linux node1 4.4.0-47-generic e2e-test: expose minion 8080 port #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Kargo
Others:
Calico 0.21

What happened:
There is a pod (OpenStack Keystone) that runs on node1 and a service associated with it. Initially the service is reachable from all nodes, e.g. request to http://keystone.ccp:5000/ replies with valid HTTP response.

Keystone pod is restarted, as result the pod gets different IP address, DNS record is updated to point to new address. However the service is not reachable from node2.

The reason is in iptables rules not being updated:

good node

-A KUBE-SEP-MMRSUY2HT6SO3JN6 -s 10.247.61.227/32 -m comment --comment "ccp/keystone:5000" -j KUBE-MARK-MASQ
-A KUBE-SEP-MMRSUY2HT6SO3JN6 -p tcp -m comment --comment "ccp/keystone:5000" -m tcp -j DNAT --to-destination 10.247.61.227:5000`

bad node

-A KUBE-SEP-2FVN6OV734ACY4G4 -s 10.247.61.245/32 -m comment --comment "ccp/keystone:5000" -j KUBE-MARK-MASQ
-A KUBE-SEP-2FVN6OV734ACY4G4 -p tcp -m comment --comment "ccp/keystone:5000" -m tcp -j DNAT --to-destination 10.247.61.245:5000

and the service:

vagrant@node1:~$ kubectl get endpoints 
NAME              ENDPOINTS                   AGE
keystone          10.247.61.227:5000          7d

kube-proxy pod is running on all nodes and in ready state:

root@node1:~# kubectl --namespace kube-system get pods
NAME                                    READY     STATUS    RESTARTS   AGE
kube-proxy-node1                        1/1       Running   0          12d
kube-proxy-node2                        1/1       Running   0          12d
kube-proxy-node3                        1/1       Running   0          12d

lsof output shows that kube-proxy process has established connections to http://127.0.0.1:8080 (master address) and a number of open listen sockets.

however strace shows that the process and all of its threads are in futex, the typical picture is:

[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000}) = -1 ETIMEDOUT (Connection timed out)
[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000}) = -1 ETIMEDOUT (Connection timed out)
[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000}) = -1 ETIMEDOUT (Connection timed out)
[pid 12370] <... select resumed> )      = 0 (Timeout)
[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000} <unfinished ...>
[pid 12370] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 12405] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000}) = -1 ETIMEDOUT (Connection timed out)

the last messages in log file are:

{"log":"E1206 09:41:58.866601       1 proxier.go:1247] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 4 failed\n","stream":"stderr","time":"2016-12-06T09:41:58.86667315Z"}
{"log":")\n","stream":"stderr","time":"2016-12-06T09:41:58.866690737Z"}
{"log":"E1206 09:42:49.047532       1 proxier.go:1247] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 4 failed\n","stream":"stderr","time":"2016-12-06T09:42:49.047652912Z"}
{"log":")\n","stream":"stderr","time":"2016-12-06T09:42:49.047670622Z"}
{"log":"E1206 10:51:11.788331       1 proxier.go:1247] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 4 failed\n","stream":"stderr","time":"2016-12-06T10:51:11.788482453Z"}
{"log":")\n","stream":"stderr","time":"2016-12-06T10:51:11.788506419Z"}

note that logs from other (live) kube-proxy also have same complains about iptables-restore, so it may be not the issue.

What you expected to happen:
Kubernetes should somehow detect that kube-proxy hangs and stopped processing events. I'd expect automatic issue detection and restart of kube-proxy pod

How to reproduce it (as minimally and precisely as possible):
The issue happened under very low load, overall number of pods is about 100, and about 20 services.

The text was updated successfully, but these errors were encountered:

ayasakov · 2017-01-17T15:36:35Z

Reproduced on my env.

Kubernetes version :
Kubernetes v1.4.6+coreos.0

Environment:

Cloud provider or hardware configuration:
200 bare-metal nodes
Install tools:
Kargo

Info: http://paste.openstack.org/show/595230/

pskrzyns · 2017-02-02T16:14:45Z

cc @kubernetes/sig-network-bugs

listomin · 2017-02-04T09:38:48Z

@ivan4th has investigated the issue and opened this discussion in golang project golang/go#18925

pigmej · 2017-02-08T11:33:52Z

https://github.com/ivan4th/hoseproxy This is small project by @ivan4th to reproduce that bug.

dcbw · 2017-03-20T23:18:01Z

If kube-proxy is running containerized, then it could be a candidate for health-checks. Otherwise, I think handling this is probably outside the scope of kube-proxy itself, and instead a general issue for various kubernetes components, not the proxy specifically.

Could we close this issue in favor of the Golang one, since hang seems to be a Go thing and not actually kube-proxy?

equinox0815 · 2017-04-03T09:52:22Z

We have a very similar problem with kubelet. An underlaying bug in go would explain this. Anyone else experiencing that issue with kubelet as well or is it just us?

thockin · 2017-05-19T04:19:22Z

This looks like not a big in kube?

r7vme · 2017-06-13T09:51:48Z

We had same issue at least two times.

r0bj · 2017-06-13T14:59:08Z

@equinox0815 I have experienced this bug also with kubelet.

thockin added the sig/network Categorizes an issue or PR as relevant to SIG Network. label May 19, 2017

thockin closed this as completed May 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-proxy process silently hangs on one of nodes #38372

kube-proxy process silently hangs on one of nodes #38372

shakhat commented Dec 8, 2016

ayasakov commented Jan 17, 2017 •

edited

Loading

pskrzyns commented Feb 2, 2017

listomin commented Feb 4, 2017

pigmej commented Feb 8, 2017

dcbw commented Mar 20, 2017

equinox0815 commented Apr 3, 2017

thockin commented May 19, 2017

r7vme commented Jun 13, 2017

r0bj commented Jun 13, 2017

kube-proxy process silently hangs on one of nodes #38372

kube-proxy process silently hangs on one of nodes #38372

Comments

shakhat commented Dec 8, 2016

ayasakov commented Jan 17, 2017 • edited Loading

pskrzyns commented Feb 2, 2017

listomin commented Feb 4, 2017

pigmej commented Feb 8, 2017

dcbw commented Mar 20, 2017

equinox0815 commented Apr 3, 2017

thockin commented May 19, 2017

r7vme commented Jun 13, 2017

r0bj commented Jun 13, 2017

ayasakov commented Jan 17, 2017 •

edited

Loading