Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy process silently hangs on one of nodes #38372

Closed
shakhat opened this issue Dec 8, 2016 · 9 comments
Closed

kube-proxy process silently hangs on one of nodes #38372

shakhat opened this issue Dec 8, 2016 · 9 comments
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@shakhat
Copy link

shakhat commented Dec 8, 2016

Kubernetes version (use kubectl version):
Kubernetes v1.4.3+coreos.0

Environment:

  • Cloud provider or hardware configuration:
    6 bare-metal nodes
  • OS (e.g. from /etc/os-release):
    Ubuntu 16.04.1 LTS (Xenial Xerus)
  • Kernel (e.g. uname -a):
    Linux node1 4.4.0-47-generic e2e-test: expose minion 8080 port #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    Kargo
  • Others:
    Calico 0.21

What happened:
There is a pod (OpenStack Keystone) that runs on node1 and a service associated with it. Initially the service is reachable from all nodes, e.g. request to http://keystone.ccp:5000/ replies with valid HTTP response.

Keystone pod is restarted, as result the pod gets different IP address, DNS record is updated to point to new address. However the service is not reachable from node2.

The reason is in iptables rules not being updated:

good node

-A KUBE-SEP-MMRSUY2HT6SO3JN6 -s 10.247.61.227/32 -m comment --comment "ccp/keystone:5000" -j KUBE-MARK-MASQ
-A KUBE-SEP-MMRSUY2HT6SO3JN6 -p tcp -m comment --comment "ccp/keystone:5000" -m tcp -j DNAT --to-destination 10.247.61.227:5000`

bad node

-A KUBE-SEP-2FVN6OV734ACY4G4 -s 10.247.61.245/32 -m comment --comment "ccp/keystone:5000" -j KUBE-MARK-MASQ
-A KUBE-SEP-2FVN6OV734ACY4G4 -p tcp -m comment --comment "ccp/keystone:5000" -m tcp -j DNAT --to-destination 10.247.61.245:5000

and the service:

vagrant@node1:~$ kubectl get endpoints 
NAME              ENDPOINTS                   AGE
keystone          10.247.61.227:5000          7d

kube-proxy pod is running on all nodes and in ready state:

root@node1:~# kubectl --namespace kube-system get pods
NAME                                    READY     STATUS    RESTARTS   AGE
kube-proxy-node1                        1/1       Running   0          12d
kube-proxy-node2                        1/1       Running   0          12d
kube-proxy-node3                        1/1       Running   0          12d

lsof output shows that kube-proxy process has established connections to http://127.0.0.1:8080 (master address) and a number of open listen sockets.

however strace shows that the process and all of its threads are in futex, the typical picture is:

[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000}) = -1 ETIMEDOUT (Connection timed out)
[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000}) = -1 ETIMEDOUT (Connection timed out)
[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000}) = -1 ETIMEDOUT (Connection timed out)
[pid 12370] <... select resumed> )      = 0 (Timeout)
[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000} <unfinished ...>
[pid 12370] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 12405] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 12405] futex(0x7594868, FUTEX_WAIT, 0, {0, 100000}) = -1 ETIMEDOUT (Connection timed out)

the last messages in log file are:

{"log":"E1206 09:41:58.866601       1 proxier.go:1247] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 4 failed\n","stream":"stderr","time":"2016-12-06T09:41:58.86667315Z"}
{"log":")\n","stream":"stderr","time":"2016-12-06T09:41:58.866690737Z"}
{"log":"E1206 09:42:49.047532       1 proxier.go:1247] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 4 failed\n","stream":"stderr","time":"2016-12-06T09:42:49.047652912Z"}
{"log":")\n","stream":"stderr","time":"2016-12-06T09:42:49.047670622Z"}
{"log":"E1206 10:51:11.788331       1 proxier.go:1247] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 4 failed\n","stream":"stderr","time":"2016-12-06T10:51:11.788482453Z"}
{"log":")\n","stream":"stderr","time":"2016-12-06T10:51:11.788506419Z"}

note that logs from other (live) kube-proxy also have same complains about iptables-restore, so it may be not the issue.

What you expected to happen:
Kubernetes should somehow detect that kube-proxy hangs and stopped processing events. I'd expect automatic issue detection and restart of kube-proxy pod

How to reproduce it (as minimally and precisely as possible):
The issue happened under very low load, overall number of pods is about 100, and about 20 services.

@ayasakov
Copy link

ayasakov commented Jan 17, 2017

Reproduced on my env.

Kubernetes version :
Kubernetes v1.4.6+coreos.0

Environment:

  • Cloud provider or hardware configuration:
    200 bare-metal nodes
  • Install tools:
    Kargo

Info: http://paste.openstack.org/show/595230/

@pskrzyns
Copy link

pskrzyns commented Feb 2, 2017

cc @kubernetes/sig-network-bugs

@listomin
Copy link

listomin commented Feb 4, 2017

@ivan4th has investigated the issue and opened this discussion in golang project golang/go#18925

@pigmej
Copy link
Contributor

pigmej commented Feb 8, 2017

https://github.com/ivan4th/hoseproxy This is small project by @ivan4th to reproduce that bug.

@dcbw
Copy link
Member

dcbw commented Mar 20, 2017

If kube-proxy is running containerized, then it could be a candidate for health-checks. Otherwise, I think handling this is probably outside the scope of kube-proxy itself, and instead a general issue for various kubernetes components, not the proxy specifically.

Could we close this issue in favor of the Golang one, since hang seems to be a Go thing and not actually kube-proxy?

@equinox0815
Copy link

We have a very similar problem with kubelet. An underlaying bug in go would explain this. Anyone else experiencing that issue with kubelet as well or is it just us?

@thockin thockin added the sig/network Categorizes an issue or PR as relevant to SIG Network. label May 19, 2017
@thockin
Copy link
Member

thockin commented May 19, 2017

This looks like not a big in kube?

@thockin thockin closed this as completed May 19, 2017
@r7vme
Copy link

r7vme commented Jun 13, 2017

We had same issue at least two times.

@r0bj
Copy link

r0bj commented Jun 13, 2017

@equinox0815 I have experienced this bug also with kubelet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

10 participants