Pods communicating via websockets and API hanging #67457

bonneaud · 2018-08-15T21:28:59Z

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
We have python tasks running on kube pods that need to communicate with each other. We run commands in one pod, from another pod, by using connect_get_namespaced_pod_exec [1] via the Kubernetes Python API, which, if we understand correctly, is a wrapper for 'kubectl exec'.
Once in a while the tasks seem to hang after the stream call sending the command to the other task/pod.
The issue is that the call never seems to return, which makes the tasks hang forever and the underlying pods appears running without any error.
We don’t have a proper fix, we have to delete the pod and re-create it completely, which most likely leads to the task then terminating successfully.

Is it a problematic practice to regularly run commands inside a pod with kubectl exec as part of an application's logic? Can we expect this interface to be stable?

[1] https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#connect_get_namespaced_pod_exec

What you expected to happen:
Most of the time, the websocket calls return, which is what we expect. Right now, even if the underlying socket is actually being terminated or the connection timing out (I don’t know if that is the case though), the call itself never returns.

How to reproduce it (as minimally and precisely as possible):
We don’t have a systematic way to reproduce the issue. We have python tasks that do the following:

from kubernetes.client.apis import core_v1_api
from kubernetes.stream import stream

def some_method_in_some_class(self, cmd: str):
self.api = core_v1_api.CoreV1Api()
[…]
# The call below hangs sometimes
return_value = stream(
self.api.connect_get_namespaced_pod_exec,
self.web_pod_name,
self.namespace,
command=['airflow'] + cmd,
container='web',
stdin='True',
stdout='True’)
[…]

Environment:

Kubernetes version (use kubectl version): 1 (1.9.7-gke.5)
Cloud provider or hardware configuration: Google Cloud / GKE cluster
OS (e.g. from /etc/os-release): Linux

/sig bugs

The text was updated successfully, but these errors were encountered:

neolit123 · 2018-08-16T00:29:59Z

/sig network

ningning54321 · 2018-10-17T09:02:50Z

Hit same issue in my lab.
Is there solution or workaround for this issue?
"Calling exec interactively" seems works fine.
the websocket log stops after get the response:

--- request header ---
GET /api/v1/namespaces/default/pods/lcm-crmq-0/exec?command=%2Fusr%2Fbin%2Fsh&command=-c&command=ls&stderr=True&stdin=False&stdout=True&tty=False HTTP/1.1
Upgrade: websocket
Connection: Upgrade
Host: 10.254.0.1
Origin: http://10.254.0.1
Sec-WebSocket-Key: BqnCVfWfDGxJvswGazkIiw==
Sec-WebSocket-Version: 13
authorization: bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJuY21zIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImNidXItbWFzdGVyLWNidXItdG9rZW4tbTh6a3giLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiY2J1ci1tYXN0ZXItY2J1ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6Ijc4NGQ0MzMyLWQxMjMtMTFlOC04NTBmLWZhMTYzZThiMTI3MCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpuY21zOmNidXItbWFzdGVyLWNidXIifQ.hRGVbUQmj3mcqsYOENZw2VvA9b9u84pjQFLCvDdsLP72eDEXq3_AN-zbyBBBesN8ZQ9V_I2166D9RexZF1_dSwELeObHWQDrQfdCBXCbvVjqvNxLOXePE7SiRQB7mg7MFO9KXScM_GMFSmtgR8W5mwbEB2KYUmFZEEVWZTr3jvlzzWVXBSNEj-3VEy6sOtihXdZn0iAk2nktAd7Lq5R7LSJ9b2rj2WbmOXqq9D2zHDxptg4Cl9Zia5cfYtugcXykFFQeKw0iiTzQbLnqZOyCoDerNp3s0ZD8WorRCCPUlJHDDLKe70Tr4STb6rIwCfLe8AkDVbSWaFt7uNsX4Z_AVQ
sec-websocket-protocol: v4.channel.k8s.io


-----------------------
--- response header ---
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: gLY8pN1Ii7bb3Mj01d5sBlE7lec=
Sec-WebSocket-Protocol: v4.channel.k8s.io
-----------------------

ningning54321 · 2018-10-17T09:37:36Z

found a workaround: add request timeout, not sure if it is a proper solution.

resp = stream(api.connect_get_namespaced_pod_exec, name, 'default',
              command=exec_command,
              _request_timeout=5,
              stderr=True, stdin=False,
              stdout=True, tty=False)

yboaron · 2018-11-07T20:18:04Z

We are hitting the same issue with K8S v1.11.0 K8S & python client 7.0.0 (same behavior also with k8s py client 8.0.0).

Is there any update on that case?

sercanacar · 2018-11-09T09:55:22Z

Same issue here

delwaterman · 2019-02-05T20:34:03Z

Same issue here as well.

danwinship · 2019-05-03T12:57:44Z

Are you connecting via direct pod IPs, or via service names/IPs?

Is it possible that the destination pod is dying and being restarted at any point? (eg, do "kubectl get pods" and see if RESTARTS is non-0).

athenabot · 2019-05-09T21:26:03Z

@danwinship
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

danwinship · 2019-05-10T11:36:55Z

/remove-triage unresolved

ningning54321 · 2019-05-14T06:20:16Z

In my case, the connection is created directly to the pod IP. I am pretty sure the target pod is not being restarted during the connection time.

danwinship · 2019-05-20T17:51:57Z

If you're connecting directly to the pod IP, then this would most likely be a problem with your network plugin, not with kubernetes itself
/close

k8s-ci-robot · 2019-05-20T17:51:58Z

@danwinship: Closing this issue.

In response to this:

If you're connecting directly to the pod IP, then this would most likely be a problem with your network plugin, not with kubernetes itself
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Aug 15, 2018

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 16, 2018

thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019

freehan assigned danwinship May 2, 2019

k8s-ci-robot removed the triage/unresolved Indicates an issue that can not or will not be resolved. label May 10, 2019

k8s-ci-robot closed this as completed May 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods communicating via websockets and API hanging #67457

Pods communicating via websockets and API hanging #67457

bonneaud commented Aug 15, 2018 •

edited

neolit123 commented Aug 16, 2018

ningning54321 commented Oct 17, 2018 •

edited

ningning54321 commented Oct 17, 2018

yboaron commented Nov 7, 2018

sercanacar commented Nov 9, 2018

delwaterman commented Feb 5, 2019

danwinship commented May 3, 2019

athenabot commented May 9, 2019

danwinship commented May 10, 2019

ningning54321 commented May 14, 2019

danwinship commented May 20, 2019

k8s-ci-robot commented May 20, 2019

Pods communicating via websockets and API hanging #67457

Pods communicating via websockets and API hanging #67457

Comments

bonneaud commented Aug 15, 2018 • edited

neolit123 commented Aug 16, 2018

ningning54321 commented Oct 17, 2018 • edited

ningning54321 commented Oct 17, 2018

yboaron commented Nov 7, 2018

sercanacar commented Nov 9, 2018

delwaterman commented Feb 5, 2019

danwinship commented May 3, 2019

athenabot commented May 9, 2019

danwinship commented May 10, 2019

ningning54321 commented May 14, 2019

danwinship commented May 20, 2019

k8s-ci-robot commented May 20, 2019

bonneaud commented Aug 15, 2018 •

edited

ningning54321 commented Oct 17, 2018 •

edited