Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods communicating via websockets and API hanging #67457

Closed
bonneaud opened this issue Aug 15, 2018 · 12 comments
Closed

Pods communicating via websockets and API hanging #67457

bonneaud opened this issue Aug 15, 2018 · 12 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@bonneaud
Copy link

bonneaud commented Aug 15, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
We have python tasks running on kube pods that need to communicate with each other. We run commands in one pod, from another pod, by using connect_get_namespaced_pod_exec [1] via the Kubernetes Python API, which, if we understand correctly, is a wrapper for 'kubectl exec'.
Once in a while the tasks seem to hang after the stream call sending the command to the other task/pod.
The issue is that the call never seems to return, which makes the tasks hang forever and the underlying pods appears running without any error.
We don’t have a proper fix, we have to delete the pod and re-create it completely, which most likely leads to the task then terminating successfully.

Is it a problematic practice to regularly run commands inside a pod with kubectl exec as part of an application's logic? Can we expect this interface to be stable?

[1] https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#connect_get_namespaced_pod_exec

What you expected to happen:
Most of the time, the websocket calls return, which is what we expect. Right now, even if the underlying socket is actually being terminated or the connection timing out (I don’t know if that is the case though), the call itself never returns.

How to reproduce it (as minimally and precisely as possible):
We don’t have a systematic way to reproduce the issue. We have python tasks that do the following:

from kubernetes.client.apis import core_v1_api
from kubernetes.stream import stream

def some_method_in_some_class(self, cmd: str):
self.api = core_v1_api.CoreV1Api()
[…]
# The call below hangs sometimes
return_value = stream(
self.api.connect_get_namespaced_pod_exec,
self.web_pod_name,
self.namespace,
command=['airflow'] + cmd,
container='web',
stdin='True',
stdout='True’)
[…]

Environment:

  • Kubernetes version (use kubectl version): 1 (1.9.7-gke.5)
  • Cloud provider or hardware configuration: Google Cloud / GKE cluster
  • OS (e.g. from /etc/os-release): Linux

/sig bugs

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Aug 15, 2018
@neolit123
Copy link
Member

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 16, 2018
@ningning54321
Copy link

ningning54321 commented Oct 17, 2018

Hit same issue in my lab.
Is there solution or workaround for this issue?
"Calling exec interactively" seems works fine.
the websocket log stops after get the response:

--- request header ---
GET /api/v1/namespaces/default/pods/lcm-crmq-0/exec?command=%2Fusr%2Fbin%2Fsh&command=-c&command=ls&stderr=True&stdin=False&stdout=True&tty=False HTTP/1.1
Upgrade: websocket
Connection: Upgrade
Host: 10.254.0.1
Origin: http://10.254.0.1
Sec-WebSocket-Key: BqnCVfWfDGxJvswGazkIiw==
Sec-WebSocket-Version: 13
authorization: bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJuY21zIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImNidXItbWFzdGVyLWNidXItdG9rZW4tbTh6a3giLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiY2J1ci1tYXN0ZXItY2J1ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6Ijc4NGQ0MzMyLWQxMjMtMTFlOC04NTBmLWZhMTYzZThiMTI3MCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpuY21zOmNidXItbWFzdGVyLWNidXIifQ.hRGVbUQmj3mcqsYOENZw2VvA9b9u84pjQFLCvDdsLP72eDEXq3_AN-zbyBBBesN8ZQ9V_I2166D9RexZF1_dSwELeObHWQDrQfdCBXCbvVjqvNxLOXePE7SiRQB7mg7MFO9KXScM_GMFSmtgR8W5mwbEB2KYUmFZEEVWZTr3jvlzzWVXBSNEj-3VEy6sOtihXdZn0iAk2nktAd7Lq5R7LSJ9b2rj2WbmOXqq9D2zHDxptg4Cl9Zia5cfYtugcXykFFQeKw0iiTzQbLnqZOyCoDerNp3s0ZD8WorRCCPUlJHDDLKe70Tr4STb6rIwCfLe8AkDVbSWaFt7uNsX4Z_AVQ
sec-websocket-protocol: v4.channel.k8s.io


-----------------------
--- response header ---
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: gLY8pN1Ii7bb3Mj01d5sBlE7lec=
Sec-WebSocket-Protocol: v4.channel.k8s.io
-----------------------

@ningning54321
Copy link

found a workaround: add request timeout, not sure if it is a proper solution.

resp = stream(api.connect_get_namespaced_pod_exec, name, 'default',
              command=exec_command,
              _request_timeout=5,
              stderr=True, stdin=False,
              stdout=True, tty=False)

@yboaron
Copy link

yboaron commented Nov 7, 2018

We are hitting the same issue with K8S v1.11.0 K8S & python client 7.0.0 (same behavior also with k8s py client 8.0.0).

Is there any update on that case?

@sercanacar
Copy link

Same issue here

@delwaterman
Copy link

Same issue here as well.

@thockin thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019
@danwinship
Copy link
Contributor

Are you connecting via direct pod IPs, or via service names/IPs?

Is it possible that the destination pod is dying and being restarted at any point? (eg, do "kubectl get pods" and see if RESTARTS is non-0).

@athenabot
Copy link

@danwinship
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

@danwinship
Copy link
Contributor

/remove-triage unresolved

@k8s-ci-robot k8s-ci-robot removed the triage/unresolved Indicates an issue that can not or will not be resolved. label May 10, 2019
@ningning54321
Copy link

In my case, the connection is created directly to the pod IP. I am pretty sure the target pod is not being restarted during the connection time.

@danwinship
Copy link
Contributor

If you're connecting directly to the pod IP, then this would most likely be a problem with your network plugin, not with kubernetes itself
/close

@k8s-ci-robot
Copy link
Contributor

@danwinship: Closing this issue.

In response to this:

If you're connecting directly to the pod IP, then this would most likely be a problem with your network plugin, not with kubernetes itself
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

10 participants