Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubes ops communicating via websockets and API hanging #602

Closed
1 of 2 tasks
bonneaud opened this issue Aug 15, 2018 · 17 comments
Closed
1 of 2 tasks

Kubes ops communicating via websockets and API hanging #602

bonneaud opened this issue Aug 15, 2018 · 17 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@bonneaud
Copy link

Hi,

This is a Bug Report

  • Feature Request
  • Bug Report

Problem:
We're using Kubernetes clusters with pods running Airflow. We then have other pods running tasks that send commands to Airflow. To do that, these ops use websockets (kubernetes.stream) and Kubernetes' core_v1_api to get the information about the pod running Airflow. We run commands in one pod, from another pod, by using connect_get_namespaced_pod_exec [1] via the Kubernetes Python API, which, if we understand correctly, is a wrapper for 'kubectl exec'. Our code looks like:
stream(self.api.connect_get_namespaced_pod_exec,
self.web_pod_name,
self.namespace,
command=['airflow'] + cmd,
container='web',
stdin='True',
stdout='True')

Most of the times, our tasks are able to send commands and get their outputs back from Airflow's pods. Once in a while though, the tasks seem to hang forever on the stream call. When the task hangs, it is accessible via Kubes and shown as running, it simply is silent. We don't see any error.

[1] https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#connect_get_namespaced_pod_exec

Proposed Solution:
We don't have a solution right now, except to delete the pod via Kube and re-create it.

Environment:

  • Kubernetes version (use kubectl version): 1 (1.9.7-gke.5)
  • Cloud provider or hardware configuration: Google Cloud / GKE cluster
  • OS (e.g. from /etc/os-release): Linux
@abeltre1
Copy link

abeltre1 commented Sep 16, 2018

@bonneaud I have the same issue, have you found a solution to it?
alternatively, we can use os.system() as a hack.

`(most recent call last):
File "main.py", line 59, in
main()
File "main.py", line 49, in main
v1.connect_get_namespaced_pod_exec('mpi0', 'default', command='ls',container='ompi3', stderr=True,stdin=True,stdout=True, tty=True)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 835, in connect_get_namespaced_pod_exec
(data) = self.connect_get_namespaced_pod_exec_with_http_info(name, namespace, **kwargs)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 935, in connect_get_namespaced_pod_exec_with_http_info
collection_formats=collection_formats)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 321, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
_request_timeout=_request_timeout)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 342, in request
headers=headers)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 231, in GET
query_params=query_params)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request

HTTP response headers: HTTPHeaderDict({'Date': 'Sun, 16 Sep 2018 18:40:37 GMT', 'Content-Length': '139', 'Content-Type': 'application/json'})

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Upgrade request required","reason":"BadRequest","code":400}
`

@PaulFurtado
Copy link

@bonneaud I may have tracked down the issue you're hitting, I filed an issue here: kubernetes-client/python-base#106

@arzarif
Copy link

arzarif commented Jan 28, 2019

@PaulFurtado - are you hitting this specifically when using CRI-O? The OP's description of this bug matches my own exactly - intermittent, indefinite hangs when invoking connect_get_namespaced_pod_exec. In my case, however, I'm using just Docker. It doesn't look like the OP specified which runtime they were using.

@PaulFurtado
Copy link

@arzarif the issue I filed is relevant to every runtime, however, for the commands I tested, something about the framing of the data causes it to be very reproducible when cri-o is the runtime.

@arzarif
Copy link

arzarif commented Jan 28, 2019

@PaulFurtado Interesting. Thanks for the heads up. This seems to be a pretty serious issue; it appears that Rundeck's remote pod execution feature is broken because of this. I tried to bypass Rundeck altogether and hit the same problem in my own client before stumbling on this issue.

@PaulFurtado
Copy link

Yeah, I agree that this is a serious issue for anyone using the exec API from python. Unfortunately, I'm not a maintainer and none of them have commented on that issue, but maybe I should just open a PR to get the ball rolling. On my end, I work around this by monkey-patching the python client, you could do the same if you need an immediate solution.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2019
@PaulFurtado
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2019
@TomasTomecek
Copy link
Contributor

Having the same issue: packit/sandcastle#23

@TomasTomecek
Copy link
Contributor

After being stuck on this for far too long, I was able to solve this by passing _request_timeout=30.0 to stream(), can anyone verify?

@SudheerBondada
Copy link

SudheerBondada commented Sep 16, 2019

@PaulFurtado @arzarif @danwinship This issue seems to be occurring with https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#list_namespaced_pod as well. (k8 python client version: 8.0.1). Adding a timeout does not help either, the API call neither returns anything nor throws an error. Looks like no one is assigned to this issue from the dev team yet.

@EricCat
Copy link

EricCat commented Sep 18, 2019

@TomasTomecek _request_timeout not work for me. the kubernetes package default is _request_timeout = kwargs.get("_request_timeout", 60)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 16, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@michalmisiewicz
Copy link

I'm facing the same issue with create_namespaced_pod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants