Kubes ops communicating via websockets and API hanging #602

bonneaud · 2018-08-15T21:37:04Z

Hi,

This is a Bug Report

Feature Request
Bug Report

Problem:
We're using Kubernetes clusters with pods running Airflow. We then have other pods running tasks that send commands to Airflow. To do that, these ops use websockets (kubernetes.stream) and Kubernetes' core_v1_api to get the information about the pod running Airflow. We run commands in one pod, from another pod, by using connect_get_namespaced_pod_exec [1] via the Kubernetes Python API, which, if we understand correctly, is a wrapper for 'kubectl exec'. Our code looks like:
stream(self.api.connect_get_namespaced_pod_exec,
self.web_pod_name,
self.namespace,
command=['airflow'] + cmd,
container='web',
stdin='True',
stdout='True')

Most of the times, our tasks are able to send commands and get their outputs back from Airflow's pods. Once in a while though, the tasks seem to hang forever on the stream call. When the task hangs, it is accessible via Kubes and shown as running, it simply is silent. We don't see any error.

[1] https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#connect_get_namespaced_pod_exec

Proposed Solution:
We don't have a solution right now, except to delete the pod via Kube and re-create it.

Environment:

Kubernetes version (use kubectl version): 1 (1.9.7-gke.5)
Cloud provider or hardware configuration: Google Cloud / GKE cluster
OS (e.g. from /etc/os-release): Linux

The text was updated successfully, but these errors were encountered:

abeltre1 · 2018-09-16T18:42:49Z

@bonneaud I have the same issue, have you found a solution to it?
alternatively, we can use os.system() as a hack.

`(most recent call last):
File "main.py", line 59, in
main()
File "main.py", line 49, in main
v1.connect_get_namespaced_pod_exec('mpi0', 'default', command='ls',container='ompi3', stderr=True,stdin=True,stdout=True, tty=True)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 835, in connect_get_namespaced_pod_exec
(data) = self.connect_get_namespaced_pod_exec_with_http_info(name, namespace, **kwargs)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 935, in connect_get_namespaced_pod_exec_with_http_info
collection_formats=collection_formats)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 321, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
_request_timeout=_request_timeout)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 342, in request
headers=headers)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 231, in GET
query_params=query_params)
File "/home/abeltre1/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request

HTTP response headers: HTTPHeaderDict({'Date': 'Sun, 16 Sep 2018 18:40:37 GMT', 'Content-Length': '139', 'Content-Type': 'application/json'})

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Upgrade request required","reason":"BadRequest","code":400}
`

PaulFurtado · 2018-12-03T01:31:00Z

@bonneaud I may have tracked down the issue you're hitting, I filed an issue here: kubernetes-client/python-base#106

arzarif · 2019-01-28T06:45:02Z

@PaulFurtado - are you hitting this specifically when using CRI-O? The OP's description of this bug matches my own exactly - intermittent, indefinite hangs when invoking connect_get_namespaced_pod_exec. In my case, however, I'm using just Docker. It doesn't look like the OP specified which runtime they were using.

PaulFurtado · 2019-01-28T08:10:36Z

@arzarif the issue I filed is relevant to every runtime, however, for the commands I tested, something about the framing of the data causes it to be very reproducible when cri-o is the runtime.

arzarif · 2019-01-28T17:34:46Z

@PaulFurtado Interesting. Thanks for the heads up. This seems to be a pretty serious issue; it appears that Rundeck's remote pod execution feature is broken because of this. I tried to bypass Rundeck altogether and hit the same problem in my own client before stumbling on this issue.

PaulFurtado · 2019-01-28T19:32:18Z

Yeah, I agree that this is a serious issue for anyone using the exec API from python. Unfortunately, I'm not a maintainer and none of them have commented on that issue, but maybe I should just open a PR to get the ball rolling. On my end, I work around this by monkey-patching the python client, you could do the same if you need an immediate solution.

fejta-bot · 2019-04-29T23:51:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

PaulFurtado · 2019-04-30T00:44:34Z

/remove-lifecycle stale

TomasTomecek · 2019-07-03T07:16:54Z

Having the same issue: packit/sandcastle#23

TomasTomecek · 2019-07-17T14:18:11Z

After being stuck on this for far too long, I was able to solve this by passing _request_timeout=30.0 to stream(), can anyone verify?

SudheerBondada · 2019-09-16T18:41:37Z

@PaulFurtado @arzarif @danwinship This issue seems to be occurring with https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#list_namespaced_pod as well. (k8 python client version: 8.0.1). Adding a timeout does not help either, the API call neither returns anything nor throws an error. Looks like no one is assigned to this issue from the dev team yet.

EricCat · 2019-09-18T10:53:43Z

@TomasTomecek _request_timeout not work for me. the kubernetes package default is _request_timeout = kwargs.get("_request_timeout", 60)

fejta-bot · 2019-12-17T11:33:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-01-16T12:20:38Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-02-15T13:01:31Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-02-15T13:01:39Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

michalmisiewicz · 2020-08-26T20:14:54Z

I'm facing the same issue with create_namespaced_pod

arzarif mentioned this issue Jan 28, 2019

Remote pod executions hang rundeck-plugins/kubernetes#23

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 29, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2019

TomasTomecek mentioned this issue Jul 4, 2019

stream, ws: disallow timeout<0 when provided timeout is >=0 kubernetes-client/python-base#143

Closed

TomasTomecek mentioned this issue Jul 17, 2019

tests: remove timeout from update() #881

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 16, 2020

k8s-ci-robot closed this as completed Feb 15, 2020

mboukhalfa mentioned this issue Jul 21, 2021

Refactor upgrade tests for k8s metal3-io/metal3-dev-env#696

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubes ops communicating via websockets and API hanging #602

Kubes ops communicating via websockets and API hanging #602

bonneaud commented Aug 15, 2018

abeltre1 commented Sep 16, 2018 •

edited

Loading

PaulFurtado commented Dec 3, 2018

arzarif commented Jan 28, 2019

PaulFurtado commented Jan 28, 2019

arzarif commented Jan 28, 2019

PaulFurtado commented Jan 28, 2019

fejta-bot commented Apr 29, 2019

PaulFurtado commented Apr 30, 2019

TomasTomecek commented Jul 3, 2019

TomasTomecek commented Jul 17, 2019

SudheerBondada commented Sep 16, 2019 •

edited

Loading

EricCat commented Sep 18, 2019

fejta-bot commented Dec 17, 2019

fejta-bot commented Jan 16, 2020

fejta-bot commented Feb 15, 2020

k8s-ci-robot commented Feb 15, 2020

michalmisiewicz commented Aug 26, 2020

Kubes ops communicating via websockets and API hanging #602

Kubes ops communicating via websockets and API hanging #602

Comments

bonneaud commented Aug 15, 2018

abeltre1 commented Sep 16, 2018 • edited Loading

PaulFurtado commented Dec 3, 2018

arzarif commented Jan 28, 2019

PaulFurtado commented Jan 28, 2019

arzarif commented Jan 28, 2019

PaulFurtado commented Jan 28, 2019

fejta-bot commented Apr 29, 2019

PaulFurtado commented Apr 30, 2019

TomasTomecek commented Jul 3, 2019

TomasTomecek commented Jul 17, 2019

SudheerBondada commented Sep 16, 2019 • edited Loading

EricCat commented Sep 18, 2019

fejta-bot commented Dec 17, 2019

fejta-bot commented Jan 16, 2020

fejta-bot commented Feb 15, 2020

k8s-ci-robot commented Feb 15, 2020

michalmisiewicz commented Aug 26, 2020

abeltre1 commented Sep 16, 2018 •

edited

Loading

SudheerBondada commented Sep 16, 2019 •

edited

Loading