Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher stuck forever in a case of network disconnect #1148

Closed
NikPaushkin opened this issue Apr 16, 2020 · 16 comments
Closed

Watcher stuck forever in a case of network disconnect #1148

NikPaushkin opened this issue Apr 16, 2020 · 16 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@NikPaushkin
Copy link

NikPaushkin commented Apr 16, 2020

What happened (please include outputs or screenshots):
If the network to Kubernetes API disconnects, watcher just gets stuck forever with no exception.

What you expected to happen:
Watcher should raise an exception in a case of any network issues.

How to reproduce it (as minimally and precisely as possible):

  1. Run watcher for any object list with some timeout (60 seconds, for example)
  2. Turn off network adapter on your laptop
  3. Wait for 60 seconds (timeout) and watcher still does not return the control: thread just stuck

Anything else we need to know?:

Environment:

  • Kubernetes version (kubectl version): 1.17.0
  • OS (e.g., MacOS 10.13.6): Windows 10
  • Python version (python --version): 3.7
  • Python client version (pip list | grep kubernetes): 11.0.0
@NikPaushkin NikPaushkin added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2020
@roycaihw
Copy link
Member

This seems to be related to #868

@NikPaushkin
Copy link
Author

This seems to be related to #868

@roycaihw well, it can be fixed like that, if we don't want to fix stream itself

@roycaihw
Copy link
Member

cc @alanjcastonguay

@ellieayla
Copy link

ellieayla commented Apr 28, 2020

Yeah, this is one of the failure modes that prompted #868. But I still don't have a method to fix it, other than simply setting a client-side timeout. Did you have one set in this test? Can you paste a minimal example that reproduces the problem in your test?

@NikPaushkin
Copy link
Author

NikPaushkin commented Apr 28, 2020

@alanjcastonguay It works like that always for me. The minimal code is trivial

from kubernetes import client, watch
from datetime import datetime

watcher = watch.Watch()
func = client.CoreV1Api(client.ApiClient(client.Configuration())).list_namespaced_pod
kwargs = dict()
kwargs['timeout_seconds'] = 60
try:
    while True:
        for event in watcher.stream(func, **kwargs):
            watcher_last_time_alive = datetime.now()
            print(watcher_last_time_alive)
except Exception as e:
    logging.error("API request failed", extra={
        "API": "Kubernetes",
        "method": func,
        "kwargsGiven": kwargs,
        "error": str(e)
    })

@Cweiping
Copy link

Cweiping commented May 9, 2020

This problem can be repeated in the case of unstable network environment. The more unstable the network, the more likely it is to have problems

@ellieayla
Copy link

kwargs['timeout_seconds'] = 60 is a polite request to the server, asking it to cleanly close the connection after 60 seconds. If you have a network outage, this does nothing. You can set this number much higher, maybe to 3600 seconds (1h).

kwargs['_request_timeout'] = 60 is a client-side timeout, configuring your local socket. If you have a network outage dropping all packets with no RST/FIN, this is how long your client waits before realizing & dropping the connection. You can keep this number low, maybe 60 seconds.

@NikPaushkin
Copy link
Author

@alanjcastonguay Thank you for clarification! When I add _request_timeout, it really raises exception. But it does that not after _request_timeout time, though: I got an exception each 5 seconds (seems like some default timeout of urllib3. But that's still much better behaviour.

@ellieayla
Copy link

The value is provided to a urllib3.Timeout at

if _request_timeout:
if isinstance(_request_timeout, (int, ) if six.PY3 else (int, long)): # noqa: E501,F821
timeout = urllib3.Timeout(total=_request_timeout)
elif (isinstance(_request_timeout, tuple) and
len(_request_timeout) == 2):
timeout = urllib3.Timeout(
connect=_request_timeout[0], read=_request_timeout[1])
. Dunno where 5 might be coming from.

@NikPaushkin
Copy link
Author

@alanjcastonguay Yep, my bad, forgot to remove my reconnect 5 seconds sleep

@ellieayla
Copy link

/remove-kind bug
/triage support

Ok to close this?

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 10, 2020
@NikPaushkin
Copy link
Author

NikPaushkin commented May 10, 2020

@alanjcastonguay After adding _request_timeout I got a timeout error every _request_timeout seconds, regardless of network connection state

HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out.

It got all the events and then dies with that error after _request_timeout.

My kwargs is

{"_request_timeout": 90, "timeout_seconds": 3600}

@ellieayla
Copy link

The urllib3 client timeout describes maximum seconds since last received byte. And the kubernetes API-server doesn't transmit any bytes if there's no changes to the watched collection. If you keep changing the set of pods at least every 90 seconds, the connection should remain up.

@NikPaushkin
Copy link
Author

NikPaushkin commented May 10, 2020

If you keep changing the set of pods at least every 90 seconds, the connection should remain up.

@alanjcastonguay Yes, it works like that. I already have own implementation of such behaviour, though.
Anyway it would be much better and useful to make stream watcher more smart to encapsulate this reconnect logic. And this is what that issue #868 is about, if I understand right.

@xuejiezhang
Copy link

@alanjcastonguay After adding _request_timeout I got a timeout error every _request_timeout seconds, regardless of network connection state

HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out.

It got all the events and then dies with that error after _request_timeout.

My kwargs is

{"_request_timeout": 90, "timeout_seconds": 3600}

Hi Nik, is this issue resolved? I got the same error. Thanks, Jeff

@NikPaushkin
Copy link
Author

@xuejiezhang Hi, This is the expected behavior. I've made own implementation of the reconnect, nothing else to do here until #868 closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

6 participants