Watcher stuck forever in a case of network disconnect #1148

NikPaushkin · 2020-04-16T13:54:42Z

What happened (please include outputs or screenshots):
If the network to Kubernetes API disconnects, watcher just gets stuck forever with no exception.

What you expected to happen:
Watcher should raise an exception in a case of any network issues.

How to reproduce it (as minimally and precisely as possible):

Run watcher for any object list with some timeout (60 seconds, for example)
Turn off network adapter on your laptop
Wait for 60 seconds (timeout) and watcher still does not return the control: thread just stuck

Anything else we need to know?:

Environment:

Kubernetes version (kubectl version): 1.17.0
OS (e.g., MacOS 10.13.6): Windows 10
Python version (python --version): 3.7
Python client version (pip list | grep kubernetes): 11.0.0

The text was updated successfully, but these errors were encountered:

roycaihw · 2020-04-27T16:23:20Z

This seems to be related to #868

NikPaushkin · 2020-04-27T16:26:12Z

This seems to be related to #868

@roycaihw well, it can be fixed like that, if we don't want to fix stream itself

roycaihw · 2020-04-27T16:34:37Z

cc @alanjcastonguay

ellieayla · 2020-04-28T00:59:37Z

Yeah, this is one of the failure modes that prompted #868. But I still don't have a method to fix it, other than simply setting a client-side timeout. Did you have one set in this test? Can you paste a minimal example that reproduces the problem in your test?

NikPaushkin · 2020-04-28T08:50:54Z

@alanjcastonguay It works like that always for me. The minimal code is trivial

from kubernetes import client, watch
from datetime import datetime

watcher = watch.Watch()
func = client.CoreV1Api(client.ApiClient(client.Configuration())).list_namespaced_pod
kwargs = dict()
kwargs['timeout_seconds'] = 60
try:
    while True:
        for event in watcher.stream(func, **kwargs):
            watcher_last_time_alive = datetime.now()
            print(watcher_last_time_alive)
except Exception as e:
    logging.error("API request failed", extra={
        "API": "Kubernetes",
        "method": func,
        "kwargsGiven": kwargs,
        "error": str(e)
    })

Cweiping · 2020-05-09T06:37:30Z

This problem can be repeated in the case of unstable network environment. The more unstable the network, the more likely it is to have problems

ellieayla · 2020-05-09T14:29:31Z

kwargs['timeout_seconds'] = 60 is a polite request to the server, asking it to cleanly close the connection after 60 seconds. If you have a network outage, this does nothing. You can set this number much higher, maybe to 3600 seconds (1h).

kwargs['_request_timeout'] = 60 is a client-side timeout, configuring your local socket. If you have a network outage dropping all packets with no RST/FIN, this is how long your client waits before realizing & dropping the connection. You can keep this number low, maybe 60 seconds.

NikPaushkin · 2020-05-10T13:33:15Z

@alanjcastonguay Thank you for clarification! When I add _request_timeout, it really raises exception. But it does that not after _request_timeout time, though: I got an exception each 5 seconds (seems like some default timeout of urllib3. But that's still much better behaviour.

ellieayla · 2020-05-10T13:41:28Z

The value is provided to a urllib3.Timeout at

python/kubernetes/client/rest.py

Lines 141 to 147 in 02ef5be

    
           if _request_timeout: 
        
               if isinstance(_request_timeout, (int, ) if six.PY3 else (int, long)):  # noqa: E501,F821 
        
                   timeout = urllib3.Timeout(total=_request_timeout) 
        
               elif (isinstance(_request_timeout, tuple) and 
        
                     len(_request_timeout) == 2): 
        
                   timeout = urllib3.Timeout( 
        
                       connect=_request_timeout[0], read=_request_timeout[1])

. Dunno where 5 might be coming from.

NikPaushkin · 2020-05-10T13:47:13Z

@alanjcastonguay Yep, my bad, forgot to remove my reconnect 5 seconds sleep

ellieayla · 2020-05-10T15:17:34Z

/remove-kind bug
/triage support

Ok to close this?

NikPaushkin · 2020-05-10T15:47:16Z

@alanjcastonguay After adding _request_timeout I got a timeout error every _request_timeout seconds, regardless of network connection state

HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out.

It got all the events and then dies with that error after _request_timeout.

My kwargs is

{"_request_timeout": 90, "timeout_seconds": 3600}

ellieayla · 2020-05-10T16:00:46Z

The urllib3 client timeout describes maximum seconds since last received byte. And the kubernetes API-server doesn't transmit any bytes if there's no changes to the watched collection. If you keep changing the set of pods at least every 90 seconds, the connection should remain up.

NikPaushkin · 2020-05-10T16:05:14Z

If you keep changing the set of pods at least every 90 seconds, the connection should remain up.

@alanjcastonguay Yes, it works like that. I already have own implementation of such behaviour, though.
Anyway it would be much better and useful to make stream watcher more smart to encapsulate this reconnect logic. And this is what that issue #868 is about, if I understand right.

xuejiezhang · 2022-03-23T18:15:10Z

@alanjcastonguay After adding _request_timeout I got a timeout error every _request_timeout seconds, regardless of network connection state
HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out.
It got all the events and then dies with that error after _request_timeout.

My kwargs is
{"_request_timeout": 90, "timeout_seconds": 3600}

Hi Nik, is this issue resolved? I got the same error. Thanks, Jeff

NikPaushkin · 2022-03-24T09:18:51Z

@xuejiezhang Hi, This is the expected behavior. I've made own implementation of the reconnect, nothing else to do here until #868 closed.

NikPaushkin added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2020

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 10, 2020

NikPaushkin closed this as completed May 10, 2020

atrbgithub mentioned this issue Dec 8, 2020

Network instabilities are able to freeze KubernetesJobWatcher apache/airflow#12644

Closed

vsliouniaev mentioned this issue Feb 11, 2021

Watching stops working after 10 minutes kiwigrid/k8s-sidecar#85

Closed

WouldYouKindly mentioned this issue Mar 23, 2021

Document client-side timeouts in watch #1402

Closed

Priyankasaggu11929 mentioned this issue Jun 2, 2021

add documentation for the server & client side timeout #1467

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watcher stuck forever in a case of network disconnect #1148

Watcher stuck forever in a case of network disconnect #1148

NikPaushkin commented Apr 16, 2020 •

edited

Loading

roycaihw commented Apr 27, 2020

NikPaushkin commented Apr 27, 2020

roycaihw commented Apr 27, 2020

ellieayla commented Apr 28, 2020 •

edited

Loading

NikPaushkin commented Apr 28, 2020 •

edited

Loading

Cweiping commented May 9, 2020

ellieayla commented May 9, 2020

NikPaushkin commented May 10, 2020

ellieayla commented May 10, 2020

NikPaushkin commented May 10, 2020

ellieayla commented May 10, 2020

NikPaushkin commented May 10, 2020 •

edited

Loading

ellieayla commented May 10, 2020

NikPaushkin commented May 10, 2020 •

edited

Loading

xuejiezhang commented Mar 23, 2022

NikPaushkin commented Mar 24, 2022

Watcher stuck forever in a case of network disconnect #1148

Watcher stuck forever in a case of network disconnect #1148

Comments

NikPaushkin commented Apr 16, 2020 • edited Loading

roycaihw commented Apr 27, 2020

NikPaushkin commented Apr 27, 2020

roycaihw commented Apr 27, 2020

ellieayla commented Apr 28, 2020 • edited Loading

NikPaushkin commented Apr 28, 2020 • edited Loading

Cweiping commented May 9, 2020

ellieayla commented May 9, 2020

NikPaushkin commented May 10, 2020

ellieayla commented May 10, 2020

NikPaushkin commented May 10, 2020

ellieayla commented May 10, 2020

NikPaushkin commented May 10, 2020 • edited Loading

ellieayla commented May 10, 2020

NikPaushkin commented May 10, 2020 • edited Loading

xuejiezhang commented Mar 23, 2022

NikPaushkin commented Mar 24, 2022

NikPaushkin commented Apr 16, 2020 •

edited

Loading

ellieayla commented Apr 28, 2020 •

edited

Loading

NikPaushkin commented Apr 28, 2020 •

edited

Loading

NikPaushkin commented May 10, 2020 •

edited

Loading

NikPaushkin commented May 10, 2020 •

edited

Loading