Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Handling Watch Stream Exception #843

Closed
neliel123 opened this issue May 30, 2019 · 10 comments
Closed

Python: Handling Watch Stream Exception #843

neliel123 opened this issue May 30, 2019 · 10 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@neliel123
Copy link

have this code that is implemented as a thread as a watch component.

class MyWatcher(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)


    def run(self):
        self.init_kube_config()
        python_api = client.CoreV1Api()
        try:
            w = watch.Watch()
            for event in w.stream(python_api.list_namespaced_config_map, namespace="my-namespace"):
                self.process_event(event)
        except ApiException as e:
            logger.error("Exception encountered while watching for event stream :: list_namespaced_config_map :: {}".format(e))
            pass

But sometimes for some reason, I am getting below exception

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/site-packages/kube/my_watcher.py", line 21, in run
    for event in w.stream(python_api.list_namespaced_config_map, namespace="my-namespace"):
  File "/usr/lib/python3.6/site-packages/kubernetes/watch/watch.py", line 128, in stream
    resp = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", line 11854, in list_namespaced_config_map
    (data) = self.list_namespaced_config_map_with_http_info(namespace, **kwargs)
  File "/usr/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", line 11957, in list_namespaced_config_map_with_http_info
    collection_formats=collection_formats)
  File "/usr/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 321, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 342, in request
    headers=headers)
  File "/usr/lib/python3.6/site-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/usr/lib/python3.6/site-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Mon, 27 May 2019 07:48:35 GMT', 'Content-Length': '186'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"resourceVersion: Invalid value: \\"None\\": strconv.ParseUint: parsing \\"None\\": invalid syntax","code":500}\n'

Currently I am catching the APIException and let it pass...I am just wondering what should be the proper way to handle this?
Should I stop the watch object and then create a new watch object?

I don't exactly know how to reproduce the error as it seems to be random in nature so I am just wondering what should I do in my except code.

@tomplus
Copy link
Member

tomplus commented Jun 6, 2019

Hi @neliel123
As you said the reason is unknown so I suggest to log an exception with a stack trace and then start watching it again. If you don't recreate the Watch you will start to stream from the last processed event.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2019
@salilgupta1
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2019
@salilgupta1
Copy link

salilgupta1 commented Sep 20, 2019

I'm currently trying to figure out what we should be doing with the resource_version when we have to restart the watch stream. The python client will handle certain exceptions, such as the connection timing out, and will use the last resource_version it has to continue the stream. Should we be emulating that paradigm when the client doesn't handle an exception? E.g. if we receive a 410 error code from K8s, should we restart the stream using the last resource_version we had? I would think that K8s would just continue to throw a 410 error because the resource_version is old.

Using K8s: 1.10.3
Client: v10.0.1

Here is a rough outline of our code, right now:

while True:
  try:
     stream = watch.Watch().stream(client.list_pod_for_all_namespaces)
     for pod in stream:
         # do some parsing
   except K8sApiException as e:
      # swallow error
   except ValueError as e:
      # We occasionally run into this problem: https://github.com/kubernetes-client/python/issues/895
      # swallow error if above issue comes up

When we swallow errors, the stream is started fresh and the resource_version is None. Should we really be passing in the last resource_version seen back into the stream() method?
We aren't worried about out of order events as much as missing events altogether.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@cben
Copy link

cben commented Jan 6, 2020

@salilgupta1 Watch already tracks last seen resource_version:
https://github.com/kubernetes-client/python-base/blob/a2d1024524/watch/watch.py#L99-L102
and restarts from that version after "regular" disconnection.

  • This might not work if in first call you didn't provide resource_version, due to some nasty random ordering Old events from the past yielded due to remembered resource_version #819. My impression (not personally tested) is that "List+Watch" pattern avoids that.
  • You're right that 410 Gone should not retry same resource_version, but rather repeat List+Watch starting at version from List.

So I think you don't need to pass back resource_version from stream() to stream(), but you do need an outer loop doing List+Watch.
(See #1016 and #868 for proposed abstractions to help with that, but for now you need to code that.)

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 5, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants