Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watcher error after 20m #12

Closed
mheese opened this issue Oct 12, 2016 · 5 comments
Closed

watcher error after 20m #12

mheese opened this issue Oct 12, 2016 · 5 comments

Comments

@mheese
Copy link

mheese commented Oct 12, 2016

When I create a watch on pods e.g. like this:

import (
...
  "k8s.io/client-go/1.4/kubernetes"
  "k8s.io/client-go/1.4/pkg/api"
  apiv1 "k8s.io/client-go/1.4/pkg/api/v1"
  "k8s.io/client-go/1.4/pkg/labels"
  "k8s.io/client-go/1.4/pkg/selection"
  "k8s.io/client-go/1.4/pkg/util/sets"
  "k8s.io/client-go/1.4/pkg/watch"
  ...
)
...
podsClient := clientset.Core().Pods("default")
watcher, err := podsClient.Watch(api.ListOptions{...})
...
select {
...
  case ev := <-watcher.ResultChan():
  ...
}

I have the trouble that after exactly 20 minutes, the ResultChan() receives infinite messages, but they are all empty. Trying to print the returned ev from the channel shows that it is an empty/uninitialized map map[Type: Object:<nil>].

Is this expected behaviour? Am I supposed to renew the watch after some time?

@caesarxuchao
Copy link
Member

It is expected. Server will timeout the watch connection, see code. Though I'm not sure why it's 20 mins, it timeout after 30-60 mins if you didn't change the default configs of your server, because the default MinRequestTimeout is 30 mins.

Trying to print the returned ev from the channel shows that it is an empty/uninitialized map map[Type: Object:]

I guess you have a loop around the select? I think you need to check if watcher.ResultChan() is closed,

 case ev, ok := <-watcher.ResultChan():
    if !ok {
       // break the loop
    }

@caesarxuchao
Copy link
Member

You can take a look at this for loop in reflector.go as a reference on how to renew watch.

@mheese
Copy link
Author

mheese commented Oct 13, 2016

Thanks @caesarxuchao . I'm giving that a try. I'm 99.999% certain that my problems are solved. Somehow it simply did not occur to me that the channel might be closed at that point, and I actually thought that it would panic otherwise.

With regards to the 20m: I'm testing/developing with minikube which is using localkube I believe? I definitely have not changed any default values. I don't even know how to do that.

@mheese
Copy link
Author

mheese commented Oct 13, 2016

I'm going to close this issue as it is not a bug, but expected behaviour.

Additionally, just to point it out again if people are running into the same error/misbehaviour: you have to test if the channel is closed.

So the solution is very simple as @caesarxuchao pointed out:

case ev, ok := <-watcher.ResultChan():
    if !ok {
       // break the loop
    }

Note, that after you recreate the watch, you will receive a lot of ADDED messages for all the pods/objects that match your watch.

Thanks again for the help @caesarxuchao!

@mheese mheese closed this as completed Oct 13, 2016
d11wtq added a commit to d11wtq/skaffold that referenced this issue Jun 19, 2018
Resolves: GoogleContainerTools#531
See also: kubernetes/client-go#12

There is an issue in which the Pods watcher gets into a infinite tight
loop and begins consuming 100% of the CPU. This happens after skaffold
dev has been running for a while (~30 mins) and once it starts, it
doesn't stop.

The issue was narrowed down by @ajbouh to the event polling loop in
`logs.go`, which was not checking if the `ResultChan()` is closed or not.
Kubernetes actually closes the connection after a timeout (default is in
the range of 30-60 mins according to the related issue linked to above).
In this case, the intended solution is to start the watcher again.

This refactors the polling into two nested loops. One to start (and
restart) the Pods watcher itself and another to receive and process the
events from the watcher. If the `ResultChan()` is closed, the entire
watcher loop is restarted and log tailing continues.

There is a subtle difference in error handling as a result of this
change. Previously any error returned from `client.Pods("").Watch()`
would be immediately returned from the `Watch()` func in `logs.go`. This
is no longer possible since the watcher is initialized in the goroutine
started by that func. As such, in the case the watcher cannot be
initialized, we simply log the error and stop tailing logs. Open to
suggestions as to be a better way to handle this error. Retrying in a
tight loop seems potentially problematic in the error scenario.
d11wtq added a commit to d11wtq/skaffold that referenced this issue Jun 19, 2018
Resolves: GoogleContainerTools#531
See also: kubernetes/client-go#12

There is an issue in which the Pods watcher gets into a infinite tight
loop and begins consuming 100% of the CPU. This happens after skaffold
dev has been running for a while (~30 mins) and once it starts, it
doesn't stop.

The issue was narrowed down by @ajbouh to the event polling loop in
`logs.go`, which was not checking if the `ResultChan()` is closed or not.
Kubernetes actually closes the connection after a timeout (default is in
the range of 30-60 mins according to the related issue linked to above).
In this case, the intended solution is to start the watcher again.

This refactors the polling into two nested loops. One to start (and
restart) the Pods watcher itself and another to receive and process the
events from the watcher. If the `ResultChan()` is closed, the entire
watcher loop is restarted and log tailing continues.

There is a subtle difference in error handling as a result of this
change. Previously any error returned from `client.Pods("").Watch()`
would be immediately returned from the `Watch()` func in `logs.go`. This
is no longer possible since the watcher is initialized in the goroutine
started by that func. As such, in the case the watcher cannot be
initialized, we simply log the error and stop tailing logs. Open to
suggestions as to be a better way to handle this error. Retrying in a
tight loop seems potentially problematic in the error scenario.
@scaszoo
Copy link

scaszoo commented Oct 9, 2019

You could override the timeout in ListOptions

timeout := int64(60 * 60 * 24) // 24 hours
watcher, err := client.CoreV1().Pods("default").Watch(metav1.ListOptions{ TimeoutSeconds: &timeout, })

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants