Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch channel does not get closed ever #755

Closed
hardikdr opened this issue Feb 26, 2020 · 12 comments
Closed

Watch channel does not get closed ever #755

hardikdr opened this issue Feb 26, 2020 · 12 comments

Comments

@hardikdr
Copy link
Member

Problem Description

The watch appears to stop receiving the events for certain custom-resources. This leads to stale-cache reads, but the watch is not re-established.

What is expected?

We expect the following client log message to be seen after every timeout minutes[~10mins],
I0226 11:09:10.181906 1 reflector.go:405] sample-controller/pkg/client/informers/externalversions/factory.go:117: Watch close - *v1alpha1.Flunder total 0 items received

From basic investigation, it seems watchHandler could technically loop forever if ResultChan is hung or never closed from servere-side. This could happen if watch is closed from APIServer and the client isn't aware, not sure how and why. There does not seem to be a timeout from client side. If the server closes the connection but the client is missing it then it could get hung forever? (See also: https://stackoverflow.com/questions/51399407/watch-in-k8s-golang-api-watches-and-get-events-but-after-sometime-doesnt-get-an)

  • We see this happening often with our GKE clusters, where custom-controllers stops getting events and leads to stale-cache reads after some time. Though, this only started recently without us upgrading GKE or upgrading the custom controllers.

Any help will be really appreciated, thanks in advance.

@liggitt
Copy link
Member

liggitt commented Feb 26, 2020

What version of the API server are you running against? They were known issues related to this that were fixed in 1.15

@hardikdr
Copy link
Member Author

We are running 1.14.8-gke.33 .
Can you please point to any issue/PR for the problem you are referring to?

@liggitt
Copy link
Member

liggitt commented Feb 26, 2020

The issue was fixed in kubernetes/kubernetes#78029 in 1.15 and the associated bugs are linked from that

@liggitt liggitt closed this as completed Feb 26, 2020
@rfranzke
Copy link

Thanks for pointing this out @LiGgit, 💯
One question - I saw you cherry-picked this fix to the release-1.14 branch with kubernetes/kubernetes#78034 end of May 2019. However, looking at the respective commit kubernetes/kubernetes@21b9a31 it looks like it never was released in a hotfix of 1.14 - how is that possible? The last 1.14 version
was 1.14.10 released in December 2019. Shouldn't it be included here?

@rfranzke
Copy link

I checked the release notes and from here it seems that it was released with v1.14.3 (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.14.md#changelog-since-v1142), not sure why the GitHub commit is not showing it.
Can you explain why we are still hit by it although we are running v1.14.8?

@liggitt
Copy link
Member

liggitt commented Feb 26, 2020

I'm not sure. Starting a 1.14.8 cluster and starting a watch with timeoutSeconds set is closing the connection and exiting the watch as I would expect.

@rfranzke
Copy link

Hm, should we then rather reopen this issue? Any other hints what it might be?
What @hardikdr describes is only observed after some time, it doesn’t happen immediately. We also don’t know yet how to exactly reproduce it.

@liggitt
Copy link
Member

liggitt commented Feb 26, 2020

The client-side code closes the channel as soon as the server closes the connection, so this issue likely doesn't belong in this repo. If you want to open an issue against https://github.com/kubernetes/kubernetes/issues/, you can, though it would need to be reproduced against a maintained OSS branch (currently 1.15+). More information about what you are doing/seeing would be helpful as well:

  • client code used to set up the informer (including configured timeouts, etc)
  • what other activity is happening to the relevant CRD and custom resource instances before/while the watch is open

@hardikdr
Copy link
Member Author

client code used to set up the informer (including configured timeouts, etc)

Resync-period is set to 12hrs, and timeout is not set explicitly, defaulted.

what other activity is happening to the relevant CRD and custom resource instances before/while the watch is open

Definitions are re-applied by the controller but not necessarily changed, not in a regular fashion. Custom resource instances are updated regularly when the watch is open.

We see below log-lines regularly for a custom-resource but it stops after sometime when the watch is hung I believe.
I0226 06:36:51.828470 1 reflector.go:405] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Flunder total 30 items received

Ref to the actual implementation.

@amshuman-kr
Copy link

Ideally, the client-side should implement the timeout by itself without depending on the server side to close the channel on timeout. This has already been done recently in client-go but not released yet. We should adopt this change as soon as it is released to insulate ourselves from such issues in the future.

@liggitt
Copy link
Member

liggitt commented Mar 5, 2020

the CR-specific issues were resolved in 1.15 (and picked back to 1.12.9, 1.13.7, 1.14.3), but there are more systemic issues that can occur at the client transport level if the underlying tcp connection is disrupted but not disconnected (like #374)

@amshuman-kr
Copy link

the CR-specific issues were resolved in 1.15 (and picked back to 1.12.9, 1.13.7, 1.14.3), but there are more systemic issues that can occur at the client transport level if the underlying tcp connection is disrupted but not disconnected (like #374)

Won't using client-side timeout (now that it has been made possible) be the right way to reliably refresh buggy listen connections?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants