Watch channel does not get closed ever #755

hardikdr · 2020-02-26T12:12:35Z

Problem Description

The watch appears to stop receiving the events for certain custom-resources. This leads to stale-cache reads, but the watch is not re-established.

What is expected?

We expect the following client log message to be seen after every timeout minutes[~10mins],
I0226 11:09:10.181906 1 reflector.go:405] sample-controller/pkg/client/informers/externalversions/factory.go:117: Watch close - *v1alpha1.Flunder total 0 items received

From basic investigation, it seems watchHandler could technically loop forever if ResultChan is hung or never closed from servere-side. This could happen if watch is closed from APIServer and the client isn't aware, not sure how and why. There does not seem to be a timeout from client side. If the server closes the connection but the client is missing it then it could get hung forever? (See also: https://stackoverflow.com/questions/51399407/watch-in-k8s-golang-api-watches-and-get-events-but-after-sometime-doesnt-get-an)

We see this happening often with our GKE clusters, where custom-controllers stops getting events and leads to stale-cache reads after some time. Though, this only started recently without us upgrading GKE or upgrading the custom controllers.

Any help will be really appreciated, thanks in advance.

The text was updated successfully, but these errors were encountered:

liggitt · 2020-02-26T12:36:53Z

What version of the API server are you running against? They were known issues related to this that were fixed in 1.15

hardikdr · 2020-02-26T12:40:12Z

We are running 1.14.8-gke.33 .
Can you please point to any issue/PR for the problem you are referring to?

liggitt · 2020-02-26T12:53:50Z

The issue was fixed in kubernetes/kubernetes#78029 in 1.15 and the associated bugs are linked from that

rfranzke · 2020-02-26T13:11:39Z

Thanks for pointing this out @LiGgit, 💯
One question - I saw you cherry-picked this fix to the release-1.14 branch with kubernetes/kubernetes#78034 end of May 2019. However, looking at the respective commit kubernetes/kubernetes@21b9a31 it looks like it never was released in a hotfix of 1.14 - how is that possible? The last 1.14 version
was 1.14.10 released in December 2019. Shouldn't it be included here?

rfranzke · 2020-02-26T13:15:18Z

I checked the release notes and from here it seems that it was released with v1.14.3 (https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.14.md#changelog-since-v1142), not sure why the GitHub commit is not showing it.
Can you explain why we are still hit by it although we are running v1.14.8?

liggitt · 2020-02-26T13:37:10Z

I'm not sure. Starting a 1.14.8 cluster and starting a watch with timeoutSeconds set is closing the connection and exiting the watch as I would expect.

rfranzke · 2020-02-26T13:42:26Z

Hm, should we then rather reopen this issue? Any other hints what it might be?
What @hardikdr describes is only observed after some time, it doesn’t happen immediately. We also don’t know yet how to exactly reproduce it.

liggitt · 2020-02-26T13:45:42Z

The client-side code closes the channel as soon as the server closes the connection, so this issue likely doesn't belong in this repo. If you want to open an issue against https://github.com/kubernetes/kubernetes/issues/, you can, though it would need to be reproduced against a maintained OSS branch (currently 1.15+). More information about what you are doing/seeing would be helpful as well:

client code used to set up the informer (including configured timeouts, etc)
what other activity is happening to the relevant CRD and custom resource instances before/while the watch is open

hardikdr · 2020-02-26T14:20:12Z

client code used to set up the informer (including configured timeouts, etc)

Resync-period is set to 12hrs, and timeout is not set explicitly, defaulted.

what other activity is happening to the relevant CRD and custom resource instances before/while the watch is open

Definitions are re-applied by the controller but not necessarily changed, not in a regular fashion. Custom resource instances are updated regularly when the watch is open.

We see below log-lines regularly for a custom-resource but it stops after sometime when the watch is hung I believe.
I0226 06:36:51.828470 1 reflector.go:405] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Flunder total 30 items received

Ref to the actual implementation.

amshuman-kr · 2020-02-27T08:42:19Z

Ideally, the client-side should implement the timeout by itself without depending on the server side to close the channel on timeout. This has already been done recently in client-go but not released yet. We should adopt this change as soon as it is released to insulate ourselves from such issues in the future.

liggitt · 2020-03-05T22:00:15Z

the CR-specific issues were resolved in 1.15 (and picked back to 1.12.9, 1.13.7, 1.14.3), but there are more systemic issues that can occur at the client transport level if the underlying tcp connection is disrupted but not disconnected (like #374)

amshuman-kr · 2020-03-06T09:50:44Z

the CR-specific issues were resolved in 1.15 (and picked back to 1.12.9, 1.13.7, 1.14.3), but there are more systemic issues that can occur at the client transport level if the underlying tcp connection is disrupted but not disconnected (like #374)

Won't using client-side timeout (now that it has been made possible) be the right way to reliably refresh buggy listen connections?

liggitt closed this as completed Feb 26, 2020

prashanth26 mentioned this issue Feb 27, 2020

Update vendoring to k8s 1.20/1.21 gardener/machine-controller-manager#402

Closed

aramase mentioned this issue Aug 27, 2020

Watch channel not getting closed #856

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch channel does not get closed ever #755

Watch channel does not get closed ever #755

hardikdr commented Feb 26, 2020

liggitt commented Feb 26, 2020

hardikdr commented Feb 26, 2020

liggitt commented Feb 26, 2020 •

edited

rfranzke commented Feb 26, 2020

rfranzke commented Feb 26, 2020

liggitt commented Feb 26, 2020

rfranzke commented Feb 26, 2020

liggitt commented Feb 26, 2020

hardikdr commented Feb 26, 2020

amshuman-kr commented Feb 27, 2020

liggitt commented Mar 5, 2020

amshuman-kr commented Mar 6, 2020

Watch channel does not get closed ever #755

Watch channel does not get closed ever #755

Comments

hardikdr commented Feb 26, 2020

Problem Description

What is expected?

liggitt commented Feb 26, 2020

hardikdr commented Feb 26, 2020

liggitt commented Feb 26, 2020 • edited

rfranzke commented Feb 26, 2020

rfranzke commented Feb 26, 2020

liggitt commented Feb 26, 2020

rfranzke commented Feb 26, 2020

liggitt commented Feb 26, 2020

hardikdr commented Feb 26, 2020

amshuman-kr commented Feb 27, 2020

liggitt commented Mar 5, 2020

amshuman-kr commented Mar 6, 2020

liggitt commented Feb 26, 2020 •

edited