Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch hang occasionally #884

Closed
zhiweiv opened this issue Jun 6, 2022 · 6 comments
Closed

Watch hang occasionally #884

zhiweiv opened this issue Jun 6, 2022 · 6 comments

Comments

@zhiweiv
Copy link
Contributor

zhiweiv commented Jun 6, 2022

After a few fixes, the watch api works fine basically, however watch hang still occur occasionally.

The underlying connection was dropped silently even though there is tcp keep alive, then the watch hang at WaitForSocketEvents because timeout is set to Infinite.

In my observation, the api server always close watch connection after 30m~60m, I am thinking set watch timeout to a reasonable value instead of Infinite to fix the hang issue, for example 2 hours, since api server has already closed the connection before this time range in normal condition.

if (watch == true)
{
cts.CancelAfter(Timeout.InfiniteTimeSpan);
}

kubernetes/kubernetes#67491 (comment)
#533 (comment)
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
--min-request-timeout int     Default: 1800

An optional field indicating the minimum number of seconds a handler must keep a request open before timing it out. Currently only honored by the watch request handler, which picks a randomized value above this number as the connection timeout, to spread out load.

@tg123
Copy link
Member

tg123 commented Jun 6, 2022

what is the tcp conn state?

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Jun 6, 2022

Didn't check it this time, the dump is following, same as the hang issue before tcp keep alive added, so I guess it is also ESTAB, but connection already dropped by load balancer.

00007F5DA1B1B470 00007F5EE9B1D12F Interop+Sys.WaitForSocketEvents(IntPtr, SocketEvent*, Int32*)
00007F5DA1B1B520 00007F5EE9B427B9 System.Net.Sockets.SocketAsyncEngine.EventLoop() [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs @ 176]
00007F5DA1B1B570 00007F5EE9B42D65 System.Net.Sockets.SocketAsyncEngine+<>c.<.ctor>b__14_0(System.Object) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs @ 154]
00007F5DA1B1B580 00007F5EE72688C2 System.Threading.Thread.StartCallback() [/_/src/coreclr/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs @ 105]
00007F5DA1B1B6F0 00007f5f60f0d0f3 [DebuggerU2MCatchHandlerFrame: 00007f5da1b1b6f0]

@tg123
Copy link
Member

tg123 commented Jun 8, 2022

sounds like the connection is still alive?
is it ok to tcpdump/wireshark?

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Jun 9, 2022

It reproduced three times in total since 7.2.19(released at Apr 23, 2022), another noticeable thing is that when it happen, all pods running watch logic are hang at same time(or similar time at least), it makes me guess the connections was dropped underlying somewhere, like #533 (comment) and #773 (comment)

I can try tcpdump next time. But in case of the underlying connection problem, I am wondering is it good to make watch timeout configurable instead of Infinite?

@tg123
Copy link
Member

tg123 commented Jun 9, 2022

does cancellationToken work for you? same as customized timeout

@zhiweiv
Copy link
Contributor Author

zhiweiv commented Jun 9, 2022

Thanks for the idea, it works for me. It is a bit tricky than global configurable watch timeout since I need to add CancelAfter in every watch logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants