-
Notifications
You must be signed in to change notification settings - Fork 39.3k
-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to detect if a watch is active #65012
Comments
There are many cases which depend on watch all need this. |
/sig api-machinery |
But it is strange that the kubelet could reconnect seeing that by netstat. edit: related to this https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/server.go#L612 |
/kind bug |
they should once the dead/idle TCP connection times out |
@liggitt The tcp keepalive is ok in this case. It can not sense exception. |
I think introduce app level keepalive between server and client for long living watch connections can solve this.
I think the cost of keepalive is bearable in order to make kubernetes more reliable. |
This has come up several times and I still haven't seen evidence that we still have a problem if the TCP keepalive is configured and working as expected? |
And if HTTP/2's caching is an issue, timing out the watch at an app level will just lead to reopening a watch over the same dead but cached connection. |
Yes, can force reconnect only by closing the previous tcp connection explicitly. |
It seems like this doesn't really solve the problem if a second step is
required? Am I missing something?
…On Wed, Jun 13, 2018, 6:18 PM Zhonghu Xu ***@***.***> wrote:
And if HTTP/2's caching is an issue, timing out the watch at an app level
will just lead to reopening a watch over the same dead but cached
connection.
Yes, can force reconnect only by closing the previous tcp connection
explicitly.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#65012 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAnglsgebJcoezELCu8RnerZqCZenLuyks5t8bnRgaJpZM4UkEPq>
.
|
No, we can close connections when keepalive timeout. Then a newer request, no matter |
opened a pr #65087, and have tested it simply. |
hoisting conversation from the kubelet-specific issue #48638 (comment) to here
|
See my comment here: #65087 (comment) Is there a reason why we can't start using the http2 request reliability mechanisms? (https://http2.github.io/http2-spec/#Reliability) |
Hi @lavalamp @liggitt what's the status of this? When doing maintenance on our master nodes and/or removing a master node we ran into numerous issues tracing back to stuck TCP connections. The first was with the kubelet -> api-server connection which seems to have been fixed by #63492 however we also encountered issues with many other components which leverage client-go to watch resources, such as Prometheus and CoreDNS. Is there a timeline for the broader client-go fix? Many of our cluster services rely on watchers to the api-server so this issue really cascades down affecting the whole cluster and requires rolling everything to re-start a fresh TCP connection. I'm also willing to lend help with the fix if you need! |
@jekohk from an extremely cursory search, it appears that we'll need to switch to using the golang.org/x/net/http2 package, and for every connection, periodically call https://godoc.org/golang.org/x/net/http2#ClientConn.Ping. There doesn't look to be a way to automatically get the connection pool type there to do this for us; I wonder if they'd consider adding that or accepting something to add it? |
It is really appreciated if golang http2 allows us to send Ping frame in our cases. @lavalamp Do you know if there is any plan for this? |
My previous update links directly to the go language method for sending a
Ping frame. Some surgery is probably needed in the client library to call
it appropriately. I'm not aware of anyone working on this yet, but it seems
like the right way to solve this problem and I'm happy to look at PRs that
address this with the Ping mechanism.
…On Sun, Aug 5, 2018 at 8:27 PM Zhonghu Xu ***@***.***> wrote:
It is really appreciated if golang http2 allows us to send Ping frame in
our cases.
@lavalamp <https://github.com/lavalamp> Do you know if there is any plan
for this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#65012 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAngluudSP9xAlbto_eNow9tHX_detIMks5uN7ePgaJpZM4UkEPq>
.
|
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
heyo, I figured I'd bump this issue because I'm currently running into this issue with our services. It's rough because I essentially cannot detect when the stream fails, so I'm required to make our applications that use watch events restart the stream constantly in order to ensure it's live. |
We're waiting for go1.15. @caesarxuchao made a fix for this in the upstream go std library. Unfortunately, it's basically impossible to fix this without that change. This is already tracked here: kubernetes/client-go#374 (comment) Apparently we've gone 2 years without de-duping these two issues, oops! |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
This has been resolved with http2 keepalive. |
Can you point to where that fix is? I would assume a http2 keep-alive timeout would cause the connection to restart after X seconds of inactivity, but it would just timeout and restart constantly, where as I'd really love something that sends a ping frame in order to continually keep the connection alive or timeout if one is not received in return. |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
We are unable to detect if a watch is active or not. ie. Especially in http2 case.
If we have 3 kube-apiservers behind a LB, and currently kube-proxy cm multiplexing one connection in http2 schema. connected with one apiserver. When the apiserver gets stuck, there will be logs indicating the errors like get timeout. But the kube-proxy and kube-controller-manager will not reconnect to another apiserver.
What you expected to happen:
If one server stuck, client should be able to detect and reconnect to another server. At worst client should be able to log obvious error and exit, wait for other guards to restart it.
How to reproduce it (as minimally and precisely as possible):
send SIGSTOP signal to one apiserver, watch related kubelet logs.
Anything else we need to know?:
Environment:
kubectl version
):uname -a
):The text was updated successfully, but these errors were encountered: