Unable to detect if a watch is active #65012

hzxuzhonghu · 2018-06-12T09:16:25Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

We are unable to detect if a watch is active or not. ie. Especially in http2 case.

If we have 3 kube-apiservers behind a LB, and currently kube-proxy cm multiplexing one connection in http2 schema. connected with one apiserver. When the apiserver gets stuck, there will be logs indicating the errors like get timeout. But the kube-proxy and kube-controller-manager will not reconnect to another apiserver.

What you expected to happen:

If one server stuck, client should be able to detect and reconnect to another server. At worst client should be able to log obvious error and exit, wait for other guards to restart it.

How to reproduce it (as minimally and precisely as possible):

send SIGSTOP signal to one apiserver, watch related kubelet logs.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

hzxuzhonghu · 2018-06-12T09:18:01Z

There are many cases which depend on watch all need this.

hzxuzhonghu · 2018-06-12T09:18:14Z

/sig api-machinery

hzxuzhonghu · 2018-06-12T09:18:52Z

@deads2k @lavalamp @liggitt @sttts @k8s-mirror-api-machinery-bugs

hzxuzhonghu · 2018-06-12T09:44:00Z

But it is strange that the kubelet could reconnect seeing that by netstat.

edit: related to this https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/server.go#L612

hzxuzhonghu · 2018-06-12T11:51:43Z

/kind bug

liggitt · 2018-06-12T13:24:37Z

But the kube-proxy and kube-controller-manager will not reconnect to another apiserver.

they should once the dead/idle TCP connection times out

hzxuzhonghu · 2018-06-12T15:20:21Z

@liggitt The tcp keepalive is ok in this case. It can not sense exception.

hzxuzhonghu · 2018-06-13T08:52:28Z

I think introduce app level keepalive between server and client for long living watch connections can solve this.

APIServer send heartbeat periodically, and client will be aware that the connection is dead if not receiving heartbeat for a timeout.
how to deal with keepalive timeout? Maybe client can close the underlying connection directly, but I am not sure if there is any risk for http2 multiplexing.

I think the cost of keepalive is bearable in order to make kubernetes more reliable.

lavalamp · 2018-06-13T23:21:12Z

This has come up several times and I still haven't seen evidence that we still have a problem if the TCP keepalive is configured and working as expected?

lavalamp · 2018-06-13T23:22:13Z

And if HTTP/2's caching is an issue, timing out the watch at an app level will just lead to reopening a watch over the same dead but cached connection.

hzxuzhonghu · 2018-06-14T01:17:09Z

And if HTTP/2's caching is an issue, timing out the watch at an app level will just lead to reopening a watch over the same dead but cached connection.

Yes, can force reconnect only by closing the previous tcp connection explicitly.

lavalamp · 2018-06-14T02:28:00Z

It seems like this doesn't really solve the problem if a second step is required? Am I missing something?

…

On Wed, Jun 13, 2018, 6:18 PM Zhonghu Xu ***@***.***> wrote: And if HTTP/2's caching is an issue, timing out the watch at an app level will just lead to reopening a watch over the same dead but cached connection. Yes, can force reconnect only by closing the previous tcp connection explicitly. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#65012 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnglsgebJcoezELCu8RnerZqCZenLuyks5t8bnRgaJpZM4UkEPq> .

hzxuzhonghu · 2018-06-14T02:46:04Z

No, we can close connections when keepalive timeout. Then a newer request, no matter watch/get/list, will lead to http client to get a connection, and since no connection can use now. It has to dial again to establish a new connection. The new connection is possibly with a new healthy apiserver.

hzxuzhonghu · 2018-06-14T09:44:03Z

opened a pr #65087, and have tested it simply.

liggitt · 2018-06-19T19:06:25Z

hoisting conversation from the kubelet-specific issue #48638 (comment) to here

@obeattie:

Since this issue has been re-opened, would there be any value in me re-opening my PR for this commit? Monzo has been running this patch in production since last July and it has eliminated this problem entirely, for all uses of client-go.

@liggitt:

I still have several reservations about that fix:

it relies on behavior that appears to be undefined (it changes properties on the result of net.TCPConn#File(), documented as "The returned os.File's file descriptor is different from the connection's. Attempting to change properties of the original using this duplicate may or may not have the desired effect.")

calling net.TCPConn#File() has implications I'm unsure of: "On Unix systems this will cause the SetDeadline methods to stop working."

it appears to only trigger closing in response to unacknowledged outgoing data... that doesn't seem like it would help clients (like kube-proxy) with long-running receive-only watch stream connections

that said, if @kubernetes/sig-api-machinery-pr-reviews and/or @kubernetes/sig-network-pr-reviews feel strongly that is the correct direction to pursue, that would be helpful to hear.

@redbaron:

Few notes on these very valid concerns:

https://golang.org/pkg/net/#TCPConn.File is returning dup'ed filedescriptor, which AFAIK share all underneath structures in kernel, except for entry in file descriptor table, so either can be used with same results. Program should be aware not to try to use them simultaneously though, for exact same reasons.

today returned filedescriptor is set to blocking mode. Probably it can be mitigated by setting it back to nonblocking mode. In Go 1.11 returned fd is going to be in same blocking/unblocking mode as it was before .File() call: net: File method of {TCP,UDP,IP,Unix}Conn and {TCP,Unix}Listener should leave the socket in nonblocking mode golang/go#24942

Maybe it will not help simple watchers, I am not familiar with Informers internals, but I was under the impression that they are not only watching, but also periodically resyncing state, these resync would trigger outgoing data transfer which would then be detected .

@obeattie:

Indeed: as far as I understand, the behaviour is not undefined, it's just defined in Linux rather than in Go. I think the Go docs could be clearer on this. Here's the relevant section from dup(2):

After a successful return from one of these system calls, the old and new file descriptors may be used interchangeably. They refer to the same open file description (see open(2)) and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the descriptors, the offset is also changed for the other.

The two descriptors do not share file descriptor flags (the close-on-exec flag).

My code doesn't modify flags after obtaining the fd, instead its only use is in a call to setsockopt(2). The docs for that call are fairly clear that it modifies properties of the socket referred to by the descriptor, not the descriptor itself:

getsockopt() and setsockopt() manipulate options for the socket referred to by the file descriptor sockfd.

I agree that the original descriptor being set to blocking mode is annoying. Go's code is clear that this will not prevent anything from working, just that more OS threads may be required for I/O:

https://github.com/golang/go/blob/516f5ccf57560ed402cdae28a36e1dc9e81444c3/src/net/fd_unix.go#L313-L315

Given that a single Kubelet (or otherwise use of client-go) establishes a small number of long-lived connections to the apiservers, and that this will be fixed in Go 1.11, I don't think this is a significant issue.

I am happy for this to be fixed in another way, but given we know that this works and does not require invasive changes to the apiserver to achieve, I think it is a reasonable solution. I have heard from several production users of Kubernetes that this has bitten them in the same way it bit us.

lavalamp · 2018-06-27T00:23:20Z

See my comment here: #65087 (comment)

Is there a reason why we can't start using the http2 request reliability mechanisms? (https://http2.github.io/http2-spec/#Reliability)

nulltrope · 2018-08-03T16:55:54Z

Hi @lavalamp @liggitt what's the status of this?

When doing maintenance on our master nodes and/or removing a master node we ran into numerous issues tracing back to stuck TCP connections.

The first was with the kubelet -> api-server connection which seems to have been fixed by #63492 however we also encountered issues with many other components which leverage client-go to watch resources, such as Prometheus and CoreDNS.

Is there a timeline for the broader client-go fix? Many of our cluster services rely on watchers to the api-server so this issue really cascades down affecting the whole cluster and requires rolling everything to re-start a fresh TCP connection.

I'm also willing to lend help with the fix if you need!

lavalamp · 2018-08-03T17:19:10Z

@jekohk from an extremely cursory search, it appears that we'll need to switch to using the golang.org/x/net/http2 package, and for every connection, periodically call https://godoc.org/golang.org/x/net/http2#ClientConn.Ping. There doesn't look to be a way to automatically get the connection pool type there to do this for us; I wonder if they'd consider adding that or accepting something to add it?

hzxuzhonghu · 2018-08-06T03:26:02Z

It is really appreciated if golang http2 allows us to send Ping frame in our cases.

@lavalamp Do you know if there is any plan for this?

lavalamp · 2018-08-06T16:40:49Z

My previous update links directly to the go language method for sending a Ping frame. Some surgery is probably needed in the client library to call it appropriately. I'm not aware of anyone working on this yet, but it seems like the right way to solve this problem and I'm happy to look at PRs that address this with the Ping mechanism.

…

On Sun, Aug 5, 2018 at 8:27 PM Zhonghu Xu ***@***.***> wrote: It is really appreciated if golang http2 allows us to send Ping frame in our cases. @lavalamp <https://github.com/lavalamp> Do you know if there is any plan for this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#65012 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAngluudSP9xAlbto_eNow9tHX_detIMks5uN7ePgaJpZM4UkEPq> .

george-angel · 2020-04-01T10:17:47Z

/remove-lifecycle stale

fejta-bot · 2020-06-30T11:05:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

redbaron · 2020-06-30T15:44:49Z

/remove-lifecycle stale

maxweisel · 2020-07-13T18:32:03Z

heyo, I figured I'd bump this issue because I'm currently running into this issue with our services. It's rough because I essentially cannot detect when the stream fails, so I'm required to make our applications that use watch events restart the stream constantly in order to ensure it's live.

lavalamp · 2020-07-13T20:21:11Z

We're waiting for go1.15. @caesarxuchao made a fix for this in the upstream go std library. Unfortunately, it's basically impossible to fix this without that change.

This is already tracked here: kubernetes/client-go#374 (comment)

Apparently we've gone 2 years without de-duping these two issues, oops!

fejta-bot · 2020-10-11T20:22:34Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

warmchang · 2020-10-12T03:24:34Z

/remove-lifecycle stale

george-angel · 2020-10-12T06:54:31Z

/remove-lifecycle stale

fejta-bot · 2021-01-10T07:43:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

maxweisel · 2021-01-10T18:20:03Z

/remove-lifecycle stale

fejta-bot · 2021-04-10T19:02:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

maxweisel · 2021-04-10T20:09:24Z

/remove-lifecycle stale

hzxuzhonghu · 2021-04-12T02:27:40Z

This has been resolved with http2 keepalive.

maxweisel · 2021-04-12T02:56:44Z

Can you point to where that fix is? I would assume a http2 keep-alive timeout would cause the connection to restart after X seconds of inactivity, but it would just timeout and restart constantly, where as I'd really love something that sends a ping frame in order to continually keep the connection alive or timeout if one is not received in return.

hzxuzhonghu · 2021-04-12T03:51:16Z

#95981

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 12, 2018

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 12, 2018

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 12, 2018

hzxuzhonghu mentioned this issue Jun 14, 2018

Provide a way to detect watch dead/hanging connection #65087

Closed

liggitt mentioned this issue Jun 19, 2018

kubelet fails to heartbeat with API server with stuck TCP connections #48638

Closed

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Jun 19, 2018

This was referenced Jun 27, 2018

Agent(clients) showing tcp connection established to api server, but actually doing nothing #47201

Closed

TCP user timeout for Kubelet ↔️ apiserver connection #48670

Closed

lavalamp mentioned this issue Aug 6, 2018

Client should expose a mechanism to close underlying TCP connections kubernetes/client-go#374

Closed

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 1, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020

Unichron mentioned this issue Sep 1, 2020

Connection timeout in the watch stream after upgrade cluster jcmoraisjr/haproxy-ingress#655

Closed

mancaus mentioned this issue Sep 2, 2020

[Known Issue] nmi returns status 500 with list pod error Azure/aad-pod-identity#780

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 11, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2020

maxweisel mentioned this issue Nov 24, 2020

Unable to detect if a watch is active kubernetes-client/javascript#559

Closed

brendandburns mentioned this issue Nov 25, 2020

kubernetes watch API is behaving oddly kubernetes-client/java#1370

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2021

brendandburns mentioned this issue Mar 6, 2021

watcher connection stops receiving events after some time kubernetes-client/javascript#596

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2021

hzxuzhonghu closed this as completed Apr 12, 2021

Unable to detect if a watch is active #65012

Unable to detect if a watch is active #65012

Comments

hzxuzhonghu commented Jun 12, 2018 • edited Loading

hzxuzhonghu commented Jun 12, 2018

hzxuzhonghu commented Jun 12, 2018

hzxuzhonghu commented Jun 12, 2018

hzxuzhonghu commented Jun 12, 2018 • edited Loading

hzxuzhonghu commented Jun 12, 2018

liggitt commented Jun 12, 2018

hzxuzhonghu commented Jun 12, 2018

hzxuzhonghu commented Jun 13, 2018

lavalamp commented Jun 13, 2018

lavalamp commented Jun 13, 2018

hzxuzhonghu commented Jun 14, 2018

lavalamp commented Jun 14, 2018 via email

hzxuzhonghu commented Jun 14, 2018

hzxuzhonghu commented Jun 14, 2018

liggitt commented Jun 19, 2018

lavalamp commented Jun 27, 2018

nulltrope commented Aug 3, 2018

lavalamp commented Aug 3, 2018

hzxuzhonghu commented Aug 6, 2018

lavalamp commented Aug 6, 2018 via email

george-angel commented Apr 1, 2020

fejta-bot commented Jun 30, 2020

redbaron commented Jun 30, 2020

maxweisel commented Jul 13, 2020

lavalamp commented Jul 13, 2020

fejta-bot commented Oct 11, 2020

warmchang commented Oct 12, 2020

george-angel commented Oct 12, 2020

fejta-bot commented Jan 10, 2021

maxweisel commented Jan 10, 2021

fejta-bot commented Apr 10, 2021

maxweisel commented Apr 10, 2021

hzxuzhonghu commented Apr 12, 2021

maxweisel commented Apr 12, 2021

hzxuzhonghu commented Apr 12, 2021

hzxuzhonghu commented Jun 12, 2018 •

edited

Loading

hzxuzhonghu commented Jun 12, 2018 •

edited

Loading