Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to detect if a watch is active #65012

Closed
hzxuzhonghu opened this issue Jun 12, 2018 · 51 comments
Closed

Unable to detect if a watch is active #65012

hzxuzhonghu opened this issue Jun 12, 2018 · 51 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/unresolved Indicates an issue that can not or will not be resolved.

Comments

@hzxuzhonghu
Copy link
Member

hzxuzhonghu commented Jun 12, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

We are unable to detect if a watch is active or not. ie. Especially in http2 case.

If we have 3 kube-apiservers behind a LB, and currently kube-proxy cm multiplexing one connection in http2 schema. connected with one apiserver. When the apiserver gets stuck, there will be logs indicating the errors like get timeout. But the kube-proxy and kube-controller-manager will not reconnect to another apiserver.

What you expected to happen:

If one server stuck, client should be able to detect and reconnect to another server. At worst client should be able to log obvious error and exit, wait for other guards to restart it.

How to reproduce it (as minimally and precisely as possible):

send SIGSTOP signal to one apiserver, watch related kubelet logs.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 12, 2018
@hzxuzhonghu
Copy link
Member Author

There are many cases which depend on watch all need this.

@hzxuzhonghu
Copy link
Member Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 12, 2018
@hzxuzhonghu
Copy link
Member Author

@hzxuzhonghu
Copy link
Member Author

hzxuzhonghu commented Jun 12, 2018

But it is strange that the kubelet could reconnect seeing that by netstat.

edit: related to this https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/server.go#L612

@hzxuzhonghu
Copy link
Member Author

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 12, 2018
@liggitt
Copy link
Member

liggitt commented Jun 12, 2018

But the kube-proxy and kube-controller-manager will not reconnect to another apiserver.

they should once the dead/idle TCP connection times out

@hzxuzhonghu
Copy link
Member Author

@liggitt The tcp keepalive is ok in this case. It can not sense exception.

@hzxuzhonghu
Copy link
Member Author

I think introduce app level keepalive between server and client for long living watch connections can solve this.

  1. APIServer send heartbeat periodically, and client will be aware that the connection is dead if not receiving heartbeat for a timeout.

  2. how to deal with keepalive timeout? Maybe client can close the underlying connection directly, but I am not sure if there is any risk for http2 multiplexing.

I think the cost of keepalive is bearable in order to make kubernetes more reliable.

@lavalamp
Copy link
Member

This has come up several times and I still haven't seen evidence that we still have a problem if the TCP keepalive is configured and working as expected?

@lavalamp
Copy link
Member

And if HTTP/2's caching is an issue, timing out the watch at an app level will just lead to reopening a watch over the same dead but cached connection.

@hzxuzhonghu
Copy link
Member Author

And if HTTP/2's caching is an issue, timing out the watch at an app level will just lead to reopening a watch over the same dead but cached connection.

Yes, can force reconnect only by closing the previous tcp connection explicitly.

@lavalamp
Copy link
Member

lavalamp commented Jun 14, 2018 via email

@hzxuzhonghu
Copy link
Member Author

No, we can close connections when keepalive timeout. Then a newer request, no matter watch/get/list, will lead to http client to get a connection, and since no connection can use now. It has to dial again to establish a new connection. The new connection is possibly with a new healthy apiserver.

@hzxuzhonghu
Copy link
Member Author

opened a pr #65087, and have tested it simply.

@liggitt
Copy link
Member

liggitt commented Jun 19, 2018

hoisting conversation from the kubelet-specific issue #48638 (comment) to here

@obeattie:

Since this issue has been re-opened, would there be any value in me re-opening my PR for this commit? Monzo has been running this patch in production since last July and it has eliminated this problem entirely, for all uses of client-go.

@liggitt:

I still have several reservations about that fix:

  • it relies on behavior that appears to be undefined (it changes properties on the result of net.TCPConn#File(), documented as "The returned os.File's file descriptor is different from the connection's. Attempting to change properties of the original using this duplicate may or may not have the desired effect.")
  • calling net.TCPConn#File() has implications I'm unsure of: "On Unix systems this will cause the SetDeadline methods to stop working."
  • it appears to only trigger closing in response to unacknowledged outgoing data... that doesn't seem like it would help clients (like kube-proxy) with long-running receive-only watch stream connections

that said, if @kubernetes/sig-api-machinery-pr-reviews and/or @kubernetes/sig-network-pr-reviews feel strongly that is the correct direction to pursue, that would be helpful to hear.

@redbaron:

Few notes on these very valid concerns:

  • https://golang.org/pkg/net/#TCPConn.File is returning dup'ed filedescriptor, which AFAIK share all underneath structures in kernel, except for entry in file descriptor table, so either can be used with same results. Program should be aware not to try to use them simultaneously though, for exact same reasons.
  • today returned filedescriptor is set to blocking mode. Probably it can be mitigated by setting it back to nonblocking mode. In Go 1.11 returned fd is going to be in same blocking/unblocking mode as it was before .File() call: net: File method of {TCP,UDP,IP,Unix}Conn and {TCP,Unix}Listener should leave the socket in nonblocking mode golang/go#24942
  • Maybe it will not help simple watchers, I am not familiar with Informers internals, but I was under the impression that they are not only watching, but also periodically resyncing state, these resync would trigger outgoing data transfer which would then be detected .

@obeattie:

Indeed: as far as I understand, the behaviour is not undefined, it's just defined in Linux rather than in Go. I think the Go docs could be clearer on this. Here's the relevant section from dup(2):

After a successful return from one of these system calls, the old and new file descriptors may be used interchangeably. They refer to the same open file description (see open(2)) and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the descriptors, the offset is also changed for the other.

The two descriptors do not share file descriptor flags (the close-on-exec flag).

My code doesn't modify flags after obtaining the fd, instead its only use is in a call to setsockopt(2). The docs for that call are fairly clear that it modifies properties of the socket referred to by the descriptor, not the descriptor itself:

getsockopt() and setsockopt() manipulate options for the socket referred to by the file descriptor sockfd.

I agree that the original descriptor being set to blocking mode is annoying. Go's code is clear that this will not prevent anything from working, just that more OS threads may be required for I/O:

https://github.com/golang/go/blob/516f5ccf57560ed402cdae28a36e1dc9e81444c3/src/net/fd_unix.go#L313-L315

Given that a single Kubelet (or otherwise use of client-go) establishes a small number of long-lived connections to the apiservers, and that this will be fixed in Go 1.11, I don't think this is a significant issue.

I am happy for this to be fixed in another way, but given we know that this works and does not require invasive changes to the apiserver to achieve, I think it is a reasonable solution. I have heard from several production users of Kubernetes that this has bitten them in the same way it bit us.

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Jun 19, 2018
@lavalamp
Copy link
Member

See my comment here: #65087 (comment)

Is there a reason why we can't start using the http2 request reliability mechanisms? (https://http2.github.io/http2-spec/#Reliability)

@nulltrope
Copy link

Hi @lavalamp @liggitt what's the status of this?

When doing maintenance on our master nodes and/or removing a master node we ran into numerous issues tracing back to stuck TCP connections.

The first was with the kubelet -> api-server connection which seems to have been fixed by #63492 however we also encountered issues with many other components which leverage client-go to watch resources, such as Prometheus and CoreDNS.

Is there a timeline for the broader client-go fix? Many of our cluster services rely on watchers to the api-server so this issue really cascades down affecting the whole cluster and requires rolling everything to re-start a fresh TCP connection.

I'm also willing to lend help with the fix if you need!

@lavalamp
Copy link
Member

lavalamp commented Aug 3, 2018

@jekohk from an extremely cursory search, it appears that we'll need to switch to using the golang.org/x/net/http2 package, and for every connection, periodically call https://godoc.org/golang.org/x/net/http2#ClientConn.Ping. There doesn't look to be a way to automatically get the connection pool type there to do this for us; I wonder if they'd consider adding that or accepting something to add it?

@hzxuzhonghu
Copy link
Member Author

It is really appreciated if golang http2 allows us to send Ping frame in our cases.

@lavalamp Do you know if there is any plan for this?

@lavalamp
Copy link
Member

lavalamp commented Aug 6, 2018 via email

@george-angel
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 1, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020
@redbaron
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2020
@maxweisel
Copy link

heyo, I figured I'd bump this issue because I'm currently running into this issue with our services. It's rough because I essentially cannot detect when the stream fails, so I'm required to make our applications that use watch events restart the stream constantly in order to ensure it's live.

@lavalamp
Copy link
Member

We're waiting for go1.15. @caesarxuchao made a fix for this in the upstream go std library. Unfortunately, it's basically impossible to fix this without that change.

This is already tracked here: kubernetes/client-go#374 (comment)

Apparently we've gone 2 years without de-duping these two issues, oops!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 11, 2020
@warmchang
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2020
@george-angel
Copy link
Contributor

/remove-lifecycle stale

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2021
@maxweisel
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2021
@maxweisel
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2021
@hzxuzhonghu
Copy link
Member Author

This has been resolved with http2 keepalive.

@maxweisel
Copy link

Can you point to where that fix is? I would assume a http2 keep-alive timeout would cause the connection to restart after X seconds of inactivity, but it would just timeout and restart constantly, where as I'd really love something that sends a ping frame in order to continually keep the connection alive or timeout if one is not received in return.

@hzxuzhonghu
Copy link
Member Author

#95981

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

14 participants