-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Keep connection alive during streaming operations #81179
[WIP] Keep connection alive during streaming operations #81179
Conversation
Certain operations will create a stream that can be open for a long period of time in which there is no activity at all. This can happen specially on certain commands: - `kubectl exec -it` - `kubectl logs -f` - `kubectl attach` - `kubectl get -w` When this commands are still running, if some L4 component is between the client and the apiserver, it can decide to close the connection after a configured timeout because no activity happens at TCP level. This patch introduces changes in the following components: - `client-go`: Add `util/keepalive` package that can be used for this purpose. It extends the `ResponseWrapper` interface with a `StreamWithPing` command, so users of `Stream` function are not affected and its behavior will be the same as it was. - `kubectl`: `exec`, `logs`, `get` and `attach` get a `--ping-interval` argument that can be used to force an interval to ping the other end of the connection, so L4 components in between will see traffic and won't close the connection. Since this timeout depends a lot on each use case and organization, leaving it as an argument to `kubectl` seems the best option, so the user can pick the interval they want in a specific run. - `apimachinery`: extend the `httpstream` connection with a `Ping()` function that is implemented in the `spdy` backend as a native ping frame already implemented by Docker's `spdystream` project.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ereslibre The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Adding |
@@ -687,7 +691,15 @@ func (o *GetOptions) watch(f cmdutil.Factory, cmd *cobra.Command, args []string) | |||
defer cancel() | |||
intr := interrupt.New(nil, cancel) | |||
intr.Run(func() error { | |||
_, err := watchtools.UntilWithoutRetry(ctx, w, func(e watch.Event) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switching this to Until or UntilWithSync would handle connection interruptions, and is preferable to depending on a kept-alive connection. See #74717
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will sync with that PR for the get
case, thank you. Thoughts on the general approach of the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long-lived connections (exec/attach/logs) have inherent timeouts built into them. Is the expectation that these connections could be kept open indefinitely using this keep-alive?
if client == nil { | ||
client = http.DefaultClient | ||
} | ||
keepalive.KeepAliveWithPinger(ctx, keepalivehttp.NewHttpPinger(client.(*http.Client), r.URL().Scheme, r.URL().Host), pingInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a cast to *http.Client
can panic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect making another request using the same client to necessarily keep alive an existing streaming connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect making another request using the same client to necessarily keep alive an existing streaming connection
I will investigate more on this. On my environment for reproducing the L4 issue I can see that the fix works, but I might be relying on specific behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might work for a http/2 connection which multiplexes requests over a single connection, but that doesn't seem good to rely on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on the spdy part of the patch I'm confident, because we are sending a native ping frame over the multiplexed connection.
The other two methods is what I am not that sure about, and that I might be extrapolating specific behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some investigation I have seen that the TCP socket is always reused, if on the same process of course. Golang implements the HTTP/1.1 pipelining and this is what allows the TCP socket to be reused, even when one of the requests is still streaming (however, I don't fully understand how this can work -- keep the stream open on one request while performing other requests).
I have used different methods to confirm this: on one side, the haproxy
stats socket, that always did show a single connection per process, netstat
on my laptop that again did show only one port per process and also a slight variation of the PR with golang's client tracing to check what connections were retrieved from the pool (and they were all the time being marked as reused, same connection object address all the time).
All this being said, I will investigate more and fully confirm if this is the case; but experts feedback for confirmation would be awesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now, so the connection is stablished using HTTP/2, I was assuming HTTP/1.1.
} | ||
|
||
func (p HttpPinger) Ping() error { | ||
_, err := p.httpClient.Get(fmt.Sprintf("%s://%s/healthz", p.scheme, p.hostPort)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use net.URL to build URLs, not Sprintf
- what happens if /healthz does not exist or is forbidden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extend this for systems which use paths other than /healthz (eg etcd uses /health)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take this into account, thank you @cheftako.
} | ||
|
||
func (p RESTClientPinger) Ping() error { | ||
return p.restClient.Get().RequestURI("/healthz").Do().Error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if /healthz does not exist or is forbidden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it sounds bad, but at that point it doesn't matter. We already generated traffic. However, if there is a better endpoint we can use I'm happy to change it.
This keep alive strategy is only to ensure that some TCP traffic happens so external actors won't close the connection if they don't see activity. It is not a goal for this PR to impact the current timeout logic on the existing commands. |
By the way, the reproducer that I have for checking manually is
Any of the mentioned commands will exit if there is no activity for 1 second. To check the PR manually, one can |
/assign @cheftako |
@ereslibre: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cc |
Have been experimenting with However, for some reason it does not allow the As a user I would like to know that a command that I triggered won't be closed by some intermediate component; however it's also true that not everybody on every company will know all the involved timeouts everywhere, and this solution could end up in a "I will just do Do you think it makes sense to continue working on this PR? I would like to know if you think it provides some value. |
I don´t see anything wrong with this, is just the same solution that OpenSSH implemented to solve this problem
I think it does provide value. However, there are certain limitations on the underlay protocols like http/1 that makes it too complex to implement. Is it possible to release this feature without http/1 support? just adding a warning or documenting this limitation? |
I will continue working on this under this assumption, only HTTP/2 will be supported, if HTTP/1.1 is the transport then this keepalive feature will not run, however I see that I'm going to continue working on this, will report back if I get something doable out of it. Thank you for all the feedback. |
cc |
Update: while starting to work on this again I noticed this issue: golang/go#31643 (and https://go-review.googlesource.com/c/net/+/173952/). Go will probably include automatic pinging the server end for the http/2 transport. We still have the spdy part of this PR, and I have to check if I will push soon an updated version of this PR, hopefully something really reviewable, sorry for the delay. |
/sub |
/assign |
Thanks to everyone involved in reviewing this. Closing because golang/net#55 (or some other similar follow up PR) will handle this automatically. If anything, and if the decided go API allows it, we could expose the ping interval to the user with Thank you again, closing. |
thanks again for your work on this! |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Certain operations will create a stream that can be open for a long
period of time in which there is no activity at all. This can happen
specially on certain commands:
kubectl exec -it
kubectl logs -f
kubectl attach
kubectl get -w
When this commands are still running, if some L4 component is between
the client and the apiserver, it can decide to close the connection
after a configured timeout because no activity happens at TCP level.
This patch introduces changes in the following components:
client-go
: Addutil/keepalive
package that can be used for thispurpose. It extends the
ResponseWrapper
interface with aStreamWithPing
command, so users ofStream
function are notaffected and its behavior will be the same as it was.
kubectl
:exec
,logs
,get
andattach
get a--ping-interval
argument that can be used to force an interval toping the other end of the connection, so L4 components in between
will see traffic and won't close the connection. Since this timeout
depends a lot on each use case and organization, leaving it as an
argument to
kubectl
seems the best option, so the user can pickthe interval they want in a specific run.
apimachinery
: extend thehttpstream
connection with aPing()
function that is implemented in the
spdy
backend as a native pingframe already implemented by Docker's
spdystream
project.Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/cc @kubernetes/sig-cli-pr-reviews
/cc @kubernetes/sig-api-machinery-pr-reviews
/cc @kubernetes/client-go-maintainers
/cc @aojea
/cc @BenTheElder