Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Keep connection alive during streaming operations #81179

Closed

Conversation

@ereslibre
Copy link
Member

ereslibre commented Aug 8, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
Certain operations will create a stream that can be open for a long
period of time in which there is no activity at all. This can happen
specially on certain commands:

  • kubectl exec -it
  • kubectl logs -f
  • kubectl attach
  • kubectl get -w

When this commands are still running, if some L4 component is between
the client and the apiserver, it can decide to close the connection
after a configured timeout because no activity happens at TCP level.

This patch introduces changes in the following components:

  • client-go: Add util/keepalive package that can be used for this
    purpose. It extends the ResponseWrapper interface with a
    StreamWithPing command, so users of Stream function are not
    affected and its behavior will be the same as it was.

  • kubectl: exec, logs, get and attach get a
    --ping-interval argument that can be used to force an interval to
    ping the other end of the connection, so L4 components in between
    will see traffic and won't close the connection. Since this timeout
    depends a lot on each use case and organization, leaving it as an
    argument to kubectl seems the best option, so the user can pick
    the interval they want in a specific run.

  • apimachinery: extend the httpstream connection with a Ping()
    function that is implemented in the spdy backend as a native ping
    frame already implemented by Docker's spdystream project.

Does this PR introduce a user-facing change?:

kubectl: `get`, `exec`, `logs` and `attach` have a new argument `--ping-interval` that will use the underlying connection to generate pings, so network components in between the client and the apiserver will see activity at TCP level.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

To be described

/cc @kubernetes/sig-cli-pr-reviews
/cc @kubernetes/sig-api-machinery-pr-reviews
/cc @kubernetes/client-go-maintainers

/cc @aojea
/cc @BenTheElder

Certain operations will create a stream that can be open for a long
period of time in which there is no activity at all. This can happen
specially on certain commands:

- `kubectl exec -it`
- `kubectl logs -f`
- `kubectl attach`
- `kubectl get -w`

When this commands are still running, if some L4 component is between
the client and the apiserver, it can decide to close the connection
after a configured timeout because no activity happens at TCP level.

This patch introduces changes in the following components:

- `client-go`: Add `util/keepalive` package that can be used for this
   purpose. It extends the `ResponseWrapper` interface with a
   `StreamWithPing` command, so users of `Stream` function are not
   affected and its behavior will be the same as it was.

- `kubectl`: `exec`, `logs`, `get` and `attach` get a
  `--ping-interval` argument that can be used to force an interval to
  ping the other end of the connection, so L4 components in between
  will see traffic and won't close the connection. Since this timeout
  depends a lot on each use case and organization, leaving it as an
  argument to `kubectl` seems the best option, so the user can pick
  the interval they want in a specific run.

- `apimachinery`: extend the `httpstream` connection with a `Ping()`
  function that is implemented in the `spdy` backend as a native ping
  frame already implemented by Docker's `spdystream` project.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Aug 8, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ereslibre
To complete the pull request process, please assign deads2k
You can assign the PR to them by writing /assign @deads2k in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ereslibre ereslibre changed the title Keep connection alive during streaming operations [WIP] Keep connection alive during streaming operations Aug 8, 2019
@ereslibre

This comment has been minimized.

Copy link
Member Author

ereslibre commented Aug 8, 2019

Adding WIP because tests need some tweaking and some TODO comments here and there, but interested in the general feedback about it. Thank you in advance.

@@ -687,7 +691,15 @@ func (o *GetOptions) watch(f cmdutil.Factory, cmd *cobra.Command, args []string)
defer cancel()
intr := interrupt.New(nil, cancel)
intr.Run(func() error {
_, err := watchtools.UntilWithoutRetry(ctx, w, func(e watch.Event) (bool, error) {

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 8, 2019

Member

switching this to Until or UntilWithSync would handle connection interruptions, and is preferable to depending on a kept-alive connection. See #74717

This comment has been minimized.

Copy link
@ereslibre

ereslibre Aug 8, 2019

Author Member

I will sync with that PR for the get case, thank you. Thoughts on the general approach of the PR?

Copy link
Member

liggitt left a comment

long-lived connections (exec/attach/logs) have inherent timeouts built into them. Is the expectation that these connections could be kept open indefinitely using this keep-alive?

if client == nil {
client = http.DefaultClient
}
keepalive.KeepAliveWithPinger(ctx, keepalivehttp.NewHttpPinger(client.(*http.Client), r.URL().Scheme, r.URL().Host), pingInterval)

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 8, 2019

Member

a cast to *http.Client can panic

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 8, 2019

Member

I wouldn't expect making another request using the same client to necessarily keep alive an existing streaming connection

This comment has been minimized.

Copy link
@ereslibre

ereslibre Aug 8, 2019

Author Member

I wouldn't expect making another request using the same client to necessarily keep alive an existing streaming connection

I will investigate more on this. On my environment for reproducing the L4 issue I can see that the fix works, but I might be relying on specific behavior.

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 8, 2019

Member

it might work for a http/2 connection which multiplexes requests over a single connection, but that doesn't seem good to rely on

This comment has been minimized.

Copy link
@ereslibre

ereslibre Aug 8, 2019

Author Member

Yes, on the spdy part of the patch I'm confident, because we are sending a native ping frame over the multiplexed connection.

The other two methods is what I am not that sure about, and that I might be extrapolating specific behavior.

This comment has been minimized.

Copy link
@ereslibre

ereslibre Aug 9, 2019

Author Member

After some investigation I have seen that the TCP socket is always reused, if on the same process of course. Golang implements the HTTP/1.1 pipelining and this is what allows the TCP socket to be reused, even when one of the requests is still streaming (however, I don't fully understand how this can work -- keep the stream open on one request while performing other requests).

I have used different methods to confirm this: on one side, the haproxy stats socket, that always did show a single connection per process, netstat on my laptop that again did show only one port per process and also a slight variation of the PR with golang's client tracing to check what connections were retrieved from the pool (and they were all the time being marked as reused, same connection object address all the time).

All this being said, I will investigate more and fully confirm if this is the case; but experts feedback for confirmation would be awesome.

This comment has been minimized.

Copy link
@ereslibre

ereslibre Aug 9, 2019

Author Member

I see now, so the connection is stablished using HTTP/2, I was assuming HTTP/1.1.

}

func (p HttpPinger) Ping() error {
_, err := p.httpClient.Get(fmt.Sprintf("%s://%s/healthz", p.scheme, p.hostPort))

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 8, 2019

Member
  • use net.URL to build URLs, not Sprintf
  • what happens if /healthz does not exist or is forbidden?

This comment has been minimized.

Copy link
@cheftako

cheftako Sep 25, 2019

Member

Can we extend this for systems which use paths other than /healthz (eg etcd uses /health)

This comment has been minimized.

Copy link
@ereslibre

ereslibre Sep 25, 2019

Author Member

I will take this into account, thank you @cheftako.

}

func (p RESTClientPinger) Ping() error {
return p.restClient.Get().RequestURI("/healthz").Do().Error()

This comment has been minimized.

Copy link
@liggitt

liggitt Aug 8, 2019

Member

what happens if /healthz does not exist or is forbidden?

This comment has been minimized.

Copy link
@ereslibre

ereslibre Aug 8, 2019

Author Member

I know it sounds bad, but at that point it doesn't matter. We already generated traffic. However, if there is a better endpoint we can use I'm happy to change it.

@ereslibre

This comment has been minimized.

Copy link
Member Author

ereslibre commented Aug 8, 2019

long-lived connections (exec/attach/logs) have inherent timeouts built into them. Is the expectation that these connections could be kept open indefinitely using this keep-alive?

This keep alive strategy is only to ensure that some TCP traffic happens so external actors won't close the connection if they don't see activity. It is not a goal for this PR to impact the current timeout logic on the existing commands.

@ereslibre

This comment has been minimized.

Copy link
Member Author

ereslibre commented Aug 8, 2019

By the way, the reproducer that I have for checking manually is kind with this patch on top: kubernetes-sigs/kind@master...ereslibre:low-timeout, creating an HA cluster with:

kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker

Any of the mentioned commands will exit if there is no activity for 1 second. To check the PR manually, one can kubectl command --ping-interval 500ms and see that the connection is kept open.

@fedebongio

This comment has been minimized.

Copy link
Contributor

fedebongio commented Aug 8, 2019

/assign @cheftako

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Aug 8, 2019

@ereslibre: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-dependencies 9d3894c link /test pull-kubernetes-dependencies
pull-kubernetes-bazel-test 9d3894c link /test pull-kubernetes-bazel-test
pull-kubernetes-typecheck 9d3894c link /test pull-kubernetes-typecheck
pull-kubernetes-verify 9d3894c link /test pull-kubernetes-verify
pull-kubernetes-kubemark-e2e-gce-big 9d3894c link /test pull-kubernetes-kubemark-e2e-gce-big

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@aojea

This comment has been minimized.

Copy link
Member

aojea commented Aug 8, 2019

/cc

@k8s-ci-robot k8s-ci-robot requested a review from aojea Aug 8, 2019
@ereslibre

This comment has been minimized.

Copy link
Member Author

ereslibre commented Aug 10, 2019

Have been experimenting with haproxy and it's possible to have something very similar to what this PR intends. Configuring tunnel, client-fin and server-fin to reasonable timeouts allows for inactive long running connections to be kept alive.

However, for some reason it does not allow the tunnel timeout to be set to 0 (to virtually never kill the connection), I don't know the reason.

As a user I would like to know that a command that I triggered won't be closed by some intermediate component; however it's also true that not everybody on every company will know all the involved timeouts everywhere, and this solution could end up in a "I will just do --ping-interval 1m everywhere"-kind-of situation.

Do you think it makes sense to continue working on this PR? I would like to know if you think it provides some value.

@aojea

This comment has been minimized.

Copy link
Member

aojea commented Aug 11, 2019

As a user I would like to know that a command that I triggered won't be closed by some intermediate component; however it's also true that not everybody on every company will know all the involved timeouts everywhere, and this solution could end up in a "I will just do --ping-interval 1m everywhere"-kind-of situation.

I don´t see anything wrong with this, is just the same solution that OpenSSH implemented to solve this problem
https://patrickmn.com/aside/how-to-keep-alive-ssh-sessions/ , is up to the user to use it or not and seems harmless.

Do you think it makes sense to continue working on this PR? I would like to know if you think it provides some value.

I think it does provide value.

However, there are certain limitations on the underlay protocols like http/1 that makes it too complex to implement. Is it possible to release this feature without http/1 support? just adding a warning or documenting this limitation?

@ereslibre

This comment has been minimized.

Copy link
Member Author

ereslibre commented Aug 14, 2019

Is it possible to release this feature without http/1 support? just adding a warning or documenting this limitation?

I will continue working on this under this assumption, only HTTP/2 will be supported, if HTTP/1.1 is the transport then this keepalive feature will not run, however I see that http.Client and http.Transport are hiding very well what real transport is being used (so if it was http2.Transport I could get the ClientConn from the pool and Ping it right away. But http.Transport is hiding the details under h2_bundle.go.

I'm going to continue working on this, will report back if I get something doable out of it. Thank you for all the feedback.

@BenTheElder

This comment has been minimized.

Copy link
Member

BenTheElder commented Aug 14, 2019

cc

@ereslibre

This comment has been minimized.

Copy link
Member Author

ereslibre commented Sep 23, 2019

Update: while starting to work on this again I noticed this issue: golang/go#31643 (and https://go-review.googlesource.com/c/net/+/173952/). Go will probably include automatic pinging the server end for the http/2 transport.

We still have the spdy part of this PR, and I have to check if port-forward is affected as well by this issue (most likely).

I will push soon an updated version of this PR, hopefully something really reviewable, sorry for the delay.

@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Sep 23, 2019

/sub

@caesarxuchao

This comment has been minimized.

Copy link
Member

caesarxuchao commented Sep 23, 2019

/assign

@ereslibre

This comment has been minimized.

Copy link
Member Author

ereslibre commented Oct 3, 2019

Thanks to everyone involved in reviewing this. Closing because golang/net#55 (or some other similar follow up PR) will handle this automatically.

If anything, and if the decided go API allows it, we could expose the ping interval to the user with client-go and/or kubectl, but that doesn't justify having this PR open.

Thank you again, closing.

@ereslibre ereslibre closed this Oct 3, 2019
@ereslibre ereslibre deleted the ereslibre:ping-on-long-running-commands branch Oct 3, 2019
@BenTheElder

This comment has been minimized.

Copy link
Member

BenTheElder commented Oct 3, 2019

thanks again for your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.