Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec over web sockets protocol is flawed #89899

Open
admilazz opened this issue Apr 6, 2020 · 38 comments
Open

exec over web sockets protocol is flawed #89899

admilazz opened this issue Apr 6, 2020 · 38 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@admilazz
Copy link

admilazz commented Apr 6, 2020

What happened:
I tried to use exec over web sockets (as implemented in the C# Kubernetes client), but the protocol does not seem to have a way to communicate when STDIN is closed. This causes commands to run forever if they wait for STDIN to be closed, and is especially bad when the command won't send its output until then.

What you expected to happen:
I expect the exec protocol to have a mechanism to communicate when STDIN is closed, since this is important for the execution of many commands. The SPDY implementation has such a mechanism. (IIRC, a serious bug in SPDY exec v1 - documented in v1.go - was a similar lack of detection for closing STDIN.)

How to reproduce it (as minimally and precisely as possible):
Exec /bin/cat (without arguments) over web sockets. Send some data and try to close STDIN while receiving all of the output.

Anything else we need to know?:
You can force the command to terminate by half-closing the web socket connection, but this is not a solution because Kubernetes interprets it as "close the connection ASAP" rather than "finish running the command, send all the output, and close the connection". This may cause the output to be truncated.

The core web socket handler code seems to be this:

// handle implements a websocket handler.
func (conn *Conn) handle(ws *websocket.Conn) {
      defer conn.Close()
      conn.initialize(ws)

      for {
            conn.resetTimeout()
            var data []byte
            if err := websocket.Message.Receive(ws, &data); err != nil {
                  if err != io.EOF {
                        klog.Errorf("Error on socket receive: %v", err)
                  }
                  break
            }
            if len(data) == 0 {
                  continue
            }
            channel := data[0]
            if conn.codec == base64Codec {
                  channel = channel - '0'
            }
            data = data[1:]
            if int(channel) >= len(conn.channels) {
                  klog.V(6).Infof("Frame is targeted for a reader %d that is not valid, possible protocol error", channel)
                  continue
            }
            if _, err := conn.channels[channel].DataFromSocket(data); err != nil {
                  klog.Errorf("Unable to write frame to %d: %v\n%s", channel, err, string(data))
                  continue
            }
      }
}

As you can see, it skips zero-byte messages, takes the channel from the first byte, and sends the remaining data to the channel. There is no provision for closing STDIN, nor does there seem to be anything in the implementation of DataFromSocket that would do it, as far as I can see.

My suggestions on how to fix it are:

  1. Use the web sockets "finish" flag to mark the end of STDIN. The client would set "finish" to true on the last STDIN frame. This is more or less what the finish flag was designed for.
  2. Make half-closing the web socket equivalent to closing STDIN. Kubernetes should finish executing the command normally, send all the output, and then close its side of the connection. Kubernetes can reserve the "abort the command" behavior for the client performing a full close of the connection.
  3. Or, interpret an empty web sockets message as an end-of-stream marker for STDIN. Since empty messages are not allowed - all messages are supposed to be prefixed with the channel ID - this shouldn't be a breaking change in protocol.

The first seems quite proper, but unfortunately, the websocket library Kubernetes uses doesn't seem to expose the flag. The second also seems fairly proper, but may also have difficulties in implementation. The third seems easy to implement.

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.2", GitCommit:"66049e3b21efe110454d67df4fa62b08ea79a19b", GitTreeState:"clean", BuildDate:"2019-05-16T16:14:56Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration: Azure VM

  • OS (e.g: cat /etc/os-release):
    NAME="Ubuntu"
    VERSION="16.04.6 LTS (Xenial Xerus)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 16.04.6 LTS"
    VERSION_ID="16.04"
    HOME_URL="http://www.ubuntu.com/"
    SUPPORT_URL="http://help.ubuntu.com/"
    BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
    VERSION_CODENAME=xenial
    UBUNTU_CODENAME=xenial

  • Kernel (e.g. uname -a): Linux AdamArisDev 4.15.0-1063-azure e2e-test: expose minion 8080 port #68-Ubuntu SMP Fri Nov 8 09:30:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

  • Others:
    My cluster is running in minikube v1.1.0.

/sig api-machinery

@admilazz admilazz added the kind/bug Categorizes issue or PR as related to a bug. label Apr 6, 2020
@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Apr 6, 2020
@liggitt
Copy link
Member

liggitt commented Apr 6, 2020

hmm... I know of a websocket-based terminal that uses exec... it seems to be initializing stdin (see https://github.com/kubernetes-ui/container-terminal/blob/master/container-terminal.js#L172)

cc @smarterclayton @spadgett are you aware of this issue?

@admilazz
Copy link
Author

admilazz commented Apr 6, 2020

I think a terminal would work okay for two reasons. First, when you type "exit" in a terminal, it terminates its own process. Second, the "disconnect" command in that terminal code simply closes the web socket; presumably the user has received all desired output when he chooses to disconnect. It's not like running /bin/cat or another program which won't terminate until STDIN is closed and for which half-closing the websocket may cause truncation of important data.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2020
@admilazz
Copy link
Author

admilazz commented Jul 6, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2020
orenbm added a commit to cyberark/conjur that referenced this issue Sep 30, 2020
There is [an issue](kubernetes/kubernetes#89899)
in k8s where messages that are sent via WebSockets to Kubectl Exec are
left hanging without returning a response when using STDIN.

However, if we inject the cert by running a bash script that exits explicitly
then we don't have the hanging issue, and the K8s API will close the socket
upon cert injection success/failure.
orenbm added a commit to cyberark/conjur that referenced this issue Sep 30, 2020
* Inject cert using bash script

There is [an issue](kubernetes/kubernetes#89899)
in k8s where messages that are sent via WebSockets to Kubectl Exec are
left hanging without returning a response when using STDIN.

However, if we inject the cert by running a bash script that exits explicitly
then we don't have the hanging issue, and the K8s API will close the socket
upon cert injection success/failure.

* Make exec command timeout configurable

We now enable configuration of the exec command timeout. This
can be done by setting the `KUBECTL_EXEC_COMMAND_TIMEOUT` env var.

* Return 202 instead of 200 for inject_client_cert requests

The "inject_client_cert" request now returns 202 Accepted instead of 200 OK to
indicate that the cert injection has started but not necessarily completed.

* Print cert injection logs upon error

The "inject_client_cert" request now writes the injection logs
to the client container. We can print them when we have an error to
help troubleshooting

* Verify "inject_client_cert" responds with 202

We should verify that the response code of the request
is 202 Accepted in successful cases

* Extract file copy logic to new command class

This logic was inside kubectl_exec.rb which had a lot going
on already. This commit will enhance this logic's readability

* Refactor kubectl_client to kube_client

The latter better tells its purpose as it a k8s client
and not a Kubectl client

* Remove redundant -r flag

* Add description and fix indentation

* Refactor SetFileContentInContainer to CopyTextToFileInContainer

* Add UTs for CopyTextToFileInContainer class

* Implement some PR requests
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022
@Ferenc-
Copy link

Ferenc- commented Dec 14, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022
@uzuku
Copy link
Contributor

uzuku commented Feb 6, 2023

I just ran into an issue may be related to this when I test kubectl cp within a virtual-kubelet implementation, and the virtual-kubelet's RunInContainer(exec) method is implemented over websocket protocol, communicating to the backend kube-apiserver.

After sending all data to the stdin stream and closing stdin stream from the client side, the tar process in container doesn't receive the EOF of the tar file and hang for a while. EOF implementation of stdin is likely to be missing over websocket.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2023
@danopia
Copy link

danopia commented May 7, 2023

This continues to be relevant - currently no way to send EOF over WebSocket exec/attach.

I've ended up modernizing a SPDY client library just to workaround this Kubernetes limitation.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2023
@aojea
Copy link
Member

aojea commented May 7, 2023

This continues to be relevant - currently no way to send EOF over WebSocket exec/attach.

yeah the goal is to fix this and try to move to websockets by default on kubectl #116778

@aojea
Copy link
Member

aojea commented Jul 10, 2023

it seems gorilla implementation has. a CloseHandler https://github.com/gorilla/websocket/blob/9111bb834a68b893cebbbaed5060bdbc1d9ab7d2/conn.go#L1135 and we can use the GoAWAY status code or the NormalClosure to signal this https://github.com/gorilla/websocket/blob/9111bb834a68b893cebbbaed5060bdbc1d9ab7d2/conn.go#L47

@AlexMAS
Copy link

AlexMAS commented Oct 18, 2023

Are there any changes for this issue? :)

I'm trying to copy a file and sometimes it fails due to the issue. At the moment I have to make several retries. Sometimes more that 3.

Does anybody has a workaround for this? Could you share it, please. :)

@aojea
Copy link
Member

aojea commented Oct 18, 2023

this is going to be solved as part of this KEP kubernetes/enhancements#4016

@hauntsaninja
Copy link

Exciting times:

// The fifth version of the Websocket RemoteCommand subprotocol adds a CLOSE signal,

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2024
@Ferenc-
Copy link

Ferenc- commented Mar 22, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2024
@knight42
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests