Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/crypto/ssh: semantics around running session.Wait() after calling session.Run(), when EOF messages are sent #32453

Open
ianvernon opened this issue Jun 5, 2019 · 3 comments

Comments

Projects
None yet
4 participants
@ianvernon
Copy link

commented Jun 5, 2019

What version of Go are you using (go version)?

$ go version
1.12.5

Does this issue reproduce with the latest release?

I am using a vendored version of the library with commit e84da0312774c21d64ee2317962ef669b27ffb41

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/vagrant/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/vagrant/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build896755704=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I will work on providing an exact reproducer of this via a small go program, but do not have this on hand immediately. I can do this if you like. My question is more about how to properly use the SSH library, as I feel like I am misunderstanding how it works.

I am wondering about the semantics behind running session.Wait() after calling session.Run().

It appears that session.Run() calls session.Start() followed by session.Wait().

I've found that if I try calling session.Wait() again after calling session.Run(), my code will hang infinitely. Is this because there is nothing to wait upon after session.Run() terminates, since the remote command has exited, so session.Wait() is waiting for something, while nothing is being sent?

I also am confused about how sessions are closed. I observe when running with debugMux set to true locally, that my session gets sent a channelEOFMsg after I have called session.Run(), but before I call session.Close(). When I try to close the session, I get the error failed to close session: EOF. I would expect that after calling Run, that I'd be able to close the session gracefully. Who is sending these EOF's? Is that the expected behavior for running a command on a session, in that the session will get an EOF when the command is done running?

Output from running with debugMux=true:

--------------> session.Run() <----------------
2019/06/05 08:17:03 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:17:03 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:17:03 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x56, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x67, 0x65, 0x74, 0x20, 0x70, 0x6f, 0x64, 0x73, 0x20, 0x2d, 0x2d, 0x61, 0x6c, 0x6c, 0x2d, 0x6e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x69, 0x74, 0x65, 0x6d, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x6d, 0x65, 0x74, 0x61, 0x64, 0x61, 0x74, 0x61, 0x2e, 0x64, 0x65, 0x6c, 0x65, 0x74, 0x69, 0x6f, 0x6e, 0x54, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61, 0x6d, 0x70, 0x7d, 0x27}}
2019/06/05 08:17:03 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:17:03 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

-------------> where is this EOF coming from? <---------------
2019/06/05 08:17:03 send(2): ssh.channelEOFMsg{PeersID:0x0}

-----------> close session here <-----------------
2019/06/05 08:17:03 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:17:03 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:17:03 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:17:03 send(2): ssh.channelCloseMsg{PeersID:0x0}
2019/06/05 08:17:03 send(2): ssh.channelCloseMsg{PeersID:0x0}

-------------> log of error from closing session <----------------
failed to close session: EOF

What did you expect to see?

I would expect to not get an EOF after calling Run() on a session, and for Wait() to be idempotent / to return immediately if there is nothing to wait upon. I would also expect that EOF would not get sent by Run(), and that I would not get an error closing a session when nothing else has tried to close it.

What did you see instead?

See above.

I will work on a small reproducer locally, but I hope that the above is enough context / information for me to get some answers, as I feel like this is moreso me not understanding how the library is supposed to be used.

@gopherbot gopherbot added this to the Unreleased milestone Jun 5, 2019

@ianvernon ianvernon changed the title x/crypto/ssh: semantics around running session.Wait() after calling session.Run() x/crypto/ssh: semantics around running session.Wait() after calling session.Run(), when EOF messages are sent Jun 5, 2019

ianvernon added a commit to cilium/cilium that referenced this issue Jun 5, 2019

test: do not spawn goroutines to wait for canceled context in `RunCom…
…mandContext`

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>

ianvernon added a commit to cilium/cilium that referenced this issue Jun 5, 2019

test: do not spawn goroutines to wait for canceled context in `RunCom…
…mandContext`

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>

ianvernon added a commit to cilium/cilium that referenced this issue Jun 5, 2019

test: do not spawn goroutines to wait for canceled context in `RunCom…
…mandContext`

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>

borkmann added a commit to cilium/cilium that referenced this issue Jun 5, 2019

test: do not spawn goroutines to wait for canceled context in `RunCom…
…mandContext`

[ upstream commit cf14b6a ]

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

tgraf added a commit to cilium/cilium that referenced this issue Jun 6, 2019

test: do not spawn goroutines to wait for canceled context in `RunCom…
…mandContext`

[ upstream commit cf14b6a ]

It was observed while running `gops stack` for the Ginkgo test suite locally,
that we were leaking goroutines that were getting stuck while waiting for SSH
sessions to finish. We accrued over 1000 of these per K8s CI run:

```
16 unique stack traces
960 occurences. Sample stack trace:
github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh.(*Session).Wait(0xc00098e000, 0x230f79c, 0x1a)
        /Users/ianvernon/go/src/github.com/cilium/cilium/vendor/golang.org/x/crypto/ssh/session.go:403 +0x57
github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext.func1(0x2502600, 0xc000542280, 0xc00098e000, 0xc0001aa780, 0xc000268220)
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:262 +0x1cc
created by github.com/cilium/cilium/test/helpers.(*SSHClient).RunCommandContext
        /Users/ianvernon/go/src/github.com/cilium/cilium/test/helpers/ssh_command.go:253 +0x13c
```

This example shows that there were over 960 goroutines stuck on `session.Wait()`.

Whenever we run a command via SSH, we call `runCommand`. When we call
`runCommand`, it calls `session.Run`, which calls `session.Start()` and
`session.Wait()`. I observed that  that calling `Wait()` on a session which
already has had `Run` invoked will never return, even if we try
to call `session.Close()` before invoking `session.Wait()`. This indicates
that our logic for trying to kill the session if the context which is provided
to `RunCommandContext` is canceled is flawed, as waiting for the session to
finish before closing it will block infinitely.

I enabled debug mode for the SSH library we use (`golang.org/x/crypto/ssh`), and
I see that the session receives an EOF message *before* we even try to close the
session:

```
>------- session started, and session.Run() invoked

2019/06/05 08:16:59 send global(2): ssh.channelOpenMsg{ChanType:"session", PeersID:0x2, PeersWindow:0x200000, MaxPacketSize:0x8000, TypeSpecificData:[]uint8(nil)}
2019/06/05 08:16:59 decoding(2): 91 &ssh.channelOpenConfirmMsg{PeersID:0x2, MyID:0x0, MyWindow:0x0, MaxPacketSize:0x8000, TypeSpecificData:[]uint8{}} - 17 bytes
2019/06/05 08:16:59 send(2): ssh.channelRequestMsg{PeersID:0x0, Request:"exec", WantReply:true, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x6d, 0x6b, 0x75, 0x62, 0x65, 0x63, 0x74, 0x6c, 0x20, 0x65, 0x78, 0x65, 0x63, 0x20, 0x2d, 0x6e, 0x20, 0x6b, 0x75, 0x62, 0x65, 0x2d, 0x73, 0x79, 0x73, 0x74, 0x65, 0x6d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x2d, 0x62, 0x63, 0x63, 0x6d, 0x34, 0x20, 0x2d, 0x2d, 0x20, 0x63, 0x69, 0x6c, 0x69, 0x75, 0x6d, 0x20, 0x73, 0x74, 0x61, 0x74, 0x75, 0x73, 0x20, 0x2d, 0x6f, 0x20, 0x6a, 0x73, 0x6f, 0x6e, 0x70, 0x61, 0x74, 0x68, 0x3d, 0x27, 0x7b, 0x2e, 0x63, 0x6c, 0x75, 0x73, 0x74, 0x65, 0x72, 0x2e, 0x6e, 0x6f, 0x64, 0x65, 0x73, 0x5b, 0x2a, 0x5d, 0x2e, 0x70, 0x72, 0x69, 0x6d, 0x61, 0x72, 0x79, 0x2d, 0x61, 0x64, 0x64, 0x72, 0x65, 0x73, 0x73, 0x2e, 0x2a, 0x7d, 0x27}}
2019/06/05 08:16:59 decoding(2): 93 &ssh.windowAdjustMsg{PeersID:0x2, AdditionalBytes:0x200000} - 9 bytes
2019/06/05 08:16:59 decoding(2): 99 &ssh.channelRequestSuccessMsg{PeersID:0x2} - 5 bytes

>-------  EOF sent on channel (not by us; we have not closed the sesion yet!

2019/06/05 08:16:59 send(2): ssh.channelEOFMsg{PeersID:0x0}
2019/06/05 08:16:59 decoding(2): data packet - 181 bytes
2019/06/05 08:16:59 send(2): ssh.windowAdjustMsg{PeersID:0x0, AdditionalBytes:0xac}
2019/06/05 08:16:59 decoding(2): 96 &ssh.channelEOFMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 decoding(2): 98 &ssh.channelRequestMsg{PeersID:0x2, Request:"exit-status", WantReply:false, RequestSpecificData:[]uint8{0x0, 0x0, 0x0, 0x0}} - 25 bytes
2019/06/05 08:16:59 decoding(2): 97 &ssh.channelCloseMsg{PeersID:0x2} - 5 bytes
2019/06/05 08:16:59 send(2): ssh.channelCloseMsg{PeersID:0x0}

>------- we try to close the session, and receive the following error

failed to close session: EOF
```

It appears that we cannot close the session, since an EOF has already been sent
for it. I am not exactly sure where this comes from. I've posted an issue /
question in the GitHub repository for golang:
golang/go#32453 . Our attempts to send signals (SIGHUP
and SIGINT) are met by this same EOF error as well; there is no point on
waiting for the session to finish in this case, so just try to close it
and move on, and not leak goroutines that will be stuck forever. When running
`gops` now against the Gingko test suite, we no longer accrue a ton of these
goroutines blocked on `session.Wait()` - the biggest # of occcurrences for
"similar" goroutines is at most 2 in a stack trace captured below, for example:

```
$ ../contrib/scripts/consolidate_go_stacktrace.py stack9.out  | head -n 15
14 unique stack traces
2 occurences. Sample stack trace:
internal/poll.runtime_pollWait(0x3f14d50, 0x72, 0xc000a6cad0)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.(*pollDesc).wait(0xc00043c218, 0x72, 0xffffffffffffff00, 0x24e7ac0, 0x3225738)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
...
```

Signed-off by: Ian Vernon <ian@cilium.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
@dmitshur

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

@hanwen

This comment has been minimized.

Copy link
Contributor

commented Jun 17, 2019

session.Run is syntactic sugar for Start + Wait. Don't call Wait if you use Run.

re. EOF, EOF signifies the end of the writing to a channel (stdout/stderr), so you can send it only once.

I am open to review changes to be more intelligent here, but the behavior per se is not a bug.

@ianvernon

This comment has been minimized.

Copy link
Author

commented Jun 17, 2019

@hanwen thanks for the reply!

Regarding your comments about EOF - it looks like calling Run sends EOF, so we do not need to close the session after we have called Run, and Run has returned, because Run implicitly closes the session as well?

I don't think there is a bug here either, it was just a bit hard to understand the idempotency of session operations. Perhaps I can open up a PR for some godoc changes describing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.