Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: (*Cmd).Wait hangs on macOS when setctty is used after commit b15c399a3 #61779

Open
FiloSottile opened this issue Aug 5, 2023 · 10 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Milestone

Comments

@FiloSottile
Copy link
Contributor

A test that involves setting the command's TTY with the code below, and then calling (*Cmd).Wait, started timing out on macOS in Go 1.20.

func SetCtty(cmd *exec.Cmd, tty *os.File) {
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Setctty: true,
		Setsid:  true,
		Ctty:    3,
	}
	cmd.ExtraFiles = []*os.File{tty}
}

I bisected it down to b15c399.

commit b15c399a36a38509ae56dd69670974566f7b0d52
Author: Yuval Pavel Zholkover <paulzhol@gmail.com>
Date:   Sat Jul 30 20:41:58 2022 +0300

    os: only add file descriptors which are set to non-blocking mode to the netpoller

    Either ones where kind == kindNonBlock or those we've successfully called syscall.SetNonblock() on.
    Restore blocking behavior if we detect an error registering with the netpoller and our flow was
    successful in setting the inital syscall.SetNonblock().

    Update #54100

    Change-Id: I08934e4107c7fb36c15a7ca23ac880490b4df235
    Reviewed-on: https://go-review.googlesource.com/c/go/+/420334
    TryBot-Result: Gopher Robot <gobot@golang.org>
    Reviewed-by: Dmitri Goutnik <dgoutnik@gmail.com>
    Reviewed-by: Ian Lance Taylor <iant@google.com>
    Run-TryBot: Yuval Pavel Zholkover <paulzhol@gmail.com>
    Reviewed-by: Than McIntosh <thanm@google.com>
    Auto-Submit: Ian Lance Taylor <iant@golang.org>

 src/os/file_unix.go | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

You can see the failing tests in the CI of rogpeppe/go-internal#172 (https://github.com/rogpeppe/go-internal/actions/runs/5005273594?pr=172) or in age's tests (https://github.com/FiloSottile/age/actions/runs/5633140339/job/15261730363).

/cc @paulzhol @ianlancetaylor @dmgk #54100

@ianlancetaylor
Copy link
Member

I'm sure I'm doing something dumb but I can't find the sources for the TestScripts/pty test. I cloned https://github.com/FiloSottile/go-internal but I don't see that test.

If the cause is https://go.dev/cl/420334 then my first guess would be that Darwin kqueue does not permit adding a pty. But I don't understand what the pty is being used for. It looks like the subcommand stdin, stdout, and stderr will all be pipes.

@mvdan
Copy link
Member

mvdan commented Aug 6, 2023

@ianlancetaylor that testdata file is added in the PR https://github.com/rogpeppe/go-internal/pull/172/files, so you can find it at the branch https://github.com/FiloSottile/go-internal/tree/filippo/pty.

@FiloSottile
Copy link
Contributor Author

FiloSottile commented Aug 6, 2023 via email

mvdan pushed a commit to rogpeppe/go-internal that referenced this issue Aug 7, 2023
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 7, 2023
@mknyszek mknyszek added this to the Backlog milestone Aug 7, 2023
@ianlancetaylor
Copy link
Member

ianlancetaylor commented Aug 9, 2023

I don't fully understand what is happening.

It's clear that macOS does not permit opening /dev/tty and passing the resulting descriptor to kevent.

Before CL 420334 we would first try to add the descriptor to kqueue, and if that succeeded put the descriptor into non-blocking mode. After that CL we would first put the descriptor into non-blocking mode, and then try to add it to kqueue. We made that change because of #54100, in which we could sometimes add a descriptor to kqueue and then fail to make it non-blocking, which is bad. So now we first set the descriptor to non-blocking, and then try to add it to kqueue. If we fail to add it to kqueue, we put the descriptor back into blocking mode.

That is the sequence that happens for macOS with /dev/tty. It seems fine. But for some reason that I don't understand, opening /dev/tty, setting it to non-blocking mode, then setting it back to blocking mode, causes an inexplicable problem: the program hangs when calling _exit. Unfortunately I can't run dtruss on the system I am using, so I don't know precisely which system call is hanging. I don't know whether this is a libsystem bug or a kernel bug.

It is also possible that the problem arises because we put /dev/ptmx or the /dev/ttysNNN device into non-blocking mode.

I have not been able to write my own test case for this, although I can see that the test described above fails. I'm not sure what is different, other than that I am using the internal/testpty package from the standard library.

I have a simple fix that fixes this issue. Maybe we should just go with it. I'm not sure.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/517555 mentions this issue: os: don't add character devices to kqueue on darwin

@paulzhol
Copy link
Member

paulzhol commented Aug 9, 2023

I don't have access to the hardware, but darwin has a test suite which sets the returned openpty() fd's to O_NONBLOCK, adds them to various kqueues and does read/write calls on them.

I'd love to see some system call traces showing whether the pair of opened ptys in the parent fail to be added to kqueue or it is the opening of /dev/tty in the child (which could be not having a controlling terminal anyway) which fails.

FreeBSD man page on tty devices has this section regarding TIOCNOTTY/TIOCSCTTY and /dev/tty:

The current system does not allocate a controlling terminal
to a process on an open() call: there is a specific ioctl
called TIOCSCTTY to make a terminal the controlling terminal.
In addition, a program can fork() and call the setsid() sys-
tem call which will place the process into its own session -
which has the effect of disassociating it from the control-
ling terminal. This is the new and preferred method for pro-
grams to lose their controlling terminal.

However, environmental restrictions may prohibit the process
from being able to fork() and call the setsid() system call
to disassociate it from the controlling terminal. In this
case, it must use TIOCNOTTY.

Maybe the child OSX process needs to set TIOCSCTTY on /dev/tty as well?

This does not explain the blocking of _exit of course.

@bcmills
Copy link
Contributor

bcmills commented Nov 3, 2023

The next time someone sees a failure mode matching this issue, could you paste a goroutine dump from the timed-out process directly into the GitHub issue? The CI links from the original post seem to have expired.

@FiloSottile
Copy link
Contributor Author

This is still an issue in Go 1.22. https://github.com/FiloSottile/age/tree/tmp/61779 has tests that fail on macOS.

$ go test ./cmd/age -run TestScript/scrypt -timeout 5s
panic: test timed out after 5s
running tests:
	TestScript/scrypt (5s)

goroutine 51 [running]:
testing.(*M).startAlarm.func1()
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:2366 +0x30c
created by time.goFunc
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/time/sleep.go:177 +0x38

goroutine 1 [chan receive]:
testing.(*T).Run(0x1400012e820, {0x1043ed148?, 0x140001697d8?}, 0x1044ba1c0)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1750 +0x32c
testing.runTests.func1(0x1400012e820)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:2161 +0x40
testing.tRunner(0x1400012e820, 0x140001698f8)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1689 +0xec
testing.runTests(0x1400011c438, {0x1046009b0, 0x1, 0x1}, {0x140001699b8?, 0x10429fbbc?, 0x10460aee0?})
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:2159 +0x3b0
testing.(*M).Run(0x14000120e60)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:2027 +0x5a4
github.com/rogpeppe/go-internal/testscript.RunMain({0x1044bbe58, 0x14000120e60}, 0x14000169e48)
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/exe.go:100 +0x5ec
filippo.io/age/cmd/age.TestMain(0x14000120e60)
	/Users/filippo/src/filippo.io/age/cmd/age/age_test.go:17 +0xf4
main.main()
	_testmain.go:49 +0x170

goroutine 35 [chan receive]:
testing.tRunner.func1()
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1650 +0x43c
testing.tRunner(0x1400012e9c0, 0x1044ba1c0)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1695 +0x128
created by testing.(*T).Run in goroutine 1
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1742 +0x318

goroutine 36 [syscall]:
syscall.syscall6(0x14000165708?, 0x104245cac?, 0x90000001010160?, 0x104976588?, 0x90?, 0x1046a0a68?, 0x90?)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/runtime/sys_darwin.go:45 +0x68
syscall.wait4(0x14000165758?, 0x1042d1c38?, 0x90?, 0x1044b0c60?)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/syscall/zsyscall_darwin_arm64.go:44 +0x4c
syscall.Wait4(0x140001657f8?, 0x14000165794, 0x0?, 0x10423d6c8?)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/syscall/syscall_bsd.go:144 +0x28
os.(*Process).wait(0x14000136750)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/os/exec_unix.go:43 +0x80
os.(*Process).Wait(...)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/os/exec.go:134
os/exec.(*Cmd).Wait(0x140001742c0)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/os/exec/exec.go:897 +0x38
github.com/rogpeppe/go-internal/testscript.waitOrStop({0x1044bd5b0, 0x14000178540}, 0x140001742c0, 0xee6608a)
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:1103 +0xc4
github.com/rogpeppe/go-internal/testscript.(*TestScript).exec(0x1400017e488, {0x1043ebb86?, 0x0?}, {0x14000159090?, 0x14000060a68?, 0x104283cc0?})
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:997 +0x4cc
github.com/rogpeppe/go-internal/testscript.(*TestScript).cmdExec(0x1400017e488, 0x0, {0x14000159080, 0x4, 0x4})
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/cmd.go:258 +0x218
github.com/rogpeppe/go-internal/testscript.RunMain.func2(0x1400017e488, 0x0, {0x14000159050, 0x3, 0xb200000000000005?})
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/exe.go:97 +0x168
github.com/rogpeppe/go-internal/testscript.(*TestScript).runLine.func2()
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:710 +0x58
github.com/rogpeppe/go-internal/testscript.(*TestScript).callBuiltinCmd(0x104482160?, {0x14000112a50?, 0x140001bc06a?}, 0x3?)
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:729 +0x58
github.com/rogpeppe/go-internal/testscript.(*TestScript).runLine(0x1400017e488, {0x140001bc06a, 0x12})
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:709 +0x5f4
github.com/rogpeppe/go-internal/testscript.(*TestScript).run(0x1400017e488)
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:606 +0x374
github.com/rogpeppe/go-internal/testscript.RunT.func1({0x1044be310, 0x1400012eb60})
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:343 +0x24c
github.com/rogpeppe/go-internal/testscript.tshim.Run.func1(0x1400012eb60?)
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:234 +0x30
testing.tRunner(0x1400012eb60, 0x140001187f0)
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1689 +0xec
created by testing.(*T).Run in goroutine 35
	/opt/homebrew/Cellar/go/1.22.0/libexec/src/testing/testing.go:1742 +0x318

goroutine 42 [chan send]:
github.com/rogpeppe/go-internal/testscript.waitOrStop.func1()
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:1100 +0x1f8
created by github.com/rogpeppe/go-internal/testscript.waitOrStop in goroutine 36
	/Users/filippo/pkg/mod/github.com/rogpeppe/go-internal@v1.12.0/testscript/testscript.go:1057 +0xbc
FAIL	filippo.io/age/cmd/age	5.132s
FAIL

@ianlancetaylor
Copy link
Member

Does anyone want to pick up https://go.dev/cl/517555 and complete it?

@danielskowronski
Copy link

danielskowronski commented Jul 5, 2024

Has it been tested on various iterations of Apple Silicon? I've observed some rare instances of different behaviour on M1 Max vs M2 Pro with the same macOS version (and stock SSH client), but the code is using several layers of abstraction over this low-level aspect and I don't have direct access to those computers.


I just checked that test, which @FiloSottile posted (go1.22.5 darwin/arm64) on machines I have access to:

  • M1 Max 32GB physical machine - 14.5 - fail
  • M1 Max 64GB VM (8GB) - 14.5 - fail
  • M2 16GB VM (8GB) - 14.5 - fail
  • M2 Pro 32GB physical machine - 14.5 - fail
  • M1 Max 64GB VM (8GB) - 15.0 beta 1 - pass
  • M1 Max 64GB VM (8GB) - 15.0 beta 2 - pass
  • M1 Max 64GB physical machine - 15.0 beta 2 - pass
  • M2 16GB physical machine - 15.0 beta 2 - pass

This disproves my theory with differences between M1 and M2 as I checked it on same CPUs that were giving me original issues with SSH. However, something interesting came up - Apple changed something XNU which aligned PTY behaviour with Linux. Nothing posted in release notes, but those kind of internals would only be visible once they push code to XNU OSS repo (I believe they do so after RC).

If you have any specific code and dumps - I can help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Projects
None yet
Development

No branches or pull requests

8 participants