Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: unix domain socket crash on darwin High Sierra 10.13.6, go 1.21.0 #62337

Closed
glycerine opened this issue Aug 29, 2023 · 11 comments
Closed
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@glycerine
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.21.0 darwin/amd64

Does this issue reproduce with the latest release?

This is the latest

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE='auto'
GOARCH='amd64'
GOBIN=''
GOCACHE='/Users/jaten/Library/Caches/go-build'
GOENV='/Users/jaten/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/jaten/go/pkg/mod'
GOOS='darwin'
GOPATH='/Users/jaten/go'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go1.21.0'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go1.21.0/pkg/tool/darwin_amd64'
GOVCS=''
GOVERSION='go1.21.0'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD=''
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/6s/zdc0hvvx7kqcglg5yqm3kl4r0000gn/T/go-build406580306=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

I started using unix domain sockets on go1.21.0 on High Sierra. I observed once a sporadic crash. This was not easy to reproduce, it happens only sporadically; probably some race.

I assume if High Sierra is no longer supported, that may be why... perhaps there was some runtime evolution of the code, and I'll just need to go back to an older Go release. But it would be nice to know if that is the case, or perhaps this really does indicate a bug in the darwin unix domain socket handling Go runtime code.

What did you expect to see?

Not a crash.

What did you see instead?

A crash a moment after starting a unix domain listener.

SIGSEGV: segmentation violation
PC=0x100051e37 m=0 sigcode=1

goroutine 0 [idle]:
runtime.(*sigctxt).sigcode(...)
/usr/local/go1.21.0/src/runtime/signal_darwin_amd64.go:43
runtime.(*sigctxt).sigFromUser(...)
/usr/local/go1.21.0/src/runtime/os_unix_nonlinux.go:14
runtime.sighandler(0x1c, 0x100a783a0?, 0x7ffeefbfd360?, 0x100a777a0)
/usr/local/go1.21.0/src/runtime/signal_unix.go:676 +0x177 fp=0x7ffeefbfd338 sp=0x7ffeefbfd2d0 pc=0x100051e37
runtime.sigtrampgo(0x1c, 0x8300, 0x820000008303)
/usr/local/go1.21.0/src/runtime/signal_unix.go:490 +0x13c fp=0x7ffeefbfd3b0 sp=0x7ffeefbfd338 pc=0x10005181c
runtime.sigtramp()
/usr/local/go1.21.0/src/runtime/sys_darwin_amd64.s:189 +0x46 fp=0x7ffeefbfd400 sp=0x7ffeefbfd3b0 pc=0x100071486

the place where this crash occurred:

file runtime/os_unix_nonlinux.go,  Go version 1.21.0 on darwin/amd64
package runtime

// sigFromUser reports whether the signal was sent because of a call
// to kill.
//
//go:nosplit
func (c *sigctxt) sigFromUser() bool {
    return c.sigcode() == _SI_USER  // line 14 (segfault) is here
}

caller was

file: runtime/signal_unix.go, Go version 1.21.0 on darwin/amd64

    flags := int32(_SigThrow)
    if sig < uint32(len(sigtable)) {
        flags = sigtable[sig].flags
    }
    if !c.sigFromUser() && flags&_SigPanic != 0 && (gp.throwsplit || gp != mp.curg) {  // segfault here, line 676
        // We can't safely sigpanic because it may grow the
        // stack. Abort in the signal handler instead.
        //
        // Also don't inject a sigpanic if we are not on a
        // user G stack. Either we're in the runtime, or we're
        // running C code. Either way we cannot recover.
        flags = _SigThrow
    }

here is the unix domain code; it creates a simple cooperative lock between processes, to avoid
both working on the same file

package main

import (
	"fmt"
	"net"
	"os"
	"path/filepath"
	"strings"
	"time"
)

// UDLock uses a unix-domain socket to lock
// a file to prevent two cooperating processing from
// using the same file.
type UDLock struct {
	Path     string
	lsn      net.Listener
	Finished chan struct{}
}

// NewUDLock obtains a lock based on path. If the
// lock was obtained, a nil err will be returned.
// path will have ".lock" appended if it does not
// already have it. Call lock.Close() to release
// the lock.
func NewUDLock(path string) (lock *UDLock, err error) {

	if strings.HasSuffix(path, ".lock") {
		// okay as is.
	} else {
		path = path + ".lock"
	}

	path2, err0 := filepath.Abs(path)
	panicOn(err0)
	path = path2

	staleLock := false
	if FileExists(path) {
		// see if process at this unix domain socket is
		// still alive and responding to lock queries, or
		// if the .lock file is just stale and we can
		// grab it.
		conn, err1 := net.Dial("unix", path)
		//vv("dial err1 = '%v' on path '%v'", err1, path)
		if err1 == nil {
			var buf [4096]byte
			err2 := conn.SetReadDeadline(time.Now().Add(3 * time.Second))
			panicOn(err2)
			n, err3 := conn.Read(buf[:])
			if err3 != nil {
				if netErr, ok := err3.(net.Error); ok && netErr.Timeout() {
					staleLock = true
				}
			}
			if staleLock {
				os.Remove(path)
			} else {
				err = fmt.Errorf("path '%v' has a live lock already: '%v'", path, string(buf[:n]))
				return
			}
		} else {
			//vv("removing stale socket file b/c could not contact the process holding it")
			os.Remove(path)
		}
	}
	lsn, err2 := net.Listen("unix", path)
	if err2 != nil {
		return nil, err2
	}

	lock = &UDLock{
		Path:     path,
		lsn:      lsn,
		Finished: make(chan struct{}),
	}
	lock.start()
	return lock, nil
}

// Close releases the lock
func (lock *UDLock) Close() {
	conn, err1 := net.DialTimeout("unix", lock.Path, 5*time.Second)

	if err1 == nil {
		conn.SetReadDeadline(time.Now().Add(time.Second))
		var buf [4096]byte
		conn.Read(buf[:])

		conn.Write([]byte("shutdown"))
		<-lock.Finished
	}
	os.Remove(lock.Path)
}

func (lock *UDLock) start() {
	go func() {
		defer close(lock.Finished)
		for {
			//vv("top of UDLock.start() loop")

			// Accept new connections, tell them the lock is held,
			// and check if we are shutting down
			conn, err := lock.lsn.Accept()
			if err != nil {
				//vv("lsn.Accept error (probably closed due to shutdown): '%v'", err)
				return
			}

			go func(conn net.Conn) {
				//vv("Client connected [%s]", conn.RemoteAddr().Network())

				err = conn.SetWriteDeadline(time.Now().Add(5 * time.Second))
				panicOn(err)
				fmt.Fprintf(conn, "'%v' locked! by pid:%v", lock.Path, os.Getpid())

				// check if is request to shutdown
				err = conn.SetReadDeadline(time.Now().Add(5 * time.Second))
				panicOn(err)
				var buf [4096]byte
				n, err := conn.Read(buf[:])
				if err == nil && n >= len("shutdown") {
					if string(buf[:len("shutdown")]) == "shutdown" {
						conn.Close()
						//vv("exit lock listening loop by closing lsn.")
						lock.lsn.Close()
						return
					}
				}
				conn.Close()
			}(conn)
		}
	}()
}
@ianlancetaylor ianlancetaylor changed the title runtime/net unix domain socket crash on darwin High Sierra 10.13.6, go 1.21.0 runtime: unix domain socket crash on darwin High Sierra 10.13.6, go 1.21.0 Aug 29, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 29, 2023
@ianlancetaylor
Copy link
Contributor

If the stack trace is correct this is a SIGWINCH signal, which is sent when the terminal window size changes. Did you happen to resize the terminal window? Then while trying to handle that signal, the process gets a SIGSEGV, which kills it. But I don't know why it would get a SIGSEGV there.

@panjf2000 panjf2000 added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 29, 2023
@panjf2000 panjf2000 added this to the Backlog milestone Aug 29, 2023
@mknyszek
Copy link
Contributor

Hm... Possibly related to #60449? Though, if it's signal specific it might unrelated; that's more general memory corruption we're seeing. How easy is this to reproduce?

@mknyszek mknyszek added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Aug 30, 2023
@glycerine
Copy link
Author

glycerine commented Sep 2, 2023

If the stack trace is correct this is a SIGWINCH signal, which is sent when the terminal window size changes. Did you happen to resize the terminal window? Then while trying to handle that signal, the process gets a SIGSEGV, which kills it. But I don't know why it would get a SIGSEGV there.

@ianlancetaylor

I didn't manually try to resize the window, but I do remember the terminal "blinking", so perhaps there was a terminal resize invoked by some code... the Go process loads R 4.0.5 as a library, which, being a huge old crusty C codebase, might very well be doing something with the terminal... or gnu readline, or libedit, might be doing stuff. The process tells R to open a graphics window, so XQuartz on darwin gets launched and a graphics window attempts to be opened... and OSX tries to bring that graphics window to the top of the desktop, which perhaps sends a signal to the terminal (which is Iterm2 build 3.1.5) Any number components might have done stuff to the terminal, I guess.

How easy is this to reproduce?

@mknyszek
I was only able to see this once. So, very difficult to reproduce.

@glycerine
Copy link
Author

glycerine commented Sep 2, 2023

Ah hah. Thanks Ian for the idea about SIGWINCH. I was able to reproduce a crash again by resizing the terminal window. So perhaps this is unrelated to the unix domain socket stuff, and more related to terminal stuff.

Also I realized that because the R library (loaded by the main Go process and called by cgo) loads a shared object (.dylib) that also contains Go code, there are two copies of the Go runtime involved (one from the main process, one embedded in the .dylib), and I cannot tell up front which one is doing the crashing... but, the goroutine stack trace does seem to be from the main, initial process, Go runtime, because it has calls to the web-server portion that I use to show graphics and the history of the R command line interaction. However, the other runtime might be crashing??? I don't think I was ever able to completely isolate it from the occassional signal, unfortunately, and sometimes they get through to the "inside" runtime, I suspect (does that make sense? i.e. perhaps the fix for #13034 still left some ways through).

@glycerine
Copy link
Author

glycerine commented Sep 2, 2023

Looking at the source for libedit, which is what I'm linking the R library against,

https://opensource.apple.com/source/libedit/libedit-31/src/sig.c.auto.html

I notice that it does not use the SA_ONSTACK flag when taking over signals while it reads a command line.

So I think I can see a race where, between the time the libedit reading a command line returns, and before my C code can restore the SA_ONSTACK flag to the signal handlers, here

https://github.com/glycerine/embedr/blob/master/cpp/embedr.cpp#L349
and here
https://github.com/glycerine/embedr/blob/master/cpp/embedr.cpp#L378

a signal could get through to the top/main Go runtime that does not have SA_ONSTACK set.

Moreover this is fragile anyway since the runtime is running in the background on a separate thread right?

@glycerine
Copy link
Author

@ianlancetaylor Is there a way to have the Go runtime not crash if SA_ONSTACK is not set? making sure it is always set is super tricky when using these big old C systems like R and libedit...

(referring to

If the non-Go code installs any signal handlers, it must use the SA_ONSTACK flag with sigaction. Failing to do so is likely to cause the program to crash if the signal is received. Go programs routinely run with a limited stack, and therefore set up an alternate signal stack.

from https://pkg.go.dev/os/signal
)

@glycerine
Copy link
Author

possibly related: #22805

@glycerine
Copy link
Author

glycerine commented Sep 2, 2023

I'm trying the following to keep signals away from the Host (main) Go runtime. I'd appreciate any thoughts about whether this would be expected to work... or is there still a subtle race here?

right after crossing the CGO border into running C code, I save the set of signal handlers (which would include any host Go runtime signal handlers for growing stacks), and then null out all the signal handlers:

https://github.com/glycerine/embedr/blob/master/cpp/embedr.cpp#L176
https://github.com/glycerine/embedr/blob/master/cpp/embedr.cpp#L158

So if the libedit/libreadline code that sets SIGWINCH does stuff, the host Go runtime should never see it, even in that small race window of the readline code finishing and the Go code getting to run again, because we are still on the C side of the CGO go->into C call.

So I restore the host Go runtime signal handlers before returning from the CGO call.

https://github.com/glycerine/embedr/blob/master/cpp/embedr.cpp#L210

I think this should prevent the kernel from ever giving the host Go runtime a signal callback, right?

I'm still puzzling about how to protect the guest Go runtime (inside the .so/DLL/.dylib) from getting signals. Thinking on it. Any suggestions welcome. I think it has to get signals in order to grow its stacks, no? (or maybe not? that would be greatly simplifying)...

I can save the guest Go runtime signal handlers after it has initialized, but there are alot of calls back and forth between R and Go guest code in the DLL, where I don't have a way to intercept in C and make sure the guest runtime signal handlers are installled... without modifying the C code for R, which would be painful.

@glycerine
Copy link
Author

So reading about -buildmode=c-shared in https://pkg.go.dev/os/signal, which would apply to the guest Go runtime in the DLL:

Non-Go programs that call Go code:
...
For the synchronous signals and SIGPIPE, the Go runtime will install a signal handler. It will save any existing signal handler. If a synchronous signal arrives while executing non-Go code, the Go runtime will invoke the existing signal handler instead of the Go signal handler.

This seems to say that I would need get the synchronous signals and SIGPIPE back in place for the guest Go runtime to be 100% happy.

If this is an accurate conclusion, it seems like some kind of callback mechanism to let my code restore to the guest Go runtime code its proper/expected signal handlers without having to modify the calling C code would be highly useful (to put it mildly, really we want to avoid random sporadic crashes to our working programs in most all cases).

@glycerine
Copy link
Author

Anywho, its pretty clear to me this crash was not a Go bug but rather a coding mistake on my part. Feel free to close this issue. But if you all do have advice on how to handle the signals for the guest c-shared Go runtime, I would be grateful for guidance. Thank you.

@ianlancetaylor
Copy link
Contributor

Thanks for the thorough investigation.

When any Go code is used in a process, all signals must have the SA_ONSTACK flag set. The Go runtime will attempt to make this happen when it is initialized. However, if signal handlers are installed after the Go runtime is initialized, then there is nothing to force the SA_ONSTACK flag to be set.

If SA_ONSTACK is not set, and a signal is delivered to a thread while it is running Go code, it will generally crash. That is because Go runs with a small extensible stack. Deliver a signal to a Go thread will often require more stack space than the Go thread has available. This will lead to a stack overrun and a crash.

Is there a way to have the Go runtime not crash if SA_ONSTACK is not set?

There is not. That said, since you mention it, we could consider having some knob for this somewhere. Though I don't know where. Turning the knob on would tend to waste a lot more memory on goroutine stacks, but it might be an acceptable tradeoff for a C program calling Go code that doesn't want to run very many goroutines. That should be a different issue, though.

Assuming you can't change the C code that you are calling, one possible fix for your program might be to have a hook after the signal handlers are installed that just sets the SA_ONSTACK flag for them. You can this by calling sigaction to fetch the current signal status, then, if SA_ONSTACK is not set, call sigaction again with the same values only with SA_ONSTACK set.

In any case it doesn't sound like there is anything to change in the Go for this issue, so closing. Thanks again.

@ianlancetaylor ianlancetaylor closed this as not planned Won't fix, can't repro, duplicate, stale Sep 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

5 participants