Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: UnixListener blocks forever in Close() if File() is used to get the file descriptor #29277

Open
prashantv opened this issue Dec 14, 2018 · 6 comments

Comments

@prashantv
Copy link
Contributor

@prashantv prashantv commented Dec 14, 2018

What version of Go are you using (go version)?

$ go version
go version go1.11.3 linux/amd64

Does this issue reproduce with the latest release?

Yes, it also reproduces with the latest tip.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/prashant/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/prashant/go"
GOPROXY=""
GORACE=""
GOROOT="/home/prashant/.gimme/versions/go1.11.3.linux.amd64"
GOTMPDIR=""
GOTOOLDIR="/home/prashant/.gimme/versions/go1.11.3.linux.amd64/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build250167096=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Created a unix listener, called File() to get the underlying file descriptor, then ran Accept in a goroutine. After some while, tried to call Close on the listener, but the Close blocks indefinitely.

Repro:
https://github.com/prashantv/unix-close-race/blob/master/main.go

What did you expect to see?

Close to return, and cause the Accept (in a separate goroutine) to unblock, and return an error.

What did you see instead?

In Go 1.11.3, the Close blocks indefinitely:

goroutine 1 [semacquire]:
internal/poll.runtime_Semacquire(0xc0000a4028)
        /home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/runtime/sema.go:61 +0x39
internal/poll.(*FD).Close(0xc0000a4000, 0xc0000a4000, 0x0)
        /home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/internal/poll/fd_unix.go:109 +0xdb
net.(*netFD).Close(0xc0000a4000, 0xc0000b5e58, 0xc00008a030)
        /home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/net/fd_unix.go:184 +0x4f
net.(*UnixListener).close(0xc000080540, 0xc0000d0020, 0x5)
        /home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/net/unixsock_posix.go:186 +0x65
net.(*UnixListener).Close(0xc000080540, 0x1, 0x1)
        /home/prashant/.gimme/versions/go1.11.3.linux.amd64/src/net/unixsock.go:270 +0x47
main.testCloseAfterAccept(0x511ec0, 0xc000080540)
        /home/prashant/go/src/github.com/prashantv/unix-close-race/main.go:66 +0x111
main.main()
        /home/prashant/go/src/github.com/prashantv/unix-close-race/main.go:43 +0x1ce

In Go 1.10.6, the Close returns, but the Accept goroutine is not unblocked,

2018/12/14 14:58:02 Got fd from unix ln5
2018/12/14 14:58:02 wait for accept
2018/12/14 14:58:02 about to accept
2018/12/14 14:58:02 close
2018/12/14 14:58:02 wait for accept to end
<blocks>

Other Notes

It looks like the behaviour change between 1.10 and 1.11 may be caused by https://go-review.googlesource.com/c/go/+/119955/ (fix for #24942)

I added a flag to the repro, if you pass --use-new-fd, instead of calling Accept and Close on the original unix listener (which we called File() on), it uses a new listener from the copied file descriptor. This mitigates the issue (both in Go 1.10 and Go 1.11). It seems like calling File() duplicates the file descriptor, but somehow affects the original file descriptor causing issues with Accept + Close.

cc @witriew who originally ran into this issue.

@dylanplecki

This comment has been minimized.

Copy link

@dylanplecki dylanplecki commented Dec 15, 2018

This was observed in osrg/gobgp as well (at least in previous versions). This is my working theory so far after debugging that (in Go 1.11.1).

The problem is actually with the fd := f.Fd() line in the reproduction. This call makes the FD non-blocking:

func (f *File) Fd() uintptr {
	if f == nil {
		return ^(uintptr(0))
	}

	// If we put the file descriptor into nonblocking mode,
	// then set it to blocking mode before we return it,
	// because historically we have always returned a descriptor
	// opened in blocking mode. The File will continue to work,
	// but any blocking operation will tie up a thread.
	if f.nonblock {
		f.pfd.SetBlocking()
	}

	return uintptr(f.pfd.Sysfd)
}

The pfd being modifed in SetBlocking() is the duplicated FD, but note that it eventually calls fcntl(fd, F_SETFL, flag &^= O_NONBLOCK) (thus clearing the nonblocking flag on the duplicate FD).

But, the gotcha is that fcntl operates on the underlying file description, not the file descriptor. The duplicate and the original file descriptors (different integers) both share the same file description (and the same file status flags), so the nonblocking flag is cleared on the original unixLn socket by this call as well.

Thus, the unixLn socket reads are actually blocking calls, but the original pFD on the socket (poll.FD) has its isBlocking == 0. This triggers a deadlock on Close() during Accept() when it waits for all readers (the accept call) to close on on the pfd.csema semaphore.

func (fd *FD) Close() error {

	...

	// Wait until the descriptor is closed. If this was the only
	// reference, it is already closed. Only wait if the file has
	// not been set to blocking mode, as otherwise any current I/O
	// may be blocking, and that would block the Close.
	// No need for an atomic read of isBlocking, increfAndClose means
	// we have exclusive access to fd.
	if fd.isBlocking == 0 {
		runtime_Semacquire(&fd.csema)
	}

	return err
}

See "File status flags" in fcntl(2) for reference.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 15, 2018

@dylanplecki Thanks, I think your analysis is exactly correct.

The File method is documented as saying "The returned os.File's file descriptor is different from the connection's. Attempting to change properties of the original using this duplicate may or may not have the desired effect." The Fd method is documented as saying "On Unix systems this will cause the SetDeadline methods to stop working." The latter is a shorthand for a complex statement that amounts to "the returned file descriptor is put into blocking mode." This changes the property of the duplicate which changes the property of the original.

I don't know what we can do about this without losing features that are more important. I'm open to suggestions.

Before that, though: what are you really trying to do? Is this something that could be done more appropriately with net.ListenConfig?

@prashantv

This comment has been minimized.

Copy link
Contributor Author

@prashantv prashantv commented Dec 17, 2018

@dylanplecki That makes perfect sense, thanks for the analysis! We noticed that the Fd was the culprit, didn't realize that the SetBlocking the duplicated FD affected the original FD as well, that was the missing piece that leads to a simple workaround for our use-case -- put the file back into non-blocking mode using syscall.SetNonblock.

@ianlancetaylor We're implementing a graceful restart feature (similar to HAProxy), so we need to pass file descriptors from the old process to the new using a unix domain socket. I don't think we can use net.ListenConfig here.

I think the issue comes down to:

  • File() returning a new file descriptor, but still referencing the same underlying file description -- the documentation mentions a new file descriptor, and that changes may not affect the original. Maybe it should be more explicit in mentioning that they may share underlying state.
  • Calling Fd() on the duplicated *os.File sets the file description into blocking mode, but the original internal/poll.FD of the listener still has isBlocking == 1, since it doesn't "realize" the underlying file descriptor is no longer non-blocking. On Close(), we enter this block:

https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L102

  // Wait until the descriptor is closed. If this was the only
  // reference, it is already closed. Only wait if the file has
  // not been set to blocking mode, as otherwise any current I/O
  // may be blocking, and that would block the Close.
  // No need for an atomic read of isBlocking, increfAndClose means
  // we have exclusive access to fd.
  if fd.isBlocking == 0 {
    runtime_Semacquire(&fd.csema)
  }

As the comment says, this causes Close() to block, leading to the behaviour we see in the bug.

Even the original internal/poll.FD was updated to no longer be considered non-blocking as part of the File() + Fd() operations, the user may trigger changes that cause it to get out-of-sync (e.g., using syscall.SetNonblock).

@crvv

This comment has been minimized.

Copy link
Contributor

@crvv crvv commented Dec 18, 2018

I think you can use SyscallConn to get the fd without side effects.
https://golang.org/pkg/net/#UnixListener.SyscallConn

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Dec 18, 2018

I think I just ran into this issue in CL 154664. Closeing the file returned by File() before Closeing the connection worked for me.
(See line 86 of sendfile_test.go.)

@crvv

This comment has been minimized.

Copy link
Contributor

@crvv crvv commented Dec 18, 2018

There is an issue for sendfile.
#28330

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.