Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: stuck at syscall.Write #44083

Closed
jarifibrahim opened this issue Feb 3, 2021 · 7 comments
Closed

syscall: stuck at syscall.Write #44083

jarifibrahim opened this issue Feb 3, 2021 · 7 comments

Comments

@jarifibrahim
Copy link

@jarifibrahim jarifibrahim commented Feb 3, 2021

What version of Go are you using (go version)?

$ go version
go version go1.15.6 linux/amd64

Does this issue reproduce with the latest release?

Probably not.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/dg/.cache/go-build"
GOENV="/home/dg/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/dg/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/dg/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/snap/go/6745"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/snap/go/6745/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build185691020=/tmp/go-build -gno-record-gcc-switches"

What did you do?

This issue might not be reproducible.

I was running dgraph bulk loader https://github.com/dgraph-io/dgraph $ dgraph bulk .... The process has been stuck for more than 30+ hours now. I'm looking at the goroutine dump and I see the following

goroutine 25531 [syscall, 3153 minutes]:
syscall.Syscall(0x1, 0x2, 0xc005c4bc00, 0x5c, 0x8, 0x7fffffffffffffff, 0x6017abfd)
	/snap/go/6745/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.write(0x2, 0xc005c4bc00, 0x5c, 0x80, 0x4d58, 0x3, 0xc000044c29)
	/snap/go/6745/src/syscall/zsyscall_linux_amd64.go:914 +0x5a
syscall.Write(...)
	/snap/go/6745/src/syscall/syscall_unix.go:212
internal/poll.(*FD).Write.func1(0x7ffff80000000000, 0x4, 0xc0000a612c)
	/snap/go/6745/src/internal/poll/fd_unix.go:267 +0x77
internal/poll.ignoringEINTR(0xc00008bc10, 0x77, 0xed7a9a201, 0x0)
	/snap/go/6745/src/internal/poll/fd_unix.go:567 +0x27
internal/poll.(*FD).Write(0xc0000a6120, 0xc005c4bc00, 0x5c, 0x80, 0x0, 0x0, 0x0)
	/snap/go/6745/src/internal/poll/fd_unix.go:267 +0x19c
os.(*File).write(...)
	/snap/go/6745/src/os/file_posix.go:48
os.(*File).Write(0xc000010020, 0xc005c4bc00, 0x5c, 0x80, 0x29a60a0, 0x0, 0x0)
	/snap/go/6745/src/os/file.go:173 +0x77
log.(*Logger).Output(0xc000241450, 0x2, 0xc1ef10a140, 0x41, 0x0, 0x0)
	/snap/go/6745/src/log/log.go:181 +0x284
log.(*Logger).Printf(0xc000241450, 0xc030b9c7c0, 0x3e, 0xc065dd6740, 0x4, 0x4)
	/snap/go/6745/src/log/log.go:188 +0x7e
github.com/dgraph-io/badger/v3.(*defaultLog).Infof(0xc0004641f0, 0x1dbb913, 0x38, 0xc065dd6740, 0x4, 0x4)
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/logger.go:97 +0xc5
github.com/dgraph-io/badger/v3.(*Options).Infof(...)
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/logger.go:46
github.com/dgraph-io/badger/v3.(*sortedWriter).createTable(0xc000ace000, 0xc00004c680, 0x0, 0x0)
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/stream_writer.go:454 +0x542
github.com/dgraph-io/badger/v3.(*sortedWriter).send.func1(0xc000ace000, 0xc00004c680)
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/stream_writer.go:384 +0x35
created by github.com/dgraph-io/badger/v3.(*sortedWriter).send
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/stream_writer.go:383 +0x86

The file that we're writing to is a usual file and I can read/write to that file.
It looks like the syscall has been stuck for 3153 minutes. Since this syscall is stuck, dgraph bulk is also stuck and I see a bunch of goroutines trying to acquire the lock

goroutine 25172 [semacquire, 3153 minutes]:
sync.runtime_SemacquireMutex(0xc000241454, 0x6017ac00, 0x1)
	/snap/go/6745/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000241450)
	/snap/go/6745/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
	/snap/go/6745/src/sync/mutex.go:81
log.(*Logger).Output(0xc000241450, 0x2, 0xc1ef10a320, 0x41, 0x0, 0x0)
	/snap/go/6745/src/log/log.go:162 +0x40e
log.(*Logger).Printf(0xc000241450, 0xc020aed000, 0x3e, 0xc0039e3540, 0x4, 0x4)
	/snap/go/6745/src/log/log.go:188 +0x7e
github.com/dgraph-io/badger/v3.(*defaultLog).Infof(0xc0004641f0, 0x1dbb913, 0x38, 0xc0039e3540, 0x4, 0x4)
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/logger.go:97 +0xc5
github.com/dgraph-io/badger/v3.(*Options).Infof(...)
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/logger.go:46
github.com/dgraph-io/badger/v3.(*sortedWriter).createTable(0xc000ace0b0, 0xc0003ba580, 0x0, 0x0)
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/stream_writer.go:454 +0x542
github.com/dgraph-io/badger/v3.(*sortedWriter).send.func1(0xc000ace0b0, 0xc0003ba580)
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/stream_writer.go:384 +0x35
created by github.com/dgraph-io/badger/v3.(*sortedWriter).send
	/home/dg/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2011.1/stream_writer.go:383 +0x86

I've also attached the complete goroutine dump
goroutine.txt

What did you expect to see?

I expected the logger to not get stuck 🤷‍♂️

What did you see instead?

The log/log.go was stuck because a syscall wasn't making progress.

@seankhliao seankhliao changed the title log/log.go: Logger stuck at syscall.Write syscall: stuck at syscall.Write Feb 3, 2021
@networkimprov
Copy link

@networkimprov networkimprov commented Feb 4, 2021

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 5, 2021

Thanks for the bug report, but it's hard to see how this is a problem in Go. The stack trace shows that Go is calling the system call write. That system call is not returning. I don't know why that is happening, but it seems like a kernel problem, not a problem in Go.

It would certainly be worth investigating if this is reproducible.

@networkimprov
Copy link

@networkimprov networkimprov commented Feb 5, 2021

Is the file in question on a network filesystem, e.g. NFS?

Also the poller retries EINTR from syscalls, which can result in hangs.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 5, 2021

I don't see any way that retrying on EINTR could introduce a hang on the write system call. The write has to hang for some other reason.

In any case that is not what happened here. The goroutine stack trace shows that the goroutine has been waiting for the write system call to return for 3153 minutes. If there was some sort of EINTR loop, that number would be lower.

@jarifibrahim
Copy link
Author

@jarifibrahim jarifibrahim commented Feb 5, 2021

Is the file in question on a network filesystem, e.g. NFS?

@networkimprov No. This was a normal file on a local SSD.

@networkimprov
Copy link

@networkimprov networkimprov commented Feb 5, 2021

Seems likely to be a transient hardware problem or kernel bug. A watchdog that can abort the app may be the only solution.

@mvdan
Copy link
Member

@mvdan mvdan commented Jun 15, 2021

Closing old issues that still have the WaitingForInfo label where enough details to investigate weren't provided. Feel free to leave a comment with more details and we can reopen.

@mvdan mvdan closed this Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants