Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/poll, runtime: got not pollable err when reading fifo or socket #59545

Closed
zhuangqh opened this issue Apr 11, 2023 · 23 comments · May be fixed by #59632
Closed

internal/poll, runtime: got not pollable err when reading fifo or socket #59545

zhuangqh opened this issue Apr 11, 2023 · 23 comments · May be fixed by #59632
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@zhuangqh
Copy link

zhuangqh commented Apr 11, 2023

What version of Go are you using (go version)?

$ go version
go version go1.18.10 linux/amd64
go version go1.19.7 linux/amd64
go version go1.20.2 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19.7"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/root/pollable-test/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1792391874=/tmp/go-build -gno-record-gcc-switches"

What did you do?

one sentence description: i've got not pollable error when reading from a fifo or socket.

reproduce code: https://github.com/zhuangqh/not-pollable-issue
step:

  1. open and close /dev/net/tun frequently
  2. normally write and read data from fifo or socket
  3. (about a few seconds later) fifo read reports error not pollable
package main

import (
	"encoding/hex"
	"fmt"
	"math/rand"
	"os"
	"syscall"
	"time"
)

func main() {
	fmt.Printf("%v start test\n", time.Now().Format(time.TimeOnly))
	go manyTun()
	for {
		f()
	}
}

func manyTun() {
	for {
		ff, err := os.OpenFile("/dev/net/tun", os.O_RDWR, 0)
		if err != nil {
			fmt.Fprintf(os.Stderr, "open dev %v", err)
			os.Exit(1)
		}

		if err := ff.Close(); err != nil {
			fmt.Fprintf(os.Stderr, "close dev")
			os.Exit(1)
		}
	}
}

// GenerateID generates a random unique id.
func GenerateID() string {
	b := make([]byte, 32)
	rand.Read(b)
	return hex.EncodeToString(b)
}

func f() {
	fp := fmt.Sprintf("/tmp/myfifo/%s", GenerateID())
	defer func() {
		os.Remove(fp)
	}()

	if err := syscall.Mkfifo(fp, uint32(0600&os.ModePerm)); err != nil && !os.IsExist(err) {
		fmt.Fprintf(os.Stderr, "create fifo read: %v", err)
		os.Exit(1)
	}

	mode := os.O_RDONLY | syscall.O_NONBLOCK
	fReq, err := os.OpenFile(fp, mode, 0600)
	if err != nil {
		fmt.Fprintf(os.Stderr, "open fifo read: %v", err)
		os.Exit(1)
	}

	wReq, err := os.OpenFile(fp, os.O_WRONLY|syscall.O_NONBLOCK, 0700)
	if err != nil {
		fmt.Fprintf(os.Stderr, "open fifo write: %v", err)
		os.Exit(1)
	}

	_, err = wReq.Write([]byte("some message"))
	if err != nil {
		fmt.Fprintf(os.Stderr, "write: %v", err)
		os.Exit(1)
	}

	buf := make([]byte, 4)
	_, err = fReq.Read(buf)
	if err != nil {
		fmt.Fprintf(os.Stderr, "%v fifo read: %v\n", time.Now().Format(time.TimeOnly), err)
		os.Exit(1)
	}

	//fmt.Printf("got %v\n", string(buf))

	fReq.Close()
	wReq.Close()
}

outputs:

19:24:58 start test
19:25:02 fifo read: read /proc/self/fd/3: not pollable

I've learnt from this issue #30426 that open the device /dev/net/tun using os.OpenFile and read it will got the not pollable error. But i got this error from other fd when open and close it frequently.
It seems that epoll events pollute other files in the case of high concurrency?

tips for maintainer:

  • only fifo or net socket will got this error, normal file read/writes won't

What did you expect to see?

no error from other fd

What did you see instead?

instead the read failed with error fifo read: read /proc/self/fd/3: not pollable

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Apr 11, 2023
@bcmills
Copy link
Contributor

bcmills commented Apr 11, 2023

It's not clear to me that the bug is in the Go standard library. Can you reproduce it without github.com/containerd/fifo?

(I would suggest starting by raising an issue with the maintainers of that package.)

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Apr 11, 2023
@zhuangqh
Copy link
Author

It's not clear to me that the bug is in the Go standard library. Can you reproduce it without github.com/containerd/fifo?

(I would suggest starting by raising an issue with the maintainers of that package.)

thanks for your reply. This library is a tiny repo to wrap the fifo operation and perform reading from a fifo file. I think it is unrelated to this issue. I also reprodeuce this issue using github.com/containerd/ttrpc (a tiny rpc lib) which perform simple read write to socket.

also i will try to reproduce without these lib later :)

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Apr 11, 2023
@ianlancetaylor
Copy link
Contributor

Likely a dup of #38618.

@zhuangqh
Copy link
Author

@bcmills @ianlancetaylor update: reproduce with only go std library

@zhuangqh
Copy link
Author

Likely a dup of #38618.

This issue explains why you got the not pollable error from /dev/net/tun fd.
But I got the error when reading other file. Are they the same?

It seems to read the epollerr event from tun fd when reading fifo fd

@ls-ggg
Copy link

ls-ggg commented Apr 12, 2023

I think this is a go runtime bug.
Implementation of runtime Use the address of the pollDesc object to assign to
ev.data and register to epoll.
*(**pollDesc)(unsafe.Pointer(&ev.data)) = pd)
epoll_ctl and epoll_pwait run independently in two threads, and the pollDesc
object will be reused, which will cause the event of the previous fd to be associated
with the next fd.

I can reproduce this problem with @zhuangqh 's demo.
The fd of a network device causes epoll_pwait to return EPOLLERR,
but this EPOLLERR is incorrectly associated with a fifo fd,
causing read fifo to encounter a not pollable error. The following is the strace log:

2059444 openat(AT_FDCWD, "/tmp/myfifo/884b1e92ec0c5b306f11", O_RDONLY|O_CLOEXEC|O_PATH <unfinished ...> = 11

1704614 openat(AT_FDCWD, "/dev/net/tun", O_RDWR|O_CLOEXEC <unfinished ...> = 34

1704625 epoll_ctl(4, EPOLL_CTL_DEL, 10, 0xc000d41c74 <unfinished ...>

1704548 openat(AT_FDCWD, "/proc/self/fd/11", O_RDONLY|O_CLOEXEC <unfinished ...> = 10

1704614 epoll_ctl(4, EPOLL_CTL_ADD, 34, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=916538008, u64=139750562415256}}

1704614 epoll_ctl(4, EPOLL_CTL_DEL, 34, 0xc000c02e74 <unfinished ...>

1704548 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE <unfinished ...>

1704524 <... epoll_pwait resumed>[{EPOLLERR, {u32=916538008, u64=139750562415256}}}], 128, 0, NULL, 0) = 1

1704548 epoll_ctl(4, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=916538008, u64=139750562415256}}

1704614 close(34)

1704582 write(2, "fifo read: read /proc/self/fd/11: not pollable", 46 <unfinished ...>

fd34(a fd for a tun device) makes epoll_pwait return EPOLLERR, but this error is reported to fd10,
which is a fifo. The pollDesc object (u64=139750562415256) is associated
with fd10 after disassociated from fd34.

@zhuangqh
Copy link
Author

this line of code forcibly set event err which emitted from tun fd to the Data address.
However, this pollDesc address maybe reused by other fd happens before this line.
we read this eventErr when reading fifo fd.

https://github.com/golang/go/blob/master/src/runtime/netpoll_epoll.go#L162

		if mode != 0 {
			pd := *(**pollDesc)(unsafe.Pointer(&ev.Data))
			pd.setEventErr(ev.Events == syscall.EPOLLERR)
			netpollready(&toRun, pd, mode)
		}

@ls-ggg
Copy link

ls-ggg commented Apr 12, 2023

I think this is a go runtime bug. Implementation of runtime Use the address of the pollDesc object to assign to ev.data and register to epoll. *(**pollDesc)(unsafe.Pointer(&ev.data)) = pd) epoll_ctl and epoll_pwait run independently in two threads, and the pollDesc object will be reused, which will cause the event of the previous fd to be associated with the next fd.

I can reproduce this problem with @zhuangqh 's demo. The fd of a network device causes epoll_pwait to return EPOLLERR, but this EPOLLERR is incorrectly associated with a fifo fd, causing read fifo to encounter a not pollable error. The following is the strace log:

2059444 openat(AT_FDCWD, "/tmp/myfifo/884b1e92ec0c5b306f11", O_RDONLY|O_CLOEXEC|O_PATH <unfinished ...> = 11

1704614 openat(AT_FDCWD, "/dev/net/tun", O_RDWR|O_CLOEXEC <unfinished ...> = 34

1704625 epoll_ctl(4, EPOLL_CTL_DEL, 10, 0xc000d41c74 <unfinished ...>

1704548 openat(AT_FDCWD, "/proc/self/fd/11", O_RDONLY|O_CLOEXEC <unfinished ...> = 10

1704614 epoll_ctl(4, EPOLL_CTL_ADD, 34, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=916538008, u64=139750562415256}}

1704614 epoll_ctl(4, EPOLL_CTL_DEL, 34, 0xc000c02e74 <unfinished ...>

1704548 fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE <unfinished ...>

1704524 <... epoll_pwait resumed>[{EPOLLERR, {u32=916538008, u64=139750562415256}}}], 128, 0, NULL, 0) = 1

1704548 epoll_ctl(4, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=916538008, u64=139750562415256}}

1704614 close(34)

1704582 write(2, "fifo read: read /proc/self/fd/11: not pollable", 46 <unfinished ...>

fd34(a fd for a tun device) makes epoll_pwait return EPOLLERR, but this error is reported to fd10, which is a fifo. The pollDesc object (u64=139750562415256) is associated with fd10 after disassociated from fd34.

In most cases, the residual EPOLLIN/EPOLLOUT will only trigger an invalid wakeup and have no effect on the upper code.
when calling epoll_ctl and epoll_wait in a multi-threaded scenario, a globally unique id or key should be used as the value of epoll_event.data, or pass the add/del operation to the same thread to execute after epoll_wait returns

@ianlancetaylor
Copy link
Contributor

Thanks for the analysis. I think I have a fix.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/484275 mentions this issue: runtime: don't reuse pollDesc until there are no pending netpolls

@chenhengqi
Copy link

@ianlancetaylor Hi, I've pulled the change and tested it locally, it seems that CL 484275 does NOT fix the issue, only mitigate it.

@ls-ggg
Copy link

ls-ggg commented Apr 13, 2023

@ianlancetaylor Hi, I've pulled the change and tested it locally, it seems that CL 484275 does NOT fix the issue, only mitigate it.

Because there may be multiple threads calling netpoll at the same time. Therefore, atomic counting does not guarantee that no other thread calls netpoll during pollcache.promote()

@ianlancetaylor
Copy link
Contributor

@chenhengqi Thanks. I've updated https://go.dev/cl/484275. Can you see if it fixes the problem for you? (Even the original version appeared to fix the problem for me.)

If anybody has suggestions for a better fix, I would be happy to hear them. I'm not very happy with this fix--I'm worried that it will lead to some contention on pollcache.lock.

ls-ggg added a commit to ls-ggg/go that referenced this issue Apr 14, 2023
…c reuse

fix the problem that the netpoll event is associated with the wrong fd because pollDesc object is reused

Fixes: golang#59545

Signed-off-by: ls-ggg <335814617@qq.com>
ls-ggg pushed a commit to ls-ggg/go that referenced this issue Apr 14, 2023
Fix the problem that the netpoll event is associated with the wrong fd because pollDesc object is reused

Fixes: golang#59545

Signed-off-by: ls-ggg <coocnngooo@gmail.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/484695 mentions this issue: netpoll:Add a fid field to prevent problems caused by pollDesc reuse

@ls-ggg
Copy link

ls-ggg commented Apr 14, 2023

@chenhengqi Thanks. I've updated https://go.dev/cl/484275. Can you see if it fixes the problem for you? (Even the original version appeared to fix the problem for me.)

If anybody has suggestions for a better fix, I would be happy to hear them. I'm not very happy with this fix--I'm worried that it will lead to some contention on pollcache.lock.

@ianlancetaylor Hi, I think it is not timely enough to release the pollDesc object by waiting for the atomic count to be 0, especially for some programs with high io load。

I've submitted a pr that might fix this problem。Running the above demo in my environment is ok。

@bcmills
Copy link
Contributor

bcmills commented Apr 14, 2023

If anybody has suggestions for a better fix, I would be happy to hear them.

I suggest that we ensure that only one netpollCheck call is in flight at a time, which I think we can do without too much contention. (Details in a comment on CL 484275.)

@ianlancetaylor
Copy link
Contributor

Thanks. I've written a different approach in the pair of CLs https://go.dev/cl/484836 and https://go.dev/cl/484837.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/484837 mentions this issue: runtime: add and use pollDesc fd sequence field

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/484836 mentions this issue: runtime: change lfstack support to taggedPointer

@chenhengqi
Copy link

@ianlancetaylor Here is the code sample you ask for. I used this to repro the issue on CL 484837.

func busyLoop() {
	for i := 0; i < 32; i++ {
		go func() {
			for {
				http.Get("https://baidu.com")
			}
		}()
	}
}

func main() {
	go busyLoop()
	fmt.Printf("%v start test\n", time.Now().Format(time.TimeOnly))
	go manyTun()
	for {
		f()
	}
}

@ianlancetaylor
Copy link
Contributor

Thanks. Updated the CL.

gopherbot pushed a commit that referenced this issue Apr 18, 2023
This is a refactoring with no change in behavior, in preparation
for future netpoll work.

For #59545

Change-Id: I493c5fd0f49f31b75787f7b5b89c544bed73f64f
Reviewed-on: https://go-review.googlesource.com/c/go/+/484836
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Auto-Submit: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Orlando Labao <orlando.labao43@gmail.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/486255 mentions this issue: os: check for ErrNotExist instead of ErrExist in TestNonPollable

gopherbot pushed a commit that referenced this issue Apr 19, 2023
An apparent typo in CL 484837 caused the test to check for ErrExist
instead of ErrNotExist when opening /dev/net/tun for read. That causes
the test to fail on platforms where /dev/net/ton does not exist,
such as on the darwin-amd64-longtest builder.

Updates #59545.

Change-Id: I9402ce0dba11ab459674e8358ae9a8b97eabc8d2
Reviewed-on: https://go-review.googlesource.com/c/go/+/486255
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Commit-Queue: Bryan Mills <bcmills@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/558795 mentions this issue: runtime: use tagged pointers to reference poll descriptors on Windows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
7 participants