-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
Go version
go version go1.23.3 linux/amd64
Output of go env in your module/workspace:
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/george/.cache/go-build'
GOENV='/home/george/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/george/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/george/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/go/go-1.23.3'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/go/go-1.23.3/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.3'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/george/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build215866673=/tmp/go-build -gno-record-gcc-switches'What did you do?
I'm working with os.File for raw non-blocking socket communication.
Not often I experience connect hangs from my client sockets.
I made a small example with a client using os.File to wrap a non-blocking TCP socket
The example can be fetched here: https://github.com/georgeyanev/go-raw-osfile-connect
This client connects to the remote side (the server) in a loop and writes a message upon successful connection. Then it closes the connection.
For connecting I use modified code from netFD.connect in go's net package.
The original connect code calls fd.pd.waitWrite diectly, and I can not do that because I have no access to the poll descriptor. In the provided example, in order to achieve calling of fd.pd.waitWrite, I use rawConn.Write passing it a dummy function.
The difference with the original code is that here, before calling fd.pd.waitWrite, rawConn.Write calls fd.writeLock() and fd.pd.prepareWrite(). I wonder if calling these two functions could cause the problem. And if so then there is no reliable way to call fd.pd.waitWrite upon connect.
Actually I can run a few hundred even a few thousand successful connects before hanging. Thats why there is a 100_000 times loop.
When using standard tcp client code (net.Dial, net.Conn etc.) there is no such an issue.
Is this behaviour expected or it is an issue that should be fixed?
This issue is tested on:
- Linux 6.10.11-linuxkit #1 SMP aarch64 GNU/Linux
- Linux 6.8.0-47-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
using go versions: 1.22.6 and 1.23.3
What did you see happen?
I saw connect hanging after a few hundred or a few thousand requests.
In the following sctrace -fTtt of the client output I see the epoll event
for writing (EPOLLOUT) is received from a PID different than the PID called connect and
then a new epoll_pwait function is called from the PID called connect this time waiting forever:
[pid 17322] 12:29:04.771233 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 3 <0.000042>
[pid 17322] 12:29:04.771447 connect(3, {sa_family=AF_INET, sin_port=htons(33334), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000048>
[pid 17322] 12:29:04.771665 fcntl(3, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000067>
[pid 17322] 12:29:04.771799 epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=2144862493, u64=18446583856394404125}} <unfinished ...>
[pid 17323] 12:29:04.771844 <... nanosleep resumed>NULL) = 0 <0.010640>
[pid 17322] 12:29:04.771864 <... epoll_ctl resumed>) = 0 <0.000050>
[pid 17323] 12:29:04.771880 epoll_pwait(4, <unfinished ...>
[pid 17323] 12:29:04.771939 <... epoll_pwait resumed>[{events=EPOLLOUT, data={u32=2144862493, u64=18446583856394404125}}], 128, 0, NULL, 0) = 1 <0.000051>
[pid 17323] 12:29:04.772010 nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...>
[pid 17322] 12:29:04.772159 epoll_pwait(4, [], 128, 0, NULL, 0) = 0 <0.000031>
[pid 17322] 12:29:04.772245 epoll_pwait(4, <unfinished ...>
And the program hangs from now on.
What did you expect to see?
I expect all 100_000 connect and write cycles to pass successfully.
I expect to be able to use use the non-blocking connect with os.File reliably.
Please suggest if there is some other proper way for doing this