Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: File sporadic waits upon non-blocking raw connect #70373

Open
georgeyanev opened this issue Nov 15, 2024 · 3 comments
Open

os: File sporadic waits upon non-blocking raw connect #70373

georgeyanev opened this issue Nov 15, 2024 · 3 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@georgeyanev
Copy link

Go version

go version go1.23.3 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/george/.cache/go-build'
GOENV='/home/george/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/george/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/george/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/go/go-1.23.3'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/go/go-1.23.3/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.3'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/george/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build215866673=/tmp/go-build -gno-record-gcc-switches'

What did you do?

I'm working with os.File for raw non-blocking socket communication.
Not often I experience connect hangs from my client sockets.
I made a small example with a client using os.File to wrap a non-blocking TCP socket

The example can be fetched here: https://github.com/georgeyanev/go-raw-osfile-connect

This client connects to the remote side (the server) in a loop and writes a message upon successful connection. Then it closes the connection.

For connecting I use modified code from netFD.connect in go's net package.

The original connect code calls fd.pd.waitWrite diectly, and I can not do that because I have no access to the poll descriptor. In the provided example, in order to achieve calling of fd.pd.waitWrite, I use rawConn.Write passing it a dummy function.
The difference with the original code is that here, before calling fd.pd.waitWrite, rawConn.Write calls fd.writeLock() and fd.pd.prepareWrite(). I wonder if calling these two functions could cause the problem. And if so then there is no reliable way to call fd.pd.waitWrite upon connect.

Actually I can run a few hundred even a few thousand successful connects before hanging. Thats why there is a 100_000 times loop.
When using standard tcp client code (net.Dial, net.Conn etc.) there is no such an issue.

Is this behaviour expected or it is an issue that should be fixed?

This issue is tested on:

- Linux 6.10.11-linuxkit #1 SMP aarch64 GNU/Linux
- Linux 6.8.0-47-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC x86_64 GNU/Linux

using go versions: 1.22.6 and 1.23.3

What did you see happen?

I saw connect hanging after a few hundred or a few thousand requests.

In the following sctrace -fTtt of the client output I see the epoll event
for writing (EPOLLOUT) is received from a PID different than the PID called connect and
then a new epoll_pwait function is called from the PID called connect this time waiting forever:

[pid 17322] 12:29:04.771233 socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_TCP) = 3 <0.000042>
[pid 17322] 12:29:04.771447 connect(3, {sa_family=AF_INET, sin_port=htons(33334), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000048>
[pid 17322] 12:29:04.771665 fcntl(3, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000067>
[pid 17322] 12:29:04.771799 epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=2144862493, u64=18446583856394404125}} <unfinished ...>
[pid 17323] 12:29:04.771844 <... nanosleep resumed>NULL) = 0 <0.010640>
[pid 17322] 12:29:04.771864 <... epoll_ctl resumed>) = 0 <0.000050>
[pid 17323] 12:29:04.771880 epoll_pwait(4,  <unfinished ...>
[pid 17323] 12:29:04.771939 <... epoll_pwait resumed>[{events=EPOLLOUT, data={u32=2144862493, u64=18446583856394404125}}], 128, 0, NULL, 0) = 1 <0.000051>
[pid 17323] 12:29:04.772010 nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 17322] 12:29:04.772159 epoll_pwait(4, [], 128, 0, NULL, 0) = 0 <0.000031>
[pid 17322] 12:29:04.772245 epoll_pwait(4,  <unfinished ...>

And the program hangs from now on.

What did you expect to see?

I expect all 100_000 connect and write cycles to pass successfully.

I expect to be able to use use the non-blocking connect with os.File reliably.
Please suggest if there is some other proper way for doing this

@ianlancetaylor ianlancetaylor changed the title os.File sporadic waits upon non-blocking raw connect os: File sporadic waits upon non-blocking raw connect Nov 15, 2024
@georgeyanev
Copy link
Author

Since EPOLLOUT is received but waitWrite does not wake up,
I managed to work around this by introducing a timer waking up the waitWrite at certain intervals and
checking connection status. See https://github.com/georgeyanev/go-raw-osfile-connect

For a 1000 connections there are ~200 timer inflicted wakeups.

I will be using this until (hopefully) this issue is resolved.

@dmitshur
Copy link
Contributor

CC @rsc, @ianlancetaylor, @griesemer.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 20, 2024
@dmitshur dmitshur added this to the Backlog milestone Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants