os: File sporadic waits upon non-blocking raw connect #70373
Labels
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
Go version
go version go1.23.3 linux/amd64
Output of
go env
in your module/workspace:What did you do?
I'm working with
os.File
for raw non-blocking socket communication.Not often I experience connect hangs from my client sockets.
I made a small example with a client using
os.File
to wrap a non-blocking TCP socketThe example can be fetched here: https://github.com/georgeyanev/go-raw-osfile-connect
This client connects to the remote side (the server) in a loop and writes a message upon successful connection. Then it closes the connection.
For connecting I use modified code from
netFD.connect
in go's net package.The original
connect
code callsfd.pd.waitWrite
diectly, and I can not do that because I have no access to the poll descriptor. In the provided example, in order to achieve calling offd.pd.waitWrite
, I userawConn.Write
passing it a dummy function.The difference with the original code is that here, before calling
fd.pd.waitWrite
,rawConn.Write
callsfd.writeLock()
andfd.pd.prepareWrite()
. I wonder if calling these two functions could cause the problem. And if so then there is no reliable way to callfd.pd.waitWrite
upon connect.Actually I can run a few hundred even a few thousand successful connects before hanging. Thats why there is a 100_000 times loop.
When using standard tcp client code (net.Dial, net.Conn etc.) there is no such an issue.
Is this behaviour expected or it is an issue that should be fixed?
This issue is tested on:
using go versions: 1.22.6 and 1.23.3
What did you see happen?
I saw connect hanging after a few hundred or a few thousand requests.
In the following
sctrace -fTtt
of the client output I see theepoll
eventfor writing (
EPOLLOUT
) is received from a PID different than the PID calledconnect
andthen a new
epoll_pwait
function is called from the PID calledconnect
this time waiting forever:And the program hangs from now on.
What did you expect to see?
I expect all 100_000 connect and write cycles to pass successfully.
I expect to be able to use use the non-blocking connect with os.File reliably.
Please suggest if there is some other proper way for doing this
The text was updated successfully, but these errors were encountered: