Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: do not wait on EAGAIN from lossy packet conn #5932

Closed
gopherbot opened this issue Jul 22, 2013 · 29 comments

Comments

Projects
None yet
10 participants
@gopherbot
Copy link

commented Jul 22, 2013

by swhite@fitstar.com:

When writing to a connection created by net.Dial("unixgram",
"/dev/log") I will occasionally get an EAGAIN returned.  When that happens the
internal WaitWrite call will block forever if no deadline is set.  I can break by
setting a deadline.  Writing to the connection after the timeout error works so it looks
like the WaitWrite is not working properly for this type of socket.  It happens for both
the Write and the WriteToUnix calls.

I don't know how to create an EAGAIN from a unixgram type socket so I don't have a test
case.  It was very difficult to reproduce this against rsyslog.  I'm hoping a socket
expert can figure it out an easy way to reproduce this.

This bug started on go1.1 and is reproducible on go1.1.1 (never saw it on 1.0.x). 
amd64/linux (ubuntu 12.04)
@alberts

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2013

Comment 1:

Could you post a complete program?
@gopherbot

This comment has been minimized.

Copy link
Author

commented Jul 23, 2013

Comment 2 by swhite@fitstar.com:

I could not reproduce with any trivial implementation.  The full setup has multiple
server processes all writing to rsyslog and required large number of concurrent requests
before the issue occurred.  I have tried to write several small tests that write to
rsyslog or to a self-opened socket but none have been able to reproduce the lockup.
I guess this is more of a reference in the hopes someone else runs across this with a
simpler setup.
@rsc

This comment has been minimized.

Copy link
Contributor

commented Jul 25, 2013

Comment 5:

If we get EAGAIN from a unixgram or udp write, I think we should just return the error.

Labels changed: added priority-later, go1.2, removed priority-triage.

Status changed to Accepted.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Jul 26, 2013

Comment 6:

>I have tried to write several small tests that write to rsyslog or to a self-opened
socket but none have been able to reproduce the lockup.
Have you tried writing with highest possible speed for a substantial amount of time? You
must be able to send faster than syslog can process. So the buffers must overflow
eventually.
@dvyukov

This comment has been minimized.

Copy link
Member

commented Jul 26, 2013

Comment 7:

So we don't get epoll notifications for udp/unixgram sockets after EAGAIN?
Returning the error instantly is probably OK for unreliable protocols. But this is a
subtle change in behavior. E.g. a program that checks return codes will crash...
@gopherbot

This comment has been minimized.

Copy link
Author

commented Jul 26, 2013

Comment 8 by swhite@fitstar.com:

I did a little more research and most of the time, EAGAIN from a unixgram socket works
as expected (does not block forever).  In fact, I can't get it to reproduce with
anything but our full stack test setup:
http load balancer/reverse proxy -> server under test with multiple writers to syslog
all on the same box 
It only takes about 30s of load testing to make it happen in our test environment so it
is 100% reproducible under the right conditions.
@dvyukov

This comment has been minimized.

Copy link
Member

commented Jul 26, 2013

Comment 9:

Can it be possible that the remote side closes the socket, but we miss the close
notification (rather than write)?
@gopherbot

This comment has been minimized.

Copy link
Author

commented Jul 26, 2013

Comment 10 by swhite@fitstar.com:

Here are the things I tried that did NOT work to reproduce the bug. FYI, none of these
actually run on the playground since they need file or syslog access.
This code produces an EAGAIN from a unixgram:
http://play.golang.org/p/lf1gs3KGvO
This code writes to rsyslog:
http://play.golang.org/p/lyws2UQj1O
This code uses the library I was using in our code to write to syslog:
http://play.golang.org/p/Am0SmrgY4f
(The current version of the timber library has been patched to avoid the issue entirely
but I disabled that when I tested)
@gopherbot

This comment has been minimized.

Copy link
Author

commented Jul 26, 2013

Comment 11 by swhite@fitstar.com:

No, the socket is open and ready.  With the timeout set, the code breaks out of the
WaitWrite and I can successfully write to the socket again without any reconnect.
One other piece of information, is that rsyslog is throttling during the test.  But I
don't think that's the cause because it also throttles in both of the examples I just
posted and the EAGAINs don't block forever on those.
@gopherbot

This comment has been minimized.

Copy link
Author

commented Jul 26, 2013

Comment 12 by swhite@fitstar.com:

Here's a stacktrace of a blocked process:
http://pastebin.com/8exhXBYE
goroutine 4 is the one that's blocked.  The code is part of the
github.com/fitstar/timber library.  There's basically one goroutine draining a channel
and writing to syslog.  A bunch of other goroutines are blocked waiting to write to that
channel and the situation never resolves even with no additional load.
@rsc

This comment has been minimized.

Copy link
Contributor

commented Sep 11, 2013

Comment 13:

Labels changed: added go1.2maybe, removed go1.2.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Sep 16, 2013

Comment 14:

Too late.

Labels changed: added go1.3maybe, removed go1.2maybe.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2013

Comment 15:

Labels changed: added release-none, removed go1.3maybe.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2013

Comment 16:

Labels changed: added repo-main.

@gopherbot

This comment has been minimized.

Copy link
Author

commented Mar 31, 2014

Comment 17 by chiparus:

Hello, i'm running into this issue too. 
I have an application using multiple connections to syslog (/dev/log) and with just a
little volume it stalls after only a few messages. Using only one connection this
problem is harder to reproduce.
This program has a 100% reproduction rate for me: http://play.golang.org/p/vu12qIyPil
I tried it running at multiple servers, different rsyslog versions, etc. It doesn't
really matter if rsyslog is configured to drop (rate-limit) messages or not. Even on a
tuned rsyslog config this scripts hangs within 10k messages.
I can send milions of messages repeatedly in one goroutine / one syslog conncetion
without problems. Using 2 goroutines / two syslog connections it hangs within 10k
messages.
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2014

Comment 18:

Labels changed: added suggested, removed priority-later.

@mikioh

This comment has been minimized.

Copy link
Contributor

commented Apr 1, 2014

Comment 19:

Not sure I understand the issue correctly, but if it's a simple deadlock on resource
starvation, perhaps returning error on Write/WriteTo/WriteMsg might mitigate the
unexpected consequences. FWIW, attached shows the difference btw Linux and OS X
Mavericks. On Linux, the unixgram writer will be blocked on write waiting for the far
endpoint to read (I mean, EAGAIN). On Darwin, the writer will get "no buffer space
available" error eventually.

Attachments:

  1. unixgram.go (1161 bytes)
@gopherbot

This comment has been minimized.

Copy link
Author

commented Jun 10, 2014

Comment 20 by lk4d4math:

Hi, I've reproduced this issue on two machines with rsyslog and syslog-ng. Just very
fast writing to syslog. Also there can be two Writers to "/dev/log" - one can hang and
one can be still alive.
I copied "log/syslog" code and adding SetWriteDeadline in writeString worked for me.
@gopherbot

This comment has been minimized.

Copy link
Author

commented Jul 25, 2014

Comment 21 by rsr@skyportsystems.com:

Please see https://groups.google.com/forum/#!topic/Golang-Nuts/PMm8nH0yaoA for more
information; we are also seeing this.  
I don't think it's a simple deadlock, though.
@dvyukov

This comment has been minimized.

Copy link
Member

commented Jul 26, 2014

Comment 22:

A common theme in all reproducers seems to be several fd's referring to the same
datagram socket.
Linux has a known issue with epoll and several fd's referring to the same file
description -- fd is not unregistered from epoll when fd is closed, it's unregistered
only when the file description is closed (last fd referring to it is closed). That's why
we need to do epoll_ctl(EPOLL_DEL) when an fd is closed, while documentation says that
it's unnecessary.
I suspect that it's the reincarnation of the same kernel bug, that leads to delivery of
epoll notification to a random (first?) fd referring to the unixgram socket (even if a
different fd is actually got EAGAIN). I don't have any evidence, though. But such kernel
behavior would explain all hangs that we see.
@gopherbot

This comment has been minimized.

Copy link
Author

commented Jul 26, 2014

Comment 23 by rsr@skyportsystems.com:

Oy, I think you're right.   I did an updated tester that's a little more flexible if
it's of any interest; see http://play.golang.org/p/1q0vrfplVT (-multi, etc. - repro
without -multi is easy)
running with -clients 8 - failure every time
running with -clients 8 -multi -- ran it around 10 times without observed failure
running with _4 separate processes_ -clients 1 - failure in two of four every few runs
<----------- bad.
It would be extremely useful if I could coax the runtime to dump (somewhere) the life of
a descriptor especially in terms of epoll activity.  Otherwise time to get systemtap
working.. 
I'll run an extended duration series of tests to see if I can reproduce it with single
instance -multi; probably will not reveal anything worthwhile but it can run by itself.
@dvyukov

This comment has been minimized.

Copy link
Member

commented Jul 27, 2014

Comment 24:

Uh, if it fails with multiple processes (which is reasonable if the hypothesis is
correct), then I do not see any sane ways to actually fix it. We can return EGAIN to
caller or add timeouts.
@a13xb

This comment has been minimized.

Copy link

commented Sep 16, 2014

Comment 25:

Setting write deadline seems like a sensible fix to me. Write failure and reconnect is
already handled in existing code.
@dvyukov

This comment has been minimized.

Copy link
Member

commented Sep 17, 2014

Comment 26:

I can't reproduce it in the following configuration:
go version devel +38f88b067042 Tue Sep 16 12:50:05 2014 -0700 linux/amd64
$ uname -a
Linux 3.13.0-32-generic Ubuntu SMP
which is Ubuntu 14.04
I run ./udptest in one console and then ./udptest -cmode -clients 8 in another. It
always say SUCCESS.
I have 2 hypothesis:
1. We now attach sockets to epoll after connect, that's what kernel people told me to
do, so can fix the issue.
2. It is a kernel bug that is fixed in recent kernels.
@driskell

This comment has been minimized.

Copy link

commented Jan 19, 2015

Hello. A user of mine is seeing a similar issue as in Comment 12. Is there any useful information we can provide to help with this? Thanks

@chendo

This comment has been minimized.

Copy link

commented Jan 19, 2015

I've also seen this happen with our setup where we mount in /dev/log into our Docker containers to write to

@a13xb

This comment has been minimized.

Copy link

commented Sep 30, 2015

This may have been fixed somewhere between 1.3 and 1.3.1. The sample from comment 17 fails 100% of the time on 1.3 and succeeds 100% of the time on 1.3.1 for me.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 30, 2015

@driskell @chendo which version of Go are you using when you see the problem? Which kernel version? Thanks.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Oct 9, 2015

Closing due to timeout.

@bradfitz bradfitz closed this Oct 9, 2015

@golang golang locked and limited conversation to collaborators Oct 9, 2016

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.