FreeBSD sporadic EADDRINUSE #1694

bradfitz · 2011-04-13T23:23:20Z

The FreeBSD builder sporadically fails with (among other things) EADDRINUSE errors.

e.g.
Change 4e18f60442c2 broke the freebsd-386 build:
http://godashboard.appspot.com/log/c55befefa54d803c1b6327356ac0db98109203d2f462e9efb547462e7cc3f948

... shows a net.Dial failing.

This test program (attached) will reproduce it after a few tries on a 64-bit FreeBSD
8.0-RELEASE-p4 box:

[bradfitz@ch ~]$ ./stress 
2011/04/14 09:22:34 at 0
2011/04/14 09:22:34 at 100
2011/04/14 09:22:34 at 200
2011/04/14 09:22:34 Accept: dial tcp 127.0.0.1:57021: address already in use

Output also includes sometimes:

...
2011/04/14 09:16:49 at 9400
2011/04/14 09:16:49 at 9500
2011/04/14 09:16:49 at 9600
2011/04/14 09:16:49 Accept: dial tcp 127.0.0.1:38028: connection reset by peer

Attachments:

stress.go (590 bytes)

bradfitz · 2011-04-13T23:24:04Z

Comment 1:

Owner changed to ---.

Status changed to New.

davecheney · 2011-04-13T23:39:09Z

Comment 2:

http://code.google.com/p/go/issues/attachmentText?id=1694&aid=7352691858693800505&name=stress.go&token=61fdba68ed6cb1d98fc58743e3f81265#19
o_O How is that possible ? When you bind a socket with a port of 0 the kernel is
supposed to choose a unique port on your behalf.

bradfitz · 2011-04-13T23:43:21Z

Comment 3:

Btw, I messed up line 30's error message.
The error is a dial error (ignore "Accept") in:
2011/04/14 09:22:34 Accept: dial tcp 127.0.0.1:57021: address already in use
But you're right about "How is that possible?".  That is indeed the mystery.

davecheney · 2011-04-14T00:52:24Z

Comment 4:

After hammering this for a while on a 8.2-PRERELEASE host I got this
2011/04/13 23:49:53 at 13500
2011/04/13 23:49:53 at 13600
2011/04/13 23:49:53 at 13700
2011/04/13 23:49:53 Accept: accept tcp4 127.0.0.1:15132: too many open files in system
Could it be that the sockets are not being closed, or this is some kind of SO_LINGER
behaviour ?

bradfitz · 2011-04-14T00:57:37Z

Comment 5:

Yeah, that may not be relevant to the bug being sought.
Instead of listening on 127.0.0.1 we could pick 3 random lower octets per run. 
127.0.0.0/8 all work.

davecheney · 2011-04-16T09:13:33Z

Comment 6:

I've been playing with this on a FreeBSD-8.2-PRE host and am only able to trigger it
very sporadically. Specifically, once it's triggered the issue goes into hiatus for a
few minutes, not that it helps with CI builds. I have a suspicion that this bug may not
be triggered if the socket is bound to 0.0.0.0:0, but it's proven hard to demonstrate as
I can't reliably reproduce the error.
Binding to 0.0.0.0 will drive the the Darwin users with strict firewall settings
homicidal, so I'd like to propose submitting Cl that switches on syscall.OS and
substitutes net.IPv4zero for the 127.0.0.1 when a BSD system is present.

rsc · 2011-04-16T13:30:28Z

Comment 7:

No thanks.  Even without Darwin I don't think tests should
make a habit of announcing on non-loopback addresses.

davecheney · 2011-04-16T14:27:14Z

Comment 8:

Thanks Russ, that is good advice.

davecheney · 2011-04-18T00:32:55Z

Comment 9:

Hi Brad,
I think there is a small mistake in the test code in stress.go; c is not being closed
after every run of the loop, so it's a race between the GC to finalise connections
faster than the loop can create them.
Cheers
Dave

davecheney · 2011-04-18T03:46:29Z

Comment 10:

Digging into this a bit more, I don't think that net.netFD's are finalized (except under
windows) so my best guess is this is a race between the OS clearing out sockets in
TIME_WAIT and the test code creating more of them.
After changing stress.go to close each connection and running ./6.out in a loop for an
hour, I haven't been able to reproduce the issue.

bradfitz · 2011-04-22T00:47:52Z

Comment 11:

With
package main
import (
    "log"
    "net"
)
func server(ls net.Listener) {
    for {
        c, err := ls.Accept()
        if err != nil {
            log.Fatalf("Accept: %v", err)
        }
        c.Close()
    }
}
func main() {
    ls, err := net.ListenTCP("tcp4", &net.TCPAddr{net.ParseIP("127.0.0.1"), 0})
    if err != nil {
        log.Fatalf("listen: %v", err)
    }
    go server(ls)
    for i := 0; ; i++ {
        if i % 100 == 0 {
            log.Printf("at %d", i)
        }
        c, err := net.Dial("tcp", ls.Addr().String())
        if err != nil {
                        log.Fatalf("Accept: %v", err)
                }
        var buf [1]byte
        c.Read(buf[:])
        c.Close()
    }
}
.... it now runs until 50,000-ish and hangs, usually 55500.  But if I control-C and
re-run, it gets right back to the same spot quickly and hangs again.
You have this running for long periods of time without hanging?

davecheney · 2011-04-22T00:54:05Z

Comment 12:

I put the original test code into a while ./6.out loop so it was only testing the
original 20,000 test code. 
If it now hangs at 50,000 conns is this a listen backlog or TIME_WAIT issue?
Dave
Sent from my C64

bradfitz · 2011-04-22T01:54:19Z

Comment 13:

If it's a TIME_WAIT issue (which is in the kernel, right?) then it's weird that
Control-C fixes it. Unless maybe on FreeBSD all those network tuples get freed when the
process dies, which I wouldn't expect but I have no clue about FreeBSD.

davecheney · 2011-04-22T05:09:07Z

Comment 14:

Hi Brad,
Your right, when the program freezes there are only around 3500 sockets in TIME_WAIT on
my system. Curiously in testing this I saw hangs at 60,600 and 13,500 so I don't think
there is anything special about the number of connections.
In my testing, while ./6.out stopped processing connections, I could telnet to the
accepting socket and it picked up my connection, read a byte, then closed the socket as
expected, but that did not unblock the test program.
I ran this code on a darwin/6g system and got the familar
2011/04/22 15:03:24 at 16200
2011/04/22 15:03:24 at 16300
2011/04/22 15:03:24 Accept: dial tcp 127.0.0.1:51630: address already in use
Have you run this test code on and linux hosts ? I'm trying now on an 8g and a 6g system
but the results are inconclusive so far. If you can't reproduce the hang on a linux
system then I would like to close this issue and open another one against the kqueue
pollsters in fd_darwin.go and fd_freebsd.go. What do you think ?

bradfitz · 2011-04-22T16:01:42Z

Comment 15:

Again:
Change 4bfb4ffacef6 broke the freebsd-386 build:
http://godashboard.appspot.com/log/cb0fb10ddfc730f7c0c637c3675279474a4569093f4bb0bc034114f855dcb2ed

bradfitz · 2011-04-22T18:13:24Z

Comment 16:

Running the code from comment 11 on linux/6g, it's still running strong (no slow-downs)
past 750,000 iterations.
And this is holding steady at around 28,230 +/- a couple:
$ sudo netstat -n | grep TIME_WAIT | grep 127.0.0.1: | wc -l
28230
So Linux must re-cycle things in TIME_WAIT quicker.
This doesn't really help us, though.

mikioh · 2011-04-27T08:30:25Z

Comment 17:

Seems like an old friend.
I've just created a wild hotfix as <http://golang.org/cl/4445067>; but I'm
not sure whether this is a correct way as for Go standard library.

mikioh · 2011-04-27T08:52:20Z

Comment 18:

Might be better to modify test codes a bit than 4445067?

bradfitz · 2011-04-28T18:27:53Z

Comment 19:

Mikioh, your change in 4445067 seems sane.  You happy enough with it to propose it for
inclusion?
What tests did you want to add?  (I didn't understand your comment #18)

mikioh · 2011-04-29T06:40:13Z

Comment 20:

Hi Brad,
Okay, I will push 445067 to review process.
What I want to say in comment #18 is, I'm a bit afraid of 445067's side effect which
describes in old textbook like below.
<q>
From UNIX Network Programming Volume 1, Third Edition: The Sockets Networking API:
4.4BSD introduced the SO_REUSEPORT socket option when support for
multicasting was added. Instead of overloading SO_REUSEADDR with the
desired multicast semantics that allow completely duplicate bindings,
this new socket option was introduced with the following semantics:
1) This option allows completely duplicate bindings, but only if each
socket that wants to bind the same IP address and port specify this
socket option.
2) SO_REUSEADDR is considered equivalent to SO_REUSEPORT if the IP
address being bound is a multicast address (p. 731 of TCPv2).
</q>

rsc · 2011-05-02T14:50:21Z

Comment 21:

This issue was closed by revision bc92671.

Status changed to Fixed.

bradfitz added fixed labels May 2, 2011

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FreeBSD sporadic EADDRINUSE #1694

FreeBSD sporadic EADDRINUSE #1694

bradfitz commented Apr 13, 2011

bradfitz commented Apr 13, 2011

davecheney commented Apr 13, 2011

bradfitz commented Apr 13, 2011

davecheney commented Apr 14, 2011

bradfitz commented Apr 14, 2011

davecheney commented Apr 16, 2011

rsc commented Apr 16, 2011

davecheney commented Apr 16, 2011

davecheney commented Apr 18, 2011

davecheney commented Apr 18, 2011

bradfitz commented Apr 22, 2011

davecheney commented Apr 22, 2011

bradfitz commented Apr 22, 2011

davecheney commented Apr 22, 2011

bradfitz commented Apr 22, 2011

bradfitz commented Apr 22, 2011

mikioh commented Apr 27, 2011

mikioh commented Apr 27, 2011

bradfitz commented Apr 28, 2011

mikioh commented Apr 29, 2011

rsc commented May 2, 2011

FreeBSD sporadic EADDRINUSE #1694

FreeBSD sporadic EADDRINUSE #1694

Comments

bradfitz commented Apr 13, 2011

bradfitz commented Apr 13, 2011

davecheney commented Apr 13, 2011

bradfitz commented Apr 13, 2011

davecheney commented Apr 14, 2011

bradfitz commented Apr 14, 2011

davecheney commented Apr 16, 2011

rsc commented Apr 16, 2011

davecheney commented Apr 16, 2011

davecheney commented Apr 18, 2011

davecheney commented Apr 18, 2011

bradfitz commented Apr 22, 2011

davecheney commented Apr 22, 2011

bradfitz commented Apr 22, 2011

davecheney commented Apr 22, 2011

bradfitz commented Apr 22, 2011

bradfitz commented Apr 22, 2011

mikioh commented Apr 27, 2011

mikioh commented Apr 27, 2011

bradfitz commented Apr 28, 2011

mikioh commented Apr 29, 2011

rsc commented May 2, 2011