Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeBSD sporadic EADDRINUSE #1694

Closed
bradfitz opened this issue Apr 13, 2011 · 21 comments
Closed

FreeBSD sporadic EADDRINUSE #1694

bradfitz opened this issue Apr 13, 2011 · 21 comments

Comments

@bradfitz
Copy link
Contributor

The FreeBSD builder sporadically fails with (among other things) EADDRINUSE errors.

e.g.
Change 4e18f60442c2 broke the freebsd-386 build:
http://godashboard.appspot.com/log/c55befefa54d803c1b6327356ac0db98109203d2f462e9efb547462e7cc3f948

... shows a net.Dial failing.

This test program (attached) will reproduce it after a few tries on a 64-bit FreeBSD
8.0-RELEASE-p4 box:

[bradfitz@ch ~]$ ./stress 
2011/04/14 09:22:34 at 0
2011/04/14 09:22:34 at 100
2011/04/14 09:22:34 at 200
2011/04/14 09:22:34 Accept: dial tcp 127.0.0.1:57021: address already in use

Output also includes sometimes:

...
2011/04/14 09:16:49 at 9400
2011/04/14 09:16:49 at 9500
2011/04/14 09:16:49 at 9600
2011/04/14 09:16:49 Accept: dial tcp 127.0.0.1:38028: connection reset by peer

Attachments:

  1. stress.go (590 bytes)
@bradfitz
Copy link
Contributor Author

Comment 1:

Owner changed to ---.

Status changed to New.

@davecheney
Copy link
Contributor

Comment 2:

http://code.google.com/p/go/issues/attachmentText?id=1694&aid=7352691858693800505&name=stress.go&token=61fdba68ed6cb1d98fc58743e3f81265#19
o_O How is that possible ? When you bind a socket with a port of 0 the kernel is
supposed to choose a unique port on your behalf.

@bradfitz
Copy link
Contributor Author

Comment 3:

Btw, I messed up line 30's error message.
The error is a dial error (ignore "Accept") in:
2011/04/14 09:22:34 Accept: dial tcp 127.0.0.1:57021: address already in use
But you're right about "How is that possible?".  That is indeed the mystery.

@davecheney
Copy link
Contributor

Comment 4:

After hammering this for a while on a 8.2-PRERELEASE host I got this
2011/04/13 23:49:53 at 13500
2011/04/13 23:49:53 at 13600
2011/04/13 23:49:53 at 13700
2011/04/13 23:49:53 Accept: accept tcp4 127.0.0.1:15132: too many open files in system
Could it be that the sockets are not being closed, or this is some kind of SO_LINGER
behaviour ?

@bradfitz
Copy link
Contributor Author

Comment 5:

Yeah, that may not be relevant to the bug being sought.
Instead of listening on 127.0.0.1 we could pick 3 random lower octets per run. 
127.0.0.0/8 all work.

@davecheney
Copy link
Contributor

Comment 6:

I've been playing with this on a FreeBSD-8.2-PRE host and am only able to trigger it
very sporadically. Specifically, once it's triggered the issue goes into hiatus for a
few minutes, not that it helps with CI builds. I have a suspicion that this bug may not
be triggered if the socket is bound to 0.0.0.0:0, but it's proven hard to demonstrate as
I can't reliably reproduce the error.
Binding to 0.0.0.0 will drive the the Darwin users with strict firewall settings
homicidal, so I'd like to propose submitting Cl that switches on syscall.OS and
substitutes net.IPv4zero for the 127.0.0.1 when a BSD system is present.

@rsc
Copy link
Contributor

rsc commented Apr 16, 2011

Comment 7:

No thanks.  Even without Darwin I don't think tests should
make a habit of announcing on non-loopback addresses.

@davecheney
Copy link
Contributor

Comment 8:

Thanks Russ, that is good advice.

@davecheney
Copy link
Contributor

Comment 9:

Hi Brad,
I think there is a small mistake in the test code in stress.go; c is not being closed
after every run of the loop, so it's a race between the GC to finalise connections
faster than the loop can create them.
Cheers
Dave

@davecheney
Copy link
Contributor

Comment 10:

Digging into this a bit more, I don't think that net.netFD's are finalized (except under
windows) so my best guess is this is a race between the OS clearing out sockets in
TIME_WAIT and the test code creating more of them.
After changing stress.go to close each connection and running ./6.out in a loop for an
hour, I haven't been able to reproduce the issue.

@bradfitz
Copy link
Contributor Author

Comment 11:

With
package main
import (
    "log"
    "net"
)
func server(ls net.Listener) {
    for {
        c, err := ls.Accept()
        if err != nil {
            log.Fatalf("Accept: %v", err)
        }
        c.Close()
    }
}
func main() {
    ls, err := net.ListenTCP("tcp4", &net.TCPAddr{net.ParseIP("127.0.0.1"), 0})
    if err != nil {
        log.Fatalf("listen: %v", err)
    }
    go server(ls)
    for i := 0; ; i++ {
        if i % 100 == 0 {
            log.Printf("at %d", i)
        }
        c, err := net.Dial("tcp", ls.Addr().String())
        if err != nil {
                        log.Fatalf("Accept: %v", err)
                }
        var buf [1]byte
        c.Read(buf[:])
        c.Close()
    }
}
.... it now runs until 50,000-ish and hangs, usually 55500.  But if I control-C and
re-run, it gets right back to the same spot quickly and hangs again.
You have this running for long periods of time without hanging?

@davecheney
Copy link
Contributor

Comment 12:

I put the original test code into a while ./6.out loop so it was only testing the
original 20,000 test code. 
If it now hangs at 50,000 conns is this a listen backlog or TIME_WAIT issue?
Dave
Sent from my C64

@bradfitz
Copy link
Contributor Author

Comment 13:

If it's a TIME_WAIT issue (which is in the kernel, right?) then it's weird that
Control-C fixes it. Unless maybe on FreeBSD all those network tuples get freed when the
process dies, which I wouldn't expect but I have no clue about FreeBSD.

@davecheney
Copy link
Contributor

Comment 14:

Hi Brad,
Your right, when the program freezes there are only around 3500 sockets in TIME_WAIT on
my system. Curiously in testing this I saw hangs at 60,600 and 13,500 so I don't think
there is anything special about the number of connections.
In my testing, while ./6.out stopped processing connections, I could telnet to the
accepting socket and it picked up my connection, read a byte, then closed the socket as
expected, but that did not unblock the test program.
I ran this code on a darwin/6g system and got the familar
2011/04/22 15:03:24 at 16200
2011/04/22 15:03:24 at 16300
2011/04/22 15:03:24 Accept: dial tcp 127.0.0.1:51630: address already in use
Have you run this test code on and linux hosts ? I'm trying now on an 8g and a 6g system
but the results are inconclusive so far. If you can't reproduce the hang on a linux
system then I would like to close this issue and open another one against the kqueue
pollsters in fd_darwin.go and fd_freebsd.go. What do you think ?

@bradfitz
Copy link
Contributor Author

Comment 15:

Again:
Change 4bfb4ffacef6 broke the freebsd-386 build:
http://godashboard.appspot.com/log/cb0fb10ddfc730f7c0c637c3675279474a4569093f4bb0bc034114f855dcb2ed

@bradfitz
Copy link
Contributor Author

Comment 16:

Running the code from comment 11 on linux/6g, it's still running strong (no slow-downs)
past 750,000 iterations.
And this is holding steady at around 28,230 +/- a couple:
$ sudo netstat -n | grep TIME_WAIT | grep 127.0.0.1: | wc -l
28230
So Linux must re-cycle things in TIME_WAIT quicker.
This doesn't really help us, though.

@mikioh
Copy link
Contributor

mikioh commented Apr 27, 2011

Comment 17:

Seems like an old friend.
I've just created a wild hotfix as <http://golang.org/cl/4445067>; but I'm
not sure whether this is a correct way as for Go standard library.

@mikioh
Copy link
Contributor

mikioh commented Apr 27, 2011

Comment 18:

Might be better to modify test codes a bit than 4445067?

@bradfitz
Copy link
Contributor Author

Comment 19:

Mikioh, your change in 4445067 seems sane.  You happy enough with it to propose it for
inclusion?
What tests did you want to add?  (I didn't understand your comment #18)

@mikioh
Copy link
Contributor

mikioh commented Apr 29, 2011

Comment 20:

Hi Brad,
Okay, I will push 445067 to review process.
What I want to say in comment #18 is, I'm a bit afraid of 445067's side effect which
describes in old textbook like below.
<q>
From UNIX Network Programming Volume 1, Third Edition: The Sockets Networking API:
4.4BSD introduced the SO_REUSEPORT socket option when support for
multicasting was added. Instead of overloading SO_REUSEADDR with the
desired multicast semantics that allow completely duplicate bindings,
this new socket option was introduced with the following semantics:
1) This option allows completely duplicate bindings, but only if each
socket that wants to bind the same IP address and port specify this
socket option.
2) SO_REUSEADDR is considered equivalent to SO_REUSEPORT if the IP
address being bound is a multicast address (p. 731 of TCPv2).
</q>

@rsc
Copy link
Contributor

rsc commented May 2, 2011

Comment 21:

This issue was closed by revision bc92671.

Status changed to Fixed.

@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants