New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: concurrent Accept-Close with no timeout on TCPListener defers closing underlying socket #10527

Open
ggriffiniii opened this Issue Apr 21, 2015 · 14 comments

Comments

Projects
None yet
6 participants
@ggriffiniii

ggriffiniii commented Apr 21, 2015

Go version: go1.4.2
OS: Linux 3.13 x86_64

The following test demonstrates what I consider a bug:

package foo

import (
    "log"
    "net"
    "testing"
)

const addr = "127.0.0.1:12345"

func TestXxx(t *testing.T) {
    for i := 0; i < 100000; i++ {
        log.Printf("Iteration %d", i+1)
        l, err := net.Listen("tcp4", addr)
        if err != nil {
            t.Fatalf("Failed to listen on %s: %v", addr, err)
        }
        go func(l net.Listener) {
            for {
                _, err := l.Accept()
                if err != nil {
                    return
                }
            }
        }(l)
        if err := l.Close(); err != nil {
            t.Fatalf("Failed to close server: %v", err)
        }
    }
}

Briefly, within a loop it Listens on tcp port 12345, invokes a goroutine that calls Accept on the listener, and then Closes the listener. My expectation was that this would always pass regardless of scheduler behavior or number of loop iterations as I expected Close() would only return once the underlying socket is no longer in use.

Instead what I see is that it will sometimes pass, but often fails on the net.Listen call of subsequent loop iterations with an error that the address is already in use. I believe I understand the cause of the failure. netFD underlies the TCPListener and it's a reference counted object. A goroutine blocked on Accept will increment the reference count. Close() flips a flag on netFD to indicate the socket is considered closed which will make other netFD operations like Accept return errors indicating the socket is closed. The underlying OS socket is only closed once the reference count for the netFD has reached zero.

The end result is that an OS socket is only closed after TCPListener.Close() is invoked AND all goroutines that are blocked in TCPListener.Accept() are scheduled and return the expected error condition. It seems like more intuitive behavior would be for TCPListener.Close() to immediately close the underlying OS socket, and goroutines blocked in TCPListener.Accept() to return errors whenever they were next scheduled/run.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Apr 21, 2015

Can you reproduce with Go tip as well?

/cc @mikioh @dvyukov

@ggriffiniii

This comment has been minimized.

ggriffiniii commented Apr 21, 2015

Just reproduced with tip also

$ go version
go version devel +e589e08 Tue Apr 21 20:50:23 2015 +0000 linux/amd64

@bradfitz bradfitz added this to the Go1.5Maybe milestone Apr 21, 2015

@mikioh

This comment has been minimized.

Contributor

mikioh commented Apr 21, 2015

Perhaps, sometimes Accept takes the shared fdMutex in Listener when you call Close. A workaround would be using SetReadDeadline with appropriate value just before Accept, or wait for the Acceptor goroutine to stop before calling Close.

@mikioh mikioh changed the title from TCPListener.Close returns before the socket is closed to net: concurrent Accept-Close with no timeout on TCPListener defers closing underlying socket Apr 21, 2015

@ggriffiniii

This comment has been minimized.

ggriffiniii commented Apr 21, 2015

I don't agree that a timeout is a solution here. That only bounds the time between the Listener.Close() and when the underlying socket is closed, it doesn't improve the situation in any other way. Regarding waiting for the Acceptor goroutine to stop before calling Close that doesn't seem like an easily implementable (or common) pattern. Assuming the goroutine is blocked on Accept() how does one signal that goroutine to exit? That's exactly what Close() is typically used for. I've hacked together a net.Listener wrapper that seems to do what I want, but it seems like it would be better if these were the semantics provided by the standard library.

// NewSaneListener returns a new SaneListener around the provided net.Listener.
func NewSaneListener(l net.Listener) net.Listener {
        return &SaneListener{l: l, c: sync.NewCond(&sync.Mutex{})}
}

// SaneListener wraps a net.Listener and ensures that once SaneListener.Close
// returns the underlying socket has been closed.
type SaneListener struct {
        l        net.Listener
        c        *sync.Cond
        refCount int
}

func (s *SaneListener) incRef() {
        s.c.L.Lock()
        s.refCount++
        s.c.L.Unlock()
}

func (s *SaneListener) decRef() {
        s.c.L.Lock()
        s.refCount--
        s.c.Broadcast()
        s.c.L.Unlock()
}

// Accept waits for and returns the next connection to the listener.
func (s *SaneListener) Accept() (net.Conn, error) {
        s.incRef()
        defer s.decRef()
        return s.l.Accept()
}

// Close closes the listener.
// Any blocked Accept operations will be unblocked and return errors.
func (s *SaneListener) Close() error {
        err := s.l.Close()
        if err == nil {
                s.c.L.Lock()
                for s.refCount > 0 {
                        s.c.Wait()
                }
                s.c.L.Unlock()
        }
        return err
}

// Addr returns the listener's network address.
func (s *SaneListener) Addr() net.Addr {
        return s.l.Addr()
}

I'm mostly including the code above to demonstrate the behavior I would like to see from the default Listener. I would need to vet/test this much more thoroughly before I trusted it in a production environment.

@ggriffiniii

This comment has been minimized.

ggriffiniii commented Apr 22, 2015

Perhaps, sometimes Accept takes the shared fdMutex in Listener when you call Close

I don't think it's a "sometimes" thing. I believe that everytime there is a concurrent Close with Accept, the Close will return before the underlying socket is closed. The only thing that makes the test not fully deterministic is that Sometimes the acceptor goroutine hasn't been run yet when Close is invoked. I've modified the test to demonstrate this better

package foo

import (
    "net"
    "testing"
)

const addr = "127.0.0.1:12345"

func TestXxx(t *testing.T) {
    for i := 0; i < 100000; i++ {
        t.Logf("Iteration %d", i+1)
        l, err := net.Listen("tcp4", addr)
        if err != nil {
            t.Fatalf("Failed to listen on %s: %v", addr, err)
        }
        ch := make(chan struct{})
        go func(l net.Listener) {
            ch <- struct{}{}
            for {
                _, err := l.Accept()
                if err != nil {
                    return
                }
            }
        }(l)
        <-ch
        if err := l.Close(); err != nil {
            t.Fatalf("Failed to close server: %v", err)
        }
    }
}

This ensures that the acceptor goroutine runs before Close is invoked and at least on my machine this makes it reliably fail on the second iteration.

@ggriffiniii

This comment has been minimized.

ggriffiniii commented Apr 23, 2015

I just did some more investigation and it is indeed the fdmu in netFD. The OS socket is not closed until netFD.destroy is called. The sequence of events is

  • goroutine1: creates new listen socket
  • goroutine1: spawns goroutine2 to call netFD.accept. netFD.accept takes a readlock on netFD.fdmu and blocks
  • goroutine1: calls netFD.Close. Close sets the state to closed and increments and decrements the ref count, but decref doesn't call netFD.destroy because goroutine2 has a readlock.
  • goroutine2: is unblocked due to the state being set to closed, calls netFD.readUnlock which notices the state is closed and calls netFD.destroy.

This all appears to be working as intended, but leads to the confusing behavior exhibited by my test case. I would think it would be far more intuitive to have Close remove the filedescriptor from the network poller and close it immediately, and have the other methods return an "already closed" error whenever they are scheduled next.

@rsc

This comment has been minimized.

Contributor

rsc commented Jul 15, 2015

I think we could reasonably do something in fdMutex to block until the pending operations are done, so that when Close returns the fd really is closed. Too late for Go 1.5 though.

@dvyukov might be interested.

@rsc rsc modified the milestones: Go1.6Early, Go1.5Maybe Jul 15, 2015

@dvyukov

This comment has been minimized.

Member

dvyukov commented Jul 15, 2015

We want this behavior only for closing of passive sockets, right?

@ggriffiniii

This comment has been minimized.

ggriffiniii commented Jul 15, 2015

My particular case is on passive sockets and I think that's certainly where this behavior is more likely to be seen, but it seems like ideally it would apply to passive and active sockets. If Close() is called on a socket it should invoke close on the underlying file descriptor before returning. I wold also expect that shutdown is called before CloseRead() or CloseWrite() return on a TCPConn. Is there a reason to only implement it for passive sockets?

@dvyukov

This comment has been minimized.

Member

dvyukov commented Jul 15, 2015

If you can't observe effects, then there is no point in adding additional waits, goroutine rescheduling, goroutine unblocking and rescheduling again. Just like when write syscall returns, the data is not yet on the disk, and it would be prohibitively expensive to wait for the data to reach the disk.

@ggriffiniii

This comment has been minimized.

ggriffiniii commented Jul 15, 2015

My SaneListener may have confused the situation more than it helped. I'm not proposing waiting for other goroutines to schedule before returning from Close(). I'm simply saying that Close() should invoke the underlying close syscall before returning. I don't know go internals at all, but at a high level I'm imagining that Close() would set a flag on the netFD to indicate it's closed, it would then immediately close the underlying file descriptor removing it from the network poller and then return. Any goroutines that are still blocked on netFD operations would become runnable and return with connection closed whenever they are next scheduled.

I'm not proposing that there should be a synchronization barrier preventing Close() from returning until all other goroutines have been scheduled.

@dvyukov

This comment has been minimized.

Member

dvyukov commented Jul 15, 2015

You can't do that.
Consider that another goroutine is in the process of calling Listen. It has already checked the closed flag and calls listen syscall. Now you close the fd. Now the other goroutine calls listen on a wrong fd, which already can be reused for another socket. And now it gets a connection from a wrong passive socket and sends your private data to a wrong connection.
That's what we solve with fdmutex. You must delay actual close syscall. You can either do it asynchronously, or actually block the goroutine until it is safe to call close syscall.

@ggriffiniii

This comment has been minimized.

ggriffiniii commented Jul 16, 2015

Thanks for explaining. I hadn't considered the race between checking closed, starting a syscall and the interaction with reusing fd's. In that case I guess the behavior in SaneListener does very accurately mimic what needs to happen and I can understand why it would be preferable to only have this behavior on passive sockets.

@veqryn

This comment has been minimized.

veqryn commented Apr 5, 2018

I recently ran into this while trying to write a benchmark for an http server.
I was a bit surprised that even after sleeping a whole minute after closing the listener, the server was still accepting new connections.

Here was my reproducer: https://play.golang.org/p/2TDWbImCgow

I originally ran into the issue when using the fasthttp library, with code that look similar to this:

package wtf

import (
	"fmt"
	"io/ioutil"
	"math/rand"
	"net"
	"net/http"
	"strconv"
	"testing"
	"time"

	"github.com/valyala/fasthttp"
)

func BenchmarkFastHttpHandler(b *testing.B) {
	fmt.Println("\n\nTest Iterations:", b.N)

	// Get a random number
	rand.Seed(time.Now().UTC().UnixNano())
	r := strconv.Itoa(rand.Intn(1000000))
	fmt.Println("Starting Server:", r)

	// Create listener
	ln, err := net.Listen("tcp", ":8080")
	Must(b, err)

	// Start server
	go func() {
		httpServer := &fasthttp.Server{Handler: func(ctx *fasthttp.RequestCtx) {
			fmt.Println("Hit:", r)
			ctx.SetBody([]byte(r))
		}}
		Must(b, httpServer.Serve(ln))
		fmt.Println("Closed Server:", r)
	}()

	time.Sleep(10 * time.Millisecond) // Let the server start

	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			req, err := http.NewRequest("GET", "http://127.0.0.1:8080", nil)
			Must(b, err)
			resp, err := (&http.Client{}).Do(req)
			Must(b, err)
			if err == nil {
				body, err := ioutil.ReadAll(resp.Body)
				Must(b, err)
				Must(b, resp.Body.Close())
				// Check that we are hitting the correct server
				if string(body) != r {
					b.Fatal("Expected:", r, "Got: ", string(body))
				}
			}
		}
	})
	Must(b, ln.Close())
}

func Must(tb testing.TB, err error) {
	if err != nil {
		tb.Fatal(err)
	}
}

This gave the results:

Test Iterations: 1
Starting Server: 586437
Hit: 586437
Closed Server: 586437

Test Iterations: 100
Starting Server: 608738
Hit: 586437
Hit: 608738
Hit: 586437
Hit: 608738
Hit: 608738
Hit: 608738
Hit: 608738
Hit: 608738
Hit: 608738
Hit: 586437
Hit: 608738
Closed Server: 608738
	pixel_server_benchmark_test.go:52: Expected: 608738 Got:  586437
	pixel_server_benchmark_test.go:52: Expected: 608738 Got:  586437
	pixel_server_benchmark_test.go:52: Expected: 608738 Got:  586437
	pixel_server_benchmark_test.go:52: Expected: 608738 Got:  586437
FAIL

Reproducing on Golang 1.10.1 with the playground link:

package main

import (
	"fmt"
	"io/ioutil"
	"math/rand"
	"net"
	"net/http"
	"strconv"
	"testing"
	"time"
)

func TestListener(t *testing.T) {
	rand.Seed(time.Now().UTC().UnixNano())

	for i := 0; i < 5; i++ {
		// Create listener
		ln, err := net.Listen("tcp", ":8080")
		Must(t, err)

		// Get a random number
		r := strconv.Itoa(rand.Intn(1000000))
		fmt.Println("Starting Server:", r)

		// Setup server
		mux := http.NewServeMux()
		mux.HandleFunc("/", func(writer http.ResponseWriter, request *http.Request) {
			fmt.Println("Hit:", r)
			_, err := writer.Write([]byte(r))
			Must(t, err)
		})
		srv := &http.Server{Handler: mux}

		// Start server
		go func() {
			srv.Serve(ln)
			fmt.Println("Closed Server")
		}()

		time.Sleep(time.Millisecond) // Let the server start

		httpClient := &http.Client{}
		req, err := http.NewRequest("GET", "http://127.0.0.1:8080", nil)
		Must(t, err)

		resp, err := httpClient.Do(req)
		Must(t, err)
		body, err := ioutil.ReadAll(resp.Body)
		Must(t, err)
		Must(t, resp.Body.Close())
		fmt.Println("Got:", string(body))

		if string(body) != r {
			t.Error("Expected:", r, "Got:", string(body))
		}

		Must(t, ln.Close())
		time.Sleep(time.Second) // Let the listener close
	}
}

func Must(tb testing.TB, err error) {
	if err != nil {
		tb.Fatal(err)
	}
}

Gave:

=== RUN   TestListener
Starting Server: 452000
Hit: 452000
Got: 452000
Closed Server
Starting Server: 999528
Hit: 452000
Got: 452000
Closed Server
Starting Server: 533627
Hit: 452000
Got: 452000
Closed Server
Starting Server: 162162
Hit: 452000
Got: 452000
Closed Server
Starting Server: 885063
Hit: 452000
Got: 452000
Closed Server
--- FAIL: TestListener (5.00s)
	main.go:55: Expected: 999528 Got: 452000
	main.go:55: Expected: 533627 Got: 452000
	main.go:55: Expected: 162162 Got: 452000
	main.go:55: Expected: 885063 Got: 452000
FAIL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment