Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TCP RST handling is unreliable on OSX #50254

Open
marten-seemann opened this issue Dec 18, 2021 · 8 comments
Open

net: TCP RST handling is unreliable on OSX #50254

marten-seemann opened this issue Dec 18, 2021 · 8 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin Unfortunate
Milestone

Comments

@marten-seemann
Copy link
Contributor

marten-seemann commented Dec 18, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.5 darwin/amd64

Does this issue reproduce with the latest release?

Yes, tested with go.18beta1.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/marten/Library/Caches/go-build"
GOENV="/Users/marten/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/marten/src/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/marten/src/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.17.5/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.17.5/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.17.5"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/marten/src/go/src/github.com/libp2p/go-libp2p/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/q0/b5ynf00142l7bl9sp8y098zr0000gn/T/go-build1068436186=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I'm seeing occasional TCP connection timeouts, even though the other side has sent (and we have received) the TCP RST packet.

I managed to reproduce the failure using the following minimal working example. It reliably fails at least once when run 1000 times (go test -run TestLinger -count 1000 -v -failfast).

import (
	"fmt"
	"io"
	"math/rand"
	"net"
	"testing"
	"time"
)

func TestLinger(t *testing.T) {
	ln, err := net.ListenTCP("tcp4", nil)
	if err != nil {
		t.Fatal(err)
	}
	defer ln.Close()

	done := make(chan struct{})
	accepted := make(chan struct{})
	go func() {
		defer close(done)
		conn, err := ln.Accept()
		if err != nil {
			t.Error(err)
		}
		close(accepted)
		io.ReadAll(conn)
	}()

	conn, err := net.DialTCP("tcp4", nil, ln.Addr().(*net.TCPAddr))
	if err != nil {
		t.Fatal(err)
	}
	// This makes sure a TCP RST is sent when Close is called.
	if err := conn.SetLinger(0); err != nil {
		t.Fatal(err)
	}
	fmt.Printf("%s <-> %s\n", conn.LocalAddr(), conn.RemoteAddr())
	go func() {
		for {
			b := make([]byte, 1+rand.Intn(1000))
			_, err := conn.Write(b)
			if err != nil {
				return
			}
			time.Sleep(time.Duration(rand.Intn(1000)) * time.Microsecond)
		}
	}()

	<-accepted
	time.Sleep(time.Duration(rand.Intn(30)) * time.Millisecond)
	if err := conn.Close(); err != nil {
		t.Fatal(err)
	}

	select {
	case <-done:
	case <-time.After(10 * time.Second):
		t.Fatal("remote conn didn't close")
	}

What did you expect to see?

I expect the tcp.Conn to be reliably closed / reset when we receive the TCP RST packet.
This works reliably on Linux (incl. the code I posted above), but not on OSX.

What did you see instead?

Occasionally (the test case above fails in maybe 1 out of 100 runs), the TCP RST doesn't seem to have any effect on the connection at all. The connection stays open, and eventually (after a long time) runs into a connection timeout.

Here's a pcap of this transport:
tcprst.pcapng.gz, showing that the RST was actually (and received).

marten-seemann added a commit to libp2p/go-libp2p that referenced this issue Dec 18, 2021
TCP RST handling is flaky on OSX, see golang/go#50254.
We can avoid this test from randomly failing by using QUIC instead.
@dmitshur dmitshur added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin labels Dec 21, 2021
marten-seemann added a commit to libp2p/go-libp2p that referenced this issue Apr 18, 2022
marten-seemann added a commit to libp2p/go-libp2p that referenced this issue Apr 18, 2022
* don't run the reconnect test using TCP on OSX

Due to golang/go#50254.

* don't check the error return value when closing stream in reconnect test
@seankhliao seankhliao added this to the Unplanned milestone Aug 20, 2022
@neild
Copy link
Contributor

neild commented Aug 20, 2022

This looks like the same macOS bug as described here:
#37795 (comment)

@seankhliao seankhliao added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Aug 20, 2022
@marten-seemann
Copy link
Contributor Author

marten-seemann commented Aug 21, 2022

@seankhliao I see the "WaitingForInfo" label was applied here. I'm happy to provide more info, if that helps us getting closer to a fix. What else do you need?

@seankhliao
Copy link
Member

seankhliao commented Aug 21, 2022

Does it reproduce on 1.19 (which should include the workaround for #37795)?

@marten-seemann
Copy link
Contributor Author

marten-seemann commented Aug 21, 2022

Does it reproduce on 1.19 (which should include the workaround for #37795)?

Yes, it does. Running my test case with -count 1000 reliably reproduces the bug:

=== RUN   TestLinger
127.0.0.1:55159 <-> 127.0.0.1:55158
    linger_test.go:60: remote conn didn't close
--- FAIL: TestLinger (10.03s)
FAIL
exit status 1

I'm running Go 1.19 on an M1 Macbook Pro:

❯ go version
go version go1.19 darwin/arm64

Here's a new pcap for the failing test case above: dump.pcapng.zip

@seankhliao seankhliao removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Aug 21, 2022
@neild
Copy link
Contributor

neild commented Aug 22, 2022

Unfortunately, this is a macOS bug, not a Go bug. (See #37795 (comment), which reproduces the problem entirely in C.) Any fix is going to have to come from Apple. Perhaps there's a workaround that can be applied in the Go runtime, but I can't think of one.

@neild
Copy link
Contributor

neild commented Aug 22, 2022

Does it reproduce on 1.19 (which should include the workaround for #37795)?

FYI, the workaround for #37795 is just a change to the flaky test which makes it avoid triggering the macOS bug. It doesn't address the underlying problem. (Which is a kernel bug triggered--I think--by receiving a RST while in the process of accepting a connection.)

@marten-seemann
Copy link
Contributor Author

marten-seemann commented Aug 25, 2022

Any fix is going to have to come from Apple.

Makes sense, thanks for investigating @neild.
Has anybody reported this to Apple? Is there any issue we can push (or at least subscribe to)?

@neild
Copy link
Contributor

neild commented Aug 26, 2022

I reported this through Apple's "Feedback Assistant", which appears to be the only official way to report issues these days. It did not give me any issue number. (Used to be you'd get a radar number, but no way to track activity on the issue.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin Unfortunate
Projects
None yet
Development

No branches or pull requests

5 participants