Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/tls: TLS connections use small buffer size that results in small syscalls and ignore HTTP client transport buffer sizes #47672

Open
richardartoul opened this issue Aug 12, 2021 · 10 comments
Assignees
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@richardartoul
Copy link

richardartoul commented Aug 12, 2021

What version of Go are you using (go version)?

$ go version

`go1.16 darwin/amd6`

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
COMP12013:dd-go richard.artoul$ go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/richard.artoul/Library/Caches/go-build"
GOENV="/Users/richard.artoul/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/richard.artoul/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/richard.artoul/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.15/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.15/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/t2/02qzh_vs4cn57ctvc7dwcsc80000gn/T/go-build075316343=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I'm using this library: https://github.com/google/go-cloud to read files from S3 in a streaming fashion.

I noticed that a lot of time was being spent in syscalls:
Screen Shot 2021-08-12 at 2 48 00 PM

I was aware of this issue: #22618

So I tuned my http client read/write buffer transport sizes to be 256kib instead of 64kib but this had no impact on time spent in syscalls which made me suspicious that somehow the way the reads were being performed that reads were not actually being buffered as I expected.

I wrote a small program to download a file from S3 in a streaming fashion using large 1mib reads, like this:

stream, err := store.GetStream(ctx, *bucket, *path)
if err != nil {
    log.Fatalf("error getting: %s, err: %v", *path, err)
}
defer stream.Close()

buf := make([]byte, 1<<20)
    for {
        n, err := bufio.NewReaderSize(stream, 1<<20).Read(buf)
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Fatal(err)
        }
	fmt.Println("n:", n)
}

I couldn't get dtrace to work properly on OS X, but luckily my application uses a custom dialer for setting write/read deadlines on every socket read so I was able to instrument the actual socket read sizes like this:

func (d *deadlineConn) Read(p []byte) (int, error) {
	d.Conn.SetReadDeadline(time.Now().Add(d.readDeadline))
	fmt.Println("read size", len(p))
	n, err := d.Conn.Read(p)
	err = maybeWrapErr(err)
	return n, err
}

What did you expect to see?

Large syscall reads (in the range of 256KiB)

What did you see instead?

Extremely small sys call reads:

read size 52398
n: 16384
n: 1024
n: 16384
n: 1024
n: 16384
n: 1024
read size 52398
n: 16384
n: 1024
read size 28361
read size 26929
n: 16384
n: 1024
n: 16384
read size 5449
n: 1024
read size 56415
read size 47823
n: 16384
n: 1024
n: 16384
read size 23479

What did you do after?

I made a small change in tls.go to instantiate the TLS client with a much larger rawInput buffer:

// Client returns a new TLS client side connection
// using conn as the underlying transport.
// The config cannot be nil: users must set either ServerName or
// InsecureSkipVerify in the config.
func Client(conn net.Conn, config *Config) *Conn {
	c := &Conn{
		rawInput: *bytes.NewBuffer(make([]byte, 0, 1<<20)),
		conn:     conn,
		config:   config,
		isClient: true,
	}
	c.handshakeFn = c.clientHandshake
	return c
}

As expected I began to observe much larger syscall reads:

read size 1034019
read size 1024035
read size 1022603
read size 1003987
read size 993963
read size 991099
read size 985371
read size 982507
read size 981075
read size 965323
read size 963891
read size 955299
read size 945275
read size 935251

I haven't tried deploying my fork to production, and measuring performance on my laptop is not interesting since I have a terrible connection to S3, but I think its well understood that a 10x increase in syscalls (especially with such a small read size of 64kib) has a dramatic impact on performance.

Proposal

I'm not 100% sure what the best approach is here, but I think we should do something since this issue means that streaming large amounts of data over TLS is much more CPU intensive than it needs to be which is a big deal for applications that process large volumes of data over the network like distributed databases.

The tls package already has a Config struct. It seems like it would be straightforward to add buffer size configuration there like has already been done for the HTTP transport. In addition, it seems reasonable that the HTTP client transport buffer sizes should be automatically propagated as the values for the TLS buffer size if the user doesn't specify a specific override.

@richardartoul
Copy link
Author

richardartoul commented Aug 13, 2021

Ok so to test my theory in production I wrote a custom TLS dialer that does some crazy unsafe stuff:

func (d *deadlineDialer) DialTLSContext(ctx context.Context, network, addr string) (c net.Conn, err error) {
	defer func() {
		if err != nil {
			err = maybeWrapErr(err)
		}
	}()

	var firstTLSHost string
	if firstTLSHost, _, err = net.SplitHostPort(addr); err != nil {
		return nil, err
	}

	cfg := &tls.Config{}
	cfg.ServerName = firstTLSHost

	trace := httptrace.ContextClientTrace(ctx)

	plainConn, err := d.dialer.DialContext(ctx, network, addr)
	if err != nil {
		return nil, err
	}

	tlsConn := tls.Client(plainConn, cfg)
	increaseTLSBufferSizeUnsafely(tlsConn)

	errc := make(chan error, 2)
	var timer *time.Timer // for canceling TLS handshake
	if d := transport.TLSHandshakeTimeout; d != 0 {
		timer = time.AfterFunc(d, func() {
			errc <- errs.NewRetryableError(errors.New("TLS handshake timeout"))
		})
	}
	go func() {
		if trace != nil && trace.TLSHandshakeStart != nil {
			trace.TLSHandshakeStart()
		}
		err := tlsConn.Handshake()
		if timer != nil {
			timer.Stop()
		}
		errc <- err
	}()
	if err := <-errc; err != nil {
		plainConn.Close()
		if trace != nil && trace.TLSHandshakeDone != nil {
			trace.TLSHandshakeDone(tls.ConnectionState{}, err)
		}
		return nil, err
	}
	cs := tlsConn.ConnectionState()
	if trace != nil && trace.TLSHandshakeDone != nil {
		trace.TLSHandshakeDone(cs, nil)
	}

        return tlsConn, nil
}

func increaseTLSBufferSizeUnsafely(tlsConn *tls.Conn) {
	var (
		pointerVal = reflect.ValueOf(tlsConn)
		val        = reflect.Indirect(pointerVal)
		member     = val.FieldByName("rawInput")
		ptrToY     = unsafe.Pointer(member.UnsafeAddr())
		realPtrToY = (*bytes.Buffer)(ptrToY)
	)
	*realPtrToY = *bytes.NewBuffer(make([]byte, 0, 1<<18))
}

After I deployed my service using the new dialer and a 256kib buffer size, all the time spent in syscalls more or less completely disappeared from my profiles. The performance difference ended up being more dramatic than I expected, profiles with the old implementation had 23.36s of CPU time whereas the new version was 19.98 for the same workload.

I would share CPU profile comparisons but its a bit difficult since my application is not open source and I don't want to potentially leak any sensitive information via the method names.

@dmitshur
Copy link
Contributor

@dmitshur dmitshur added this to the Backlog milestone Aug 14, 2021
@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 14, 2021
@seizethedave
Copy link

seizethedave commented Jun 13, 2022

+1 on this! Our database downloads new partitions to serve from S3 while it is serving reads, and we are considering rewriting our downloader component in a different language as these 4GB downloads must be squeezed through the syscall pipe that is at most 64KB wide due to the way Conn.rawInput works.

read syscall return values:

@bytes:
(..., 0)           87559 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[0]                 1629 |                                                    |
[1]                 1287 |                                                    |
[2, 4)               103 |                                                    |
[4, 8)                54 |                                                    |
[8, 16)              464 |                                                    |
[16, 32)             121 |                                                    |
[32, 64)             134 |                                                    |
[64, 128)            213 |                                                    |
[128, 256)           666 |                                                    |
[256, 512)           608 |                                                    |
[512, 1K)            948 |                                                    |
[1K, 2K)            4310 |@@                                                  |
[2K, 4K)            6920 |@@@@                                                |
[4K, 8K)           25991 |@@@@@@@@@@@@@@@                                     |
[8K, 16K)          46532 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[16K, 32K)         42064 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
[32K, 64K)         16326 |@@@@@@@@@                                           |
[64K, 128K)          160 |                                                    |
[128K, 256K)           3 |                                                    |
[256K, 512K)           1 |                                                    |
[512K, 1M)             2 |                                                    |

@rolandshoemaker
Copy link
Member

Dupe of #20420.

@FiloSottile
Copy link
Contributor

Dupe of #20420.

I think it's related but not a dupe. #20420 is about negotiating over the protocol smaller individual records to reduce (the peer's) memory requirements. This is about reading from the wire more than one record at a time to reduce syscall overhead.

@xdg
Copy link

xdg commented Jun 14, 2022

@FiloSottile what needs to happen to get this out of the backlog? Is the proposal acceptable and a PR along those lines would be accepted?

@richardartoul
Copy link
Author

Was refactoring some code and found a link to this issue above an extremely ugly hack I introduced to work around it :P Would be awesome to see this proposal accepted or an alternative proposed!

@rolandshoemaker
Copy link
Member

We obviously didn't get to this for 1.20 (sorry), but it is something on my radar to look at in 1.21.

@elijahandrews
Copy link

elijahandrews commented Jan 17, 2024

Thanks for the workaround @richardartoul, I just used it as well.

For the Go maintainers – any chance we can get this investigated for the next Go release?

@rolandshoemaker rolandshoemaker self-assigned this Jan 17, 2024
jamesross3 pushed a commit to jamesross3/distribution that referenced this issue Jun 5, 2024
@HippoBaro
Copy link

HippoBaro commented Jul 29, 2024

Bump on this one; it seems the unsafe workaround hack mentioned in the issue is proliferating in the wild which we probably want to avoid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants