Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/tls: figure out why/how buffer size affects crypto/tls performance on windows #15899

Closed
adg opened this issue May 31, 2016 · 19 comments
Closed
Assignees
Milestone

Comments

@adg
Copy link
Contributor

@adg adg commented May 31, 2016

It looks like a recent change to crypto/tls increased the duration of the throughput benchmarks beyond 10 minutes on the windows race builder.

https://build.golang.org/log/1217fb6191fad2bf98dc54b61cfbe51e33af21e6

BenchmarkThroughput/MaxPacket/1MB-4                    1    2226420000 ns/op       0.47 MB/s
BenchmarkThroughput/MaxPacket/2MB-4                    1    3995838000 ns/op       0.52 MB/s
BenchmarkThroughput/MaxPacket/4MB-4                    1    6992716500 ns/op       0.60 MB/s
BenchmarkThroughput/MaxPacket/8MB-4                    1    12010950000 ns/op      0.70 MB/s
BenchmarkThroughput/MaxPacket/16MB-4                   1    18002754000 ns/op      0.93 MB/s
BenchmarkThroughput/MaxPacket/32MB-4                   1    34987995000 ns/op      0.96 MB/s
BenchmarkThroughput/MaxPacket/64MB-4                   1    69017067000 ns/op      0.97 MB/s
*** Test killed: ran too long (10m0s).
FAIL    crypto/tls  600.061s

The benchmarks are run with -benchtime .1s, but many of the benchmarks take longer than 100 milliseconds.

One simple fix would be to increase the timeout, currently 10 minutes for all standard library benchmarks combined. But I'm hesitant to make builds take longer.

I will investigate why -benchtime doesn't appear to have the desired effect.

@adg adg added this to the Go1.7 milestone May 31, 2016
@adg adg self-assigned this May 31, 2016
@adg

This comment has been minimized.

Copy link
Contributor Author

@adg adg commented May 31, 2016

Looks like I misread the log output. The crypto/tls package benchmarks alone are taking 10 minutes. Something is definitely wrong on windows.

@adg

This comment has been minimized.

Copy link
Contributor Author

@adg adg commented May 31, 2016

FWIW this takes 44 seconds on my MacBook:

$ go test -run=^$ -bench=. -v -race crypto/tls
BenchmarkThroughput/MaxPacket/1MB-4                  100      22417768 ns/op      46.77 MB/s
BenchmarkThroughput/MaxPacket/2MB-4                   50      37196722 ns/op      56.38 MB/s
BenchmarkThroughput/MaxPacket/4MB-4                   20      63029114 ns/op      66.55 MB/s
BenchmarkThroughput/MaxPacket/8MB-4                   10     111478777 ns/op      75.25 MB/s
BenchmarkThroughput/MaxPacket/16MB-4                   5     283195368 ns/op      59.24 MB/s
BenchmarkThroughput/MaxPacket/32MB-4                   3     491191446 ns/op      68.31 MB/s
BenchmarkThroughput/MaxPacket/64MB-4                   1    1032492829 ns/op      65.00 MB/s
BenchmarkThroughput/DynamicPacket/1MB-4               50      24004322 ns/op      43.68 MB/s
BenchmarkThroughput/DynamicPacket/2MB-4               30      41218780 ns/op      50.88 MB/s
BenchmarkThroughput/DynamicPacket/4MB-4               20      79591691 ns/op      52.70 MB/s
BenchmarkThroughput/DynamicPacket/8MB-4               10     120384840 ns/op      69.68 MB/s
BenchmarkThroughput/DynamicPacket/16MB-4               5     246487076 ns/op      68.07 MB/s
BenchmarkThroughput/DynamicPacket/32MB-4               3     442225137 ns/op      75.88 MB/s
BenchmarkThroughput/DynamicPacket/64MB-4               2     970823710 ns/op      69.13 MB/s
BenchmarkLatency/MaxPacket/200kbps-4                   2     703320882 ns/op
BenchmarkLatency/MaxPacket/500kbps-4                   5     285504385 ns/op
BenchmarkLatency/MaxPacket/1000kbps-4                 10     146081561 ns/op
BenchmarkLatency/MaxPacket/2000kbps-4                 20      76909455 ns/op
BenchmarkLatency/MaxPacket/5000kbps-4                 50      34849995 ns/op
BenchmarkLatency/DynamicPacket/200kbps-4              10     141680066 ns/op
BenchmarkLatency/DynamicPacket/500kbps-4              20      61072803 ns/op
BenchmarkLatency/DynamicPacket/1000kbps-4             50      34683166 ns/op
BenchmarkLatency/DynamicPacket/2000kbps-4            100      21136070 ns/op
BenchmarkLatency/DynamicPacket/5000kbps-4            100      13180012 ns/op
PASS
ok      crypto/tls  44.515s
@adg

This comment has been minimized.

Copy link
Contributor Author

@adg adg commented May 31, 2016

The crypto/tls benchmarks run 100x slower on the windows builder than they do on my laptop (with and without the race detector enabled):

$ gomote run user-adg-windows-amd64-race-0 bin/go test -run=^$ -bench=. -v -race -cpu=4 crypto/tls
BenchmarkThroughput/MaxPacket/1MB-4                    1    2283290800 ns/op       0.46 MB/s
BenchmarkThroughput/MaxPacket/2MB-4                    1    4006989800 ns/op       0.52 MB/s
BenchmarkThroughput/MaxPacket/4MB-4                    1    6988549600 ns/op       0.60 MB/s
BenchmarkThroughput/MaxPacket/8MB-4                    1    10019916000 ns/op      0.84 MB/s
BenchmarkThroughput/MaxPacket/16MB-4                   1    17996784800 ns/op      0.93 MB/s
^C
$ gomote run user-adg-windows-amd64-race-0 bin/go test -run=^$ -bench=. -v  -cpu=4 crypto/tls
BenchmarkThroughput/MaxPacket/1MB-4                    1    1591858000 ns/op       0.66 MB/s
BenchmarkThroughput/MaxPacket/2MB-4                    1    4998238800 ns/op       0.42 MB/s
BenchmarkThroughput/MaxPacket/4MB-4                    1    7999330600 ns/op       0.52 MB/s
BenchmarkThroughput/MaxPacket/8MB-4                    1    11974092600 ns/op      0.70 MB/s
BenchmarkThroughput/MaxPacket/16MB-4                   1    17979206000 ns/op      0.93 MB/s
BenchmarkThroughput/MaxPacket/32MB-4                   1    35982827000 ns/op      0.93 MB/s
BenchmarkThroughput/MaxPacket/64MB-4                   1    52988362800 ns/op      1.27 MB/s
^C

And it's not that the builder is slow. I compared the results of the bytes package's benchmarks between the windows builder and my macbook and found them comparable (within the same order of magnitude).

So we need to investigate why this is so slow on the windows builder. Is it just the test, or the crypto/tls package in general?

@adg

This comment has been minimized.

Copy link
Contributor Author

@adg adg commented May 31, 2016

I tried running the benchmarks on a linux-amd64 VM, and they run about as fast as on the real hardware. So it's not a GCE thing in general. I wonder if it's a windows+GCE thing? If someone with a physical windows box wants to help out, please run

go test -run=^$ -bench=. -v -race -cpu=4 crypto/tls

and paste your output here. Thanks.

@adg adg changed the title x/build: windows race builder timing out during tests crypto/tls: benchmarks very slow on windows-amd64 builder May 31, 2016
@ChrisHines

This comment has been minimized.

Copy link
Contributor

@ChrisHines ChrisHines commented May 31, 2016

System info: Windows 7 Home Premium SP1
CPU: Intel Core 2 Quad, Q8400 @ 2.66 GHz

I built tip with CGO_ENABLED=0

C:\GoTip\go\src>\GoTip\go\bin\go version
go version devel +b3f98d7 Tue May 31 00:22:56 2016 +0000 windows/amd64

C:\GoTip\go\src>\GoTip\go\bin\go test -run=none -bench=. -v -cpu=4 -timeout=1m crypto/tls
BenchmarkThroughput/MaxPacket/1MB-4                   20          76700000 ns/op          13.67 MB/s
BenchmarkThroughput/MaxPacket/2MB-4                   10         149300000 ns/op          14.05 MB/s
BenchmarkThroughput/MaxPacket/4MB-4                    5         290000000 ns/op          14.46 MB/s
BenchmarkThroughput/MaxPacket/8MB-4                    2         566500000 ns/op          14.81 MB/s
BenchmarkThroughput/MaxPacket/16MB-4                   1        1125000000 ns/op          14.91 MB/s
BenchmarkThroughput/MaxPacket/32MB-4                   1        2295000000 ns/op          14.62 MB/s
BenchmarkThroughput/MaxPacket/64MB-4                   1        4517000000 ns/op          14.86 MB/s
BenchmarkThroughput/DynamicPacket/1MB-4         *** Test killed: ran too long (2m0s).
FAIL    crypto/tls      120.049s

I noticed something that could help. Once the DynamicPacket test started, CPU usage by tls.test.exe dropped to zero. When I run just the MaxPacket benchmarks they take 18.3s total and consume ~38% CPU (or ~1.5 cores) during that time.

@adg

This comment has been minimized.

Copy link
Contributor Author

@adg adg commented May 31, 2016

@ChrisHines thanks for that. Those numbers seem normal. So it does seem to be a combination of Windows and GCE.

I have another interesting finding. Lowering the read/write buffer size in the throughput test appears to dramatically affect run time. Reducing it from 1<<16 (64kb) to 1<<15 (32kb) yields these results:

$ gomote run user-adg-windows-amd64-race-0 go/bin/go test -run=^$ -bench=. -v -race -cpu=4 crypto/tls
BenchmarkThroughput/MaxPacket/1MB-4                   50      33806430 ns/op      31.02 MB/s
BenchmarkThroughput/MaxPacket/2MB-4                   30      56246400 ns/op      37.29 MB/s
BenchmarkThroughput/MaxPacket/4MB-4                   10     106243200 ns/op      39.48 MB/s
BenchmarkThroughput/MaxPacket/8MB-4                    5     201159000 ns/op      41.70 MB/s
BenchmarkThroughput/MaxPacket/16MB-4                   3     414036000 ns/op      40.52 MB/s
BenchmarkThroughput/MaxPacket/32MB-4                   2     772411500 ns/op      43.44 MB/s
BenchmarkThroughput/MaxPacket/64MB-4                   1    1570212000 ns/op      42.74 MB/s
BenchmarkThroughput/DynamicPacket/1MB-4                1    1854373500 ns/op       0.57 MB/s
BenchmarkThroughput/DynamicPacket/2MB-4                1    2018425500 ns/op       1.04 MB/s
BenchmarkThroughput/DynamicPacket/4MB-4                1    2045767500 ns/op       2.05 MB/s
BenchmarkThroughput/DynamicPacket/8MB-4                1    2107287000 ns/op       3.98 MB/s
BenchmarkThroughput/DynamicPacket/16MB-4               1    2205913500 ns/op       7.61 MB/s
BenchmarkThroughput/DynamicPacket/32MB-4               1    2430508500 ns/op      13.81 MB/s
BenchmarkThroughput/DynamicPacket/64MB-4               1    2810367000 ns/op      23.88 MB/s
BenchmarkLatency/MaxPacket/200kbps-4                   2     713821500 ns/op
BenchmarkLatency/MaxPacket/500kbps-4                   5     297246600 ns/op
BenchmarkLatency/MaxPacket/1000kbps-4                 10     158778900 ns/op
BenchmarkLatency/MaxPacket/2000kbps-4                 20      90082125 ns/op
BenchmarkLatency/MaxPacket/5000kbps-4                 30      49085400 ns/op
BenchmarkLatency/DynamicPacket/200kbps-4              10     153896400 ns/op
BenchmarkLatency/DynamicPacket/500kbps-4              20      75825225 ns/op
BenchmarkLatency/DynamicPacket/1000kbps-4             30      48564600 ns/op
BenchmarkLatency/DynamicPacket/2000kbps-4             50      36755460 ns/op
BenchmarkLatency/DynamicPacket/5000kbps-4             50      28630980 ns/op
PASS
ok      crypto/tls  47.659s

Still some slowness in the DynamicPacket benchmarks, but the others seem to have risen to normal range again.

Lowering it further to 1<<14 16kb yields numbers in the expected range for all benchmarks:

$ gomote run user-adg-windows-amd64-race-0 go/bin/go test -run=^$ -bench=. -v -race -cpu=4 crypto/tls
BenchmarkThroughput/MaxPacket/1MB-4                   30      39483150 ns/op      26.56 MB/s
BenchmarkThroughput/MaxPacket/2MB-4                   20      69917400 ns/op      29.99 MB/s
BenchmarkThroughput/MaxPacket/4MB-4                   10     127628550 ns/op      32.86 MB/s
BenchmarkThroughput/MaxPacket/8MB-4                    5     248421600 ns/op      33.77 MB/s
BenchmarkThroughput/MaxPacket/16MB-4                   3     477508500 ns/op      35.13 MB/s
BenchmarkThroughput/MaxPacket/32MB-4                   2     908633250 ns/op      36.93 MB/s
BenchmarkThroughput/MaxPacket/64MB-4                   1    1771371000 ns/op      37.89 MB/s
BenchmarkThroughput/DynamicPacket/1MB-4               30      41305950 ns/op      25.39 MB/s
BenchmarkThroughput/DynamicPacket/2MB-4               20      70600950 ns/op      29.70 MB/s
BenchmarkThroughput/DynamicPacket/4MB-4               10     134561700 ns/op      31.17 MB/s
BenchmarkThroughput/DynamicPacket/8MB-4                5     251741700 ns/op      33.32 MB/s
BenchmarkThroughput/DynamicPacket/16MB-4               2     550257750 ns/op      30.49 MB/s
BenchmarkThroughput/DynamicPacket/32MB-4               2     980894250 ns/op      34.21 MB/s
BenchmarkThroughput/DynamicPacket/64MB-4               1    1928587500 ns/op      34.80 MB/s
BenchmarkLatency/MaxPacket/200kbps-4                   2     713821500 ns/op
BenchmarkLatency/MaxPacket/500kbps-4                   5     298027800 ns/op
BenchmarkLatency/MaxPacket/1000kbps-4                 10     159169500 ns/op
BenchmarkLatency/MaxPacket/2000kbps-4                 20      90668025 ns/op
BenchmarkLatency/MaxPacket/5000kbps-4                 30      49313250 ns/op
BenchmarkLatency/DynamicPacket/200kbps-4              10     154384650 ns/op
BenchmarkLatency/DynamicPacket/500kbps-4              20      77387625 ns/op
BenchmarkLatency/DynamicPacket/1000kbps-4             30      48564600 ns/op
BenchmarkLatency/DynamicPacket/2000kbps-4             50      36325800 ns/op
BenchmarkLatency/DynamicPacket/5000kbps-4             50      28787220 ns/op
PASS
ok      crypto/tls  46.143s

So the network stack for Windows on GCE prefers smaller buffers?

I'll send a change to lower the buffer size, so that the tests run in a reasonable time, but we should find out why this is happening.

@ChrisHines

This comment has been minimized.

Copy link
Contributor

@ChrisHines ChrisHines commented May 31, 2016

I changed read/write buffer size to 1<<14 and reran the tests on my machine and got:

C:\GoTip\go\src>\GoTip\go\bin\go test -run=none -bench=. -v -cpu=4 crypto/tls
BenchmarkThroughput/MaxPacket/1MB-4                   10         112600000 ns/op           9.31 MB/s
BenchmarkThroughput/MaxPacket/2MB-4                    5         216600000 ns/op           9.68 MB/s
BenchmarkThroughput/MaxPacket/4MB-4                    3         430333333 ns/op           9.75 MB/s
BenchmarkThroughput/MaxPacket/8MB-4                    2         849500000 ns/op           9.87 MB/s
BenchmarkThroughput/MaxPacket/16MB-4                   1        1681000000 ns/op           9.98 MB/s
BenchmarkThroughput/MaxPacket/32MB-4                   1        3342000000 ns/op          10.04 MB/s
BenchmarkThroughput/MaxPacket/64MB-4                   1        6738000000 ns/op           9.96 MB/s
BenchmarkThroughput/DynamicPacket/1MB-4               10         110600000 ns/op           9.48 MB/s
BenchmarkThroughput/DynamicPacket/2MB-4                5         216600000 ns/op           9.68 MB/s
BenchmarkThroughput/DynamicPacket/4MB-4                3         423666666 ns/op           9.90 MB/s
BenchmarkThroughput/DynamicPacket/8MB-4                2         841500000 ns/op           9.97 MB/s
BenchmarkThroughput/DynamicPacket/16MB-4               1        1691000000 ns/op           9.92 MB/s
BenchmarkThroughput/DynamicPacket/32MB-4               1        3381000000 ns/op           9.92 MB/s
BenchmarkThroughput/DynamicPacket/64MB-4               1        6812000000 ns/op           9.85 MB/s
BenchmarkLatency/MaxPacket/200kbps-4                   2         707000000 ns/op
BenchmarkLatency/MaxPacket/500kbps-4                   5         288600000 ns/op
BenchmarkLatency/MaxPacket/1000kbps-4                 10         150200000 ns/op
BenchmarkLatency/MaxPacket/2000kbps-4                 20          80450000 ns/op
BenchmarkLatency/MaxPacket/5000kbps-4                 30          40366666 ns/op
BenchmarkLatency/DynamicPacket/200kbps-4              10         145100000 ns/op
BenchmarkLatency/DynamicPacket/500kbps-4              20          64050000 ns/op
BenchmarkLatency/DynamicPacket/1000kbps-4             30          38200000 ns/op
BenchmarkLatency/DynamicPacket/2000kbps-4             50          23940000 ns/op
BenchmarkLatency/DynamicPacket/5000kbps-4            100          17090000 ns/op
PASS
ok      crypto/tls      55.532s
@kardianos

This comment has been minimized.

Copy link
Contributor

@kardianos kardianos commented May 31, 2016

This isn't just a GCE issue. My Win7 box has terrible performance in the test code with "1<<16" but on par with my linux box (similar gen hardware) with "1<<14".

Terrible performance as in MaxPacket 0.05Mb/s (not just dyn packet).

@adg

This comment has been minimized.

Copy link
Contributor Author

@adg adg commented May 31, 2016

@kardianos thanks for the additional information. That's interesting. I wonder why, with the smaller buffer, the performance goes down on @ChrisHines' machine while it goes up on ours?

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented May 31, 2016

CL https://golang.org/cl/23574 mentions this issue.

@adg adg changed the title crypto/tls: benchmarks very slow on windows-amd64 builder crypto/tls: figure out why/how buffer size affects crypto/tls performance on windows May 31, 2016
gopherbot pushed a commit that referenced this issue May 31, 2016
The Windows builders run the throughput benchmarks really slowly with a
64kb buffer. Lowering it to 16kb brings the performance back into line
with the other builders.

This is a work-around to get the build green until we can figure out why
the Windows builders are slow with the larger buffer size.

Update #15899

Change-Id: I215ebf115e8295295c87f3b3e22a4ef1f9e77f81
Reviewed-on: https://go-review.googlesource.com/23574
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@adg adg modified the milestones: Go1.8, Go1.7 May 31, 2016
@tombergan

This comment has been minimized.

Copy link
Contributor

@tombergan tombergan commented Jun 2, 2016

1<<14 is the maximum TLS record size. The only difference between 1<<14 and 1<<16 is that the former produces a single netFD.Write per Write call in the benchmark, while the latter produces 4 netFD.Writes per Write call. (At least, that should be the only difference.)

Is it possible to get a tcpdump for a single slow iteration? That may help.

@tombergan

This comment has been minimized.

Copy link
Contributor

@tombergan tombergan commented Jun 2, 2016

Is it possible to verify that Windows is respecting TCP_NODELAY? A tcpdump may help here as well.

@mikioh

This comment has been minimized.

Copy link
Contributor

@mikioh mikioh commented Jun 2, 2016

In general, a loopback software interface is designed for protocol stack maintenance purpose and probably not suitable for performance measurement (and for the verification of some fancy feature such as IPv6 multicasting).

FWIW, I just asked google: https://blogs.technet.microsoft.com/wincat/2012/12/05/fast-tcp-loopback-performance-and-low-latency-with-windows-server-2012-tcp-loopback-fast-path/

@mikioh mikioh added the OS-Windows label Jun 2, 2016
@alexbrainman

This comment has been minimized.

Copy link
Member

@alexbrainman alexbrainman commented Jun 14, 2016

I can reproduce this on one of my computers too.

I copied (with small adjustments) BenchmarkThroughput into net package and run it there. And I can see about 20% more throughput with 1<<16 than 1<<14. That seems reasonable to me. I think this rules out some magic slowness in tcp/ip stack.

I am happy to debug this, but I don't even know where to start. Suggestions are welcome.

Alex

@tombergan

This comment has been minimized.

Copy link
Contributor

@tombergan tombergan commented Jun 14, 2016

If you have time to debug:

  • What happens if you use (1<<15)? (1<<14)+1?
  • Can you figure out the default socket send buffer size on your machine, and if it's lower than 64KB, report what happens when you raise that size to 64KB or higher? (see here or here)
  • Can you get tcpdumps of (1<<14) and (1<<16)?
  • Does Windows have an equivalent of strace?
  • Do you have any firewall or antivirus programs running, and if so, what happens if they are disabled?

Note that the benchmark code alternates conn.Write with conn.Read. When bufsize=1<<14, each conn.Write call translates to a single TLS record (the max record size is 1<<14). Each TLS record is written with a single netFD.Write call, which translates to one WSASend syscall. When bufsize=1<<16, there are four TLS records (and four netFD.Writes / WSASends) per conn.Write.

In the 1<<16 case, I'd guess that the 2nd, 3rd, or 4th WSASend call is being delayed for some unknown reason. Perhaps that this is a weird interaction with TCP_NODELAY and/or TCP flow control.

@alexbrainman

This comment has been minimized.

Copy link
Member

@alexbrainman alexbrainman commented Jun 29, 2016

I cannot explain the slowdown. But I can explain why windows race builder was hanging. The benchmark assumes that you can write as much data as you like to the network (without reading data on the other end of network connection), but that is not true. For example:

# cat main.go
package main

import (
    "fmt"
    "log"
    "net"
)

func main() {
    l, err := net.Listen("tcp4", "127.0.0.1:0")
    if err != nil {
        log.Fatal(err)
    }
    defer l.Close()

    go func() {
        c, err := l.Accept()
        if err != nil {
            log.Fatal(err)
        }
        defer c.Close()

        select {}
    }()

    c, err := net.Dial(l.Addr().Network(), l.Addr().String())
    if err != nil {
        log.Fatal(err)
    }
    defer c.Close()

    total := 0
    buf := make([]byte, 4096)
    for {
        n, err := c.Write(buf[:])
        if err != nil {
            log.Fatal(err)
        }
        total += n
        fmt.Println(total)
    }
}
# go run main.go
4096
8192
12288
16384
20480
...
417792
421888
^Csignal: interrupt
#

Same happens on windows. But the numbers are obviously diffferent:

C:\>u:\test
4096
8192
12288
16384
20480
24576
28672
32768
36864
40960
45056
49152
53248
57344
61440
65536
69632
73728
77824
81920
86016
90112
94208
98304
102400
106496
110592
114688
118784
122880
126976
131072
135168

C:\>

(I did press ctrl+C to break the program).

The BenchmarkThroughput has the same problem. I can make it hang on Linux just like it did on Windows:

# git diff
diff --git a/src/crypto/tls/tls_test.go b/src/crypto/tls/tls_test.go
index b4b5f4a..6106026 100644
--- a/src/crypto/tls/tls_test.go
+++ b/src/crypto/tls/tls_test.go
@@ -560,7 +560,7 @@ func throughput(b *testing.B, totalBytes int64, dynamicRecordSizingDisabled bool
    clientConfig := testConfig.clone()
    clientConfig.DynamicRecordSizingDisabled = dynamicRecordSizingDisabled

-   buf := make([]byte, 1<<14)
+   buf := make([]byte, 1<<20)
    chunks := int(math.Ceil(float64(totalBytes) / float64(len(buf))))
    for i := 0; i < N; i++ {
        conn, err := Dial("tcp", ln.Addr().String(), clientConfig)
# go test -run=none -bench=Throughput
^C#

Alex

@tombergan

This comment has been minimized.

Copy link
Contributor

@tombergan tombergan commented Jun 29, 2016

You're onto something. The problem definitely seems related to the buffer size relative to TCP flow control parameters. The client and server are using different buffer sizes -- the client uses len(buf) chunks while the server uses a fixed 32KB buffer (in io.Copy). This results in the following sequence each loop iteration:

  1. client sends len(buf)
  2. server receives 32KB
  3. server sends 32KB
  4. repeat steps 2 and 3 until the server has copied len(buf)
  5. client receives len(buf)

The behavior of the above sequence depends on the size of the TCP receive window (rwin). If rwin >= len(buf), each step can act as an atomic unit and everything works fine. If rwin <= len(buf)-32KB, the loop will eventually hang at step 3. In other cases, the benchmark will work, however, steps 1 and 4 cannot act as atomic units because rwin is not large enough. This implies more synchronization between the client and server, which slows down the benchmark. The origin version of this benchmark used len(buf)=1<<16=64KB. The question is now, is rwin between 32KB and 64KB? Here is a link that suggests Windows uses rwin=64512, which is just shy of 64KB.

At least that's my hypothesis. I mailed the following change to close this issue. Can someone verify it works on Windows? Thanks!
https://go-review.googlesource.com/24581

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Jun 29, 2016

CL https://golang.org/cl/24581 mentions this issue.

@gopherbot gopherbot closed this in ad82f2c Jun 29, 2016
@alexbrainman

This comment has been minimized.

Copy link
Member

@alexbrainman alexbrainman commented Jul 1, 2016

Can someone verify it works on Windows?

C:\dev\go\src\crypto\tls>go version
go version devel +29f0984 Fri Jul 1 01:12:24 2016 +0000 windows/amd64

C:\dev\go\src\crypto\tls>go test -run=none -bench=.
BenchmarkThroughput/MaxPacket/1MB                     20          81057800 ns/op          12.94 MB/s
BenchmarkThroughput/MaxPacket/2MB                     10         158013880 ns/op          13.27 MB/s
BenchmarkThroughput/MaxPacket/4MB                      5         311144760 ns/op          13.48 MB/s
BenchmarkThroughput/MaxPacket/8MB                      2         622094200 ns/op          13.48 MB/s
BenchmarkThroughput/MaxPacket/16MB                     1        1230516000 ns/op          13.63 MB/s
BenchmarkThroughput/MaxPacket/32MB                     1        2492283200 ns/op          13.46 MB/s
BenchmarkThroughput/MaxPacket/64MB                     1        4841982800 ns/op          13.86 MB/s
BenchmarkThroughput/DynamicPacket/1MB                 20          81155460 ns/op          12.92 MB/s
BenchmarkThroughput/DynamicPacket/2MB                 10         156744300 ns/op          13.38 MB/s
BenchmarkThroughput/DynamicPacket/4MB                  5         318176280 ns/op          13.18 MB/s
BenchmarkThroughput/DynamicPacket/8MB                  2         618187800 ns/op          13.57 MB/s
BenchmarkThroughput/DynamicPacket/16MB                 1        1231492600 ns/op          13.62 MB/s
BenchmarkThroughput/DynamicPacket/32MB                 1        2457125600 ns/op          13.66 MB/s
BenchmarkThroughput/DynamicPacket/64MB                 1        4855655200 ns/op          13.82 MB/s
BenchmarkLatency/MaxPacket/200kbps                     2         704616900 ns/op
BenchmarkLatency/MaxPacket/500kbps                     5         285167200 ns/op
BenchmarkLatency/MaxPacket/1000kbps                   10         146685320 ns/op
BenchmarkLatency/MaxPacket/2000kbps                   20          77053740 ns/op
BenchmarkLatency/MaxPacket/5000kbps                   50          35704496 ns/op
BenchmarkLatency/DynamicPacket/200kbps                10         143267220 ns/op
BenchmarkLatency/DynamicPacket/500kbps                20          63430170 ns/op
BenchmarkLatency/DynamicPacket/1000kbps               50          36231860 ns/op
BenchmarkLatency/DynamicPacket/2000kbps              100          22608290 ns/op
BenchmarkLatency/DynamicPacket/5000kbps              100          14717362 ns/op
PASS
ok      crypto/tls      52.523s

C:\dev\go\src\crypto\tls>

I don't know if these numbers are good or not.

Alex

@golang golang locked and limited conversation to collaborators Jul 1, 2017
FiloSottile pushed a commit to FiloSottile/go that referenced this issue Oct 12, 2018
…TLS throughput benchmark

I believe it's necessary to use a buffer size smaller than 64KB because
(at least some versions of) Window using a TCP receive window less than
64KB. Currently the client and server use buffer sizes of 16KB and 32KB,
respectively (the server uses io.Copy, which defaults to 32KB internally).
Since the server has been using 32KB, it should be safe for the client to
do so as well.

Fixes golang#15899

Change-Id: I36d44b29f2a5022c03fc086213d3c1adf153e983
Reviewed-on: https://go-review.googlesource.com/24581
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
FiloSottile pushed a commit to FiloSottile/go that referenced this issue Oct 12, 2018
…TLS throughput benchmark

I believe it's necessary to use a buffer size smaller than 64KB because
(at least some versions of) Window using a TCP receive window less than
64KB. Currently the client and server use buffer sizes of 16KB and 32KB,
respectively (the server uses io.Copy, which defaults to 32KB internally).
Since the server has been using 32KB, it should be safe for the client to
do so as well.

Fixes golang#15899

Change-Id: I36d44b29f2a5022c03fc086213d3c1adf153e983
Reviewed-on: https://go-review.googlesource.com/24581
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.