add benchmark for sql.RawBytes #945

methane · 2019-04-03T12:34:12Z

Description

Added BenchmarkQueryRawBytes.

Checklist

Code compiles correctly
Created tests which fail without the change (if possible)
All tests passing
Extended the README / documentation, if necessary
Added myself / the copyright holder to the AUTHORS file

methane · 2019-04-03T12:39:08Z

before.txt is master
after.txt is #943 (c81b5c)

$ benchcmp before.txt after.txt
benchmark                              old ns/op     new ns/op     delta
BenchmarkQueryRawBytes/size=100        2082425       2065162       -0.83%
BenchmarkQueryRawBytes/size=1000       2114160       2101272       -0.61%
BenchmarkQueryRawBytes/size=2000       2131505       2135560       +0.19%
BenchmarkQueryRawBytes/size=4000       2150739       2146939       -0.18%
BenchmarkQueryRawBytes/size=8000       2195996       2190682       -0.24%
BenchmarkQueryRawBytes/size=12000      2290469       2256546       -1.48%
BenchmarkQueryRawBytes/size=16000      2371228       2329660       -1.75%
BenchmarkQueryRawBytes/size=32000      2705139       2801268       +3.55%
BenchmarkQueryRawBytes/size=64000      3353764       3495100       +4.21%
BenchmarkQueryRawBytes/size=256000     6925304       7302883       +5.45%

benchmark                              old allocs     new allocs     delta
BenchmarkQueryRawBytes/size=100        223            223            +0.00%
BenchmarkQueryRawBytes/size=1000       223            223            +0.00%
BenchmarkQueryRawBytes/size=2000       223            223            +0.00%
BenchmarkQueryRawBytes/size=4000       223            223            +0.00%
BenchmarkQueryRawBytes/size=8000       223            223            +0.00%
BenchmarkQueryRawBytes/size=12000      223            223            +0.00%
BenchmarkQueryRawBytes/size=16000      223            223            +0.00%
BenchmarkQueryRawBytes/size=32000      223            323            +44.84%
BenchmarkQueryRawBytes/size=64000      223            323            +44.84%
BenchmarkQueryRawBytes/size=256000     223            323            +44.84%

benchmark                              old bytes     new bytes     delta
BenchmarkQueryRawBytes/size=100        7072          7072          +0.00%
BenchmarkQueryRawBytes/size=1000       7072          7072          +0.00%
BenchmarkQueryRawBytes/size=2000       7072          7072          +0.00%
BenchmarkQueryRawBytes/size=4000       7072          7072          +0.00%
BenchmarkQueryRawBytes/size=8000       7072          7072          +0.00%
BenchmarkQueryRawBytes/size=12000      7072          7072          +0.00%
BenchmarkQueryRawBytes/size=16000      7072          7072          +0.00%
BenchmarkQueryRawBytes/size=32000      7072          3283877       +46334.91%
BenchmarkQueryRawBytes/size=64000      7072          6560680       +92669.80%
BenchmarkQueryRawBytes/size=256000     7072          26221480      +370678.85%

methane · 2019-04-03T12:44:01Z

@vmg Please take a look.

methane · 2019-04-03T13:01:09Z

I'm sorry, #905 have not merged master branch. master branch merged connCheck which
increase allocations.
I rebased #905 (commit = 45ca928) now. delta is smaller than above comment.

$ benchcmp pr905all.txt pr943all.txt
benchmark                              old ns/op     new ns/op     delta
BenchmarkQuery                         18388         18327         -0.33%
BenchmarkExec                          11688         11837         +1.27%
BenchmarkRoundtripTxt                  100898        98245         -2.63%
BenchmarkRoundtripBin                  73160         70721         -3.33%
BenchmarkInterpolation                 648           637           -1.70%
BenchmarkQueryContext/1                49050         56076         +14.32%
BenchmarkQueryContext/2                26132         25814         -1.22%
BenchmarkQueryContext/3                19054         19206         +0.80%
BenchmarkQueryContext/4                17305         17886         +3.36%
BenchmarkExecContext/1                 49717         50246         +1.06%
BenchmarkExecContext/2                 25031         28813         +15.11%
BenchmarkExecContext/3                 19448         19151         -1.53%
BenchmarkExecContext/4                 16240         17596         +8.35%
BenchmarkQueryRawBytes/size=100        2161196       2085116       -3.52%
BenchmarkQueryRawBytes/size=1000       2174977       2138447       -1.68%
BenchmarkQueryRawBytes/size=2000       2201240       2175069       -1.19%
BenchmarkQueryRawBytes/size=4000       2224169       2185810       -1.72%
BenchmarkQueryRawBytes/size=8000       2270772       2215323       -2.44%
BenchmarkQueryRawBytes/size=12000      2337594       2294619       -1.84%
BenchmarkQueryRawBytes/size=16000      2408917       2357734       -2.12%
BenchmarkQueryRawBytes/size=32000      2733884       2905703       +6.28%
BenchmarkQueryRawBytes/size=64000      3426760       3521465       +2.76%
BenchmarkQueryRawBytes/size=256000     7023087       7905538       +12.57%
BenchmarkParseDSN                      6827          6981          +2.26%

benchmark                              old allocs     new allocs     delta
BenchmarkQuery                         23             23             +0.00%
BenchmarkExec                          8              8              +0.00%
BenchmarkRoundtripTxt                  17             17             +0.00%
BenchmarkRoundtripBin                  19             18             -5.26%
BenchmarkInterpolation                 1              1              +0.00%
BenchmarkQueryContext/1                22             22             +0.00%
BenchmarkQueryContext/2                22             22             +0.00%
BenchmarkQueryContext/3                22             22             +0.00%
BenchmarkQueryContext/4                22             22             +0.00%
BenchmarkExecContext/1                 22             22             +0.00%
BenchmarkExecContext/2                 22             22             +0.00%
BenchmarkExecContext/3                 22             22             +0.00%
BenchmarkExecContext/4                 22             22             +0.00%
BenchmarkQueryRawBytes/size=100        223            223            +0.00%
BenchmarkQueryRawBytes/size=1000       223            223            +0.00%
BenchmarkQueryRawBytes/size=2000       223            223            +0.00%
BenchmarkQueryRawBytes/size=4000       223            223            +0.00%
BenchmarkQueryRawBytes/size=8000       223            223            +0.00%
BenchmarkQueryRawBytes/size=12000      223            223            +0.00%
BenchmarkQueryRawBytes/size=16000      223            223            +0.00%
BenchmarkQueryRawBytes/size=32000      223            323            +44.84%
BenchmarkQueryRawBytes/size=64000      223            323            +44.84%
BenchmarkQueryRawBytes/size=256000     223            323            +44.84%
BenchmarkParseDSN                      73             73             +0.00%

benchmark                              old bytes     new bytes     delta
BenchmarkQuery                         699           609           -12.88%
BenchmarkExec                          136           136           +0.00%
BenchmarkRoundtripTxt                  36541         25565         -30.04%
BenchmarkRoundtripBin                  38135         12333         -67.66%
BenchmarkInterpolation                 176           176           +0.00%
BenchmarkQueryContext/1                682           592           -13.20%
BenchmarkQueryContext/2                682           592           -13.20%
BenchmarkQueryContext/3                682           592           -13.20%
BenchmarkQueryContext/4                682           592           -13.20%
BenchmarkExecContext/1                 682           592           -13.20%
BenchmarkExecContext/2                 682           592           -13.20%
BenchmarkExecContext/3                 682           592           -13.20%
BenchmarkExecContext/4                 682           592           -13.20%
BenchmarkQueryRawBytes/size=100        7072          7072          +0.00%
BenchmarkQueryRawBytes/size=1000       7072          7072          +0.00%
BenchmarkQueryRawBytes/size=2000       7072          7072          +0.00%
BenchmarkQueryRawBytes/size=4000       7072          7072          +0.00%
BenchmarkQueryRawBytes/size=8000       7072          7072          +0.00%
BenchmarkQueryRawBytes/size=12000      7072          7072          +0.00%
BenchmarkQueryRawBytes/size=16000      7072          7072          +0.00%
BenchmarkQueryRawBytes/size=32000      7072          3283873       +46334.86%
BenchmarkQueryRawBytes/size=64000      7072          6560673       +92669.70%
BenchmarkQueryRawBytes/size=256000     7072          26221480      +370678.85%
BenchmarkParseDSN                      7952          7952          +0.00%

vmg · 2019-04-03T15:12:05Z

@methane: Alright, I've had a chance to compare these results. Thanks for writing this benchmark! Very useful!

For the benchmark you propose, the reason why the current version of #943 allocates more memory than master is because we've set a hard-cap on the maximum size of buffers we're willing to allocate, like you suggested (I still think this is a good idea!):

mysql/buffer.go

Lines 64 to 73 in c81b5c7

    
           if need > len(dest) { 
        
           	// Round up to the next multiple of the default size 
        
           	dest = make([]byte, ((need/defaultBufSize)+1)*defaultBufSize) 
        
           	// if the allocated buffer is not too large, move it to backing storage 
        
           	// to prevent extra allocations on applications that perform large reads 
        
           	if len(dest) <= maxCachedBufSize { 
        
           		b.dbuf[b.flipcnt&1] = dest 
        
           	} 
        
           }

If we remove the maxCachedBufSize, the results are much better: there are no extra allocations, because as soon as the buffer is large enough, we keep it and never discard it.

	// grow buffer if necessary to fit the whole packet.
	if need > len(dest) {
		// Round up to the next multiple of the default size
		dest = make([]byte, ((need/defaultBufSize)+1)*defaultBufSize)
		b.dbuf[b.flipcnt&1] = dest
	}

FWIW this is what master does: once it grows the buffer, it always keeps it, and does not shrink it.

However: I don't think the benchmark you're proposing here tells the whole story. I think the benchmark I've updated in this pr is more interesting/accurate.

I've changed two things in this benchmark:

I'm creating a new db connection inside b.Run() instead of using the existing one. In the benchmark of this PR, you're creating a new db at the beginning, and filling the database with rows. This means that the connection buffers will grow outside of the actual benchmarks we're measuring! Since we're doing very large INSERTs into MySQL, the buffers for the connection will already be as large as they can be before you call b.Run(...) with each Benchmark Size, so the benchmarks are not measuring the actual growth of the buffers.
Also, critically: the benchmark in this PR only tests the happy path! It always reads 100 rows from the database without cancelling early, so we're not testing the case we're trying to fix: how much memory are we allocating when we cancel a query early? In fact, the benchmark in this PR never calls buf.discard() or buf.flip() (!!!) because the Rows are fully drained, so we just return early before we can call the actual methods that solve the race. 😅

I've fixed this by adding a case for each benchmark where we only read 10 rows before cancelling the query. I think that shows the actually interesting results!

So with this in mind, here are the results for my updated benchmarks:

`master` vs #943

benchmark                                          old allocs     new allocs     delta
BenchmarkQueryRawBytes/size=4000,rows=10-16        42             42             +0.00%
BenchmarkQueryRawBytes/size=4000,rows=100-16       222            222            +0.00%
BenchmarkQueryRawBytes/size=8000,rows=10-16        42             42             +0.00%
BenchmarkQueryRawBytes/size=8000,rows=100-16       222            222            +0.00%
BenchmarkQueryRawBytes/size=16000,rows=10-16       42             42             +0.00%
BenchmarkQueryRawBytes/size=16000,rows=100-16      222            222            +0.00%
BenchmarkQueryRawBytes/size=32000,rows=10-16       42             142            +238.10%
BenchmarkQueryRawBytes/size=32000,rows=100-16      222            322            +45.05%
BenchmarkQueryRawBytes/size=64000,rows=10-16       42             142            +238.10%
BenchmarkQueryRawBytes/size=64000,rows=100-16      222            323            +45.50%
BenchmarkQueryRawBytes/size=256000,rows=10-16      42             144            +242.86%
BenchmarkQueryRawBytes/size=256000,rows=100-16     222            325            +46.40%

benchmark                                          old bytes     new bytes     delta
BenchmarkQueryRawBytes/size=4000,rows=10-16        1306          1327          +1.61%
BenchmarkQueryRawBytes/size=4000,rows=100-16       7075          7083          +0.11%
BenchmarkQueryRawBytes/size=8000,rows=10-16        1330          1355          +1.88%
BenchmarkQueryRawBytes/size=8000,rows=100-16       7092          7115          +0.32%
BenchmarkQueryRawBytes/size=16000,rows=10-16       1347          1389          +3.12%
BenchmarkQueryRawBytes/size=16000,rows=100-16      7108          7148          +0.56%
BenchmarkQueryRawBytes/size=32000,rows=10-16       1437          3278165       +228025.61%
BenchmarkQueryRawBytes/size=32000,rows=100-16      7195          3283929       +45541.82%
BenchmarkQueryRawBytes/size=64000,rows=10-16       1553          6555002       +421986.41%
BenchmarkQueryRawBytes/size=64000,rows=100-16      7307          6560809       +89688.00%
BenchmarkQueryRawBytes/size=256000,rows=10-16      4007          26216008      +654155.25%
BenchmarkQueryRawBytes/size=256000,rows=100-16     9769          26221830      +268318.77%

`master` vs #943 w/ keeping buffers

I.e. we're always saving the buffer like

	// grow buffer if necessary to fit the whole packet.
	if need > len(dest) {
		// Round up to the next multiple of the default size
		dest = make([]byte, ((need/defaultBufSize)+1)*defaultBufSize)
		b.dbuf[b.flipcnt&1] = dest
	}

benchmark                                          old allocs     new allocs     delta
BenchmarkQueryRawBytes/size=4000,rows=10-16        42             42             +0.00%
BenchmarkQueryRawBytes/size=4000,rows=100-16       222            222            +0.00%
BenchmarkQueryRawBytes/size=8000,rows=10-16        42             42             +0.00%
BenchmarkQueryRawBytes/size=8000,rows=100-16       222            222            +0.00%
BenchmarkQueryRawBytes/size=16000,rows=10-16       42             42             +0.00%
BenchmarkQueryRawBytes/size=16000,rows=100-16      222            222            +0.00%
BenchmarkQueryRawBytes/size=32000,rows=10-16       42             42             +0.00%
BenchmarkQueryRawBytes/size=32000,rows=100-16      222            222            +0.00%
BenchmarkQueryRawBytes/size=64000,rows=10-16       42             42             +0.00%
BenchmarkQueryRawBytes/size=64000,rows=100-16      222            222            +0.00%
BenchmarkQueryRawBytes/size=256000,rows=10-16      42             42             +0.00%
BenchmarkQueryRawBytes/size=256000,rows=100-16     222            222            +0.00%

benchmark                                          old bytes     new bytes     delta
BenchmarkQueryRawBytes/size=4000,rows=10-16        1306          1327          +1.61%
BenchmarkQueryRawBytes/size=4000,rows=100-16       7075          7083          +0.11%
BenchmarkQueryRawBytes/size=8000,rows=10-16        1330          1355          +1.88%
BenchmarkQueryRawBytes/size=8000,rows=100-16       7092          7116          +0.34%
BenchmarkQueryRawBytes/size=16000,rows=10-16       1347          1387          +2.97%
BenchmarkQueryRawBytes/size=16000,rows=100-16      7108          7147          +0.55%
BenchmarkQueryRawBytes/size=32000,rows=10-16       1437          1558          +8.42%
BenchmarkQueryRawBytes/size=32000,rows=100-16      7195          7318          +1.71%
BenchmarkQueryRawBytes/size=64000,rows=10-16       1553          1776          +14.36%
BenchmarkQueryRawBytes/size=64000,rows=100-16      7307          7536          +3.13%
BenchmarkQueryRawBytes/size=256000,rows=10-16      4007          6672          +66.51%
BenchmarkQueryRawBytes/size=256000,rows=100-16     9769          12433         +27.27%

`master` vs #905

benchmark                                          old allocs     new allocs     delta
BenchmarkQueryRawBytes/size=4000,rows=10-16        42             43             +2.38%
BenchmarkQueryRawBytes/size=4000,rows=100-16       222            223            +0.45%
BenchmarkQueryRawBytes/size=8000,rows=10-16        42             44             +4.76%
BenchmarkQueryRawBytes/size=8000,rows=100-16       222            223            +0.45%
BenchmarkQueryRawBytes/size=16000,rows=10-16       42             44             +4.76%
BenchmarkQueryRawBytes/size=16000,rows=100-16      222            223            +0.45%
BenchmarkQueryRawBytes/size=32000,rows=10-16       42             45             +7.14%
BenchmarkQueryRawBytes/size=32000,rows=100-16      222            223            +0.45%
BenchmarkQueryRawBytes/size=64000,rows=10-16       42             45             +7.14%
BenchmarkQueryRawBytes/size=64000,rows=100-16      222            223            +0.45%
BenchmarkQueryRawBytes/size=256000,rows=10-16      42             45             +7.14%
BenchmarkQueryRawBytes/size=256000,rows=100-16     222            223            +0.45%

benchmark                                          old bytes     new bytes     delta
BenchmarkQueryRawBytes/size=4000,rows=10-16        1306          9554          +631.55%
BenchmarkQueryRawBytes/size=4000,rows=100-16       7075          7114          +0.55%
BenchmarkQueryRawBytes/size=8000,rows=10-16        1330          17744         +1234.14%
BenchmarkQueryRawBytes/size=8000,rows=100-16       7092          7116          +0.34%
BenchmarkQueryRawBytes/size=16000,rows=10-16       1347          17738         +1216.85%
BenchmarkQueryRawBytes/size=16000,rows=100-16      7108          7115          +0.10%
BenchmarkQueryRawBytes/size=32000,rows=10-16       1437          50647         +3424.50%
BenchmarkQueryRawBytes/size=32000,rows=100-16      7195          7182          -0.18%
BenchmarkQueryRawBytes/size=64000,rows=10-16       1553          83524         +5278.24%
BenchmarkQueryRawBytes/size=64000,rows=100-16      7307          7364          +0.78%
BenchmarkQueryRawBytes/size=256000,rows=10-16      4007          282704        +6955.25%
BenchmarkQueryRawBytes/size=256000,rows=100-16     9769          9905          +1.39%

I think these results are much accurate and interesting than the previous one.

To note:

For master vs buffer: Use a double-buffering scheme to prevent data races #943 w/ keeping buffers, we can see that there's a slight ~1% increase in the total allocation size and an increase of 1 in the allocation count: this is the cost of allocating the background buffer for double-buffering, which we couldn't see before because the allocation happened outside of t.Run()!
For master vs fix race on cancelling context passed to QueryContext #905, we can clearly see that whenever we cancel a Query early, we're allocating a lot of memory, because truncating the buffer in (*buffer).discard causes it to permanently shrink. So we're allocating memory (+631,55%) even in the case of 4000 rows, because early cancellation will always shrink our buffer and eventually force us to allocate.

Overall, I think that #943 is the way to go, and we should find a value for maxCachedBufSize, like you suggested, that reduces the amount of allocations for the average client. We could also always keep the buffer, which is what master is already doing, but I don't like the idea of letting buffers grow unbounded. We should fix this TODO in master by adding a maxCachedBufSize:

https://github.com/go-sql-driver/mysql/blob/master/buffer.go#L50-L51

methane · 2019-04-03T15:47:10Z

@vmg First of all, I don't meant "your PR is worse performance!!". I know how to use micro benchmarks. I believe you know too.

If we remove the maxCachedBufSize, the results are much better: there are no extra allocations, because as soon as the buffer is large enough, we keep it and never discard it.

I know. I didn't say it's bad. I just want to check allocation is expected, and check how much performance deltas before merging (or choosing) PRs.

I'm creating a new db connection inside b.Run() instead of using the existing one. In the benchmark of this PR, you're creating a new db at the beginning, and filling the database with rows. This means that the connection buffers will grow outside of the actual benchmarks we're measuring!

Thanks. That's I want from code review.
(BTW, #905 consume buffer on close. So #905 has mostly equal penalty even when db is used before.)

Also, critically: the benchmark in this PR only tests the happy path! It always reads 100 rows from the database without cancelling early, so we're not testing the case we're trying to fix:

That's intended. Cancelling is rare path. I want to know how much penalty on major path while fixing rare path.
I want to tune this library for thousands queries per sec, not thousands cancels per sec.

Sorry, it's 0:45am on my time zone. I'll check tomorrow.

vmg · 2019-04-03T16:02:33Z

First of all, I don't meant "your PR is worse performance!!". I know how to use micro benchmarks. I believe you know too.

I know! I appreciate you writing these benchmarks to make sure we can merge a fix that doesn't regress performance. 🙇

methane · 2019-04-04T04:37:00Z

For master vs fix race on cancelling context passed to QueryContext #905, we can clearly see that whenever we cancel a Query early, we're allocating a lot of memory, because truncating the buffer in (*buffer).discard causes it to permanently shrink. So we're allocating memory (+631,55%) even in the case of 4000 rows, because early cancellation will always shrink our buffer and eventually force us to allocate.

It is not a problem because cancel is rare and RawBytes are rare. But...

Overall, I think that #943 is the way to go,

I prefer #943 now too. While #905 diff is small, it uses something like ring buffer or deque.
It's hard to understand or maintain it.

At this time, my top priority is fixing #903 without any regression and acceptable performance and
memory overhead in stable (1.4) branch. And I think current result is acceptable.

add benchmark for sql.RawBytes

253c47c

This comment has been minimized.

Sign in to view

don't reuse connection over subbenchs

5d1faf2

methane requested a review from shogo82148 April 4, 2019 04:37

methane closed this Apr 4, 2019

methane deleted the bench-rawbytes branch April 4, 2019 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add benchmark for sql.RawBytes #945

add benchmark for sql.RawBytes #945

methane commented Apr 3, 2019 •

edited

Loading

methane commented Apr 3, 2019 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

methane commented Apr 3, 2019

This comment has been minimized.

methane commented Apr 3, 2019

vmg commented Apr 3, 2019

methane commented Apr 3, 2019 •

edited

Loading

vmg commented Apr 3, 2019

methane commented Apr 4, 2019

add benchmark for sql.RawBytes #945

add benchmark for sql.RawBytes #945

Conversation

methane commented Apr 3, 2019 • edited Loading

Description

Checklist

methane commented Apr 3, 2019 • edited Loading

This comment has been minimized.

This comment has been minimized.

methane commented Apr 3, 2019

This comment has been minimized.

methane commented Apr 3, 2019

vmg commented Apr 3, 2019

master vs #943

master vs #943 w/ keeping buffers

master vs #905

methane commented Apr 3, 2019 • edited Loading

vmg commented Apr 3, 2019

methane commented Apr 4, 2019

methane commented Apr 3, 2019 •

edited

Loading

methane commented Apr 3, 2019 •

edited

Loading

`master` vs #943

`master` vs #943 w/ keeping buffers

`master` vs #905

methane commented Apr 3, 2019 •

edited

Loading