Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CopyFrom is ~80x slower in v5.2.0 versus v4.17.2 #1481

Closed
jeremy-dupre opened this issue Jan 27, 2023 · 19 comments
Closed

CopyFrom is ~80x slower in v5.2.0 versus v4.17.2 #1481

jeremy-dupre opened this issue Jan 27, 2023 · 19 comments
Labels

Comments

@jeremy-dupre
Copy link

Describe the bug
CopyFrom in v5.2.0 is ~80x slower than CopyFrom in v4.17.2
Likely because of the new logic implemented in: func (pgConn *PgConn) CopyFrom(ctx context.Context, r io.Reader, sql string) (CommandTag, error)

Here is the bench result:

goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6 10 106087930 ns/op 363762 B/op 66 allocs/op
BenchmarkPgx5-6 1 8003570000 ns/op 2422688 B/op 100559 allocs/op
PASS
ok nph/bench 11.587s

To Reproduce
Run the bench provided here below

package bench

import (
	"context"
	"testing"

	pgx4 "github.com/jackc/pgx/v4"
	pgx5 "github.com/jackc/pgx/v5"
)

var connString = "postgres://xxx"

func BenchmarkPgx4(b *testing.B) {

	ctx := context.Background()
	conn, err := pgx4.Connect(ctx, connString)
	if err != nil {
		panic(err)
	}
	_, err = conn.Exec(ctx, "CREATE TABLE IF NOT EXISTS test_bench(int_col BIGINT, text_col TEXT)")
	if err != nil {
		panic(err)
	}
	_, err = conn.Exec(ctx, "TRUNCATE TABLE test_bench")
	if err != nil {
		panic(err)
	}
	rows := make([][]any, 100000)
	for i := range rows {
		rows[i] = []any{123, "456789"}
	}
	b.ResetTimer()
	var count int64
	for n := 0; n < b.N; n++ {
		count, err = conn.CopyFrom(ctx, pgx4.Identifier{"test_bench"}, []string{"int_col", "text_col"}, pgx4.CopyFromRows(rows))
		if err != nil {
			panic(err)
		}
	}
	if int(count) != len(rows) {
		b.Fatal("bad result")
	}
}

func BenchmarkPgx5(b *testing.B) {

	ctx := context.Background()
	conn, err := pgx5.Connect(ctx, connString)
	if err != nil {
		panic(err)
	}
	_, err = conn.Exec(ctx, "CREATE TABLE IF NOT EXISTS test_bench(int_col BIGINT, text_col TEXT)")
	if err != nil {
		panic(err)
	}
	_, err = conn.Exec(ctx, "TRUNCATE TABLE test_bench")
	if err != nil {
		panic(err)
	}
	rows := make([][]any, 100000)
	for i := range rows {
		rows[i] = []any{123, "456789"}
	}
	b.ResetTimer()
	var count int64
	for n := 0; n < b.N; n++ {
		count, err = conn.CopyFrom(ctx, pgx5.Identifier{"test_bench"}, []string{"int_col", "text_col"}, pgx5.CopyFromRows(rows))
		if err != nil {
			panic(err)
		}
	}
	if int(count) != len(rows) {
		b.Fatal("bad result")
	}
}

Expected behavior
I would expect CopyFrom v5.x to be roughly at par with CopyFrom v4.x performance-wise

Actual behavior
CopyFrom v5.2.0 is 80x slower than v4.17.2

Version

  • Go: v1.19.5
  • PostgreSQL: v12.0 x64
  • pgx: v4.17.2 vs v5.2.0

Additional context

@jackc
Copy link
Owner

jackc commented Jan 27, 2023

There must be something platform or network specific. I'm showing v4 taking ~78% the time of v5. Significant, but not 80x. I'm on MacOS on an M2 Max using a unix socket. The memory allocations are still excessive though.

jack@glados ~/dev/pgx_issues/pgx-1481 ±master⚡ » got -bench=. -benchmem
goos: darwin
goarch: arm64
pkg: github.com/jackc/pgx_issues/pgx-1481
BenchmarkPgx4-12    	      50	  23735211 ns/op	  351785 B/op	      53 allocs/op
BenchmarkPgx5-12    	      38	  30128284 ns/op	90526099 B/op	  181994 allocs/op
PASS
ok  	github.com/jackc/pgx_issues/pgx-1481	3.136s

If I had to guess I would say something with the new non-blocking mode is the problem. Still looking though.

jackc added a commit that referenced this issue Jan 28, 2023
This was causing allocations every time there was a non-blocking read
with nothing to read.

#1481
@jackc
Copy link
Owner

jackc commented Jan 28, 2023

Found a place that was allocating a buffer for a non-blocking read and not releasing it back to the sync.Pool when there was no bytes read. That happens a lot in the CopyFrom path. With that fixed (7941518) pgx v5 is faster than v4 for me.

jack@glados ~/dev/pgx_issues/pgx-1481 ±master⚡ » got -bench=. -benchmem                                                                                                                                                                1goos: darwin
goarch: arm64
pkg: github.com/jackc/pgx_issues/pgx-1481
BenchmarkPgx4-12    	      51	  23224866 ns/op	  351749 B/op	      53 allocs/op
BenchmarkPgx5-12    	      51	  22054759 ns/op	 3926631 B/op	  185413 allocs/op
PASS
ok  	github.com/jackc/pgx_issues/pgx-1481	3.221s

YMMV, though, you were starting from a much slower place. Let me know how this change affects you.

Also, are you on Windows? True non-blocking reads are only implemented on *nix right now. On Windows it is simulated. That could have a significant impact.

jackc added a commit that referenced this issue Jan 28, 2023
This significantly reduces memory allocations in paths that repeatedly
encode the same type of values such as CopyFrom.

#1481
@jackc
Copy link
Owner

jackc commented Jan 28, 2023

And memoizing the encode plans reduces the allocation count by 1/3. (42a4719)

jack@glados ~/dev/pgx_issues/pgx-1481 ±master⚡ » got -bench=. -benchmem
goos: darwin
goarch: arm64
pkg: github.com/jackc/pgx_issues/pgx-1481
BenchmarkPgx4-12    	      51	  23316109 ns/op	  351870 B/op	      54 allocs/op
BenchmarkPgx5-12    	      50	  22217718 ns/op	 3024637 B/op	  116902 allocs/op
PASS
ok  	github.com/jackc/pgx_issues/pgx-1481	4.117s

jackc added a commit that referenced this issue Jan 28, 2023
This saves an allocation on every call.

#1481
jackc added a commit that referenced this issue Jan 28, 2023
Only create RawConn.Read callback once and have it use NetConn fields.
Avoids the closure and some allocations.

#1481
@jeremy-dupre
Copy link
Author

jeremy-dupre commented Jan 28, 2023

Thanks Jack for quick answer and first fixes
Allocs definitely shrunk but main perf issue remains after taking your latest commit - bench results here below:

goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6   	      12	  98959325 ns/op	  363179 B/op	      60 allocs/op
BenchmarkPgx5-6   	       1	7976769700 ns/op	  828320 B/op	     563 allocs/op
PASS
ok  	nph/bench	10.757s

I am indeed on a windows env - so the underlying issue is probably windows/network related, as you suggested.

Also played with the buffer size in the CopyFrom method, and when setting buf := iobufpool.Get(6553600) , it clearly converged perf-wise with v4

goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6   	      10	 101899850 ns/op	  363856 B/op	      63 allocs/op
BenchmarkPgx5-6   	       6	 171502650 ns/op	14198041 B/op	     296 allocs/op
PASS
ok  	nph/bench	3.825s

100x less roundtrips theoretically, so the fishy part indeed lies within the send/receive logic for windows

Came across this Microsoft KB article while looking for potential root causes:
https://learn.microsoft.com/en-us/troubleshoot/windows-server/networking/slow-performance-copy-data-tcp-server-sockets-api

@jeremy-dupre
Copy link
Author

Tried to tweak windows with their method #4, no luck

I timed reads (pgConn.receiveMessage) and writes (pgConn.frontend.SendUnbufferedEncodedCopyData) in PgConn::CopyFrom, reads are clearly the bottleneck:

Write time: 1662200ns
Read time: 7721166800ns

Inquiring further, looks like the issue is located in fakeNonblockingRead() (used behind pgConn.peekMessage() / pgConn.frontend.Receive()) : waiting for a message that never arrives results in 100ms lost each time (and ErrWouldBlock)
I get hundreds of those in a row while data is being sent :

Receive: 106774000 ns - msg: , err: would block
fakeNonblockingRead : 106774000 ns
Receive: 113192700 ns - msg: , err: would block
fakeNonblockingRead : 113192700 ns
Receive: 113702600 ns - msg: , err: would block
fakeNonblockingRead : 113702600 ns
etc ...

func (c *NetConn) fakeNonblockingRead(b []byte) (n int, err error) {
	c.readDeadlineLock.Lock()
	defer c.readDeadlineLock.Unlock()

	deadline := time.Now().Add(fakeNonblockingWaitDuration) // => 100ms
	if c.readDeadline.IsZero() || deadline.Before(c.readDeadline) {
		err = c.conn.SetReadDeadline(deadline)
		if err != nil {
			return 0, err
		}
		defer func() {
			// Ignoring error resetting deadline as there is nothing that can reasonably be done if it fails.
			c.conn.SetReadDeadline(c.readDeadline)

			if err != nil {
				if errors.Is(err, os.ErrDeadlineExceeded) {
					err = ErrWouldBlock
				}
			}
		}()
	}

	return c.conn.Read(b)
}

jackc added a commit that referenced this issue Jan 28, 2023
The reason for a high max wait time was to ensure that reads aren't
cancelled when there is data waiting for it in Go or the OS's receive
buffer. Unfortunately, there is no way to know ahead of time how long
this should take.

This new code uses 2x the fastest successful read time as the max read
time. This allows the code to adapt to whatever host it is running on.

#1481
@jackc
Copy link
Owner

jackc commented Jan 28, 2023

A few more optimizations and now the memory usage is comparable.

jack@glados ~/dev/pgx_issues/pgx-1481 ±master⚡ » got -bench=. -benchmem
goos: darwin
goarch: arm64
pkg: github.com/jackc/pgx_issues/pgx-1481
BenchmarkPgx4-12    	      49	  24231045 ns/op	  351642 B/op	      53 allocs/op
BenchmarkPgx5-12    	      46	  23306757 ns/op	  315970 B/op	     267 allocs/op
PASS
ok  	github.com/jackc/pgx_issues/pgx-1481	3.141s

But the problem you are having is definitely caused by the fake non-blocking read.

...


The problem is the lack of true non-blocking IO for Windows. I think it should be possible, but I'm not sure how.

Since we are left with fake non-blocking IO the problem is figuring out a wait time that allows for a read to succeed without waiting too long. Originally, I just chose a large number to ensure that we didn't get in a situation where all reads fail because the read deadline is too short. 898891a should improve this significantly. It keeps track of the time of successful reads and uses 2x the fastest read as the wait time for a fake non-blocking read. This allows it to adapt to the speed of the host and still provides reasonable assurance that reads will work when data is waiting in Go or the OS's receive buffer.

With those changes in place, I disabled true non-blocking IO and tried the benchmark again:

jack@glados ~/dev/pgx_issues/pgx-1481 ±master⚡ » got -bench=. -benchmem
goos: darwin
goarch: arm64
pkg: github.com/jackc/pgx_issues/pgx-1481
BenchmarkPgx4-12    	      49	  23830909 ns/op	  351839 B/op	      54 allocs/op
BenchmarkPgx5-12    	      46	  24218995 ns/op	  304662 B/op	     118 allocs/op
PASS
ok  	github.com/jackc/pgx_issues/pgx-1481	3.143s

I think this should resolve this issue.

@jeremy-dupre
Copy link
Author

Thanks again Jack, much better indeed, but unfortunately still a 10x gap :/
Memory is twice as much on windows in v5, compared to your bench run (probably more buffer allocs because of the win io waits)

goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6   	       8	 131495475 ns/op	  373196 B/op	      81 allocs/op
BenchmarkPgx5-6   	       1	1235863700 ns/op	  770944 B/op	     361 allocs/op
PASS
ok  	nph/bench	2.731s

I am unfortunately no networking expert, but if non-blocking IO is not available in windows, could it be possible to come back to the same kind of goroutine logic as in CopyFrom v4 (async write / sync receive) - for windows only ?

@jackc
Copy link
Owner

jackc commented Jan 28, 2023

To be clear non-blocking IO is definitely possible on Windows, and is almost certainly possible from Go. But I don't have a Windows dev box nor do I have the Windows expertise to easily implement it. The ideal solution is to get that implemented.

I'd much rather see that happen than have two separate copy from implementations.

I'm going to post a new issue requesting help on this.

@jackc
Copy link
Owner

jackc commented Jan 28, 2023

@jeremy-dupre One other thing, can you look at how long the fake non-blocking reads are taking again? And confirm that that is still the problem? I would have expected that to be less than 100µs. And if it is that low I would not expect the impact to be so great.

@jeremy-dupre
Copy link
Author

jeremy-dupre commented Jan 28, 2023

I timed the row 454 (=> c.conn.Read(b)) in func (c *NetConn) fakeNonblockingRead:

conn.Read: 505600 ns; read 34 byte(s); err: <nil>; fakeNonblockingReadWaitDuration: 100000000 ns
conn.Read: 3889600 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 16316900 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 14711300 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 16246900 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 16352800 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 17267500 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
...

Each read (except the first - and only - one that returns something) takes ~16ms, and ends up in i/o timeout, though the fake read deadline is 1ms
Hence, this is still the problem.

@jackc
Copy link
Owner

jackc commented Jan 28, 2023

Good to know, thanks. I suppose the read deadline on Go / Windows isn't very strictly enforced.

@drakkan
Copy link
Contributor

drakkan commented Jan 29, 2023

I timed the row 454 (=> c.conn.Read(b)) in func (c *NetConn) fakeNonblockingRead:

conn.Read: 505600 ns; read 34 byte(s); err: <nil>; fakeNonblockingReadWaitDuration: 100000000 ns
conn.Read: 3889600 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 16316900 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 14711300 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 16246900 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 16352800 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
conn.Read: 17267500 ns; read 0 byte(s); err: read tcp 192.168.1.25:62214->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1011200 ns
...

Each read (except the first - and only - one that returns something) takes ~16ms, and ends up in i/o timeout, though the fake read deadline is 1ms Hence, this is still the problem.

see golang/go#44343

@jeremy-dupre
Copy link
Author

jeremy-dupre commented Jan 29, 2023

  1. As mentionned in the above golang issue by @drakkan (thanks!) - and as a workaround pending its resolution - we can force the (very unprecise) windows timer precision from ~16ms to 1ms (minimum) before the 'problematic' SetReadDeadline:
var (
	winmmDLL            = syscall.NewLazyDLL("winmm.dll")
	procTimeBeginPeriod = winmmDLL.NewProc("timeBeginPeriod")
)

	if c.readDeadline.IsZero() || deadline.Before(c.readDeadline) {
		procTimeBeginPeriod.Call(uintptr(1)) // 1ms; probably overkill to call it each time, but at least we can be sure it's not reset by some other process
		err = c.conn.SetReadDeadline(deadline)
  1. Could it be possible to configure the read/write buffer sizes ? In some case (=>high performance network) we might want to take advantage of bigger buffers - especially for binary data flows like CopyFrom with millions of rows
    I increased it in the code by x2 (131 072) for my subsequent tests.

  2. The postgres server in my tests is not local therefore the 200% wait can be very costly in my particular case; I tried with a slightly more aggressive 50% buffer on top of the fastest successful read (rather than 100%) :

proposedWait := endTime.Sub(startTime) * 3 / 2

Then v5 is 20% faster than v4 :

goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6   	      10	 122825480 ns/op	  360039 B/op	      65 allocs/op
BenchmarkPgx5-6   	      10	 103812350 ns/op	  790924 B/op	     222 allocs/op
PASS
ok  	nph/bench	5.414s

And when testing with a local postgres server, the bottleneck indeed remains the windows timer minimal 'precision' :

conn.Read: 0 ns; read 34 byte(s); err: <nil>; fakeNonblockingReadWaitDuration: 1000 ns
conn.Read: **1053600 ns**; read 0 byte(s); err: read tcp 192.168.1.25:63651->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1000 ns
conn.Read: **1047000 ns**; read 0 byte(s); err: read tcp 192.168.1.25:63651->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1000 ns
conn.Read: **1052400 ns**; read 0 byte(s); err: read tcp 192.168.1.25:63651->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1000 ns
conn.Read: **1026400 ns**; read 0 byte(s); err: read tcp 192.168.1.25:63651->163.xxx.xxx.xxx:5432: i/o timeout; fakeNonblockingReadWaitDuration: 1000 ns

@jackc
Copy link
Owner

jackc commented Feb 1, 2023

I added a few more optimizations.

384a581 should use the copy buffer more efficiently.

9963c32 causes the fake non-blocking read to only read 1 byte the first 5 times bytes are read. This ensures that there are 4 reads where there is data already in Go or the OS's network receive buffer. In other words this should give us the time necessary to get past Go's deadline system, but not require waiting for an actual network receive.


Your points 1 and 2 are working around the fundamental issue -- no real non-blocking I/O. I'd rather not spend the time and add the complexity on a workaround. FWIW, I looked into what it would take to get it implemented on Windows. It doesn't look too hard. Hopefully someone will tackle #1482.

Point 3 should be resolved by 9963c32

@jeremy-dupre
Copy link
Author

Thanks much for the fixes - greatly appreciated.
Few remarks to actualise my previous points:

  1. Ok, agreed, makes sense to wait for a proper windows solution (and I can in the meantime include my workaround in my code)
  2. 384a581 indeed helps, though not fully reaching the same perf as a buffer increase (in my test case). Ability to set socket read/write buffers is offered on the postgres jdbc driver (through 'send_buffer_size' and 'receive_buffer_size' in the connection string and driver props), I thought it could add value to your golang package as well - for both *nix and windows use.
  3. Seems indeed solved by 9963c32

FYI, here's my latest bench - pgx v5 perf is still 40-ish % off pgx v4 perf:

goos: windows
goarch: amd64
pkg: nph/bench
cpu: Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
BenchmarkPgx4-6   	       9	 112712144 ns/op	  370487 B/op	      76 allocs/op
BenchmarkPgx5-6   	       7	 154666243 ns/op	  366804 B/op	     178 allocs/op
PASS
ok  	nph/bench	3.622s

@vcabbage
Copy link
Contributor

Hello, I found this issue after running some benchmarks to decide whether to switch from INSERT to COPY. I thought it might be useful to share some numbers. I'm on Linux rather than Windows so not everything discussed here applies.

v5.2.0:

BenchmarkInsertPushes/1000000/copy-48     1  7763021465 ns/op  10635100240 B/op  16411173 allocs/op
BenchmarkInsertPushes/1000000/insert-48   1  8297983761 ns/op    810778640 B/op   4047839 allocs/op

fa5fbed4:

BenchmarkInsertPushes/1000000/copy-48     1  6792212353 ns/op    40738064 B/op    2000627 allocs/op
BenchmarkInsertPushes/1000000/insert-48   1  8709871389 ns/op   767498856 B/op    4000103 allocs/op

As can be seen, the memory improvements are significant 🎉

@jackc Do you foresee tagging a new release in the near future? No particular rush on my part, just curious what sort of timeline to expect.

@jackc
Copy link
Owner

jackc commented Feb 11, 2023

v5.3.0 was released today.

@jackc
Copy link
Owner

jackc commented Feb 25, 2023

@jeremy-dupre Try branch windows-non-blocking-io.

@codercms
Copy link
Contributor

@jeremy-dupre could you also try this changes - #1525 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants