Increase throughput #24

fd0 · 2014-10-08T22:35:21Z

No description provided.

davecheney · 2014-10-09T08:52:13Z

This has bits of #23 and some code cleanups in this PR. Could I ask you to repropose once #23 is landed (very soon) and move the unrelated cleanups to another PR to reduce the size of the change.

fd0 · 2014-10-09T09:25:49Z

Sure, I'll do a cleanup this evening. I thought it might be helpful to get your opinion on my approach as soon as possible. The commit from #23 was included by accident.

fd0 · 2014-10-09T19:13:41Z

Reworked the commit, split minor impromevents into #25 and removed the commit that added the benchmarks.

davecheney · 2014-10-17T11:49:36Z

Thanks for submitting this proposal.

I'm not in a position to do a full review today, but here are my intial thoughts.

I'm of the belief that the major difference in performance between go/sftp and openssh is the lack of Go's well tuned crypto primatives.
I'm not keen on the apporach this patch takes spinning up and managing multiple goroutines per connection to mux/demux messages. This was the approach we took with the ssh package itself and ended up with a lot of issues around locking and handling the close semantics properly.

I need to find some time to review the sftp documentation to figure out what are the patterns where a message may arrive out of order.

As an aside it stikes me that a better way to improve sftp throughput would be to multiplex over multiple sftp.Conn's rather than placing the logic to handle messages out of order on a single conn as in either case those Conns are multiplexed on the back of a single ssh.Conn.

fd0 · 2014-10-18T10:37:51Z

Thanks for your comment. However, I disagree on 1.

I'm of the belief that the major difference in performance between go/sftp and openssh is the lack of Go's well tuned crypto primatives.

I'm pretty sure that it is crucial to have several sftp packets in flight in order to saturate the available bandwidth.

Suppose I'd like to transfer 100MiB file of data to a server via sftp. I have an extremely fast CPU and my crypto primitives are tuned to the max, so crypto is assumed to take no time at all.

Now assume that the latency (round-trip time) to the sftp server is 1ms. In sftp, all servers should support WRITE packets with 32KiB of data and the Go sftp library uses this size for all data packets, so in order to write all of the 100MiB I'd have to send 3200 sftp WRITE packets. If I send all of these packets in serial and wait for the STATUS packets to arrive before sending another packet, this will take at least 3.2 seconds. Considering that TCP splits these 32KiB WRITE packets into multiple packets on the wire, 3.2 seconds is just the lower bound and it may even take longer. So for 1ms latency to the server, the upper bound for the throughput is 100MiB/3.2s = 31.25MiB/s.

For a much more realistic latency consider two servers in two data centers in Germany with a latency of about 9ms. This means that the lower bound for the transfer time for 3200 WRITE packets is 28.8s, in other words: The maximum bandwidth here is ~3.5MiB/s.

At home, I have a latency of about 22ms to one of my servers. In this case the bandwidth is limited to ~1.4MiB/s.

All the calculations above do not consider any time spent on crypto primitives.

Therefore I conclude that it is necessary to have multiple packets in flight.

I'm not keen on the apporach this patch takes spinning up and managing multiple goroutines per connection to mux/demux messages. This was the approach we took with the ssh package itself and ended up with a lot of issues around locking and handling the close semantics properly.

I think it's also possible to have multiple concurrent operations with several packets in flight without the need for a separate goroutine, but this may be much more complex. Did a pattern for starting and managing goroutines in a library emerge from your experiences with the ssh package that we maybe could use?

I need to find some time to review the sftp documentation to figure out what are the patterns where a message may arrive out of order.

According to the spec's section 6.1, there are no restrictions regarding the order of the response packets:

There are no restrictions on the order in which responses to
outstanding requests are delivered to the client, except that the
server must ensure fairness in the sense that processing of no
request will be indefinitely delayed even if the client is sending
other requests so that there are multiple outstanding requests all
the time.

As an aside it stikes me that a better way to improve sftp throughput would be to multiplex over multiple sftp.Conn's rather than placing the logic to handle messages out of order on a single conn as in either case those Conns are multiplexed on the back of a single ssh.Conn.

But what about my example above, copying a file of 100MiB to a server? Wouldn't that mean opening the file several times over several sftp connections and writing in parallel? That sounds wrong to me...

I'm looking forward to your thoughts about my reasoning.

fd0 · 2014-10-19T13:38:07Z

I just did a few more benchmarks to compare the Go ssh client (and Go crypto primitives) against stock OpenSSH. TL/DR: The Go crypto primitives don't limit sftp bandwidth.

For this I modified the gsftp example client from the repository gsftp.go, I added the option to start an external program instead of using the Go ssh client and the io.Copy() in put/fetch is done manually with larger buffers. The server noi is my file server in the living room connected to my workstation via gigabit ethernet, the latency is about ~0.4ms.

Uploading 500MiB of random data to the server with the stock OpenSSH sftp client:

$ sftp fs@noi
sftp> put testdata-500MiB
Uploading testdata-500MiB to /home/fs/testdata-500MiB
testdata-500MiB                                            100%  500MB  38.5MB/s   00:13

This is CPU-bound on the receiving end (can easily be seen in top), because the file server has a low power AMD Turion CPU. So the maximum bandwidth over SFTP should be around 38MiB/s.

Master Branch (without optimizations)

Now gsftp.go is built against the master branch without my optimizations from this pull request. Let's run the program with the Go ssh client:

$ pv testdata-100MiB | ./gsftp -host noi -user fs put /home/fs/testdata-100MiB
 100MiB 0:00:13 [7.53MiB/s] [==============================>] 100%

Second, use the stock OpenSSH client to connect to the server:

$ pv testdata-100MiB | ./gsftp -cmd "ssh fs@noi -s sftp" put /home/fs/testdata-100MiB
run command "ssh fs@noi -s sftp"
 100MiB 0:00:12 [8.29MiB/s] [==============================>] 100%

As you can already see, this is pretty close: 7.53MiB/s vs. 8.29MiB/s.

Increase-Throughput (sending multiple packets in parallel)

When building gsftp.go against the branch containing my optimizations, the same picture emerges:

Builtin SSH client:

$ pv testdata-100MiB | ./gsftp -host noi -user fs put /home/fs/testdata-100MiB
 100MiB 0:00:03 [29.4MiB/s] [==============================>] 100%

OpenSSH stock SSH client:

$ pv testdata-100MiB | ./gsftp -cmd "ssh fs@noi -s sftp" put /home/fs/testdata-100MiB
run command "ssh fs@noi -s sftp"
 100MiB 0:00:03 [30.4MiB/s] [==============================>] 100%

Still, pretty close: 29.4MiB/s vs. 30.4MiB/s. For 500MiB of data it's even closer: 32.6MiB/s (Go) vs. 32.9MiB/s (OpenSSH)

This data supports my conclusion above: The limiting factor here is the latency, not the Go crypto primitives. Those seem to be optimized well already.

Latency

Adding 9ms artificial latency (under Linux: tc qdisc add dev eth0 root netem delay 9ms):

$ ping -c 5 noi
PING noi (172.25.33.1) 56(84) bytes of data.
64 bytes from noi (172.25.33.1): icmp_seq=1 ttl=64 time=9.35 ms
64 bytes from noi (172.25.33.1): icmp_seq=2 ttl=64 time=9.49 ms
64 bytes from noi (172.25.33.1): icmp_seq=3 ttl=64 time=9.53 ms
64 bytes from noi (172.25.33.1): icmp_seq=4 ttl=64 time=9.47 ms
64 bytes from noi (172.25.33.1): icmp_seq=5 ttl=64 time=9.48 ms
--- noi ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 9.351/9.467/9.534/0.061 ms

Checking maximum realistic bandwidth with the OpenSSH sftp client:

$ sftp fs@noi
Connected to noi.
sftp> put testdata-500MiB
Uploading testdata-500MiB to /home/fs/testdata-500MiB
testdata-500MiB                             100%  500MB  38.5MB/s   00:13

Build and run the test program against the master branch:

$ pv testdata-100MiB | ./gsftp -host noi -user fs put /home/fs/testdata-100MiB
 100MiB 0:00:42 [2.35MiB/s] [==============================>] 100%

$ pv testdata-100MiB | ./gsftp -cmd "ssh fs@noi -s sftp" put /home/fs/testdata-100MiB
 100MiB 0:00:41 [2.43MiB/s] [==============================>] 100%

Including my optimizations:

$ pv testdata-100MiB | ./gsftp -host noi -user fs put /home/fs/testdata-100MiB
 100MiB 0:00:04 [20.1MiB/s] [==============================>] 100%
$ pv testdata-100MiB | ./gsftp -cmd "ssh fs@noi -s sftp" put /home/fs/testdata-100MiB
run command "ssh fs@noi -s sftp"
 100MiB 0:00:04 [20.8MiB/s] [==============================>] 100%

This shows that the latency still limits the maximum bandwith, but less severe. With 20ms latency the bandwith is 13.9MiB/s (Go) vs. 13.8MiB/s (OpenSSH).

davecheney · 2014-10-20T22:19:44Z

Thanks, I've just got back from a long trip I'll try to find some time to
review this.
On 18 Oct 2014 21:37, "Alexander Neumann" notifications@github.com wrote:

Thanks for your comment. However, I disagree on 1., I'm pretty sure that
it is
crucial to have several sftp packets in flight in order to saturate the
available bandwidth.

Suppose I'd like to transfer 100MiB file of data to a server via sftp. I
have
an extremely fast CPU and my crypto primitives are tuned to the max, so
crypto
is assumed to take no time at all.

Now assume that the latency (round-trip time) to the sftp server is 1ms. In
sftp, all servers should support WRITE packets with 32KiB of data and the
Go
sftp library uses this size for all data packets, so in order to write all
of
the 100MiB I'd have to send 3200 sftp WRITE packets. If I send all of these
packets in serial and wait for the STATUS packets to arrive before sending
another packet, this will take at least 3.2 seconds. Considering that TCP
splits these 32KiB WRITE packets into multiple packets on the wire, 3.2
seconds is just the lower bound and it may even take longer. So for 1ms
latency to the server, the upper bound for the throughput is 100MiB/3.2s =
31.25MiB/s.

For a much more realistic latency consider two servers in two data centers
in
Germany with a latency of about 9ms. This means that the lower bound for
the
transfer time for 3200 WRITE packets is 28.8s, in other words: The maximum
bandwidth here is ~3.5MiB/s.

At home, I have a latency of about 22ms to one of my servers. In this case
the
bandwidth is limited to ~1.4MiB/s.

All the calculations above do not consider any time spent on crypto
primitives.

Therefore I conclude that it is necessary to have multiple packets in
flight.

I'm looking forward to your thoughts about my reasoning.

—
Reply to this email directly or view it on GitHub
#24 (comment).

fd0 · 2014-10-22T07:50:58Z

I've seen that you presented at dotGo, unfortunately I was unable to attend (and I don't live far away from Paris). I'm looking forward to your thoughts!

davecheney · 2014-10-22T07:51:59Z

Thanks for your response -- nothing concludes a discussion like hard data!

davecheney · 2014-10-22T07:56:38Z

client.go

+
+	sendMu    sync.Mutex // guarantees atomic operation for sending packets
+	pktsOutCh chan idCh
+	done      chan (struct{})


drop the brackets around struct{}, they aren't necessary.

davecheney · 2014-10-22T08:15:16Z

Thank you. Here is my first review pass.

For clarification, you've sold me on the idea, 4x speedup is worth making the code more complex.

Benchmarks before: BenchmarkRead1k 10 142398598 ns/op 73.64 MB/s BenchmarkRead16k 50 32758323 ns/op 320.10 MB/s BenchmarkRead32k 100 42391983 ns/op 247.36 MB/s BenchmarkRead128k 50 42014906 ns/op 249.58 MB/s BenchmarkRead512k 50 32767844 ns/op 320.01 MB/s BenchmarkRead1MiB 50 38830615 ns/op 270.04 MB/s BenchmarkRead4MiB 50 33023037 ns/op 317.53 MB/s BenchmarkWrite1k 10 140492382 ns/op 74.64 MB/s BenchmarkWrite16k 50 37735998 ns/op 277.87 MB/s BenchmarkWrite32k 50 34301633 ns/op 305.70 MB/s BenchmarkWrite128k 50 29061948 ns/op 360.81 MB/s BenchmarkWrite512k 50 36087549 ns/op 290.57 MB/s BenchmarkWrite1MiB 50 39458367 ns/op 265.75 MB/s BenchmarkWrite4MiB 50 32230107 ns/op 325.34 MB/s BenchmarkMarshalInit 5000000 651 ns/op BenchmarkMarshalOpen 5000000 577 ns/op BenchmarkMarshalWriteWorstCase 50000 46544 ns/op BenchmarkMarshalWrite1k 500000 2823 ns/op ok github.com/pkg/sftp 39.618s after: BenchmarkRead1k 1 1069701026 ns/op 9.80 MB/s BenchmarkRead16k 20 111213902 ns/op 94.29 MB/s BenchmarkRead32k 50 69827661 ns/op 150.17 MB/s BenchmarkRead128k 50 43963052 ns/op 238.52 MB/s BenchmarkRead512k 100 26209817 ns/op 400.07 MB/s BenchmarkRead1MiB 100 22649939 ns/op 462.95 MB/s BenchmarkRead4MiB 50 35849768 ns/op 292.50 MB/s BenchmarkWrite1k 1 2450890588 ns/op 4.28 MB/s BenchmarkWrite16k 10 222473163 ns/op 47.13 MB/s BenchmarkWrite32k 10 135452515 ns/op 77.41 MB/s BenchmarkWrite128k 50 52768404 ns/op 198.72 MB/s BenchmarkWrite512k 50 31508170 ns/op 332.80 MB/s BenchmarkWrite1MiB 50 27730215 ns/op 378.14 MB/s BenchmarkWrite4MiB 50 43410403 ns/op 241.55 MB/s BenchmarkMarshalInit 5000000 659 ns/op BenchmarkMarshalOpen 5000000 560 ns/op BenchmarkMarshalWriteWorstCase 50000 43900 ns/op BenchmarkMarshalWrite1k 1000000 2471 ns/op ok github.com/pkg/sftp 44.857s I also tested a real-world scenario, uploading 500MiB of random data to a (rather slow embedded) server in my living room: sftp: $ sftp server Connected to server. sftp> put testdata testdata 100% 500MB 41.7MB/s 00:12 before this commit, with simple client `put.go` and pv (pipe view), backend is the system ssh client connecting to the sames server as above, the file is written to the exact same location on the server: $ pv testdata | go run put.go 500MiB 0:00:57 [8.68MiB/s] [=======================>] 100% With the new code: $ pv testdata | go run put.go 500MiB 0:00:14 [34.9MiB/s] [=======================>] 100% Then I added 20ms of artificial latency to this server via tc: # tc qdisc add dev enp0s25 root netem delay 20ms Afterwards the difference is even more drastic. First test with sftp results in 41.7MiB/s. Next with `put.go`, old code is 1.35MiB/s, new code reaches 30.9MiB/s.

fd0 · 2014-11-09T20:25:30Z

Thanks for your comments. I've reworked the commit, please have a look.

The example benchmarks for the master branch on my machine are:

% for dir in examples/*benchmark; do sh -c "echo $dir; cd $dir; go run main.go"; done
examples/buffered-read-benchmark
2014/11/09 21:18:19 reading 1e+09 bytes
2014/11/09 21:18:41 read 1000000000 bytes in 21.850351328s
examples/buffered-write-benchmark
2014/11/09 21:18:42 writing 1e+09 bytes
2014/11/09 21:19:12 wrote 1000000000 bytes in 29.122257824s
examples/streaming-read-benchmark
2014/11/09 21:19:13 reading 1000000000 bytes
2014/11/09 21:19:35 read 1000000000 bytes in 21.648978119s
examples/streaming-write-benchmark
2014/11/09 21:19:36 writing 1000000000 bytes
2014/11/09 21:20:04 wrote 1000000000 bytes in 27.471592583s

and for the new code:

% for dir in examples/*benchmark; do sh -c "echo $dir; cd $dir; go run main.go"; done
examples/buffered-read-benchmark
2014/11/09 21:16:18 reading 1e+09 bytes
2014/11/09 21:16:30 read 1000000000 bytes in 11.574949353s
examples/buffered-write-benchmark
2014/11/09 21:16:31 writing 1e+09 bytes
2014/11/09 21:16:44 wrote 1000000000 bytes in 12.603036457s
examples/streaming-read-benchmark
2014/11/09 21:16:46 reading 1000000000 bytes
2014/11/09 21:17:07 read 1000000000 bytes in 21.470725004s
examples/streaming-write-benchmark
2014/11/09 21:17:08 writing 1000000000 bytes
2014/11/09 21:17:31 wrote 1000000000 bytes in 22.354928153s

For my file server, master branch:

% for dir in examples/*benchmark; do sh -c "echo $dir; cd $dir; go run main.go -host noi -user fs"; done
examples/buffered-read-benchmark
2014/11/09 21:49:46 reading 1e+09 bytes
2014/11/09 21:51:41 read 1000000000 bytes in 1m55.27282327s
examples/buffered-write-benchmark
2014/11/09 21:51:43 writing 1e+09 bytes
2014/11/09 21:53:43 wrote 1000000000 bytes in 2m0.487277412s
examples/streaming-read-benchmark
2014/11/09 21:53:44 reading 1000000000 bytes
2014/11/09 21:55:40 read 1000000000 bytes in 1m55.796024742s
examples/streaming-write-benchmark
2014/11/09 21:55:42 writing 1000000000 bytes
2014/11/09 21:57:41 wrote 1000000000 bytes in 1m59.713730318s

New code:

% for dir in examples/*benchmark; do sh -c "echo $dir; cd $dir; go run main.go -host noi -user fs"; done
examples/buffered-read-benchmark
2014/11/09 21:41:28 reading 1e+09 bytes
2014/11/09 21:41:58 read 1000000000 bytes in 29.321412409s
examples/buffered-write-benchmark
2014/11/09 21:41:59 writing 1e+09 bytes
2014/11/09 21:42:24 wrote 1000000000 bytes in 25.09417721s
examples/streaming-read-benchmark
2014/11/09 21:42:26 reading 1000000000 bytes
2014/11/09 21:44:22 read 1000000000 bytes in 1m56.917131627s
examples/streaming-write-benchmark
2014/11/09 21:44:24 writing 1000000000 bytes
2014/11/09 21:46:26 wrote 1000000000 bytes in 2m1.687394197s

The streaming read/write benchmark can be accelerated by implementing WriteTo() and ReadFrom(), but that's step 2.

fd0 · 2014-12-21T12:40:34Z

Hey Dave, is there anything else I can do here?

davecheney · 2015-01-11T05:24:06Z

Hi,

Sorry for the delay.

There are a few problems, some minor, some not.

There is a merge conflict, I have no idea, why, could you re push this change.
I don't have an objection to this change, but I do think it should be expanded so that all commands issued to the server are made asynchronously.

If you can fix the merge conflict I'll merge this.

davecheney · 2015-06-03T21:27:11Z

Superseded by #41

fd0 mentioned this pull request Oct 8, 2014

Slow downloads compared to FileZilla #15

Closed

davecheney reviewed Oct 22, 2014
View reviewed changes

fd0 changed the title ~~Increase troughput~~ Increase throughput Nov 17, 2014

davecheney mentioned this pull request May 23, 2015

Got fd0's increase-troughput branch to merge into master. #32

Closed

This was referenced May 23, 2015

Make c.sendRequest concurrent #40

Merged

Slow speed #36

Closed

ggriffiniii mentioned this pull request Jun 2, 2015

Increase throughput #41

Merged

davecheney closed this Jun 3, 2015

fd0 mentioned this pull request Apr 15, 2017

Why? restic/rest-server#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase throughput #24

Increase throughput #24

fd0 commented Oct 8, 2014

davecheney commented Oct 9, 2014

fd0 commented Oct 9, 2014

fd0 commented Oct 9, 2014

davecheney commented Oct 17, 2014

fd0 commented Oct 18, 2014

fd0 commented Oct 19, 2014

davecheney commented Oct 20, 2014

fd0 commented Oct 22, 2014

davecheney commented Oct 22, 2014

davecheney Oct 22, 2014

davecheney commented Oct 22, 2014

fd0 commented Nov 9, 2014

fd0 commented Dec 21, 2014

davecheney commented Jan 11, 2015

davecheney commented Jun 3, 2015

Increase throughput #24

Increase throughput #24

Conversation

fd0 commented Oct 8, 2014

davecheney commented Oct 9, 2014

fd0 commented Oct 9, 2014

fd0 commented Oct 9, 2014

davecheney commented Oct 17, 2014

fd0 commented Oct 18, 2014

fd0 commented Oct 19, 2014

Master Branch (without optimizations)

Increase-Throughput (sending multiple packets in parallel)

Latency

davecheney commented Oct 20, 2014

fd0 commented Oct 22, 2014

davecheney commented Oct 22, 2014

davecheney Oct 22, 2014

Choose a reason for hiding this comment

davecheney commented Oct 22, 2014

fd0 commented Nov 9, 2014

fd0 commented Dec 21, 2014

davecheney commented Jan 11, 2015

davecheney commented Jun 3, 2015