-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase throughput #24
Conversation
Sure, I'll do a cleanup this evening. I thought it might be helpful to get your opinion on my approach as soon as possible. The commit from #23 was included by accident. |
Reworked the commit, split minor impromevents into #25 and removed the commit that added the benchmarks. |
Thanks for submitting this proposal. I'm not in a position to do a full review today, but here are my intial thoughts.
I need to find some time to review the sftp documentation to figure out what are the patterns where a message may arrive out of order. As an aside it stikes me that a better way to improve sftp throughput would be to multiplex over multiple sftp.Conn's rather than placing the logic to handle messages out of order on a single conn as in either case those Conns are multiplexed on the back of a single ssh.Conn. |
Thanks for your comment. However, I disagree on 1.
I'm pretty sure that it is crucial to have several sftp packets in flight in order to saturate the available bandwidth. Suppose I'd like to transfer 100MiB file of data to a server via sftp. I have an extremely fast CPU and my crypto primitives are tuned to the max, so crypto is assumed to take no time at all. Now assume that the latency (round-trip time) to the sftp server is 1ms. In sftp, all servers should support WRITE packets with 32KiB of data and the Go sftp library uses this size for all data packets, so in order to write all of the 100MiB I'd have to send 3200 sftp WRITE packets. If I send all of these packets in serial and wait for the STATUS packets to arrive before sending another packet, this will take at least 3.2 seconds. Considering that TCP splits these 32KiB WRITE packets into multiple packets on the wire, 3.2 seconds is just the lower bound and it may even take longer. So for 1ms latency to the server, the upper bound for the throughput is 100MiB/3.2s = 31.25MiB/s. For a much more realistic latency consider two servers in two data centers in Germany with a latency of about 9ms. This means that the lower bound for the transfer time for 3200 WRITE packets is 28.8s, in other words: The maximum bandwidth here is ~3.5MiB/s. At home, I have a latency of about 22ms to one of my servers. In this case the bandwidth is limited to ~1.4MiB/s. All the calculations above do not consider any time spent on crypto primitives. Therefore I conclude that it is necessary to have multiple packets in flight.
I think it's also possible to have multiple concurrent operations with several packets in flight without the need for a separate goroutine, but this may be much more complex. Did a pattern for starting and managing goroutines in a library emerge from your experiences with the ssh package that we maybe could use?
According to the spec's section 6.1, there are no restrictions regarding the order of the response packets:
But what about my example above, copying a file of 100MiB to a server? Wouldn't that mean opening the file several times over several sftp connections and writing in parallel? That sounds wrong to me... I'm looking forward to your thoughts about my reasoning. |
I just did a few more benchmarks to compare the Go ssh client (and Go crypto primitives) against stock OpenSSH. TL/DR: The Go crypto primitives don't limit sftp bandwidth. For this I modified the gsftp example client from the repository Uploading 500MiB of random data to the server with the stock OpenSSH sftp client:
This is CPU-bound on the receiving end (can easily be seen in Master Branch (without optimizations)Now
Second, use the stock OpenSSH client to connect to the server:
As you can already see, this is pretty close: 7.53MiB/s vs. 8.29MiB/s. Increase-Throughput (sending multiple packets in parallel)When building Builtin SSH client:
OpenSSH stock SSH client:
Still, pretty close: 29.4MiB/s vs. 30.4MiB/s. For 500MiB of data it's even closer: 32.6MiB/s (Go) vs. 32.9MiB/s (OpenSSH) This data supports my conclusion above: The limiting factor here is the latency, not the Go crypto primitives. Those seem to be optimized well already. LatencyAdding 9ms artificial latency (under Linux:
Checking maximum realistic bandwidth with the OpenSSH sftp client:
Build and run the test program against the master branch:
Including my optimizations:
This shows that the latency still limits the maximum bandwith, but less severe. With 20ms latency the bandwith is 13.9MiB/s (Go) vs. 13.8MiB/s (OpenSSH). |
Thanks, I've just got back from a long trip I'll try to find some time to
|
I've seen that you presented at dotGo, unfortunately I was unable to attend (and I don't live far away from Paris). I'm looking forward to your thoughts! |
Thanks for your response -- nothing concludes a discussion like hard data! |
|
||
sendMu sync.Mutex // guarantees atomic operation for sending packets | ||
pktsOutCh chan idCh | ||
done chan (struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop the brackets around struct{}, they aren't necessary.
Thank you. Here is my first review pass. For clarification, you've sold me on the idea, 4x speedup is worth making the code more complex. |
Benchmarks before: BenchmarkRead1k 10 142398598 ns/op 73.64 MB/s BenchmarkRead16k 50 32758323 ns/op 320.10 MB/s BenchmarkRead32k 100 42391983 ns/op 247.36 MB/s BenchmarkRead128k 50 42014906 ns/op 249.58 MB/s BenchmarkRead512k 50 32767844 ns/op 320.01 MB/s BenchmarkRead1MiB 50 38830615 ns/op 270.04 MB/s BenchmarkRead4MiB 50 33023037 ns/op 317.53 MB/s BenchmarkWrite1k 10 140492382 ns/op 74.64 MB/s BenchmarkWrite16k 50 37735998 ns/op 277.87 MB/s BenchmarkWrite32k 50 34301633 ns/op 305.70 MB/s BenchmarkWrite128k 50 29061948 ns/op 360.81 MB/s BenchmarkWrite512k 50 36087549 ns/op 290.57 MB/s BenchmarkWrite1MiB 50 39458367 ns/op 265.75 MB/s BenchmarkWrite4MiB 50 32230107 ns/op 325.34 MB/s BenchmarkMarshalInit 5000000 651 ns/op BenchmarkMarshalOpen 5000000 577 ns/op BenchmarkMarshalWriteWorstCase 50000 46544 ns/op BenchmarkMarshalWrite1k 500000 2823 ns/op ok github.com/pkg/sftp 39.618s after: BenchmarkRead1k 1 1069701026 ns/op 9.80 MB/s BenchmarkRead16k 20 111213902 ns/op 94.29 MB/s BenchmarkRead32k 50 69827661 ns/op 150.17 MB/s BenchmarkRead128k 50 43963052 ns/op 238.52 MB/s BenchmarkRead512k 100 26209817 ns/op 400.07 MB/s BenchmarkRead1MiB 100 22649939 ns/op 462.95 MB/s BenchmarkRead4MiB 50 35849768 ns/op 292.50 MB/s BenchmarkWrite1k 1 2450890588 ns/op 4.28 MB/s BenchmarkWrite16k 10 222473163 ns/op 47.13 MB/s BenchmarkWrite32k 10 135452515 ns/op 77.41 MB/s BenchmarkWrite128k 50 52768404 ns/op 198.72 MB/s BenchmarkWrite512k 50 31508170 ns/op 332.80 MB/s BenchmarkWrite1MiB 50 27730215 ns/op 378.14 MB/s BenchmarkWrite4MiB 50 43410403 ns/op 241.55 MB/s BenchmarkMarshalInit 5000000 659 ns/op BenchmarkMarshalOpen 5000000 560 ns/op BenchmarkMarshalWriteWorstCase 50000 43900 ns/op BenchmarkMarshalWrite1k 1000000 2471 ns/op ok github.com/pkg/sftp 44.857s I also tested a real-world scenario, uploading 500MiB of random data to a (rather slow embedded) server in my living room: sftp: $ sftp server Connected to server. sftp> put testdata testdata 100% 500MB 41.7MB/s 00:12 before this commit, with simple client `put.go` and pv (pipe view), backend is the system ssh client connecting to the sames server as above, the file is written to the exact same location on the server: $ pv testdata | go run put.go 500MiB 0:00:57 [8.68MiB/s] [=======================>] 100% With the new code: $ pv testdata | go run put.go 500MiB 0:00:14 [34.9MiB/s] [=======================>] 100% Then I added 20ms of artificial latency to this server via tc: # tc qdisc add dev enp0s25 root netem delay 20ms Afterwards the difference is even more drastic. First test with sftp results in 41.7MiB/s. Next with `put.go`, old code is 1.35MiB/s, new code reaches 30.9MiB/s.
Thanks for your comments. I've reworked the commit, please have a look. The example benchmarks for the master branch on my machine are:
and for the new code:
For my file server, master branch:
New code:
The streaming read/write benchmark can be accelerated by implementing WriteTo() and ReadFrom(), but that's step 2. |
Hey Dave, is there anything else I can do here? |
Hi, Sorry for the delay. There are a few problems, some minor, some not.
If you can fix the merge conflict I'll merge this. |
Superseded by #41 |
No description provided.