Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase throughput #24

Closed
wants to merge 1 commit into from
Closed

Increase throughput #24

wants to merge 1 commit into from

Conversation

fd0
Copy link
Contributor

@fd0 fd0 commented Oct 8, 2014

No description provided.

@davecheney
Copy link
Member

This has bits of #23 and some code cleanups in this PR. Could I ask you to repropose once #23 is landed (very soon) and move the unrelated cleanups to another PR to reduce the size of the change.

@fd0
Copy link
Contributor Author

fd0 commented Oct 9, 2014

Sure, I'll do a cleanup this evening. I thought it might be helpful to get your opinion on my approach as soon as possible. The commit from #23 was included by accident.

@fd0
Copy link
Contributor Author

fd0 commented Oct 9, 2014

Reworked the commit, split minor impromevents into #25 and removed the commit that added the benchmarks.

@davecheney
Copy link
Member

Thanks for submitting this proposal.

I'm not in a position to do a full review today, but here are my intial thoughts.

  1. I'm of the belief that the major difference in performance between go/sftp and openssh is the lack of Go's well tuned crypto primatives.
  2. I'm not keen on the apporach this patch takes spinning up and managing multiple goroutines per connection to mux/demux messages. This was the approach we took with the ssh package itself and ended up with a lot of issues around locking and handling the close semantics properly.

I need to find some time to review the sftp documentation to figure out what are the patterns where a message may arrive out of order.

As an aside it stikes me that a better way to improve sftp throughput would be to multiplex over multiple sftp.Conn's rather than placing the logic to handle messages out of order on a single conn as in either case those Conns are multiplexed on the back of a single ssh.Conn.

@fd0
Copy link
Contributor Author

fd0 commented Oct 18, 2014

Thanks for your comment. However, I disagree on 1.

  1. I'm of the belief that the major difference in performance between go/sftp and openssh is the lack of Go's well tuned crypto primatives.

I'm pretty sure that it is crucial to have several sftp packets in flight in order to saturate the available bandwidth.

Suppose I'd like to transfer 100MiB file of data to a server via sftp. I have an extremely fast CPU and my crypto primitives are tuned to the max, so crypto is assumed to take no time at all.

Now assume that the latency (round-trip time) to the sftp server is 1ms. In sftp, all servers should support WRITE packets with 32KiB of data and the Go sftp library uses this size for all data packets, so in order to write all of the 100MiB I'd have to send 3200 sftp WRITE packets. If I send all of these packets in serial and wait for the STATUS packets to arrive before sending another packet, this will take at least 3.2 seconds. Considering that TCP splits these 32KiB WRITE packets into multiple packets on the wire, 3.2 seconds is just the lower bound and it may even take longer. So for 1ms latency to the server, the upper bound for the throughput is 100MiB/3.2s = 31.25MiB/s.

For a much more realistic latency consider two servers in two data centers in Germany with a latency of about 9ms. This means that the lower bound for the transfer time for 3200 WRITE packets is 28.8s, in other words: The maximum bandwidth here is ~3.5MiB/s.

At home, I have a latency of about 22ms to one of my servers. In this case the bandwidth is limited to ~1.4MiB/s.

All the calculations above do not consider any time spent on crypto primitives.

Therefore I conclude that it is necessary to have multiple packets in flight.

  1. I'm not keen on the apporach this patch takes spinning up and managing multiple goroutines per connection to mux/demux messages. This was the approach we took with the ssh package itself and ended up with a lot of issues around locking and handling the close semantics properly.

I think it's also possible to have multiple concurrent operations with several packets in flight without the need for a separate goroutine, but this may be much more complex. Did a pattern for starting and managing goroutines in a library emerge from your experiences with the ssh package that we maybe could use?

I need to find some time to review the sftp documentation to figure out what are the patterns where a message may arrive out of order.

According to the spec's section 6.1, there are no restrictions regarding the order of the response packets:

There are no restrictions on the order in which responses to
outstanding requests are delivered to the client, except that the
server must ensure fairness in the sense that processing of no
request will be indefinitely delayed even if the client is sending
other requests so that there are multiple outstanding requests all
the time.

As an aside it stikes me that a better way to improve sftp throughput would be to multiplex over multiple sftp.Conn's rather than placing the logic to handle messages out of order on a single conn as in either case those Conns are multiplexed on the back of a single ssh.Conn.

But what about my example above, copying a file of 100MiB to a server? Wouldn't that mean opening the file several times over several sftp connections and writing in parallel? That sounds wrong to me...

I'm looking forward to your thoughts about my reasoning.

@fd0
Copy link
Contributor Author

fd0 commented Oct 19, 2014

I just did a few more benchmarks to compare the Go ssh client (and Go crypto primitives) against stock OpenSSH. TL/DR: The Go crypto primitives don't limit sftp bandwidth.

For this I modified the gsftp example client from the repository gsftp.go, I added the option to start an external program instead of using the Go ssh client and the io.Copy() in put/fetch is done manually with larger buffers. The server noi is my file server in the living room connected to my workstation via gigabit ethernet, the latency is about ~0.4ms.

Uploading 500MiB of random data to the server with the stock OpenSSH sftp client:

$ sftp fs@noi
sftp> put testdata-500MiB
Uploading testdata-500MiB to /home/fs/testdata-500MiB
testdata-500MiB                                            100%  500MB  38.5MB/s   00:13

This is CPU-bound on the receiving end (can easily be seen in top), because the file server has a low power AMD Turion CPU. So the maximum bandwidth over SFTP should be around 38MiB/s.

Master Branch (without optimizations)

Now gsftp.go is built against the master branch without my optimizations from this pull request. Let's run the program with the Go ssh client:

$ pv testdata-100MiB | ./gsftp -host noi -user fs put /home/fs/testdata-100MiB
 100MiB 0:00:13 [7.53MiB/s] [==============================>] 100%

Second, use the stock OpenSSH client to connect to the server:

$ pv testdata-100MiB | ./gsftp -cmd "ssh fs@noi -s sftp" put /home/fs/testdata-100MiB
run command "ssh fs@noi -s sftp"
 100MiB 0:00:12 [8.29MiB/s] [==============================>] 100%

As you can already see, this is pretty close: 7.53MiB/s vs. 8.29MiB/s.

Increase-Throughput (sending multiple packets in parallel)

When building gsftp.go against the branch containing my optimizations, the same picture emerges:

Builtin SSH client:

$ pv testdata-100MiB | ./gsftp -host noi -user fs put /home/fs/testdata-100MiB
 100MiB 0:00:03 [29.4MiB/s] [==============================>] 100%

OpenSSH stock SSH client:

$ pv testdata-100MiB | ./gsftp -cmd "ssh fs@noi -s sftp" put /home/fs/testdata-100MiB
run command "ssh fs@noi -s sftp"
 100MiB 0:00:03 [30.4MiB/s] [==============================>] 100%

Still, pretty close: 29.4MiB/s vs. 30.4MiB/s. For 500MiB of data it's even closer: 32.6MiB/s (Go) vs. 32.9MiB/s (OpenSSH)

This data supports my conclusion above: The limiting factor here is the latency, not the Go crypto primitives. Those seem to be optimized well already.

Latency

Adding 9ms artificial latency (under Linux: tc qdisc add dev eth0 root netem delay 9ms):

$ ping -c 5 noi
PING noi (172.25.33.1) 56(84) bytes of data.
64 bytes from noi (172.25.33.1): icmp_seq=1 ttl=64 time=9.35 ms
64 bytes from noi (172.25.33.1): icmp_seq=2 ttl=64 time=9.49 ms
64 bytes from noi (172.25.33.1): icmp_seq=3 ttl=64 time=9.53 ms
64 bytes from noi (172.25.33.1): icmp_seq=4 ttl=64 time=9.47 ms
64 bytes from noi (172.25.33.1): icmp_seq=5 ttl=64 time=9.48 ms
--- noi ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 9.351/9.467/9.534/0.061 ms

Checking maximum realistic bandwidth with the OpenSSH sftp client:

$ sftp fs@noi
Connected to noi.
sftp> put testdata-500MiB
Uploading testdata-500MiB to /home/fs/testdata-500MiB
testdata-500MiB                             100%  500MB  38.5MB/s   00:13

Build and run the test program against the master branch:

$ pv testdata-100MiB | ./gsftp -host noi -user fs put /home/fs/testdata-100MiB
 100MiB 0:00:42 [2.35MiB/s] [==============================>] 100%

$ pv testdata-100MiB | ./gsftp -cmd "ssh fs@noi -s sftp" put /home/fs/testdata-100MiB
 100MiB 0:00:41 [2.43MiB/s] [==============================>] 100%

Including my optimizations:

$ pv testdata-100MiB | ./gsftp -host noi -user fs put /home/fs/testdata-100MiB
 100MiB 0:00:04 [20.1MiB/s] [==============================>] 100%
$ pv testdata-100MiB | ./gsftp -cmd "ssh fs@noi -s sftp" put /home/fs/testdata-100MiB
run command "ssh fs@noi -s sftp"
 100MiB 0:00:04 [20.8MiB/s] [==============================>] 100%

This shows that the latency still limits the maximum bandwith, but less severe. With 20ms latency the bandwith is 13.9MiB/s (Go) vs. 13.8MiB/s (OpenSSH).

@davecheney
Copy link
Member

Thanks, I've just got back from a long trip I'll try to find some time to
review this.
On 18 Oct 2014 21:37, "Alexander Neumann" notifications@github.com wrote:

Thanks for your comment. However, I disagree on 1., I'm pretty sure that
it is
crucial to have several sftp packets in flight in order to saturate the
available bandwidth.

Suppose I'd like to transfer 100MiB file of data to a server via sftp. I
have
an extremely fast CPU and my crypto primitives are tuned to the max, so
crypto
is assumed to take no time at all.

Now assume that the latency (round-trip time) to the sftp server is 1ms. In
sftp, all servers should support WRITE packets with 32KiB of data and the
Go
sftp library uses this size for all data packets, so in order to write all
of
the 100MiB I'd have to send 3200 sftp WRITE packets. If I send all of these
packets in serial and wait for the STATUS packets to arrive before sending
another packet, this will take at least 3.2 seconds. Considering that TCP
splits these 32KiB WRITE packets into multiple packets on the wire, 3.2
seconds is just the lower bound and it may even take longer. So for 1ms
latency to the server, the upper bound for the throughput is 100MiB/3.2s =
31.25MiB/s.

For a much more realistic latency consider two servers in two data centers
in
Germany with a latency of about 9ms. This means that the lower bound for
the
transfer time for 3200 WRITE packets is 28.8s, in other words: The maximum
bandwidth here is ~3.5MiB/s.

At home, I have a latency of about 22ms to one of my servers. In this case
the
bandwidth is limited to ~1.4MiB/s.

All the calculations above do not consider any time spent on crypto
primitives.

Therefore I conclude that it is necessary to have multiple packets in
flight.

I'm looking forward to your thoughts about my reasoning.


Reply to this email directly or view it on GitHub
#24 (comment).

@fd0
Copy link
Contributor Author

fd0 commented Oct 22, 2014

I've seen that you presented at dotGo, unfortunately I was unable to attend (and I don't live far away from Paris). I'm looking forward to your thoughts!

@davecheney
Copy link
Member

Thanks for your response -- nothing concludes a discussion like hard data!


sendMu sync.Mutex // guarantees atomic operation for sending packets
pktsOutCh chan idCh
done chan (struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the brackets around struct{}, they aren't necessary.

@davecheney
Copy link
Member

Thank you. Here is my first review pass.

For clarification, you've sold me on the idea, 4x speedup is worth making the code more complex.

Benchmarks before:

    BenchmarkRead1k           10     142398598 ns/op      73.64 MB/s
    BenchmarkRead16k          50      32758323 ns/op     320.10 MB/s
    BenchmarkRead32k         100      42391983 ns/op     247.36 MB/s
    BenchmarkRead128k         50      42014906 ns/op     249.58 MB/s
    BenchmarkRead512k         50      32767844 ns/op     320.01 MB/s
    BenchmarkRead1MiB         50      38830615 ns/op     270.04 MB/s
    BenchmarkRead4MiB         50      33023037 ns/op     317.53 MB/s
    BenchmarkWrite1k          10     140492382 ns/op      74.64 MB/s
    BenchmarkWrite16k         50      37735998 ns/op     277.87 MB/s
    BenchmarkWrite32k         50      34301633 ns/op     305.70 MB/s
    BenchmarkWrite128k        50      29061948 ns/op     360.81 MB/s
    BenchmarkWrite512k        50      36087549 ns/op     290.57 MB/s
    BenchmarkWrite1MiB        50      39458367 ns/op     265.75 MB/s
    BenchmarkWrite4MiB        50      32230107 ns/op     325.34 MB/s
    BenchmarkMarshalInit     5000000           651 ns/op
    BenchmarkMarshalOpen     5000000           577 ns/op
    BenchmarkMarshalWriteWorstCase     50000         46544 ns/op
    BenchmarkMarshalWrite1k   500000          2823 ns/op
    ok      github.com/pkg/sftp 39.618s

after:

    BenchmarkRead1k        1        1069701026 ns/op           9.80 MB/s
    BenchmarkRead16k              20         111213902 ns/op          94.29 MB/s
    BenchmarkRead32k              50          69827661 ns/op         150.17 MB/s
    BenchmarkRead128k             50          43963052 ns/op         238.52 MB/s
    BenchmarkRead512k            100          26209817 ns/op         400.07 MB/s
    BenchmarkRead1MiB            100          22649939 ns/op         462.95 MB/s
    BenchmarkRead4MiB             50          35849768 ns/op         292.50 MB/s
    BenchmarkWrite1k               1        2450890588 ns/op           4.28 MB/s
    BenchmarkWrite16k             10         222473163 ns/op          47.13 MB/s
    BenchmarkWrite32k             10         135452515 ns/op          77.41 MB/s
    BenchmarkWrite128k            50          52768404 ns/op         198.72 MB/s
    BenchmarkWrite512k            50          31508170 ns/op         332.80 MB/s
    BenchmarkWrite1MiB            50          27730215 ns/op         378.14 MB/s
    BenchmarkWrite4MiB            50          43410403 ns/op         241.55 MB/s
    BenchmarkMarshalInit     5000000               659 ns/op
    BenchmarkMarshalOpen     5000000               560 ns/op
    BenchmarkMarshalWriteWorstCase     50000             43900 ns/op
    BenchmarkMarshalWrite1k  1000000              2471 ns/op
    ok      github.com/pkg/sftp     44.857s

I also tested a real-world scenario, uploading 500MiB of random data to
a (rather slow embedded) server in my living room:

sftp:

    $ sftp server
    Connected to server.
    sftp> put testdata
    testdata                      100% 500MB  41.7MB/s   00:12

before this commit, with simple client `put.go` and pv (pipe view),
backend is the system ssh client connecting to the sames server as
above, the file is written to the exact same location on the server:

    $ pv testdata | go run put.go
    500MiB 0:00:57 [8.68MiB/s] [=======================>] 100%

With the new code:

    $ pv testdata | go run put.go
    500MiB 0:00:14 [34.9MiB/s] [=======================>] 100%

Then I added 20ms of artificial latency to this server via tc:

    # tc qdisc add dev enp0s25 root netem delay 20ms

Afterwards the difference is even more drastic. First test with sftp
results in 41.7MiB/s. Next with `put.go`, old code is 1.35MiB/s, new
code reaches 30.9MiB/s.
@fd0
Copy link
Contributor Author

fd0 commented Nov 9, 2014

Thanks for your comments. I've reworked the commit, please have a look.

The example benchmarks for the master branch on my machine are:

% for dir in examples/*benchmark; do sh -c "echo $dir; cd $dir; go run main.go"; done
examples/buffered-read-benchmark
2014/11/09 21:18:19 reading 1e+09 bytes
2014/11/09 21:18:41 read 1000000000 bytes in 21.850351328s
examples/buffered-write-benchmark
2014/11/09 21:18:42 writing 1e+09 bytes
2014/11/09 21:19:12 wrote 1000000000 bytes in 29.122257824s
examples/streaming-read-benchmark
2014/11/09 21:19:13 reading 1000000000 bytes
2014/11/09 21:19:35 read 1000000000 bytes in 21.648978119s
examples/streaming-write-benchmark
2014/11/09 21:19:36 writing 1000000000 bytes
2014/11/09 21:20:04 wrote 1000000000 bytes in 27.471592583s

and for the new code:

% for dir in examples/*benchmark; do sh -c "echo $dir; cd $dir; go run main.go"; done
examples/buffered-read-benchmark
2014/11/09 21:16:18 reading 1e+09 bytes
2014/11/09 21:16:30 read 1000000000 bytes in 11.574949353s
examples/buffered-write-benchmark
2014/11/09 21:16:31 writing 1e+09 bytes
2014/11/09 21:16:44 wrote 1000000000 bytes in 12.603036457s
examples/streaming-read-benchmark
2014/11/09 21:16:46 reading 1000000000 bytes
2014/11/09 21:17:07 read 1000000000 bytes in 21.470725004s
examples/streaming-write-benchmark
2014/11/09 21:17:08 writing 1000000000 bytes
2014/11/09 21:17:31 wrote 1000000000 bytes in 22.354928153s

For my file server, master branch:

% for dir in examples/*benchmark; do sh -c "echo $dir; cd $dir; go run main.go -host noi -user fs"; done
examples/buffered-read-benchmark
2014/11/09 21:49:46 reading 1e+09 bytes
2014/11/09 21:51:41 read 1000000000 bytes in 1m55.27282327s
examples/buffered-write-benchmark
2014/11/09 21:51:43 writing 1e+09 bytes
2014/11/09 21:53:43 wrote 1000000000 bytes in 2m0.487277412s
examples/streaming-read-benchmark
2014/11/09 21:53:44 reading 1000000000 bytes
2014/11/09 21:55:40 read 1000000000 bytes in 1m55.796024742s
examples/streaming-write-benchmark
2014/11/09 21:55:42 writing 1000000000 bytes
2014/11/09 21:57:41 wrote 1000000000 bytes in 1m59.713730318s

New code:

% for dir in examples/*benchmark; do sh -c "echo $dir; cd $dir; go run main.go -host noi -user fs"; done
examples/buffered-read-benchmark
2014/11/09 21:41:28 reading 1e+09 bytes
2014/11/09 21:41:58 read 1000000000 bytes in 29.321412409s
examples/buffered-write-benchmark
2014/11/09 21:41:59 writing 1e+09 bytes
2014/11/09 21:42:24 wrote 1000000000 bytes in 25.09417721s
examples/streaming-read-benchmark
2014/11/09 21:42:26 reading 1000000000 bytes
2014/11/09 21:44:22 read 1000000000 bytes in 1m56.917131627s
examples/streaming-write-benchmark
2014/11/09 21:44:24 writing 1000000000 bytes
2014/11/09 21:46:26 wrote 1000000000 bytes in 2m1.687394197s

The streaming read/write benchmark can be accelerated by implementing WriteTo() and ReadFrom(), but that's step 2.

@fd0 fd0 changed the title Increase troughput Increase throughput Nov 17, 2014
@fd0
Copy link
Contributor Author

fd0 commented Dec 21, 2014

Hey Dave, is there anything else I can do here?

@davecheney
Copy link
Member

Hi,

Sorry for the delay.

There are a few problems, some minor, some not.

  1. There is a merge conflict, I have no idea, why, could you re push this change.
  2. I don't have an objection to this change, but I do think it should be expanded so that all commands issued to the server are made asynchronously.

If you can fix the merge conflict I'll merge this.

This was referenced May 23, 2015
@ggriffiniii ggriffiniii mentioned this pull request Jun 2, 2015
@davecheney
Copy link
Member

Superseded by #41

@davecheney davecheney closed this Jun 3, 2015
@fd0 fd0 mentioned this pull request Apr 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants