TCP throughput #685

jordens · 2017-03-17T12:06:57Z

... seems to be around 25 kB/s currently (rust/smoltcp). Should and could be much faster.

https://irclog.whitequark.org/m-labs/2017-03-17#1489751108-1489752071;

dnadlinger · 2017-05-12T11:51:16Z

(Commenting here, since #732 was closed.)

I've had a quick look at our specific setup with varying levels of background (broadcast) traffic and different window sizes. I don't really have the time to fix this myself right now, but here are some bullet points:

ACKs should be sent immediately on zero-window (or after having received two MSS-size packets). Currently, they always seem to be delayed. This is the cause for the bad performance when the window is full – e.g. when using a window only a single segment big and streaming data –, where the effective RTT becomes something like 40ms.
Packets with non-consecutive sequence numbers shouldn't be dropped silently, but cause a duplicate ACK so that fast retransmission can be triggered for the dropped segments. This avoids the ~1 s pauses in some of the traces floating around here.
TCP reassembly would help a lot with recovering from dropped packets. Since those are going to be relatively frequent with the slow NIC, I think implementing this should be a priority after fixing the ACK issues.
Related to the previous points, advertising the full receive window size and leaving the limited rx ring buffer size for the remote slow start/congestion control implementation to deal with seems like the much saner choice than trying to limit the window size, as the rx buffers are shared between all traffic (and can just be considered part of the link).
ACKs with the wrong sequence number should trigger a challenge ACK instead of being dropped. This is of much lower priority than the other issues, because while it is easily possible to get reordering on a lab network when saturating commodity switches, I can't think of a realistic scenario where this would seriously hurt RPC performance.

And on the MAC side of things:

The current rx ring buffer size of 4 is rather small for a Gb link, making it more sensitive to bursty but small broadcast traffic than necessary.

whitequark · 2017-05-12T19:23:55Z

ACKs should be sent immediately on zero-window (or after having received two MSS-size packets). Currently, they always seem to be delayed.

smoltcp does not implement delayed ACKs, this is an artifact of how we perform scheduling. I think this can be addressed, I'm currently trying a few approaches.

Packets with non-consecutive sequence numbers shouldn't be dropped silently, but cause a duplicate ACK so that fast retransmission can be triggered for the dropped segments.

Correct. This is the main reason for slowness.

ACKs with the wrong sequence number should trigger a challenge ACK instead of being dropped.

This would just be handled by the same logic as above.

Related to the previous points, advertising the full receive window size and leaving the limited rx ring buffer size for the remote slow start/congestion control implementation to deal with seems like the much saner choice than trying to limit the window size

It seems like a much saner choice but after experimentation I have determined that the Linux and Widnows TCP stacks in the default configuration are unable to deal with it correctly. The current situation is far better, if flawed conceptually.

TCP reassembly would help a lot with recovering from dropped packets. Since those are going to be relatively frequent with the slow NIC, I think implementing this should be a priority after fixing the ACK issues.

Implementing standards-compliant TCP reassembly is very tricky to do without heap allocation, especially in a space-efficient way. I've been thinking about the best way to do it, but I do not have a particularly good solution so far.

whitequark · 2017-05-12T19:29:29Z

The current rx ring buffer size of 4 is rather small for a Gb link, making it more sensitive to bursty but small broadcast traffic than necessary.

Indeed, I'm rather unhappy with the current MAC. It doesn't support jumbo packets (which would remove a lot of overhead in the runtime), it doesn't support MAC address filtering (which would eliminate packet loss due to broadcast traffic), and it doesn't do DMA, which limits buffer sizes by the amount of free BRAMs in gateware. MAC address filtering is perhaps the easiest to address here...

sbourdeauducq · 2017-05-13T01:32:34Z

Theoretically, jumbo frames are easy, and would boil down to changing a few numbers in the MAC followed by the usual yak shaving.
Anyway, lwip gave out 1MB/s with the current MAC and only 2 buffers.

whitequark · 2017-05-13T02:05:38Z

@sbourdeauducq This is just a wish, or maybe even a premature optimization. It should not be necessary to get lwip-like speeds.

sbourdeauducq · 2017-05-14T12:52:53Z

@cjbe How was the KC705 connected to the PC when you made the dump in #731? Did it go through some network router that can fragment a packet that is too large for MTU of the outbound port?

cjbe · 2017-05-14T14:01:59Z

@sbourdeauducq it was connected through a D-Link switch (DGS108B). I can try it with a direct connection tomorrow if that would be useful?

dnadlinger · 2017-05-14T17:27:11Z

smoltcp does not implement delayed ACKs, this is an artifact of how we perform scheduling.

Makes sense, thanks – I was wondering where the delay came from, since I couldn't see anything obvious in the stack itself.

It seems like a much saner choice but after experimentation I have determined that the Linux and Widnows TCP stacks in the default configuration are unable to deal with it correctly. The current situation is far better, if flawed conceptually.

Fair enough. I wonder whether the outcome would change with "proper" ACK handling and out-of-order support, though.

Implementing standards-compliant TCP reassembly is very tricky to do without heap allocation, especially in a space-efficient way. I've been thinking about the best way to do it, but I do not have a particularly good solution so far.

It should be perfectly workable to keep a free list backed by a static pool around to store per-segment metadata (sequence number ranges, …), while building up the payload directly in the circular buffer. The tricky bit seems to be to make sure to get the various state transitions/edge cases right, not the method used for memory allocation (given that you can always just bail and revert to dropping the packet). What am I missing?

You don't even need to support IP reassembly; a simple queue of incoming segments would be enough for this to be a huge improvement. This might not qualify as a "particularly good solution" according to your criteria, but a few-entry queue should get the job done admirably for our use case.

If you have any patches to test on our network environment(s), just ping me and I'll be happy to oblige.

cjbe · 2017-05-15T09:41:28Z

@sbourdeauducq here is a packet dump with the KC705 directly connected to the PC - it looks qualitatively similar to the dump through the router.

sbourdeauducq · 2017-05-22T07:40:58Z

Tentative patch by @whitequark:

diff --git a/src/socket/tcp.rs b/src/socket/tcp.rs
index f0ab25f..164faae 100644
--- a/src/socket/tcp.rs
+++ b/src/socket/tcp.rs
@@ -717,6 +717,7 @@ impl<'a> TcpSocket<'a> {
             (_, TcpRepr { seq_number, .. }) => {
                 let next_remote_seq = self.remote_seq_no + self.rx_buffer.len();
                 if seq_number > next_remote_seq {
+                    self.retransmit.reset();
                     net_trace!("[{}]{}:{}: unacceptable SEQ ({} not in {}..)",
                                self.debug_id, self.local_endpoint, self.remote_endpoint,
                                seq_number, next_remote_seq);

sbourdeauducq · 2017-06-25T04:27:30Z

@klickverbot @cjbe @dhslichter should be better now.

whitequark · 2017-06-26T09:18:28Z

It should be perfectly workable to keep a free list backed by a static pool around to store per-segment metadata (sequence number ranges, …), while building up the payload directly in the circular buffer.

Yeah, this is how I would also go at implementing this. However this is a rather nontrivial chunk of work.

Packets with non-consecutive sequence numbers shouldn't be dropped silently, but cause a duplicate ACK so that fast retransmission can be triggered for the dropped segments. This avoids the ~1 s pauses in some of the traces floating around here.

I've implemented duplicate ACK responses in smoltcp. This improves things considerably.

To summarize, the following test:

from artiq.experiment import *


class Transfer(EnvExperiment):
    def build(self):
        self.setattr_device("core")

    def run(self):
        print(self.transfer(10**6), "B/s")

    def get(self, n) -> TBytes:
        return b"\x00"*n

    @kernel
    def transfer(self, n):
        t0 = self.core.get_rtio_counter_mu()
        self.get(n)
        t1 = self.core.get_rtio_counter_mu()
        return n/self.core.mu_to_seconds(t1-t0)

running over the LAN now reports speeds of at least 1 MB/s, and up to 1.8 MB/s. I believe this can be further improved, but this should already be acceptable.

whitequark · 2017-06-26T09:52:44Z

Here are the TCP throughput graphs for the test above.

First for a capture with 1.1 MB/s rate (which is consistently reproducible):

Second for a capture with 1.8 MB/s rate (ideal conditions):

Neither has any retransmissions so I'm not sure exactly where the difference comes from.
Pcap files

whitequark · 2017-06-26T10:04:26Z

@klickverbot

advertising the full receive window size and leaving the limited rx ring buffer size for the remote slow start/congestion control implementation to deal with seems like the much saner choice than trying to limit the window size

I re-checked this again after implementing duplicate ACK responses. This doesn't work. The core device ends up sending exactly thirty eight (amusingly it's very consistent) duplicate ACKs in response to the host pushing the window, and this drops throughput to less than 100 kB/s.
See for yourself:

Pcap files

I think the reason for this is that smoltcp always announces the entire buffer it has, however large, as its window, and this is not quite what one should do. This should ideally be replaced with a proper window management algorithm, which I should read up on, I guess...

jordens · 2017-06-26T10:07:25Z

The two pcaps (1.1M and 1.8M) are exactly the same.

whitequark · 2017-06-26T10:08:11Z

The two pcaps (1.1M and 1.8M) are exactly the same.

Hm, wireshark has messed up filename management. I've lost the other pcap then sadly.

whitequark · 2017-06-26T10:15:59Z

@jordens Packet captures reacquired and comment above updated.

whitequark · 2017-08-30T15:08:25Z

This is finally fixed! 🎊

I very consistently get ~300 kB/s with the latest master, measured at the same LAN as the core device:

$ artiq_run test_thru1.py
291882.2459315135 B/s
$ artiq_run test_thru1.py
291765.49065326096 B/s
$ artiq_run test_thru1.py
291893.81590445706 B/s
$ artiq_run test_thru2.py
283804.8550823502 B/s
$ artiq_run test_thru2.py
283803.68106043333 B/s
$ artiq_run test_thru2.py
283496.9025411925 B/s

dhslichter · 2017-08-30T15:21:44Z

This is nice to have the improvement, thanks @whitequark. However, this is still very slow in the grand scheme of things, especially if we are going to larger Sinara setups. It seems that we should be aiming to go at least an order of magnitude faster (if not more) once we are working with Metlino cards driving multiple uTCA crates etc -- complex experiments, with all the attendant parameterized waveforms, plus the possibility of a lot of data needing to be RPC'ed back to the PC. It would be a big problem if all the fancy hardware were laid low by a slow TCP connection to the core device.

whitequark · 2017-08-30T15:25:52Z

@dhslichter I agree. This is actually slower than the results I've presented earlier in this thread.

What I find encouraging is that unlike the previous problem ("the TCP stack completely breaks the expectations of the peer and destroys throughput"), the new problem ("the TCP stack is not tuned quite optimally") is much more tractable, and shouldn't require anywhere as much work.

I'm not yet sure why it's three times slower yet, I've just started investigating it.

dhslichter · 2017-08-30T15:33:11Z

Sounds good @whitequark, and I am glad that you have managed to improve the reliability here (I do recall your reporting better numbers before, but with consistency/reliability issues). Definitely seems like tuning the TCP stack to optimize throughput will be a more straightforward task. Anyway, my main message is "good work!" :)

whitequark · 2017-08-31T01:16:26Z

@dhslichter You wanted an order of magnitude faster? Well, how about ~2.2 MB/s? The throughput graph looks like this now. I believe it's limited by TCP Slow Start at the very beginning:

Note that this is only host to core device though, core device to host is still 300 kB/s. I don't know why, everything seems perfectly normal both in packet dumps and TCP debug log, except that occasionally the core device stops sending packets for 150ms. (No retransmits.) I think it might be an artifact in the test.

cjbe · 2017-08-31T14:27:05Z

@whitequark I just tried 3.0.dev+1303.gc5fe2799 and I still see very low throughput with continuously increasing retransmit delays.

Here is the packet dump from running transfer.py (which did not complete before I got bored).

whitequark · 2017-08-31T14:43:56Z

@cjbe No wonder, commit c5fe279 does not include any of the fixes I added to master...

cjbe · 2017-08-31T14:45:17Z

Ooops - that build (which conda install gave me) was from the sinara branch, hence does not have the latest fixes. Despite that, using 3.0.dev+1280.g20f43d57 I see the same problem.
dump.zip

whitequark · 2017-08-31T14:54:15Z

@cjbe What is the OS used on the host machine?

cjbe · 2017-08-31T15:02:21Z

Linux (3.19.0-84-generic)

jordens · 2017-08-31T15:12:55Z

@whitequark Good work on smoltcp!

But (assuming RTT is in the ms range and you are testing this locally) I would not expect any of those 200 ms quiet periods in there. They look like some retransmission timeout. And from what I have seen in on other TCP those are usually not encountered at this stage.
Those long timeouts might not affect sustained transfer speed. But for the small transfer (kernels << 1 MB) that we are doing they are critical.

whitequark · 2017-08-31T15:20:03Z

@jordens No timeouts (retransmissions on the transmit side and out-of-order packets on the receive side go into the log at WARN level, easily noticeable). The quiet periods are all on the host side. One of them is the compiler latency, the other I'm not sure.

whitequark · 2017-08-31T15:37:55Z

@jordens Actually, I'm not sure why the graph above is so pessimistic. Here I ran three kernels in rapid succession. It takes about 9ms from the end of one kernel to the start of another. No 200 ms quiet periods:

Here's the log and the packet dump.

jordens · 2017-08-31T15:41:10Z

@whitequark ack. i thought they were raw transfers with fast source and and discard sink.

whitequark · 2017-08-31T16:45:13Z

Note that this is only host to core device though, core device to host is still 300 kB/s. [...] I think it might be an artifact in the test.

I have determined that this is an artifact in the test. If, instead of an RPC, I directly transmit a large chunk of data (e.g. core log or analyzer buffer) then the throughput is also ~2.2 MB/s. Of course, this is still a problem, but it's not a problem with the TCP/IP stack.

jordens · 2017-09-01T13:27:33Z

dhslichter · 2017-09-01T14:56:33Z

@whitequark interesting. Definitely will need to see why the RPC does so much worse. And good to see that rates >1 MB/s are appearing now. The throughput and latency issues with the host-core device communication is probably the most major roadblock to switching to ARTIQ 3 for us in the lab right now, so keep us posted on how the debugging progresses.

whitequark · 2017-09-22T19:15:05Z

I've implemented reassembly of out-of-order segments in smoltcp, see c1829f3.

dnadlinger · 2017-09-22T21:29:46Z

@whitequark: For a simple test kernel that just sends 1 MB of zeros back and forth, I am now seeing about 18 MB / s (!) from master to kernel, and about 19 MB / s from kernel to master (same Linux 3.19.0-84-generic system as earlier). This almost seems too good to be true – if those numbers hold up in real-world usage, this is definitely fast enough for us in the short term. Long-term, we would like to get close to GbE line speeds, but this will of course involve more than just a few tweaks to the TCP stack (i.e. it might require changes to the SoC architecture).

Having a quick look at the package captures, there still appear to be more duplicate ACKs than expected in the steady state (with smoltcp-rs/smoltcp@64369d9), but since the core device receive performance is almost identical to the other direction where Linux just gobbles up packets at full speed, I didn't have a closer look at it yet.

Anyway, great work!

(Edit: To preempt the obvious question – no, as far as I can see, I didn't mix up bits and bytes.)

sbourdeauducq · 2017-09-23T02:48:39Z

Great work @whitequark ! I get 2.3MB/s in both directions.
@klickverbot can you share your source code that displays 18MB/s?

whitequark · 2017-09-23T10:25:45Z

@klickverbot Or packet captures, in both directions, I'd be quite curious to look at the duplicate ACKs.

dnadlinger · 2017-09-23T10:51:50Z

That machine seems to have crashed, it will take me a while to get back to the lab to revive it. But by now, I suspect there might be some funny business going on with the clocks in the firmware build I was using, as I was using the core device RTIO counter for timing.

cjbe · 2017-09-23T13:28:16Z

@whitequark there was indeed a factor 8 error in the numbers from @klickverbot : we were seeing about 2.4 MB/s in the above tests, not 18 MB/s. Here is a dump of a 1 MB master -> kernel transfer showing duplicate acks, then a 1 MB kernel -> master transfer.

This is all with the core device connected through a 100M switch - I will try it with a 1G switch on Monday.

dnadlinger · 2017-09-23T16:25:43Z

Indeed – while I didn't mix up bits and bytes, the code mistook eight nanoseconds for just one due to a mismatching device db, as Chris pointed out. At least you now know what I would consider acceptable transfer speeds. ;)

@cjbe: Your dump presumably didn't include smoltcp-rs/smoltcp@64369d9? Or was this with my last firmware build?

I've also resurrected the test box and tried again swapping out the 100M switch with our Netgear ProSafe GbE switch. The extreme packet loss issue still persists (less than 1 % ICMP pings make it through, with no networking-related messages at TRACE log level). I (or Chris) will give it another go on Monday with a stock gateware build and open a separate issue to keep track.

cjbe · 2017-09-23T16:40:31Z

@klickverbot my dump was from the firmware build you put on the board

jordens added area:coredevice prio:3-serious labels Mar 17, 2017

jordens added this to the 3.0 milestone Mar 17, 2017

jordens assigned whitequark Mar 17, 2017

jordens added this to blocked in dashboard-jordens Mar 17, 2017

jordens moved this from blocked to triage in dashboard-jordens Mar 17, 2017

jordens moved this from triage to watch in dashboard-jordens Mar 17, 2017

jbqubit mentioned this issue Mar 17, 2017

track performance #686

Open

whitequark added the type:bug label Mar 31, 2017

cjbe mentioned this issue May 9, 2017

"unacceptable SEQ" errors on experiment load #732

Closed

jordens added the area:speed label May 11, 2017

cjbe mentioned this issue Jun 6, 2017

TBool -> TInt casts/conversions #736

Closed

sbourdeauducq mentioned this issue Jun 9, 2017

develop generic pattern to compensate for delays/latencies #40

Open

sbourdeauducq closed this as completed Sep 23, 2017

dhslichter mentioned this issue Sep 25, 2017

ARTIQ 3 release candidate? #833

Closed

whitequark moved this from in progress to done in dashboard-whitequark Oct 2, 2017

jordens moved this from watch to done in dashboard-jordens Jan 29, 2018

jordens removed this from done in dashboard-jordens Mar 21, 2018

TCP throughput #685

TCP throughput #685

Comments

jordens commented Mar 17, 2017 • edited Loading

dnadlinger commented May 12, 2017 • edited by whitequark Loading

whitequark commented May 12, 2017

whitequark commented May 12, 2017

sbourdeauducq commented May 13, 2017

whitequark commented May 13, 2017

sbourdeauducq commented May 14, 2017

cjbe commented May 14, 2017

dnadlinger commented May 14, 2017

cjbe commented May 15, 2017

sbourdeauducq commented May 22, 2017

sbourdeauducq commented Jun 25, 2017

whitequark commented Jun 26, 2017 • edited Loading

whitequark commented Jun 26, 2017 • edited Loading

whitequark commented Jun 26, 2017 • edited Loading

jordens commented Jun 26, 2017

whitequark commented Jun 26, 2017 • edited Loading

whitequark commented Jun 26, 2017

whitequark commented Aug 30, 2017

dhslichter commented Aug 30, 2017

whitequark commented Aug 30, 2017

dhslichter commented Aug 30, 2017

whitequark commented Aug 31, 2017 • edited Loading

cjbe commented Aug 31, 2017

whitequark commented Aug 31, 2017

cjbe commented Aug 31, 2017

whitequark commented Aug 31, 2017

cjbe commented Aug 31, 2017

jordens commented Aug 31, 2017

whitequark commented Aug 31, 2017 • edited Loading

whitequark commented Aug 31, 2017 • edited Loading

jordens commented Aug 31, 2017

whitequark commented Aug 31, 2017

jordens commented Sep 1, 2017

dhslichter commented Sep 1, 2017

whitequark commented Sep 22, 2017 • edited Loading

dnadlinger commented Sep 22, 2017 • edited Loading

sbourdeauducq commented Sep 23, 2017 • edited Loading

whitequark commented Sep 23, 2017

dnadlinger commented Sep 23, 2017

cjbe commented Sep 23, 2017

dnadlinger commented Sep 23, 2017 • edited Loading

cjbe commented Sep 23, 2017

jordens commented Mar 17, 2017 •

edited

Loading

dnadlinger commented May 12, 2017 •

edited by whitequark

Loading

whitequark commented Jun 26, 2017 •

edited

Loading

whitequark commented Jun 26, 2017 •

edited

Loading

whitequark commented Jun 26, 2017 •

edited

Loading

whitequark commented Jun 26, 2017 •

edited

Loading

whitequark commented Aug 31, 2017 •

edited

Loading

whitequark commented Aug 31, 2017 •

edited

Loading

whitequark commented Aug 31, 2017 •

edited

Loading

whitequark commented Sep 22, 2017 •

edited

Loading

dnadlinger commented Sep 22, 2017 •

edited

Loading

sbourdeauducq commented Sep 23, 2017 •

edited

Loading

dnadlinger commented Sep 23, 2017 •

edited

Loading