Optimize window update logic #108

jwt27 · 2024-01-17T16:22:18Z

I'm experimenting with the delayed-ACK / window update code. Some radical changes here, so this needs proper testing to ensure that it helps in all scenarios. And there may be bugs still.

RFC 1122 says:

A TCP SHOULD implement a delayed ACK, but an ACK should not
be excessively delayed; in particular, the delay MUST be
less than 0.5 seconds, and in a stream of full-sized
segments there SHOULD be an ACK for at least every second
segment.

But in our case, the packet driver's "send" function is a major bottleneck. We should avoid sending small packets at all costs. The solution: remove fast-ACK altogether, delay as long as possible.

In addition, we should never advertise a receive window smaller than MSS. Ideally, keep it wide open. So, the window should be updated only from tcp_read(). It makes no sense to do so from FSM or retransmitter - they can't solve the problem.

In this patch I also allocate a receive buffer that is 2x the window size. Not sure yet if it's necessary.

Things to check:

New field tcp_Socket.recv_win is only updated for in-order data. That is likely not correct.
Round-trip timing may be affected on high-latency links.
Window size determination may not be optimal for the default 2K buffer.
Buffer size should be limited to 64K on 16-bit systems.

Test procedure with the bundled wget:

$ python3 -m http.server 8000 --bind 10.0.0.10

> wget-dj.exe 10.0.0.10:8000/hugefile.dat -T 0 -O /dev/null

And in wattcp.cfg: tcp.recv_win = 65536

Results so far look promising:

Before: ~4.0MB/s
After: ~6.4MB/s

jwt27 · 2024-01-17T16:46:20Z

What this looks like in Wireshark, before:

After:

gvanem · 2024-01-18T10:04:31Z

Results so far look promising:

Before: ~4.0MB/s
After: ~6.4MB/s

AFAICS, the lack of receive-speed in Watt-32 programs (vs. Winsock or Linux) is the lack of
SACK (Selective Acknowledge) algorithm.

And FACK (Forward Acknowledgement), but not sure how common this is nowadays.

There is an option USE_TCP_SACK in config.h, but it's tricky to implement. There's only some dummy tcp_opt_sack*() functions now. But I've scraped an old SACK-implementation and added it here.

jwt27 · 2024-01-18T20:33:26Z

I read a bit about SACK before. Doesn't it only help for retransmitted segments (which should be rare)?

On that topic, I noticed something. If you let wget write to a file, the bottleneck then becomes DOS and the hard drive. A 64K receive window is then unreliable since we only have 40 packet buffers (57K), and we end up dropping some. That looks like this:

Now we send a dup-ACK for every packet received after the gap. Limiting this to 3 would already save some time spent in TCP_SEND(). Then the peer starts retransmitting one segment at a time (and somehow we have a few more dup-ACKs here). At this point we can do fast-ACK to recover from retransmission more quickly. I added a commit to implement this:

jwt27 · 2024-01-18T21:04:58Z

BTW, if we use a giant buffer size in wget:

diff --git a/bin/WGET.182/retr.c b/bin/WGET.182/retr.c
index ebe1fe5..f78aad0 100644
--- a/bin/WGET.182/retr.c
+++ b/bin/WGET.182/retr.c
@@ -141,7 +141,7 @@ get_contents (int fd, FILE *fp, long *len, long restval, long expected,
 #ifdef __LARGE__
   static char c[2048];
 #else
-  static char c[8192];
+  static char c[131072];
 #endif
   void *progress = NULL;
   struct wget_timer *timer = wtimer_allocate ();

We can get a better estimate of maximum possible throughput. The difference then becomes:

Before: ~5.3 MB/s (current master branch)
After: ~8.8 MB/s (this PR)

Which is pretty close to maximum link speed!

gvanem · 2024-01-19T06:46:29Z

Which is pretty close to maximum link speed!

Pretty impressive work there! 🥇

But all this is with a djgpp version of Wget on your LAN?
Have you tested with a Large-model or Windows (with Watcom etc. if you have that) on a remote server.
Like between Japan <--> Belgium.

jwt27 · 2024-01-19T17:48:53Z

Yes, this needs to be tested in different scenarios. I feel like I might be optimizing for one specific use case. On high-latency connections I expect it will fall apart, but haven't really tried yet.

I did try building wget with watcom but it complains about the makefile syntax. I'm really not familiar with the watcom toolset at all, would appreciate some help here (if anyone else happens to listen in on this).

jwt27 · 2024-01-19T18:44:29Z

Found a server on the other side of the planet (ping 290ms).

$ wget http://mirror.linux.org.au/debian/dists/unstable/main/Contents-source.gz -T 0 -O /dev/null

Windows (MSYS2): Constant 850 KB/s
Linux: Starts slow, but increases gradually to 10 MB/s (wow!)
Watt32 on djgpp (current master branch): Constant 65 KB/s
Watt32 on djgpp (this PR): Constant 40 KB/s

Oops :)
But not entirely unexpected. Let's think about this some more...

gvanem · 2024-01-19T19:00:50Z

Windows (MSYS2): Constant 850 KB/s

What's your max WAN link-speed? I'm getting 5.09MB/s with wget/Windows (MSVC) on my 100 MBit/s WAN.

Linux: Starts slow, but increases gradually to 10 MB/s (wow!)

Nice. Some tricks inside Linux that does this?

gvanem · 2024-01-19T19:03:27Z

On 2nd attempt, Windows is similar to Linux:

jwt27 · 2024-01-19T19:11:32Z

I'm on a gigabit link, so no bottleneck there. Also I tested on Windows 7, maybe 10/11 has better TCP implementation (I should upgrade at some point).

For Linux I guess they have accurate round-trip timing to know when to send earlier window updates. But that's just a guess.

jwt27 · 2024-01-19T19:32:45Z

Looking at Wireshark on the Linux side:

It uses the TCP timestamp option and window size of 3MB. Window size does always stay constant. ACKs are sent fairly late.

Also note the effect of TCP offloading, the network card bundles up multiple packets and presents them to the OS as one large block. We don't have this luxury here :)

gvanem · 2024-01-19T19:50:24Z

Also note the effect of TCP offloading,

If that include checksum offloading, I've disabled that to see the calculated checksums in Wireshark etc.
So the Windows speed is comparable to Linux AFAICS; starts slow but increases gradually to 8 MB/s here.

I'm really not familiar with the watcom toolset at all,

Wmake is really weird and ugly compared to GNU-make. I'll make an USE_WATCOM=1 target for Makefile.Windows.

jwt27 · 2024-01-19T21:09:37Z

Thinking out loud:

Problem is the delay between receiving the last packet and sending a window update. We can receive packets pretty quick, but then it takes a while before we get to process it. In theory sending ACKs faster with reduced window size will allow the sender to adapt and send at a more constant rate. But that's not happening, so something on our end must be causing a fairly constant delay.

If I were to redesign TCP, I would add some sort of window pre-announce mechanism. So a sender could say with the first packet "I will be transmitting N full segments now". And the receiver immediately replies "after those N segments, my window size will be at least X". Then the sender can always keep going without ever having to wait for a window update.

gvanem · 2024-01-20T14:15:49Z

But that's not happening, so something on our end must be causing a fairly constant delay.

Does this have something to do with Nagle's Algorithm?

jwt27 · 2024-01-20T16:44:08Z

Hm, not likely. I do think the Nagle implementation is slightly wrong, but that is another topic.

I've done some more experimenting. If I remove the window update treshold, the high-latency case improves to ~75 KB/s without hurting LAN too much (6.3 MB/s). Doing fast-ACK from FSM, I can get 100 KB/s (or even 200 with 64K window). But LAN really suffers.

What I think is happening then:
By the time we get to tcp_tick(), there may already be a backlog of 20+ packets. If you do fast-ACK there, then by the time you're done with the first set of packets, 20 more may have arrived (since TCP_SEND() is so slow, and round-trip is very short). So you get stuck in tcp_tick() for a long time and are never able to clear the buffer.

Limit the number of dup-ACKs to 3, and send fast-ACKs after each retransmitted segment.

jwt27 · 2024-01-20T18:53:25Z

To elaborate on the "slightly wrong" Nagle mode: While there is still unacked data, it's supposed to send only whole segments. But if we have a whole segment and a few bytes, Watt32 will also send a partial one. Small difference, but it is "slightly wrong".

There is also a modification to Nagle's algorithm described here. Could take a look at that sometime.

gvanem · 2024-01-20T22:16:43Z

Did you use the RDTSC or the 8254 timer as you mentioned in #99?

jwt27 · 2024-01-21T15:53:32Z

Oh now you're on to something. I had been using the default, which I thought should be rdtsc. I saw you can toggle it with environment var USE_RDTSC.

So I set USE_RDTSC=0. Doesn't seem to make much of a difference.
I enable it again with USE_RDTSC=1. Now I see 11.3MB/s on LAN somehow.

I'm very confused now. Need to figure out how this works.

jwt27 · 2024-01-21T17:32:52Z

This would all make sense if the timer code was using gettimeofday() since that is a very slow function. So that could be the "fairly constant delay" I was talking about earlier.

jwt27 · 2024-01-21T22:34:43Z

I did some benchmarking:

#include <stdio.h>
#include <tcp.h>

static uint64_t rdtsc()
{
  uint64_t count;
  asm volatile ("rdtsc" : "=A" (count));
  return count;
}

int main()
{
  uint64_t begin, end;

  init_misc();
  //hires_timer(0);

  begin = rdtsc();
  for (int i = 0; i < 100000; ++i)
  {
    set_timeout(0);
  }
  end = rdtsc();

  printf("%llu\n", end - begin);

  return 0;
}

hires_timer(0):  3387089559 cycles (avg 45161ns per call)
hires_timer(1):  3387605858 cycles (avg 45168ns per call)
USE_RDTSC=1   :    27900545 cycles (avg   372ns per call)

This shows that the timer code is a serious bottleneck - set_timeout() etc is used everywhere.

The default option, at least for djgpp, is equivalent to hires_timer(1), so 8254 enabled. It doesn't appear to have much effect here, because the 8254 code still calls libc time(), which is equally slow as gettimeofday().

RDTSC is decently fast on average, but it still calls gettimeofday() every now and then, so will cause random delays.

Question is, do we need exact UNIX/UTC time anywhere? For basic timeouts you can just start counting from 0 at program startup, and avoid all the slow time.h stuff from libc.

gvanem · 2024-01-22T08:44:39Z

Question is, do we need exact UNIX/UTC time anywhere? For basic timeouts you can just start counting from 0 at program
startup, and avoid all the slow time.h stuff from libc.

Agreed. I'll try to make a modified version of src/tests/ttime.c that compares the CPU-clocks between various timer-functions.

jwt27 marked this pull request as draft January 17, 2024 16:22

jwt27 force-pushed the recvwin branch 2 times, most recently from 80530a2 to 9c76a02 Compare January 18, 2024 20:32

jwt27 force-pushed the recvwin branch from 9c76a02 to 7efff0c Compare January 19, 2024 00:37

jwt27 added 2 commits January 20, 2024 18:34

Optimize ACK and window updates

208a98b

Faster recovery from retransmission

4b26f08

Limit the number of dup-ACKs to 3, and send fast-ACKs after each retransmitted segment.

jwt27 force-pushed the recvwin branch from 7efff0c to 4b26f08 Compare January 20, 2024 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize window update logic #108

Optimize window update logic #108

jwt27 commented Jan 17, 2024

jwt27 commented Jan 17, 2024

gvanem commented Jan 18, 2024

jwt27 commented Jan 18, 2024

jwt27 commented Jan 18, 2024

gvanem commented Jan 19, 2024

jwt27 commented Jan 19, 2024

jwt27 commented Jan 19, 2024 •

edited

gvanem commented Jan 19, 2024

gvanem commented Jan 19, 2024

jwt27 commented Jan 19, 2024

jwt27 commented Jan 19, 2024

gvanem commented Jan 19, 2024

jwt27 commented Jan 19, 2024

gvanem commented Jan 20, 2024

jwt27 commented Jan 20, 2024 •

edited

jwt27 commented Jan 20, 2024

gvanem commented Jan 20, 2024

jwt27 commented Jan 21, 2024

jwt27 commented Jan 21, 2024

jwt27 commented Jan 21, 2024

gvanem commented Jan 22, 2024

Optimize window update logic #108

Are you sure you want to change the base?

Optimize window update logic #108

Conversation

jwt27 commented Jan 17, 2024

jwt27 commented Jan 17, 2024

gvanem commented Jan 18, 2024

jwt27 commented Jan 18, 2024

jwt27 commented Jan 18, 2024

gvanem commented Jan 19, 2024

jwt27 commented Jan 19, 2024

jwt27 commented Jan 19, 2024 • edited

gvanem commented Jan 19, 2024

gvanem commented Jan 19, 2024

jwt27 commented Jan 19, 2024

jwt27 commented Jan 19, 2024

gvanem commented Jan 19, 2024

jwt27 commented Jan 19, 2024

gvanem commented Jan 20, 2024

jwt27 commented Jan 20, 2024 • edited

jwt27 commented Jan 20, 2024

gvanem commented Jan 20, 2024

jwt27 commented Jan 21, 2024

jwt27 commented Jan 21, 2024

jwt27 commented Jan 21, 2024

gvanem commented Jan 22, 2024

jwt27 commented Jan 19, 2024 •

edited

jwt27 commented Jan 20, 2024 •

edited