Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cubic performance drops on Satellite link with random packet losses #1221

Closed
huitema opened this issue Aug 20, 2021 · 4 comments
Closed

Cubic performance drops on Satellite link with random packet losses #1221

huitema opened this issue Aug 20, 2021 · 4 comments

Comments

@huitema
Copy link
Collaborator

@huitema huitema commented Aug 20, 2021

On 8/19/2021 3:38 AM, PETROU Matthieu wrote:

I’m currently running some tests to compare the download time of QUIC (picoquic) and TCP (iperf3) on a satellite connection, and I saw some wide differences in the results that I wanted to share with you.

I use OpenSAND to emulate the satellite link. The bandwidth between the server and the client is 12 Mb/s on the forward link, and 3 Mb/s on the return link. The buffer size is fixed on the gateway to a bit more than BDP (1.2 times the BDP). I emulate packet losses with time traces, generated with the modele Gilbert and Elliot, with p = 0.01 and q = 0.167. Overall, there is 2.56% of loss time with an average burst length of 16.70ms and a maximum burst length of 43ms.

For the tests, I download either 20 MB of data with iperf3 and picoquic. Either there is only one flow which downloads 20 MB, or five concurrent flows, with each one of them downloading 20 MB. To be able to compare the concurrent flows, I launch 5 different iperf3 and picoquic threads for each flow (on different ports). Both QUIC and TCP are using CUBIC and have hystart disabled. I’m using iperf 3.7+ (cJSON 1.5.2) and picoquic v0.34f.

Here is the download time medians and the standard deviation of 30 tests in those conditions :

Configuration Download time median (s) Standard Deviation (s)
1 TCP flow 91.42 47.48
1 QUIC flow 278.39 172.45
5 TCP flows 163.35 50.16
5 QUIC flows 400.74 127.81

If I remove all link losses, here is the download time medians and the standard deviation of 30 tests :

Configuration Download time median (s) Standard Deviation (s)
1 TCP flow 17.84 0.34
1 QUIC flow 18.35 0.20
5 TCP flows 65.29 8.73
5 QUIC flows 74.46 7.61

With those results, it seems that QUIC (picoquic) is way more impacted by losses on a satellite link. However, I don’t really understand why, as both are using CUBIC algorithm and hystart is disabled.

I launched 30 tests in the same condition with CUBIC and hystart enabled, on both TCP and QUIC, here is the results:

Configuration Download time median (s) Standard Deviation (s)
1 TCP flow 135.73 43.45
1 QUIC flow 374.17 153.67
5 TCP flows 186.80 37.79
5 QUIC flows 375.26 116.36

The results are worse for TCP CUBIC with Hystart. For QUIC : worse with only one flow, and slightly better with five concurrent flows.

Of course, there are packet losses without Hystart, mainly because the slow start needs to overshoot to get out of the start phase. Hystart would avoid that with the measurement of the RTT which increases with buffer filling. However, the issue is that the flow goes out of the start phase when the RTT is greater by 16 ms than the minimum RTT. With a satellite, the jitter is important and a difference of 16 ms is easily reached. That’s why we decided to disable hystart for our tests with TCP and QUIC. We’re aiming more about the download time than packet losses.

@huitema
Copy link
Collaborator Author

@huitema huitema commented Aug 20, 2021

From looking at the logs, I think I know what is happening. I see on the tests that the Window drops immediately after a packet loss. It will eventually converge to something like 1/loss-rate packets, which is way too slow. I think that the TCP implementation of Cubic in Linux has a some kind of filter, and it only reduces the window if it sees several packet losses. That's a way to disambiguate between "congestion" losses that should trigger a slow down and "random" losses caused by transmission issues that should not. My implementation of Cubic does not do that, that's a bug that should be fixed.

@FlavienRJ
Copy link

@FlavienRJ FlavienRJ commented Aug 20, 2021

Can you look at ACK frequency on return link ?

A difference I see between QUIC and TCP is the ACK frequency.
ACK are used to detect loss and to measure RTT. Then, too few ACK would prevent a quick detection of congestion and/or loss.
Your return link is tight, but your TCP stack is in capacity to ensure a 1:2 ratio (3Mbps ~= 5 000 ACKs per sec = 10 000 DATA pkts, if MSS=1500B, maximum bandwidth "acknowledgeable" is 120Mbps).

If your QUIC stack try to minimize acknowledgements flow to 1:10 or less (like with lsquic implementing draft-ietf-quic-ack-frequency with target of 1 ACKs per RTT), it could explain a burst of loss but more importantly, does not "sense" congestion by the increase of RTT at the right time.

@huitema
Copy link
Collaborator Author

@huitema huitema commented Aug 21, 2021

@FlavienRJ Please look at PR #1222. It fixes the issue by just inserting a filter on the packet loss events, so that isolated losses or double losses do not affect congestion control. This also fixes the variability issues found with Hystart, because early losses were causing early exit from Hystart. The ack frequency does not affect matters much, as long as ACKs are not too far apart.

@huitema
Copy link
Collaborator Author

@huitema huitema commented Sep 2, 2021

Fixed in PR #1222

@huitema huitema closed this Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants