Optimizing (*sentPacketHistory).FirstOutstanding #3462

tobyxdd · 2022-06-29T02:32:09Z

quic-go currently spends a lot of CPU time on (*sentPacketHistory).FirstOutstanding when transmitting at high speed over a high packet loss connection. (Of course, usually this can't happen at all, as the built-in CUBIC congestion control slows the connection down to a crawl when packet loss rate is high. I have replaced it with a custom constant-speed CC.)

The following call graph was generated by a transmission test between 2 servers ($5 AWS Lightsail) @ 200 Mbps, simulated 30ms, 4% packet loss link

Showing nodes accounting for 3080ms, 28.33% of 10870ms total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                            1640ms 53.25% |   github.com/lucas-clemente/quic-go/internal/ackhandler.(*sentPacketHistory).HasOutstandingPackets (inline)
                                            1440ms 46.75% |   github.com/lucas-clemente/quic-go/internal/ackhandler.(*sentPacketHandler).PeekPacketNumber (inline)
    2490ms 22.91% 22.91%     3080ms 28.33%                | github.com/lucas-clemente/quic-go/internal/ackhandler.(*sentPacketHistory).FirstOutstanding
                                             590ms 19.16% |   github.com/lucas-clemente/quic-go/internal/ackhandler.(*PacketElement).Next (inline)
----------------------------------------------------------+-------------
                                             590ms   100% |   github.com/lucas-clemente/quic-go/internal/ackhandler.(*sentPacketHistory).FirstOutstanding (inline)
     590ms  5.43% 28.33%      590ms  5.43%                | github.com/lucas-clemente/quic-go/internal/ackhandler.(*PacketElement).Next
----------------------------------------------------------+-------------

FirstOutstanding was taking up nearly 30% of the runtime and was a major CPU bottleneck. The program occupied the entire CPU core, couldn't get faster than a merely 100 Mbps because of it. And it gets worse as I crank up the loss rate. Switching to servers with better CPU performance does improve the situation - I was able to hit ~700 Mbps between two Ryzen 7 machines over the same link.

Do you have any thoughts on what could be done to optimize this part of the code? Perhaps we can add some kind of cache, or pre-calculate the first outstanding packet every time the linked list is modified, instead of iterating it every call?

pprof sample attached:
pprof.hy_linux_amd64.samples.cpu.002.pb.gz

The text was updated successfully, but these errors were encountered:

marten-seemann · 2022-06-29T08:40:09Z

We call FirstOutstandingPacket twice. Once when generating the packet number:
https://github.com/lucas-clemente/quic-go/blob/f29dd273b40d37b6ea8206350b2d96e8779161e5/internal/ackhandler/sent_packet_handler.go#L681-L687
and once when queueing a probe packet:
https://github.com/lucas-clemente/quic-go/blob/f29dd273b40d37b6ea8206350b2d96e8779161e5/internal/ackhandler/sent_packet_handler.go#L760-L762

Generating a packet number happens for every single packet, no matter the loss right. Am I right to assume that in low loss conditions, we do more iterations of the loop?

Maybe the solution here is to have two separate lists, one that has all packets p for which !p.declaredLost && !p.skippedPacket && !p.IsPathMTUProbePacket. That would allow you to just take the first packet to determine the first outstanding packet.

The only problem is that you'd have to move packets between the two lists whenever you want to declare a packet lost, and that's a O(N) operation. However, N is the number of lost packets (plus skipped plus MTU probe packets), so should be fairly small under normal conditions. You could also further split up that list, with one list containing skipped and MTU probe packets, and the other one lost packets. Packets are declared lost in the order they are sent, which would reduce declaring a packet lost to a O(1) operation. This might be over-optimizing things though.

tobyxdd · 2022-06-29T21:42:49Z

Am I right to assume that in low loss conditions, we do more iterations of the loop?

I think you mean in high loss conditions?

tobyxdd · 2022-06-30T04:09:02Z

Can you elaborate on why it would be O(N) with two lists, and O(1) with three? Is it because when using two lists, declaring a packet lost requires finding the correct position in the lost || skipped || probe list for insertion, and when using a dedicated lost list, you can just append it to the tail?

Also, I'm not sure how to implement Iterate when there are multiple lists. Does the order matter? e.g. can it just internally do

Iterate(outstandingList)
Iterate(probeSkippedList)
Iterate(lostList)

or must it behave as it's still one list?

marten-seemann · 2022-06-30T05:02:59Z

I think you mean in high loss conditions?

Yes, that's what I meant.

Is it because when using two lists, declaring a packet lost requires finding the correct position in the lost || skipped || probe list for insertion, and when using a dedicated lost list, you can just append it to the tail?

Correct. Although the N is pretty small when using two lists (as compared to the current situation), so probably that's fine.

Also, I'm not sure how to implement Iterate when there are multiple lists. Does the order matter?

Order does matter, since it allows the caller to abort early. One way to implement this is to have a nextOutstanding, nextProbeSkipped and nextLost variable, pointing to the next packet in each respective list. Then on every call of the callback, you decide which one to pass to the callback (just take the one with the lowest packet number).

tobyxdd · 2022-06-30T06:04:35Z

Might be a dumb question but what would happen if we simply remove all declaredLost packets from the list immediately (vs after 3*PTO as it is now) and remove this field altogether?

I tried to look around the code but didn't find anything obvious.

tobyxdd · 2022-06-30T06:26:13Z

btw, I changed the code to look like this to confirm that it is spending most of the time filtering out declaredLost packets.

tobyxdd · 2022-06-30T06:32:01Z

My understanding is that, since ACKed packets are immediately removed from the list, but lost packets are only cleared after 3*PTO by DeleteOldPackets, there is a huge backlog of lost packets in front of the list that FirstOutstanding needs to get through every time it is called, causing this performance issue.

tobyxdd · 2022-06-30T06:35:28Z

And if that's the case,

Although the N is pretty small when using two lists (as compared to the current situation)

The N would still be pretty large, right?

marten-seemann · 2022-06-30T07:23:16Z

Might be a dumb question but what would happen if we simply remove all declaredLost packets from the list immediately (vs after 3*PTO as it is now) and remove this field altogether?

Good question. What about this scenario: We declare a packet lost, then we receive an ACK for it later. If we haven't had the chance to retransmit the frames, we don't need to do that any more. We can also use it to generate a RTT sample (this is very valuable in this situation).

Although the N is pretty small when using two lists (as compared to the current situation)

The N would still be pretty large, right?

Not if you traverse the list from the end. You'll only encounter skipped and MTU packets.

tobyxdd · 2022-07-01T01:34:31Z

I see. I'll work on an implementation using 2 lists.

tobyxdd · 2022-07-01T21:06:22Z

Hi @marten-seemann , here's the pprof result after this optimization.

tobyxdd · 2022-07-01T21:08:42Z

I have found, however, another significant bottleneck when the loss rate is high. This time on the receiving end.

marten-seemann · 2022-07-01T21:11:12Z

@tobyxdd I think you've rediscovered #300 :)

And with generics on the horizon we'll soon be able to fix it (by using a tree data structure)!

tobyxdd · 2022-07-01T23:35:40Z

And with generics on the horizon

Why does it require generics?

marten-seemann · 2022-07-02T07:19:48Z

Why does it require generics?

I was hoping that we'd be able to find a well-tested, type-safe generic tree implementation.

tobyxdd · 2022-07-04T22:00:48Z

Why does it require generics?

I was hoping that we'd be able to find a well-tested, type-safe generic tree implementation.

Could this be useful https://github.com/tidwall/btree

tobyxdd · 2022-07-05T00:47:14Z

I don't think there's any library that is readily available. Even if they support custom comparators, they only support comparisons between the same types - which doesn't fit our "find gap by offset" use case.

marten-seemann added enhancement help wanted labels Jun 29, 2022

tobyxdd mentioned this issue Jul 1, 2022

optimize FirstOutstanding in the sent packet history #3467

Merged

marten-seemann closed this as completed in #3467 Jul 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing (*sentPacketHistory).FirstOutstanding #3462

Optimizing (*sentPacketHistory).FirstOutstanding #3462

tobyxdd commented Jun 29, 2022

marten-seemann commented Jun 29, 2022

tobyxdd commented Jun 29, 2022

tobyxdd commented Jun 30, 2022

marten-seemann commented Jun 30, 2022

tobyxdd commented Jun 30, 2022 •

edited

Loading

tobyxdd commented Jun 30, 2022

tobyxdd commented Jun 30, 2022

tobyxdd commented Jun 30, 2022

marten-seemann commented Jun 30, 2022

tobyxdd commented Jul 1, 2022

tobyxdd commented Jul 1, 2022

tobyxdd commented Jul 1, 2022 •

edited

Loading

marten-seemann commented Jul 1, 2022

tobyxdd commented Jul 1, 2022

marten-seemann commented Jul 2, 2022

tobyxdd commented Jul 4, 2022

tobyxdd commented Jul 5, 2022 •

edited

Loading

Optimizing (*sentPacketHistory).FirstOutstanding #3462

Optimizing (*sentPacketHistory).FirstOutstanding #3462

Comments

tobyxdd commented Jun 29, 2022

marten-seemann commented Jun 29, 2022

tobyxdd commented Jun 29, 2022

tobyxdd commented Jun 30, 2022

marten-seemann commented Jun 30, 2022

tobyxdd commented Jun 30, 2022 • edited Loading

tobyxdd commented Jun 30, 2022

tobyxdd commented Jun 30, 2022

tobyxdd commented Jun 30, 2022

marten-seemann commented Jun 30, 2022

tobyxdd commented Jul 1, 2022

tobyxdd commented Jul 1, 2022

tobyxdd commented Jul 1, 2022 • edited Loading

marten-seemann commented Jul 1, 2022

tobyxdd commented Jul 1, 2022

marten-seemann commented Jul 2, 2022

tobyxdd commented Jul 4, 2022

tobyxdd commented Jul 5, 2022 • edited Loading

tobyxdd commented Jun 30, 2022 •

edited

Loading

tobyxdd commented Jul 1, 2022 •

edited

Loading

tobyxdd commented Jul 5, 2022 •

edited

Loading