Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing (*sentPacketHistory).FirstOutstanding #3462

Closed
tobyxdd opened this issue Jun 29, 2022 · 17 comments · Fixed by #3467
Closed

Optimizing (*sentPacketHistory).FirstOutstanding #3462

tobyxdd opened this issue Jun 29, 2022 · 17 comments · Fixed by #3467

Comments

@tobyxdd
Copy link
Contributor

tobyxdd commented Jun 29, 2022

Hi @marten-seemann,

quic-go currently spends a lot of CPU time on (*sentPacketHistory).FirstOutstanding when transmitting at high speed over a high packet loss connection. (Of course, usually this can't happen at all, as the built-in CUBIC congestion control slows the connection down to a crawl when packet loss rate is high. I have replaced it with a custom constant-speed CC.)

The following call graph was generated by a transmission test between 2 servers ($5 AWS Lightsail) @ 200 Mbps, simulated 30ms, 4% packet loss link

profile001

image

Showing nodes accounting for 3080ms, 28.33% of 10870ms total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context
----------------------------------------------------------+-------------
                                            1640ms 53.25% |   github.com/lucas-clemente/quic-go/internal/ackhandler.(*sentPacketHistory).HasOutstandingPackets (inline)
                                            1440ms 46.75% |   github.com/lucas-clemente/quic-go/internal/ackhandler.(*sentPacketHandler).PeekPacketNumber (inline)
    2490ms 22.91% 22.91%     3080ms 28.33%                | github.com/lucas-clemente/quic-go/internal/ackhandler.(*sentPacketHistory).FirstOutstanding
                                             590ms 19.16% |   github.com/lucas-clemente/quic-go/internal/ackhandler.(*PacketElement).Next (inline)
----------------------------------------------------------+-------------
                                             590ms   100% |   github.com/lucas-clemente/quic-go/internal/ackhandler.(*sentPacketHistory).FirstOutstanding (inline)
     590ms  5.43% 28.33%      590ms  5.43%                | github.com/lucas-clemente/quic-go/internal/ackhandler.(*PacketElement).Next
----------------------------------------------------------+-------------

FirstOutstanding was taking up nearly 30% of the runtime and was a major CPU bottleneck. The program occupied the entire CPU core, couldn't get faster than a merely 100 Mbps because of it. And it gets worse as I crank up the loss rate. Switching to servers with better CPU performance does improve the situation - I was able to hit ~700 Mbps between two Ryzen 7 machines over the same link.

Do you have any thoughts on what could be done to optimize this part of the code? Perhaps we can add some kind of cache, or pre-calculate the first outstanding packet every time the linked list is modified, instead of iterating it every call?

pprof sample attached:
pprof.hy_linux_amd64.samples.cpu.002.pb.gz

@marten-seemann
Copy link
Member

We call FirstOutstandingPacket twice. Once when generating the packet number:
https://github.com/lucas-clemente/quic-go/blob/f29dd273b40d37b6ea8206350b2d96e8779161e5/internal/ackhandler/sent_packet_handler.go#L681-L687
and once when queueing a probe packet:
https://github.com/lucas-clemente/quic-go/blob/f29dd273b40d37b6ea8206350b2d96e8779161e5/internal/ackhandler/sent_packet_handler.go#L760-L762

Generating a packet number happens for every single packet, no matter the loss right. Am I right to assume that in low loss conditions, we do more iterations of the loop?

Maybe the solution here is to have two separate lists, one that has all packets p for which !p.declaredLost && !p.skippedPacket && !p.IsPathMTUProbePacket. That would allow you to just take the first packet to determine the first outstanding packet.

The only problem is that you'd have to move packets between the two lists whenever you want to declare a packet lost, and that's a O(N) operation. However, N is the number of lost packets (plus skipped plus MTU probe packets), so should be fairly small under normal conditions. You could also further split up that list, with one list containing skipped and MTU probe packets, and the other one lost packets. Packets are declared lost in the order they are sent, which would reduce declaring a packet lost to a O(1) operation. This might be over-optimizing things though.

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jun 29, 2022

Am I right to assume that in low loss conditions, we do more iterations of the loop?

I think you mean in high loss conditions?

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jun 30, 2022

Can you elaborate on why it would be O(N) with two lists, and O(1) with three? Is it because when using two lists, declaring a packet lost requires finding the correct position in the lost || skipped || probe list for insertion, and when using a dedicated lost list, you can just append it to the tail?

Also, I'm not sure how to implement Iterate when there are multiple lists. Does the order matter? e.g. can it just internally do

Iterate(outstandingList)
Iterate(probeSkippedList)
Iterate(lostList)

or must it behave as it's still one list?

@marten-seemann
Copy link
Member

I think you mean in high loss conditions?

Yes, that's what I meant.

Is it because when using two lists, declaring a packet lost requires finding the correct position in the lost || skipped || probe list for insertion, and when using a dedicated lost list, you can just append it to the tail?

Correct. Although the N is pretty small when using two lists (as compared to the current situation), so probably that's fine.

Also, I'm not sure how to implement Iterate when there are multiple lists. Does the order matter?

Order does matter, since it allows the caller to abort early. One way to implement this is to have a nextOutstanding, nextProbeSkipped and nextLost variable, pointing to the next packet in each respective list. Then on every call of the callback, you decide which one to pass to the callback (just take the one with the lowest packet number).

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jun 30, 2022

Might be a dumb question but what would happen if we simply remove all declaredLost packets from the list immediately (vs after 3*PTO as it is now) and remove this field altogether?

I tried to look around the code but didn't find anything obvious.

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jun 30, 2022

btw, I changed the code to look like this to confirm that it is spending most of the time filtering out declaredLost packets.

image

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jun 30, 2022

My understanding is that, since ACKed packets are immediately removed from the list, but lost packets are only cleared after 3*PTO by DeleteOldPackets, there is a huge backlog of lost packets in front of the list that FirstOutstanding needs to get through every time it is called, causing this performance issue.

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jun 30, 2022

And if that's the case,

Although the N is pretty small when using two lists (as compared to the current situation)

The N would still be pretty large, right?

@marten-seemann
Copy link
Member

Might be a dumb question but what would happen if we simply remove all declaredLost packets from the list immediately (vs after 3*PTO as it is now) and remove this field altogether?

Good question. What about this scenario: We declare a packet lost, then we receive an ACK for it later. If we haven't had the chance to retransmit the frames, we don't need to do that any more. We can also use it to generate a RTT sample (this is very valuable in this situation).

Although the N is pretty small when using two lists (as compared to the current situation)

The N would still be pretty large, right?

Not if you traverse the list from the end. You'll only encounter skipped and MTU packets.

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jul 1, 2022

I see. I'll work on an implementation using 2 lists.

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jul 1, 2022

Hi @marten-seemann , here's the pprof result after this optimization.

profile001

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jul 1, 2022

I have found, however, another significant bottleneck when the loss rate is high. This time on the receiving end.

profile002

image

@marten-seemann
Copy link
Member

@tobyxdd I think you've rediscovered #300 :)

And with generics on the horizon we'll soon be able to fix it (by using a tree data structure)!

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jul 1, 2022

And with generics on the horizon

Why does it require generics?

@marten-seemann
Copy link
Member

Why does it require generics?

I was hoping that we'd be able to find a well-tested, type-safe generic tree implementation.

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jul 4, 2022

Why does it require generics?

I was hoping that we'd be able to find a well-tested, type-safe generic tree implementation.

Could this be useful https://github.com/tidwall/btree

@tobyxdd
Copy link
Contributor Author

tobyxdd commented Jul 5, 2022

I don't think there's any library that is readily available. Even if they support custom comparators, they only support comparisons between the same types - which doesn't fit our "find gap by offset" use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants