Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can HW handle packet sorting for TPTO? #50

Open
ami-GS opened this issue Feb 7, 2023 · 11 comments
Open

Can HW handle packet sorting for TPTO? #50

ami-GS opened this issue Feb 7, 2023 · 11 comments
Labels
TPTO Related to Time based Packet Transmission offloads

Comments

@ami-GS
Copy link
Contributor

ami-GS commented Feb 7, 2023

Hi @nibanks @stevedoyle, @BorisPis
To efficiently transmit time based packet, the packets need to be sorted by the deadline.
If multiple process/thread use TPTO, packets with earlier deadline could be blocked by that of later deadline.
Windows doesn't have packet queuing and sorting feature like Linux tc
Do HW/FW/Driver have ability to queuing and sorting?
Or should we set limitation? each process/thread bind each queue on NIC. then apps to take care of the packet order in themselves.

@ami-GS ami-GS changed the title Which component handles sorting packet for Time-based packet transmission Packet sorting with TPTO Feb 7, 2023
@ami-GS ami-GS added the TPTO Related to Time based Packet Transmission offloads label Feb 7, 2023
@maolson-msft
Copy link
Contributor

maolson-msft commented Feb 7, 2023

I don't see how the upper layer could handle the sorting itself, because you can't restrict the upper layer to post packets in global order of when they should be sent. (by "global order" I mean all the packets across all connections being posted in order of transmission time. Since different connections will have different pacing rates, that would be an insane restriction I think.)

At best, you could have the upper layer establish individual "flows" with the NIC, and each time it posts packets they are marked as being associated with one of those "flows", and then maybe it would be reasonable to require the upper layer to post packets in order of transmission time with respect to each flow. But the NIC would then still have to sort the flows by the transmission time of the next-to-be-sent packet of each flow.

@maolson-msft
Copy link
Contributor

By the way, there's some very relevant discussion in section 8 of this paper, if you aren't aware: https://saeed.github.io/files/carousel-sigcomm17.pdf

@maolson-msft
Copy link
Contributor

I feel like that paper is describing how a packet sorting algorithm like carousel is implemented on top of a single-queue hardware pacer, but I don't really understand the paper's explanation.

@ami-GS
Copy link
Contributor Author

ami-GS commented Feb 7, 2023

What does the "flow" mean in this context?
It associates N sockets to M Tx queue in NIC? Then App to specify flow ID?
The article loos like kernel (tcpip) layer to implement something. the background I started discussion is to make implementation as simple as possible as commented by Nick.
#44 (comment)

@ami-GS
Copy link
Contributor Author

ami-GS commented Feb 7, 2023

maybe? TAPRIO qdisc is same idea as the flow, but its hard to understand whether this is applicable to our requirements

@maolson-msft
Copy link
Contributor

By "flow" I mean "connection". A transport protocol implementation will typically post the packets for a single connection in the same order as it wants them to be sent, as far as I know.

"It associates N sockets to M Tx queue in NIC?" - I don't know what you mean by associating N things with M things. I'm talking about associating N connections with N queues in the NIC.

@ami-GS
Copy link
Contributor Author

ami-GS commented Feb 7, 2023

I came up with the N x N idea, but the number of queue in NIC is limited. how do we handle when run out of queue on NIC?
(I expected M (>> N) is kind of magic)

@nibanks
Copy link
Member

nibanks commented Feb 8, 2023

Fundamentally, the important question is (for @stevedoyle, @BorisPis):

Can the hardware deal with the interleaved sends that come from independent connections (flows) sending timestamped packets?

As the discussion above demonstrates there are multiple ways to do this, but are any practical in the HW?

@nibanks nibanks changed the title Packet sorting with TPTO Can HW handle packet sorting for TPTO? Feb 8, 2023
@SymbiosisStarshine
Copy link

SymbiosisStarshine commented Oct 2, 2023

Fundamentally, the important question is (for @stevedoyle, @BorisPis):

Can the hardware deal with the interleaved sends that come from independent connections (flows) sending timestamped packets?

As the discussion above demonstrates there are multiple ways to do this, but are any practical in the HW?

Hey Nick, Kevin Scott from Intel here.

Speaking to the capabilities of Intel LAN devices, once a particular Tx queue in the device has been given "work", in this case a packet or segment to transmit, there is not a second operation which can be interleaved on that Tx queue - the first operation must complete before the next can begin. The kernel driver can take requests to transmit from any arbitrary thread of course, but is required to make a determination as to which hardware queue to send the packet on, most frequently by the NetBufferListHashValue provided in the NBL. At this point the driver will access a lock to post the work for this queue on a ring (a descriptor ring is a term used by many IHVs).

One other item of note, related to the above, is that any request to provide a delay for a packet or TSO, will delay the transmission of the next work item on the queue. So in a case where the driver sees work request A and work request B, if work request A is a TSO and requires a delay of 5ms before transmission and a 5ms delay between segment and consists of 5 segments, work request B will see a 25ms delay before being transmitted. NDIS and the upper layers of the stack would see that the Tx completion for B would take a "long" time.

@nibanks
Copy link
Member

nibanks commented Oct 3, 2023

@SymbiosisStarshine thanks for your feedback, and that makes sense, but it still leaves us with the problem of "How do we efficiently segment and pace multiple independent flows?" Based on what you're saying above, it seems like every flow would be required to have its own queue. Is that feasible at the hardware for modern server scenarios?

@SymbiosisStarshine
Copy link

@SymbiosisStarshine thanks for your feedback, and that makes sense, but it still leaves us with the problem of "How do we efficiently segment and pace multiple independent flows?" Based on what you're saying above, it seems like every flow would be required to have its own queue. Is that feasible at the hardware for modern server scenarios?

From the application point of view, they are often working at a socket level, where traffic can be readily identified as for this application. The application itself does not know anything about the LAN device, it simply knows that traffic matching some set of criteria should get to it. As such, I tend to lean towards finding a way to give applications some subset of the device that extend their world view that traffic matching specific criteria belongs to it alone, and a queue I find is a very nice fit for that abstraction.

If we don't consider cases where we have large numbers of VMs running on a host, our LAN adapters will typically allocate 8 or 16 queues for use by the host, but could be capable of running hundreds of virtual machines. If we say that the host is just using 16 queues, then in many cases the LAN device has hundreds of queues waiting for something to do as the HW does not care if there is virtualization or not. We can provide a queue and filtering capabilities to ensure traffic matching a set pattern only land on that queue i.e. the application.

If we also consider modern CPUs, there are typically a large number of CPU cores, which lets software scale rather readily! If we could scale the queues in a manner other than simply adding more LAN devices to the host, this would provide a cost benefit (you don't need to buy more NICs) and you also would better use the resources that are simply sitting idle.

This is a rather long winded way of saying "Yes, I believe it is feasible in modern server scenarios for LAN devices to provide resources to the application directly.". The "how" you get there does get interesting, but I think this is an area rich with opportunities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TPTO Related to Time based Packet Transmission offloads
Projects
None yet
Development

No branches or pull requests

4 participants