Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs #45886

Open
bradfitz opened this issue Apr 30, 2021 · 49 comments
Open

net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs #45886

bradfitz opened this issue Apr 30, 2021 · 49 comments

Comments

@bradfitz
Copy link
Contributor

(co-written with @neild)

Linux has recvmmsg to read multiple UDP packets from the kernel at once.

There is no Recvmmsg wrapper func in golang.org/x/sys/unix. That's easy enough to add, but it's not ideal: it means all callers of it would be using a thread while blocked waiting for a packet.

There is, however, batch support in golang.org/x/net/ipv{4,6}: e.g. https://pkg.go.dev/golang.org/x/net/ipv4#PacketConn.ReadBatch (added around golang/net@b8b1343). But it has the same thread-blocking problem. And it has the additional problem of having separate packages for IPv4 vs IPv6.

It'd be nicer to integrate with the runtime's poller.

Adding API to do this in the net package would mean both:

  1. we'd be able to integrate with the runtime poller (epoll, etc) and not waste a thread during a read
  2. there'd be portable API to do this regardless of whether the platform/kernel version supports something like recvmmsg.

For writing, net.Buffers already exists, as does golang.org/x/net/ipv{4,6}'s PacketConn.WriteBatch, so is less important, but could be done for consistency.

As far as a potential API, https://pkg.go.dev/golang.org/x/net/ipv4#PacketConn.ReadBatch is close, but the platform-specific flags should probably not be included, at least as an int. While there's some precedent with https://golang.org/pkg/net/#UDPConn.ReadMsgUDP use of flags int, we could probably use a better type if we need flags for some reason.

Alternatively, if callers of x/sys/unix or x/net/ipv{4,6} could do this efficiently with the runtime poller, that'd also work (even if they'd need to use some build tags, which is probably tolerable for anybody who cares about this).

@gopherbot gopherbot added this to the Proposal milestone Apr 30, 2021
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Apr 30, 2021
@ianlancetaylor
Copy link
Contributor

What should the allocation strategy be here? The existing ReadMsgUDP method takes a pair of buffers and fills them in. Should this new method take a slice of buffers and somehow return how much data was read into each buffer?

@bradfitz
Copy link
Contributor Author

At a high level, the caller should supply all the memory. The whole thing should be zero allocations (see also: #43451 and https://golang.org/cl/291509), otherwise the sort of people who'd want to use this probably wouldn't want to use it.

Probably pass a slice of messages similar to ipv4.PacketConn.ReadBatch but likely with a slightly different Message. I don't think the Message.Addr net.Addr field is amenable to the midstack inlining optimization from https://golang.org/cl/291509.

@neild
Copy link
Contributor

neild commented Apr 30, 2021

A possible concrete API:

type UDPMessage struct {
  // recvmmsg accepts a per-message scatter/gather array, so this could be [][]byte instead.
  Buffer []byte
  OOB    []byte
  Addr   UDPAddr

  // The existing (*UDPConn).ReadMsgUDP method returns flags as an int,
  // but perhaps this should be a more abstract type.
  Flags  int
}

// ReadUDPBatch reads multiple messages from c
// It returns the number of messages read.
// It reads the payload into Buffer and associated out-of-band data into OOB,
// and sets the length of each slice to the extent of the read data.
// It sets the Addr and Flags fields to the source address and flags set on each message.
func (c *UDPConn) ReadUDPBatch(ms []UDPMessage) (int, error)

This API preserves the existing limitation that there is no way to provide recvmsg flags to *UDPConn methods. (I'm quite confused, by the way, since it looks like we never set the MSG_OOB flag on recvmsg calls even when reading OOB data. Is this flag not actually necessary?)

@ianlancetaylor
Copy link
Contributor

UDP doesn't have out-of-band data. TCP does. I have no idea why UDPConn.ReadMsgUDP takes an oob argument or an oobn result. Maybe I'm missing something.

For that matter I'm not sure off hand how to read out-of-band data for a TCP socket using the net package.

@neild
Copy link
Contributor

neild commented Apr 30, 2021

UDP doesn't have out-of-band data.

Oh, right. I mean "control" data--the msg_control field of a struct msghdr. The fact that ReadMsgUDP calls its parameter oob for some reason led me astray.

@ianlancetaylor
Copy link
Contributor

Whoops, that led me astray also. But the end result is the same. As you say, the oob parameter to ReadMsgUDP, if not empty, can be filled with the ancillary data that the readmsg system call returns in the msg_control field. But UDP sockets never have any ancillary data. So it's still pointless.

@neild
Copy link
Contributor

neild commented Apr 30, 2021

UDP sockets can have ancillary data: Setting the IP_PKTINFO sockopt will give you a control message containing the interface index the packet was received on among other info (useful for DHCP servers), and the IP_TOS sockopt will give you the ToS byte.

@ianlancetaylor
Copy link
Contributor

Ah, OK, thanks.

@rsc
Copy link
Contributor

rsc commented May 5, 2021

We may want to wait on doing anything here until we figure out what to do with IP addresses, which still allocate.

@josharian
Copy link
Contributor

It is important that the caller be able to specify whether they want to block for N packets, or receive up-to-N packets but only block until at least one is available. On linux that's accomplished through flags, but we might want a nicer, higher-level way to express that.

@ianlancetaylor
Copy link
Contributor

@josharian Is it important to support both? For Read we only support "up-to-N", and make up for the lack via io.ReadFull.

@josharian
Copy link
Contributor

Hmm. Yeah, I think always up-to-N seems fine, at least for my uses.

@diogin
Copy link
Contributor

diogin commented Feb 5, 2022

There is also a sendmmsg(2): https://man7.org/linux/man-pages/man2/sendmmsg.2.html
If readmmsg(2) is added, sendmmsg(2) could be added together. QUIC implementations should benefit from these two APIs.

@cristaloleg
Copy link

Russ's comment (#45886 (comment)) is already outdated 'cause we've new IP addresses: #46518

@tomalaci
Copy link

Another use-case (that I actively work on) is mass SNMP polling or other type of packet-based polling (polling hundreds of thousands of routers), mostly for their interface metrics. For that I can't exactly create socket per host or create a pool of sockets to host polling sessions of individual hosts, I have to multiplex packets. Syscalls such as sendmmsg and recvmmsg help by queuing up multiple small packets (500-1000 bytes each). You can further increase performance by using GSO if your kernel allows it.

Performance benefits of the above can be read here as well: https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/

Currently I am manually creating non-blocking UDP sockets with additional epoll descriptors to pretty much circumvent Go's networking stack for that extra efficiency. That being said, if your host does not have more than 10G interface then there isn't much point doing these optimizations as a basic UDP connection will max out the interface unless your packets are extremely small (few hundred bytes or less).

@anacrolix
Copy link
Contributor

I use this heavily from https://github.com/anacrolix/torrent, where inbound UDP over uTP (a UDP-based protocol) is a bottleneck). On Linux I'm able to use https://github.com/anacrolix/mmsg from https://github.com/anacrolix/go-libutp, which I've adapted from golang.org/x/net to fix up some issues there around handling recvmmsg efficiently.

@neild
Copy link
Contributor

neild commented Jul 29, 2022

Updated proposal using netip:

type UDPMessage struct {
  Buffer  []byte // must be pre-allocated by caller to non-zero len; callee fills, re-slices
  Control []byte // like Buffer (pre-allocated, re-sliced), may be zero-length, what ReadMsgUDP mistakenly called “oob” (TODO: fix its docs)
  Addr    netip.AddrPort // populated by callee

  // Flags is an OS-specific bitmask. Use x/net/etc for meaning.
  // We include it as-is only because the old UDPConn.ReadMsgUDP returned it too.
  Flags int
}

// ReadUDPBatch reads multiple messages from c.
// It returns the number of messages read.
// The error is non-nil if and only if msgsRead is zero.
// It reads the payload into Buffer and associated control data into Control,
// and sets the length of each slice to the extent of the read data.
// It sets the Addr and Flags fields to the source address and flags for each message.
func (c *UDPConn) ReadUDPBatch(ms []UDPMessage) (msgsRead int, err error)

// WriteUDPBatch sends multiple messages to c.
// It returns the number of messages written.
// The error is non-nil if and only if not all messages could be written.
func (c *UDPConn) WriteUDPBatch(ms []UDPMessage) (msgsSent int, err error)

@rsc
Copy link
Contributor

rsc commented Aug 3, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Aug 3, 2022
@rsc
Copy link
Contributor

rsc commented Aug 10, 2022

Briefly discussed at proposal review. Given spelling of ReadMsgUDP, we should probably call the type UDPMsg, so that there is only one spelling of "message" in the package. And then to avoid the new term Batch, we could call the methods ReadUDPMsgs and WriteUDPMsgs.

Otherwise the semantics look fine (or at least as good as ReadMsgUDP with respect to the flags.)

@rsc rsc changed the title proposal: net: add API to receive multiple UDP packets (potentially in one system call) proposal: net: add multiple-UDP-packet read and write Aug 10, 2022
@rsc
Copy link
Contributor

rsc commented Aug 17, 2022

Note, Aug 31 2022: Changed type of Buffer and Control from []byte to [][]byte.


Updated API (only the names are different):

type UDPMsg struct {
  Buffer  [][]byte // must be pre-allocated by caller to non-zero len; callee fills, re-slices elements
  Control [][]byte // like Buffer (pre-allocated, re-sliced), may be zero-length, what ReadMsgUDP mistakenly called “oob” (TODO: fix its docs)
  Addr    netip.AddrPort // populated by callee

  // Flags is an OS-specific bitmask. Use x/net/etc for meaning.
  // We include it as-is only because the old UDPConn.ReadMsgUDP returned it too.
  Flags int
}

// ReadUDPMsgs reads multiple messages from c.
// It returns the number of messages read.
// The error is non-nil if and only if msgsRead is zero.
// It reads the payload into Buffer and associated control data into Control,
// and sets the length of each slice to the extent of the read data.
// It sets the Addr and Flags fields to the source address and flags for each message.
func (c *UDPConn) ReadUDPMsgs(ms []UDPMsg) (msgsRead int, err error)

// WriteUDPMsgs sends multiple messages to c.
// It returns the number of messages written.
// The error is non-nil if and only if not all messages could be written.
func (c *UDPConn) WriteUDPMsgs(ms []UDPMsg) (msgsSent int, err error)

Does anyone object to adding this API?

@rsc rsc changed the title proposal: net: add multiple-UDP-packet read and write proposal: net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs Aug 17, 2022
@martin-sucha
Copy link
Contributor

It is not clear to me from the proposed documentation what happens if the length of the Buffer (or Control) is smaller than the length of the UDP packet. Is it an error or is the data truncated?

I assume it truncates the data since ReadMsgUDP and recvmsg/recvmmsg all do, but we should explicitly mention the behavior in the documentation:

// ReadUDPMsgs reads multiple messages from c.
// It returns the number of messages read.
// The error is non-nil if and only if msgsRead is zero.
// It reads the payload into Buffer and associated control data into Control,
// and sets the length of each slice to the extent of the read data.
// If the packet/control data is longer than the Buffer/Control, the Buffer/Control will
// contain as much data as fits, the rest will be truncated.
// Truncated data is not an error.
// ReadUDPMsgs sets the Addr and Flags fields to the source address
// and flags for each message.
func (c *UDPConn) ReadUDPMsgs(ms []UDPMsg) (msgsRead int, err error)

@martin-sucha
Copy link
Contributor

Why does UDPMsg have Buffer []byte and not Buffer [][]byte or Buffer Buffers? recvmmsg/sendmmsg support a scatter/gather array per message. I see that possibility mentioned earlier in the thread, but I don't see any arguments why we should (or should not) use a single buffer per message.

Is there any downside of using multiple buffers per message?

@rsc
Copy link
Contributor

rsc commented Sep 28, 2022

Sounds like the UDP GSO concern doesn't need to block this API after all.

@rsc
Copy link
Contributor

rsc commented Sep 28, 2022

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs Sep 28, 2022
@rsc rsc modified the milestones: Proposal, Backlog Sep 28, 2022
@matzf
Copy link

matzf commented Oct 3, 2022

@bradfitz wrote:

There is, however, batch support in golang.org/x/net/ipv{4,6}: e.g. https://pkg.go.dev/golang.org/x/net/ipv4#PacketConn.ReadBatch [...]. But it has the same thread-blocking problem.

Just asking for clarification on this point here; as far as I understand, the recvmmsg/sendmmsg in these Read/WriteBatch functions always operate on a non-blocking socket and use the runtime's poller to poll for completion, via the syscall.RawConn interface.
It's entirely plausible that I'm misunderstanding how this works, or that I'm misunderstanding your statement, but to me it seems that this thread-blocking issue does not apply to Read/WriteBatch.

@ianlancetaylor
Copy link
Contributor

@matzf I think you are correct.

@database64128
Copy link
Contributor

It does seem like it should use [][]byte instead of []byte. Updated that.

@rsc It's not entirely clear to me how we are going to reslice Buffer [][]byte upon returning. I imagine the implementation would have to iterate over the byte slices and reslice the last byte slice as well as the outer slice. The caller would then have to basically do the same thing all over again.

It'd be much easier to understand and use if we leave Buffer as-is and set Msglen in a separate field. Or just keep the original Buffer []byte, since scatter-gather is rarely used with UDP messages.

Control is not an iovec. It should be changed back to Control []byte.

@rsc
Copy link
Contributor

rsc commented Oct 20, 2022

Control is not an iovec. It should be changed back to Control []byte.

Yes, thanks.

@rsc It's not entirely clear to me how we are going to reslice Buffer [][]byte upon returning. I imagine the implementation would have to iterate over the byte slices and reslice the last byte slice as well as the outer slice. The caller would then have to basically do the same thing all over again.

It'd be much easier to understand and use if we leave Buffer as-is and set Msglen in a separate field. Or just keep the original Buffer []byte, since scatter-gather is rarely used with UDP messages.

OK, let's add Len int to the struct and just leave the slices alone. That will be a little more work for the caller but it's not any more than in C using iovecs. So the new struct would be:

type UDPMsg struct {
  Buffer  [][]byte // slices to read into or write from
  Control []byte // control data buffer to read into or write from, what ReadMsgUDP mistakenly called “oob” (TODO: fix its docs)
  Addr    netip.AddrPort // ReadMsgUDP fills in, WriteMsgUDP sends to this address

  // Flags is an OS-specific bitmask. Use x/net/etc for meaning.
  // We include it as-is only because the old UDPConn.ReadMsgUDP returned it too.
  // ReadMsgUDP fills in, WriteMsgUDP uses
  Flags int

  // Results from ReadUDPMsgs, WriteUDPMsgs
  Len int // number of bytes transferred to/from Buffer
  ControlLen int // number of bytes transferred to/from Control
}

@rsc
Copy link
Contributor

rsc commented Oct 20, 2022

@database64128 are you implementing this change?

@rsc
Copy link
Contributor

rsc commented Oct 20, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc changed the title net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs proposal: net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs Oct 26, 2022
@rsc
Copy link
Contributor

rsc commented Oct 26, 2022

Any objections to #45886 (comment) ?

@rsc
Copy link
Contributor

rsc commented Nov 2, 2022

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Nov 9, 2022

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs Nov 9, 2022
@marten-seemann
Copy link
Contributor

Author of quic-go here 👋 . For obvious reasons, we're very interested in this new API. Can't wait to play around with it!

We've been working on reducing allocations this year, and managed to significantly reduce GC pressure. We're now at a point where a large fraction of the remaining allocations happens within the standard library, especially during IP address handling, on both the send and the receive path. I see that the updated proposal is using the netip package, which should hopefully fix that problem.

I agree with @bradfitz's point that zero-allocation should be an explicit design goal. Being able to do a QUIC transfer that doesn't allocate at all (amortized) would be huge!

@aimuz
Copy link
Contributor

aimuz commented Dec 1, 2023

It looks like this is not currently implemented? Can I try this one?

@neild
Copy link
Contributor

neild commented Dec 1, 2023

Nobody is currently working on this, so far as I know. I intend to get to it, but I don't know when that will be and I don't want to block someone else from working on it if they have time.

Since this proposal was written, UDP GSO/GRO (generic segmentation offload / generic receive offload) have gained some attention. (I'm not certain to what extent OSs supported these features when we wrote this proposal; either they didn't exist yet, or we didn't know about them.) GSO/GRO have proven highly effective at improving performance of applications with high UDP throughput. For example:

https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/
https://tailscale.com/blog/quic-udp-throughput/

Any implementation of this proposal should take advantage of GSO/GRO when available. If it isn't possible to support GSO/GRO within the API as proposed, then we should go back and redesign it to make it possible. I don't think we should add an implementation of this API that doesn't support GSO/GRO on at least one platform, to validate that we aren't adding an API that is already obsolete.

/cc @marten-seemann, who may have opinions on what quic-go needs from the net package here.

@tomalaci
Copy link

tomalaci commented Dec 1, 2023

Nobody is currently working on this, so far as I know. I intend to get to it, but I don't know when that will be and I don't want to block someone else from working on it if they have time.

Since this proposal was written, UDP GSO/GRO (generic segmentation offload / generic receive offload) have gained some attention. (I'm not certain to what extent OSs supported these features when we wrote this proposal; either they didn't exist yet, or we didn't know about them.) GSO/GRO have proven highly effective at improving performance of applications with high UDP throughput. For example:

https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/ https://tailscale.com/blog/quic-udp-throughput/

Any implementation of this proposal should take advantage of GSO/GRO when available. If it isn't possible to support GSO/GRO within the API as proposed, then we should go back and redesign it to make it possible. I don't think we should add an implementation of this API that doesn't support GSO/GRO on at least one platform, to validate that we aren't adding an API that is already obsolete.

/cc @marten-seemann, who may have opinions on what quic-go needs from the net package here.

Is it not possible to just enable GSO segmentation with your own custom size as needed via ancillary/control buffers? Worst case it can also be done via setsockopt from syscall package for the UDP socket.

Maybe an extra method can be added to UDPConn to "set/enable segmentation".

Either way, shouldn't GSO/GRO be a separate feature to be added to UDPConn (which I would highly support) since this feature is about enabling batch-sending and batch-receiving of UDP packets?

Also, GSO segmentation may not make sense if you are dealing with variable-sized UDP packets (you would need to pad them with junk bytes to align to the segmentation size and hope receiving end knows how to extract the actual data ignoring the said padding).

@neild
Copy link
Contributor

neild commented Dec 1, 2023

The feature in this issue is about efficient batch sending/receiving of UDP packets. One way to do that is sendmmsg/recvmmsg. Another is GSO/GRO. And the two approaches can of course be combined at the same time. (At the time we wrote this proposal, either GSO/GRO didn't exist, or Brad and I weren't aware of it.)

I think that to the extent possible, the net package should abstract away the details.

A perfect abstraction isn't possible, since GSO/GRO require a contiguous buffer (and GSO in particular requires, I believe, all sent packets to be the same size), but if we're adding new API surface to net to allow for efficient transmission of packets, it should work with GSO/GRO without requiring the user to use syscall or x/net. You can use sendmmsg/recvmmsg/GSO/GRO with x/net today if you want; the point of adding it to the net package is to offer a good abstraction over the low level interfaces.

Either way, shouldn't GSO/GRO be a separate feature to be added to UDPConn

Possibly, but if so, I think we should understand what that feature looks like before we commit to the API in this proposal, to make sure the two work together cleanly. And implement both sendmmsg/recvmmsg and GSO/GRO to verify they work together before committing to either implementation alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests