Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs #45886

Open
bradfitz opened this issue Apr 30, 2021 · 33 comments
Open

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Apr 30, 2021

(co-written with @neild)

Linux has recvmmsg to read multiple UDP packets from the kernel at once.

There is no Recvmmsg wrapper func in golang.org/x/sys/unix. That's easy enough to add, but it's not ideal: it means all callers of it would be using a thread while blocked waiting for a packet.

There is, however, batch support in golang.org/x/net/ipv{4,6}: e.g. https://pkg.go.dev/golang.org/x/net/ipv4#PacketConn.ReadBatch (added around golang/net@b8b1343). But it has the same thread-blocking problem. And it has the additional problem of having separate packages for IPv4 vs IPv6.

It'd be nicer to integrate with the runtime's poller.

Adding API to do this in the net package would mean both:

  1. we'd be able to integrate with the runtime poller (epoll, etc) and not waste a thread during a read
  2. there'd be portable API to do this regardless of whether the platform/kernel version supports something like recvmmsg.

For writing, net.Buffers already exists, as does golang.org/x/net/ipv{4,6}'s PacketConn.WriteBatch, so is less important, but could be done for consistency.

As far as a potential API, https://pkg.go.dev/golang.org/x/net/ipv4#PacketConn.ReadBatch is close, but the platform-specific flags should probably not be included, at least as an int. While there's some precedent with https://golang.org/pkg/net/#UDPConn.ReadMsgUDP use of flags int, we could probably use a better type if we need flags for some reason.

Alternatively, if callers of x/sys/unix or x/net/ipv{4,6} could do this efficiently with the runtime poller, that'd also work (even if they'd need to use some build tags, which is probably tolerable for anybody who cares about this).

@gopherbot gopherbot added this to the Proposal milestone Apr 30, 2021
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Apr 30, 2021
@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Apr 30, 2021

What should the allocation strategy be here? The existing ReadMsgUDP method takes a pair of buffers and fills them in. Should this new method take a slice of buffers and somehow return how much data was read into each buffer?

@bradfitz
Copy link
Contributor Author

bradfitz commented Apr 30, 2021

At a high level, the caller should supply all the memory. The whole thing should be zero allocations (see also: #43451 and https://golang.org/cl/291509), otherwise the sort of people who'd want to use this probably wouldn't want to use it.

Probably pass a slice of messages similar to ipv4.PacketConn.ReadBatch but likely with a slightly different Message. I don't think the Message.Addr net.Addr field is amenable to the midstack inlining optimization from https://golang.org/cl/291509.

@neild
Copy link
Contributor

neild commented Apr 30, 2021

A possible concrete API:

type UDPMessage struct {
  // recvmmsg accepts a per-message scatter/gather array, so this could be [][]byte instead.
  Buffer []byte
  OOB    []byte
  Addr   UDPAddr

  // The existing (*UDPConn).ReadMsgUDP method returns flags as an int,
  // but perhaps this should be a more abstract type.
  Flags  int
}

// ReadUDPBatch reads multiple messages from c
// It returns the number of messages read.
// It reads the payload into Buffer and associated out-of-band data into OOB,
// and sets the length of each slice to the extent of the read data.
// It sets the Addr and Flags fields to the source address and flags set on each message.
func (c *UDPConn) ReadUDPBatch(ms []UDPMessage) (int, error)

This API preserves the existing limitation that there is no way to provide recvmsg flags to *UDPConn methods. (I'm quite confused, by the way, since it looks like we never set the MSG_OOB flag on recvmsg calls even when reading OOB data. Is this flag not actually necessary?)

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Apr 30, 2021

UDP doesn't have out-of-band data. TCP does. I have no idea why UDPConn.ReadMsgUDP takes an oob argument or an oobn result. Maybe I'm missing something.

For that matter I'm not sure off hand how to read out-of-band data for a TCP socket using the net package.

@neild
Copy link
Contributor

neild commented Apr 30, 2021

UDP doesn't have out-of-band data.

Oh, right. I mean "control" data--the msg_control field of a struct msghdr. The fact that ReadMsgUDP calls its parameter oob for some reason led me astray.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Apr 30, 2021

Whoops, that led me astray also. But the end result is the same. As you say, the oob parameter to ReadMsgUDP, if not empty, can be filled with the ancillary data that the readmsg system call returns in the msg_control field. But UDP sockets never have any ancillary data. So it's still pointless.

@neild
Copy link
Contributor

neild commented Apr 30, 2021

UDP sockets can have ancillary data: Setting the IP_PKTINFO sockopt will give you a control message containing the interface index the packet was received on among other info (useful for DHCP servers), and the IP_TOS sockopt will give you the ToS byte.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented May 1, 2021

Ah, OK, thanks.

@rsc
Copy link
Contributor

rsc commented May 5, 2021

We may want to wait on doing anything here until we figure out what to do with IP addresses, which still allocate.

@josharian
Copy link
Contributor

josharian commented Jun 2, 2021

It is important that the caller be able to specify whether they want to block for N packets, or receive up-to-N packets but only block until at least one is available. On linux that's accomplished through flags, but we might want a nicer, higher-level way to express that.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jun 2, 2021

@josharian Is it important to support both? For Read we only support "up-to-N", and make up for the lack via io.ReadFull.

@josharian
Copy link
Contributor

josharian commented Jun 2, 2021

Hmm. Yeah, I think always up-to-N seems fine, at least for my uses.

@diogin
Copy link
Contributor

diogin commented Feb 5, 2022

There is also a sendmmsg(2): https://man7.org/linux/man-pages/man2/sendmmsg.2.html
If readmmsg(2) is added, sendmmsg(2) could be added together. QUIC implementations should benefit from these two APIs.

@cristaloleg
Copy link

cristaloleg commented Feb 9, 2022

Russ's comment (#45886 (comment)) is already outdated 'cause we've new IP addresses: #46518

@tomalaci
Copy link

tomalaci commented May 17, 2022

Another use-case (that I actively work on) is mass SNMP polling or other type of packet-based polling (polling hundreds of thousands of routers), mostly for their interface metrics. For that I can't exactly create socket per host or create a pool of sockets to host polling sessions of individual hosts, I have to multiplex packets. Syscalls such as sendmmsg and recvmmsg help by queuing up multiple small packets (500-1000 bytes each). You can further increase performance by using GSO if your kernel allows it.

Performance benefits of the above can be read here as well: https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/

Currently I am manually creating non-blocking UDP sockets with additional epoll descriptors to pretty much circumvent Go's networking stack for that extra efficiency. That being said, if your host does not have more than 10G interface then there isn't much point doing these optimizations as a basic UDP connection will max out the interface unless your packets are extremely small (few hundred bytes or less).

@anacrolix
Copy link
Contributor

anacrolix commented Jun 6, 2022

I use this heavily from https://github.com/anacrolix/torrent, where inbound UDP over uTP (a UDP-based protocol) is a bottleneck). On Linux I'm able to use https://github.com/anacrolix/mmsg from https://github.com/anacrolix/go-libutp, which I've adapted from golang.org/x/net to fix up some issues there around handling recvmmsg efficiently.

@neild
Copy link
Contributor

neild commented Jul 29, 2022

Updated proposal using netip:

type UDPMessage struct {
  Buffer  []byte // must be pre-allocated by caller to non-zero len; callee fills, re-slices
  Control []byte // like Buffer (pre-allocated, re-sliced), may be zero-length, what ReadMsgUDP mistakenly called “oob” (TODO: fix its docs)
  Addr    netip.AddrPort // populated by callee

  // Flags is an OS-specific bitmask. Use x/net/etc for meaning.
  // We include it as-is only because the old UDPConn.ReadMsgUDP returned it too.
  Flags int
}

// ReadUDPBatch reads multiple messages from c.
// It returns the number of messages read.
// The error is non-nil if and only if msgsRead is zero.
// It reads the payload into Buffer and associated control data into Control,
// and sets the length of each slice to the extent of the read data.
// It sets the Addr and Flags fields to the source address and flags for each message.
func (c *UDPConn) ReadUDPBatch(ms []UDPMessage) (msgsRead int, err error)

// WriteUDPBatch sends multiple messages to c.
// It returns the number of messages written.
// The error is non-nil if and only if not all messages could be written.
func (c *UDPConn) WriteUDPBatch(ms []UDPMessage) (msgsSent int, err error)

@rsc
Copy link
Contributor

rsc commented Aug 3, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Aug 3, 2022
@rsc
Copy link
Contributor

rsc commented Aug 10, 2022

Briefly discussed at proposal review. Given spelling of ReadMsgUDP, we should probably call the type UDPMsg, so that there is only one spelling of "message" in the package. And then to avoid the new term Batch, we could call the methods ReadUDPMsgs and WriteUDPMsgs.

Otherwise the semantics look fine (or at least as good as ReadMsgUDP with respect to the flags.)

@rsc rsc changed the title proposal: net: add API to receive multiple UDP packets (potentially in one system call) proposal: net: add multiple-UDP-packet read and write Aug 10, 2022
@rsc
Copy link
Contributor

rsc commented Aug 17, 2022

Note, Aug 31 2022: Changed type of Buffer and Control from []byte to [][]byte.


Updated API (only the names are different):

type UDPMsg struct {
  Buffer  [][]byte // must be pre-allocated by caller to non-zero len; callee fills, re-slices elements
  Control [][]byte // like Buffer (pre-allocated, re-sliced), may be zero-length, what ReadMsgUDP mistakenly called “oob” (TODO: fix its docs)
  Addr    netip.AddrPort // populated by callee

  // Flags is an OS-specific bitmask. Use x/net/etc for meaning.
  // We include it as-is only because the old UDPConn.ReadMsgUDP returned it too.
  Flags int
}

// ReadUDPMsgs reads multiple messages from c.
// It returns the number of messages read.
// The error is non-nil if and only if msgsRead is zero.
// It reads the payload into Buffer and associated control data into Control,
// and sets the length of each slice to the extent of the read data.
// It sets the Addr and Flags fields to the source address and flags for each message.
func (c *UDPConn) ReadUDPMsgs(ms []UDPMsg) (msgsRead int, err error)

// WriteUDPMsgs sends multiple messages to c.
// It returns the number of messages written.
// The error is non-nil if and only if not all messages could be written.
func (c *UDPConn) WriteUDPMsgs(ms []UDPMsg) (msgsSent int, err error)

Does anyone object to adding this API?

@rsc rsc changed the title proposal: net: add multiple-UDP-packet read and write proposal: net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs Aug 17, 2022
@martin-sucha
Copy link
Contributor

martin-sucha commented Aug 20, 2022

It is not clear to me from the proposed documentation what happens if the length of the Buffer (or Control) is smaller than the length of the UDP packet. Is it an error or is the data truncated?

I assume it truncates the data since ReadMsgUDP and recvmsg/recvmmsg all do, but we should explicitly mention the behavior in the documentation:

// ReadUDPMsgs reads multiple messages from c.
// It returns the number of messages read.
// The error is non-nil if and only if msgsRead is zero.
// It reads the payload into Buffer and associated control data into Control,
// and sets the length of each slice to the extent of the read data.
// If the packet/control data is longer than the Buffer/Control, the Buffer/Control will
// contain as much data as fits, the rest will be truncated.
// Truncated data is not an error.
// ReadUDPMsgs sets the Addr and Flags fields to the source address
// and flags for each message.
func (c *UDPConn) ReadUDPMsgs(ms []UDPMsg) (msgsRead int, err error)

@martin-sucha
Copy link
Contributor

martin-sucha commented Aug 20, 2022

Why does UDPMsg have Buffer []byte and not Buffer [][]byte or Buffer Buffers? recvmmsg/sendmmsg support a scatter/gather array per message. I see that possibility mentioned earlier in the thread, but I don't see any arguments why we should (or should not) use a single buffer per message.

Is there any downside of using multiple buffers per message?

@rsc
Copy link
Contributor

rsc commented Aug 31, 2022

@martin-sucha Yes it truncates and sets the MSG_TRUNC flag, just as the underlying system call does on most (all?) systems.

It does seem like it should use [][]byte instead of []byte. Updated that.

We should define what we do on systems that don't have the batch UDP system calls (most non-Linux?). I assume WriteUDPMsgs will loop over the messages and write each one, while ReadUDPMsgs will read a single message and then return (reading a second message might block arbitrarily long, delaying the return of the first message).

Are there any other details we are missing? Does anyone object to this API?

@tomalaci
Copy link

tomalaci commented Aug 31, 2022

I assume WriteUDPMsgs will loop over the messages and write each one, while ReadUDPMsgs will read a single message and then return (reading a second message might block arbitrarily long, delaying the return of the first message).

WriteUDPMsgs could also just send one message with normal send call and return msgsSent = 1. This approach would be simpler to maintain and less people would be surprised by a significant syscall overhead (waiting for entire batch of messages to finish using single-msg sends) as they try running it on non-Linux machines. Plus, you have to account for not sending full batch of messages anyways on Linux when using sendmmsg when socket becomes unavailable for sending.

@database64128
Copy link
Contributor

database64128 commented Aug 31, 2022

In my experience with sendmmsg(2), it can achieve higher throughput if you do a poll for write readiness when the returned number of messages sent is less than the number of messages in-flight the current sendmmsg(2) call. And in order to do that, the write method would have to loop over and send everything.

I assume the reason for the performance improvement is, in between iterations, you only have to poll for write readiness once. If the write method only does sendmmsg(2) once with best efforts, you end up making at least two sendmmsg(2) calls before it would return -EAGAIN or -EWOULDBLOCK. Each sendmmsg(2) call would then carry fewer messages, and thus the whole thing becomes less efficient.

@database64128
Copy link
Contributor

database64128 commented Aug 31, 2022

Why does UDPMsg have Buffer []byte and not Buffer [][]byte or Buffer Buffers? recvmmsg/sendmmsg support a scatter/gather array per message. I see that possibility mentioned earlier in the thread, but I don't see any arguments why we should (or should not) use a single buffer per message.

Is there any downside of using multiple buffers per message?

Unless you are in an environment where the MTU is huge, it makes no sense to use multiple buffers for a single UDP packet. Gathering from an iovec could be a lot of pointer chasing. It makes more sense to copy everything into a single buffer, if your application only deals with typical MTU (1500 bytes).

@rsc
Copy link
Contributor

rsc commented Sep 7, 2022

In C, write can return an indication that it sent fewer bytes and you have to retry yourself.
In Go, Write conventionally takes care of that for you - a short Write is an error.
I think we should do the same here: WriteUDPMsgs needs to write everything or else return an error.
It sounds like @database64128 is giving other reasons for the same outcome too.

@rsc
Copy link
Contributor

rsc commented Sep 7, 2022

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@tomalaci
Copy link

tomalaci commented Sep 8, 2022

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

Glad to see this moving forward! That said, is there any consideration for adding some form of official support for UDP GSO?

According to Cloudflare blog post about accelerating UDP (https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/) enabling and using GSO can result in a another significant gain. For example, here is the graph in their tests which can show rough proportional gain:
image

There was also this draft paper released earlier than the blog post: http://vger.kernel.org/lpc_net2018_talks/willemdebruijn-lpc2018-udpgso-paper-DRAFT-1.pdf

I am not entirely sure what could be an API design for it as it is highly specific to Linux kernel and only available from kernel version 4.18 (so 3.x kernels can't use it). It could also perhaps be an optimization done by Go's runtime behind the scenes.

@rsc
Copy link
Contributor

rsc commented Sep 21, 2022

@tomalaci, what exactly would the API changes be for GSO? Does it affect the API in this proposal? Thanks.

@database64128
Copy link
Contributor

database64128 commented Sep 21, 2022

UDP GSO is usually done by passing a single large buffer to sendmsg(2) and the segmentation size as control message. I think the existing WriteMsgUDPAddrPort already covers this use case. If you want to combine GSO with sendmmsg(2), this proposal in its current shape should also work.

@tomalaci
Copy link

tomalaci commented Sep 21, 2022

@tomalaci, what exactly would the API changes be for GSO? Does it affect the API in this proposal? Thanks.

Upon further thinking, it should be possible to use this API for UDP GSO without issues. To use and customize UDP GSO offloading you may want to have both ability to set socket options (should be possible via UDPConn as long as you can get file descriptor) and/or setting ancillary/control data (possible via this API as it exposes control buffers).

So I agree with @database64128, this should be a solid proposal from the looks of it, even for UDP GSO offloading.

@rsc
Copy link
Contributor

rsc commented Sep 21, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Active
Development

No branches or pull requests