Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs #45886

Open
bradfitz opened this issue Apr 30, 2021 · 59 comments
Open

net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs #45886

bradfitz opened this issue Apr 30, 2021 · 59 comments

Comments

@bradfitz
Copy link
Contributor

(co-written with @neild)

Linux has recvmmsg to read multiple UDP packets from the kernel at once.

There is no Recvmmsg wrapper func in golang.org/x/sys/unix. That's easy enough to add, but it's not ideal: it means all callers of it would be using a thread while blocked waiting for a packet.

There is, however, batch support in golang.org/x/net/ipv{4,6}: e.g. https://pkg.go.dev/golang.org/x/net/ipv4#PacketConn.ReadBatch (added around golang/net@b8b1343). But it has the same thread-blocking problem. And it has the additional problem of having separate packages for IPv4 vs IPv6.

It'd be nicer to integrate with the runtime's poller.

Adding API to do this in the net package would mean both:

  1. we'd be able to integrate with the runtime poller (epoll, etc) and not waste a thread during a read
  2. there'd be portable API to do this regardless of whether the platform/kernel version supports something like recvmmsg.

For writing, net.Buffers already exists, as does golang.org/x/net/ipv{4,6}'s PacketConn.WriteBatch, so is less important, but could be done for consistency.

As far as a potential API, https://pkg.go.dev/golang.org/x/net/ipv4#PacketConn.ReadBatch is close, but the platform-specific flags should probably not be included, at least as an int. While there's some precedent with https://golang.org/pkg/net/#UDPConn.ReadMsgUDP use of flags int, we could probably use a better type if we need flags for some reason.

Alternatively, if callers of x/sys/unix or x/net/ipv{4,6} could do this efficiently with the runtime poller, that'd also work (even if they'd need to use some build tags, which is probably tolerable for anybody who cares about this).

@gopherbot gopherbot added this to the Proposal milestone Apr 30, 2021
@ianlancetaylor
Copy link
Contributor

What should the allocation strategy be here? The existing ReadMsgUDP method takes a pair of buffers and fills them in. Should this new method take a slice of buffers and somehow return how much data was read into each buffer?

@bradfitz
Copy link
Contributor Author

At a high level, the caller should supply all the memory. The whole thing should be zero allocations (see also: #43451 and https://golang.org/cl/291509), otherwise the sort of people who'd want to use this probably wouldn't want to use it.

Probably pass a slice of messages similar to ipv4.PacketConn.ReadBatch but likely with a slightly different Message. I don't think the Message.Addr net.Addr field is amenable to the midstack inlining optimization from https://golang.org/cl/291509.

@neild
Copy link
Contributor

neild commented Apr 30, 2021

A possible concrete API:

type UDPMessage struct {
  // recvmmsg accepts a per-message scatter/gather array, so this could be [][]byte instead.
  Buffer []byte
  OOB    []byte
  Addr   UDPAddr

  // The existing (*UDPConn).ReadMsgUDP method returns flags as an int,
  // but perhaps this should be a more abstract type.
  Flags  int
}

// ReadUDPBatch reads multiple messages from c
// It returns the number of messages read.
// It reads the payload into Buffer and associated out-of-band data into OOB,
// and sets the length of each slice to the extent of the read data.
// It sets the Addr and Flags fields to the source address and flags set on each message.
func (c *UDPConn) ReadUDPBatch(ms []UDPMessage) (int, error)

This API preserves the existing limitation that there is no way to provide recvmsg flags to *UDPConn methods. (I'm quite confused, by the way, since it looks like we never set the MSG_OOB flag on recvmsg calls even when reading OOB data. Is this flag not actually necessary?)

@ianlancetaylor
Copy link
Contributor

UDP doesn't have out-of-band data. TCP does. I have no idea why UDPConn.ReadMsgUDP takes an oob argument or an oobn result. Maybe I'm missing something.

For that matter I'm not sure off hand how to read out-of-band data for a TCP socket using the net package.

@neild
Copy link
Contributor

neild commented Apr 30, 2021

UDP doesn't have out-of-band data.

Oh, right. I mean "control" data--the msg_control field of a struct msghdr. The fact that ReadMsgUDP calls its parameter oob for some reason led me astray.

@ianlancetaylor
Copy link
Contributor

Whoops, that led me astray also. But the end result is the same. As you say, the oob parameter to ReadMsgUDP, if not empty, can be filled with the ancillary data that the readmsg system call returns in the msg_control field. But UDP sockets never have any ancillary data. So it's still pointless.

@neild
Copy link
Contributor

neild commented Apr 30, 2021

UDP sockets can have ancillary data: Setting the IP_PKTINFO sockopt will give you a control message containing the interface index the packet was received on among other info (useful for DHCP servers), and the IP_TOS sockopt will give you the ToS byte.

@ianlancetaylor
Copy link
Contributor

Ah, OK, thanks.

@rsc
Copy link
Contributor

rsc commented May 5, 2021

We may want to wait on doing anything here until we figure out what to do with IP addresses, which still allocate.

@josharian
Copy link
Contributor

It is important that the caller be able to specify whether they want to block for N packets, or receive up-to-N packets but only block until at least one is available. On linux that's accomplished through flags, but we might want a nicer, higher-level way to express that.

@ianlancetaylor
Copy link
Contributor

@josharian Is it important to support both? For Read we only support "up-to-N", and make up for the lack via io.ReadFull.

@josharian
Copy link
Contributor

Hmm. Yeah, I think always up-to-N seems fine, at least for my uses.

@diogin
Copy link
Contributor

diogin commented Feb 5, 2022

There is also a sendmmsg(2): https://man7.org/linux/man-pages/man2/sendmmsg.2.html
If readmmsg(2) is added, sendmmsg(2) could be added together. QUIC implementations should benefit from these two APIs.

@cristaloleg
Copy link

Russ's comment (#45886 (comment)) is already outdated 'cause we've new IP addresses: #46518

@tomalaci
Copy link

Another use-case (that I actively work on) is mass SNMP polling or other type of packet-based polling (polling hundreds of thousands of routers), mostly for their interface metrics. For that I can't exactly create socket per host or create a pool of sockets to host polling sessions of individual hosts, I have to multiplex packets. Syscalls such as sendmmsg and recvmmsg help by queuing up multiple small packets (500-1000 bytes each). You can further increase performance by using GSO if your kernel allows it.

Performance benefits of the above can be read here as well: https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/

Currently I am manually creating non-blocking UDP sockets with additional epoll descriptors to pretty much circumvent Go's networking stack for that extra efficiency. That being said, if your host does not have more than 10G interface then there isn't much point doing these optimizations as a basic UDP connection will max out the interface unless your packets are extremely small (few hundred bytes or less).

@anacrolix
Copy link
Contributor

I use this heavily from https://github.com/anacrolix/torrent, where inbound UDP over uTP (a UDP-based protocol) is a bottleneck). On Linux I'm able to use https://github.com/anacrolix/mmsg from https://github.com/anacrolix/go-libutp, which I've adapted from golang.org/x/net to fix up some issues there around handling recvmmsg efficiently.

@neild
Copy link
Contributor

neild commented Jul 29, 2022

Updated proposal using netip:

type UDPMessage struct {
  Buffer  []byte // must be pre-allocated by caller to non-zero len; callee fills, re-slices
  Control []byte // like Buffer (pre-allocated, re-sliced), may be zero-length, what ReadMsgUDP mistakenly called “oob” (TODO: fix its docs)
  Addr    netip.AddrPort // populated by callee

  // Flags is an OS-specific bitmask. Use x/net/etc for meaning.
  // We include it as-is only because the old UDPConn.ReadMsgUDP returned it too.
  Flags int
}

// ReadUDPBatch reads multiple messages from c.
// It returns the number of messages read.
// The error is non-nil if and only if msgsRead is zero.
// It reads the payload into Buffer and associated control data into Control,
// and sets the length of each slice to the extent of the read data.
// It sets the Addr and Flags fields to the source address and flags for each message.
func (c *UDPConn) ReadUDPBatch(ms []UDPMessage) (msgsRead int, err error)

// WriteUDPBatch sends multiple messages to c.
// It returns the number of messages written.
// The error is non-nil if and only if not all messages could be written.
func (c *UDPConn) WriteUDPBatch(ms []UDPMessage) (msgsSent int, err error)

@rsc
Copy link
Contributor

rsc commented Aug 3, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this to Active in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@rsc
Copy link
Contributor

rsc commented Aug 10, 2022

Briefly discussed at proposal review. Given spelling of ReadMsgUDP, we should probably call the type UDPMsg, so that there is only one spelling of "message" in the package. And then to avoid the new term Batch, we could call the methods ReadUDPMsgs and WriteUDPMsgs.

Otherwise the semantics look fine (or at least as good as ReadMsgUDP with respect to the flags.)

@rsc rsc changed the title proposal: net: add API to receive multiple UDP packets (potentially in one system call) proposal: net: add multiple-UDP-packet read and write Aug 10, 2022
@rsc
Copy link
Contributor

rsc commented Aug 17, 2022

Note, Aug 31 2022: Changed type of Buffer and Control from []byte to [][]byte.


Updated API (only the names are different):

type UDPMsg struct {
  Buffer  [][]byte // must be pre-allocated by caller to non-zero len; callee fills, re-slices elements
  Control [][]byte // like Buffer (pre-allocated, re-sliced), may be zero-length, what ReadMsgUDP mistakenly called “oob” (TODO: fix its docs)
  Addr    netip.AddrPort // populated by callee

  // Flags is an OS-specific bitmask. Use x/net/etc for meaning.
  // We include it as-is only because the old UDPConn.ReadMsgUDP returned it too.
  Flags int
}

// ReadUDPMsgs reads multiple messages from c.
// It returns the number of messages read.
// The error is non-nil if and only if msgsRead is zero.
// It reads the payload into Buffer and associated control data into Control,
// and sets the length of each slice to the extent of the read data.
// It sets the Addr and Flags fields to the source address and flags for each message.
func (c *UDPConn) ReadUDPMsgs(ms []UDPMsg) (msgsRead int, err error)

// WriteUDPMsgs sends multiple messages to c.
// It returns the number of messages written.
// The error is non-nil if and only if not all messages could be written.
func (c *UDPConn) WriteUDPMsgs(ms []UDPMsg) (msgsSent int, err error)

Does anyone object to adding this API?

@rsc rsc changed the title proposal: net: add multiple-UDP-packet read and write proposal: net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs Aug 17, 2022
@rsc rsc moved this from Accepted to Active in Proposals Oct 20, 2022
@rsc rsc changed the title net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs proposal: net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs Oct 26, 2022
@rsc
Copy link
Contributor

rsc commented Oct 26, 2022

Any objections to #45886 (comment) ?

@rsc rsc moved this from Active to Likely Accept in Proposals Nov 2, 2022
@rsc
Copy link
Contributor

rsc commented Nov 2, 2022

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Nov 9, 2022

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc moved this from Likely Accept to Accepted in Proposals Nov 9, 2022
@rsc rsc changed the title proposal: net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs net: add UDPMsg, (*UDPConn).ReadUDPMsgs, (*UDPConn).WriteUDPMsgs Nov 9, 2022
@marten-seemann
Copy link
Contributor

Author of quic-go here 👋 . For obvious reasons, we're very interested in this new API. Can't wait to play around with it!

We've been working on reducing allocations this year, and managed to significantly reduce GC pressure. We're now at a point where a large fraction of the remaining allocations happens within the standard library, especially during IP address handling, on both the send and the receive path. I see that the updated proposal is using the netip package, which should hopefully fix that problem.

I agree with @bradfitz's point that zero-allocation should be an explicit design goal. Being able to do a QUIC transfer that doesn't allocate at all (amortized) would be huge!

@aimuz
Copy link
Contributor

aimuz commented Dec 1, 2023

It looks like this is not currently implemented? Can I try this one?

@neild
Copy link
Contributor

neild commented Dec 1, 2023

Nobody is currently working on this, so far as I know. I intend to get to it, but I don't know when that will be and I don't want to block someone else from working on it if they have time.

Since this proposal was written, UDP GSO/GRO (generic segmentation offload / generic receive offload) have gained some attention. (I'm not certain to what extent OSs supported these features when we wrote this proposal; either they didn't exist yet, or we didn't know about them.) GSO/GRO have proven highly effective at improving performance of applications with high UDP throughput. For example:

https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/
https://tailscale.com/blog/quic-udp-throughput/

Any implementation of this proposal should take advantage of GSO/GRO when available. If it isn't possible to support GSO/GRO within the API as proposed, then we should go back and redesign it to make it possible. I don't think we should add an implementation of this API that doesn't support GSO/GRO on at least one platform, to validate that we aren't adding an API that is already obsolete.

/cc @marten-seemann, who may have opinions on what quic-go needs from the net package here.

@tomalaci
Copy link

tomalaci commented Dec 1, 2023

Nobody is currently working on this, so far as I know. I intend to get to it, but I don't know when that will be and I don't want to block someone else from working on it if they have time.

Since this proposal was written, UDP GSO/GRO (generic segmentation offload / generic receive offload) have gained some attention. (I'm not certain to what extent OSs supported these features when we wrote this proposal; either they didn't exist yet, or we didn't know about them.) GSO/GRO have proven highly effective at improving performance of applications with high UDP throughput. For example:

https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/ https://tailscale.com/blog/quic-udp-throughput/

Any implementation of this proposal should take advantage of GSO/GRO when available. If it isn't possible to support GSO/GRO within the API as proposed, then we should go back and redesign it to make it possible. I don't think we should add an implementation of this API that doesn't support GSO/GRO on at least one platform, to validate that we aren't adding an API that is already obsolete.

/cc @marten-seemann, who may have opinions on what quic-go needs from the net package here.

Is it not possible to just enable GSO segmentation with your own custom size as needed via ancillary/control buffers? Worst case it can also be done via setsockopt from syscall package for the UDP socket.

Maybe an extra method can be added to UDPConn to "set/enable segmentation".

Either way, shouldn't GSO/GRO be a separate feature to be added to UDPConn (which I would highly support) since this feature is about enabling batch-sending and batch-receiving of UDP packets?

Also, GSO segmentation may not make sense if you are dealing with variable-sized UDP packets (you would need to pad them with junk bytes to align to the segmentation size and hope receiving end knows how to extract the actual data ignoring the said padding).

@neild
Copy link
Contributor

neild commented Dec 1, 2023

The feature in this issue is about efficient batch sending/receiving of UDP packets. One way to do that is sendmmsg/recvmmsg. Another is GSO/GRO. And the two approaches can of course be combined at the same time. (At the time we wrote this proposal, either GSO/GRO didn't exist, or Brad and I weren't aware of it.)

I think that to the extent possible, the net package should abstract away the details.

A perfect abstraction isn't possible, since GSO/GRO require a contiguous buffer (and GSO in particular requires, I believe, all sent packets to be the same size), but if we're adding new API surface to net to allow for efficient transmission of packets, it should work with GSO/GRO without requiring the user to use syscall or x/net. You can use sendmmsg/recvmmsg/GSO/GRO with x/net today if you want; the point of adding it to the net package is to offer a good abstraction over the low level interfaces.

Either way, shouldn't GSO/GRO be a separate feature to be added to UDPConn

Possibly, but if so, I think we should understand what that feature looks like before we commit to the API in this proposal, to make sure the two work together cleanly. And implement both sendmmsg/recvmmsg and GSO/GRO to verify they work together before committing to either implementation alone.

@aimuz
Copy link
Contributor

aimuz commented Dec 4, 2023

Thank you very much, I'm trying to work on this, and as far as I can tell from the information I've consulted (which may not be comprehensive)

UDP GSO was added to the Linux kernel at v4.18.
UDP GRO was added to the Linux kernel at v5.0.

If you need to use GSO/GRO, the Linux kernel cannot be lower than these versions.

How should ReadUDPMsgs and WriteUDPMsgs be handled in kernels below these versions?

Also GSO/GRO can improve ReadMsgUDP and WriteMsgUDP as well.


I had some problems doing this when I tried to run GOOS=linux GOARCH=amd64 ./mkall.sh after.

A lot of constants were removed from ztypes_linux_amd64.go, some of which were added to the zerrors_linux_amd64.go file.

And some of the type fields in ztypes_linux_amd64.go have changed. I don't know if this is normal.

I am running it through Docker and here is how I am executing it.

docker run -v $(pwd):/work/go -it golang bash
cd /work/go/src/syscall
GOOS=linux GOARCH=amd64 ./mkall.sh

Kernel versions are as follows

uname -a
Linux bfca687b639d 6.5.10-orbstack-00110-gbcfe04c86d2f #1 SMP Fri Nov  3 10:20:37 UTC 2023 x86_64 GNU/Linux

@neild
Copy link
Contributor

neild commented Dec 4, 2023

I believe that GSO/GRO can't be relied on even in recent Linux kernels; it's possible for them to be disabled. I'm afraid I don't know the details, however.

There needs to be a graceful fallback when GSO/GRO aren't available. One possibility would be for the net package to provide a general-purpose interface for sending/receiving a batch of messages, which uses GRO/GSO when available and falls back to a slower path otherwise. Another possibility would be for the net package to expose whether GRO/GSO are available and offer separate APIs.

A single API for batch send with a graceful fallback would be simplest to use, but perhaps exposing to the user whether a contiguous buffer is required or not would be better. I'm not certain.

I think the real work here is going to be in figuring out what the API looks like and how fallback works.

@tomalaci
Copy link

tomalaci commented Dec 5, 2023

I would like API for GSO/GRO to be opt-in since it being always forced could make certain UDP workloads un-optimal. Most notably handling batches of packets with varying sizes. Correct me if I am wrong but, AFAIK, GSO segmentation requires each packet to be same size in the super-buffer.

@aimuz
Copy link
Contributor

aimuz commented Dec 6, 2023

During my verification, I learnt that GSO is for sockfd, after setting gsosize for sockfd, all subsequent sendmmsg and sendmsg will be affected by gsosize.

syscall.SetsockoptInt(int(fd), 17, 103, 20)
conn.WriteToUDPAddrPort(make([]byte, 100), raddr.AddrPort())

WriteToUDPAddrPort passes no control data

11:39:27.752936 IP (tos 0x0, ttl 62, id 1783, offset 0, flags [none], proto UDP (17), length 48)
    192.168.31.158.35083 > 114.114.114.114.53: 0 [0q] (20)
11:39:27.752965 IP (tos 0x0, ttl 62, id 29786, offset 0, flags [none], proto UDP (17), length 48)
    192.168.31.158.35083 > 114.114.114.114.53: 0 [0q] (20)
11:39:27.753005 IP (tos 0x0, ttl 62, id 32140, offset 0, flags [none], proto UDP (17), length 48)
    192.168.31.158.35083 > 114.114.114.114.53: 0 [0q] (20)
11:39:27.753024 IP (tos 0x0, ttl 62, id 38732, offset 0, flags [none], proto UDP (17), length 48)
    192.168.31.158.35083 > 114.114.114.114.53: 0 [0q] (20)
11:39:27.753039 IP (tos 0x0, ttl 62, id 45095, offset 0, flags [none], proto UDP (17), length 48)
    192.168.31.158.35083 > 114.114.114.114.53: 0 [0q] (20)

If on a platform with GSO, turning on GSO means that all writes need to be adjusted.

@neild
Copy link
Contributor

neild commented Dec 13, 2023

If on a platform with GSO, turning on GSO means that all writes need to be adjusted.

I believe it's also possible to set GSO on a per-sendmsg basis by setting SOL_UDP/UDP_SEGMENT in the cmsg.

@aimuz
Copy link
Contributor

aimuz commented Dec 14, 2023

I believe it's also possible to set GSO on a per-sendmsg basis by setting SOL_UDP/UDP_SEGMENT in the cmsg.

You are right that setting UDP_SEGMENT via cmsg works, so there are a couple of problems that follow.

This patchset implements GSO for UDP. A process can concatenate and
submit multiple datagrams to the same destination in one send call
by setting socket option SOL_UDP/UDP_SEGMENT with the segment size,
or passing an analogous cmsg at send time.
https://lwn.net/Articles/752184/

  1. how to get whether GSO/GRO is supported on the current system, judging by the system kernel, this may work (GSO/GRO is enabled by default on the latest systems), but GSO/GRO can be disabled. Perhaps a better way to verify this is needed.

  2. how to combine sendmmsg/recvmmsg and GSO/GRO. One idea I have is to use sendmmsg with GSO to combine multiple identical UDP messages (flags, buffers) into one Msghdr, and then set the control according to the buffer length.

@bt90
Copy link

bt90 commented Dec 14, 2023

GSO can be tricky to implement in practice. For reference: quic-go/quic-go#3911

@aimuz
Copy link
Contributor

aimuz commented Dec 15, 2023

recvmmsg needs to pass a flags, do we need to provide it in this method?

https://man7.org/linux/man-pages/man2/recvmmsg.2.html

       int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen,
                    int flags, struct timespec *timeout);

The flags here are different from the flags in the UDPMessage, which need to be passed in.

The flags argument contains flags ORed together.  The flags are
       the same as documented for [recvmsg(2)](https://man7.org/linux/man-pages/man2/recvmsg.2.html), with the following
       addition:

       MSG_WAITFORONE (since Linux 2.6.34)
              Turns on MSG_DONTWAIT after the first message has been
              received.
func (c *UDPConn) ReadUDPMsgs(ms []UDPMsg, flags int) (msgsRead int, err error)

@neild
Copy link
Contributor

neild commented Dec 15, 2023

We don't pass flags for other net.UDPConn methods. Is there a reason why we would do so here?

I think the MSG_WAITFORONE flag is not going to be relevant here. Unless I'm misremembering something (which is quite possible), the runtime doesn't want syscalls to block (it should only call recvmsg after the poller indicates the socket is readable), so either the socket is going to be O_NONBLOCK or we're going to want to always pass MSG_WAITFORDONE.

@anacrolix
Copy link
Contributor

There's an original issue for this at #3661.

I've been using https://pkg.go.dev/github.com/anacrolix/mmsg@v1.0.1/socket for this for the last 6-7 years, which exposes an API that uses recvmmsg/sendmmsg if they're available otherwise falls back to the scalar operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests