Path MTU handling - suggested solution for IPv6 #3119

vanrein · 2022-05-20T07:58:47Z

Description

Larger SIP frames get dropped when sent over UDP and IPv6. The sending side has MTU 1500 and the receiving side has MTU 1492. This is an IPv6-only setup, so this is problematic. Also, pulling down the MTU of a general server for a tunneled peer would smear like an oil stain. The suggestion to fallback on TCP feels like a hack.

Troubleshooting

Reproduction

Send a SIP message to a network interface with a lower MTU than the submitted frame size.

Debugging Data

None, transmission works fine.

Log Messages

None, transmission works fine.

SIP Traffic

Irrelevant.

Code Investigation

I explored the Kamailio source code for MTU handling.

kamailio/src/core/udp_server.c

Lines 340 to 349 in 81265e4

    
           #if defined (__OS_linux) 
        
           	/* if pmtu_discovery=1 then set DF bit and do Path MTU discovery 
        
           	 * disabled by default */ 
        
           	optval= (pmtu_discovery) ? IP_PMTUDISC_DO : IP_PMTUDISC_DONT; 
        
           	if(setsockopt(sock_info->socket, IPPROTO_IP, IP_MTU_DISCOVER, 
        
           			(void*)&optval, sizeof(optval)) ==-1){ 
        
           		LM_ERR("setsockopt: %s\n", strerror(errno)); 
        
           		goto error; 
        
           	} 
        
           #endif

This defaults to switching off the Don't Fragment bit on IPv4 frames, to allow them to be broken up downstream.
This should check for an IPv4 socket, because (at least) Linux has a separate symbolic value IPV6_MTU_DISCOVER
Note that IPv6 is never broken up by routers; they always return ICMPv6 message Packet Too Big
Note that IPv6 Path MTU discovery by the kernel is automatic only for connected UDP sockets
Since Kamailio uses unconnected UDP sockets, Path MTU issues cause packets to be dropped
Note that such packet drops depend on a somewhat dynamic SIP message size, causing random behaviour
I therefore suggest that Kamailio is lacking in some of its IPv6 logic

kamailio/src/core/udp_server.c

Lines 331 to 339 in 81265e4

    
           #if defined (__OS_linux) && defined(UDP_ERRORS) 
        
           	optval=1; 
        
           	/* enable error receiving on unconnected sockets */ 
        
           	if(setsockopt(sock_info->socket, SOL_IP, IP_RECVERR, 
        
           					(void*)&optval, sizeof(optval)) ==-1){ 
        
           		LM_ERR("setsockopt: %s\n", strerror(errno)); 
        
           		goto error; 
        
           	} 
        
           #endif

This enables the reporting of ICMP errors, including Path MTU but also other useful things like Host Not Found
This should check for an IPv4 socket, because (at least) Linux has a separate symbolic value IPV6_RECVERR
Note that the same behaviour is available on BSD, so it need not be specific to Linux
Note that handling errors is done with recvmsg() with a flag MSG_ERRQUEUE, which is not used in Kamailio
Note that the ancillary data from recvmsg() holds data to cleverly handle ICMP or ICMPv6 responses
I therefore suggest that Kamailio is lacking a fair chunk of its ICMP and ICMPv6 logic
I expect that this may bring efficiency gains due to faster closing of transactions

Possible Solutions

I have been thinking about ways to lower MTU values only for some peers.

Using connected sockets might work, possibly as an alternative when Path MTU problems arise. It might not scale however.
Every socket could have an extra sending socket set to a lower MTU. The use of SO_REUSEADDR seems to allow for that.
Before falling back on an extra socket, the desired MTU could be set. Alternatively, as for IPv6, an MTU of 1280 might be considered in many cases:
- If you can carry 6 out of 8 coffee mugs from the kitchen, you need to walk twice, and 4+4 is easier than 6+2
- Anything over the MTU splits into at least 2 frames
- The headers added are 40 bytes IPv6 header and 8 bytes Fragment Extension Header
- 2 frames can hold a single frame of 2*1280-40-8 = 2512 bytes
- Up to 2512 bytes original MTU, breakup in 1280 byte frames will be fine
- 3 frames are going to be useful for larger frames, then a similar style can be used
- This might help to decide whether 1280 or a higher MTU is most desirable

Additional Information

Kamailio 5.2.1 from Debian stable

Operating System:

Linux Debian stable on kernel 4.19.0

(I don't suppose it matters, this code has been around for ages. I used permalinks for stable reference).

The text was updated successfully, but these errors were encountered:

miconda · 2022-05-20T08:08:07Z

The suggestion to fallback on TCP feels like a hack.

It is not a hack, it is in the specs, see https://datatracker.ietf.org/doc/html/rfc3261#section-18.1.1:

18.1.1 Sending Requests

   The client side of the transport layer is responsible for sending the
   request and receiving responses.  The user of the transport layer
   passes the client transport the request, an IP address, port,
   transport, and possibly TTL for multicast destinations.

   If a request is within 200 bytes of the path MTU, or if it is larger
   than 1300 bytes and the path MTU is unknown, the request MUST be sent
   using an RFC 2914 congestion controlled transport protocol, such
   as TCP.

If someone wants to implement your suggestions, they are more than welcome, but it should be controlled by a parameter.

Should nobody commit to implement this feature request in one month or so, the track item may be closed.

vanrein · 2022-05-20T08:36:14Z

Thanks for pointing out that part of RFC 3261. I wonder how this this would work with a UDP-registered phone, and an assumption that it will have taken precautions to also be reachable over TCP. Not an issue in IPv4 if you Do Fragment, but possibly breaking for IPv6 reachability. (That's why I reported as a bug, not a feature request.)

miconda · 2022-05-20T08:55:18Z

Iirc, TCP is mandatory, but I can say that many phones don't do it. For more details on UDP-To-TCP specs, you may ask on IETF SIP Core group or SIP Implementors mailing lists.

I labeled it feature request because it is not a bug in the existing C code per se, but something missing as an implementation.

After all there are many IETF RFCs that are not implemented in Kamailio, lack of having them is not a bug, but missing features. We work based on community collaboration, usually who needs a feature implements it and makes PRs to get it merge or helps to implement it (e.g., collaborates with other developers).

We also close the feature requests that don't get interest from developers, because the tracker is open for everyone to submit and it will get filled with requests which will make it hard to focus on actual issues after a while.

vanrein · 2022-05-28T07:03:37Z

Fair enough. I will do what I can to help, but may not collect the confidence to do this in Kamailio, at least not without backup from a core developer.

I have started some tooling to measure these problems, available here.

My observation that veth ethernet pairs (on Linux) could work around the problem is now falsified.
My next hope is to use SO_REUSEADDR or SO_REUSEPORT to open a secondary socket with a lower MTU, and fallback on that for resends.

I will continue to do these experiments outside of Kamailio, to avoid its complexity. If this works, it is most likely after integration with immediate resends.

vanrein · 2022-05-28T10:07:59Z

I have been exploring the code to find a place to send with a lower MTU than the MRU for a socket. That is probably more efficient than learning about Path MTU on every sending attempt. Feedback welcomed!

I think an extension to struct socket_info with socket as default socket and a new field socket_mtu as an outbound override could help. It could be setup when udp_mtu is set (or maybe when listen has an extra mtu 1234 parameter). Cleanup would recognise socket != socket_mtu as a special case in which the secondary needed cleanup. The extra field could be setup in udp_init() and trivially copied in the other core/xxx_server.c variants, so sending can always use socket_mtu instead of the receiving side, socket itself, with its unbounded MTU.

My work now is to construct dual sockets, set different MTUs and see this idea work. I will do that in the MTU games repo.

Notes:

It is not currently meaningful to configure udp_mtu but not udp_mtu_try_proto. Indeed, the code confirms ignoring that situation. Precisely this configuration could give rise to transmission over a secondary socket with reduced MTU. This code might therefore set a flag FL_MTU_UDP_FB (that is part of the FL_MTU_FB_MASK).
This flag is used to determine the sending socket for a message, via di.proto and di.send_sock and delivered into send_info->proto and send_info->send_sock. An extra case for FL_MTU_UDP_FB could be added here.
The routine get_send_socket2() finds a struct socket_info to send over. Assuming non-forced socket, for single-homed systems, this is the static value in sendipv6, for multi-homed systems it is determined with get_out_socket() which uses find_si() to locate a configured socket.
It would be possible to have socket for its current use, but for UDP allow an alternative socket socket_mtu to be set to the same coordinates (in the remainder of the structure) but a lower MTU value, namely to the udp_mtu value. Sending would prefer this socket (if set, that is, if >= 0) but receiving would continue to use the default MRU (namely the interface MTU). Being conservative in what we send, liberal in what we accept.
This additional socket_mtu element would be added in udp_init() when udp_mtu is set.
Future options mtu 1234 after a listen declaration may set a socket-specific MTU (and leave the MRU unchanged).
I doubt if such conservative-low MTU knowledge would benefit TCP and TLS. If it is, then this approach could be replicated. But more likely is that the interactive nature of these protocols benefits from an explicit in-situ learning process.

miconda · 2022-06-10T11:38:19Z

@vanrein: tanks for tackling this one further! I am rather busy these days and I am not sure if any other developer wants to look deeper in it. My suggestion is to make a pull request with the changes you would consider to do, ideally controlling the new behaviour with a parameter. It might be easier to understand what changes are proposed, review them and merge if all ok.

vanrein · 2022-06-11T07:48:08Z

I got one thing wrong, and that saves bundles of work. Here's from experimental code,

/*
 * Confusingly, ip(7) states
 *
 * IP_MTU (since Linux 2.2)
 *    Retrieve the current known path MTU of the current socket.
 *    Returns an integer.  IP_MTU  is valid only for getsockopt(2) and
 *    can be employed only when the socket has been connected.
 *
 * Similarly, ipv6(7) states
 *
 * IPV6_MTU
 *     getsockopt(): Retrieve the current known path MTU of the current
 *     socket.  Valid only when the socket has been connected.  Returns
 *     an integer.
 *     
 *     setsockopt():  Set  the  MTU to be used for the socket.  The MTU
 *     is limited by the device MTU or the path MTU when path MTU
 *     discovery is enabled.  Argument is a pointer to integer.
 *
 * This suggests that IP_MTU is a socket property.  However, it makes
 * more sense as a shared global property, which indeed seems to apply:
 *
 * The ipv6(7) entry for IPV6_MTU_DISCOVER references IP_MTU_DISCOVER;
 * the ip(7) entry for IP_MTU_DISCOVER states
 *
 * IP_MTU_DISCOVER (since Linux 2.2)
 *    When PMTU discovery is enabled, the kernel automatically keeps track
 *    of the path MTU  per destination host.  When it is connected to a
 *    specific peer with connect(2), the currently known path MTU can be
 *    retrieved conveniently using the IP_MTU socket option (e.g.,  after
 *    an  EMSGSIZE  error  occurred).   The  path MTU may change over time.
 *    For connectionless sockets with many destinations, the new MTU for a
 *    given destination can also be  accessed using  the  error  queue (see
 *    IP_RECVERR).  A new error will be queued for every incoming MTU update.
 *
 *    While MTU discovery is in progress, initial packets from datagram
 *    sockets may be dropped.  Applications  using  UDP  should  be aware
 *    of this and not take it into account for their packet retransmit strategy.
 *
 * Retransmission is common in UDP applications.  Ideally, the IP_RECVERR or
 * IPV6_RECVERR are used to immediately resend, without wait for timers to
 * expire; and without limiting the number of Path MTU lessens learnt to the
 * number of timer rounds.
 *
 * For IPv6, where fragmenttion is required to accomodate the Path MTU, and
 * for unconnected applications, the lessons from Path MTU discovery are of
 * major impact on their behaviour; we should always let the socket fragment
 * frames when so desired, so:
 *
 * IP_MTU_DISCOVER (since Linux 2.2)
 *    IP_PMTUDISC_WANT will fragment a datagram if needed according to the
 *    path MTU, [IPv4-only: or will set the don't-fragment flag otherwise].
 *
 *    Path MTU discovery value   Meaning
 *    IP_PMTUDISC_WANT           Use per-route settings.
 *    IP_PMTUDISC_DONT           Never do Path MTU Discovery.
 *    IP_PMTUDISC_DO             Always do Path MTU Discovery.
 *    IP_PMTUDISC_PROBE          Set DF but ignore Path MTU.
 *
 * IPv6 changes the names to `IPV6_MTU_DISCOVER` and `IPV6_PMTUDISC_WANT`.
 */

I'm documenting it here, so that the knowledge is not lost on the project. This is difficult stuff.

It would seem that Path MTU discovery is not maintained per socket (which would benefit locality and proper cleanup of the knowledge) but as a global kernel property for the route (which benefits reuse of the knowledge, IWO a useful form of caching).

Conclusions for Kamailio on IPv6

The idea to set different MTU values for two sockets failed for unconnected sockets. And to have multiple MTUs you need unconnected sockets.
This means that the idea of a secondary socket is not going to work in Kamailio either.
It does seem to be true that the kernel keeps track of Path MTU if asked.
For IPv6, not learning from Path MTU feedback (ICMPv6 Packet too Big) always leads to the same effect; once a frame is dropped it is always lost, regardless of resends. Kamailio comes across as unstable, especially because SIP message sizes vary and make some things works while others fail.
Note that it never causes packet drops if Path MTU discovery is enabled for IPv6; there is just a reason for fragmentation, which at most is an efficiency issue. Note that IPv6 has no "Don't Fragment" option; this behaviour is always active.
And it means that it can only add value to enable Path MTU discovery for IPv6. Even if sysctl() could make such a setting, Kamailio stability demands this for IPv6, AFAIK.
Path MTU discovery for IPv4 continues to be an option and a matter of taste, unlike for IPv6.

Perfection for Kamailio over IPv6

The first contact with an IPv6 host may drop with Packet too Big over ICMPv6 messages. This may happen when the kernel drops knowledge. Some SIP processing is an hour apart, and may cause this dropping of knowledge.
Use of IPV6_RECVERR enables immediate resending, with improved Path MTU knowledge. This involves an extra polling mechanism, which is beyond my reach. This also links into the tm logic and goes beyond my reach. For sl replies there will probably be a 2nd round if Path MTU problems arise, because the reply was sent-then-forgotten, and needs to wait for another round.

- For IPv4, DF is an option; for IPv6 it is always active - This makes pmtu_discover an IPv4-only option - This means that we should set IPV6_MTU_DISCOVER to IPV6_PMTUDISC_WANT - Unconnected UDP sockets can now learn from ICMPv6 "Packet too Big" - As a result, hitting a Path MTU upper bound is a learning process - This should stop consistent SIP packet drops due to Path MTU

henningw · 2022-08-21T12:57:09Z

@vanrein the PR was merged, so I think there should be a good improvement for IPv6 now available in the upcoming version. Is there anything else that you want to consider in this feature request?

miconda · 2022-08-26T11:20:10Z

Probably this should be closed, if there is anything new to be discussed/added, a new item can be created.

vanrein · 2022-08-26T11:53:28Z

It looks like the _WANT variant was also added. Great job, thanks!

And yes, I will close it now.

miconda added the feature-request label May 20, 2022

miconda changed the title ~~Path MTU handling; implicit IPv4 assumption, suggested solution for IPv6~~ Path MTU handling - suggested solution for IPv6 May 20, 2022

vanrein mentioned this issue May 30, 2022

Add IPv6 support LubosD/twinkle#6

Open

vanrein mentioned this issue Jun 11, 2022

Path MTU discovery for IPv6 #3141

Closed

10 tasks

vanrein closed this as completed Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path MTU handling - suggested solution for IPv6 #3119

Path MTU handling - suggested solution for IPv6 #3119

vanrein commented May 20, 2022

miconda commented May 20, 2022

vanrein commented May 20, 2022

miconda commented May 20, 2022

vanrein commented May 28, 2022

vanrein commented May 28, 2022

miconda commented Jun 10, 2022

vanrein commented Jun 11, 2022 •

edited

Loading

henningw commented Aug 21, 2022

miconda commented Aug 26, 2022

vanrein commented Aug 26, 2022

Path MTU handling - suggested solution for IPv6 #3119

Path MTU handling - suggested solution for IPv6 #3119

Comments

vanrein commented May 20, 2022

Description

Troubleshooting

Reproduction

Debugging Data

Log Messages

SIP Traffic

Code Investigation

Possible Solutions

Additional Information

miconda commented May 20, 2022

vanrein commented May 20, 2022

miconda commented May 20, 2022

vanrein commented May 28, 2022

vanrein commented May 28, 2022

Notes:

miconda commented Jun 10, 2022

vanrein commented Jun 11, 2022 • edited Loading

Conclusions for Kamailio on IPv6

Perfection for Kamailio over IPv6

henningw commented Aug 21, 2022

miconda commented Aug 26, 2022

vanrein commented Aug 26, 2022

vanrein commented Jun 11, 2022 •

edited

Loading