Skip to content

Commit 81cfd60

Browse files
krismangregkh
authored andcommitted
udp: Force compute_score to always inline
[ Upstream commit b80a95c ] Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback thoughput test that we traced to the extra overhead of calling compute_score on two places, introduced by commit f0ea27e ("udp: re-score reuseport groups when connected sockets are present"). At the time, I pointed out the overhead was caused by the multiple calls, associated with cpu-specific mitigations, and merged commit 50aee97 ("udp: Avoid call to compute_score on multiple sites") to jump back explicitly, to force the rescore call in a single place. Recently though, we got another regression report against a newer distro version, which a team colleague traced back to the same root-cause. Turns out that once we updated to gcc-13, the compiler got smart enough to unroll the loop, undoing my previous mitigation. Let's bite the bullet and __always_inline compute_score on both ipv4 and ipv6 to prevent gcc from de-optimizing it again in the future. These functions are only called in two places each, udpX_lib_lookup1 and udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot enough to be very visible in profilings. In fact, with gcc13, forcing the inline will prevent gcc from unrolling the fix from commit 50aee97, so we don't end up increasing udpX_lib_lookup2 at all. I haven't recollected the results myself, as I don't have access to the machine at the moment. But the same colleague reported 4.67% inprovement with this patch in the loopback benchmark, solving the regression report within noise margins. Eric Dumazet reported no size change to vmlinux when built with clang. I report the same also with gcc-13: scripts/bloat-o-meter vmlinux vmlinux-inline add/remove: 0/2 grow/shrink: 4/0 up/down: 616/-416 (200) Function old new delta udp6_lib_lookup2 762 949 +187 __udp6_lib_lookup 810 975 +165 udp4_lib_lookup2 757 906 +149 __udp4_lib_lookup 871 986 +115 __pfx_compute_score 32 - -32 compute_score 384 - -384 Total: Before=35011784, After=35011984, chg +0.00% Fixes: 50aee97 ("udp: Avoid call to compute_score on multiple sites") Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://patch.msgid.link/20260410155936.654915-1-krisman@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
1 parent 1218bfe commit 81cfd60

2 files changed

Lines changed: 13 additions & 12 deletions

File tree

net/ipv4/udp.c

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -365,10 +365,10 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
365365
return udp_lib_get_port(sk, snum, hash2_nulladdr);
366366
}
367367

368-
static int compute_score(struct sock *sk, const struct net *net,
369-
__be32 saddr, __be16 sport,
370-
__be32 daddr, unsigned short hnum,
371-
int dif, int sdif)
368+
static __always_inline int
369+
compute_score(struct sock *sk, const struct net *net,
370+
__be32 saddr, __be16 sport, __be32 daddr,
371+
unsigned short hnum, int dif, int sdif)
372372
{
373373
int score;
374374
struct inet_sock *inet;
@@ -508,8 +508,8 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
508508
continue;
509509

510510
/* compute_score is too long of a function to be
511-
* inlined, and calling it again here yields
512-
* measurable overhead for some
511+
* inlined twice here, and calling it uninlined
512+
* here yields measurable overhead for some
513513
* workloads. Work around it by jumping
514514
* backwards to rescore 'result'.
515515
*/

net/ipv6/udp.c

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -127,10 +127,11 @@ void udp_v6_rehash(struct sock *sk)
127127
udp_lib_rehash(sk, new_hash, new_hash4);
128128
}
129129

130-
static int compute_score(struct sock *sk, const struct net *net,
131-
const struct in6_addr *saddr, __be16 sport,
132-
const struct in6_addr *daddr, unsigned short hnum,
133-
int dif, int sdif)
130+
static __always_inline int
131+
compute_score(struct sock *sk, const struct net *net,
132+
const struct in6_addr *saddr, __be16 sport,
133+
const struct in6_addr *daddr, unsigned short hnum,
134+
int dif, int sdif)
134135
{
135136
int bound_dev_if, score;
136137
struct inet_sock *inet;
@@ -260,8 +261,8 @@ static struct sock *udp6_lib_lookup2(const struct net *net,
260261
continue;
261262

262263
/* compute_score is too long of a function to be
263-
* inlined, and calling it again here yields
264-
* measurable overhead for some
264+
* inlined twice here, and calling it uninlined
265+
* here yields measurable overhead for some
265266
* workloads. Work around it by jumping
266267
* backwards to rescore 'result'.
267268
*/

0 commit comments

Comments
 (0)