rtnl.listener dies on message burst #184

f00b4r0 · 2023-12-29T18:05:30Z

I noticed that the rtnl listener callback I setup in a ucode script would appear to randomly "die", without any error message and while leaving the rest of the script operating normally.

After a bit of digging I think I have tracked it down to the point where it seems to be a resource exhaustion of some sort: the bug can be reproduced using the attached ucode script, which sets up a simple listener on RTNLGRP_NEIGH that prints the received messages.

Everything goes well until the neigh garbage collector kicks in and deletes a large number of neigh entries, resulting in a "large" (hundreds) number of messages being delivered. The script will typically appear to hang after printing anywhere between 0 and the first few of the delete messages ("cmd": 29), with no error what so ever.

On a system where the neigh GC is set like so:

net.ipv4.neigh.default.gc_thresh1=512
net.ipv4.neigh.default.gc_thresh2=2048
net.ipv4.neigh.default.gc_thresh3=4096

(values fairly typical for a busy router), the garbage collector may delete hundreds of entries in one go when it kicks in (when more than 512 entries have been created), triggering the hang. I have not been able to reliably reproduce this bug when thresh1 is set to e.g. 128, which typically results GC kicking more frequently and in only a few dozen entries being pruned at once on a typical GC run, so the problem only seems to occur when a certain threshold number of messages occur "at once".

I provide a memdump of the script taken after the hang.

rtnlbug.uc.txt
ucode.1703872887.23407.memdump.txt

The text was updated successfully, but these errors were encountered:

f00b4r0 · 2024-01-08T15:12:03Z

Provided that the "netcat" package is installed, that the LAN IP is 192.168.1.1 and almost no client devices are present, the following script will trigger the bug:

#!/bin/sh

sysctl -w net.ipv4.neigh.default.gc_thresh1=512
sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
sysctl -w net.ipv4.neigh.default.gc_thresh3=4096

for i in $(seq 2 254); do
	echo "" | netcat -c -u 192.168.1.$i 65534   # create a large number of NUD FAILED neighbours
done

sleep 5

sysctl -w net.ipv4.neigh.default.gc_thresh1=128

jow- mentioned this issue Jan 10, 2024

rtnl: improve event reception in order to avoid ENOBUFS #185

Merged

jow- closed this as completed in #185 Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rtnl.listener dies on message burst #184

rtnl.listener dies on message burst #184

f00b4r0 commented Dec 29, 2023

f00b4r0 commented Jan 8, 2024 •

edited

Loading

rtnl.listener dies on message burst #184

rtnl.listener dies on message burst #184

Comments

f00b4r0 commented Dec 29, 2023

f00b4r0 commented Jan 8, 2024 • edited Loading

f00b4r0 commented Jan 8, 2024 •

edited

Loading