Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rtnl.listener dies on message burst #184

Closed
f00b4r0 opened this issue Dec 29, 2023 · 1 comment · Fixed by #185
Closed

rtnl.listener dies on message burst #184

f00b4r0 opened this issue Dec 29, 2023 · 1 comment · Fixed by #185

Comments

@f00b4r0
Copy link
Contributor

f00b4r0 commented Dec 29, 2023

I noticed that the rtnl listener callback I setup in a ucode script would appear to randomly "die", without any error message and while leaving the rest of the script operating normally.

After a bit of digging I think I have tracked it down to the point where it seems to be a resource exhaustion of some sort: the bug can be reproduced using the attached ucode script, which sets up a simple listener on RTNLGRP_NEIGH that prints the received messages.

Everything goes well until the neigh garbage collector kicks in and deletes a large number of neigh entries, resulting in a "large" (hundreds) number of messages being delivered. The script will typically appear to hang after printing anywhere between 0 and the first few of the delete messages ("cmd": 29), with no error what so ever.

On a system where the neigh GC is set like so:

net.ipv4.neigh.default.gc_thresh1=512
net.ipv4.neigh.default.gc_thresh2=2048
net.ipv4.neigh.default.gc_thresh3=4096

(values fairly typical for a busy router), the garbage collector may delete hundreds of entries in one go when it kicks in (when more than 512 entries have been created), triggering the hang. I have not been able to reliably reproduce this bug when thresh1 is set to e.g. 128, which typically results GC kicking more frequently and in only a few dozen entries being pruned at once on a typical GC run, so the problem only seems to occur when a certain threshold number of messages occur "at once".

I provide a memdump of the script taken after the hang.

rtnlbug.uc.txt
ucode.1703872887.23407.memdump.txt

@f00b4r0
Copy link
Contributor Author

f00b4r0 commented Jan 8, 2024

Provided that the "netcat" package is installed, that the LAN IP is 192.168.1.1 and almost no client devices are present, the following script will trigger the bug:

#!/bin/sh

sysctl -w net.ipv4.neigh.default.gc_thresh1=512
sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
sysctl -w net.ipv4.neigh.default.gc_thresh3=4096

for i in $(seq 2 254); do
	echo "" | netcat -c -u 192.168.1.$i 65534   # create a large number of NUD FAILED neighbours
done

sleep 5

sysctl -w net.ipv4.neigh.default.gc_thresh1=128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant