Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bridger is not reliable (fails to register traffic) #3

Closed
nicefile opened this issue Apr 24, 2023 · 21 comments
Closed

bridger is not reliable (fails to register traffic) #3

nicefile opened this issue Apr 24, 2023 · 21 comments

Comments

@nicefile
Copy link

After a while after starting bridger doesn't register new connections and file
/sys/kernel/debug/ppe0/bind
stays empty for new/current traffic but after a while work again
/etc/init.d/bridger restart fix this for current traffic instantly

link to forum thread where others confirm this
https://forum.openwrt.org/t/mt76-wireless-driver-debugging/154514/147?u=nicefile

build from 21-04-2023 @ cudy wr3000 mt7981

@nicefile
Copy link
Author

r22967-f18cb0ba63 on freshly supported wr3000 still doesn't register some of the connection
/etc/init.d/bridger restart fix this for current traffic
to duplicate this issue just use iperf3 test between wired and wireless host.
I see no cpu hogging that plague previous bridger build

@Fail-Safe
Copy link

I can confirm I am seeing the same as well. I posted details here:
https://forum.openwrt.org/t/mt76-wireless-driver-debugging/154514/177?u=_failsafe

Running on RT3200 build r24615-25e215c14e. However, I am no longer seeing WED crashing as noted here:
openwrt/mt76#754 (comment)

@rany2
Copy link

rany2 commented Dec 21, 2023

This seems to solve the issue for me, but I'm not sure why it does:

diff --git a/flow.c b/flow.c
index 61564c0..c7a599a 100644
--- a/flow.c
+++ b/flow.c
@@ -160,7 +160,6 @@ bridger_flow_update_cb(struct uloop_timeout *timeout)
 	avl_for_each_element_safe(&flows, flow, node, tmp) {
 		avl_delete(&sorted_flows, &flow->sort_node);
 		bridger_bpf_flow_update(flow);
-		bridger_nl_flow_offload_update(flow);
 		avl_insert(&sorted_flows, &flow->sort_node);
 
 		flow_debug_msg(flow, "Update");

I do not know what's wrong with

bridger/nl.c

Lines 734 to 739 in 3159bbe

dev->offload_update = false;
msg = nlmsg_alloc_simple(RTM_GETTFILTER, NLM_F_REQUEST | NLM_F_DUMP);
nlmsg_append(msg, &tcmsg, sizeof(tcmsg), NLMSG_ALIGNTO);
nl_send_auto_complete(cmd_sock, msg);
nlmsg_free(msg);
nl_wait_for_ack(cmd_sock);
but I don't think it's an issue in handle_filter(). I made handle_filter() a noop and there was no change, only not sending that RTM_GETTFILTER command to cmd_sock by not calling bridger_nl_flow_offload_update fixed it.

Of course, I'm sharing this is only in the hopes that it helps find the source of the issue; not for you to use the patch; though it does seem to work fine.

@Fail-Safe
Copy link

Fail-Safe commented Dec 21, 2023

Very interesting find! I rebuilt the firmware for my three RT3200s including the change you made in flow.c and sure enough, I'm still seeing flows in /sys/kernel/debug/ppe0/bind even after about 20 minutes of uptime. Longest I've ever seen it keep working.


Update:
This is wild! It is still working nearly 12 hours later!

I know @nbd168 has to be pulled in a million other directions, but hopefully he can give this a look and get some updates into bridger. 😃

@rany2
Copy link

rany2 commented Dec 21, 2023

It's weird that RTM_GETTFILTER causes this issue because as far as I know, it shouldn't cause any changes. I can even trigger the issue again with while :; do tc -s filter show dev eth0 ingress >/dev/null; sleep 1; done and bridger_nl_flow_offload_update commented out like above.

I think it could be kernel bug but not sure.

@imwhocodes
Copy link

I'm still seeing this issue with "OpenWrt SNAPSHOT r25136-6497cdba09" and "bridger 2023-05-12-d0f79a16", is there any update or ii is still better to keep WED disabled on a DumpAP?

@Fail-Safe
Copy link

The WED crash issue seems to be fixed. See details toward the end of: openwrt/mt76#754 (comment)

I'm using @rany2's patch from here and it has kept the WED offloading working for me.

@imwhocodes
Copy link

The WED crash issue seems to be fixed. See details toward the end of: openwrt/mt76#754 (comment)

I'm using @rany2's patch from here and it has kept the WED offloading working for me.

Thanks,
So there is no any pre-packaged build of it, but I need to build myself?

@Fail-Safe
Copy link

Correct, at this point you'd have to build and patch yourself.

Fail-Safe referenced this issue in openwrt/openwrt Mar 10, 2024
Should improve performance/reliability with lots of mcast packets

Signed-off-by: Felix Fietkau <nbd@nbd.name>
@skramstad
Copy link

Just testing my bpi-r3 as a dumb AP and latest snapshot. I have also tested kernel 6.6 and bridger with the same result. I see that bridger does not get new flows after a minute or so...

But now, I've been testing bridger by removing this line. #3 (comment) And I can see new flows again.

-		bridger_nl_flow_offload_update(flow);

Thanks @rany2 👍

nicefile pushed a commit to nicefile/bridger-fix that referenced this issue Mar 15, 2024
bridger is not reliable (fails to register traffic)

Signed-off-by: Robert Senderek <robert.senderek@10g.pl>
@nicefile
Copy link
Author

@rany2 I've took liberty to create PR with your proposed workaround . Maybe this will catch @nbd168 attention

@nicefile
Copy link
Author

bridger with rany2 patch for OpenWrt 23.05.3 on my gdrive

@gssjshark
Copy link

gssjshark commented Mar 27, 2024

how do we apply this patch? sorry, I am relatively new to openwrt. thanks for your help!

@nicefile
Copy link
Author

nicefile commented Mar 27, 2024

@gssjshark Lets assume you're in OpenWrt folder

mkdir package/network/services/bridger/patches
wget -O package/network/services/bridger/patches/10-fix-issue-3.patch "https://github.com/nbd168/bridger/pull/5/commits/c73bf1f80999db1fe5dbf5c082a9e77862b35d58.patch"

then build your package or whole firmware

or You can install package for OpenWrt 23.05.3 from #3 (comment)

@Fail-Safe
Copy link

@nbd168 Hey Felix, do you have any feedback around the findings from @rany2 in post #3 (comment)?

@nbd168
Copy link
Owner

nbd168 commented Apr 14, 2024

Please try the latest version

@rany2
Copy link

rany2 commented Apr 14, 2024

I'll test it out tomorrow, thanks as always for your efforts. Hopefully you could find a tester that can respond earlier.

@Fail-Safe
Copy link

Fail-Safe commented Apr 14, 2024

@nbd168 Updated my build to run with c77a7a1. So far, so good.

After 50 minutes of uptime, I am still seeing flows when watching /sys/kernel/debug/ppe0/bind. I typically would have seen the flows "disappear" within a handful of minutes (often less than 5 mins). I'll give another update after I let this cook overnight and see how things look.

Thank you, @nbd168!

@Fail-Safe
Copy link

@nbd168 Still seeing flows 12+ hours later. Commit c77a7a1 seems golden, IMHO. Many thanks!

@rany2
Copy link

rany2 commented Apr 15, 2024

I think this issue could be closed, seems solved for me.

@nbd168
Copy link
Owner

nbd168 commented Apr 15, 2024

thanks for testing!

@nbd168 nbd168 closed this as completed Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants