-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPPOE reconnect loops due to buggy LCP packet bypass #2267
Comments
|
I am also interested in the evolution of PPPoE with OPNsense, so signing up for this one. Thank you for reporting the issue. |
|
Did you try 64057c136f4d ? |
|
I haven't tried this patch. I think, the patch will not solve the provlem because the observed issues reside on PPP and LCP layer. |
|
Since it was mentioned that it is "stuck" and only reboot helps the following could help since it is related to the pppoe interface in the netgraph kernel module: |
|
Oh sorry, I have overlooked the details of your patch 😮 and its relation to netgraph. I will test it tomorrow. |
|
Much appreciated, thanks! :) |
|
I tested the patch, but my system shows the same behaviour like before. I think the next step should be debugging the mpd5 daemon to see whether the corresponding socket of netgraph hook mpd32168-wan:bypass receive and process the LCP packets. Maybe, I can find some time the next weeks to check this. |
|
Sometimes this issue also occurs after rebooting the machine, not only after pppoe reconnect |
|
I'm experiencing the same issue. I have to reboot my OPNsense multiple times (2 - 3x) per day. This is reliable reproduceable by forcing a PPPoE reconnect, by either side. Can I do anything to help you fix this issue? Edit: @fichtner 64057c1 did not fix this issue for me either. |
|
Your help is highly appreciated. The next steps I like to try when I have some free time:
|
|
Problem identified :-) As already mentioned my ISP sends LCP echo request every 10 seconds and terminates the ppp/pppoe session after three unanswered packets. Since the ISP does not receive LCP echo replies my opnsense enters an infinite reconnection loop until I reboot the system. The cause is a timing issue in OpnSense (mpd daemon in conjunction with scripts) which affects all users applying to pppoe dialin in OpnSense. After negotiating and establishing the ppp session the mpd daemon spawns a new child process for executing the ppp-linkup script by calling the blocking system() function call. This results in blocking the single-threaded mpd instance until the ppp-linkup script finishes. That time all incoming packets for managing the ppp session are queued and not processed. On my system the ppp-linkup script consumes too much time, so that the mpd daemon is not able to reply the LCP echo request packets in time. This is also the cause why the system runs fine after reboot but not after reestablishing the pppoe session (after reboot the script consumes around 25 seconds to execute in other ways much more time). In summary, it can be the stated that the problem results in a design flaw. It is not a good idea to process time critical actions within the mpd5 daemon while running a blocking system call without knowing how much time it consumes. To my mind in the long term the architecture of the mpd daemon needs a change by spawning the child process in a separate thread (or all time critical actions). As a short term solution all actions which are not related to setting up the interface (all but ip addresses, dns servers and routes) should be sourced out to synchronized parallel processes. |
|
@somova thanks for this analysis, we can try a thing or two with this info :) |
|
@fichtner: thanks for the fast reply. Currently I am testing a dirty work around as short term solution (see here) |
|
@fichtner: I applied the patch, the ppp-linkup works fine after reboot. But after initiating a reconnect I don't have internet access anymore. It seems that not all necessary services are restarted and the default routes are not setup properly. Log after reboot (excerpt): Log after manually initiated reconnect (excerpt): |
|
looks like ppp-linkup is started but pppoe0 is not configured? oO |
|
@somova could you also post the interfaces: ppp: log file output ? |
|
@somova and one more question... when the manual reconnect is done, will this fix it? It looks like there is something wrong with the config.xml ... reading this pppoe0 is on top of ix0_vlan7 is on top of ix0 ? where would a static IP of 192.168.100.2 on ix0 come from if not from the config.xml ?! |
|
Due to problems with the patch I switched back to my work around. But this morning again no internet connection 😢. The process list contained some hanging python processes which looked like remains from yesterday's tests. root 47440 100.0 0.1 43644 10004 - Rs 15:32 860:46.39 /usr/local/bin/python2.7 /usr/local/opnsense/service/configd_ctl.py -m 'dns reload' 'interface newip pppoe0' |
Ok, I will test the patch again and post the results.
The static ip on ix0 is for the IP related communication with my fritzbox 7412 (192.168.100.1). This box is not configured as a real bridge modem but forwards incoming pppoe packets to the DSL interface and vice versa. In this configuration the box is still able to offer all its standard functionality to me (e.g. WLAN [as my guest WLAN] etc.) |
|
After applying the patch again and performing a reconnect the two spawned daemon processes (one for IPv6 and IPv4) hang with 100 percent cpu load. root 94073 100.0 0.1 43644 10012 - Rs 21:21 3:00.70 /usr/local/bin/python2.7 /usr/local/opnsense/service/configd_ctl.py -m 'dns reload' 'interface newip pppoe0'
This cmd line has no effect. Edit: |
|
Will this issue be solved in milestone 18.7 or do need some more beta testers? |
|
@rcmcronny: This issue is about LCP echo response messages which cannot sent out via the interface in time due to a script which blocks the mpd daemon. So, the ISP thinks the client is not responding anymore and the connection goes on-hook. It seems that your issue is another one. Your logfile shows that reconnection timed out. So, I recommend you to track this down using wireshark packet analysis. And create a new bug report in case the issue is not the same. |
|
@fichtner: I am now on version 19.1.4. How can I support the development team with tests to solve the issue? |
|
I've added a few commits in the last weeks and thinking about how to debug this further, but I'm not entirely sure which part hangs other than "everything". The best I can come up with is log process ID date and operation to be performed and do a mini log file for ppp linkup/linkdown? |
|
It's a pitty that mpd is single-threaded and blocks until the script is completed. What is about moving the scripts to the background (like in my dirty workaround)? In the latter case we should consider what happens if the ppp-linkdown script is executed while the ppp-linkup script is still executing. |
This might kill a bit of delay in function use by doing an atomic move to update resolv.conf. Even if several instances are running at the same time the contents of the file will be the same now. I don't expect issues with the DNS route updates either: even if they are removed or added twice, they will always end up being there.
|
Anybody? I have patches to try for the willing on top of 19.1.7... |
|
@fichtner Yes please! |
|
You can try a development lock removal in DNS code: You can try the old experimental patch for doing all reloading in the background: On top of this we could then try removing DNS reload on ppp-linkdown or having a log file writing actions and timestamps to see where the linkup blocks... |
|
@Alphakilo sorry, missed you on IRC yesterday... should be around all week to discuss if you want :) |
|
@fichtner |
|
patches are always cached nowadays, but good thinking nevertheless 👍 |
|
So, now I have switched to 19.1.7 and can test the patches. Before I start, I need to rollback my workaround before applying the patches. In case something goes wrong and opnsense disconnects from the internet, it's better to have physical access to the machine. Thus, I postpone testing to next monday. |
|
Okay, let's see. I've installed the patch May 13th. Since then I had 8 reconnects. I'd have expected 4 reconnects, one every 24 hours. The problem is still there, just not as prevalent as it used to be. |
|
So, I have performed the tests with Opnsense 19.1.7 and the following steps:
Now "PPP LCP Echo Requests" are immediately answered and my ISP does not drop the PPP connection anymore :-). So, the patches look good, but I need some more time for testing the stability. 👍 |
This might kill a bit of delay in function use by doing an atomic move to update resolv.conf. Even if several instances are running at the same time the contents of the file will be the same now. I don't expect issues with the DNS route updates either: even if they are removed or added twice, they will always end up being there. (cherry picked from commit 5f4315c)
|
@Alphakilo @somova thanks for testing... more or less on the right track. there's still a little contention around linkup and linkdown we could try to sidestep by refusing to reconfigure dns on network linkdown. thoughts? :) |
DynDNS? If OPNsense doesn't reconfigure that, and there is a failover, the DynDNS records won't be updated, right? That'd be.. not good. |
|
No, this does not pertain to DynDNS:
|
|
More context: configd could block there to reconfigure DNS while doing something else, but reconfiguring DNS (resolv.conf really) serves no purpose if a link goes down as connectivity is broken anyway. 2806a0c tip-toes around this by doing fire and forget, but it doesn't seem reasonable to call it in the first place. |
|
@fichtner : Is it safe to update to opnsense v.19.1.8 after applied the patches in v.19.1.7? @Alphakilo : I am not sure your issue is the same like mine. Can you please check (with wireshark) whether unanswered LCP echo request packets are the cause that your ISP terminates the connection. |
|
After updating my opnsense to version19.1.9 and applying patch |
This might kill a bit of delay in function use by doing an atomic move to update resolv.conf. Even if several instances are running at the same time the contents of the file will be the same now. I don't expect issues with the DNS route updates either: even if they are removed or added twice, they will always end up being there.
This might kill a bit of delay in function use by doing an atomic move to update resolv.conf. Even if several instances are running at the same time the contents of the file will be the same now. I don't expect issues with the DNS route updates either: even if they are removed or added twice, they will always end up being there.
I have an issue with a stable DSL connection using opnsense 18.1.x. After a reboot everything works fine. In case of reconnection (e.g. forced connection drop by my ISP after 24h, errors etc.) my system runs into a reconnection loop. Depending on the opnsense version (18.1.2 is the first release I ever used) the system behaviour looks differently:
The problem is that after re-establishing the PPPoE connection PPP LCP packets sent out on netgraph hook mpd32168-lso:b0 (received on mpd32168-wan:bypass) are not (or only partly) forwarded via hook mpd32168-wan:link0 (see attached netgraph diagram).
In case of version 18.1.2 after re-establishing the PPPoE connection PPP LCP configuration packets are successfully exchanged, but responses to periodically incoming PPP LCP Echo requests (sent every 30 seconds by my ISP) are not forwarded. After 3 unanswered echo requests (3*30 = 90seconds) the ISP drops the connection. In case of version 18.1.4 even the exchange of PPP LCP configuration packets fails.
The bug report is related to the discussion in the opnsense forum: https://forum.opnsense.org/index.php?topic=7270.0
Neither manual reconnect nor restart of the mpd5 daemon works, the only solution is a reboot of the opnsense system.
The text was updated successfully, but these errors were encountered: