Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPPOE reconnect loops due to buggy LCP packet bypass #2267

Closed
somova opened this issue Mar 18, 2018 · 55 comments
Closed

PPPOE reconnect loops due to buggy LCP packet bypass #2267

somova opened this issue Mar 18, 2018 · 55 comments
Assignees
Labels
bug Production bug
Milestone

Comments

@somova
Copy link

somova commented Mar 18, 2018

I have an issue with a stable DSL connection using opnsense 18.1.x. After a reboot everything works fine. In case of reconnection (e.g. forced connection drop by my ISP after 24h, errors etc.) my system runs into a reconnection loop. Depending on the opnsense version (18.1.2 is the first release I ever used) the system behaviour looks differently:

  • v.18.1.2: the connection comes up again and runs fine for 90 seconds and drops again
  • v.18.1.4. the connection does not come up again

The problem is that after re-establishing the PPPoE connection PPP LCP packets sent out on netgraph hook mpd32168-lso:b0 (received on mpd32168-wan:bypass) are not (or only partly) forwarded via hook mpd32168-wan:link0 (see attached netgraph diagram).

In case of version 18.1.2 after re-establishing the PPPoE connection PPP LCP configuration packets are successfully exchanged, but responses to periodically incoming PPP LCP Echo requests (sent every 30 seconds by my ISP) are not forwarded. After 3 unanswered echo requests (3*30 = 90seconds) the ISP drops the connection. In case of version 18.1.4 even the exchange of PPP LCP configuration packets fails.

The bug report is related to the discussion in the opnsense forum: https://forum.opnsense.org/index.php?topic=7270.0

Neither manual reconnect nor restart of the mpd5 daemon works, the only solution is a reboot of the opnsense system.

netgraph_ix0

@petererdos
Copy link

I am also interested in the evolution of PPPoE with OPNsense, so signing up for this one.

Thank you for reporting the issue.

@fichtner
Copy link
Member

Did you try 64057c136f4d ?

@somova
Copy link
Author

somova commented Mar 28, 2018

I haven't tried this patch. I think, the patch will not solve the provlem because the observed issues reside on PPP and LCP layer.

@fichtner
Copy link
Member

Since it was mentioned that it is "stuck" and only reboot helps the following could help since it is related to the pppoe interface in the netgraph kernel module:

64057c136f4d#diff-c5f4a3581fb76adc9e03ceaee62ceb86R12

@somova
Copy link
Author

somova commented Mar 28, 2018

Oh sorry, I have overlooked the details of your patch 😮 and its relation to netgraph. I will test it tomorrow.

@fichtner
Copy link
Member

Much appreciated, thanks! :)

@somova
Copy link
Author

somova commented Mar 29, 2018

I tested the patch, but my system shows the same behaviour like before.

I think the next step should be debugging the mpd5 daemon to see whether the corresponding socket of netgraph hook mpd32168-wan:bypass receive and process the LCP packets. Maybe, I can find some time the next weeks to check this.

@somova
Copy link
Author

somova commented Apr 5, 2018

Sometimes this issue also occurs after rebooting the machine, not only after pppoe reconnect

@Alphakilo
Copy link

Alphakilo commented Apr 19, 2018

I'm experiencing the same issue. I have to reboot my OPNsense multiple times (2 - 3x) per day.
mpd5 stops answering LCP Echo Requests and the connection is subsequently terminated:
https://forum.opnsense.org/index.php?topic=7270.msg35642#msg35642

This is reliable reproduceable by forcing a PPPoE reconnect, by either side.

Can I do anything to help you fix this issue?

Edit: @fichtner 64057c1 did not fix this issue for me either.
My initial installation (18.1.3) was running fine, I think the issue was introduced in either .4 or .5 and was not fixed with .6

@somova
Copy link
Author

somova commented Apr 24, 2018

Your help is highly appreciated. The next steps I like to try when I have some free time:

  1. get more reference logs from working installations (for comparison)
  2. check whether LCP echo response packets sent out by mpd32168-lso:b0 are successfully received by hook mpd32168-wan:bypass (see netgraph above)
  3. check the obscure memory allocator within mpd5
  4. debug the mpd daemon (could be hard without test env and real debugger)

@somova
Copy link
Author

somova commented May 6, 2018

Problem identified :-)

As already mentioned my ISP sends LCP echo request every 10 seconds and terminates the ppp/pppoe session after three unanswered packets. Since the ISP does not receive LCP echo replies my opnsense enters an infinite reconnection loop until I reboot the system. The cause is a timing issue in OpnSense (mpd daemon in conjunction with scripts) which affects all users applying to pppoe dialin in OpnSense.

After negotiating and establishing the ppp session the mpd daemon spawns a new child process for executing the ppp-linkup script by calling the blocking system() function call. This results in blocking the single-threaded mpd instance until the ppp-linkup script finishes. That time all incoming packets for managing the ppp session are queued and not processed.

On my system the ppp-linkup script consumes too much time, so that the mpd daemon is not able to reply the LCP echo request packets in time. This is also the cause why the system runs fine after reboot but not after reestablishing the pppoe session (after reboot the script consumes around 25 seconds to execute in other ways much more time).

In summary, it can be the stated that the problem results in a design flaw. It is not a good idea to process time critical actions within the mpd5 daemon while running a blocking system call without knowing how much time it consumes. To my mind in the long term the architecture of the mpd daemon needs a change by spawning the child process in a separate thread (or all time critical actions).

As a short term solution all actions which are not related to setting up the interface (all but ip addresses, dns servers and routes) should be sourced out to synchronized parallel processes.

@fichtner fichtner self-assigned this May 6, 2018
@fichtner fichtner added the bug Production bug label May 6, 2018
@fichtner fichtner added this to the 18.7 milestone May 6, 2018
@fichtner
Copy link
Member

fichtner commented May 6, 2018

@somova thanks for this analysis, we can try a thing or two with this info :)

@somova
Copy link
Author

somova commented May 6, 2018

@fichtner: thanks for the fast reply. Currently I am testing a dirty work around as short term solution (see here)

@fichtner
Copy link
Member

fichtner commented May 6, 2018

@somova I was thinking something similar, but needed to check the code first to see if this was a very early regression when we removed pfSctl, but it does not seem so. How about a54cf79 (untested, extra branch for now).

@somova
Copy link
Author

somova commented May 6, 2018

@fichtner: I applied the patch, the ppp-linkup works fine after reboot. But after initiating a reconnect I don't have internet access anymore. It seems that not all necessary services are restarted and the default routes are not setup properly.

Log after reboot (excerpt):
May 6 15:28:46 opnsense: /usr/local/etc/rc.bootup: Dynamic DNS: (Success) IP Address Updated Successfully!
May 6 15:28:46 opnsense: /usr/local/etc/rc.bootup: Dynamic DNS: updating cache file /var/cache/dyndns[...]
May 6 15:28:45 opnsense: /usr/local/etc/rc.bootup: Dynamic DNS: (Success) IP Address Updated Successfully!
May 6 15:28:45 opnsense: /usr/local/etc/rc.bootup: Dynamic DNS: updating cache file /var/cache/dyndns[...]
[...]
May 6 15:27:46 opnsense: /usr/local/etc/rc.bootup: ROUTING: setting IPv6 default route to fe80::9203:25ff:fe74:4dfe
May 6 15:27:46 opnsense: /usr/local/etc/rc.bootup: ROUTING: setting IPv4 default route to 62.214.63.89
May 6 15:27:46 opnsense: /usr/local/etc/rc.bootup: ROUTING: no IPv6 default gateway set, trying wan on 'pppoe0' (fe80::9203:25ff:fe74:4dfe)
May 6 15:27:46 opnsense: /usr/local/etc/rc.bootup: ROUTING: no IPv4 default gateway set, trying wan on 'pppoe0' (62.214.63.89)

Log after manually initiated reconnect (excerpt):
May 6 15:33:24 kernel: ix0: promiscuous mode disabled
May 6 15:32:37 kernel: ng_pppoe[10]: no matching session
May 6 15:32:34 kernel: ng_pppoe[10]: no matching session
May 6 15:32:34 kernel: ng_pppoe[10]: no matching session
May 6 15:32:33 opnsense: /usr/local/etc/rc.newwanip: ROUTING: skipping IPv4 default route to wan
May 6 15:32:33 opnsense: /usr/local/etc/rc.newwanip: ROUTING: skipping IPv4 default route to wan
May 6 15:32:33 opnsense: /usr/local/etc/rc.newwanip: ROUTING: no IPv6 default gateway set, trying wan on 'pppoe0' ()
May 6 15:32:33 opnsense: /usr/local/etc/rc.newwanip: ROUTING: no IPv4 default gateway set, trying wan on 'pppoe0' ()
May 6 15:32:33 opnsense: /usr/local/etc/rc.newwanip: ROUTING: removing /tmp/pppoe0_defaultgwv6
May 6 15:32:33 opnsense: /usr/local/etc/rc.newwanip: ROUTING: removing /tmp/pppoe0_defaultgw
May 6 15:32:33 opnsense: /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'opt3'
May 6 15:32:33 opnsense: /usr/local/etc/rc.newwanip: On (IP address: 192.168.100.2) (interface: WAN_GUESTNET[opt3]) (real interface: ix0).
May 6 15:32:33 opnsense: /usr/local/etc/rc.newwanip: IP renewal is starting on 'ix0'
May 6 15:32:33 opnsense: /usr/local/etc/rc.linkup: Hotplug event detected for WAN_GUESTNET(opt3) but ignoring since interface is configured with static IP (192.168.100.2 ::)
May 6 15:32:33 kernel: ix0_vlan7: link state changed to UP
May 6 15:32:33 kernel: ix0: link state changed to UP
May 6 15:32:24 opnsense: /usr/local/etc/rc.linkup: Hotplug event detected for WAN_GUESTNET(opt3) but ignoring since interface is configured with static IP (192.168.100.2 ::)
May 6 15:32:24 kernel: ng0: changing name to 'pppoe0'
May 6 15:32:23 kernel: ix0_vlan7: link state changed to DOWN
May 6 15:32:23 kernel: ix0: link state changed to DOWN
May 6 15:32:23 opnsense: /status_interfaces.php: The command '/sbin/ifconfig 'pppoe0' inet6 -accept_rtadv' returned exit code '1', the output was 'ifconfig: interface pppoe0 does not exist'
May 6 15:32:23 opnsense: /status_interfaces.php: The command `/sbin/ifconfig -m 'pppoe0'' failed to execute ifconfig: interface pppoe0 does not exist
May 6 15:32:19 kernel: ix0: promiscuous mode enabled
May 6 15:32:16 kernel: ix0: promiscuous mode disabled
May 6 15:32:09 sshlockout[83567]: sshlockout/webConfigurator v3.0 starting up
May 6 15:32:09 ppp-linkdown: Removing stale PPPoE gateway 62.214.63.89 on inet

@fichtner
Copy link
Member

fichtner commented May 6, 2018

looks like ppp-linkup is started but pppoe0 is not configured? oO

@fichtner
Copy link
Member

fichtner commented May 6, 2018

@somova could you also post the interfaces: ppp: log file output ?

@fichtner
Copy link
Member

fichtner commented May 6, 2018

@somova and one more question... when the manual reconnect is done, will this fix it?

# configctl interface newip pppoe0

It looks like there is something wrong with the config.xml ... reading this pppoe0 is on top of ix0_vlan7 is on top of ix0 ? where would a static IP of 192.168.100.2 on ix0 come from if not from the config.xml ?!

@somova
Copy link
Author

somova commented May 7, 2018

Due to problems with the patch I switched back to my work around. But this morning again no internet connection 😢. The process list contained some hanging python processes which looked like remains from yesterday's tests.

root 47440 100.0 0.1 43644 10004 - Rs 15:32 860:46.39 /usr/local/bin/python2.7 /usr/local/opnsense/service/configd_ctl.py -m 'dns reload' 'interface newip pppoe0'
root 40075 83.2 0.1 43644 9876 - Rs 15:27 829:08.92 /usr/local/bin/python2.7 /usr/local/opnsense/service/configd_ctl.py -m 'dns reload' 'interface newip pppoe0'
root 36610 80.0 0.1 43644 9868 - Rs 15:27 827:26.13 /usr/local/bin/python2.7 /usr/local/opnsense/service/configd_ctl.py -m 'dns reload' 'interface newip pppoe0'
root 44841 76.2 0.1 43644 10016 - Rs 15:32 801:07.49 /usr/local/bin/python2.7 /usr/local/opnsense/service/configd_ctl.py -m 'dns reload' 'interface newip pppoe0'
root 20349 0.0 0.1 1110628 6468 - Ss 05:53 0:00.02 /usr/local/sbin/mpd5 -b -k -d /var/etc -f mpd_wan.conf -p /var/run/pppoe_wan.pid -s ppp pppoeclient

@somova
Copy link
Author

somova commented May 7, 2018

@somova could you also post the interfaces: ppp: log file output ?
@somova and one more question... when the manual reconnect is done, will this fix it?

Ok, I will test the patch again and post the results.

It looks like there is something wrong with the config.xml ... reading this pppoe0 is on top of ix0_vlan7 is on top of ix0 ? where would a static IP of 192.168.100.2 on ix0 come from if not from the config.xml ?!

  1. The pppoe0 is on top of ix0
  2. The ix0_vlan7 is on top of ix0 (This one was only for testing purposes with fritzbox but not in use anymore).

The static ip on ix0 is for the IP related communication with my fritzbox 7412 (192.168.100.1). This box is not configured as a real bridge modem but forwards incoming pppoe packets to the DSL interface and vice versa. In this configuration the box is still able to offer all its standard functionality to me (e.g. WLAN [as my guest WLAN] etc.)

@somova
Copy link
Author

somova commented May 7, 2018

After applying the patch again and performing a reconnect the two spawned daemon processes (one for IPv6 and IPv4) hang with 100 percent cpu load.

root 94073 100.0 0.1 43644 10012 - Rs 21:21 3:00.70 /usr/local/bin/python2.7 /usr/local/opnsense/service/configd_ctl.py -m 'dns reload' 'interface newip pppoe0'
root 97704 100.0 0.1 43644 10008 - Rs 21:21 3:00.65 /usr/local/bin/python2.7 /usr/local/opnsense/service/configd_ctl.py -m 'dns reload' 'interface newip pppoe0'

# configctl interface newip pppoe0

This cmd line has no effect.

Edit:
I forgot to add the ppp log 😮
@fichtner: If you still need the ppp log? I can reapply the patch for testing. Please let me know.

@somova
Copy link
Author

somova commented Jun 26, 2018

Will this issue be solved in milestone 18.7 or do need some more beta testers?

@fichtner fichtner modified the milestones: 18.7, 19.1 Jul 24, 2018
@somova
Copy link
Author

somova commented Apr 1, 2019

@rcmcronny: This issue is about LCP echo response messages which cannot sent out via the interface in time due to a script which blocks the mpd daemon. So, the ISP thinks the client is not responding anymore and the connection goes on-hook.

It seems that your issue is another one. Your logfile shows that reconnection timed out. So, I recommend you to track this down using wireshark packet analysis. And create a new bug report in case the issue is not the same.

@somova
Copy link
Author

somova commented Apr 1, 2019

@fichtner: I am now on version 19.1.4. How can I support the development team with tests to solve the issue?

@fichtner
Copy link
Member

fichtner commented Apr 1, 2019

I've added a few commits in the last weeks and thinking about how to debug this further, but I'm not entirely sure which part hangs other than "everything". The best I can come up with is log process ID date and operation to be performed and do a mini log file for ppp linkup/linkdown?

@somova
Copy link
Author

somova commented Apr 3, 2019

It's a pitty that mpd is single-threaded and blocks until the script is completed. What is about moving the scripts to the background (like in my dirty workaround)? In the latter case we should consider what happens if the ppp-linkdown script is executed while the ppp-linkup script is still executing.

fichtner added a commit that referenced this issue May 6, 2019
This might kill a bit of delay in function use by doing an atomic
move to update resolv.conf.  Even if several instances are running
at the same time the contents of the file will be the same now.

I don't expect issues with the DNS route updates either: even if
they are removed or added twice, they will always end up being there.
@fichtner
Copy link
Member

Anybody? I have patches to try for the willing on top of 19.1.7...

@Alphakilo
Copy link

@fichtner Yes please!

@fichtner
Copy link
Member

You can try a development lock removal in DNS code:

# opnsense-patch 5f4315c

You can try the old experimental patch for doing all reloading in the background:

# opnsense-patch 2806a0c

On top of this we could then try removing DNS reload on ppp-linkdown or having a log file writing actions and timestamps to see where the linkup blocks...

@fichtner
Copy link
Member

@Alphakilo sorry, missed you on IRC yesterday... should be around all week to discuss if you want :)

@Alphakilo
Copy link

@fichtner # opnsense-patch 5f4315c is applied. 2806a0c is prefetched and cached, in case things go south.

@fichtner
Copy link
Member

patches are always cached nowadays, but good thinking nevertheless 👍

@somova
Copy link
Author

somova commented May 16, 2019

So, now I have switched to 19.1.7 and can test the patches. Before I start, I need to rollback my workaround before applying the patches. In case something goes wrong and opnsense disconnects from the internet, it's better to have physical access to the machine. Thus, I postpone testing to next monday.

@Alphakilo
Copy link

Alphakilo commented May 17, 2019

Okay, let's see. I've installed the patch May 13th. Since then I had 8 reconnects. I'd have expected 4 reconnects, one every 24 hours.

# grep "LCP: rec'd Terminate Request #\|IPCP: state change Ack-Sent --> Opened" /var/log/ppps.log | grep -v "May 13"
May 14 19:12:27 gateway ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Opened)
May 14 19:12:42 gateway ppp: [wan] IPCP: state change Ack-Sent --> Opened
May 14 19:13:21 gateway ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Opened)
May 14 19:13:34 gateway ppp: [wan] IPCP: state change Ack-Sent --> Opened
May 15 19:13:32 gateway ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Opened)
May 15 19:13:45 gateway ppp: [wan] IPCP: state change Ack-Sent --> Opened
May 15 19:14:55 gateway ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Opened)
May 15 19:15:11 gateway ppp: [wan] IPCP: state change Ack-Sent --> Opened
May 15 19:15:50 gateway ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Opened)
May 15 19:16:05 gateway ppp: [wan] IPCP: state change Ack-Sent --> Opened
May 16 19:16:03 gateway ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Opened)
May 16 19:16:18 gateway ppp: [wan] IPCP: state change Ack-Sent --> Opened
May 17 19:16:17 gateway ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Opened)
May 17 19:16:32 gateway ppp: [wan] IPCP: state change Ack-Sent --> Opened
May 17 19:17:42 gateway ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Opened)
May 17 19:18:00 gateway ppp: [wan] IPCP: state change Ack-Sent --> Opened

The problem is still there, just not as prevalent as it used to be.

@somova
Copy link
Author

somova commented May 20, 2019

So, I have performed the tests with Opnsense 19.1.7 and the following steps:

  1. Dropped the running PPPoE session
  2. Did a rollback of my workaround
  3. Dumped the network traffic while establishing a new PPPoE session and saw delayed answers to "PPP LCP Echo Requests" (as seen in the past)
  4. Applied both patches (5f4315c and 2806a0c)
  5. Dropped the running PPPoE session
  6. Dumped the network traffic while establishing a new PPPoE session

Now "PPP LCP Echo Requests" are immediately answered and my ISP does not drop the PPP connection anymore :-). So, the patches look good, but I need some more time for testing the stability.

👍

fichtner added a commit that referenced this issue May 20, 2019
This might kill a bit of delay in function use by doing an atomic
move to update resolv.conf.  Even if several instances are running
at the same time the contents of the file will be the same now.

I don't expect issues with the DNS route updates either: even if
they are removed or added twice, they will always end up being there.

(cherry picked from commit 5f4315c)
@fichtner
Copy link
Member

@Alphakilo @somova thanks for testing... more or less on the right track. there's still a little contention around linkup and linkdown we could try to sidestep by refusing to reconfigure dns on network linkdown. thoughts? :)

@Alphakilo
Copy link

Alphakilo commented May 21, 2019

reconfigure dns on network linkdown

DynDNS? If OPNsense doesn't reconfigure that, and there is a failover, the DynDNS records won't be updated, right? That'd be.. not good.

@fichtner
Copy link
Member

No, this does not pertain to DynDNS:

/usr/local/opnsense/service/configd_ctl.py dns reload

@fichtner
Copy link
Member

fichtner commented May 21, 2019

More context: configd could block there to reconfigure DNS while doing something else, but reconfiguring DNS (resolv.conf really) serves no purpose if a link goes down as connectivity is broken anyway.

2806a0c tip-toes around this by doing fire and forget, but it doesn't seem reasonable to call it in the first place.

@somova
Copy link
Author

somova commented May 26, 2019

@fichtner : Is it safe to update to opnsense v.19.1.8 after applied the patches in v.19.1.7?

@Alphakilo : I am not sure your issue is the same like mine. Can you please check (with wireshark) whether unanswered LCP echo request packets are the cause that your ISP terminates the connection.

@somova
Copy link
Author

somova commented Jun 6, 2019

After updating my opnsense to version19.1.9 and applying patch 2806a0c PPP LCP echo requests are still immediately answered 😄

@fichtner
Copy link
Member

fichtner commented Jun 7, 2019

@somova thanks for bearing with us ❤️ 2806a0c is now on master as e9f2862

EugenMayer pushed a commit to KontextWork/opnsense_core that referenced this issue Jul 22, 2019
EugenMayer pushed a commit to KontextWork/opnsense_core that referenced this issue Jul 22, 2019
This might kill a bit of delay in function use by doing an atomic
move to update resolv.conf.  Even if several instances are running
at the same time the contents of the file will be the same now.

I don't expect issues with the DNS route updates either: even if
they are removed or added twice, they will always end up being there.
EugenMayer pushed a commit to KontextWork/opnsense_core that referenced this issue Jul 22, 2019
EugenMayer pushed a commit to KontextWork/opnsense_core that referenced this issue Jul 22, 2019
EugenMayer pushed a commit to KontextWork/opnsense_core that referenced this issue Jul 22, 2019
This might kill a bit of delay in function use by doing an atomic
move to update resolv.conf.  Even if several instances are running
at the same time the contents of the file will be the same now.

I don't expect issues with the DNS route updates either: even if
they are removed or added twice, they will always end up being there.
EugenMayer pushed a commit to KontextWork/opnsense_core that referenced this issue Jul 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Production bug
Development

No branches or pull requests

7 participants