Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding second WAN with type DHCP makes OPN unreachable #1811

Closed
mimugmail opened this issue Sep 8, 2017 · 28 comments
Closed

Adding second WAN with type DHCP makes OPN unreachable #1811

mimugmail opened this issue Sep 8, 2017 · 28 comments
Assignees
Labels
bug

Comments

@mimugmail
Copy link
Member

@mimugmail mimugmail commented Sep 8, 2017

I have a OPN installation with a static WAN IP address.
When I add a second WAN, Type DHCP and plug the cable, the system get's an IP address and a second default gateway from DHCP. After this the firewall isn't reachable anymore via WAN or WAN2.

I can see the packets coming in via the WAN interfaces bot no reply sent back to any IF.
After a reboot it works again via WAN1.

Is this normal behavior? For me it's OK to know that FW has to be restarted, but I think this should work flawless in productive environments.

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 10, 2017

Disabling and enabling DHCP WAN reproduces this problem.
Both WANs unusable. When I log into LAN I can do some DNS resolution (DNS bound to IF?) and the ping results in no route to host:

root@dialin:~ # ping apple.com
PING apple.com (17.178.96.59): 56 data bytes
ping: sendto: No route to host
ping: sendto: No route to host

system.log and dmesg doesn't show anything interested.
Seems that having a static default gw via WAN and then receiving a second default gw via WAN_DHCP results in no default gw.

@AdSchellevis

This comment has been minimized.

Copy link
Member

@AdSchellevis AdSchellevis commented Sep 10, 2017

what does netstat -nr say?

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 10, 2017

root@dialin:~ # netstat -nr
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
5.10.0.115         a0:36:9f:8a:fa:ee  UHS        igb2
5.10.0.122         a0:36:9f:8a:fa:ee  UHS        igb2
8.8.8.8            217.5.204.209      UGHS       igb0
91.137.72.X/24     link#3             U          igb2
91.137.72.X      link#3             UHS         lo0
127.0.0.1          link#7             UH          lo0
192.168.169.0/24   link#5             U           em0
192.168.169.230    link#5             UHS         lo0
217.5.204.X/28   link#1             U          igb0
217.5.204.X      link#1             UHS         lo0

Internet6:
Destination                       Gateway                       Flags     Netif Expire
::1                               link#7                        UH          lo0
fe80::%igb0/64                    link#1                        U          igb0
fe80::a236:9fff:fe8a:faec%igb0    link#1                        UHS         lo0
fe80::%igb1/64                    link#2                        U          igb1
fe80::a236:9fff:fe8a:faed%igb1    link#2                        UHS         lo0
fe80::%igb2/64                    link#3                        U          igb2
fe80::a236:9fff:fe8a:faee%igb2    link#3                        UHS         lo0
fe80::%em0/64                     link#5                        U           em0
fe80::7285:c2ff:fe25:bbd4%em0     link#5                        UHS         lo0
fe80::%lo0/64                     link#7                        U           lo0
fe80::1%lo0                       link#7                        UHS         lo0

No default anymore, Google IP fixed routed via WAN (IP 217.X), two DNS servers from DHCP routed via v6(?)

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 10, 2017

reenabling WAN_DHCP doesn't add any default gateway.
Only adding/deleting a gateway group starts a process where default gateway is added again.

@AdSchellevis

This comment has been minimized.

Copy link
Member

@AdSchellevis AdSchellevis commented Sep 10, 2017

can you try the following in a php script?

<?php
require_once("system.inc");
require_once("util.inc");
require_once("interfaces.inc");

system_routing_configure();

Usually it should try to setup routing on linkup, but maybe the dhcp client is throwing a race condition in here.

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 10, 2017

You mean executing this script after enabling WAN_DHCP?
Nothing happens. Created a test.php and put it in /usr/local/www/
Script returns nothing, also included some prints before and after, displayed fine.

@AdSchellevis

This comment has been minimized.

Copy link
Member

@AdSchellevis AdSchellevis commented Sep 10, 2017

yes, just to be sure that the issue is persistent.
I don't think I can reproduce this easily, can you paste a screenshot of the list of gateways in the gui?

@StrikerTwo

This comment has been minimized.

Copy link

@StrikerTwo StrikerTwo commented Sep 11, 2017

This seems to be the same problem as opnsense/plugins#239. I have posted screenshots there.

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 11, 2017

Gateway with only WAN (static)
image

Routes with only WAN (static)
image

Gateway with both WANs (DHCP WAN enabled)
image

Routes with both WANs (DHCP WAN enabled)
image

So in the Gateways view it's there, bot not in the system itself

@AdSchellevis

This comment has been minimized.

Copy link
Member

@AdSchellevis AdSchellevis commented Sep 11, 2017

@mimugmail I was particularly interested in the first screenshot with both interfaces enabled, can you supply that too?

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 11, 2017

Sure;

image

@AdSchellevis

This comment has been minimized.

Copy link
Member

@AdSchellevis AdSchellevis commented Sep 11, 2017

ok, this doesn't make sense, it should keep your static route as default all the time. I will see if I can reproduce this somewhere...

@AdSchellevis AdSchellevis self-assigned this Sep 11, 2017
@AdSchellevis AdSchellevis added the bug label Sep 11, 2017
@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 11, 2017

This would be great!

Here's the dhcp log when enabling IF:

Sep 11 10:24:08 dialin dhclient[99637]: dhclient already running, pid: 93846.
Sep 11 10:24:08 dialin dhclient[99637]: exiting.
Sep 11 10:24:09 dialin dhclient: PREINIT
Sep 11 10:24:09 dialin dhclient: Starting delete_old_states()
Sep 11 10:24:09 dialin dhclient[93846]: DHCPREQUEST on igb2 to 255.255.255.255 port 67
Sep 11 10:24:10 dialin dhclient[93846]: DHCPACK from 91.137.72.1
Sep 11 10:24:10 dialin dhclient: REBOOT
Sep 11 10:24:10 dialin dhclient: Starting delete_old_states()
Sep 11 10:24:10 dialin dhclient: Starting add_new_address()
Sep 11 10:24:10 dialin dhclient: ifconfig igb2 inet 91.137.72.243 netmask 255.255.255.0 broadcast 91.137.72.255
Sep 11 10:24:10 dialin dhclient: New IP Address (igb2): 91.137.72.243
Sep 11 10:24:10 dialin dhclient: New Subnet Mask (igb2): 255.255.255.0
Sep 11 10:24:10 dialin dhclient: New Broadcast Address (igb2): 91.137.72.255
Sep 11 10:24:10 dialin dhclient: New Routers (igb2): 91.137.72.1
Sep 11 10:24:10 dialin dhclient: Adding new routes to interface: igb2
Sep 11 10:24:10 dialin dhclient: Creating resolv.conf
Sep 11 10:24:10 dialin dhclient[93846]: bound to 91.137.72.243 -- renewal in 53091 seconds.

I switched from DHCP to static with the values I receive via DHCP and then it works. So it seems to be related to dhcp only.

@AdSchellevis

This comment has been minimized.

Copy link
Member

@AdSchellevis AdSchellevis commented Sep 11, 2017

@mimugmail do you have gateway switching enabled on this device? I can't reproduce this on my end.

Just to be sure, what does this output when both interfaces are connected and the gateway isn't there?

ls /tmp/*_defaultgw
cat /tmp/*_defaultgw
@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 11, 2017

I just found this feature here #1315
Why is this not enabled by default, and why do you want to phase it out?

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 11, 2017

Ok I just tested with WAN2 as DHCP and now it works fine with gateway switching enabled.
Will there be some kind of replacement since also the help states it's a deprecated feature ( @fichtner )?

P.S.: Really appreciate you help and setup a test env @AdSchellevis

@fichtner

This comment has been minimized.

Copy link
Member

@fichtner fichtner commented Sep 11, 2017

The problem with that feature is that it had multiple bugs and relies on gateway monitoring. Most bugs may be gone since 17.7, but the monitoring it relies on is still problematic.

Other problems are that there is no fallback priority / logic, we can't mark gateways to be ignored in switching.

It's really just a life saver in the short term, but a long term network failure incident.

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 11, 2017

But how do you want to manage failover for local services? My tests with pf and failover group didn't work.
If you have to manage multiple firewalls remotely you need some failover mechanism to reach the units also when the line fails.

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 12, 2017

I'm closing this one now since it works now with gateway switching enabled and without gwgroups.
Will try to figure out which config scenarios break the setups like combining gw switching with gw group tiering.

Thank you guys! 👍

@mimugmail mimugmail closed this Sep 12, 2017
@fichtner

This comment has been minimized.

Copy link
Member

@fichtner fichtner commented Sep 12, 2017

@mimugmail thank you! maybe we're just missing a few tweaks so thanks for pursuing this :)

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Sep 12, 2017

@fichtner Perhaps sime easy tweaks, hopefully. I'm doing some testbeds right now, really strange behavior. But with you guys I'm quite sure we'll coming some steps nearer to perfection :)

@comotion

This comment has been minimized.

Copy link

@comotion comotion commented Oct 19, 2018

Hello, OPNsense 18.7.4 (amd64/OpenSSL) here.
Just got hit by this bug. Added a 2nd WAN interface with DHCP, lost DNS first, then could no longer ping out on the gw.

Tried to ifconfig igb2 down (the WAN2 interface) over ssh, this did not help. Tried resetting dns in /etc/resolv.conf, as this had been polluted by the WAN2 DHCP, this did not help.

I then lost access to the web GUI too, and tried hitting the "11) Reload all services" option over ssh.
This led to loss of SSH connection and loss of LAN DHCP. Eventually had to power cycle the gateway to resume production since bringing up serial or keyboard and screen wasn't an option.

Big problems for a seemingly innocent change: bringing up a 2nd WAN interface.

I did not configure a 2nd gateway before adding WAN2.
Firewall->Advanced->Default gateway switching is disabled.
System->General->Allow DNS server list to be overridden by DHCP/PPP on WAN was enabled, which probably caused the initial DNS outage, but doesn't explain the ensuing clustersnafu.

Working on a link with a hundred people on it, I'd love to know when a change has the potential to disrupt the link in an unrecoverable way.

@fichtner

This comment has been minimized.

@comotion

This comment has been minimized.

Copy link

@comotion comotion commented Oct 19, 2018

the bit pointing out the lack of guarantees: not useful.
the bit linking to another similar issue: useful. good to know that there is a fix coming.

@fichtner

This comment has been minimized.

Copy link
Member

@fichtner fichtner commented Oct 19, 2018

Sadly, it's not enough. Nobody really bothers testing, not even the OP, likely citing "a link with a hundred people on it" for the reason not to test it. It's a deadlock. Everyone gets what they offer.

@mimugmail

This comment has been minimized.

Copy link
Member Author

@mimugmail mimugmail commented Oct 19, 2018

Ehh .. sorry, did I miss something? Shall I test something? :)

@fichtner

This comment has been minimized.

Copy link
Member

@fichtner fichtner commented Oct 19, 2018

The discussed dhclient changes are in master via #2542, no further action needed from this ticket‘s perspective.

@comotion

This comment has been minimized.

Copy link

@comotion comotion commented Oct 22, 2018

It would be tricky, but reporting the issue does indicate a willingness to test (at least from me) because I do rely on OPNsense and have a fair understanding of maintainership of free software - although maybe not an understanding of the particular challenges of OPNsense maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.