Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding second WAN with type DHCP makes OPN unreachable #1811

Closed
mimugmail opened this issue Sep 8, 2017 · 28 comments
Closed

Adding second WAN with type DHCP makes OPN unreachable #1811

mimugmail opened this issue Sep 8, 2017 · 28 comments
Assignees
Labels
bug Production bug

Comments

@mimugmail
Copy link
Member

I have a OPN installation with a static WAN IP address.
When I add a second WAN, Type DHCP and plug the cable, the system get's an IP address and a second default gateway from DHCP. After this the firewall isn't reachable anymore via WAN or WAN2.

I can see the packets coming in via the WAN interfaces bot no reply sent back to any IF.
After a reboot it works again via WAN1.

Is this normal behavior? For me it's OK to know that FW has to be restarted, but I think this should work flawless in productive environments.

@mimugmail
Copy link
Member Author

Disabling and enabling DHCP WAN reproduces this problem.
Both WANs unusable. When I log into LAN I can do some DNS resolution (DNS bound to IF?) and the ping results in no route to host:

root@dialin:~ # ping apple.com
PING apple.com (17.178.96.59): 56 data bytes
ping: sendto: No route to host
ping: sendto: No route to host

system.log and dmesg doesn't show anything interested.
Seems that having a static default gw via WAN and then receiving a second default gw via WAN_DHCP results in no default gw.

@AdSchellevis
Copy link
Member

what does netstat -nr say?

@mimugmail
Copy link
Member Author

root@dialin:~ # netstat -nr
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
5.10.0.115         a0:36:9f:8a:fa:ee  UHS        igb2
5.10.0.122         a0:36:9f:8a:fa:ee  UHS        igb2
8.8.8.8            217.5.204.209      UGHS       igb0
91.137.72.X/24     link#3             U          igb2
91.137.72.X      link#3             UHS         lo0
127.0.0.1          link#7             UH          lo0
192.168.169.0/24   link#5             U           em0
192.168.169.230    link#5             UHS         lo0
217.5.204.X/28   link#1             U          igb0
217.5.204.X      link#1             UHS         lo0

Internet6:
Destination                       Gateway                       Flags     Netif Expire
::1                               link#7                        UH          lo0
fe80::%igb0/64                    link#1                        U          igb0
fe80::a236:9fff:fe8a:faec%igb0    link#1                        UHS         lo0
fe80::%igb1/64                    link#2                        U          igb1
fe80::a236:9fff:fe8a:faed%igb1    link#2                        UHS         lo0
fe80::%igb2/64                    link#3                        U          igb2
fe80::a236:9fff:fe8a:faee%igb2    link#3                        UHS         lo0
fe80::%em0/64                     link#5                        U           em0
fe80::7285:c2ff:fe25:bbd4%em0     link#5                        UHS         lo0
fe80::%lo0/64                     link#7                        U           lo0
fe80::1%lo0                       link#7                        UHS         lo0

No default anymore, Google IP fixed routed via WAN (IP 217.X), two DNS servers from DHCP routed via v6(?)

@mimugmail
Copy link
Member Author

reenabling WAN_DHCP doesn't add any default gateway.
Only adding/deleting a gateway group starts a process where default gateway is added again.

@AdSchellevis
Copy link
Member

can you try the following in a php script?

<?php
require_once("system.inc");
require_once("util.inc");
require_once("interfaces.inc");

system_routing_configure();

Usually it should try to setup routing on linkup, but maybe the dhcp client is throwing a race condition in here.

@mimugmail
Copy link
Member Author

You mean executing this script after enabling WAN_DHCP?
Nothing happens. Created a test.php and put it in /usr/local/www/
Script returns nothing, also included some prints before and after, displayed fine.

@AdSchellevis
Copy link
Member

yes, just to be sure that the issue is persistent.
I don't think I can reproduce this easily, can you paste a screenshot of the list of gateways in the gui?

@StrikerTwo
Copy link

This seems to be the same problem as opnsense/plugins#239. I have posted screenshots there.

@mimugmail
Copy link
Member Author

Gateway with only WAN (static)
image

Routes with only WAN (static)
image

Gateway with both WANs (DHCP WAN enabled)
image

Routes with both WANs (DHCP WAN enabled)
image

So in the Gateways view it's there, bot not in the system itself

@AdSchellevis
Copy link
Member

@mimugmail I was particularly interested in the first screenshot with both interfaces enabled, can you supply that too?

@mimugmail
Copy link
Member Author

Sure;

image

@AdSchellevis
Copy link
Member

ok, this doesn't make sense, it should keep your static route as default all the time. I will see if I can reproduce this somewhere...

@AdSchellevis AdSchellevis self-assigned this Sep 11, 2017
@AdSchellevis AdSchellevis added the bug Production bug label Sep 11, 2017
@mimugmail
Copy link
Member Author

This would be great!

Here's the dhcp log when enabling IF:

Sep 11 10:24:08 dialin dhclient[99637]: dhclient already running, pid: 93846.
Sep 11 10:24:08 dialin dhclient[99637]: exiting.
Sep 11 10:24:09 dialin dhclient: PREINIT
Sep 11 10:24:09 dialin dhclient: Starting delete_old_states()
Sep 11 10:24:09 dialin dhclient[93846]: DHCPREQUEST on igb2 to 255.255.255.255 port 67
Sep 11 10:24:10 dialin dhclient[93846]: DHCPACK from 91.137.72.1
Sep 11 10:24:10 dialin dhclient: REBOOT
Sep 11 10:24:10 dialin dhclient: Starting delete_old_states()
Sep 11 10:24:10 dialin dhclient: Starting add_new_address()
Sep 11 10:24:10 dialin dhclient: ifconfig igb2 inet 91.137.72.243 netmask 255.255.255.0 broadcast 91.137.72.255
Sep 11 10:24:10 dialin dhclient: New IP Address (igb2): 91.137.72.243
Sep 11 10:24:10 dialin dhclient: New Subnet Mask (igb2): 255.255.255.0
Sep 11 10:24:10 dialin dhclient: New Broadcast Address (igb2): 91.137.72.255
Sep 11 10:24:10 dialin dhclient: New Routers (igb2): 91.137.72.1
Sep 11 10:24:10 dialin dhclient: Adding new routes to interface: igb2
Sep 11 10:24:10 dialin dhclient: Creating resolv.conf
Sep 11 10:24:10 dialin dhclient[93846]: bound to 91.137.72.243 -- renewal in 53091 seconds.

I switched from DHCP to static with the values I receive via DHCP and then it works. So it seems to be related to dhcp only.

@AdSchellevis
Copy link
Member

@mimugmail do you have gateway switching enabled on this device? I can't reproduce this on my end.

Just to be sure, what does this output when both interfaces are connected and the gateway isn't there?

ls /tmp/*_defaultgw
cat /tmp/*_defaultgw

@mimugmail
Copy link
Member Author

I just found this feature here #1315
Why is this not enabled by default, and why do you want to phase it out?

@mimugmail
Copy link
Member Author

mimugmail commented Sep 11, 2017

Ok I just tested with WAN2 as DHCP and now it works fine with gateway switching enabled.
Will there be some kind of replacement since also the help states it's a deprecated feature ( @fichtner )?

P.S.: Really appreciate you help and setup a test env @AdSchellevis

@fichtner
Copy link
Member

The problem with that feature is that it had multiple bugs and relies on gateway monitoring. Most bugs may be gone since 17.7, but the monitoring it relies on is still problematic.

Other problems are that there is no fallback priority / logic, we can't mark gateways to be ignored in switching.

It's really just a life saver in the short term, but a long term network failure incident.

@mimugmail
Copy link
Member Author

But how do you want to manage failover for local services? My tests with pf and failover group didn't work.
If you have to manage multiple firewalls remotely you need some failover mechanism to reach the units also when the line fails.

@mimugmail
Copy link
Member Author

I'm closing this one now since it works now with gateway switching enabled and without gwgroups.
Will try to figure out which config scenarios break the setups like combining gw switching with gw group tiering.

Thank you guys! 👍

@fichtner
Copy link
Member

@mimugmail thank you! maybe we're just missing a few tweaks so thanks for pursuing this :)

@mimugmail
Copy link
Member Author

@fichtner Perhaps sime easy tweaks, hopefully. I'm doing some testbeds right now, really strange behavior. But with you guys I'm quite sure we'll coming some steps nearer to perfection :)

@comotion
Copy link
Contributor

Hello, OPNsense 18.7.4 (amd64/OpenSSL) here.
Just got hit by this bug. Added a 2nd WAN interface with DHCP, lost DNS first, then could no longer ping out on the gw.

Tried to ifconfig igb2 down (the WAN2 interface) over ssh, this did not help. Tried resetting dns in /etc/resolv.conf, as this had been polluted by the WAN2 DHCP, this did not help.

I then lost access to the web GUI too, and tried hitting the "11) Reload all services" option over ssh.
This led to loss of SSH connection and loss of LAN DHCP. Eventually had to power cycle the gateway to resume production since bringing up serial or keyboard and screen wasn't an option.

Big problems for a seemingly innocent change: bringing up a 2nd WAN interface.

I did not configure a 2nd gateway before adding WAN2.
Firewall->Advanced->Default gateway switching is disabled.
System->General->Allow DNS server list to be overridden by DHCP/PPP on WAN was enabled, which probably caused the initial DNS outage, but doesn't explain the ensuing clustersnafu.

Working on a link with a hundred people on it, I'd love to know when a change has the potential to disrupt the link in an unrecoverable way.

@fichtner
Copy link
Member

@comotion
Copy link
Contributor

the bit pointing out the lack of guarantees: not useful.
the bit linking to another similar issue: useful. good to know that there is a fix coming.

@fichtner
Copy link
Member

Sadly, it's not enough. Nobody really bothers testing, not even the OP, likely citing "a link with a hundred people on it" for the reason not to test it. It's a deadlock. Everyone gets what they offer.

@mimugmail
Copy link
Member Author

Ehh .. sorry, did I miss something? Shall I test something? :)

@fichtner
Copy link
Member

The discussed dhclient changes are in master via #2542, no further action needed from this ticket‘s perspective.

@comotion
Copy link
Contributor

It would be tricky, but reporting the issue does indicate a willingness to test (at least from me) because I do rely on OPNsense and have a fair understanding of maintainership of free software - although maybe not an understanding of the particular challenges of OPNsense maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Production bug
Development

No branches or pull requests

5 participants