Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No error message for non-supported multi WAN w/ single gateway IP setup #6576

Closed
2 tasks done
sjjh opened this issue May 25, 2023 · 25 comments
Closed
2 tasks done

No error message for non-supported multi WAN w/ single gateway IP setup #6576

sjjh opened this issue May 25, 2023 · 25 comments
Labels
help wanted Contributor missing / timeout support Community support

Comments

@sjjh
Copy link

sjjh commented May 25, 2023

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

I can create a multiwan setup with two (PPPoE) gateways which get their IP addresses via DHCP from the ISP and have uplink both the same upstream gateway IP. This setup is apparently not supported by FreeBSD (as multipath is disabled due to other issues), see: https://forum.opnsense.org/index.php?topic=34189.0 Although this setup is not supported, no error message or warning is shown in the Web GUI.

To Reproduce

Just create two gateways having the same upstream gateway address.

Expected behavior

It works or an error message is shown.

Describe alternatives you considered

Not using multi WAN.

Additional context

The resulting feature request would be to check the IP address of the respective gateway of the two gateways. In case both gateways have the same gateway IP address, show a big error message in the Web GUI and log an error message to the log file.

Environment

OPNsense 23.1.7_3-amd64
FreeBSD 13.1-RELEASE-p7
OpenSSL 1.1.1t 7 Feb 2023

@fichtner
Copy link
Member

Quick test adding a second gateway...

The following input errors were detected:

The gateway IP address "10.3.0.2" already exists.

@fichtner fichtner added the support Community support label May 25, 2023
@fichtner
Copy link
Member

If you talk about unspecified dynamic gateways delivered by the ISP at runtime.. I'm not even sure how and where to present that.

@sjjh
Copy link
Author

sjjh commented May 25, 2023

Sorry, did not test it with a manual config. It's about DHCP (as stated in the initial post).
Probably in most cases when a second gateway will be created usind DHCP, the connection will be established very quickly and thus the issue will be noticeable directly. Thus I could imagine to present the error message in the gateway > single gateway screen. Even if it later occurs, I would imagine someone will be looking sooner or later at the gateway screen and would see an error message there.
Additionally I believe a log entry might be helpful.

@fichtner
Copy link
Member

PPPoE and DHCP are distinct, but routers are provided in both cases. These routers are written to files on the disk:

# ls /tmp/*_router

To my knowledge the problem first and foremost is that you cannot simultaneously push traffic through both WANs if they have the same gateway address. The second one is dead. Default gateway switching still works, but since the single point of failure is your ISP gateway the failover point is rather moot.

So typically you only use such a setup if you want to bundle two connections in order to use doubled bandwidth. All these constraints may or may not apply to the case at hand. My reluctance here is adding an error to a functional setup as well. Some people don't mind or haven't noticed. Not sure where the sweet spot for this request shall be.

Cheers,
Franco

@sjjh
Copy link
Author

sjjh commented May 25, 2023

In our setup we are using a 1Gbit/s link for "most" traffic, and a 30Mbit/s link dedicated only for VoIP traffic (so no bundling to double the bandwidth). The gateway to use is selected by firewall / NAT rules (the internal VoIP traffic is coming from a separate VLAN).
This seems to work and it does not look like as if one interface would be dead completely.
We are experiencing irregular issues if the web-gateway goes down (e.g. after connection breakages or reboots) that the web traffic is using the VoIP-gateway and will not come back to the web-WAN it should use. (Web-gateway is marked as upstream, default, higher priority (equals lower number) than the VoIP-gateway)

I understood (but might be wrong due to my limited understanding) that the setup with two gateways and only one gateway-gateway address is not supported at all. Thus I thought that an error message would be helpful (would at least have saved me quite some hours of research on the net). If one gateway would be dead (even if people would not notice), it still sounds sensible to me to show an error message to make them aware of that fact.
Right now, at least I, wasn't aware of the root cause of the topic and it costs me quite some time do research.

@sjjh
Copy link
Author

sjjh commented May 25, 2023

Probably obvious, but in case it helps, yes, both files contain the same IP address:

$ ls /tmp/pppoe*_router
/tmp/pppoe1_router	/tmp/pppoe2_router
$ diff /tmp/pppoe1_router /tmp/pppoe2_router
$ 

@fichtner
Copy link
Member

Do you have a gateway group set? Loss and delay triggers are broken currently, see #6231

Cheers,
Franco

@sjjh
Copy link
Author

sjjh commented May 25, 2023

No, no gateway group is used. We also disabled gateway monitoring as we do have no fallback anyway it does add no value (and could potential only lead to false positive).

@fichtner
Copy link
Member

But you are using default gateway switching? I’m not sure how that works without proper monitoring.

@sjjh
Copy link
Author

sjjh commented May 25, 2023

no, no switching at all. Just two gateways, for specific traffic:

  • GW_Internet_WAN (1Gbit/s) -> all the web surfing traffic, email, ...
  • GW_VoIP_WAN (30Mbit/s) -> only VoIP traffic (we do have an PBX on premise, using SIP trunking)
    Reason for that setup: phoning should still work, even if surfing takes all the bandwidth. QoS, Shaping, ... did not work very well, thus we decided to use two separate gateways.

@fichtner
Copy link
Member

Are both set to upstream gateway? Can you explain "web-gateway goes down" a little more?

Thanks,
Franco

@sjjh
Copy link
Author

sjjh commented May 26, 2023

Only the "web" gateway is set to upstream.

With "web-gateway goes down" I mean occasions as e.g. power loss of firewall, cable disconnected, reboot of firewall, taking the gateway down in SW, forcing the gateway down by (false positive) gateway monitoring result, ... all situations when the interface is not up. Not in all but in some cases we than have issues as described that all the traffic will only use the other VoIP gateway and stick there, even if both gateways are available again. My expectation was, that as soon as the web-gateway will come up again, it will be used again (due to priority, marking as upstream, ...) but it is not. Often it only helps to take the VoIP-gateway down, and after a while then the traffic switches back to the web-gateway.

@fichtner
Copy link
Member

Ok, when the traffic is stuck on the VOIP WAN will this resolve it?

# /usr/local/etc/rc.filter_configure

If this doesn't work you could also try

# /usr/local/etc/rc.routing_configure

But I suspect the first one will work.

Cheers,
Franco

@sjjh
Copy link
Author

sjjh commented May 27, 2023

Will try, when I experience the problem next time, and report back.

@sjjh
Copy link
Author

sjjh commented Jun 22, 2023

So, after maintenance work of our ISP tonight, leading to a cut-off of the uplink, this morning we were having the same issue, that the web-traffic was using the wrong gateway.
I tried both, # /usr/local/etc/rc.filter_configure and # /usr/local/etc/rc.routing_configure, and both did not work. I also tried reconnecting both gateways in the web UI under Interfaces > Overview > reload, which also did not work. Only editing the gateways under System > Gateways > Single (enabling the the monitoring monitoring and reapplying the changes) helped to bring traffic back to the correct gateway.

@fichtner
Copy link
Member

That seems to indicate gateway monitoring (dpinger) plays a bigger role here in decision. It would perhaps appear dpinger is "stuck" on the second link. Have you tried to disable host routes for the gateways?

Can you share the gateway log during the event and fix?

The development version has improved gateway monitor handling and recovery, but perhaps due to the same gateway IP this might be a OS problem of sorts still.

Cheers,
Franco

@sjjh
Copy link
Author

sjjh commented Jun 22, 2023

Have you tried to disable host routes for the gateways?

sry, not sure. Can you point me to the setting you are talking about?

Can you share the gateway log during the event and fix?

root@fw:/var/log/gateways # ls -l
total 184
-rw-------  1 root  wheel  10557 Mar 30 13:55 gateways_20230330.log
-rw-------  1 root  wheel  57752 Mar 31 23:58 gateways_20230331.log
-rw-------  1 root  wheel  99903 Apr  1 20:56 gateways_20230401.log
-rw-------  1 root  wheel   3875 Apr 24 12:35 gateways_20230424.log
-rw-------  1 root  wheel    115 May 23 20:08 gateways_20230523.log
-rw-------  1 root  wheel    932 Jun 22 07:53 gateways_20230622.log
lrwxr-x---  1 root  wheel     39 Jun 22 08:01 latest.log -> /var/log/gateways/gateways_20230622.log
root@fw:/var/log/gateways # cat latest.log 
<12>1 2023-06-22T07:51:55+02:00 fw.example.com dpinger 29060 - [meta sequenceId="1"] send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  alarm_hold 10000ms  dest_addr 8.8.8.8  bind_addr n.n.n.1  identifier "GW_INTERNET_WAN_PPPOE "
<12>1 2023-06-22T07:51:55+02:00 fw.example.com dpinger 30434 - [meta sequenceId="2"] send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  alarm_hold 10000ms  dest_addr 8.8.4.4  bind_addr n.n.n.2  identifier "GW_VOIP_WAN_PPPOE "
<12>1 2023-06-22T07:52:48+02:00 fw.example.com dpinger 29060 - [meta sequenceId="3"] exiting on signal 15
<12>1 2023-06-22T07:52:48+02:00 fw.example.com dpinger 30434 - [meta sequenceId="4"] exiting on signal 15

@fichtner
Copy link
Member

It's a setting for each individual gateway: "Disable Host Route"

So after 07:52:48 the first line was up being used again? It's a bit strange since rc.routing_configure will also restart all monitors.

Cheers,
Franco

@sjjh
Copy link
Author

sjjh commented Jun 22, 2023

It's a setting for each individual gateway: "Disable Host Route"

Sorry, overlooked that one. It's not disabled. Shall I disable it for both and check if it makes a difference next time? (if there is a next time -- due to the ongoing issues we are currently considering to abandon the second gateway and just use one, as long as bandwidth permits it)

So after 07:52:48 the first line was up being used again?

Yes, after the mentioned steps in my above post the gw internet WAN worked again as expected.

@fichtner
Copy link
Member

It's relatively strange about the fix with the "apply", in a nutshell the GUI is calling /usr/local/etc/rc.routing_configure. Just to make sure gateway monitor is now disabled (option checked).

Yes you can check disable host route setting, but it only makes sense if monitor itself is enabled (option unchecked).

Cheers,
Franco

@sjjh
Copy link
Author

sjjh commented Jun 22, 2023

Just to make sure gateway monitor is now disabled (option checked).

It initially was disabled (option checked), I enabled it (to have a change I could apply), and then disabled it again (and applied again), for both gateways respectively.

Yes you can check disable host route setting, but it only makes sense if monitor itself is enabled (option unchecked).

Which is not (monitor). I'll nevertheless just enable it, if it cannot hurt and we'll (might) see next time if it makes any difference.

@sjjh
Copy link
Author

sjjh commented Jul 30, 2023

FYI: We removed the second gateway (as it is not supported, as stated in the initial posting) to erase this as a root cause for other connection problems. Thus I will not be able to test/debug this any further. The initial bug/feature request is IMHO nevertheless valid, thus leaving this bug open. :)

@syserr0r
Copy link
Contributor

syserr0r commented Jul 31, 2023

We have something similar with 3 WAN links to the same ISP and currently with the same gateway address (this was not always the case):

  • WAN [priority 10]
  • WAN2 [priority 20]
  • WAN_VOIP [priority 100, 'upstream gateway' unticked in gateways]

We use gateway rules to enforce traffic from our VOIP server over the WAN_VOIP interface. We use similar rules for assigning certain traffic to certain interfaces. Remaining traffic goes over a gateway group balancing WAN and WAN2.

I was honestly not aware this was unsupported.

Things that I have noticed that might not be working:

  • gateway monitoring for WAN2 and WAN_VOIP show 100% loss even though these links appear to be working. WAN seems OK
    • This likely also "breaks" the gateway group causing it to only use WAN and not really balancing
  • Attempting to add static routes with a gateway of WAN2 or WAN_VOIP once applied show as WAN in the routes status (presumably the route is added by IP not interface - likely also why gateway monitoring is broken)

We are currently in contact with the ISP to see if we can get different gateway IPs assigned.

I am happy to provide some testing although it is a production system so I am weary of anything that might affect client connectivity,

I have now disabled gateway monitoring on WAN2/WAN_VOIP and will see what the impact (if any) is on the gateway groups.

@AdSchellevis
Copy link
Member

Overlapping networks break normal (destination) routing constraints, this is an issue on most platforms. It's like instructing the mailman the same address is located at different locations, in which case a letter might be delivered randomly.

In theory it should be possible to define virtual overlapping networks using fibs (https://man.freebsd.org/cgi/man.cgi?query=setfib), but it comes with quite some constraints (the running application should choose on which virtual network it lives). Unfortunately that's not a scenario easy to support from our end. If I'm not mistaken in linux the problem is similar, but solvable using VRF (https://docs.kernel.org/networking/vrf.html), which probably has similar challenges.

@OPNsense-bot
Copy link

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.

@OPNsense-bot OPNsense-bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2023
@OPNsense-bot OPNsense-bot added the help wanted Contributor missing / timeout label Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Contributor missing / timeout support Community support
Development

No branches or pull requests

5 participants