Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAN_DHCP6 gateway fails in certain multi WAN setups #3604

Closed
RehaagJ opened this issue Jul 27, 2019 · 47 comments
Closed

WAN_DHCP6 gateway fails in certain multi WAN setups #3604

RehaagJ opened this issue Jul 27, 2019 · 47 comments
Labels
support Community support

Comments

@RehaagJ
Copy link

RehaagJ commented Jul 27, 2019

[X] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md

[X] I have searched the existing issues and I'm convinced that mine is new.

WAN Gateway IPv6 issues started with 19.7. While this is similar to #3601 , I don't think it's the same (but maybe related; one big difference is that both my IPv6 gateways have valid IPv6 addresses, not just link local).

Setup: OPNsense 19.7.1 with WAN, LAN, DMZ and some more internal networks. WAN gets both IPv4 and IPv6 via DHCP, IPv6 sending prefix hint (size 56), directly send SOLICIT checked, prevent release checked.
Then there is an OpenVPN client that builds a tunnel to a service that provides fixed IPv4 and IPv6 addresses that I use for the DMZ (policy based routing); all other internal networks follow WAN.

That worked up until (including) 19.1.10. With 19.7, I lost routing through the WAN_DHCP6 interface as soon as the VPN tunnel came up. In the seconds between establishing WAN connection and bringing up the tunnel, WAN_DHCP6 was good. A new PTYOPENVPN_VPNV6 interface was added automatically, and declared default (duplicating my existing openvpn ipv6 gateway, which I then removed). Once that happened, the WAN_DHCP6 interface was not only no longer default, but also not useable in policy based routing any longer.
As a workaround, I introduced a route-up script for OpenVPN that removes the new wrong default ipv6 route and adds the WAN interface back as default route. With that, I had the old functionality back (even though the PTYOPENVPN_VPNV6 still gets displayed as active gateway). Dpinger correctly showed WAN_DHCP6 as online and measured times and loss.
With 19.7.1, WAN_DHCP6 vanished from the list of gateways, but the workaround still works. Dpinger now fails for WAN_DHCP6 because it cannot see the gateway.

While this workaround makes it possible to stay on 19.7.x, it is lacking features (like gateway monitoring), and is not robust (there can be situations where the default ipv6 route goes back to the OpenVPN tunnel).

Expected behavior
Multiple IPv6 interfaces should coexist, not compete with each other. It should be possible to define the one that should be default freely. I understand that the change in gateway handling removed the default selection on purpose, but the options for priority and/or upstream gateway should be able to make the preferred gateway the active one.

Environment
OPNsense 19.7.1 (amd64, OpenSSL).

@AdSchellevis
Copy link
Member

default is called upstream now, which is evaluated in combination with priority.

You should be able to choose freely between gateways, the overview page should be ordered according to real priority now. (if no upstream is found, it considers non upstream as well, but always values upstream higher in priority)

It would be good to have some additional documentation, but unfortunately we haven't found the time yet to write an extra doc for this.

@AdSchellevis AdSchellevis added the support Community support label Jul 27, 2019
@RehaagJ
Copy link
Author

RehaagJ commented Jul 27, 2019 via email

@RehaagJ
Copy link
Author

RehaagJ commented Jul 27, 2019

Hi again,

I believe the label "support" is not correct. This is clearly a bug. Please let me know how I can help to solve it (log excerpts, other testing etc...). Thanks!

@mr44er
Copy link

mr44er commented Jul 28, 2019

I played a bit more with it. Giving priority works only on ipv4.

Setting upstream and/or (lower) priority when wan1ipv6gw is marked as 'down' does nothing. Leaving this setting and reenabling interface gives same 'nothing'-effect. Trying the other way round with settings on wan2ipv6gw does again nothing.

Btw. multiwan loadbalancing breaks with 3 upstreams on tier1, it then uses randomly only 2 of 3 upstreams regardless of the weight-setting. Even weight 3x5 does nothing.
I think I can rule out a config error, because it works regardless on which combination of 2 gateways I choose to be in tier1, but as said only with max. 2.

@cmb991
Copy link

cmb991 commented Jul 28, 2019

The issue is also in IPv4 when trying to assign gateways to dns servers.

@AdSchellevis
Copy link
Member

If there is an issue, first question is, what does the system_gateways.php page show exactly and what is the configured setting of "Gateway switching" in system_general.php

A lot of issues are related to misconfigured gateway monitoring, if the status overview shows its down, it's not a valid target to use for gateway switching.

@RehaagJ
Copy link
Author

RehaagJ commented Jul 28, 2019

Gateway switching is off, but I tried many iterations of different settings combinations with it being on also - didn't make any difference.
Here's the system_gateways.php page. Note that it does not show the WAN_DHCP6 gateway any more, which it still did in 19.7 (without being able to use it, though).
The missing gateway still exists in config.xml, and when it first went missing in 19.7.1, I tried to add it again - still not showing in the gateway page, but I ended up with three copies of the same gateway definition in config.xml (now cleaned up again so that only the original WAN_DHCP6 is there).

Gateways

@AdSchellevis
Copy link
Member

@RehaagJ if it doesn't show the interface it doesn't consider it a valid alternative, which is slightly different than 19.1.x, which could show unusable gateways as well.

In this case it's likely an issue with the interface configuration, can you check the following on a console (and paste the output):

ls -als /tmp/*gw*

Then for all files returned, check the contents, if the interface gateway is there.

@RehaagJ
Copy link
Author

RehaagJ commented Jul 28, 2019

Here's the output:
4 -rw-r--r-- 1 root wheel 21 Jul 28 11:56 /tmp/ovpnc1_defaultgwv6
That's the only line, and the content of that file is the gateway address of the OpenVPN IPv6 gateway.

What you say about the WAN_DHCP6 gateway being considered invalid makes sense. So the question is: Why would it be considered invalid?

@AdSchellevis
Copy link
Member

It seems that your interface didn't receive a gateway (and thus misses a XXX_defaultgwv6 file), can you post your interface configuration page? maybe there are some clues in there. The gateway page itself seems to be doing the right thing here.

@RehaagJ
Copy link
Author

RehaagJ commented Jul 28, 2019

image

image

@AdSchellevis
Copy link
Member

You probably have to check the logging when connecting to dhcpv6, if I'm not mistaken, the rtsold script should write the gateway after a successful lease (using /var/etc/rtsold_*.sh).

If the interface itself has received a correct address, it might also be a provider specific thing, it doesn't look related to gateway switching.

@fichtner
Copy link
Member

This is the case where SOLICIT is sent directly because the ISP doesn't offer a router for radvd, hence no GW file.

@RehaagJ
Copy link
Author

RehaagJ commented Jul 28, 2019

OK. Since this worked fine until 19.1.10 with the same configuration: There probably never was a GW file then, but still it was possible to configure and use the gateway. Apparently, the new gateway handling relies on the GW files now. Understood correctly?

@fichtner
Copy link
Member

fichtner commented Jul 28, 2019

We should generate the gateway from dhcp6c as well if radvd doesn't fly. I've already promised to look into it elsewhere.

@mr44er
Copy link

mr44er commented Jul 28, 2019

it doesn't look related to gateway switching.

Right, the switching or policy base routing works for me. The problem starts when GWs are shown as offline for no reason.

The 'working' state:

wan1
###################
grafik
###################
grafik
###################
root@fw1:/ # ls -als /tmp/*gw* 4 -rw-r----- 1 root wheel 12 Jul 27 04:46 /tmp/pppoe0_defaultgw 4 -rw-r----- 1 root wheel 26 Jul 28 04:00 /tmp/pppoe0_defaultgwv6

######################

Now only activating ipv6 on wan2:
wan2
#######################
grafik
Note that it immediately takes WAN2 as active, butI didn't set it to upstream nor changed any setting at all for this GW.

root@fw1:/ # ls -als /tmp/*gw* 4 -rw-r----- 1 root wheel 26 Jul 28 15:33 /tmp/em1_defaultgwv6 4 -rw-r----- 1 root wheel 12 Jul 27 04:46 /tmp/pppoe0_defaultgw 4 -rw-r----- 1 root wheel 26 Jul 28 04:00 /tmp/pppoe0_defaultgwv6 root@fw1:/ #

At this state my routing is broken, because wan1gwipv6 is shown as dead.

@AdSchellevis
Copy link
Member

@RehaagJ I said /tmp/*default*, but I meant /tmp/*router* (which in this case is written at the same time). The dynamic entries follow approximately the same logic as the old version, with the exception that it will only show if valid now. for reference, the dynamic items are considered here

// add dynamic gateways
foreach ($definedIntf as $ifname => $ifcfg) {
if (empty($ifcfg['enable'])) {
// only consider active interfaces
continue;
}
foreach (["inet", "inet6"] as $ipproto) {
// filename suffix and interface type as defined in the interface
$descr = !empty($ifcfg['descr']) ? $ifcfg['descr'] : $ifname;
$fsuffix = $ipproto == "inet6" ? "v6" : "";
$ctype = self::convertType($ipproto, $ifcfg);
$ctype = $ctype != null ? $ctype : "GW";
// default configuration, when not set in gateway_item
$thisconf = [
"interface" => $ifname,
"weight" => 1,
"ipprotocol" => $ipproto,
"name" => strtoupper("{$descr}_{$ctype}"),
"descr" => "Interface " . strtoupper("{$descr}_{$ctype}") . " Gateway",
"monitor_disable" => true, // disable monitoring by default
"if" => $ifcfg['if'],
"dynamic" => true,
"virtual" => true
];
// set default priority
if (strstr($ifcfg['if'], 'gre') || strstr($ifcfg['if'], 'gif') || strstr($ifcfg['if'], 'ovpn')) {
// consider tunnel type interfaces least attractive by default
$thisconf['priority'] = 255;
} else {
$thisconf['priority'] = 254;
}
// locate interface gateway settings
if (!empty($dynamic_gw[$ifname])) {
foreach ($dynamic_gw[$ifname] as $gw_arr) {
if ($gw_arr['ipprotocol'] == $ipproto) {
// dynamic gateway for this ip protocol found, use config
$thisconf = $gw_arr;
break;
}
}
}
// dynamic gateways dump their address in /tmp/[IF]_router[FSUFFIX]
if (!empty($thisconf['virtual']) && in_array($thisconf['name'], $reservednames)) {
// if name is already taken, don't try to add a new (virtual) entry
null;
} elseif (file_exists("/tmp/{$ifcfg['if']}_router".$fsuffix)) {
$thisconf['gateway'] = trim(@file_get_contents("/tmp/{$ifcfg['if']}_router".$fsuffix));
if (empty($thisconf['monitor_disable']) && empty($thisconf['monitor'])) {
$thisconf['monitor'] = $thisconf['gateway'];
}
$gwkey = $this->newKey($thisconf['priority'], !empty($thisconf['defaultgw']));
$this->cached_gateways[$gwkey] = $thisconf;
} elseif (substr($ifcfg['if'], 0, 5) == "ovpnc") {
// other predefined types, only bound by interface (e.g. openvpn)
$gwkey = $this->newKey($thisconf['priority'], !empty($thisconf['defaultgw']));
// gateway should only contain a valid address, make sure its empty
unset($thisconf['gateway']);
$this->cached_gateways[$gwkey] = $thisconf;
} elseif (empty($thisconf['dynamic'])) {
$gwkey = $this->newKey($thisconf['priority'], !empty($thisconf['defaultgw']));
// gateway should only contain a valid address, make sure its empty
unset($thisconf['gateway']);
$this->cached_gateways[$gwkey] = $thisconf;
}
}
}

@mr44er
Copy link

mr44er commented Jul 28, 2019

grafik
##################################
grafik
##################################
grafik
##################################

I'm out of ideas here.

@AdSchellevis
Copy link
Member

@mr44er can you try 732b5ff on opnsense 19.7.1?

opnsense-patch 732b5ff

@mr44er
Copy link

mr44er commented Aug 1, 2019

I used the patch, rebooted and activated DHCPv6 on wan2 again.

Nothing changed, GW of wan2 is still priorized, regardless of the lower prio-number and wan1gwv6 shown as dead.

But after some minutes, I could'nt access webif anymore. Reboot doesn't help, dmesg didn't show anything.
Only cutting physically connection to wan2 + reboot did help.
webif+ssh now listens only on internal lan and I deactivated dynamic dns detection for now. I don't see, what the problem with this is, but shouldn't be connected with the GW-thing. Anyway, webif is now working.

Funny thing ist, not even offline by force does activate wan1ipv6gw:

grafik

Again, only deactivating ipv6 completely on wan2 brings back 01_WAN_ENTEGA_DHCP6 fe80::2e6b:f5ff:fead:894a to life.

Between tests I did multiple reboots, no effect either.

@AdSchellevis
Copy link
Member

@mr44er you might have other networking issues, but maybe you can install patches 704dc96 and eb4975e as well and dump a screenshot again.

opnsense-patch 704dc96cf eb4975e

This will add priority, ipprotocol and upstream setting to the overview, which helps to evaluate if the settings are correct.

AdSchellevis added a commit that referenced this issue Aug 2, 2019
… should also permit those as default gateway. could be #3604
@AdSchellevis
Copy link
Member

@mr44er and install this a6264e5 one too please.

opnsense-patch a6264e5

Although there is no relation to gateway status (which actually isn't changed), it considers the gateways from 732b5ff as valid defaults (kind of forgot that one yesterday).

@mr44er
Copy link

mr44er commented Aug 2, 2019

With the other problems, I think maybe I had a routing-loop or Tor had bitten in between due to exposed host I am, I dunno.

Ok, let's roll:
All 3 patches added, reboot.

Overview for now (I like it that way):
image

Reaching IP and ping from outside to one from delegated net works fine.
#############################
Trying now:
image

gives this immediately:
image

Now setting inactivity off from setting yesterday:
image

Pinging from outside is dead for now.

Setting wan1 as upstream, still no luck:
image

Complete reboot:
image

@AdSchellevis
Copy link
Member

@mr44er did you push 2001:xxx:8888 and 2001:xxx:8844 through the correct gateway, this looks like a static routing issue (only one can be reachable at the same time, both can compete)

@mr44er
Copy link

mr44er commented Aug 2, 2019

I don't understand what you mean with 'push through the correct gateway'.

But to rule out 'only one can be reachable at the same time, both can compete' I used now the ipv6 from quad9.com and cloudflare.com:

grafik

grafik

grafik

grafik

@AdSchellevis
Copy link
Member

you should have a static route, using the correct gateway for both monitor ip's, otherwise both will use default, which very likely is your issue. if both are dns entries, you can set a gateway there, this should be visible in your static routes.

@mr44er
Copy link

mr44er commented Aug 2, 2019

netstat

Ah, I now understand. Nope, I use these IPs only for GW-Monitoring. DNS is done with other servers and routing table looks correct.

@AdSchellevis
Copy link
Member

just add static routes and you're probably fine.

@mr44er
Copy link

mr44er commented Aug 2, 2019

Mhm? I think the routing table looks correct as it is with correct routes as it should.

More routes would double the entry and bring chaos into routing, or am I wrong here?

Could you give me an example how you mean the two static routes should look, maybe I can understand better what you mean?

@AdSchellevis
Copy link
Member

There should be static routes for the addresses your trying to monitor (unless these are within their own subnet range, your screenshots seem to have different monitors set). ipv4 shouldn't be different than ipv6 in that regard.

@mr44er
Copy link

mr44er commented Aug 2, 2019

grafik

grafik

grafik

But I still think it shouldn't be necessary to add these routes explicitly...

@AdSchellevis
Copy link
Member

if already there, you should be good. you probably have to do more debugging why you can't reach the endpoint. My community support time is limited

@mr44er
Copy link

mr44er commented Aug 2, 2019

My community support time is limited

I understand that for sure, but I must admit I see it as a bug and not a wrong configuration.

more debugging

If I only knew, where to look.
Anyway please don't close this thread too fast, because I would like to hear what the others with this problem can reach with the patches etc.

@AdSchellevis
Copy link
Member

I'm not closing it (it's a support ticket), feel free to discuss, if there is an issue that one can point to I'll gladly take a look.

There are some debugging tips in https://docs.opnsense.org/manual/gateways.html, maybe that helps.

@bimmerdriver
Copy link

We should generate the gateway from dhcp6c as well if radvd doesn't fly. I've already promised to look into it elsewhere.

Please let me know when this is ready. I will test it. (I have one opnsense system running the latest release version and one running the latest development version.)

@AdSchellevis
Copy link
Member

@bimmerdriver the patches mentioned earlier in this ticket will be included in 19.7.3 (9a772fd), which should be Monday

@bimmerdriver
Copy link

bimmerdriver commented Aug 3, 2019

the patches mentioned earlier in this ticket will be included in 19.7.3 (9a772fd), which should be Monday

@AdSchellevis My development system is already at 20.1, so will the patch be in the release version?

@fichtner
Copy link
Member

fichtner commented Aug 3, 2019 via email

@bimmerdriver
Copy link

bimmerdriver commented Aug 3, 2019

“20.1” is not a version number so will have to be more specific.

The development version is running OPNsense 20.1.a_44-amd64, as of the most recent update. I updated it by setting the release type to development and then checking for updates. (Not using the command line.) Does that help?

@fichtner
Copy link
Member

fichtner commented Aug 3, 2019

Yes, the changes you are looking for are included with 20.1.a_72, which will be included in Monday's 19.7.2 release on the development track translating to roughly 20.1.a_75 or higher if further changes are pushed to the master branch.

@bimmerdriver
Copy link

bimmerdriver commented Aug 4, 2019

I updated my test system yesterday. It's running 20.1.a_75. I can confirm that the problem is fixed.

@RehaagJ
Copy link
Author

RehaagJ commented Aug 4, 2019

For my system, these patches already helped a lot; once I set the default route manually on the WAN_DHCP6 gateway, it will now stick because I can make it default now. Also, monitoring is possible again.
Once the change @fichtner mentioned about writing the router files also for dhcp6c when radvd doesn't generate the gateway will be implemented, everything should be back to the functionality we had until 18.10.
Thanks!

@fichtner
Copy link
Member

fichtner commented Aug 4, 2019

@RehaagJ no, the change you mention will make the system more robust than it ever was. I do believe 19.7.2 will be back to where it was before. Just for perspective. ;)

@RehaagJ
Copy link
Author

RehaagJ commented Aug 4, 2019

Well, currently, 19.7.1 with the patches isn't where it was before for me. I didn't have to define a default route manually earlier, now I need to do that. But I can't exclude the possibility that something went wrong during my many tests and/or patching, so I'll try again on 19.7.2. I'll update this topic then.

@bimmerdriver
Copy link

With regards to the gateway monitor, is there a reason that the dashboard displays ~ as the address instead of the actual WAN gateway address (which in this case is link-local)? The interface status on the dashboard doesn't have a problem displaying the link-local WAN interface address.

@RehaagJ
Copy link
Author

RehaagJ commented Aug 16, 2019

Sorry for the delay, I promised to update the situation after update to 19.7.2. No difference - the gateway still does not get populated (fits to what @bimmerdriver said) , and a manual route entry is needed here. This was different in 19.1.
So I'm still convinced that gateway functionality is not yet restored to the pre-19.7 level, and that this should be labeled bug and not support. But since I have a workaround, if this is still seen as a support ticket, I think it can be closed.

@AdSchellevis
Copy link
Member

@RehaagJ ok, thanks for your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support Community support
Development

No branches or pull requests

6 participants