Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WireGuard not respecting CARP MASTER status in HA configuration #7773

Closed
2 tasks done
pmhausen opened this issue Aug 14, 2024 · 9 comments
Closed
2 tasks done

WireGuard not respecting CARP MASTER status in HA configuration #7773

pmhausen opened this issue Aug 14, 2024 · 9 comments
Assignees
Labels
cleanup Low impact changes

Comments

@pmhausen
Copy link
Contributor

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

When enabling the "Depend on (CARP) feature" the WireGuard instance switches to "DOWN", even when the current node is the CARP master.

In previous versions of OPNsense I kind of faked the HA setup (don't remember when that feature was introduced), having the "spoke" office locations dial in to the "hub" HA cluster and configuring the central cluster without the peers IP addresses so it would never dial out.

So with the CARP address configured as the peer at the office locations only the active node was contacted and the tunnel was set up properly. I have similar configurations with OpenVPN bound to 127.0.0.1 and NAT port forward from "CARP address" to 127.0.0.1 - you get the idea.

Now with the major update and a lot of cleaning up to do, I decided to migrate the IPsec configuration etc. and do WireGuard "properly". Unfortunately it doesn't work. The moment I enable the "Depend" feature the instance goes down.

To Reproduce

See screen shots for most detail of the configuration. If more is needed I will of course provide it.

Expected behavior

WireGuard should be up on the master and down on the backup node.

Describe alternatives you considered

I'm back to the "fake" setup I ran the past years, but I'd like to move forward and configure it properly.

Screenshots

Bildschirmfoto 2024-08-14 um 19 25 50 Bildschirmfoto 2024-08-14 um 19 28 01 Bildschirmfoto 2024-08-14 um 19 28 24

Relevant log files

Bildschirmfoto 2024-08-14 um 19 29 13

Environment

OPNsense 24.7.1 on 2x DEC3860 in an HA configuration

Kind regards,
Patrick

@AdSchellevis
Copy link
Member

Only difference I see with my test setup is the vhid's being reused on interfaces, which might cause issues in

To test this assumption, you could add a print_r($vhids) before the return and run the configure script manually (/usr/local/opnsense/scripts/Wireguard/wg-service-control.php -a restart)

@pmhausen
Copy link
Contributor Author

pmhausen commented Aug 14, 2024

Hi Ad,

thanks for the quick response. Well CARP only demands the vhids to be unique per broadcast domain. But that is even easier to test if OPNsense relies on them being globally unique - it's after office hours, I'll just quickly change them and try again.

@pmhausen
Copy link
Contributor Author

But first here's the output of that print statement:

Array
(
    [4cfb9dcb-dd8a-498f-acb1-8dd198f75164] => Array
        (
            [status] => DISABLED
            [vhid] => 1
        )

    [d279ef9d-536c-41e1-96dd-6842044a60f6] => Array
        (
            [status] => DISABLED
            [vhid] => 101
        )

    [d6f12840-ff3a-4388-994e-537797cd5b5a] => Array
        (
            [status] => DISABLED
            [vhid] => 101
        )

    [6ab1532f-efd0-4708-ac89-54fce445e3ed] => Array
        (
            [status] => DISABLED
            [vhid] => 1
        )

    [87f2d164-9855-47b2-9ad4-0c9d69ea85e3] => Array
        (
            [status] => MASTER
            [vhid] => 101
        )

    [943ed3c0-38b6-4dca-a3e9-bc288fb21ba6] => Array
        (
            [status] => MASTER
            [vhid] => 1
        )

)

@pmhausen
Copy link
Contributor Author

And with globally unique vhids:

Array
(
    [4cfb9dcb-dd8a-498f-acb1-8dd198f75164] => Array
        (
            [status] => MASTER
            [vhid] => 1
        )

    [d279ef9d-536c-41e1-96dd-6842044a60f6] => Array
        (
            [status] => MASTER
            [vhid] => 101
        )

    [d6f12840-ff3a-4388-994e-537797cd5b5a] => Array
        (
            [status] => MASTER
            [vhid] => 102
        )

    [6ab1532f-efd0-4708-ac89-54fce445e3ed] => Array
        (
            [status] => MASTER
            [vhid] => 3
        )

    [87f2d164-9855-47b2-9ad4-0c9d69ea85e3] => Array
        (
            [status] => MASTER
            [vhid] => 103
        )

    [943ed3c0-38b6-4dca-a3e9-bc288fb21ba6] => Array
        (
            [status] => MASTER
            [vhid] => 2
        )

)

@pmhausen
Copy link
Contributor Author

And now the primary logs:

wireguard instance KAGate (wg0) started

and nothing else - VPN is working.

And the backup:

Wireguard configure event instance KAGate (wg0) vhid: 1 carp: BACKUP interface: down

So, thanks a lot, possibly change the issue from bug to support and document that vhids need to be globally uniqe for all failover mechanisms to work as expected - maybe that's even the case already.

@AdSchellevis AdSchellevis self-assigned this Aug 14, 2024
@AdSchellevis AdSchellevis added the cleanup Low impact changes label Aug 14, 2024
@AdSchellevis
Copy link
Member

I also don't mind supporting this (f477fa1), as long as preempt is enabled.

@pmhausen
Copy link
Contributor Author

Sorry, for once I do not have an opinion - whatever works ;)

@AdSchellevis
Copy link
Member

ok, now I'm confused, I'm not used to you not having an opinion 🫣

@pmhausen
Copy link
Contributor Author

ok, now I'm confused, I'm not used to you not having an opinion 🫣

I simply do not quite understand this sentence:

eventually all of them will switch between master/backup at the same time anyway, so we can assume all virtual ips switch simultaneously.

CARP is implemented in the kernel and works on an interface/broadcast domain level. That's why vhids need not be unique per system, only per broadcast domain.

Picture two firewalls with a CARP address in e.g. LAN and DMZ each and discrete physical ports. If you pull LAN from FW 1 and DMZ from FW 2, then FW 1 will be master in DMZ and FW 2 will be master in LAN.

I have no idea what OPNsense might have to do/care with that. It's happening in the kernel and that's just how it's supposed to work. Probably not the best network design if you care about failure of individual links and not only failure of complete nodes.

So my own conclusion was: you certainly know best. I can live with the constraint to use globally uniqe vhids, I can live with the other way just as well. But now that I changed my vhids I am not going to revert them. After all it's completely random and unimportant, anyway, what these concrete numbers are as long as both nodes agree.

Kind regards,
Patrick

fichtner pushed a commit that referenced this issue Aug 26, 2024
 #7773

Although all our examples always use vhid as a unique key per firewall, it is possible to add the same vhid to different interfaces.
When "disable preempt" is not selected, eventually all of them will switch between master/backup at the same time anyway, so we can assume all virtual ips switch simultaneously.

If preempt is disabled, our vhid matching might not be perfect, but likely better than before.

(cherry picked from commit f477fa1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Low impact changes
Development

No branches or pull requests

2 participants