Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway-group bound state flushing on failback #3516

Closed
namezero111111 opened this issue Jun 5, 2019 · 9 comments
Closed

Gateway-group bound state flushing on failback #3516

namezero111111 opened this issue Jun 5, 2019 · 9 comments
Labels
help wanted Contributor missing / timeout

Comments

@namezero111111
Copy link
Contributor

namezero111111 commented Jun 5, 2019

When using a failover gateway group (Tier1/2) for VoiP,
there is a situation where SIP and RTP can be split across tiers, resulting in
VoiP becoming unavailable.

Scenario:

Tier1 UP; Tier2 UP

SIP and RTP are over Tier1, everything works well

Tier1 DOWN (fails); Tier2 UP

SIP and RTP are over Tier2, everything works well

Tier1 UP; Tier2 UP

SIP remains over Tier2, (new) RTP goes out via Tier1.
VoiP remains broken until manual state flush or forcing manual re-registration

Tier1 UP; Tier2 DOWN (fails)

SIP and RTP are over Tier1, everything works well
Registration now is forced to reoccur via Tier1

Basically, when the registration occurs via tier 2, and tier 1 comes back online, the registration stays on tier 2.
This results in the RTP data going out over tier 1 and hence being in a split state, ruining the system.

The ideal solution would be if the gateway group contained a feature
"Flush states for this GW group on failback"

This would allow for:

  • GW groups that leave connections on the generally favorable soft-failback where old connections remain
    on the backup
  • Allow for failover groups that require hard-failback (VoIP, metered connections)

I'd be willing to poke around for implementation given some pointers.

@mimugmail
Copy link
Member

The problem is that OPN/pf keeps track of SIP and since T2 is still up when moving to T1 again, SIP packets are still sent over T2 while connection-less RTP runs over T1.
We maybe need something similar like Kill states and/or Dynamic state reset for ANY gateway failover with a clear warning in desciption and disabled by default.

@namezero111111
Copy link
Contributor Author

namezero111111 commented Jun 5, 2019 via email

@AdSchellevis
Copy link
Member

With the new rule logic planned for 19.7 it should be possible to kill on a per rule bases, since we use the label field as a unique rule hash (previously the description was put there).

@fichtner fichtner added the help wanted Contributor missing / timeout label Nov 11, 2019
@AdSchellevis
Copy link
Member

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.

@iMiMx
Copy link

iMiMx commented Apr 28, 2020

Just adding this here, for anyone else that stumbles across it, as the opnsense forum post appears to have been archived - I hacked a script from something I found on the pfsense forum, that does the same.

My WAN interface is lagg0_vlan18, the script checks the default route interface, then if it's set to WAN it reset the mobile data states. My Opnsense connects to a 4G/Mobile router over 192.168.54.0/24.

Uses 'pfctl -k' to only kill states for 192.168.54.102 (the NAT IP on Opnsense that goes over 4G). Runs from cron every minute, causes my L2TP (UDP) tunnels to then fail back/reconnect on the active connection..... rather than remaining on 4G.

As I have gateway monitoring enabled, which causes 1 ICMP session, mobile states have to be greater than 1 (to allow for this) before it does any state flushing.

"$MOBILE_NSTATES" -gt 1

#!/bin/sh
# *** kills firewall states on failover Mobile Data  when WAN is up ***

WAN_IF="lagg0_vlan18"

CURRENT_TIME="$(date +"%c")"
WAN_STATUS=`route -n show default | grep interface | awk '{print $2}'`

if [ "$WAN_STATUS" = "$WAN_IF" ]; then
	# the following line may need to be tweaked depending on your needs
	MOBILE_NSTATES=`pfctl -s state | grep "192\.168\.54" | wc -l`
	if [ "$MOBILE_NSTATES" -gt 1 ]; then
		echo "$CURRENT_TIME: WAN1 is online, but connections remain on Mobile Data. Killing states."
		pfctl -k 192.168.54.102
	fi
fi

EDIT: This also requires gateway switching to be enabled, so that the default route/interface changes.

@cyrus104
Copy link

Is it possible to get this added into as a default option for cron with some variable?

@haarp
Copy link

haarp commented Oct 16, 2022

Is it possible to get this added into as a default option for cron with some variable?

Yeah, but cron is not really suitable for this sort of thing. Better hook into something so what you're doing only gets executed when necessary.

I just wrote this because Dynamic state reset resets too much and is slated to be removed anyway.

Create /usr/local/etc/inc/plugins.inc.d/local.inc with this content (adjust the values of $primary_wan and $failover_wan):

<?php

function local_configure() {
        return array(
                'newwanip' => array('reset_failover_states:2')
        );
}

function reset_failover_states( $verbose = false, $interface = '' ) {
        $primary_wan = 'pppoe0';
        $failover_wan = 'ppp1';

        if( get_real_interface($interface) !== $primary_wan ) return;

        $ip = get_interface_ip( $failover_wan );

        // same as in rc.newwanip
        log_error("Primary WAN $primary_wan change detected, killing states of failover WAN $failover_wan");
        mwexecf('/sbin/pfctl -k 0.0.0.0/0 -k %s', $ip);
        mwexecf('/sbin/pfctl -k %s', $ip);
}

This hooks into newwanip. If there is a new IP on the primary WAN interface, it will kill all states on the failover WAN interface.

I would very much like to see a proper version of this implemented in OPNsense at some point :)

Cheers!

@iMiMx
Copy link

iMiMx commented Nov 28, 2023

@haarp What happens if there is not a new WAN IP though? Does the IP have to actually change for this to be executed?

For example, the primary link suffers from packet loss, fails over to the secondary, but the IP does not change on the primary - sessions/states could still end up on and remain on the secondary?

EDIT: So I disabled my Primary gateway in the opnsense UI as a test, failed over to secondary, all tunnels etc failed over nicely. Re-enable the Primary gateway, traffic fails back - most of it.

I can see that /usr/local/etc/rc.newwanip is executed from the logs, but I don't see the 'Primary WAN' log_error so I assume it has not run?

... and I end up with sessions/states still on the secondary.

EDIT 2: Ah, I think the 'rc.newwanip' I see in the logs, post failover/failback, relates to my L2TP tunnel, not the Primary/Secondary WAN.

EDIT 3: Tried shutting down the switch port to the cable modem for the Primary WAN, all fails over and back. But it does not seem to run when the primary WAN is re-established, if the IP does not/has not changed.

I think the trigger needs to be the default route change, as opposed to newwanip - is that possible some how?

@haarp
Copy link

haarp commented Nov 28, 2023

@iMiMx
There are a couple of hooks that could potentially work. You could try monitor instead of newwanip. I found it triggers on gateway up events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Contributor missing / timeout
Development

No branches or pull requests

7 participants