Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA Cluster: Failover does not force a reconnect of mobile clients to new master #3708

Closed
stumbaumr opened this issue Sep 16, 2019 · 13 comments
Closed
Labels
help wanted Contributor missing / timeout support Community support

Comments

@stumbaumr
Copy link
Contributor

stumbaumr commented Sep 16, 2019

Important notices
Before you add a new report, we ask you kindly to acknowledge the following:

[/] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md

[/] I have searched the existing issues and I'm convinced that mine is new.

Describe the bug
In a master-backup HA setup:
If a failover is triggered using "Enter Persistent CARP Maintenance Mode" or "Temporarily Disable CARP" the CARP IP moves to the backup HA system, but the IPsec mobile clients still remain connected to the previous master. Though a reboot of the master HA system forces a reconnect.

See as well https://forum.opnsense.org/index.php?topic=14226

To Reproduce
Steps to reproduce the behavior:

  1. Setup a HA system with two firewalls
  2. Create a VPN IPsec IKEv2 setup with MobIke enabled
  3. Do not reboot the master but failover using the above mentioned buttons
  4. Check if the VPN client disconnects from the deactivated master

Background:
https://wiki.strongswan.org/projects/strongswan/wiki/MobIke
https://tools.ietf.org/html/rfc4555

  1. The mobile client initiates the connection using the CARP IP
  2. The OPNsense has the interface IP and the CARP IP and let's the client know about both
  3. The client decides which one to use
  4. As the failover occurs the client switches to the then still available interface address

Expected behavior
Mobile IPsec VPN users should be disconnected from the old master and then reconnected to the new master HA system.

Additional context
Maybe create a script in /usr/local/etc/rc.syshook.d/carp named 10-ipsec that restarts strongswan as soon as OPNsense receives a "I became backup" event.

Environment
Software version used and hardware type if relevant.

OPNsense 19.7.4_1 (amd64, OpenSSL).
2x Lenovo ThinkServer SR630
ThinkSystem 10Gb 4-port SFP+ LOM (Intel X722 Integrated 10 GbE Controller)
ZFS

@AdSchellevis AdSchellevis added the support Community support label Sep 16, 2019
@stumbaumr
Copy link
Contributor Author

stumbaumr commented Sep 17, 2019

So I got a little bit further:

  • Adjusted the firewall rule for the incoming IPsec connection to only allow the CARP IP and not anymore "This Firewall" - which includes the interface IP as well
  • Triggered a failover
  • A linux client reconnected automatically
  • The Windows clients using the built-in VPN IKEv2 client did not even realize they were disconnected (the OPNsense does...)

@stumbaumr
Copy link
Contributor Author

So the linux client reconnected not to the CARP IP on the new master, but to the now backup firewall on another internal facing interface (the interface facing the clients LAN). On that interface IPsec to the OPNsense is not even allowed per rules...

I do firmly believe now that a restart of strongswan is necessary to make clients reconnect to the new master OPNsense.

I do not want to disable MobIke since mobile clients tend to change the IP on their side...

@AdSchellevis
Copy link
Member

Reconnecting the backup won’t make any difference, since it can’t communicate using it’s address while in backup mode...

@stumbaumr
Copy link
Contributor Author

stumbaumr commented Sep 23, 2019

Maybe I made myself not properly clear.
Two firewalls (fw1 (interface IP 1.1.1.1) and fw2 (1.1.1.2)) having an IPsec setup using a CARP IP (1.1.1.3).
strongswan is configured to use MobIke, so it knows about all of the fw's available IP addresses (CARP IP and the interface IP).
The client connects to 1.1.1.3 on fw1.
After a failover from fw1 to fw2 the client is still connected to fw1 using IP 1.1.1.1(!!!).
Even when setting up firewall rules to allow only traffic to 1.1.1.3 the client is staying connected using 1.1.1.1 .

Only after stopping the strongswan process I was able to let the client know that the connection is down.

@stumbaumr
Copy link
Contributor Author

I created a devd rc.syshook.d script (as a copy from 50-frr).

root@opnsense02:~ # cat /usr/local/etc/rc.syshook.d/carp/20-ipsec 
#!/usr/local/bin/php
<?php
/*
 * Copyright (C) 2018 Franco Fichtner <franco@opnsense.org>
 * Copyright (C) 2004 Scott Ullrich <sullrich@gmail.com>
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 * 1. Redistributions of source code must retain the above copyright notice,
 *    this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,
 * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
 * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
 * AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
 * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 */
require_once('config.inc');
require_once('util.inc');

$subsystem = !empty($argv[1]) ? $argv[1] : '';
$type = !empty($argv[2]) ? $argv[2] : '';
if ($type != 'MASTER' && $type != 'BACKUP') {
    log_error("Carp '$type' event unknown from source '{$subsystem}'");
    exit(1);
}
if (!strstr($subsystem, '@')) {
    log_error("Carp '$type' event triggered from wrong source '{$subsystem}'");
    exit(1);
}
switch ($type) {
    case 'MASTER':
        log_error(sprintf('IPsec started since subsystem %s changed to type %s', $subsystem, $type));
        shell_exec('pluginctl -s strongswan start');
        break;
    case 'BACKUP':
        log_error(sprintf('IPsec stopped since subsystem %s changed to type %s', $subsystem, $type));
        shell_exec('pluginctl -s strongswan stop');
        break;
}
root@opnsense02:~ #

And I found a bug in den devd carp.conf file:
#3721

@AdSchellevis
Copy link
Member

ah, your traffic is leaving from the wrong address, did you add an outbound nat rule to enforce usage of the correct outbound address? It might help to capture some packets and see if it really is accepting traffic on .3 and sending out on .1 (in which case that's the issue that should be solved here)

@stumbaumr
Copy link
Contributor Author

The problem is MobIke...

This is the current list of IPs available to strongswan, which it communicates to the MobIke clients:

root@opnsense01:~ # ipsec statusall
Status of IKE charon daemon (strongSwan 5.8.0, FreeBSD 11.2-RELEASE-p14-HBSD, amd64):
  uptime: 13 hours, since Sep 23 23:33:54 2019
  worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 381
  loaded plugins: charon aes des blowfish rc2 sha2 sha1 md4 md5 random nonce x509 revocation constraints pubkey pkcs1 pkcs7 pkcs8 pkcs12 pgp dnskey sshkey pem openssl fips-prf curve25519 xcbc cmac hmac gcm attr kernel-pfkey kernel-pfroute resolve socket-default stroke vici updown eap-identity eap-md5 eap-mschapv2 eap-radius eap-tls eap-ttls eap-peap xauth-generic xauth-eap xauth-pam whitelist addrblock counters
Virtual IP pools (size/online/offline):
  10.20.35.32/28: 14/2/2
  10.20.35.0/27: 30/3/1
  10.20.35.128/25: 126/0/1
  10.20.35.48/28: 14/1/0
Listening IP addresses:
  10.11.10.11
  10.50.20.11
  10.20.30.11
  10.20.30.1
  <removed WAN interface IP>
  <removed WAN CARP IPv6>
  <removed WAN CARP IP>
  <removed WAN IP Alias>
  <removed WAN IP Alias>
  10.20.27.11
  10.20.27.1
  192.168.0.11
  192.168.1.1
  10.20.26.11
  10.20.26.1
  10.20.40.11
  10.20.40.1
  10.20.28.11
  10.20.28.1
  10.20.38.41
  10.20.38.1
  10.20.36.1
  10.10.0.1
Connections:

So the .11 and .41 IPs are the interface IPs and the .1 is the CARP IP.

As soon as the CARP IP switches over to the HA partner the strongswan session uses the interface IP.

And since the strongswan switches the connection from "behind" the firewall to the client I think there is no way to stop that except for taking down strongswan.

@AdSchellevis
Copy link
Member

I would try to match an outbound nat rule, although usually I expect that selecting the VIP on phase 1 should already do the trick in these cases.

@stumbaumr
Copy link
Contributor Author

The CARP VIP is already the one used on phase 1.
I am using this setup: https://forum.opnsense.org/index.php?topic=12147.0
The IPsec responder FQDN points to the CARP VIP.

I have anyway still another problem with the HA failover/fallback. Maybe better to just have another support session.

@AdSchellevis
Copy link
Member

sure, there must be something we're overlooking, although I'm also not sure what it is yet. This week I'm not at the office by the way.

@mimugmail
Copy link
Member

Rainer, can you try what happens when in old master the daemon restarts (to force mobike clients switch), but have S2S VPNs with start immediate mode?

@stumbaumr
Copy link
Contributor Author

Hi Michael,
Site2Site VPN is not affected at all since there MobIke is not active. So even with a failover/fallback the S2S VPN just reconnects and works immediately.

@AdSchellevis
Copy link
Member

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.

@AdSchellevis AdSchellevis added the help wanted Contributor missing / timeout label Apr 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Contributor missing / timeout support Community support
Development

No branches or pull requests

3 participants