Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

carp: disable preemption leads to split master/backup when link flaps #2780

Closed
mimugmail opened this issue Oct 2, 2018 · 19 comments
Closed
Assignees
Labels
bug Production bug
Milestone

Comments

@mimugmail
Copy link
Member

I'm seeing here a reproduceable problem with CARP.
OPN1 is setup to sync config to OPN2, dedicated link, better skew
OPN2 is not setup to sync, dedicated link, disable preemtion ticked to force it always backup when OPN1 is there.

When I do an update cycle I update OPN2, reboot, when fine, set OPN1 to enter persistant maintainence mode. OPN2 takes over IPs and then I can update OPN1. While rebooting, when OPN2 looses a link on e.g. LAN, it releases it's state ... when OPN1 comes back it starts carp and get's master for this IP. I think the problem here is that OPN2 had disable preemption enabled. But normally I'd guess OPN1 shouldn't get any IP since it should be in persistance mode.

I discovered this cause I wanted to test some HA stuff and connected LAN via crossover and not a switch, so the LAN is always flapping while reboot.

I can successfully reproduce multiple times.

Perhaps this is also the answer for strange failover scenarios you can read time to time in the forums.

@AdSchellevis AdSchellevis added the support Community support label Oct 2, 2018
@AdSchellevis
Copy link
Member

@mimugmail it sounds like it works as designed, persistent maintenance mode forces the VIP to use it's highest advertised skew (advskew 254)

$carp_maintenancemode = isset($config["virtualip_carp_maintenancemode"]);
if ($carp_maintenancemode) {
$advskew = "advskew 254";

And preempt when selected (net.inet.carp.preempt ==> 0) allows interfaces to act independently:

Allow virtual hosts to preempt each other. When enabled, a vhid in a backup state would preempt a master that is announcing itself with a lower advskew. Disabled by default.

@mimugmail
Copy link
Member Author

But does this really makes sense that LAN on OPN1 is master while WAN is master on OPN2

@AdSchellevis
Copy link
Member

yes, that's the effect of "net.inet.carp.preempt=0", normally you don't want them to act independently.

The docs from openbsd (https://www.openbsd.org/faq/pf/carp.html) are a bit more clear on the subject:

net.inet.carp.preempt
Allow hosts within a redundancy group that have a better advbase and advskew to preempt the master. In addition, this option also enables failing over a group of interfaces together in the event that one interface goes down. If one physical CARP-enabled interface goes down, CARP will increase the demotion counter, carpdemote, by 1 on interface groups that the carp(4) interface is a member of, in effect causing all group members to fail-over together. net.inet.carp.preempt is 0 (disabled) by default.

Our help text could probably be improved, but it does what it's supposed to do if you ask me.

@mimugmail
Copy link
Member Author

Hm, the helptext completely differs from the FAQ. Let me do some testing with a switch between .. I was always promoting to enable this feature only on stanby devices, but after reading the FAQ this is nonsense. Quite hard to keep on focused when enabling a feature which disables something and default is disabled.

I'll send a PR with a better helptext when finished. Thanks for your time!

@mimugmail
Copy link
Member Author

@andrewhotlab
Copy link

Thank you Michael, I just updated the thread with new info:
https://forum.opnsense.org/index.php?topic=9450.msg44997#msg44997

@mimugmail
Copy link
Member Author

When the link on an interface goes down, the linkdown script removes the IP configuration.
Then you have a demotion of -240 which makes a splibrain, where one interface is master on unit1 and the other master on unit2.

@fichtner This can be solved when ticking "Prevent Interface Removal". You introduced this cause of ZeroTier if I remember correctly.

ATM I'm unsure if we just need to update the documentation or if this is a higher impact introduced with the interface lock?

@AdSchellevis @fichtner opinions?

@AdSchellevis
Copy link
Member

Prevent interface removal only triggers

$mismatch = false;
at boot, as far as I can see. Shouldn't be related.

@mimugmail
Copy link
Member Author

Indeed. I removed the interface lock and tested again. Now the config is still not removed. Like @andrewhotlab wrote in forums, changed some stuff in interface and then it was working.
I'll reinstall the machines next week and try to dig further.

For the archives: When users have problems with carp and mixed master/backup states, check via ifconfig if carp config is still there. :)

@mimugmail
Copy link
Member Author

Ok, I did a factory reset via console (option 4) and set up the machines from scratch (interfaces, rules, HA).

Initial ifconfig on machine 1:
root@OPN1:~ # ifconfig

em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=2098<VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
        ether 00:1a:8c:12:7d:9f
        hwaddr 00:1a:8c:12:7d:9f
        inet 192.168.1.11 netmask 0xffffff00 broadcast 192.168.1.255
        inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255 vhid 2
        inet6 fe80::21a:8cff:fe12:7d9f%em0 prefixlen 64 scopeid 0x1
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
        carp: MASTER vhid 2 advbase 1 advskew 0
em1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=2098<VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
        ether 00:1a:8c:12:7d:9e
        hwaddr 00:1a:8c:12:7d:9e
        inet 81.24.66.142 netmask 0xffffffc0 broadcast 81.24.66.191
        inet 81.24.66.141 netmask 0xffffffc0 broadcast 81.24.66.191 vhid 1
        inet6 fe80::21a:8cff:fe12:7d9e%em1 prefixlen 64 scopeid 0x2
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
        carp: MASTER vhid 1 advbase 1 advskew 0

It's master and everything fine. I shut down the switch port of em0:

root@OPN1:~ # ifconfig
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=2098<VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
        ether 00:1a:8c:12:7d:9f
        hwaddr 00:1a:8c:12:7d:9f
        inet6 fe80::21a:8cff:fe12:7d9f%em0 prefixlen 64 scopeid 0x1
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: no carrier
em1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=2098<VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
        ether 00:1a:8c:12:7d:9e
        hwaddr 00:1a:8c:12:7d:9e
        inet 81.24.66.142 netmask 0xffffffc0 broadcast 81.24.66.191
        inet 81.24.66.141 netmask 0xffffffc0 broadcast 81.24.66.191 vhid 1
        inet6 fe80::21a:8cff:fe12:7d9e%em1 prefixlen 64 scopeid 0x2
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
        carp: MASTER vhid 1 advbase 1 advskew 0

Now it's master for em1 (WAN) but machine 2 is master for em0 (LAN) since the interface config was completely removed.

I went back and opened LAN config in UI, did nothing, only click Save:

Oct  8 10:22:05 OPN1 opnsense: /interfaces.php: The command `/sbin/ifconfig 'em0' -alias '192.168.1.1'' failed to execute
Oct  8 10:22:06 OPN1 kernel: em0: promiscuous mode enabled
Oct  8 10:22:06 OPN1 kernel: carp: demoted by 240 to 240 (interface down)
Oct  8 10:22:06 OPN1 kernel: carp: 1@em1: MASTER -> BACKUP (more frequent advertisement received)
Oct  8 10:22:06 OPN1 kernel: ifa_maintain_loopback_route: deletion failed for interface em1: 3
Oct  8 10:22:06 OPN1 opnsense: /interfaces.php: ROUTING: entering configure using 'lan'
Oct  8 10:22:06 OPN1 kernel: carp: demoted by 240 to 480 (pfsync bulk start)
Oct  8 10:22:06 OPN1 opnsense: /interfaces.php: ROUTING: IPv4 default gateway set to wan
Oct  8 10:22:06 OPN1 opnsense: /interfaces.php: ROUTING: no IPv6 default gateway set, assuming wan
Oct  8 10:22:06 OPN1 opnsense: /interfaces.php: ROUTING: skipping IPv4 default route
Oct  8 10:22:06 OPN1 opnsense: /interfaces.php: ROUTING: skipping IPv6 default route
Oct  8 10:22:06 OPN1 kernel: carp: demoted by -240 to 240 (pfsync bulk done)
Oct  8 10:22:06 OPN1 opnsense: /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member "81.24.66.141 - WAN (1@em1)" has resumed the state "BACKUP" for vhid 1
Oct  8 10:22:08 OPN1 sshd[71108]: Received signal 15; terminating.
Oct  8 10:22:08 OPN1 sshd[35793]: Server listening on :: port 22.
Oct  8 10:22:08 OPN1 sshd[35793]: Server listening on 0.0.0.0 port 22.
Oct  8 10:22:09 OPN1 opnsense: /interfaces.php: The command '/bin/pkill -'TERM' -F '/var/run/lighty-webConfigurator.pid'' returned exit code '15', the output was ''
Oct  8 10:22:10 OPN1 opnsense: /interfaces.php: The command '/bin/pkill -'HUP' 'php-cgi'' returned exit code '1', the output was ''
Oct  8 10:22:10 OPN1 opnsense: /interfaces.php: ROUTING: entering configure using defaults
Oct  8 10:22:10 OPN1 opnsense: /interfaces.php: ROUTING: IPv4 default gateway set to wan
Oct  8 10:22:10 OPN1 opnsense: /interfaces.php: ROUTING: no IPv6 default gateway set, assuming wan
Oct  8 10:22:10 OPN1 opnsense: /interfaces.php: ROUTING: setting IPv4 default route to 81.24.66.129
Oct  8 10:22:10 OPN1 opnsense: /interfaces.php: ROUTING: keeping current default gateway '81.24.66.129'
Oct  8 10:22:10 OPN1 opnsense: /interfaces.php: ROUTING: skipping IPv6 default route
Oct  8 10:22:11 OPN1 opnsense: /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '81.24.66.129'
Oct  8 10:22:11 OPN1 opnsense: /usr/local/etc/rc.filter_configure: Cannot switch while 0 inet6 gateways are up
Oct  8 10:22:17 OPN1 opnsense: /usr/local/etc/rc.filter_synchronize: Filter sync successfully completed with https://10.10.10.2/xmlrpc.php.

It seems some command from above is not executed at install/wizard so carp config is not really persistant, because after this, the interface config for em0 (LAN) on machine 1 looks fine:

root@OPN1:~ # ifconfig
em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=2098<VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
        ether 00:1a:8c:12:7d:9f
        hwaddr 00:1a:8c:12:7d:9f
        inet6 fe80::21a:8cff:fe12:7d9f%em0 prefixlen 64 scopeid 0x1
        inet 192.168.1.11 netmask 0xffffff00 broadcast 192.168.1.255
        inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255 vhid 2
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: no carrier
        carp: INIT vhid 2 advbase 1 advskew 0
em1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=2098<VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC>
        ether 00:1a:8c:12:7d:9e
        hwaddr 00:1a:8c:12:7d:9e
        inet 81.24.66.142 netmask 0xffffffc0 broadcast 81.24.66.191
        inet 81.24.66.141 netmask 0xffffffc0 broadcast 81.24.66.191 vhid 1
        inet6 fe80::21a:8cff:fe12:7d9e%em1 prefixlen 64 scopeid 0x2
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
        carp: BACKUP vhid 1 advbase 1 advskew 0

Any ideas @AdSchellevis @fichtner

@AdSchellevis
Copy link
Member

I think it always did remove the configuration from the interface, although I'm not 100% sure.
The question is, does it matter if there's an interface less.

@mimugmail
Copy link
Member Author

Yes, loosing the carp configuration on a interface makes a huge difference, cause it demotes back.
When master on both interfaces it's value is 0. When link on em0 is gone it demotes to 240 .. it will give MASTER to second machine, but since the config is deleted it demotes back -240 to 0 since the interface isn't participating in CARP. Then the second carp interface has link, demotion is 0, so it's master.

The backup machine instead is backup for the running interface, but it doesn't receive skews for LAN cause machine 1 removed it's config and goes master.

This was the log when the interface config wasnt made "persistant":

Oct  8 10:06:45 OPN1 kernel: carp: 2@em0: MASTER -> INIT (hardware interface down)
Oct  8 10:06:45 OPN1 kernel: carp: demoted by 240 to 240 (interface down)
Oct  8 10:06:45 OPN1 kernel: em0: link state changed to DOWN
Oct  8 10:06:45 OPN1 kernel: carp: 1@em1: MASTER -> BACKUP (more frequent advertisement received)
Oct  8 10:06:45 OPN1 kernel: ifa_maintain_loopback_route: deletion failed for interface em1: 3
Oct  8 10:06:45 OPN1 opnsense: /usr/local/etc/rc.linkup: DEVD Ethernet detached event for lan
Oct  8 10:06:46 OPN1 kernel: ifa_maintain_loopback_route: deletion failed for interface em0: 3
Oct  8 10:06:46 OPN1 kernel: ifa_maintain_loopback_route: deletion failed for interface em0: 3
Oct  8 10:06:46 OPN1 kernel: carp: demoted by -240 to 0 (vhid removed)

Look at the reason why demote -240 -> vhid removed

@mimugmail
Copy link
Member Author

EDIT:

The backup machine instead is backup for the running interface, but it doesn't receive skews for LAN cause machine 1 removed it's config and goes master.

Not true, it's demotion value is just 0 and it sends higher skew .. but it makes no difference, it's reproduceable and it's fixed when you touch the interface config after installation again.

@mimugmail
Copy link
Member Author

@fichtner as requested by you yesterday, here's the diff of config.xml before and after just opening lan interface config and hitting apply:

root@OPNsense1:~ # diff -Naur /conf/config.xml config1-notworking.xml
--- /conf/config.xml    2018-10-11 13:06:14.500102000 +0200
+++ config1-notworking.xml      2018-10-11 13:04:47.976749000 +0200
@@ -290,12 +290,16 @@
       <gateway>GW_WAN</gateway>
     </wan>
     <lan>
-      <if>em0</if>
-      <descr/>
       <enable>1</enable>
-      <spoofmac/>
+      <if>em0</if>
       <ipaddr>192.168.1.11</ipaddr>
       <subnet>24</subnet>
+      <ipaddrv6/>
+      <subnetv6/>
+      <media/>
+      <mediaopt/>
+      <gateway/>
+      <gatewayv6/>
     </lan>
     <opt1>
       <if>em2</if>
@@ -477,8 +481,8 @@
   </widgets>
   <revision>
     <username>root@81.24.66.132</username>
-    <time>1539255974.4898</time>
-    <description>/interfaces.php made changes</description>
+    <time>1539255758.9669</time>
+    <description>/system_advanced_admin.php made changes</description>
   </revision>
   <OPNsense>
     <captiveportal version="1.0.0">

@AdSchellevis
Copy link
Member

@mimugmail can you try e720c57 ? I think you're right, we shouldn't remove carp addresses as they might confuse the preemption settings.

@mimugmail
Copy link
Member Author

Thanks, I'll test tomorrow, need to find the power supplies for my test machines :)

@AdSchellevis
Copy link
Member

no rush :)

@mimugmail
Copy link
Member Author

ATM I'm fighting with a second carp problem. 2 machines connected with crossover cable, 4 vlans over 1 interface. 3 work fine, on the 4th, both are master, I can see on vlan interface via tcpdump both units sending carps, so unit 2 doesn't receive them .. this carp thing drives me cray :)

@mimugmail
Copy link
Member Author

@AdSchellevis e720c57 works perfect! thanks 👍

@fichtner fichtner added bug Production bug and removed support Community support labels Oct 18, 2018
@fichtner fichtner added this to the 19.1 milestone Oct 18, 2018
fichtner pushed a commit that referenced this issue Oct 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Production bug
Development

No branches or pull requests

4 participants