Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-WAN load balancing setup with Sticky Connections sometimes "drops" new connections #5869

Closed
2 tasks done
denschub opened this issue Jul 13, 2022 · 19 comments
Closed
2 tasks done
Assignees

Comments

@denschub
Copy link

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

As per the previous discussions on this topic, having a Multi-WAN setup with Sticky Connections enabled leads to some connections being unable to establish itself.

I'm following this official Multi-WAN guide closely, with the only exception being step 5, which I skipped, because I have a NAT rule that forces all DNS-traffic to be handled by OPNsense's Unbound. To debug, I have dropped all other "custom" configs like outbound static port NAT etc, and I believe this to be a fairly standard setup. I'm happy to share a slightly sanitized version of the config file if anyone thinks this could be useful!

The issue is happening without clear frequency or cause. Sometimes, connections just get "stuck" and won't unstuck themselves until they are closed and opened again. Here are a couple of things I know:

  • This does not appear to be an issue with any of the WAN connections. I've made a good attempt at verifying that by having a constant dedicated connection open through OPNsense (with two firewall rules that map two certain destination IPs to WAN interfaces).
  • This issue appears to only affect new connections. Existing connections are not interrupted. New connections can "get stuck" (i.e. have 100% packet loss), but usually, killing that connection and retrying after a few seconds makes it work.
  • It indeed appears to be related to Sticky Connections. I'm able to reproduce with Sticky Connections enabled at least once per hour or so, but never without it.
  • Outbound NAT seems irrelevant, I've removed all rules and it still reproduces.
  • When connections are "broken", I see two states in Firewall > Diagnostics > States: The "in" policy-based WAN rule, and the "out" autogenerated "let out anything from firewall host itself (force gw)" rule. A broken state doesn't look different from a working state.
  • Dropping the aforementioned two states when they're broken makes the connection become alive immediately. For example, in a broken continuous ping, which will never un-stuck itself, dropping the two state entries will make ping work.
  • Running a Package Capture on all interfaces with a "broken ping" running, I can see the ping requests arriving on the Firewall via the LAN interface, but I never see it leave on any WAN interfaces, and I also do not see responses. It looks like the packages just don't get forwarded to any WAN interface.
  • Also while having a "broken ping" running, I do not see any connection attempt in the Firewall logs. Not even with logging for the policy-based WAN rule enabled. Just nothing.

To Reproduce

Unfortunately, I'm unable to provide you with Steps To Reproduce. My current test is "using the internet". As soon as a connection gets stuck in the browser (i.e. the tab is loading forever), I figure out which IP it is trying to connect to and check the logs based on that.

In parallel, I have a constant while true; do; ping -c3 9.9.9.9; sleep 3; done running. quad-9 is a provider I'm not using, so the only connection being made here is from that ping. When I notice that throwing timeouts, I immediately ctrl-c that while loop, and start a new continuous ping to the same IP. If that also timeouts, it will timeout forever, not recovering itself until I restart it or do something else (see above).

Note that I'm looking for instructions on how to gather more information about this. I'm not asking you to spend a ton of time trying to reproduce this - I'm aware of the hard-to-reproduce nature, and I'm happy to spend time on gathering useful data. My devices are the only devices doing traffic on this network, so I'm fine with testing stuff. I'm filing this ticket because I've run out of ideas what I can do/look at. :)

Screenshots

  • Here is a screenshot showing the "stuck" behavior. On the left, you can see two mtr instances, tracing to 8.8.8.8 and 8.8.4.4, with have been hard-wired to the two Tier 1 WANs. On the right, you can see a ping to 9.9.9.9. The ping timeouts, but both WAN connections are up and running with no issues.

Relevant log files

  • This ZIP file (1.2KiB) contains a packet capture I captured on OPNsense with these settings. ax0 is connected to my LAN-switch, and you can see the ping requests coming into the firewall. The other four captures are the WAN-upstreams I mention below, in that order, and the PPPoE connection as well. Those files are empty.
  • The firewall log doesn't show anything at all for 9.9.9.9 during the broken state.

Additional context

My setup has three WAN connections, and their gateways are in a gateway group:

  • WAN_PRIMARY, connected via igb0, in Tier 1 - a VDSL2 connection with a static IP. Connection is established via PPPoE on OPNsense, it's connected to a DrayTek Vigor167 acting as modem only.
  • WAN_SECONDARY, connected via igb1, in Tier 1 - a DOCSIS connection with a static IP. The OPNsense is connected to a FRITZ!Box 6591 Cable. Contrary to consumer-contracts, this line is set up with a /30 subnet on the FRITZ!Box, a public IP assigned to the port the OPNsense is connected to, and the OPNsense added as exposed host. There is no NAT'ing going on here, the public IP is connected as the interface IP in OPNsense, and the Gateway in OPNsense points to the FRITZ!Box, just like any other ethernet-upstream.
  • WAN_TERTIARY, connected via igb2, in Tier 3 - a LTE backup connection that should be irrelevant in this case.

Environment

opnsense-business, 22.4.2, OpenSSL
Running on a DEC750 purchased in January.

@AdSchellevis AdSchellevis self-assigned this Jul 13, 2022
@AdSchellevis
Copy link
Member

There are a couple of things we could try, 'll ask @fichtner tomorrow to offer a FreeBSD 13.1 kernel on the business edition mirror for testing, just to exclude the issue is solved but we missed it for some reason.

While trying to debug some of this locally I got a hunch which I would like you to try out, would you be able to go to the inbound rule (with the gateway) and set te "Max source states" field to a high value (10000) and test again?

@denschub
Copy link
Author

offer a FreeBSD 13.1 kernel on the business edition mirror for testing, just to exclude the issue is solved but we missed it for some reason.

My primary use-case for using the Business Edition is "send money to Deciso", nothing else. I'm happy to switch to the Community Edition if you think that's worth a try.

would you be able to go to the inbound rule (with the gateway) and set te "Max source states" field to a high value (10000) and test again?

I assume you're talking about the rule on the LAN_DEFAULT interface with the Gateway group as destination that the Multi-WAN setup describes as "Policy-based WAN"? If so, I just set the value to 10000 and rebooted the Firewall, just for good measure. I'll let you know if a connection gets stuck again, but unfortunately, for now, I'll just have to wait a few hours and observe before I can try something else. If you meant another rule, can you clarify? I haven't set any inbound rules for any of the WAN interfaces, and the only floating inbound rules are the auto-generated once.

I've also seen your suggestion to disable shared forwarding in the other thread. I'll try that next if I get drops with the increased source state limit.

@denschub
Copy link
Author

Okay, this is still happening with the 10000 "Max source states" limit set on the rule I mentioned above. Same situation - starting a new ping in a second terminal from the same host worked, and dropping the state in the Firewall Diagnostic section made the still-running ping receive responses. I looked into the "Firewall: Diagnostics: Statistics" section and saw 231 entries in the state-table, and 32 entries in the source-tracking-table, see this screenshot.

I'll reset the value to default now, and give disabling shared forwarding a try.

@AdSchellevis
Copy link
Member

With max source states set and a "broken" ping, what does the following report?

pfctl -vvsS

@denschub
Copy link
Author

I was running with disabled "shared forwarding" since I wrote the comment (~1h) and had no issues. To answer your question, I enabled it again - and immediately got multiple "stuck" connections. This isn't conclusive and I have to test more, but it's interesting.

The full output of your the command is below for reference. The client with the stuck connection is 10.42.4.2, and the destination in this case is 149.112.112.112. The state in the firewall shows:

Screenshot 2022-07-13 at 22 28 12

So the traffic went through the WAN interface with the IP 109.90.221.74 (gateway for that is the .73), so I think the relevant line is

10.42.4.2 -> 109.90.221.73 ( states 1, connections 1, rate 0.0/0s )
   age 01:05:45, 5522 pkts, 452590 bytes, filter rule 89

The firewall rule with the truncated name is this autogenerated rule (assuming the hash-like rid query parameters match)

Screenshot 2022-07-13 at 22 30 21

Which is the right interface for this WAN connection.

I'll go back to the config with disabled "shared forwarding" now, to see if that breaks.


root@packetbox:~ # pfctl -vvsS
No ALTQ support in kernel
ALTQ related functions disabled
10.42.1.2 -> 109.90.221.73 ( states 2, connections 0, rate 0.0/0s )
   age 00:00:34, 3 pkts, 881 bytes, filter rule 89
10.42.1.2 -> 109.90.221.73 ( states 3, connections 1, rate 0.0/0s )
   age 01:11:29, 1343 pkts, 79700 bytes, filter rule 79
10.42.4.2 -> 62.156.244.45 ( states 12, connections 2, rate 0.0/0s )
   age 00:01:56, 189 pkts, 27232 bytes, filter rule 89
10.42.4.2 -> 109.90.221.73 ( states 1, connections 1, rate 0.0/0s )
   age 01:05:45, 5522 pkts, 452590 bytes, filter rule 89
10.42.4.1 -> 62.156.244.45 ( states 1, connections 1, rate 0.0/0s )
   age 00:01:34, 2853 pkts, 740128 bytes, filter rule 89
10.42.4.1 -> 109.90.221.73 ( states 1, connections 1, rate 0.0/0s )
   age 00:58:54, 859 pkts, 63500 bytes, filter rule 89
10.42.6.2 -> 109.90.221.73 ( states 1, connections 0, rate 0.0/0s )
   age 00:00:19, 0 pkts, 0 bytes, filter rule 89
10.42.112.2 -> 109.90.221.73 ( states 1, connections 1, rate 0.0/0s )
   age 00:45:27, 240 pkts, 44220 bytes, filter rule 89
10.42.128.8 -> 62.156.244.45 ( states 2, connections 0, rate 0.0/0s )
   age 00:01:54, 3 pkts, 1200 bytes, filter rule 89
10.42.64.2 -> 62.156.244.45 ( states 40, connections 24, rate 0.0/0s )
   age 00:01:57, 2058 pkts, 520724 bytes, filter rule 89
10.42.64.2 -> 109.90.221.73 ( states 3, connections 2, rate 0.0/0s )
   age 00:02:30, 1951 pkts, 451534 bytes, filter rule 89
10.42.64.2 -> 62.156.244.45 ( states 32, connections 32, rate 0.0/0s )
   age 01:11:16, 545766 pkts, 516863032 bytes, filter rule 89
10.42.64.2 -> 109.90.221.73 ( states 2, connections 1, rate 0.0/0s )
   age 01:11:29, 16327 pkts, 15163871 bytes, filter rule 79
10.42.80.1 -> 109.90.221.73 ( states 7, connections 0, rate 0.0/0s )
   age 00:01:59, 8 pkts, 520 bytes, filter rule 89
10.42.80.1 -> 109.90.221.73 ( states 11, connections 8, rate 0.0/0s )
   age 00:09:46, 104741 pkts, 109083742 bytes, filter rule 89
2003:a:18:fd00:85a8:b349:7cde:82d6 -> fe80::200:ff:fe00:0 ( states 1, connections 1, rate 0.0/0s )
   age 00:01:55, 97 pkts, 37727 bytes, filter rule 90
10.42.127.1 -> 62.156.244.45 ( states 1, connections 0, rate 0.0/0s )
   age 00:01:47, 2 pkts, 1657 bytes, filter rule 89
10.42.127.1 -> 109.90.221.73 ( states 1, connections 1, rate 0.0/0s )
   age 01:11:06, 632 pkts, 290624 bytes, filter rule 89
2003:a:18:fd00:6c58:6eff:fe0f:f85d -> fe80::200:ff:fe00:0 ( states 1, connections 1, rate 0.0/0s )
   age 00:01:40, 1762 pkts, 768769 bytes, filter rule 90
10.42.6.1 -> 62.156.244.45 ( states 1, connections 0, rate 0.0/0s )
   age 00:00:27, 0 pkts, 0 bytes, filter rule 89
10.42.128.1 -> 109.90.221.73 ( states 1, connections 1, rate 0.0/0s )
   age 01:10:39, 297 pkts, 38150 bytes, filter rule 89
10.42.128.2 -> 62.156.244.45 ( states 3, connections 0, rate 0.0/0s )
   age 00:01:03, 18 pkts, 4332 bytes, filter rule 89
10.42.128.2 -> 62.156.244.45 ( states 1, connections 1, rate 0.0/0s )
   age 01:10:07, 1072 pkts, 265853 bytes, filter rule 89
10.42.128.2 -> 109.90.221.73 ( states 1, connections 1, rate 0.0/0s )
   age 01:11:29, 360 pkts, 39708 bytes, filter rule 79
10.42.96.1 -> 109.90.221.73 ( states 4, connections 2, rate 0.0/0s )
   age 00:01:55, 224 pkts, 112274 bytes, filter rule 89
10.42.96.1 -> 62.156.244.45 ( states 2, connections 1, rate 0.0/0s )
   age 01:11:07, 16297 pkts, 5212204 bytes, filter rule 89
10.42.5.1 -> 62.156.244.45 ( states 1, connections 0, rate 0.0/0s )
   age 00:00:18, 1 pkts, 116 bytes, filter rule 89
10.42.5.1 -> 109.90.221.73 ( states 1, connections 0, rate 0.0/0s )
   age 00:02:31, 2 pkts, 248 bytes, filter rule 89
10.42.2.3 -> 62.156.244.45 ( states 1, connections 0, rate 0.0/0s )
   age 00:01:01, 0 pkts, 0 bytes, filter rule 89
10.42.2.3 -> 62.156.244.45 ( states 1, connections 0, rate 0.0/0s )
   age 00:07:35, 0 pkts, 0 bytes, filter rule 89
2003:a:18:fd00:e094:83:f035:85d0 -> fe80::200:ff:fe00:0 ( states 6, connections 6, rate 0.0/0s )
   age 00:09:16, 209331 pkts, 206070245 bytes, filter rule 90
10.42.128.9 -> 109.90.221.73 ( states 2, connections 0, rate 0.0/0s )
   age 00:01:54, 3 pkts, 1200 bytes, filter rule 89
10.42.3.1 -> 62.156.244.45 ( states 1, connections 1, rate 0.0/0s )
   age 00:58:45, 229 pkts, 15036 bytes, filter rule 89
10.42.1.1 -> 62.156.244.45 ( states 1, connections 0, rate 0.0/0s )
   age 00:00:36, 0 pkts, 0 bytes, filter rule 89
2003:a:18:fd00::1202 -> fe80::200:ff:fe00:0 ( states 1, connections 1, rate 0.0/0s )
   age 01:09:11, 533 pkts, 59464 bytes, filter rule 90
10.42.96.4 -> 62.156.244.45 ( states 1, connections 1, rate 0.0/0s )
   age 01:04:14, 693 pkts, 231264 bytes, filter rule 89
10.42.128.6 -> 109.90.221.73 ( states 1, connections 1, rate 0.0/0s )
   age 01:11:13, 1323 pkts, 195731 bytes, filter rule 89
10.42.128.7 -> 109.90.221.73 ( states 10, connections 9, rate 0.0/0s )
   age 00:01:57, 90 pkts, 4936 bytes, filter rule 89
10.42.128.7 -> 109.90.221.73 ( states 1, connections 0, rate 0.0/0s )
   age 01:11:29, 321 pkts, 25750 bytes, filter rule 79

@AdSchellevis
Copy link
Member

ok, @fichtner produced a test kernel for the business edition, handle with care as it's only intended for testing purposes.

opnsense-update -zbkr 22.7.r1

If the issue on your end isn't reproducible without shared forwarding and the standard kernel, that would be valuable info as well, so trying that a bit longer with the current kernel is also a good idea.

@fichtner
Copy link
Member

FWIW, the issue might be present even without shared forwarding enabled, but that would likely only happen when firewall rule direction is "out" as opposed to "in" which is the default. It looks like pf is mismatching states across interfaces and creating separate states for different gateways...

@denschub
Copy link
Author

I spent my entire workday using the internet (and having a ping two two different endpoints on two different devices), and could not notice any issue with Shared Forwarding disabled. This isn't a final conclusion obviously, but maybe it's a sign.

I wanted to install the test kernel, but the installer complains about an invalid signature:

# opnsense-update -zbkr 22.7.r1
Fetching base-22.7.r1-amd64.txz: ....... failed, signature invalid

@fichtner can you confirm that the test kernel you built is indeed unsigned? I'm happy to just pass -i, but I want to check with you first.

@fichtner
Copy link
Member

fichtner commented Jul 14, 2022

@denschub oh, yes, -i is the way to go (keys change between majors and 22.4 doesn’t have 22.7 keys yet)

@denschub
Copy link
Author

I upgraded to 22.7.r1, and unfortunately, within an hour, I had a stuck connection again. Same symptoms as before, same states as before, and dropping the states made it unstuck again. :(

I'll go back to the standard kernel with disabled shared forwarding again. I haven't seen that break yet, so I want to spend a bit more time in that configuration (not sure if I hope that it breaks, or if I hope that it doesn't break, ..)

@denschub
Copy link
Author

Since my last comment, I ran with Sticky Connections enabled but Shared Forwarding disabled, and had zero issues since then. I tried the 22.7.r1 kernel again and can confirm that this indeed still breaks.

So I guess the issue is somehow related to Shared Forwarding. I'm still happy to help debug this, even though I found a "working" configuration for me. Just let me know if there are other things you want to try, or certain kinds of debug output that would help you. :) But until I hear back, I'll be quiet.

@jinhong-
Copy link

This really helped me with my issue. I faced similar network issues as well with a round robin multiwan setup with sticky sessions. Disabling shared forwarding resolved it

@anaisbetts
Copy link

Fwiw, I also have this issue - the short version of the repro is:

  1. Install OpnSense from scratch
  2. Follow the Multi-WAN guide verbatim, set both gateways to Tier 1
  3. Your Internet connection will exhibit random stalls and generally be very unpleasant to use

@bluejay-np
Copy link

This really helped me with my issue. I faced similar network issues as well with a round robin multiwan setup with sticky sessions. Disabling shared forwarding resolved it

This solution solved my connection with OPNsense WAN load balancing and internet.
Note: I set up firewall base routing. my issue is ping to google.com success and failed again after adding 2 WANs in the same tier

@pbean
Copy link

pbean commented Mar 24, 2023

Just wanted to chime in. Im having the same problems myself except my problem occurs much more frequently then stated. On a basic bare bones opnsense with multi-wan. 2 1gig/1gig fiber pppoe connections hooked directly from the ONT's to Opnsense(I have used intel cards and mellanox cards HW wise as well, this is a bare metal server). trying to browse websites or connect to services randomly drops for all users on the network within minutes over and over and over again (avg every 120 seconds). If I disable shared forwarding the drops still happen but now they take longer in between before dropping (can go 5 minutes). I also then tried disabling sticky connections again but no change
Like the others in this scenario:

  • if I have a continuous ping going it is never interrupted during these connection drops only using apps, services, and viewing new pages.
  • If I reset states everyone gets access again
  • I can't use policy based routing to route certain traffic to another group gateway when using multi-wan load balancing. traffic hits the router from my internal network and then just disappears. Activity log shows its being allowed out through the default let out anything through the firewall rule but nothing ever connects to a remote destination. Configs are pretty much stock, I have zerotier, speedtest, and I turned suricata on but this happened even before any IDS/IDP was used.
  • Something else I've noticed to my immense frustration is if I have the load balanced multi-wan gateway in use and I reboot the opnsense server I loose all connectivity to the internet when it comes back up (except from the OPNsense box itself). I have to reset both state and the other table to get it to start working again.

Edit: I also want to add they use the same gateway on the ISP side but have different IP addresses. since Im using PPPoE this is supported per documentation i've found online so there is no nat appliance in place ahead of them

My current workaround is just to use a failover gateway group or the LB gateway group with one gateway disabled.

@AdSchellevis
Copy link
Member

At the moment this doesn't appear to be a very common issue and is difficult to track. let's close it for now, if new relevant information comes in pointing into a direction where investigate further, we can always reopen. Testing again on 23.7 might make sense for people having issue.

@mimugmail
Copy link
Member

Are there any shared forwarding additions in 23.7?
In fact this #5869 (comment) is quite standard. If you use gateway groups with multiple Tier1 gateways you have to disable shared forwarding.

@AdSchellevis
Copy link
Member

@mimugmail there aren't. if it's a known limitation, I also don't mind moving this to the documentation, as it stands I don't expect any movement on this ticket.

@mimugmail
Copy link
Member

I'll PR to the docs repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

8 participants