-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-WAN load balancing setup with Sticky Connections sometimes "drops" new connections #5869
Comments
There are a couple of things we could try, 'll ask @fichtner tomorrow to offer a FreeBSD 13.1 kernel on the business edition mirror for testing, just to exclude the issue is solved but we missed it for some reason. While trying to debug some of this locally I got a hunch which I would like you to try out, would you be able to go to the inbound rule (with the gateway) and set te "Max source states" field to a high value (10000) and test again? |
My primary use-case for using the Business Edition is "send money to Deciso", nothing else. I'm happy to switch to the Community Edition if you think that's worth a try.
I assume you're talking about the rule on the LAN_DEFAULT interface with the Gateway group as destination that the Multi-WAN setup describes as "Policy-based WAN"? If so, I just set the value to 10000 and rebooted the Firewall, just for good measure. I'll let you know if a connection gets stuck again, but unfortunately, for now, I'll just have to wait a few hours and observe before I can try something else. If you meant another rule, can you clarify? I haven't set any inbound rules for any of the WAN interfaces, and the only floating inbound rules are the auto-generated once. I've also seen your suggestion to disable shared forwarding in the other thread. I'll try that next if I get drops with the increased source state limit. |
Okay, this is still happening with the 10000 "Max source states" limit set on the rule I mentioned above. Same situation - starting a new ping in a second terminal from the same host worked, and dropping the state in the Firewall Diagnostic section made the still-running ping receive responses. I looked into the "Firewall: Diagnostics: Statistics" section and saw 231 entries in the I'll reset the value to default now, and give disabling shared forwarding a try. |
With max source states set and a "broken" ping, what does the following report?
|
I was running with disabled "shared forwarding" since I wrote the comment (~1h) and had no issues. To answer your question, I enabled it again - and immediately got multiple "stuck" connections. This isn't conclusive and I have to test more, but it's interesting. The full output of your the command is below for reference. The client with the stuck connection is So the traffic went through the WAN interface with the IP
The firewall rule with the truncated name is this autogenerated rule (assuming the hash-like Which is the right interface for this WAN connection. I'll go back to the config with disabled "shared forwarding" now, to see if that breaks.
|
ok, @fichtner produced a test kernel for the business edition, handle with care as it's only intended for testing purposes.
If the issue on your end isn't reproducible without shared forwarding and the standard kernel, that would be valuable info as well, so trying that a bit longer with the current kernel is also a good idea. |
FWIW, the issue might be present even without shared forwarding enabled, but that would likely only happen when firewall rule direction is "out" as opposed to "in" which is the default. It looks like pf is mismatching states across interfaces and creating separate states for different gateways... |
I spent my entire workday using the internet (and having a ping two two different endpoints on two different devices), and could not notice any issue with Shared Forwarding disabled. This isn't a final conclusion obviously, but maybe it's a sign. I wanted to install the test kernel, but the installer complains about an invalid signature:
@fichtner can you confirm that the test kernel you built is indeed unsigned? I'm happy to just pass |
@denschub oh, yes, -i is the way to go (keys change between majors and 22.4 doesn’t have 22.7 keys yet) |
I upgraded to I'll go back to the standard kernel with disabled shared forwarding again. I haven't seen that break yet, so I want to spend a bit more time in that configuration (not sure if I hope that it breaks, or if I hope that it doesn't break, ..) |
Since my last comment, I ran with Sticky Connections enabled but Shared Forwarding disabled, and had zero issues since then. I tried the 22.7.r1 kernel again and can confirm that this indeed still breaks. So I guess the issue is somehow related to Shared Forwarding. I'm still happy to help debug this, even though I found a "working" configuration for me. Just let me know if there are other things you want to try, or certain kinds of debug output that would help you. :) But until I hear back, I'll be quiet. |
This really helped me with my issue. I faced similar network issues as well with a round robin multiwan setup with sticky sessions. Disabling shared forwarding resolved it |
Fwiw, I also have this issue - the short version of the repro is:
|
This solution solved my connection with OPNsense WAN load balancing and internet. |
Just wanted to chime in. Im having the same problems myself except my problem occurs much more frequently then stated. On a basic bare bones opnsense with multi-wan. 2 1gig/1gig fiber pppoe connections hooked directly from the ONT's to Opnsense(I have used intel cards and mellanox cards HW wise as well, this is a bare metal server). trying to browse websites or connect to services randomly drops for all users on the network within minutes over and over and over again (avg every 120 seconds). If I disable shared forwarding the drops still happen but now they take longer in between before dropping (can go 5 minutes). I also then tried disabling sticky connections again but no change
Edit: I also want to add they use the same gateway on the ISP side but have different IP addresses. since Im using PPPoE this is supported per documentation i've found online so there is no nat appliance in place ahead of them My current workaround is just to use a failover gateway group or the LB gateway group with one gateway disabled. |
At the moment this doesn't appear to be a very common issue and is difficult to track. let's close it for now, if new relevant information comes in pointing into a direction where investigate further, we can always reopen. Testing again on 23.7 might make sense for people having issue. |
Are there any shared forwarding additions in 23.7? |
@mimugmail there aren't. if it's a known limitation, I also don't mind moving this to the documentation, as it stands I don't expect any movement on this ticket. |
I'll PR to the docs repo |
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3AissueI'm filing this issue as a follow-up to MultiWAN / Gateway group connectivity issues since OPNsense upgrade #5094 on explicit request from @AdSchellevis
Describe the bug
As per the previous discussions on this topic, having a Multi-WAN setup with Sticky Connections enabled leads to some connections being unable to establish itself.
I'm following this official Multi-WAN guide closely, with the only exception being step 5, which I skipped, because I have a NAT rule that forces all DNS-traffic to be handled by OPNsense's Unbound. To debug, I have dropped all other "custom" configs like outbound static port NAT etc, and I believe this to be a fairly standard setup. I'm happy to share a slightly sanitized version of the config file if anyone thinks this could be useful!
The issue is happening without clear frequency or cause. Sometimes, connections just get "stuck" and won't unstuck themselves until they are closed and opened again. Here are a couple of things I know:
To Reproduce
Unfortunately, I'm unable to provide you with Steps To Reproduce. My current test is "using the internet". As soon as a connection gets stuck in the browser (i.e. the tab is loading forever), I figure out which IP it is trying to connect to and check the logs based on that.
In parallel, I have a constant
while true; do; ping -c3 9.9.9.9; sleep 3; done
running. quad-9 is a provider I'm not using, so the only connection being made here is from that ping. When I notice that throwing timeouts, I immediately ctrl-c that while loop, and start a new continuous ping to the same IP. If that also timeouts, it will timeout forever, not recovering itself until I restart it or do something else (see above).Note that I'm looking for instructions on how to gather more information about this. I'm not asking you to spend a ton of time trying to reproduce this - I'm aware of the hard-to-reproduce nature, and I'm happy to spend time on gathering useful data. My devices are the only devices doing traffic on this network, so I'm fine with testing stuff. I'm filing this ticket because I've run out of ideas what I can do/look at. :)
Screenshots
mtr
instances, tracing to8.8.8.8
and8.8.4.4
, with have been hard-wired to the two Tier 1 WANs. On the right, you can see a ping to9.9.9.9
. The ping timeouts, but both WAN connections are up and running with no issues.Relevant log files
ax0
is connected to my LAN-switch, and you can see the ping requests coming into the firewall. The other four captures are the WAN-upstreams I mention below, in that order, and the PPPoE connection as well. Those files are empty.9.9.9.9
during the broken state.Additional context
My setup has three WAN connections, and their gateways are in a gateway group:
WAN_PRIMARY
, connected viaigb0
, in Tier 1 - a VDSL2 connection with a static IP. Connection is established via PPPoE on OPNsense, it's connected to a DrayTek Vigor167 acting as modem only.WAN_SECONDARY
, connected viaigb1
, in Tier 1 - a DOCSIS connection with a static IP. The OPNsense is connected to a FRITZ!Box 6591 Cable. Contrary to consumer-contracts, this line is set up with a /30 subnet on the FRITZ!Box, a public IP assigned to the port the OPNsense is connected to, and the OPNsense added as exposed host. There is no NAT'ing going on here, the public IP is connected as the interface IP in OPNsense, and the Gateway in OPNsense points to the FRITZ!Box, just like any other ethernet-upstream.WAN_TERTIARY
, connected viaigb2
, in Tier 3 - a LTE backup connection that should be irrelevant in this case.Environment
opnsense-business, 22.4.2, OpenSSL
Running on a DEC750 purchased in January.
The text was updated successfully, but these errors were encountered: