Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiWAN / Gateway group connectivity issues since OPNsense upgrade #5094

Closed
2 tasks done
Rapterron opened this issue Jul 14, 2021 · 73 comments
Closed
2 tasks done

MultiWAN / Gateway group connectivity issues since OPNsense upgrade #5094

Rapterron opened this issue Jul 14, 2021 · 73 comments
Labels
help wanted Contributor missing / timeout support Community support

Comments

@Rapterron
Copy link

Dear,

As suggested by your bot in ticket #5089 I am also affected by this issue and want to raise this issue to a higher priority by trying to follow your templates and providing a detailed bug report.
Please also take the feedback from ticket #5089 initially opened by "Malunke"

Describe the bug

Load balanced multi WAN routing using 2 Internet Gateways stopped working after a few minutes up to several hours.
After rebooting the firewall works again for some time.

Detailed report

My setup:

The OPNsense is installed on a hardware platform using a Celeron CPU and Intel nic's. 
For troubleshooting I also Installed the OPNsense on a 2nd spare device but with the same issue (read more below).

I have 2 Internet gateways from 2 different ISPs both terminating each on an AVM Fritzbox (a widely used xDSL modem and router).
The OPNsense is behind those both AVM VDSL router via 2 separate VLAN's.
Between the OPNsense and each of the gateway router is a small transfer network with static assigned IPs.
The transmission OPNsense -> Gatwayrouter is IP routing no NAT on this lag.
The NAT will happen later on the gateway VDSL router towards Internet.

In the configuration I have both Gateways in a same Tier group and route the traffic via a floating firewall rule.
To ensure the session consistence I enabled in the advances firewall settings under MultiWAN the function to lock sessions with a custom time of 300 seconds.

This Setup was working fine over the last few years (set up mid 2019) and after the update to the Version OPNsense 21 I experience sporadic Internet outages from all internal networks.
I can ICMP ping endpoints behind the gateways while the bug were triggered but everything else on every client does not work. It seems like OPNsense somehow messed completely up with the sessions and gateway allocation.
After a restart of the whole firewall it works for a few minutes - hours (no pattern recognized yet).

Steps taken for troubleshooting.

Reviewed and cleaned up the configuration and firewall rules.
Checked both gateway router and made sure the internet connection is up on each.
Using debuging tools on the opnsense while the bug were triggered like ping gateways run health audit.
Fresh reinstalled the firewall on the same hardware while importing my configuration.
Reinstalled the firewall on a spare hardware device while importing the configuration.

Since it's an productive firewall I need to be carefully and document every step but as I could not find a solution with multiwan enabled I will do next a full downgrade on the spare firewall. Once I can confirm the last working version I will also update you here.

Current version: OPNsense 21.1.8_1-amd64
Last known Version: OPNsense 20.xx (subversion not clear as I made several updates at once)

To Reproduce
Having 2 Internet gateways in a same tier group and defined in a floating rule.

Expected behavior
Having the Load balancing working as before the upgrade but in the current version

Describe alternatives you considered
Disable the load balancing by changing the rule to use only one gateway. -> works without the feature
Downgrade OPNsense (exact last working version needs to be checked but it's difficult as it's a sporadic issue and I don't want to take my whole network offline several times)

Environment

Main firewall on a server grade celeron Intel CPU with Intel NICs on a hardware made for firewalls (I do not have the exact type in mind but does not matter as this happens also on the spare firewall)

Spare firewall (currently in use) with the same issue:
Intel(R) Atom(TM) CPU N450 @ 1.66GHz (2 cores)
OPNsense 21.1.8_1-amd64
FreeBSD 12.1-RELEASE-p19-HBSD
OpenSSL 1.1.1k 25 Mar 2021

I hope this information are helpful for you and please let me know if you need any further details

@AdSchellevis AdSchellevis added the support Community support label Jul 14, 2021
@Rapterron
Copy link
Author

Hello,
please find below some updates.

I went back all the versions and ended up the last working version is:

OPNsense 19.7.10_1-amd64
FreeBSD 11.2-RELEASE-p16-HBSD
OpenSSL 1.0.2u 20 Dec 2019

As soon as I update to the next major version the bug is triggered.

I did not noticed this issue earlier as I this is an productive firewall and needed a maintenance window to update so I did multiple updates at once.

Configuration in detail

In this working configuration under advanced Multi-WAN I have:

  • Sticky connections enabled
  • Shared forwarding enabled
  • A floating firewall rule almost on the end of my rule set which matches on any and route it to a gateway group containing my both gateways with the same tier level.

If you need more details about my configuration, please let me know.

This configuration has worked for several years since I use OPNsense (was the main reason to switch to OPNsense).
For now I will let my production firewall on this software but for testing I can quickly set up a new one since I used this opportunity to visualize the firewall (on Hyper-V Server Core).

Thank you and have a great Weekend.

@AdSchellevis
Copy link
Member

You probably best compare the contents of /tmp/rules.debug between both versions as a starting point, checking the counters (inspect button) might also help identify issues with the ruleset (another rule matching first for example).

If your issue is related to the ruleset, comparing differences between those files might be the fastest way to narrow it down.

From a kernel/driver perspective 20.7 and 21.1 are roughly the same, so if your problem doesn't exist on 20.7.x, it's not likely a kernel or driver issue. That might be worth the effort to test as well.

@sschueller
Copy link

sschueller commented Jul 19, 2021

I have seen this issue (I think it's the same) on the pfsense release 2.5.1 as well. I have 2 periodic speedtest running, one on the active and one of the backup/failover WAN. After upgrade I can only run the speedtest on the active gateway. It may be related to this https://forum.netgate.com/topic/163070/pfsense-2-5-1-multi-wan-routing-trouble/4 and https://reviews.freebsd.org/R10:41063b40168b69b38e92d8da3af3b45e58fd98ca ?

I do the following:
Install speedtest app

pkg add "https://install.speedtest.net/app/cli/ookla-speedtest-1.0.0-freebsd.pkg"

igb0 = Primary (active)
igb2 = Failover/Backup WAN

Run on Interface 0 (Works)

/usr/local/bin/speedtest -I igb0

Run in Interface 2 (Fails)

/usr/local/bin/speedtest -I igb2

Now if I set igb2 as active by marking igb0 as down the speed test will work on igb2 but no longer on igb0.

@AdSchellevis
Copy link
Member

@sschueller usually this behaviour is caused by missing reply-to rules on outgoing traffic, assuming a ping from any of the source addresses in your case doesn't work as well (ping -S <ip of igb0|igb2> 8.8.8.8) which is what I just tested on one of our machines (and works without issues).

@sschueller
Copy link

@sschueller usually this behaviour is caused by missing reply-to rules on outgoing traffic, assuming a ping from any of the source addresses in your case doesn't work as well (ping -S <ip of igb0|igb2> 8.8.8.8) which is what I just tested on one of our machines (and works without issues).

Both of those work for me without issue.

root@XX:~ # ping -S xx.xx.xx.xx 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from xx.xx.xx.xx: 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=118 time=13.290 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=118 time=16.490 ms
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 13.290/14.890/16.490/1.600 ms


root@XX:~ # ping -S yy.yy.yy.yy 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from yy.yy.yy.yy: 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=116 time=36.244 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=33.177 ms
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 33.177/34.711/36.244/1.534 ms

@Rapterron
Copy link
Author

Hello,

@AdSchellevis Good indicator this may explain why the last working version is 19.7.10 while 20.x and 21.x having this problem.

Right now I am with the same configuration back on "OPNsense 19.7.10_1-amd64" which works fine and I see an almost perfect balance between both internet gateways on my dashboard.

A firewall rule going wild was also my first guess but as I do not have that much rules it's quite easy to review them and there are no obvious contradicting entry's.

For test purpose I could also disable all other rules and / or place the affected gateway rule as first rule to sort out any rule related issue.
I also checked the counters under inspect and found nothing unexpected.

I also considered to completely re configure the firewall with a minimum rule set to test the connectivity but then I read about other users with similar issues.

I will update again to the latest version and try the mentioned steps and also check the diff between both /tmp/rules.debug files.

Will update the results here.

@fichtner
Copy link
Member

Did you check "Disable State Killing on Gateway Failure" option under Firewall: Settings: Advanced yet?

Cheers,
Franco

@sschueller
Copy link

Did you check "Disable State Killing on Gateway Failure" option under Firewall: Settings: Advanced yet?

Cheers,
Franco

This has no effect on my setup. I still can only do a speedtest on the active gateway.

@AdSchellevis
Copy link
Member

@sschueller maybe speedtest doesn't bind to the address of the interface. Just keep in mind that this scenario (keep traffic on originating interface) has no relation to the use of gateway groups (the topic of this issue).

@Rapterron
Copy link
Author

Good evening.

I have a few more updates and narrowed down the issue .
Maybe one thing to mention: I do not have any plugins (besides the dark theme) installed.

I upgraded the running firewall (exact the same config) from 19 again to 21 and immediately triggered the issue.

For debug purpose I deactivated all my rules and placed the gateway routing rule as first rule (would match all traffic and send to a gateway).
Also here immediately triggered the issue and all sessions went crazy on multiple clients in the network.

Firefox for example returned the error message (translated accordingly from German) : unable to proceed the request. Protocol violation.
And after a few ping tests even my workstation OS (Windows 10 pro) ran into a blue screen.
Also my Android mobile phone was not proper working until I disabled WiFi.

This absolutely points me to an asynchronous routing where the packets from one session gets round robin spread out via 2 independent gateways (2 different public IPs) and violate the protocol and lead to unexpected broken sockets.

I have the suspicion the (mandatory) Multi WAN feature "Firewall->Settings-> advanced-> Sticky connections" seems to not working anymore.

I also compared both versions "rules.debug" (roughly compared the differentials) file but they are looking almost the same.

To be very sure the issue is on the OPNsense I also downgraded and replaced both gateway routers (because I also upgraded them recently).

Next step I could capture some traffic at the gateways to see what comes out of the OPNsense but I am very sure the not working sticky connection and asynchronous routing is the root cause.

I also could install a fresh OPNsense with the absolute minimum configuration for Multi WAN in a test network segment but this would take a few days to set up.

@AdSchellevis is there anything else I can do? Anything for you to narrow down and fix the issue?
Should I switch to the development update channel?

Thank you very much for your support and please do not hesitate to let me know if you need any further information.
Cheers Christian

@AdSchellevis
Copy link
Member

Hi Christian,

I'm not expecting general issues to be honest, upstream (https://bugs.freebsd/org) doesn't seem to indicate an issue with the sticky-address keyword (and I'm quite sure a lot of people do use this feature). Best try to upgrade to 20.7 first, test what that does so kernels are roughly aligned and work your way up to the ruleset.

Problems with unstable connections often also relate to mtu by the way, as a wild guess you could also try to lower the mtu values to something like 1300.

Best regards,

Ad

@Rapterron
Copy link
Author

@AdSchellevis yes that is strange I also did not hear about a general issue in freebsd but read here and then (and also in the forum linked in #5089) about this issue.
As soon I upgrade to anything greater than 19.7.10_1 the issue is triggered.

I could imagine that this issue is not affecting that much users because they might use fail over or have such a huge upstream gateway that they are not interested in load balancing. However in my setup here I have 2x 100 Mbit (with 40Mbit upstream) VDSL connections for a few servers my test network and around 15 Users so I want to utilize the bandwidth of both connections.

For sure I could stay forever on version 19 which works but I guess this is not the best solution for either other users and even the developer team.
I understand that this might not be a high prio issue for the team but at least to have it on the scope.
I use the OPNsense I am happy with the OPNsense so I am also happy to contribute making this firewall better.

Back to the Issue:
Yes I played around with the MultiWAN and gateway associated settings such as Disable State Killing on Gateway Failure / Sticky connections / time limit for the stickiness but it does not matter what I set.

I will do now the following:

  • as recommended try with a different MTU

  • tracing the packets on the gateway facing interfaces

  • Doing a new and minimal parallel installation of the firewall in a separate test VLAN but with a similar setup pointing to the same gateways

  • For better visualization I will also provide you a diagram of my setup.

One last word to #5089 and @Malunke I created this ticket as recommend by OPNsense-bot by using the templates and gain some movement as I haven't seen any comments in the ticket since a few days I was not sure if they would be any movement at all.
I sincerely apologize for any misunderstanding.

Will let you know on my results
Cheers Christian

@AdSchellevis
Copy link
Member

Hi Christian,

If 20.1 is the first version which doesn't work correctly for you, it might be easier to track the changes (and focus on 20.1 first).
The kernel between 19.7.7 and 20.1 is basically the same one if I'm not mistaken.

Best regards,

Ad

@Rapterron
Copy link
Author

Hello all,

I had some time and setup a fresh OPNsense in my test network parallel to my productive network but with the same gateway routers.

Quick summary: The Issue also appears in minimal configuration.

Please let me try to report as detailed as possible providing all the important configuration.

Starting with a network diagram:

Untitled Diagram

Fresh installation OPNsense 21.1.8_1-amd64 with the minimum configuration / rules.
Running currently on a Hyper-V Server (parallel to the productive instance):
system_information

Here you see the interfaces:
Please don't mind the 3rd WAN interface (called WAN3) this is not used in the rule and group (see below).

interfaces

Here are the gateways with configured gateway monitors to determinate the status:

gateways_config

As example the gateway configuration for one gateway:

edit_gateway

The gateway group with both gateways in a tier 1 group for load balancing:

gateway_group

gateway_group_config

Used in a floating firewall rule:

Notice: the deactivated rule would route the traffic via one WAN interface without load balancing.
This is one of the workarounds.

floating rule

The content of the floating rule:

floating rule config

And to finish this set here are the additional firewall configurations like sticky connections:

firewall_advanced_settings

This is basically a very small setup using mostly default settings and is also referred in some online documentations.

As this is a very generic configuration without any sensitive data please find below the configuration:

config-OPNsense.localdomain-20210729172111.zip

Hope this information are helpful for you and please let me know if you need any further informations.

I will leave my working productive OPNsense on 19.7.10_1 as this point aslo to a more generic issue.
For any further testing I can easily start my test router.

Cheers Christian

@AdSchellevis
Copy link
Member

Hi Christian,

I really would advice to check between 19.7.x and 20.1 as suggested in my previous comment so we can compare differences, moving to (almost) the latest version is quite a gap in time.

The old installer for 20.1 is still available on our webserver (https://pkg.opnsense.org/releases/), if you're looking for a 19.7 one, I can look around if I can find one for you.

Best regards,

Ad

@Rapterron
Copy link
Author

Hello Ad,

Thank you very much for your quick response.

Yes I had this in mind but then I thought it would be better to go tabula-rasa and build a config from scratch using an actual version.
I have the firewall for several years now and want to exclude an issue based on my configuration and rules which grew over time.

Thank you I have all the ISO images still here on my file server and already prepared 3 VMs:

  • 19,7 productive
  • 20.1 test
  • 21.1 test
    and can very quickly install and switch it.

I am now on 20.1.9_1 and testing.
Same config as in my previous above.

grafik

From your point of view is there anything in the example config which might be not correct or unwanted?

Thank you.
Cheers Christian

@AdSchellevis
Copy link
Member

Hi Christian,

At a first glance this looks rather normal, except that if you're also running DNS on the box you're likely forwarding local traffic to the next hop now (which would also disrupt your internet traffic).

The reason why I'm asking to test both older versions is that we can match code differences, as I mentioned before, at my end multiple gateways using either sticky or non-sticky connections over IPv4 work like a charm.... on the latest version.

Over the years I've seen quite some configuration issues from customer at our support desk, but that's really outside of community support scope.

Best regards,

Ad

@fichtner
Copy link
Member

I am now on 20.1.9_1 and testing.

It would be beneficial to test specifically 20.1 initial release unless the error is not present there. 20.1 -> 20.1.9_1 covers a lot of ground still.

@Rapterron
Copy link
Author

Hello Ad,

thanks again for the fast response.

Yes I am absolutely aware that this is "communuity" support and I am more than happy that you are taking a look on this case. You should not debug my productive configuration especially also other user having this therefore I reduced it to the really minimum to narrow it down on the low level.

Yes this is a separated test network and all traffic should be routed straight to one of the both gateways. For DNS I let the clients ask directly the google DNS server (in productive I have my own server).

Okay while I was testing it happens also on version 20.1.9_1
I have 2 windows10 clients and 2 android mobiles.
On all systems a similar reaction.

On the windows you see that nothing is loading anymore but nslookup and traceroute / ping works but nothing else

tempsnip

On the Android mobile nothing will load and WIFI is reporting: connected without internet.

On the routers dashboard everything is "green" and after a reboot it's working again.
Next time this happens I will check if also some other actions like modifying the group or turning a gateway off and on will temporary help.

@fichtner Hm okay that's a point. I will reinstall to the 20 initial release and test again.

Cheers Christian

@Malunke
Copy link

Malunke commented Aug 13, 2021

Is there any news here yet? Unfortunately, I can't actively contribute to the test, but I would be happy if the developer community would actually acknowledge this bug as a bug and not blame it on some exotic configurations. I have already shown in some posts that it occurs on different hardware and Rapterron has shown in an exemplary way that only an update of a working OpnSense machine triggers the problem. At the latest here the honor of the developer community should be awakened - with other projects this would have been enough long ago.

I would be happy if there would finally be news.

@AdSchellevis
Copy link
Member

@Malunke not sure why you seem to have the urge to hijack this thread, but let me reference my last comment, maybe you forgot to read it #5089 (comment)

@Malunke
Copy link

Malunke commented Aug 13, 2021

(Before Reading - normally the following has nothing to do with a bug report. I still want to push this topic and want a solution.)

Hello,
I do not hijack the thread and I have not forgotten the last comment.
My problem is that simply nothing happens.
In my opinion, everything was already said on the topic in the community forum in various threads - but nothing happened.
After that I opened a bur report - nothing happened.
After that Rapterron opened this bugreport and did excellent work until today - but still nothing happened.

I'm sorry, but I just can't understand this. I also can't understand why the developers don't react already at the community forum, although it is almost certainly a bug. The steps are quite simple:

  1. Install firewall plain vanilla
  2. Set up 2 WAN gateways
  3. put both gateways into one gateway group (same tier)
  4. set up firewall rule with gateway for this group
  5. admire the error on your own body

Nothing to find here with extraordinary configurations etc. To top it all off Rapterron did a great job and found out at which version jump the error occurs - all this should prove that it is a bug in the firewall and not a RTFM error! But nothing happened until today - and if the developers have no infrastructure to test, I am speechless myself. The bug analysis and code comparison can't be demanded from the community either, although with almost every problem it is pointed out that you didn't buy a paid support package from OpnSense!

I am still politely waiting for a bug fix. I am also in the position to advise users or customers to or against a paid firewall product and OPNSense is not doing well here at all at the moment.

@AdSchellevis
Copy link
Member

@Malunke Apparently you really don't seem to want to listen, as earlier reported I did setup a gateway group (#5089 (comment)), which didn't have issues, quite some people use this feature without issues so there's likely something different in your setup. I don't mind digging into something, as long as someone can explain what to look into (hence the question to pinpoint the exact version). I'm not spending any more time on this, good luck complaining about other people not helping you out, a bit of self-reflection would probably help.

@Rapterron
Copy link
Author

Good evening.

I am still testing and thought it would be the best to wait with my report until the test are finished.
I got notified by git hub on new posts so I think it's a good time to jump in again.

As mentioned I do not test in the productive environment at this time but use the same upstream gateways.
My live firewall is still on OPNsense 19.7.10_1-amd64.

The test firewall is now on OPNsense 20.1-amd64 fresh installed from the ISO and not patched as recommended by @fichtner

grafik

Not that much traffic but both gateways are used for load balancing.
grafik

grafik

Minimum configuration based on the flowchart and configuration file I mentioned above #5094 (comment).

3 test devices:
2x Mobiles
1x Windows 10
The test environment is now running for about 2 days and 8 hrs and NO issue was triggered!
Apparently the Idea from @fichtner was a good hint so the last working version is OPNsense 20.1-amd64

@AdSchellevis you mentioned you have already set up a test? Can you provide the configuration file so I can compare it with my minimum setup?
Do you have NAT enabled? Because this might be the only difference. As far I remember by default OpnSense would create an automatic outbound NAT rule towards the gateways but as my gateway routers (AVM FritzBox) have a static route for the returning traffic pointing to the OpnSense it is not needed to NAT the LAN side (10.133.7.0/24) via OpnSense towards the gateway routers.
This will save me some CPU time and double NAT as the gateway routers will NAT the traffic towards public internet anyway.

grafik
Here you see the back route.
192.168.12.10 is the leg of my prod. OpnSense
192.168.12.11 is the leg from the test OpnSense

I have the production and test firewall on the same server hardware (hyper-v) and could load and test and switch versions and configs very easy.

Hope that helps and please let me know what to test next.

Side note on load balancing:
As I am working for an ISP and use mostly Fortinet hardware I see that the majority of customer use failover instate of real load balancing.
Mostly they have a big main line with gigagbit or even 10gigabit and a smaller backup line via LTE / 5G / xDSL or even STARLINK Satellite.

In a smaller setup like our apartment block we want to combine multiple xDSL lines (with a max bandwidth each 100mbit) to have a better distribution of the bandwidth to everyone.

Cheers Christian

@mimugmail
Copy link
Member

Please do double Nat, these times are gone and it is way more clearer. Also try the floating rule where source is your LAN and not any. There might be cases where returning packets might get out again since they match too with source any

@fichtner
Copy link
Member

https://github.com/opnsense/changelog/blob/613960454b7da72e18f282211f7cbd5f1bf844b7/community/20.1/20.1.7#L23 This one looks like a candidate. I think all the kernels of 20.1 are still on the mirror to try if you want.

@AdSchellevis
Copy link
Member

Hi Christian,

My test was fairly simple, including outbound nat rules. it would be good to rule out the patch in pf first, it shouldn't have affect when the interface is there, but if I'm not mistaken that specific change originates from opnsense/src#52 (opnsense/src@923c95c).

The old kernels are still there, so you could try to only update base and kernel to it with opnsense-update -bkr 20.1.7 if I'm not mistaken.

Best regards,

Ad

@Malunke
Copy link

Malunke commented Aug 14, 2021

Perhaps I'm able to help a little bit.
I'm using Hybrid NAT (see my attachment) so I think NAT won't be the right direction.

However, what might differentiate Rapterron and me from other setups - we both use VLANs. Normally, this should not make a difference - but this could distinguish us from users with purely physical ports. I only have 2 network cables connected to my ESXi server, the rest is divided via tagging and actually runs super stable.

Trunkports are exclusively the uplink between 2 switches and the two connections to the ESXi server. So 4 trunk ports, the rest are edge ports without vlan-tag.

Aufbau

@Rapterron
Copy link
Author

Hello @AdSchellevis,

almost 3 days working now without issues.

grafik

@AdSchellevis
Copy link
Member

@Rapterron ok, that's good news, at a first glance it doesn't look kernel related in that case. did you try the latest 20.1 as well earlier by the way? if that didn't work, we probably best take small steps from here.

@Rapterron
Copy link
Author

Hello,

@AdSchellevis Hm yes I tried the 20.1 already.
The smalles steps I did were: Installing from the image -> testing -> upgrade via the UI and repository -> testing.
The test setup still works fine and has no issues If you can tell me every single step to update we should be very easily locating the module which does not work as intended.

A bit off topic but similar:
I came across an other wrong behavior on my "old" productive firewall (19.7.10_1).
Beginning of this week I added a 3rd Gateway to the system (got invited to the SpaceX Starlink beta) and I added this gateway to the group (same Tier).
As the Starlink router does not support custom routing I needed to add an outbound NAT and here my firewall started to load balance packets wrongly so I disabled Starlink again.

Routing a few clients explicit via this gateway or having it either alone on Tier 1 or Tier 2 (fail over) works fine but once in a Gateway group Loadbalanced with the other Gatways in the same Tier group it does not work.

Today I replaced the starlink router with an router who supports custom routing and I set up all 3 Gateways basically the same and now it works fine. All 3 Gateways in the same Tier 1 group load balanced the traffic fine.

Feels like the whole Group and load balancing settings are really fragile.

But anyway let's focus on this issue with the most recent version as it might be on my old firewall I ran into a bug which has been already fixed.

Thank you and have a great weekend!
Cheers Christian

@AdSchellevis
Copy link
Member

@Rapterron There are likely a lot of misassumptions about policy based routing, in our experience it's pretty stable, but often requires more knowledge about your network and expected traffic flow.

So if I understand the current situation correctly, your setup works good with the first editions of 20.1, but doesn't work anymore in the last version. We excluded the kernel part, but we can also step through the versions in the 20.1 branch, if I'm not mistaken you can upgrade to a specific version using the "firmware flavour" selection.

All available versions can be found on our mirror (https://pkg.opnsense.org/FreeBSD:12:amd64/21.1/MINT/), setting the flavour to custom + "20.1/MINT/20.1.2/OpenSSL" should offer the option to upgrade to 20.1.2. Can you upgrade + test until the non working 20.1.x is identified?

@Rapterron
Copy link
Author

Hello @AdSchellevis,

Please apologize for not responding the last 2 weeks.

Yes I know the policy based routing can be very tricky and to sort out all other possible network setup errors I set up a clean environment with only one rule.
This is a very simple setup LAN and 2x WAN in a separate VLAN with no connection to the prod. network and with 3 Test devices (2x android phones 1x windows client).

After the Kernel update it was still working and my next plan is to upgrade step by step all other modules as you mentioned until the issue is triggered but I need some more time for this as update + testing takes a few hours each step.

As a quick sidenote:
I have also a virtual Installation of the latest opnsense which I upgraded lately but the issue is still present.

Maybe also helpful: I have also a pfsense test installation based on the very latest version and in a similar setup the multi WAN load balancing is working. I don't know how many modules OPNsense and pfsense have in common but maybe this information is helpful for you do narrow down the root cause.

Anyway I will let you know once I updated step by step.
Cheers Christian.

@Rapterron
Copy link
Author

Hello all.

during the last week I was actively patching version by version my test instance through the versions of the 20.x release.
After each step I was testing for at least 1 - 2 days.

Out of accident I made a discovery which might help resolving this issue.

While patching from 20.1.9_1.... towards 20.7.5 (a version which had the issue before) it was still working I was wondering what happen and why it's working now. Later I remembered my other test and my configuration was still there (guess I forgot to save).

Luckily this mistake may lead to the root cause and the solution.

I disabled the function "Use sticky connections" from the Advanced Firewall UI
grafik

and it's seems this resolved the connectivity issues.

Right now I am on 20.7.5 and it still works.
Also my other test container with the latest 21.7.3_3 works now.

To double check this: when I set the sticky connections on the bug is triggered and between a few minutes till couple of hours the connectivity break and I need to reboot the firewall.

I will continue my test with this setting and try to identify possible downsides but for now this seems to be the best point to start checking the code.

Cheers Christian

@AdSchellevis
Copy link
Member

@Rapterron if you're experiencing issues with sticky connections, I would inspect the size of the source tracking table at the time you have issues. From the command line pfctl -vvsinfo.

@Rapterron
Copy link
Author

Hello @AdSchellevis and thank you for your ongoing support!

The issue was triggered this evening but I let him in this state for some time.
2 of 3 devices reporting no internet and unable to do anything online.

Below the output:
pfctl -vvsinfo.txt

Note: this was in my 3rd test container running a OPNsense 21.7.3_3
If needed I can switch to another version.

Cheers Christian

@AdSchellevis
Copy link
Member

@Rapterron The number of current source entries is way too limited to be problematic.

Source Tracking Table
  current entries                        5
  searches                            3581            0.1/s
  inserts                               57            0.0/s
  removals                              52            0.0/s

When a couple of devices report loss of connectivity, next thing I would do is check if there are still states assigned to these machines (Firewall->Daignostics->States), if so, kill them and try again.

@FlavioDF
Copy link

I've also the same problem.

OPNsense 21.7.3_3-amd64
FreeBSD 12.1-RELEASE-p20-HBSD
OpenSSL 1.1.1l 24 Aug 2021

Is there any fix prevision ?

Thanks in advanced

@Rapterron
Copy link
Author

Hello all, Hello @AdSchellevis

sorry for the delayed answer.

Apparently the issue almost instantly kicks in when I activate sticky connections.
Under Firewall->Daignostics->States I can see the states and indeed when I clear it the connectivity works again.
I will keep an eye on the stats and check if a reset resolves the issue and for how long.

As a further note and for everyone else reading,
I have found a the Workaround for my setup by disabling the sticky connections. I did not experienced any side effects yet and if so I would either tune the gateway priority or routing via explicit firewall rules.

Cheers Christian

@tsouza85
Copy link

tsouza85 commented Nov 22, 2021

OPNsense 21.7.5 (amd64/OpenSSL)
FreeBSD 12.1-RELEASE-p21-HBSD

I have the same problem... I fix it by turning off sticky... it's a bug and no developer admits it...

@AdSchellevis
Copy link
Member

sorry for the delayed answer.

Apparently the issue almost instantly kicks in when I activate sticky connections.
Under Firewall->Daignostics->States I can see the states and indeed when I clear it the connectivity works again.
I will keep an eye on the stats and check if a reset resolves the issue and for how long.

@Rapterron no problem, it has been busy here too. The problem so-far is that we still don't know in which version there was a change in behaviour, which makes it nearly impossible to make a lot of sense out of the reports to be honest. Currently it doesn't look like there's anything different between versions tested.

Under Firewall->Daignostics->States I can see the states and indeed when I clear it the connectivity works again.
I will keep an eye on the stats and check if a reset resolves the issue and for how long.

If you kill sessions selectively for the machine that doesn't seem to have internet anymore, does it also solve the issue? If so, what is the "state" of the related states? (output of pfctl -s Source -vv might be useful too) and is there any change in connectivity on upstream connections, when these are pinned to an upstream gateway which isn't responding anymore, we may expect similar behaviour.

@tsouza85

I have the same problem... I fix it by turning off sticky... it's a bug and no developer admits it...

Very helpful, what did you do to help track your issue and deliver a reproducible test case?

@Rapterron
Copy link
Author

Hello all,

First of all and for everyone who run into this problem:

Workaround is to disable sticky connections
grafik

This will have the effect / benefit that multi session TCP connections will use all gateways in the group (will sometimes result in more bandwidth) but can also lead to unexpected side effects.
In case of issues I recommend to create several gateway groups with multiple tiers (failover setup) and route your clients / networks via static rules through the groups.
Like:
Servers via GW A as default and failover GW B
Clients via GW B as defualt and failover GW A
...

I upgraded my production network to the latest version "OPNsense 21.7.5-amd64" and can confirm that this setting works for me and I think I will keep this setting as this solves the issue and I also got the benefit of combining all my gateways for more bandwidth.

For example a test via speedtest.net uses all my 3 Gateways (2x VDSL + SpaceX STARLINK)
grafik

So thank you all for working on this solution until this point.


Back to the issue

However even If I am "fine now" I am really certain that the developers are happy for every constructive input even if they are not answering in time.

I still have and will keep my test network up and running for further troubleshooting.

Unfortunately the tracking and testing takes some time but It's seems like after clearing the sessions it is working for much longer now (for 10 days now!) and I have to reboot the firewall to trigger the issue again (might also depends on the low traffic in the test environment).

From what I have in my notes the last working version was the first version of the 20x release (OPNsense 20.1-amd64) then we upgraded the kernel only to 20.1.7 (to sort out any kernel related issues) and it still works but soon after I update the remaining packets the issue will sneak in and will keep his place until the very latest version.

Thanks to snapshots I can quickly jump in my test environment back and forth between the versions.

@AdSchellevis I am now back at the first known version which has the issue and will follow your advise and check the sessions / clear only for one device.

And yes I can say as soon as I pin the network via rule to one single gateway the connection works but I haven't tested pinning a particular device on one gateway by rule.
I will check this as well and let you know.

Cheers Christian

@AdSchellevis
Copy link
Member

Hi Christian,

From what I have in my notes the last working version was the first version of the 20x release (OPNsense 20.1-amd64) then we upgraded the kernel only to 20.1.7 (to sort out any kernel related issues) and it still works but soon after I update the remaining packets the issue will sneak in and will keep his place until the very latest version.

Let's go into this a bit deeper, the amount of text in the issue (including the me-too comments) is making it hard to keep focus on the relevant datapoints. So 20.1 works, 20.1 with the kernel of 20.1.7 works, but it stops working after upgrading the rest of the packages to 20.1.7 as well, right? (core would be the only relevant one here by the way)

If we can answer that with a yes, then it's getting a bit weird, earlier you checked the contents of /tmp/rules.debug if I'm not mistaken, if these are the same between both versions, it doesn't make sense there's a difference at all. When it comes to source (policy based) routing, pf(4) should be the only relevant component here (kernel+ruleset).

And yes I can say as soon as I pin the network via rule to one single gateway the connection works but I haven't tested pinning a particular device on one gateway by rule.
I will check this as well and let you know.

Not exactly what I meant, the question here is if anything changed on any of the upstream gateways leading to the session getting stale in some way. Gateway events might be relevant for example. No need to test single gateway rules. As soon as it's stale, the next question is, what's the "state" of the states for that source and what does source tracking report (hence the pfctl -s Source -vv).

Best regards,

Ad

@hydrosIII
Copy link

Same issue here.
I am getting conectivity issues. It is also hapenning in Pfsense so It is something related to FreeBSD. It was working before. I have ping of mote than 1500 to 8.8.8.8 when enabling Multiwan either in failover mode or in loadbalancing mode.

Disabling sticky connections does not seem to work.

Updating to the last kernel : kernel-21.7.7 seemed to improve the issue for now. But I will have to test it for a few days to say if it is solved.

@OPNsense-bot
Copy link

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.

@OPNsense-bot OPNsense-bot added the help wanted Contributor missing / timeout label Jan 10, 2022
@Malunke
Copy link

Malunke commented Jan 10, 2022

reopen because there is no solution yet

@fichtner
Copy link
Member

fichtner commented Jan 10, 2022

There's no progress for three months. Let's please start to be realistic about this issue and its lack of actionable information. As per contribution guidelines updates can be made and if clarity arises the ticket will be reopened.

Cheers,
Franco

@Malunke
Copy link

Malunke commented Jan 10, 2022

Ich möchte nicht schon wieder eine Grundsatzdiskussion über die Art und Weise der Behandlung von "seltenen" Problemen starten, dass ist nicht zielführend.

Es haben aber mehrere Personen nachvollziehbar geschildert, dass das Problem bei Ihnen auftritt und nachvollzogen werden kann. Ich kann leider nicht beliebig zwischen Versionen hin- und herspringen, sonst könnte ich auch etwas mehr dazu beitragen. Ich kann aber sagen, dass Problem besteht weiterhin.

Eine Fehlkonfiguration möchte ich nahezu ausschließen, habe aber schon immer eine Teamviewersession zur kurzen Beurteilung angeboten.

Warum so wenig Informationen einfließen mag an der Resignation der Beteiligten liegen. Ich habe bei mir schließlich auch die Krücke benutzt, und die Gateways in zwei verschiedene Tiers gesteckt. Damit habe ich zwar noch eine Ausfallsicherheit aber nicht die angestrebte Bandbreitenteilung, da im Regelbetrieb nun ausschließlich Gateway 1 genutzt wird.

Da aber keine Lösung ersichtlich ist, bleibt für mich nur das Umgehen des Problems und ggf. Suche nach einer anderen Firewalldistribution.

Mit freundlichen Grüßen

I don't want to start another fundamental discussion about the way "rare" problems are treated, as that would not be helpful.

However, several people have described in a comprehensible way that the problem occurs with them and can be reproduced. Unfortunately, I can't jump back and forth between versions at will, otherwise I could contribute a bit more. But I can say that the problem still exists.

I would like to almost rule out a misconfiguration, but have always offered a team viewer session for a brief assessment.

Why so little information flows in may be due to the resignation of those involved. I finally also used the crutch with me, and put the gateways in two different tiers. This still gives me fail-safety, but not the desired bandwidth sharing, since only gateway 1 is now used in regular operation.

But since no solution is apparent, the only thing left for me is to work around the problem and possibly search for another firewall distribution.

Thanks a lot.

@fichtner
Copy link
Member

I don't want to start another fundamental discussion about the way "rare" problems are treated, as that would not be helpful.

I'd really appreciate it. :) I did work on shared forwarding for a couple of weeks now for FreeBSD 13 and this constant nagging here for free support isn't really useful.

Cheers,
Franco

@Malunke
Copy link

Malunke commented Jan 10, 2022

Had you tried your shared forwarding project with FreeBSD 13 with the same config we're using here (some graphics show the configuration) especially with VLANs (and on my end qith vmxnet interface)? I can also send my xml-file if you are interested. Also a Teamviewer session is possible if somebody from developers team is interested.

I can only say - the problem still exists in the actual version. And even when it is a rare condition (or perhaps also no rare condition, nobody knows) everybody should be interested to investigate this issue and solve this condition either in the web-frontend to not allow such a condition or in the underlying system so that the underlying bug will be resolved.

(By the way - also my time isn't free of charge so constant nagging or ignoring bug reports from different voluntary bug reporters also isn't really useful.)

@fichtner
Copy link
Member

I'm not here to offer free support. 3 beta versions have been posted and on Wednesday we will have the first release candidate. If you have energy to discuss how community rules don't apply in this particular case you also have energy to test the FreeBSD 13 code.

Cheers,
Franco

@Malunke
Copy link

Malunke commented Jan 10, 2022

no comment - from here I'm out

@Rapterron
Copy link
Author

Hi guys,

Sorry I should have answered earlier but you know life happens and I totally lost track about this.

I started at a new my job and have no access to my old testlab anymore so unfortunately I would need to rebuild anything from scratch.

Since I am working now for a German ISP in the hosting and network area I got a deeper understanding of professional firewall systems and bandwidth. I think I have mentioned this already but for most of the routers and firewall you use failover instate of active load-balancing since this kind of backbone firewalls have mostly bandwidths beyond 10Gbit and more which makes load balancing like with small bandwith VDSL lines with an enterprise firewall like the opnsense an exotic use case.

Unfortunately and I think I talk for all German IT guys the Internet Infrastructure here in Germany is a joke and way behind the standard. Most citys don't have fibre lines and use old copper wire connections with mostly 10mbit -> max 100mbit and sometimes also 500mbit via cable so as a German we need to play with uncommon solutions like bounding multiple WAN lines together which is indeed in the industrial world uncommon. For example our STARLINK satellite internet has more bandwidth than the fastest internet line you could rent...

I don't blame the dev's or anyone else it's just a matter of understanding for a rarely used case and I tried my best to provide as much information as possible but some problems are hard to track especially when you are the only one with a testlab (well who "had" a testlab).

I will continue following the opnsense project and may rebuild my lab but that's something in the far future as I am totally loaded with my new job.

Cheers Christian

@denschub
Copy link

denschub commented Jul 13, 2022

I've now run into the same issue on my OPNsense 22.4.2-amd64, which is a Business Edition setup on a DEC750. I spent a bit of time debugging this and now hit a road block where I don't know how to progress further. Here is some additional, hopefully useful, information.

  • This issue appears to only affect new connections. Existing connections are not interrupted. New connections can "get stuck" (i.e. have 100% packet loss), but usually, killing that connection and retrying after a few seconds makes it work.
  • It indeed appears to be related to Sticky Connections. I'm able to reproduce with Sticky Connections enabled at least once per hour or so, but never without it.
  • Outbound NAT seems irrelevant, I've removed all rules and it still reproduces.
  • When connections are "broken", I see two states in Firewall > Diagnostics > States: The "in" policy-based WAN rule, and the "out" autogenerated "let out anything from firewall host itself" rule. A broken state doesn't look different from a working state.
  • Dropping the aforementioned two states when they're broken makes the connection become alive immediately. For example, in a broken continuous ping, which will never un-stuck itself, dropping the two state entries will make ping work.
  • Running a Package Capture on all interfaces with a "broken ping" running, I can see the ping requests arriving on the Firewall via the LAN interface, but I never see it leave on any WAN interfaces, and I also do not see responses. It looks like the packages just don't get forwarded to any WAN interface.
  • Also while having a "broken ping" running, I do not see any connection attempt in the Firewall logs. Not even with logging for the policy-based WAN rule enabled. Just nothing.

It seems like the traffic is dropped somewhere between the firewalls LAN interface and the firewall rules. Unfortunately, I have no clue how to debug that. And while I'd be happy to pay for a support subscription, I doubt this is something that can be resolved in the 2 hours included, and extra hours might be a bit too expensive for my home network. :) Maybe someone finds this information useful and can throw me a pointer how to debug this further.

@AdSchellevis
Copy link
Member

@denschub it might be better to open a new ticket with the relevant information, we haven't been able to reproduce an issue so far unfortunately. There are somethings that you can try, the first thing likely being disabling shared forwarding and check how that changes behaviour , we can also offer a beta kernel for 22.7 (FreeBSD 13.1) for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Contributor missing / timeout support Community support
Development

No branches or pull requests