Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared Forwarding and Gateway-Group sporadic connection lost #5089

Closed
Malunke opened this issue Jul 11, 2021 · 10 comments
Closed

Shared Forwarding and Gateway-Group sporadic connection lost #5089

Malunke opened this issue Jul 11, 2021 · 10 comments
Labels
incomplete Issue template missing info

Comments

@Malunke
Copy link

Malunke commented Jul 11, 2021

Dear all,

Describe the bug

I think there is a bug in shared forwarding combined with gateway-group. When both is used I have sporadic connection losses to the WAN side.

To Reproduce

  1. create 2 WAN-gateways
  2. add both to a gateway-group (same Tier)
  3. modify firewall-rule to use gateway group instead of default gateway

After that - you will have sporadic connection losses to internet. It can be repaired either switching of shared forwarding or using default-gateway instead of gateway group.

Both ways are not good, because I want loadbalancing as well as traffic shaper which needs shared forwarding turned on.

See: https://forum.opnsense.org/index.php?topic=23456.0 and https://forum.opnsense.org/index.php?topic=23460.30

More users see the same problem.

Expected behavior
No internet connection losses.

Environment
ESXi virtualized
OPNsense 21.1.7_1-amd64
FreeBSD 12.1-RELEASE-p18-HBSD
OpenSSL 1.1.1k 25 Mar 2021
Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (2 cores)

@OPNsense-bot
Copy link

Thank you for creating an issue.
Since the ticket doesn't seem to be using one of our templates, we're marking this issue as low priority until further notice.

For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

The easiest option to gain traction is to close this ticket and open a new one using one of our templates.

@OPNsense-bot OPNsense-bot added the incomplete Issue template missing info label Jul 12, 2021
@Rapterron
Copy link

Dear,

as I am affected as well and debugging my setup since the last 5 days (after OPNsense upgrade) please let me add some more information to raise the priority as I think loosing the MultiWAN and load balancing capability of OPNsense is a major issue.

My setup:

The OPNsense is installed on a hardware platform using a Celeron CPU and Intel nic's. This Hardware is made for firewalls (I don't have all the specs in mind yet but if necessary I can provide them as well).
For troubleshooting I also Installed the OPNsense on a 2nd device but with the same issue (read more below).

I have 2 Internet Gateways from 2 different ISPs both terminating each on an AVM Fritzbox (a widely used xDSL modem and router).
The OPNsense is behind those both AVM VDSL router via 2 separate VLAN's.
Between the OPNsense and each of the gateway router is a small transfer network with static assigned IPs.
The transmission OPNsense -> Gatwayrouter is IP routing no NAT on this lag.
The NAT will happen later on the gateway VDSL router towards Internet.

In the configuration I have both Gateways in a same Tier group and route the traffic via a floating firewall rule.
To ensure the session consistence I enabled in the advances firewall settings under MultiWAN the function to lock sessions with a custom time of 300 seconds.

This Setup was working fine over the last few years and after the update to the Version OPNsense 21.1.8_1-amd64 I experience sporadic Internet outages from all internal networks.
I can ICMP ping endpoints behind the gateways while the bug were triggered but everything else on every client does not work. It seems like OPNsense somehow messed completely up with the sessions and gateway allocation.
After a restart of the whole firewall it works for a few minutes - hours (no pattern recognized yet).

Steps taken for troubleshooting.

I reinstalled the firewall on the same hardware while importing my configuration.
Reinstalled the firewall on a spare hardware device while importing the configuration.

Steps I will do later for troubleshooting.

Since it's an productive firewall I need to be carefully and document every step but as I could not find a solution with multiwan enabled I will do next a full downgrade on the spare firewall. Once I can confirm the last working version I will also update you here.

Workaround:

Disable the load balancing by changing the floating firewall rule to use only one gateway.
Downgrade OPNsense

Additional details for troubleshooting:

Current version: OPNsense 21.1.8_1-amd64
Last known Version: OPNsense 20.xx (subversion not clear as I made several updates at once)
Replicating the bug: having 2 Internet gateways in a same tier group and defined in a floating rule.

Thank you very much for your support. I hope this bug report is detailed enough to raise the priority.
Please let me know If I you need any further details

@dannykorpan
Copy link

Same issue here, it's also mentioned many times in the OPNsense forum, e.g. here
https://forum.opnsense.org/index.php?topic=17116.0
https://forum.opnsense.org/index.php?topic=7860.0
https://forum.opnsense.org/index.php?topic=19977.0
https://forum.opnsense.org/index.php?topic=6993.0
https://forum.opnsense.org/index.php?topic=23456.0

Workaround: Either disable shared forwarding, and loosing traffic shaping and captivate portal or disabling sticky connections.

@fichtner
Copy link
Member

FWIW, people say it's broken quickly but fail to add such helpful information as to what IP protocol family is involved or give some packet-based (pcaps) indication on the issue.

@Malunke
Copy link
Author

Malunke commented Jul 23, 2021

Hello,
I find the reaction both here and in the forum very unfortunate. Several topics have been opened and no real error analysis has been attempted in any of them. I have asked specifically what information I can still contribute to the diagnosis - but there is simply no more response from the developers. I find this more than a pity.
Yes, I didn't pay money for support - but if this is probably a bug, everyone involved should be interested in solving the problem - and yes, my time is not free either. But I'm happy to use it for bug analysis in the spirit of community - I just need to know what to contribute.
Very good example is Rapterron, he has contributed quite a bit - but unfortunately in my infrastructure I can't downgrade / upgrade / set up new machines at will. But since no one told me pertinently what info they still need, I couldn't help until now.
But I have described exactly how the bug can be traced, I have 2 firewalls that show similar behavior and also found other forum participants who have similar problems. To stand now sweeping and say it will probably be due to the users, I find to shame. Even if it should be sometimes, these members also need help. But in this case I think it is a bug - according to Rapterron's exemplary search probably after 19.7.

Dear OpnSense team - if without any major change the problem occurs all at once after upgrading from 19.7 to a higher version I ask the question here. Is it probably due to the software or the user???

See #5094

@Malunke
Copy link
Author

Malunke commented Jul 23, 2021

By the way - simple IPv4 TCP/UDP and ICMP
In one of the forum posts you find a lot of screenshots so please do not claim there is not enough information.

@mimugmail
Copy link
Member

Maybe the reason is most of the devs do the dev at home where two lines are quite untypical, thats why also devs need more Info like packet captures.

I try to reproduce next week on a VPS (cause also me dont have 2 lines to test)

@AdSchellevis
Copy link
Member

A solid bug report is always welcome, but like I also mentioned in the other thread (#5094 (comment)), if there's a difference between versions it's imperative to know what the last working version was (not being a version from 2 years ago).

Next question is what did "it" do, what do you expect "it" to do and what should the feature do according to the documentation, and don't forget, which features are being used in combination (in case the shaper and captive portal aren't used, you can safely disable shared forwarding by the way as these are the reason the option exists).

The conditions under which an issue appears are also quite crucial to know, @fichtner doesn't point for no reason to which kind of traffic is being affected. IPv4 and IPv6 for example are different implementations inside the kernel.

When connectivity drops for whatever reason, also check the logs, did a line drop for example or are there kernel messages from the driver (driver/hardware related for example).

I just tried to setup a gateway group with (and without) sticky connections, on my end it works without issues (with IPv4, but since quite some people use this combination, I didn't really expect issues to be honest).

If we state #5094 and this issue are the same, we probably beter close one of them by the way.

Dear OpnSense team - if without any major change the problem occurs all at once after upgrading from 19.7 to a higher version I ask the question here. Is it probably due to the software or the user???

Reading the text above, I hope you understand it's not that simple to point to "the software" as well, there are quite some moving parts there.... for us it's impossible to track everybody's problems considering more than 80% of the issues are related to how people use our software and the equipment surrounding it (not necessarily issues with the software). The quality and contents of issues really matter to gain priority (yes, we also accept pull requests to discuss improvements).

Our community time is limited, we use quite a lot of it to improve OPNsense..... going on goose chases doesn't help. Not saying that you don't have an issue, but it would really help if there's a reproducible set of steps someone else can replicate which always leads to the same result (from scratch with the minimum set of components). Most projects request that by the way, I don't think we're that unique.

@Malunke
Copy link
Author

Malunke commented Jul 23, 2021

From my point of view you can merge this issue with issue #5094 (I don't know, why Raptorron replied here and also opened a new one). But he did very good analyzes (I didn't know that it worked before because my second line is quite new so with the former versions I also had only one WAN - line).

My biggest problem is - I tried to ask and help in the community forum (and you should agree, that it is probably a problem with the software and there are more than only one person out there with this phenomenon) but suddenly there are no replies and no reactions from dev-team. A good software team should also make an effort to investigate (community) errors.

It is also possible to logon via teamviewer in my installation and do a phone-call to get the necessary informations (german native language) if there is really an interest and when a good error culture is lived.

OT:
I recently had a similar problem with a backup software. Standard is free - larger editions are payed.
There was a sporadic problem restoring some virtual machines (this was apparently due to something specific in my backup infrastructure). After 2 emails and a short demonstration in my environment (counterpart was from Australia) and collecting debug-logs from the developers, the bug was found and fixed within 10 days (it was a specific problem with some SATA hard disks). Also the developers were very grateful for solving this bug.
Mind you, at the moment I'm using the free edition there - but there the developers were very eager.

I also have a second software package (a tax software) where I reported some bugs. The developers were also very thankful - since some years I get every update free of charge because of my help in finding bugs.

And in my case it is quite so that I use and use things privately. If they prove themselves, they will possibly be used both at my job and in our club and then the paid editions are usually used - so a win-win for both sides. However, products are also excluded in this way - I am certainly not alone with this approach.

@AdSchellevis
Copy link
Member

It is also possible to logon via teamviewer in my installation and do a phone-call to get the necessary informations (german native language) if there is really an interest and when a good error culture is lived.

I understand your point of view, but in reality networking is complicated and most of the issues really lie out of our scope. Quite some issues look alike, but have different causes. If there's an apparent and reproducible bug, we tend to dig into those as soon as we find the time for it, but in reality there's a lot of assumptions in tickets.

It's always annoying when you open a thread in a forum and nobody reacts, but the community as whole need to step-up more in that area I guess. You can't always point to the people building the product.

Some time ago I read a book on the subject by Nadia Eghbal named "Working in Public: The Making and Maintenance of Open Source Software", if you're interested in how (most) open source projects work, it's definitely an advisable read.

If you're really interested in tracking your issue down, my advise would be to try to gather some people on the forum (you can ask people on GitHub to join there too) and discuss an action plan how to make your problem reproducible. If that seems to be difficult, try to figure out what binds the issues you expect to be the same (type of hardware, type of traffic, functionality used, etc etc). In some cases the way things are configured are just different than intended, in which case they can stop working at some point in time as well.

I'll close this issue so #5094 can stay open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incomplete Issue template missing info
Development

No branches or pull requests

7 participants