Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MutiWAN and Reset States #5387

Closed
2 tasks done
ElXk6 opened this issue Dec 2, 2021 · 12 comments
Closed
2 tasks done

MutiWAN and Reset States #5387

ElXk6 opened this issue Dec 2, 2021 · 12 comments
Labels
help wanted Contributor missing / timeout support Community support

Comments

@ElXk6
Copy link

ElXk6 commented Dec 2, 2021

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Similar issues or feature requests:
#3979
https://forum.opnsense.org/index.php?topic=25818.0

Describe the bug

We are using MultiWAN with 2 Uplinks with:

  • Gateway switching (Allow default gateway switching => enabled)
  • Kill States ( Disable State Killing on Gateway Failure => not ticked)
  • Sticky Connections (Use sticky connections => not ticked)
  • Two Gateways, one with higher prio
  • Also tested Gateway group

On top of that, i run a OpenVPN Client Connection (TCP)

When I produce the active Gateway failure, the Gateway switching jumps in, the OpenVPN Tunnel times out and the takeover is fine. It also seems to do a TCP States Reset since my SSH Tunnel/Access dies.

HOWEVER: If I switch back to the Active Gateway it switches back to the main one again, BUT the TCP States does not get killed.

The SSH Session is still active. Not states Reset seem to happen.
If I kill the ESTABLISHED connection in the "States Dump" GUI, then it will start to connect via the active/correct gateway.

So wonder if:
-I set up something wrong?

  • the state reset just happens by design on the 1st failover
  • the state reset function is a bug and should be triggered when jumping back to the primary interface

To Reproduce

Steps to reproduce the behavior:

  1. Setup MutliWAN setup with
    • Gateway switching (Allow default gateway switching => enabled)
    • Kill States ( Disable State Killing on Gateway Failure => not ticked)
    • Sticky Connections (Use sticky connections => not ticked)
  2. Test interruption of default gateway
    => State Reset happens, ssh connection goes down
  3. Wait till all seams fine
  4. Reconnect default gateway
  5. Gateway with higher prio get default gateway
    => State doesn't get reset, ssh connection is up
  6. Wait (60min)
    => Still backup gateway is in use
  7. Kill states manually (Firewall: Diagnostics: States)
    => State Reset happens, ssh connection goes down, default gateway gets used

Expected behavior

If we set a gateway with higher prio, it should jump back to default gateway, like all other connections.
Also, MultiWAN firewall rule seams to do nothing in this behavior, it gets also ignored.

Describe alternatives you considered

An option to force connections back.

Additional context

All other connections like HTTP/HTTPS/ICMP are jumping back and forth between default gateway and backup gateway.

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 21.7.6-amd64
FreeBSD 12.1-RELEASE-p21-HBSD
OpenSSL 1.1.1l 24 Aug 2021
AMD G-SERIES SOC GX-416RA 1.6 GHz Quad-Core
Network Intel® I210-AT

@AdSchellevis
Copy link
Member

you could try #5367 (comment) , but if it's specifically for OpenVPN clients you might have to wait for @mimugmail as he offered to setup a test on his end.

@fichtner fichtner added the support Community support label Dec 9, 2021
@ElXk6
Copy link
Author

ElXk6 commented Jan 10, 2022

Are there any updates on this issue?

you could try #5367 (comment)

In my case, it is mostly OpenVPN specific.

@OPNsense-bot
Copy link

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.

@OPNsense-bot OPNsense-bot added the help wanted Contributor missing / timeout label May 31, 2022
@ElXk6
Copy link
Author

ElXk6 commented Jun 5, 2023

Is there anything new here?

I had it recently again, that a VPN connection ran for several weeks over the wrong WAN interface, until we noticed problems.

Is there an extra option for this?
Dynamic state reset was removed and can no longer be used, that is also no longer an option for it.

For some setups it would also be nice to just have a hardreset for the second gateway, because the second gateway should really only be used in case of emergency (e.g. LTE uplinks). Maybe also an option that you can set for each interface seperat?

Is this currently a "works as intended"?
Because the only solution I currently see is monitoring whether an gateway is still used or even write a script that the second gateway is always disabled when the first is up :/

@fichtner
Copy link
Member

fichtner commented Jun 5, 2023

To most reporters in the past the disruptive clearing of states is the actual undesirable outcome. If you need it you could throw a script into /usr/local/etc/rc.syshook.d/monitor directory as per https://docs.opnsense.org/development/backend/autorun.html and do the relevant pfctl magic there.

Cheers,
Franco

@ElXk6
Copy link
Author

ElXk6 commented Jun 5, 2023

To most reporters in the past the disruptive clearing of states is the actual undesirable outcome

Okay, I had already suspected that.

If you need it you could throw a script into /usr/local/etc/rc.syshook.d/monitor directory as per https://docs.opnsense.org/development/backend/autorun.html and do the relevant pfctl magic there.

Thanks, then I will think about something here.
An official configurable option would of course still be nice :).

@fichtner
Copy link
Member

fichtner commented Jun 5, 2023

I don't mind a feature request, but it must be designed correctly: clearing all states everywhere is not an option anymore as it will lead to the same reports again. Working on the gateway monitoring code the past few weeks there is quite a bit of complexity involved in setup at hand (these are already multiple requirements) and how the expectation of the failover will go. At the moment monitoring is target driven: search the best candidate. Handling the previous candidate can add a lot of complexity that might not be worth it (I don't remember any such request from the past on how to deal selectively with lines being demoted).

@mimugmail
Copy link
Member

A nightly cron with VPN reset should also do the trick

@ElXk6
Copy link
Author

ElXk6 commented Jun 5, 2023

Yes, I thought about it a bit.
I think I go with you, as it is now it will be best for most, there are too many cases to cover.

You can of course throw in options like, service restarts at x o'clock, if a gateway was down, as mimugmail already meant.
But for one person it is better early in the morning for the other in the evening etc.
I think you can't meet all the requirements here without a lot of time and effort.

Yes, clearing all states is not a good idea, if then only for the IP range of the failover gateway. But even here I had it, that the connections were resumed after a clear, probably because the client answered the udp stream again.
In this case only a gateway deactivate => clear states => activate gateway helped.
But I did not look more closely here, maybe it was my fault.

Maybe a notification that services are still running through the failover gateway would be enough, so that you can react to it manually.
But I think everyone can also monitor for themselves and their requirements.

Sorry for the reopening, I think the topic has settled again for now :D.

@alex8654
Copy link

alex8654 commented Apr 3, 2024

I noticed this problem not just on VPNs, but on any traffic. Let's say there is a state open via WAN1, the interface goes down, the state is still on WAN1 and does not fail over to WAN2. I have confirmed this behaviour, I need to manually reset the states, or wait for it to time out due to inactivity. If you leave ping running, or traceroute, it will never time out, and it will never take the other interface that is up.

@gitmachtl
Copy link

I have the same problem. Currently migrating from Draytek Routers to OPNSense. Dual WAN with a CableModem and LTE Connection. When the WAN(Cablemodes) comes back up again, the states for WAN2(LTE) are not killed and clients stick to those connections.

Can someone please do something about that? Can someone tell me how to write a script that is killing WAN2 states once WAN is ok again?

Thanks!

@gitmachtl
Copy link

gitmachtl commented Apr 10, 2024

I made myself a solution .. for those who are interested in, its here:
#6803 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Contributor missing / timeout support Community support
Development

No branches or pull requests

7 participants