Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway groups and NAT : Incorrect NAT IP used on interface when Shared Forwarding is enabled #2376

Closed
namezero111111 opened this issue Apr 25, 2018 · 33 comments
Assignees
Labels
bug Production bug

Comments

@namezero111111
Copy link
Contributor

Good morning, I am afraid there may be another issue related to shared forwarding and MultiWAN. The original issue was posted here: https://forum.opnsense.org/index.php?topic=7803.0

Essentially, packets are sent out one interface originating from the IP of another.

The issue occurs regardless of:

  1. Source tracking timeout
  2. Firewall tracking normal or conservative
  3. Whether the NAT IP is CARP or interface address

The issue does NOT occur if:

  1. Shared forwarding is disabled.
@namezero111111 namezero111111 changed the title Gateway groups and NAT : Incorrect NAT IP used on interface. Gateway groups and NAT : Incorrect NAT IP used on interface when Shared Forwarding is enabled Apr 25, 2018
@fichtner fichtner self-assigned this Apr 25, 2018
@fichtner fichtner added the bug Production bug label Apr 25, 2018
@fichtner
Copy link
Member

no work would be boring :)

let me see how we could debug this further....

@namezero111111
Copy link
Contributor Author

namezero111111 commented Apr 25, 2018

Please don't hate on me for bringing shared forwarding up again :}

This fellow has an issue that could be related, too: https://forum.opnsense.org/index.php?topic=7860

Some notes with less scientific insight that might give some clues:

  1. The initial states captured on the wrong source interface (or IP, let's just call them mismatched) appear to always be in the SYN_SENT:CLOSED state (at least when checking the state table right after the connection appears in the dump)
  2. 1.) seems to create a new stickiness with the mismatched IP/interface
  3. When the issue occurs, the sticky association shown by pfctl -s Source is to the gateway on the interface with the IP (but goes out the wrong interface).
    I.e. given a configuration where:

DMZ1 GW1
DMZ2 GW2

Sticky association: GW1
Faulty dump: DMZ1 sending out with DMZ2 IP

or

Sticky association: GW2
Faulty dump: DMZ2 sending out with DMZ1 IP

If I find more relevant, verifiable information I will update again.

@namezero111111
Copy link
Contributor Author

namezero111111 commented Jun 19, 2018

Well hello :}

This probably won't make it to 18.7 eh?

The reason I'm asking is because I think this here (https://forum.opnsense.org/index.php?topic=8786.0) could be related to the same issue.

This guy (https://forum.opnsense.org/index.php?topic=7860) also reported disabling sticky fixes his issue.

@fichtner
Copy link
Member

@namezero111111 hello, let's take a stab at this...

Sticky outbound NAT disable "helps" with this --- this is still true? That's a good place to start.

@fichtner
Copy link
Member

@namezero111111 also, does this happen all the time or do some connections work fine? like 50% ok and %50 not ok?

@namezero111111
Copy link
Contributor Author

Yes, disabling sticky will bypass/"resolve" this. Source tracking timeout seems to have no effect on this *.

The initial connection(s) "work", since data is actually being transferred, and as long as there is an active association (presumably before the last association for the client expires) there is no problem.
However, after that, new connections stall in this state.
*Source tracking will play a role here because it can delay this issue by keeping the association longer.

I will run some tests tomorrow to see if there is something more deterministic about it.

Should I try with 18.1 or would you prefer the 18.7 dev version?

fichtner added a commit to opnsense/src that referenced this issue Jun 19, 2018
@fichtner
Copy link
Member

I can't test this very well but I found a candidate. opnsense/src@6aad8285a4

It unlikely to panic but I also can't vouch for not making it worse that it was.

Can provide this as a test kernel.

fichtner added a commit to opnsense/src that referenced this issue Jun 19, 2018
@fichtner
Copy link
Member

Whoopsie, better patch at opnsense/src@ddb08b18

@namezero111111
Copy link
Contributor Author

namezero111111 commented Jun 19, 2018 via email

@fichtner
Copy link
Member

Perfect, kernel update via:

# opnsense-update -kr 18.1.9-sticky
# /usr/local/etc/rc.reboot

Cheers,
Franco

@namezero111111
Copy link
Contributor Author

namezero111111 commented Jun 20, 2018 via email

@namezero111111
Copy link
Contributor Author

namezero111111 commented Jun 24, 2018

Franco, I am sorry to report that for the life of me I cannot reproduce this on the original test system I used to report this.
I tried all the original combinations with gateways online and partially offline, etc...

If nothing has changed on the kernel, there must be another variable at play here.
I don't see kernel changes in the changelogs either.

Currently I cannot think of anything else to try.
Can we put this on hold for now until then?

@namezero111111
Copy link
Contributor Author

Fwiw, the patched kernel does no harm in that regard either :/

@fichtner
Copy link
Member

fichtner commented Aug 9, 2018

@namezero111111 do you have any news or should we close this ticket?

@namezero111111
Copy link
Contributor Author

namezero111111 commented Aug 9, 2018

@fichtner
Hi, unfortunately not been able to reproduce under the previous circumstance. I agree, we should close this.

Edit:
No, Thx for your help :}

@fichtner
Copy link
Member

fichtner commented Aug 9, 2018

Ok, you know where to find me. Thanks for all your help! :)

@fichtner fichtner closed this as completed Aug 9, 2018
@Malunke
Copy link

Malunke commented Apr 28, 2019

Dear all,
it seems, I have the same issue on opnsense 19.1.6.

Just to make it short - I configured 2 WAN interfaces (VLAN) according to wiki.

I have the same problems as mentioned above and they disappear

  • when unplug Gateway 1 (unplug from switch)
    or
  • when unplug Gateway 2
    or
  • when disabling sticky connections.

How can I help to analyze the problem and find a patch?

@mimugmail
Copy link
Member

mimugmail commented Apr 28, 2019 via email

@Malunke
Copy link

Malunke commented Apr 28, 2019

no, it is disabled (standard?)
ForceGateway

@Malunke
Copy link

Malunke commented Apr 28, 2019

ForceGateway

@mimugmail
Copy link
Member

mimugmail commented Apr 28, 2019 via email

@namezero111111
Copy link
Contributor Author

Oh joy if this rears it head again.
Are you using shared forwarding?
The packets go out the other gateway using an incorrect NAT IP?
If so, are you able to provide a dump verifying the NAT IP of WAN1 is used sending packages out WAN2 when WAN1 is disabled/disconnected AND shared forwarding is enabled?

@fichtner
Copy link
Member

In particular, I'd say if disabling shared forwarding fixes it please use this workaround.

In general if such a bug is encountered there's a lot that needs to be reported in order to make any sense of the setup at hand: firewall rules with policy, gateway group settings, IPv4 or IPv6 evaluated separately, switch conditions.

@fichtner
Copy link
Member

Oh and also aux functionality used that is supposed to be addressed by shared forwarding: captive portal, traffic shaping, transparent web proxy use.

@Malunke
Copy link

Malunke commented Apr 28, 2019

It seems, that disabling shared forwarding won't help.

I try to provide logs as mentioned by namezero.

@Malunke
Copy link

Malunke commented Apr 28, 2019

A
B
C
D

@Malunke
Copy link

Malunke commented Apr 28, 2019

Gateway 1 is offline because I unplugged it :-)

@Malunke
Copy link

Malunke commented Apr 30, 2019

Yes, I have also these NAT issues. A packet-log shows for example the following entries on Interface V200:

17:54:24.284702 IP 10.28.253.3.25721 > 54.243.161.26.443: tcp 1
17:54:24.285421 IP 10.4.0.143.2816 > 213.239.241.182.443: tcp 0
17:54:24.285840 IP 10.4.0.143.34473 > 213.239.241.182.443: tcp 0
17:54:24.291151 IP 23.52.60.236.443 > 10.28.253.3.8047: tcp 0

The IP 10.4.0.143 belongs to interface V201 - so it seems it is exactly the same issue as described by namezero111111

Please fix this issue.

I will assist with troubleshooting, just tell me, what you need.

@Malunke
Copy link

Malunke commented May 1, 2019

It seems, disabling shared forwarding will reduce the problem (I could not determine it after short test any more)
But there is also no traffic going out of interface V201.

When disabling sticky connections everything works fine (the right traffic leaves the right interface) - but I will get problems with online sessions (after login on websites).

Please help - you can also open a new ticket instead of using this old one
opnsense 19.1.6

@namezero111111
Copy link
Contributor Author

namezero111111 commented May 1, 2019

We have **sticky connections **off and shared forwarding on (we need it for shaping).

The session-aware problem was solved by redirecting different clients through different gateway groups configured as failover rather than load balancing (Gateways on different tiers).
Not sure if the problem reappears if sticky connections is enabled again (would be much bettwer than failover groups), but I'm sure as hell not gonna find out if it makes this nightmare reappear.
If anything it's a workable workaround until we somehow stumble on what's causing this.

I could not recreate this with an imported config on a test system no matter how I tried to provoke it - it seemed sticky was working then, too - (see_ comment from Aug 9th 2018), so I assumed some unknown circumstance changed.

@Malunke
Copy link

Malunke commented May 2, 2019

Thanks for your answer, but I need sticky connections and want to use load balancing.

Question to the fichtner and the other members - do I have to open a separate bug report rahter than following this one to get help and the Bug re-opened again?

Okay - in my case it is Gateway-Group and sticky connection which causes the problem.

@namezero111111
Copy link
Contributor Author

It seems, that disabling shared forwarding won't help.

If this is true (your post from 2019-04-28) then we had been down the wrong route all along or you're experiencing this under different circumstances. Did you verify this after a reboot? Just in case some stale connections or whatever are hanging around. Sure you can do it on the fly but conditionally disabling and enabling kernel firewall code at runtime raises a flag with me nonetheless.
That said:

Disabling shared forwarding should fix this because it disables the glue code between pf and ipfw.
If what you are saying is true and verified, then this problem does not relate to the shared forwarding code as assumed before.

@Malunke
Copy link

Malunke commented May 5, 2019

I checked the last days. I have now enabled sticky connections and disabled shared forwarding. It seems to work.

I set the above settings and rebooted the firewall. I also can confirm the bug exists with sticky connections on and shared forwarding on.

I hope, this issue will be fixed in the next release?

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Production bug
Development

No branches or pull requests

4 participants