Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use masquerade default gateway when no default gw is found #4122

Merged
merged 1 commit into from Mar 5, 2024

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Feb 1, 2024

When no default gateway is found, OVNK will still come up. This behavior was intentional: 3eeb930

The goal was that with design changes, host->service traffic would work without the need for a default gateway to be detected. This works, in most cases, but there is a case we did not consider. If a user has deployed a cluster with local gateway mode, and is using a different interface/network for kapi traffic (not the gateway bridge interface), then endpoints to the kubernetes service will reside on this other network. In this case when a host tries to talk to kube API service, the traffic would go to the GR, it would be DNAT'ed to an IP address not on any known network, and be dropped by OVN routing when there is no default gateway route.

When the configuration parameter AllowNoUplink is set, we will set the default route to be the masquerade IP, which solves this problem. This commit changes the behavior to not require AllowNoUplink to be set in order to achieve the same behavior, as services should work in this scenario even without a default gw.

@coveralls
Copy link

Coverage Status

Changes unknown
when pulling 1ccb958 on trozet:lgw_default_route_on_gr
into ** on ovn-org:master**.

Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm just a question around the warning print
also wider question, I wasn't involved in the design of AllowNoUplink option where we deploy LGW without interfacing with host NIC, but currently this is really used only in bridgeForInterface anyways just checking this option is still relevant in OVNK considering we decided to add the default route irrespective of whether this option is set or not

}
if len(defaultGatewayNextHops) == 0 {
klog.Warning("No default route identified in the host. Egress features and pod egress traffic in " +
"shared gateway mode may not function correctly!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to print this warning in local gateway mode but without the allowNoUplink set as well which is what we are doing now? in which case I'd remove the shared gateway (or routingViaHost=false) from the log..? But I think these features are expected to work on LGW mode (though maybe not EIPs..), with allowNoUplink disabled in which case we shouldn't be printing this warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree for allowNoUplink, we can probably skip printing it.

I guess I am also missing a comma. It should be:
Egress features, and pod egress traffic in shared gateway mode may not function correctly

Which I meant to say only egress pod traffic in shared gateway mode might not work correctly. Egress pod traffic in LGW will still work. Egress features may not work in both modes.

Should I reword it to clarify they are 2 separate things?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's reword to clarify, thanks!

// For local gw, if not default gateway is available or the provide gateway interface is not the host gateway interface
// use nexthop masquerade IP as GR default gw to steer traffic to the gateway bridge
// use nexthop masquerade IP as GR default gw to steer traffic to the gateway bridge, and then the host for routing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-note: so what happens is:

  1. host->kapi clusterIP
  2. sent to br-ex and then to OVN via GR using service routes on host
  3. GR DNATs to secondary network backend pod IPs
  4. no default route on GR so we drop
    This change makes the masqueradeIP be the default route so
  5. Traffic instead of dropping comes back to host
  6. then host routes and iptable masquerade takes this to the secondary nic

thanks @trozet for the nice PR description which made it easy to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right!

@trozet
Copy link
Contributor Author

trozet commented Feb 14, 2024

also wider question, I wasn't involved in the design of AllowNoUplink option where we deploy LGW without interfacing with host NIC, but currently this is really used only in bridgeForInterface anyways just checking this option is still relevant in OVNK considering we decided to add the default route irrespective of whether this option is set or not

good question. So I debated removing allowNoUplink as well. I think the one key difference is this option still allows for no uplink at all to be found/attached to the bridge:
https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/node/gateway.go#L494C3-L494C15

So without this option, if we cannot find an interface to attach to the bridge, then we consider it as a failed state unless explicitly told with allowNoUplink. I'm not sure how useful this is or not, but I decided to leave it alone since it isnt broke :)

@tssurya tssurya self-assigned this Feb 20, 2024
@tssurya
Copy link
Member

tssurya commented Feb 26, 2024

@trozet : wanna repush here?

@trozet
Copy link
Contributor Author

trozet commented Mar 1, 2024

addressed the comments in last push, will now push another with just rebase

When no default gateway is found, OVNK will still come up. This behavior
was intentional: 3eeb930

The goal was that with design changes, host->service traffic would work
without the need for a default gateway to be detected. This works, in
most cases, but there is a case we did not consider. If a user has
deployed a cluster with local gateway mode, and is using a different
interface/network for kapi traffic (not the gateway bridge interface),
then endpoints to the kubernetes service will reside on this other
network. In this case when a host tries to talk to kube API service, the
traffic would go to the GR, it would be DNAT'ed to an IP address not on
any known network, and be dropped by OVN routing when there is no
default gateway route.

When the configuration parameter AllowNoUplink is set, we will set the
default route to be the masquerade IP, which solves this problem. This
commit changes the behavior to not require AllowNoUplink to be set in
order to achieve the same behavior, as services should work in this
scenario even without a default gw.

Signed-off-by: Tim Rozet <trozet@redhat.com>
@@ -184,10 +184,19 @@ func getGatewayNextHops() ([]net.IP, string, error) {
}
gatewayIntf = defaultGatewayIntf
} else {
if gatewayIntf != defaultGatewayIntf {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much if's and else's for my taste :) but we can clean code later, I see value in printing different warnings, will help us debug better in the future.

@tssurya tssurya merged commit c405012 into ovn-org:master Mar 5, 2024
34 checks passed
@@ -1857,7 +1857,6 @@ var _ = Describe("Gateway unit tests", func() {
gwIPs := []net.IP{config.Gateway.MasqueradeIPs.V4DummyNextHopMasqueradeIP}
config.Gateway.Interface = dummyBridgeName
config.Gateway.Mode = config.GatewayModeLocal
config.Gateway.AllowNoUplink = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-note: there is another UT testing with allowNoUplink to true so we do have coverage.

@openshift-merge-robot
Copy link

Fix included in accepted release 4.16.0-0.nightly-2024-03-21-152650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants