Bug 2100536: Update API to config EgressIP timeout #1210

msherif1234 · 2022-06-23T16:03:47Z

Need to extend apis to allow configuring EgressIP node reachability timeout config
Signed-off-by: Mohamed Mahmoud mmahmoud@redhat.com

openshift-ci · 2022-06-23T16:04:12Z

Hello @msherif1234! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

For merging purposes, this repository follows the no-Feature-Freeze process which means that in addition to the standard lgtm and approved labels this repository requires either:

bugzilla/valid-bug - applied if your PR references a valid bugzilla bug

OR

qe-approved, docs-approved, and px-approved - these labels can be applied by anyone in the openshift org via the /label command.

Who should apply these qe/docs/px labels?

For a no-Feature-Freeze team who is merging a feature before code freeze, they need to get those labels applied to their api repo PR by the appropriate teams (i.e. qe, docs, px)
For a Feature Freeze (traditional) team who is merging a feature before FF, they can self-apply the labels (via /label commands), they are basically irrelevant for those teams
For a Feature Freeze team who is merging a feature after FF, the PR should be rejected barring an exception

openshift-ci · 2022-06-23T16:04:27Z

@msherif1234: This pull request references Bugzilla bug 2100536, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

Bug 2100536: Update API to config EgressIP timeout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

operator/v1/0000_70_cluster-network-operator_01.crd.yaml

operator/v1/types_network.go

operator/v1/0000_70_cluster-network-operator_01.crd.yaml

operator/v1/types_network.go

andreaskaris · 2022-06-24T16:06:27Z

/lgtm

msherif1234 · 2022-06-27T13:36:45Z

/assign @knobunc

jcaamano · 2022-06-27T13:50:52Z

/lgtm

JoelSpeed

Is there an enhancement that pairs with this feature change?

operator/v1/types_network.go

msherif1234 · 2022-06-27T16:40:05Z

Is there an enhancement that pairs with this feature change?

there was an RFE https://issues.redhat.com/browse/RFE-2889

JoelSpeed

Wording wise and design wise I'm pretty happy now.

There's just one final thing. I think we should impose an upper and lower limit using +kubebuilder:validation:Minimum and +kubebuilder:validation:Maximum, the minimum is obviously 0, but we should choose a maximum that is sensible to prevent users set arbitrarily high values

Do you have any thoughts on what an appropriate maximum timeout might be within the context of ovnk running on OpenShift?

operator/v1/types_network.go

JoelSpeed · 2022-07-01T11:36:46Z

operator/v1/types_network.go

+	// The current default is 1 second.
+	// A value of 0 disables the EgressIP node's reachability check.
+	// +kubebuilder:validation:Minimum=0
+	// +kubebuilder:validation:Maximum=10


So we are setting a maximum of 10s, can you quickly explain why you chose that value before we merge, just want to make sure that's going to be enough (this is smaller than I thought it would be)

I picked this value given osdn platform was shipping with 5s and given its impact if its too large I went with 10s I am checking with @MichaelWasher to confirm

Well... I mean it all depends on the rety-count within this time. The concern is not that a single packet takes more than 5s/10s to complete. It's more about how many times is the packets are sent. By default with TCP, the first SYN will retry after 5.8s, so you're only getting 1 retry.

However reviewing the new dialer code, it looks like you're retrying ever second-ish, so that'll be ~9 retries, which should be a tonne. IMO it LGTM

My concern here is that 10s is quite short, and while it makes sense, will it always make sense? Would an average person be expecting to set such a short timeout given other timeouts in kube are normally much higher? For example Kubelet doesn't timeout for 5 minutes of inactivity

its not meant to be long like kubelet for the side effects mentioned, since the default is 1sec the upper bond shouldn't be that far off IMO hence the 10sec

The side effect for this and kubelet, as I understand it, isn't particularly different. We have something that's being healthchecked, we take it down/offline at some point if the healthchecks are failing. This would be similar to pods as well with readyz checks, which are rarely as short as 10s in my experience. I think having such a low limit will confuse users

60s will be good upper limit?

@JoelSpeed , the timeout was originally 1s with no retries. 60s is a massive jump from this, however as it's the upper bound, it doesn't really matter.

TL;DR: the criteria should be: how did the feature behave before we made this a knob. And that should be the value we pick (or something close to that). Not: how does any unrelated feature in the K8s API normally select timeouts. + this is a networking feature, and downtime is usually detected in ms to seconds, not in minutes.

We use this as a detection mechanism for EgressIP failover. Imagine EgressIP as a HSRP or VRRP VIP, just the other way around. All traffic to the outside world is funneled through that node from EgressIP marked pods, so potentially with a 60 seconds timeout by default, we cause those pods' traffic to be blackholed for 60 seconds in case of a node issue. So the shorter the timeout (within a reasonable delay) the better.

IMO, the criteria should be: how did the feature behave before we made this a knob. And that should be the value we pick (or something close to that). Not: how does any unrelated feature in the K8s API normally select timeouts. If your ready probe on a pod takes a minute to time out, then your one pods is affected. If the reachability check for EgressIP takes 60 seconds (or 4 minutes) to timeout, then potentially dozens of pods can't reach the outside world for that time. Add to this that customers expected this to happen within a second, and now from one version to another of the product we bump this to a value 60 times higher ... seems a lot to me. And add to this again that node reboots are actually really common in OCP due to the machine config operator taking them offline and online.

Would an average person be expecting to set such a short timeout

I'd argue they would. And that on the contrary, nobody would expect a VIP (or an inverted VIP like egressIP) to take minutes to timeout. Seconds is already a lot for a networking feature.

Just to clarify: the dialer isn't actually establishing TCP connections, right. Instead, it's calling into tcp/9 with no listener on the other side, it's get a port unreachable message or equivalent back, and thus it will know that the host on the other side is up (there would probably have been better ways of detecting that the node is up, but this is how it was done in SDN and OVNK way before this change). Regardless if the current detection mechanism is good or not, on the API side, I'd push for a shorter delay rather than a longer delay because of the aforementioned reasons.

https://github.com/ovn-org/ovn-kubernetes/blob/ce8b60f2309dad9aa3473c62c949ecf75ff2d185/go-controller/pkg/ovn/egressip.go#L2167

operator/v1/types_network.go

flavio-fernandes · 2022-07-05T21:14:31Z

operator/v1/0000_70_cluster-network-operator_01.crd.yaml

+                              is 1 second. A value of 0 disables the EgressIP node's
+                              reachability check.
+                            format: int32
+                            maximum: 10


flavio-fernandes · 2022-07-07T13:58:08Z

After reading the discussion here + email and looking at the code I too am convinced that 60 is a reasonable value as max.
So...
/lgtm

deads2k · 2022-07-08T19:15:10Z

After reading the discussion here + email and looking at the code I too am convinced that 60 is a reasonable value as max.
So...

Once the code is updated to allow up to 60s, this lgtm.

Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>

openshift-ci · 2022-07-14T15:09:40Z

@msherif1234: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

JoelSpeed

/lgtm

openshift-ci · 2022-07-15T09:42:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreaskaris, flavio-fernandes, jcaamano, JoelSpeed, knobunc, msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [knobunc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

flavio-fernandes · 2022-07-15T13:19:29Z

/lgtm

@JoelSpeed : in that case... /remove-hold ? :)

JoelSpeed · 2022-07-15T13:24:52Z

/hold cancel

openshift-ci · 2022-07-15T13:30:33Z

@msherif1234: All pull requests linked via external trackers have merged:

openshift/api#1210

Bugzilla bug 2100536 has been moved to the MODIFIED state.

In response to this:

Bug 2100536: Update API to config EgressIP timeout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 23, 2022

openshift-ci bot requested review from anuragthehatter, derekwaynecarr and sjenning June 23, 2022 16:04

msherif1234 force-pushed the EIP branch 3 times, most recently from 3fb93a5 to 47f349e Compare June 23, 2022 17:27

msherif1234 mentioned this pull request Jun 23, 2022

Bug 2100601: Update CNO to config EgressIP timeout for ovnk openshift/cluster-network-operator#1498

Merged

msherif1234 force-pushed the EIP branch from 47f349e to 27d8e43 Compare June 23, 2022 19:26

jcaamano suggested changes Jun 24, 2022

View reviewed changes

msherif1234 force-pushed the EIP branch from 27d8e43 to 33480d9 Compare June 24, 2022 15:12

msherif1234 requested a review from jcaamano June 24, 2022 15:13

msherif1234 force-pushed the EIP branch from 33480d9 to 0227133 Compare June 24, 2022 15:20

openshift-ci bot assigned andreaskaris Jun 24, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2022

openshift-ci bot assigned knobunc Jun 27, 2022

openshift-ci bot assigned jcaamano Jun 27, 2022

JoelSpeed reviewed Jun 27, 2022

View reviewed changes

operator/v1/types_network.go Outdated Show resolved Hide resolved

operator/v1/types_network.go Outdated Show resolved Hide resolved

JoelSpeed reviewed Jun 27, 2022

View reviewed changes

operator/v1/types_network.go Show resolved Hide resolved

msherif1234 force-pushed the EIP branch from 0227133 to edf5193 Compare June 27, 2022 16:37

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2022

msherif1234 requested a review from JoelSpeed June 27, 2022 16:41

msherif1234 force-pushed the EIP branch from edf5193 to 368cac1 Compare June 27, 2022 16:52

msherif1234 requested a review from andreaskaris June 29, 2022 15:47

JoelSpeed reviewed Jun 30, 2022

View reviewed changes

operator/v1/types_network.go Show resolved Hide resolved

operator/v1/types_network.go Outdated Show resolved Hide resolved

msherif1234 force-pushed the EIP branch from 285b9a3 to d3f3b46 Compare June 30, 2022 12:57

msherif1234 requested a review from JoelSpeed June 30, 2022 12:58

JoelSpeed reviewed Jul 1, 2022

View reviewed changes

msherif1234 force-pushed the EIP branch from d3f3b46 to 35fa344 Compare July 1, 2022 12:44

msherif1234 requested review from JoelSpeed and MichaelWasher July 4, 2022 13:36

flavio-fernandes reviewed Jul 5, 2022

View reviewed changes

openshift-ci bot assigned flavio-fernandes Jul 7, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2022

msherif1234 force-pushed the EIP branch from 35fa344 to 5d00533 Compare July 14, 2022 14:22

openshift-ci bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm Indicates that a PR is ready to be merged. labels Jul 14, 2022

msherif1234 force-pushed the EIP branch from 5d00533 to e04e4ff Compare July 14, 2022 14:34

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 14, 2022

Bug 2100536: Update API to config EgressIP total timeout

75f639c

Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>

msherif1234 force-pushed the EIP branch from e04e4ff to 75f639c Compare July 14, 2022 14:50

JoelSpeed reviewed Jul 15, 2022

View reviewed changes

openshift-ci bot assigned JoelSpeed Jul 15, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 15, 2022

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2022

openshift-ci bot merged commit dab5b36 into openshift:master Jul 15, 2022

Bug 2100536: Update API to config EgressIP timeout #1210

Bug 2100536: Update API to config EgressIP timeout #1210

Uh oh!

Conversation

msherif1234 commented Jun 23, 2022

Uh oh!

openshift-ci bot commented Jun 23, 2022

Uh oh!

openshift-ci bot commented Jun 23, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreaskaris commented Jun 24, 2022

Uh oh!

msherif1234 commented Jun 27, 2022

Uh oh!

jcaamano commented Jun 27, 2022

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

msherif1234 commented Jun 27, 2022

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JoelSpeed Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

msherif1234 Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

MichaelWasher Jul 3, 2022

Choose a reason for hiding this comment

Uh oh!

MichaelWasher Jul 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

msherif1234 Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

msherif1234 Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

MichaelWasher Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreaskaris Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flavio-fernandes Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

flavio-fernandes commented Jul 7, 2022

Uh oh!

deads2k commented Jul 8, 2022

Uh oh!

openshift-ci bot commented Jul 14, 2022

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

MichaelWasher Jul 3, 2022 •

edited

Loading

msherif1234 Jul 4, 2022 •

edited

Loading

MichaelWasher Jul 6, 2022 •

edited

Loading

andreaskaris Jul 6, 2022 •

edited

Loading