Skip to content

Conversation

msherif1234
Copy link
Contributor

Need to extend apis to allow configuring EgressIP node reachability timeout config
Signed-off-by: Mohamed Mahmoud mmahmoud@redhat.com

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2022

Hello @msherif1234! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

For merging purposes, this repository follows the no-Feature-Freeze process which means that in addition to the standard lgtm and approved labels this repository requires either:

bugzilla/valid-bug - applied if your PR references a valid bugzilla bug

OR

qe-approved, docs-approved, and px-approved - these labels can be applied by anyone in the openshift org via the /label command.

Who should apply these qe/docs/px labels?

  • For a no-Feature-Freeze team who is merging a feature before code freeze, they need to get those labels applied to their api repo PR by the appropriate teams (i.e. qe, docs, px)
  • For a Feature Freeze (traditional) team who is merging a feature before FF, they can self-apply the labels (via /label commands), they are basically irrelevant for those teams
  • For a Feature Freeze team who is merging a feature after FF, the PR should be rejected barring an exception

@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jun 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2022

@msherif1234: This pull request references Bugzilla bug 2100536, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

Bug 2100536: Update API to config EgressIP timeout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreaskaris
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2022
@msherif1234
Copy link
Contributor Author

/assign @knobunc

@jcaamano
Copy link
Contributor

/lgtm

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an enhancement that pairs with this feature change?

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 27, 2022
@msherif1234
Copy link
Contributor Author

Is there an enhancement that pairs with this feature change?

there was an RFE https://issues.redhat.com/browse/RFE-2889

@msherif1234 msherif1234 requested a review from andreaskaris June 29, 2022 15:47
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording wise and design wise I'm pretty happy now.

There's just one final thing. I think we should impose an upper and lower limit using +kubebuilder:validation:Minimum and +kubebuilder:validation:Maximum, the minimum is obviously 0, but we should choose a maximum that is sensible to prevent users set arbitrarily high values

Do you have any thoughts on what an appropriate maximum timeout might be within the context of ovnk running on OpenShift?

// The current default is 1 second.
// A value of 0 disables the EgressIP node's reachability check.
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are setting a maximum of 10s, can you quickly explain why you chose that value before we merge, just want to make sure that's going to be enough (this is smaller than I thought it would be)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked this value given osdn platform was shipping with 5s and given its impact if its too large I went with 10s I am checking with @MichaelWasher to confirm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... I mean it all depends on the rety-count within this time. The concern is not that a single packet takes more than 5s/10s to complete. It's more about how many times is the packets are sent. By default with TCP, the first SYN will retry after 5.8s, so you're only getting 1 retry.

Copy link

@MichaelWasher MichaelWasher Jul 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However reviewing the new dialer code, it looks like you're retrying ever second-ish, so that'll be ~9 retries, which should be a tonne. IMO it LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is that 10s is quite short, and while it makes sense, will it always make sense? Would an average person be expecting to set such a short timeout given other timeouts in kube are normally much higher? For example Kubelet doesn't timeout for 5 minutes of inactivity

Copy link
Contributor Author

@msherif1234 msherif1234 Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not meant to be long like kubelet for the side effects mentioned, since the default is 1sec the upper bond shouldn't be that far off IMO hence the 10sec

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The side effect for this and kubelet, as I understand it, isn't particularly different. We have something that's being healthchecked, we take it down/offline at some point if the healthchecks are failing. This would be similar to pods as well with readyz checks, which are rarely as short as 10s in my experience. I think having such a low limit will confuse users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60s will be good upper limit?

Copy link

@MichaelWasher MichaelWasher Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed , the timeout was originally 1s with no retries. 60s is a massive jump from this, however as it's the upper bound, it doesn't really matter.

Copy link
Contributor

@andreaskaris andreaskaris Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: the criteria should be: how did the feature behave before we made this a knob. And that should be the value we pick (or something close to that). Not: how does any unrelated feature in the K8s API normally select timeouts. + this is a networking feature, and downtime is usually detected in ms to seconds, not in minutes.

We use this as a detection mechanism for EgressIP failover. Imagine EgressIP as a HSRP or VRRP VIP, just the other way around. All traffic to the outside world is funneled through that node from EgressIP marked pods, so potentially with a 60 seconds timeout by default, we cause those pods' traffic to be blackholed for 60 seconds in case of a node issue. So the shorter the timeout (within a reasonable delay) the better.

IMO, the criteria should be: how did the feature behave before we made this a knob. And that should be the value we pick (or something close to that). Not: how does any unrelated feature in the K8s API normally select timeouts. If your ready probe on a pod takes a minute to time out, then your one pods is affected. If the reachability check for EgressIP takes 60 seconds (or 4 minutes) to timeout, then potentially dozens of pods can't reach the outside world for that time. Add to this that customers expected this to happen within a second, and now from one version to another of the product we bump this to a value 60 times higher ... seems a lot to me. And add to this again that node reboots are actually really common in OCP due to the machine config operator taking them offline and online.

Would an average person be expecting to set such a short timeout

I'd argue they would. And that on the contrary, nobody would expect a VIP (or an inverted VIP like egressIP) to take minutes to timeout. Seconds is already a lot for a networking feature.

Just to clarify: the dialer isn't actually establishing TCP connections, right. Instead, it's calling into tcp/9 with no listener on the other side, it's get a port unreachable message or equivalent back, and thus it will know that the host on the other side is up (there would probably have been better ways of detecting that the node is up, but this is how it was done in SDN and OVNK way before this change). Regardless if the current detection mechanism is good or not, on the API side, I'd push for a shorter delay rather than a longer delay because of the aforementioned reasons.

https://github.com/ovn-org/ovn-kubernetes/blob/ce8b60f2309dad9aa3473c62c949ecf75ff2d185/go-controller/pkg/ovn/egressip.go#L2167

is 1 second. A value of 0 disables the EgressIP node's
reachability check.
format: int32
maximum: 10

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@flavio-fernandes
Copy link

After reading the discussion here + email and looking at the code I too am convinced that 60 is a reasonable value as max.
So...
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2022
@deads2k
Copy link
Contributor

deads2k commented Jul 8, 2022

After reading the discussion here + email and looking at the code I too am convinced that 60 is a reasonable value as max.
So...

Once the code is updated to allow up to 60s, this lgtm.

@openshift-ci openshift-ci bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm Indicates that a PR is ready to be merged. labels Jul 14, 2022
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 14, 2022
Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 14, 2022

@msherif1234: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 15, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreaskaris, flavio-fernandes, jcaamano, JoelSpeed, knobunc, msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@flavio-fernandes
Copy link

/lgtm

@JoelSpeed : in that case... /remove-hold ? :)

@JoelSpeed
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2022
@openshift-ci openshift-ci bot merged commit dab5b36 into openshift:master Jul 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 15, 2022

@msherif1234: All pull requests linked via external trackers have merged:

Bugzilla bug 2100536 has been moved to the MODIFIED state.

In response to this:

Bug 2100536: Update API to config EgressIP timeout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants