-
Notifications
You must be signed in to change notification settings - Fork 573
CNTRLPLANE-1576: add event-ttl configuration to kube-apiserver #2520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@tjungblu: This pull request references CNTRLPLANE-1576 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Hello @tjungblu! Some important instructions when contributing to openshift/api: |
b24d8da
to
e816e63
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
operator/v1/types_kubeapiserver.go
Outdated
// The value must be parseable as a time duration value; | ||
// see <https://pkg.go.dev/time#ParseDuration>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer not to use duration values anymore. Instead, create a int32 type, with units in the name
For example, this should be eventTTLMinutes
.
We do this because not all clients are built in Go, and building a Go compatible duration parsing in other languages is not going to be fun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reminding me. I've changed this to be configurable in minutes via int32.
operator/v1/types_kubeapiserver.go
Outdated
// If configured, it must be a value of 1m (one minute) or greater, we only allow setting | ||
// minute and hour durations (e.g. 5m or 5h). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should set a sensible maximum value for this TTL. Can you suggest a sensible maximum value for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, made it 2x the default of 3h now
operator/v1/types_kubeapiserver.go
Outdated
// +kubebuilder:validation:Format=duration | ||
// +kubebuilder:validation:Pattern=^(0|([0-9]+(\.[0-9]+)?(m|h))+)$ | ||
// +kubebuilder:validation:Type:=string | ||
// +default="3h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a configuration API, we want to reserve the right to change this value over time. This will be easier if we don't set the value through openapi defaulting.
It's better here to set the value in the controller by detecting that it is omitted.
We generally add a comment to the godoc to explain this scenario
// When omitted this means no opinion, and the platform is left to choose a reasonable default, which is subject to change over time.
// The current default value is 3h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this, great idea
operator/v1/types_kubeapiserver.go
Outdated
type KubeAPIServerSpec struct { | ||
StaticPodOperatorSpec `json:",inline"` | ||
|
||
// eventTTL specifies the amount of time that the events are stored before being deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a user choose either a larger or smaller value for this? Is there a performance benefit? What is too large? What are the impacts of choosing something very small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How deep do we want to document this on the godoc vs. the actual openshift documentation?
Or is this just for your personal curiosity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should give at least some guidance as to what the effect is of choosing a smaller or larger value, doesn't have to be super deep, but if we can make the API self service that's preferable
My main concern is that someone sees they can change it to 1m in the oc explain
, so does so, without us giving them any hint as to potential issues this short feedback loop could introduce
This adds a minute based configuration to configure the event ttl setting in kube-apiserver. Default will stay 3h, as currently defined in KAS-O. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
e816e63
to
6054d58
Compare
// | ||
// +kubebuilder:default:=180 | ||
// +kubebuilder:validation:Minimum=30 | ||
// +kubebuilder:validation:Maximum=360 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have users that need to increase the event TTL above 3h? Choosing 3h as the maximum would preserve the upper bound when it comes to benchmarking and publishing supported cluster sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, upstream default TTL is 1h and it is already considering very high for today's cluster sizes.
The only reason we have 3h downstream as the default value is because origin jobs have a 3h timeout and this allows to retain the events for the entire duration of the tests.
StaticPodOperatorSpec `json:",inline"` | ||
|
||
// eventTTLMinutes specifies the amount of time that the events are stored before being deleted. | ||
// This setting is allowed between 30 minutes minimum up to 6h (360 minutes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the field value is an integer, it would be better to reverse these
// This setting is allowed between 30 minutes minimum up to 6h (360 minutes). | |
// The TTL is allowed between 30 minutes minimum up to a maximum of 360 minutes (6 hours). |
// When omitted this means no opinion, and the platform is left to choose a reasonable default, which is subject to change over time. | ||
// The current default value is 3h (180 minutes). | ||
// | ||
// +kubebuilder:default:=180 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment on the old default, we shouldn't have the openapi default like this
Is there an EP for this new field? It should be added behind a feature gate |
// The current default value is 3h (180 minutes). | ||
// | ||
// +kubebuilder:default:=180 | ||
// +kubebuilder:validation:Minimum=30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can go as far down as 5 minutes. This would allow customers that persists events externally to reduce the amount of events stored in their cluster as much as possible while still being able to know what is happening right now in their cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when there is an event spew, the only way to recover the cluster quickly is to reduce the event-ttl as much as possible. For that some people set the ttl to 0 in vanilla Kubernetes, but I think that for debug purposes, it would still be useful to have some events around, so 5 minutes is better IMO.
@tjungblu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This adds a minute/hour configuration duration to configure the event ttl setting in kube-apiserver. Default will stay 3h, as currently defined in KAS-O.