Skip to content

Conversation

tjungblu
Copy link
Contributor

@tjungblu tjungblu commented Oct 7, 2025

This adds a minute/hour configuration duration to configure the event ttl setting in kube-apiserver. Default will stay 3h, as currently defined in KAS-O.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 7, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 7, 2025

@tjungblu: This pull request references CNTRLPLANE-1576 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

This adds a minute/hour configuration duration to configure the event ttl setting in kube-apiserver. Default will stay 3h, as currently defined in KAS-O.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Oct 7, 2025

Hello @tjungblu! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 7, 2025
Copy link
Contributor

openshift-ci bot commented Oct 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines 41 to 42
// The value must be parseable as a time duration value;
// see <https://pkg.go.dev/time#ParseDuration>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer not to use duration values anymore. Instead, create a int32 type, with units in the name

For example, this should be eventTTLMinutes.

We do this because not all clients are built in Go, and building a Go compatible duration parsing in other languages is not going to be fun

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reminding me. I've changed this to be configurable in minutes via int32.

Comment on lines 44 to 45
// If configured, it must be a value of 1m (one minute) or greater, we only allow setting
// minute and hour durations (e.g. 5m or 5h).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set a sensible maximum value for this TTL. Can you suggest a sensible maximum value for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, made it 2x the default of 3h now

// +kubebuilder:validation:Format=duration
// +kubebuilder:validation:Pattern=^(0|([0-9]+(\.[0-9]+)?(m|h))+)$
// +kubebuilder:validation:Type:=string
// +default="3h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a configuration API, we want to reserve the right to change this value over time. This will be easier if we don't set the value through openapi defaulting.

It's better here to set the value in the controller by detecting that it is omitted.

We generally add a comment to the godoc to explain this scenario

// When omitted this means no opinion, and the platform is left to choose a reasonable default, which is subject to change over time.
// The current default value is 3h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this, great idea

type KubeAPIServerSpec struct {
StaticPodOperatorSpec `json:",inline"`

// eventTTL specifies the amount of time that the events are stored before being deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a user choose either a larger or smaller value for this? Is there a performance benefit? What is too large? What are the impacts of choosing something very small?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How deep do we want to document this on the godoc vs. the actual openshift documentation?
Or is this just for your personal curiosity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should give at least some guidance as to what the effect is of choosing a smaller or larger value, doesn't have to be super deep, but if we can make the API self service that's preferable

My main concern is that someone sees they can change it to 1m in the oc explain, so does so, without us giving them any hint as to potential issues this short feedback loop could introduce

This adds a minute based configuration to configure the event ttl setting in kube-apiserver.
Default will stay 3h, as currently defined in KAS-O.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
//
// +kubebuilder:default:=180
// +kubebuilder:validation:Minimum=30
// +kubebuilder:validation:Maximum=360
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have users that need to increase the event TTL above 3h? Choosing 3h as the maximum would preserve the upper bound when it comes to benchmarking and publishing supported cluster sizes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, upstream default TTL is 1h and it is already considering very high for today's cluster sizes.

The only reason we have 3h downstream as the default value is because origin jobs have a 3h timeout and this allows to retain the events for the entire duration of the tests.

StaticPodOperatorSpec `json:",inline"`

// eventTTLMinutes specifies the amount of time that the events are stored before being deleted.
// This setting is allowed between 30 minutes minimum up to 6h (360 minutes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the field value is an integer, it would be better to reverse these

Suggested change
// This setting is allowed between 30 minutes minimum up to 6h (360 minutes).
// The TTL is allowed between 30 minutes minimum up to a maximum of 360 minutes (6 hours).

// When omitted this means no opinion, and the platform is left to choose a reasonable default, which is subject to change over time.
// The current default value is 3h (180 minutes).
//
// +kubebuilder:default:=180
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment on the old default, we shouldn't have the openapi default like this

@JoelSpeed
Copy link
Contributor

Is there an EP for this new field? It should be added behind a feature gate

// The current default value is 3h (180 minutes).
//
// +kubebuilder:default:=180
// +kubebuilder:validation:Minimum=30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go as far down as 5 minutes. This would allow customers that persists events externally to reduce the amount of events stored in their cluster as much as possible while still being able to know what is happening right now in their cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when there is an event spew, the only way to recover the cluster quickly is to reduce the event-ttl as much as possible. For that some people set the ttl to 0 in vanilla Kubernetes, but I think that for debug purposes, it would still be useful to have some events around, so 5 minutes is better IMO.

Copy link
Contributor

openshift-ci bot commented Oct 8, 2025

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 6054d58 link false /test okd-scos-e2e-aws-ovn
ci/prow/integration 6054d58 link true /test integration

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants