NO-ISSUE: test/e2e/upgrade: Raise default update-ack timeout to 10m#30917
Conversation
It's failing in [1]'s client-go bump to Kube 1.35, with timelines like [2]: * 11:17:27, test-suite patches ClusterVersion to request the update. * 11:17:27, CVO Events UpgradeStarted (although it hasn't actually pivoted the target yet, might want to adjust this Event name/timing) and RetrievePayload: Retrieving and verifying payload… * 11:17:28, CVO creates version–fgn9s. * 11:17:28, CVO starts watching version--fgn9s. * 11:17:33, version--fgn9s container started. So it's not slow image pulls. * 11:17:34, version--fgn9s container started again? Not clear what happened here. * 11:17:37, rename-to-final-location container started (the last container in version-... Pods, it should be an atomic, single-filesystem mv). * 11:17:38, rename-to-final-location exits 0, so success, but the CVO does not notice. * 11:19:27, test-case times out after 2m. * 11:21:27, CVO refreshes the watch on version--fgn9s. and CVO logs like: I0320 11:17:58.306414 1 reflector.go:1159] "Warning: event bookmark expired" err="k8s.io/client-go/tools/watch/informerwatcher.go:162: hasn't received required bookmark event marking the end of initial events stream, received last event 18.795159461s ago" I0320 11:18:08.306345 1 reflector.go:1159] "Warning: event bookmark expired" err="k8s.io/client-go/tools/watch/informerwatcher.go:162: hasn't received required bookmark event marking the end of initial events stream, received last event 28.795076029s ago" ... I0320 11:21:18.307818 1 reflector.go:1159] "Warning: event bookmark expired" err="k8s.io/client-go/tools/watch/informerwatcher.go:162: hasn't received required bookmark event marking the end of initial events stream, received last event 3m38.796537922s ago" I0320 11:21:27.596115 1 trace.go:236] Trace[23813503]: "Reflector WatchList" name:k8s.io/client-go/tools/watch/informerwatcher.go:162 (20-Mar-2026 11:17:28.301) (total time: 239294ms): 21 - 17 = 4m, and I'm giving it 10m to be safe. The bump allows work like [1] to move forward while we troubleshoot the client issue. We'll definitely want to revert this once the 1.35 clients are fixed. [1]: openshift/cluster-version-operator#1282 [2]: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-version-operator/1282/pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-ovn-upgrade-out-of-change/2034935978228977664
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@wking: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThe default acknowledgment timeout for the Cluster Version Operator (CVO) update was increased from 2 minutes to 10 minutes in the upgrade test file. This timeout serves as the fallback value for non-BareMetal and non-OpenStack platforms and impacts flake-threshold checks. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/upgrade/upgrade.go (1)
85-85: Decouple ack timeout from flake-threshold signalThis constant now controls both the wait timeout and the flake threshold (Line 528), which effectively removes the slow-ack flake signal on current platform branches. Consider introducing a separate flake-threshold constant (or an explicit TODO+issue expiry) so this temporary timeout bump doesn’t silently reduce regression visibility.
As per coding guidelines, "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/upgrade/upgrade.go` at line 85, The constant defaultCVOUpdateAckTimeout currently gates both the wait timeout and the flake-threshold signal (referenced by defaultCVOUpdateAckTimeout and the flake check usage), which hides flake regressions when you temporarily bump the timeout; introduce a separate constant (e.g., defaultCVOFlakeThreshold or cvoUpdateFlakeThreshold) and replace the flake-threshold usage to read that new constant while leaving defaultCVOUpdateAckTimeout as the actual wait timeout, and update any references in the flake-detection logic and comments to use the new name so the two behaviors are decoupled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/e2e/upgrade/upgrade.go`:
- Line 85: The constant defaultCVOUpdateAckTimeout currently gates both the wait
timeout and the flake-threshold signal (referenced by defaultCVOUpdateAckTimeout
and the flake check usage), which hides flake regressions when you temporarily
bump the timeout; introduce a separate constant (e.g., defaultCVOFlakeThreshold
or cvoUpdateFlakeThreshold) and replace the flake-threshold usage to read that
new constant while leaving defaultCVOUpdateAckTimeout as the actual wait
timeout, and update any references in the flake-detection logic and comments to
use the new name so the two behaviors are decoupled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 46d340d1-0aad-4cf6-8be6-9a4bf5751029
📒 Files selected for processing (1)
test/e2e/upgrade/upgrade.go
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed, neisw, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Scheduling required tests: Scheduling tests matching the |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/verified by E2E passing |
|
@JoelSpeed: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
It's failing in openshift/cluster-version-operator#1282 's client-go bump to Kube 1.35, with timelines like:
UpgradeStarted(although it hasn't actually pivoted the target yet, might want to adjust this Event name/timing) andRetrievePayload: Retrieving and verifying payload…version–fgn9s.version--fgn9s.version--fgn9scontainer started. So it's not slow image pulls.version--fgn9scontainer started again? Not clear what happened here.rename-to-final-locationcontainer started (the last container inversion-...Pods, it should be an atomic, single-filesystemmv).rename-to-final-locationexits 0, so success, but the CVO does not notice.version--fgn9s.and CVO logs like:
21 - 17 = 4m, and I'm giving it 10m to be safe. The bump allows work like openshift/cluster-version-operator#1282 to move forward while we troubleshoot the client issue. We'll definitely want to revert this once the 1.35 clients are fixed.