OCPBUGS-84504: retry transient kubeconfig read failures in GetClientConfig#31080
Conversation
During upgrade prow jobs the kubeconfig Secret volume is re-mounted by the kubelet, which atomically swaps symlinks and creates a sub-second window where ReadFile() returns ENOENT or reads empty/partial content. With 10-15 parallel Ginkgo goroutines all calling GetClientConfig() on every client creation, this window reliably causes mass test failures. Replace the single ReadFile attempt with a wait.PollImmediate retry (200ms interval, 10s timeout) that retries on file-not-found, empty file, or clientcmd parse errors. Non-transient I/O errors still fail immediately. Adds unit tests covering all four cases. Fixes: OCPBUGS-84504 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughModified Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@dgoodwin: This pull request references Jira Issue OCPBUGS-84504, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/extended/util/client_test.go (1)
76-112: Add coverage for the remaining retry branches.These tests still don’t exercise the parse-error and non-transient I/O paths that
GetClientConfignow distinguishes, so the retry contract is only partially locked down. The timeout assertion is also looser than the 10s poll window, which makes regressions easier to miss.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/util/client_test.go` around lines 76 - 112, Add tests covering the parse-error and non-transient I/O branches of GetClientConfig and tighten the timeout assertions to the 10s poll window: add one test that writes an invalid kubeconfig to path (triggering parse error in GetClientConfig) and assert it returns an error immediately (elapsed << 1s) with no retries; add another test that simulates a non-transient I/O error (e.g., create the file or directory with permissions that cause a permanent permission-denied read) and assert GetClientConfig returns an error quickly (elapsed < 10s) without retrying; update the existing TestGetClientConfigFailsOnPermanentMissing and TestGetClientConfigSucceedsImmediately timeout checks to use the 10s poll window (assert elapsed < 10s for failures and <1s for immediate success) so behavior is fully covered and aligned with GetClientConfig’s retry contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/util/client_test.go`:
- Around line 28-74: The retry tests TestGetClientConfigRetryOnMissingFile and
TestGetClientConfigRetryOnEmptyFile are flakey due to fixed sleeps and
fire-and-forget goroutines; change them to synchronize the background file
mutations and surface any file I/O errors so the test deterministically
exercises GetClientConfig's retry window. Specifically, replace the anonymous
goroutines with a goroutine that reports errors over a channel (or use a
sync.WaitGroup) and signal when the mutation is done, wait for that signal
before finishing the test, and check/require the returned error values from
os.Remove/os.WriteFile in those goroutines; ensure the test waits for the
background worker to start the mutation timing (e.g., signal readiness) so the
GetClientConfig call actually encounters the missing/empty file and triggers
retries.
---
Nitpick comments:
In `@test/extended/util/client_test.go`:
- Around line 76-112: Add tests covering the parse-error and non-transient I/O
branches of GetClientConfig and tighten the timeout assertions to the 10s poll
window: add one test that writes an invalid kubeconfig to path (triggering parse
error in GetClientConfig) and assert it returns an error immediately (elapsed <<
1s) with no retries; add another test that simulates a non-transient I/O error
(e.g., create the file or directory with permissions that cause a permanent
permission-denied read) and assert GetClientConfig returns an error quickly
(elapsed < 10s) without retrying; update the existing
TestGetClientConfigFailsOnPermanentMissing and
TestGetClientConfigSucceedsImmediately timeout checks to use the 10s poll window
(assert elapsed < 10s for failures and <1s for immediate success) so behavior is
fully covered and aligned with GetClientConfig’s retry contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d353c3b8-2956-42cf-b1c8-0567cf438d2b
📒 Files selected for processing (2)
test/extended/util/client.gotest/extended/util/client_test.go
| func TestGetClientConfigRetryOnMissingFile(t *testing.T) { | ||
| dir := t.TempDir() | ||
| path := filepath.Join(dir, "kubeconfig") | ||
|
|
||
| if err := os.WriteFile(path, []byte(validKubeconfig), 0644); err != nil { | ||
| t.Fatal(err) | ||
| } | ||
|
|
||
| go func() { | ||
| time.Sleep(200 * time.Millisecond) | ||
| os.Remove(path) | ||
| time.Sleep(800 * time.Millisecond) | ||
| os.WriteFile(path, []byte(validKubeconfig), 0644) | ||
| }() | ||
|
|
||
| time.Sleep(300 * time.Millisecond) | ||
|
|
||
| cfg, err := GetClientConfig(path) | ||
| if err != nil { | ||
| t.Fatalf("expected retry to succeed, got error: %v", err) | ||
| } | ||
| if cfg.Host != "https://localhost:6443" { | ||
| t.Fatalf("unexpected host: %s", cfg.Host) | ||
| } | ||
| } | ||
|
|
||
| func TestGetClientConfigRetryOnEmptyFile(t *testing.T) { | ||
| dir := t.TempDir() | ||
| path := filepath.Join(dir, "kubeconfig") | ||
|
|
||
| if err := os.WriteFile(path, []byte{}, 0644); err != nil { | ||
| t.Fatal(err) | ||
| } | ||
|
|
||
| go func() { | ||
| time.Sleep(500 * time.Millisecond) | ||
| os.WriteFile(path, []byte(validKubeconfig), 0644) | ||
| }() | ||
|
|
||
| cfg, err := GetClientConfig(path) | ||
| if err != nil { | ||
| t.Fatalf("expected retry to succeed, got error: %v", err) | ||
| } | ||
| if cfg.Host != "https://localhost:6443" { | ||
| t.Fatalf("unexpected host: %s", cfg.Host) | ||
| } | ||
| } |
There was a problem hiding this comment.
Make the retry tests deterministic.
The fixed sleeps and fire-and-forget goroutines can let these tests pass without actually hitting the retry window, and they can also race t.TempDir() cleanup on slower CI workers. Please synchronize the file mutation and check the background write/remove errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/util/client_test.go` around lines 28 - 74, The retry tests
TestGetClientConfigRetryOnMissingFile and TestGetClientConfigRetryOnEmptyFile
are flakey due to fixed sleeps and fire-and-forget goroutines; change them to
synchronize the background file mutations and surface any file I/O errors so the
test deterministically exercises GetClientConfig's retry window. Specifically,
replace the anonymous goroutines with a goroutine that reports errors over a
channel (or use a sync.WaitGroup) and signal when the mutation is done, wait for
that signal before finishing the test, and check/require the returned error
values from os.Remove/os.WriteFile in those goroutines; ensure the test waits
for the background worker to start the mutation timing (e.g., signal readiness)
so the GetClientConfig call actually encounters the missing/empty file and
triggers retries.
|
/jira refresh |
|
@dgoodwin: This pull request references Jira Issue OCPBUGS-84504, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
|
None of them seems like a blocker, looks good just the way they are. /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, jcmoraisjr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified later |
|
@dgoodwin: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified later @dgoodwin |
|
@dgoodwin: This PR has been marked to be verified later by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.21 |
|
@dgoodwin: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@dgoodwin: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@dgoodwin: Jira Issue OCPBUGS-84504: All pull requests linked via external trackers have merged: This pull request has the DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@dgoodwin: new pull request created: #31086 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-4.22 |
|
@dgoodwin: new pull request created: #31096 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
During upgrade prow jobs the kubeconfig Secret volume is re-mounted by
the kubelet, which atomically swaps symlinks and creates a sub-second
window where ReadFile() returns ENOENT or reads empty/partial content.
With 10-15 parallel Ginkgo goroutines all calling GetClientConfig() on
every client creation, this window can cause rare mass test failures.
Replace the single ReadFile attempt with a wait.PollImmediate retry
(200ms interval, 10s timeout) that retries on file-not-found, empty
file, or clientcmd parse errors. Non-transient I/O errors still fail
immediately. Adds unit tests covering all four cases.
Fixes: OCPBUGS-84504
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
Summary by CodeRabbit
Bug Fixes
Tests