fix: skip MarkRevisionCreationFailed for transient network errors#16564
fix: skip MarkRevisionCreationFailed for transient network errors#16564Ankitsinghsisodya wants to merge 3 commits intoknative:mainfrom
Conversation
Transient errors (webhook timeouts, GOAWAY, service unavailable) during revision creation were propagating as permanent status failures, confusing users who saw a failed service that eventually became ready after a retry. For transient errors (InternalError, ServiceUnavailable, ServerTimeout, Timeout, TooManyRequests, or url.Error), the warning event is still emitted but the Configuration status is no longer marked as RevisionFailed. The reconciler retries, and the status stays Unknown until the operation actually succeeds or fails permanently. Fixes knative#10511
|
Hi @Ankitsinghsisodya. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
Adjusts Configuration reconciliation to avoid marking RevisionFailed for transient errors during Revision creation, preventing clients (e.g., kn) from reporting permanent failure when the controller will successfully retry.
Changes:
- Add transient-error classification (
isTransientCreateError) and skipMarkRevisionCreationFailedfor those errors while still emitting warning events. - Add table tests covering transient
apierrs.InternalErrorand*url.Errorcreate failures. - Add
WithConfigObservedGenFailuretesting helper to model “new generation observed but reconcile errored”.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/reconciler/configuration/configuration.go | Adds transient error detection and changes reconcile behavior to avoid marking RevisionFailed on transient create errors. |
| pkg/reconciler/configuration/configuration_test.go | Adds regression tests ensuring transient create errors do not mark Configuration failed. |
| pkg/testing/v1/configuration.go | Adds WithConfigObservedGenFailure helper for expected status in transient-error test cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func isTransientCreateError(err error) bool { | ||
| if apierrs.IsInternalError(err) || | ||
| apierrs.IsServiceUnavailable(err) || | ||
| apierrs.IsServerTimeout(err) || | ||
| apierrs.IsTimeout(err) || | ||
| apierrs.IsTooManyRequests(err) { | ||
| return true | ||
| } | ||
| var urlErr *url.Error | ||
| return stderrors.As(err, &urlErr) | ||
| } |
There was a problem hiding this comment.
isTransientCreateError treats any *url.Error as transient. url.Error is also used for non-transient failures (e.g., TLS/x509 validation errors, DNS misconfiguration, unsupported scheme), so this change can leave a Configuration stuck Unknown indefinitely instead of surfacing a persistent failure. Consider narrowing this to known-retriable network errors (e.g., underlying net.Error timeout/temporary, io.EOF/io.ErrUnexpectedEOF/connection reset) and letting other url.Error values continue to mark RevisionFailed (or adding an allowlist/denylist based on the wrapped error).
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #16564 +/- ##
==========================================
+ Coverage 80.18% 80.29% +0.11%
==========================================
Files 217 217
Lines 13532 17322 +3790
==========================================
+ Hits 10850 13908 +3058
- Misses 2319 3049 +730
- Partials 363 365 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
For transient infrastructure errors during revision creation (webhook timeouts, service unavailable, server timeouts, rate limiting), the Configuration status is now set to Ready=Unknown/RevisionCreationRetrying instead of Ready=False/RevisionFailed. This prevents kn and other clients from treating retriable blips as permanent failures while still surfacing the retry state in .status.conditions. Changes: - Add MarkRevisionCreationRetrying to ConfigurationStatus; sets Ready=Unknown so clients continue waiting rather than reporting failure - Drop *url.Error from isTransientCreateError; url.Error wraps permanent failures (TLS mismatch, DNS for non-existent service) in addition to transient ones and cannot be used as a reliable transient signal - isTransientCreateError now covers HTTP codes whose semantics are definitively transient: InternalError (webhook timeout), ServiceUnavailable, ServerTimeout, Timeout, TooManyRequests. A comment documents the known ambiguity of InternalError for permanently-broken webhook services - Replace url.Error test case with InternalError test verifying the Ready=Unknown/RevisionCreationRetrying condition Fixes knative#10511
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Ankitsinghsisodya The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Problem
Fixes #10511
Transient infrastructure errors during revision creation (webhook timeouts, API server overload) were being set as
Ready=False/RevisionFailed, causingknand other clients to report permanent failures for conditions that the controller would successfully retry.Solution
New lifecycle method
Added
MarkRevisionCreationRetrying(message string)toConfigurationStatus. This setsReady=Unknown(notFalse) with reasonRevisionCreationRetrying. Clients watching forReady=Falsecontinue waiting; the error message and a warning event still surface the retry state for operators.Transient error detection
isTransientCreateErrorin the configuration reconciler returnstruefor HTTP codes whose semantics are definitively transient:IsInternalErrorfailed calling webhook: context deadline exceededIsServiceUnavailableIsServerTimeoutIsTimeoutIsTooManyRequests*url.Errorwas explicitly excluded.url.Errorwraps permanent failures (TLS certificate mismatch, DNS failure for a non-existent webhook service, invalid URL) alongside transient ones, so it cannot reliably classify a failure as retriable. Using it would silently suppressRevisionFailedfor permanently broken webhook configurations.A comment in the code documents the known ambiguity of
IsInternalError: a webhook service that is permanently unreachable (deleted service, wrong DNS) also surfaces asInternalError. A webhook that actively rejects the spec typically returnsForbiddenorInvalid, notInternalError.Observability
The retry state is visible in
.status.conditions(RevisionCreationRetrying) and in Kubernetes events (CreationFailedwarning). Both fire on every retry. A follow-up is needed to add an escalation mechanism (e.g. attempt counter or duration threshold after which the status is demoted toFalse) — this is tracked in the issue.Tests
"failure creating revision"test unchanged — non-API errors (likeInduceFailureplain errors) still produceRevisionFailed/False"transient failure creating revision (webhook timeout)"— verifiesReady=Unknown/RevisionCreationRetryingforapierrs.NewInternalErrorMarkRevisionCreationRetryingtest helper added topkg/testing/v1/configuration.gopkg/...tests pass