Skip to content

fix: skip MarkRevisionCreationFailed for transient network errors#16564

Closed
Ankitsinghsisodya wants to merge 3 commits intoknative:mainfrom
Ankitsinghsisodya:fix/transient-reconcile-failure-10511
Closed

fix: skip MarkRevisionCreationFailed for transient network errors#16564
Ankitsinghsisodya wants to merge 3 commits intoknative:mainfrom
Ankitsinghsisodya:fix/transient-reconcile-failure-10511

Conversation

@Ankitsinghsisodya
Copy link
Copy Markdown
Contributor

@Ankitsinghsisodya Ankitsinghsisodya commented Apr 21, 2026

Problem

Fixes #10511

Transient infrastructure errors during revision creation (webhook timeouts, API server overload) were being set as Ready=False/RevisionFailed, causing kn and other clients to report permanent failures for conditions that the controller would successfully retry.

Solution

New lifecycle method

Added MarkRevisionCreationRetrying(message string) to ConfigurationStatus. This sets Ready=Unknown (not False) with reason RevisionCreationRetrying. Clients watching for Ready=False continue waiting; the error message and a warning event still surface the retry state for operators.

Transient error detection

isTransientCreateError in the configuration reconciler returns true for HTTP codes whose semantics are definitively transient:

Error HTTP Example
IsInternalError 500 failed calling webhook: context deadline exceeded
IsServiceUnavailable 503 API server overloaded
IsServerTimeout 504 Server-side timeout
IsTimeout 408 Request timeout
IsTooManyRequests 429 Rate limiting

*url.Error was explicitly excluded. url.Error wraps permanent failures (TLS certificate mismatch, DNS failure for a non-existent webhook service, invalid URL) alongside transient ones, so it cannot reliably classify a failure as retriable. Using it would silently suppress RevisionFailed for permanently broken webhook configurations.

A comment in the code documents the known ambiguity of IsInternalError: a webhook service that is permanently unreachable (deleted service, wrong DNS) also surfaces as InternalError. A webhook that actively rejects the spec typically returns Forbidden or Invalid, not InternalError.

Observability

The retry state is visible in .status.conditions (RevisionCreationRetrying) and in Kubernetes events (CreationFailed warning). Both fire on every retry. A follow-up is needed to add an escalation mechanism (e.g. attempt counter or duration threshold after which the status is demoted to False) — this is tracked in the issue.

Tests

  • Existing "failure creating revision" test unchanged — non-API errors (like InduceFailure plain errors) still produce RevisionFailed/False
  • New test "transient failure creating revision (webhook timeout)" — verifies Ready=Unknown/RevisionCreationRetrying for apierrs.NewInternalError
  • MarkRevisionCreationRetrying test helper added to pkg/testing/v1/configuration.go
  • All pkg/... tests pass

Transient errors (webhook timeouts, GOAWAY, service unavailable) during
revision creation were propagating as permanent status failures, confusing
users who saw a failed service that eventually became ready after a retry.

For transient errors (InternalError, ServiceUnavailable, ServerTimeout,
Timeout, TooManyRequests, or url.Error), the warning event is still emitted
but the Configuration status is no longer marked as RevisionFailed. The
reconciler retries, and the status stays Unknown until the operation
actually succeeds or fails permanently.

Fixes knative#10511
Copilot AI review requested due to automatic review settings April 21, 2026 22:12
@knative-prow knative-prow Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 21, 2026
@knative-prow
Copy link
Copy Markdown

knative-prow Bot commented Apr 21, 2026

Hi @Ankitsinghsisodya. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow knative-prow Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 21, 2026
@knative-prow knative-prow Bot requested review from dsimansk and skonto April 21, 2026 22:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts Configuration reconciliation to avoid marking RevisionFailed for transient errors during Revision creation, preventing clients (e.g., kn) from reporting permanent failure when the controller will successfully retry.

Changes:

  • Add transient-error classification (isTransientCreateError) and skip MarkRevisionCreationFailed for those errors while still emitting warning events.
  • Add table tests covering transient apierrs.InternalError and *url.Error create failures.
  • Add WithConfigObservedGenFailure testing helper to model “new generation observed but reconcile errored”.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pkg/reconciler/configuration/configuration.go Adds transient error detection and changes reconcile behavior to avoid marking RevisionFailed on transient create errors.
pkg/reconciler/configuration/configuration_test.go Adds regression tests ensuring transient create errors do not mark Configuration failed.
pkg/testing/v1/configuration.go Adds WithConfigObservedGenFailure helper for expected status in transient-error test cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +320 to +330
func isTransientCreateError(err error) bool {
if apierrs.IsInternalError(err) ||
apierrs.IsServiceUnavailable(err) ||
apierrs.IsServerTimeout(err) ||
apierrs.IsTimeout(err) ||
apierrs.IsTooManyRequests(err) {
return true
}
var urlErr *url.Error
return stderrors.As(err, &urlErr)
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isTransientCreateError treats any *url.Error as transient. url.Error is also used for non-transient failures (e.g., TLS/x509 validation errors, DNS misconfiguration, unsupported scheme), so this change can leave a Configuration stuck Unknown indefinitely instead of surfacing a persistent failure. Consider narrowing this to known-retriable network errors (e.g., underlying net.Error timeout/temporary, io.EOF/io.ErrUnexpectedEOF/connection reset) and letting other url.Error values continue to mark RevisionFailed (or adding an allowlist/denylist based on the wrapped error).

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 61.53846% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.29%. Comparing base (6784d14) to head (53552d5).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
pkg/apis/serving/v1/configuration_lifecycle.go 0.00% 5 Missing ⚠️
pkg/testing/v1/configuration.go 0.00% 4 Missing ⚠️
pkg/reconciler/configuration/configuration.go 94.11% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16564      +/-   ##
==========================================
+ Coverage   80.18%   80.29%   +0.11%     
==========================================
  Files         217      217              
  Lines       13532    17322    +3790     
==========================================
+ Hits        10850    13908    +3058     
- Misses       2319     3049     +730     
- Partials      363      365       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

For transient infrastructure errors during revision creation (webhook
timeouts, service unavailable, server timeouts, rate limiting), the
Configuration status is now set to Ready=Unknown/RevisionCreationRetrying
instead of Ready=False/RevisionFailed. This prevents kn and other clients
from treating retriable blips as permanent failures while still surfacing
the retry state in .status.conditions.

Changes:
- Add MarkRevisionCreationRetrying to ConfigurationStatus; sets
  Ready=Unknown so clients continue waiting rather than reporting failure
- Drop *url.Error from isTransientCreateError; url.Error wraps permanent
  failures (TLS mismatch, DNS for non-existent service) in addition to
  transient ones and cannot be used as a reliable transient signal
- isTransientCreateError now covers HTTP codes whose semantics are
  definitively transient: InternalError (webhook timeout), ServiceUnavailable,
  ServerTimeout, Timeout, TooManyRequests. A comment documents the known
  ambiguity of InternalError for permanently-broken webhook services
- Replace url.Error test case with InternalError test verifying the
  Ready=Unknown/RevisionCreationRetrying condition

Fixes knative#10511
@knative-prow knative-prow Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 21, 2026
@knative-prow
Copy link
Copy Markdown

knative-prow Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Ankitsinghsisodya
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Knative reconcile fails in-middle, but be ready eventually for service creation

2 participants