Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't drop traffic when upgrading a deployment fails #14795

Merged
merged 5 commits into from Feb 5, 2024

Conversation

dprotaso
Copy link
Member

@dprotaso dprotaso commented Jan 15, 2024

Fixes: #14660

Depends on 1.13 point release with the following fix #14846

@knative-prow knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/API API objects and controllers labels Jan 15, 2024
@knative-prow knative-prow bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 15, 2024
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 17, 2024
@dprotaso dprotaso force-pushed the reachability-fix branch 2 times, most recently from d2f5d97 to 1687537 Compare January 17, 2024 18:26
Copy link

codecov bot commented Jan 17, 2024

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (752314e) 86.02% compared to head (87d7f34) 85.78%.
Report is 49 commits behind head on main.

Files Patch % Lines
pkg/apis/serving/v1/revision_lifecycle.go 70.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14795      +/-   ##
==========================================
- Coverage   86.02%   85.78%   -0.24%     
==========================================
  Files         197      198       +1     
  Lines       14950    15126     +176     
==========================================
+ Hits        12860    12976     +116     
- Misses       1778     1827      +49     
- Partials      312      323      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

When transforming the deployment status to the revision
we want to bubble up the more severe condition to Ready.

Since Replica failures will include a more actionable error
message this condition is preferred
This isn't accurate when the Revision has failed to rollout
an update to it's deployment
1. PA Reachability now depends on the status of the Deployment

If we have available replicas we don't mark the revision as
unreachable. This allows ongoing requests to be handled

2. Always propagate the K8s Deployment Status to the Revision.

We don't need to gate this depending on whether the Revision
required activation. Since the only two conditions we propagate
from the Deployment is Progressing and ReplicaSetFailure=False

3. Mark Revision as Deploying if the PA's service name isn't set
@dprotaso
Copy link
Member Author

/test-all

@knative-prow knative-prow bot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Jan 28, 2024
@dprotaso dprotaso force-pushed the reachability-fix branch 2 times, most recently from cf53416 to b8b019b Compare January 28, 2024 21:55
@dprotaso
Copy link
Member Author

dprotaso commented Jan 29, 2024

I'm testing the upgrade test to the 1.12 branch to ensure it fails when the fix isn't present. - #14840

@knative-prow knative-prow bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 30, 2024
@knative-prow knative-prow bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 30, 2024
@dprotaso dprotaso changed the title [wip] reachability fixes Don't drop traffic when upgrading a deployment fails Jan 30, 2024
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 30, 2024
@dprotaso
Copy link
Member Author

/hold

So the upgrade test is failing because the bug exists in 1.12 and 1.13 release branches. When we upgrade from those older release branches the older controller observes the config map changes and then marks reachability=false incorrectly.

Thus to ensure this upgrade test works we should cherry pick to 1.12 - perform a release, then cherry pick to 1.13 then do a release.

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 30, 2024
@dprotaso
Copy link
Member Author

dprotaso commented Feb 3, 2024

/retest
/hold cancel

v1.13.1 is out that should unblock the upgrade test

@knative-prow knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 3, 2024
@dprotaso
Copy link
Member Author

dprotaso commented Feb 5, 2024

@ReToCode @skonto this is ready to merge

Copy link
Member

@ReToCode ReToCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 5, 2024
Copy link

knative-prow bot commented Feb 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, ReToCode

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dprotaso
Copy link
Member Author

dprotaso commented Feb 5, 2024

/retest

@knative-prow knative-prow bot merged commit 1760f08 into knative:main Feb 5, 2024
49 checks passed
@dprotaso dprotaso deleted the reachability-fix branch February 5, 2024 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Since 1.12, healthy Revision is taken down because of temporary glitch during Pod creation
4 participants