Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent a PodAutoscaler's DesiredScale to turn to -1 #14866

Conversation

SaschaSchwarze0
Copy link
Contributor

Fixes #14669

Proposed Changes

  • This changed the Knative PodAutoscaler logic to never change the DesiredScale from a non-negative to a negative value.
  • This happens when leading AutoScaler restarted and it has no metrics. Moments later once it has metrics, it will change it back.
  • While this has no functional impact, it causes such a change for every revision which leads to four Kubernetes API calls (update PodAutoscaler in autoscaler, update Revision in controller, and then again PodAutoscaler in autoscaler, update Revision in controller) which has massive impact as QPS slows down overall progression inside these two controllers.

Release Note

The autoscaler now keeps the desiredScale of a PodAutoscaler at its current value while it initializes and therefore has not yet metrics

@knative-prow knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 5, 2024
Copy link

knative-prow bot commented Feb 5, 2024

Hi @SaschaSchwarze0. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow knative-prow bot added the area/API API objects and controllers label Feb 5, 2024
@skonto
Copy link
Contributor

skonto commented Feb 5, 2024

/ok-to-test

cc @dprotaso @psschwei

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 5, 2024
@skonto
Copy link
Contributor

skonto commented Feb 5, 2024

It seems that the failing test Name: "sks does not exist", expects to set pa.DesiredScale to -1 instead of the original value (11).

@SaschaSchwarze0
Copy link
Contributor Author

It seems that the failing test Name: "sks does not exist", expects to set pa.DesiredScale to -1 instead of the original value (11).

Yep, now that I don't have this change guarded by a flag anymore, I have to look at the results of all unit tests. Will check.

@SaschaSchwarze0 SaschaSchwarze0 force-pushed the sascha-14669-ignore-pa-desiredscale-change-to-negative branch from e3f5c11 to e3d76a8 Compare February 6, 2024 07:44
@SaschaSchwarze0
Copy link
Contributor Author

I changed the test case. It was actually running into that code path where there are no metrics and now that means that it keeps the desiredScale rather than setting it to -1.

@SaschaSchwarze0
Copy link
Contributor Author

/easycla

Copy link

codecov bot commented Feb 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.05%. Comparing base (17df219) to head (87a5c84).
Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14866      +/-   ##
==========================================
+ Coverage   84.02%   84.05%   +0.03%     
==========================================
  Files         213      213              
  Lines       16796    16803       +7     
==========================================
+ Hits        14112    14124      +12     
+ Misses       2329     2323       -6     
- Partials      355      356       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2024
@dprotaso
Copy link
Member

I've been reviewing #14607 & #14573 which address bugs in the same part of the codebase as this PR.

Those PRs fix the problem but they would introduce regressions if merged as is - so it's not solved yet. I'm creating e2e tests to catch these regressions - it would be good to run them against this PR when they land.

It sorta feels like there might a 'holistic' solution for those PRs where we want to distinguish "I have no data for this revision" and "I have no data for this revision and I haven't tried to sample any"

The latter scenario sounds seems to match what this PR is trying to fix.

I'm going to hold this PR for now until I wrap up the other reviews and the e2e tests

/hold

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2024
@SaschaSchwarze0
Copy link
Contributor Author

@dprotaso can you give an update where we are standing here ?

Comment on lines 225 to 230
// later back to the correct value. The following condition omits this.
if pc.want > -1 || pa.Status.DesiredScale == nil || *pa.Status.DesiredScale < 0 {
pa.Status.DesiredScale = ptr.Int32(int32(pc.want))
} else {
logger.Debugf("Ignoring change of desiredScale from %d to %d", *pa.Status.DesiredScale, pc.want)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on flipping the condition and the blocks around?

I found it easier to reason about this code when I realized the else condition is

pc.want == -1 && pa.Status.DesiredScale != nil && pa.Status.DesiredScale >= 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it both ways. 8-) Changed it.

@@ -219,7 +219,15 @@ func (c *Reconciler) reconcileDecider(ctx context.Context, pa *autoscalingv1alph
}

func computeStatus(ctx context.Context, pa *autoscalingv1alpha1.PodAutoscaler, pc podCounts, logger *zap.SugaredLogger) {
pa.Status.DesiredScale, pa.Status.ActualScale = ptr.Int32(int32(pc.want)), ptr.Int32(int32(pc.ready))
pa.Status.ActualScale = ptr.Int32(int32(pc.ready))
Copy link
Member

@dprotaso dprotaso Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why ActualScale is different then I saw we are deriving this number from querying the informer cache which is sync'd at startup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, that value is never incorrect.

@SaschaSchwarze0 SaschaSchwarze0 force-pushed the sascha-14669-ignore-pa-desiredscale-change-to-negative branch from e3d76a8 to d08cde0 Compare March 27, 2024 16:17
@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2024
@dprotaso
Copy link
Member

/hold cancel
/lgtm

@knative-prow knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2024
@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2024
@dprotaso
Copy link
Member

/approve

Copy link

knative-prow bot commented Mar 27, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, SaschaSchwarze0

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 27, 2024
@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2024
@dprotaso
Copy link
Member

linter failed - I just updated the PR to use the right ptr package

/lgtm

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2024
@dprotaso
Copy link
Member

dprotaso commented Mar 27, 2024

kourier-tls flakes - #15052
/retest

@knative-prow knative-prow bot merged commit 3e2243d into knative:main Mar 27, 2024
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restart of autoscaler causes Knative to perform four status update calls per active revision
4 participants