Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HPA feedback from writing status.replicas to spec.replicas. #79035

Merged
merged 1 commit into from
Jul 2, 2019

Conversation

josephburnett
Copy link
Member

@josephburnett josephburnett commented Jun 14, 2019

/kind bug

What this PR does / why we need it:

There are various reasons that the HPA will decide not the change the current scale. Two important ones are when missing metrics might change the direction of scaling, and when the recommended scale is within tolerance of the current scale.

The way that ReplicaCalculator signals it's desire to not change the current scale is by returning the current scale. However the current scale is from scale.Status.Replicas and can be larger than scale.Spec.Replicas (e.g. during Deployment rollout with configured surge). This causes a positive feedback loop because scale.Status.Replicas is written back into scale.Spec.Replicas, further increasing the current scale.

This PR fixes the feedback loop by plumbing the replica count from spec through horizontal.go and replica_calculator.go so the calculator can punt with the right value.

It also introduces separate types for replica counts derived from scale.Spec and scale.Status (specReplicas and statusReplicas respectively) to guard against this kind of cross-talk. When returning a desired scale to be written into spec, the calculator must either return the given, current scale, or explicitly call newSpecReplicas.

With separate types, other sources of cross-talk became compiler errors. E.g. recording Status.Replicas as an initial recommendation. This would manifest if a deployment was in the process of rolling out when the HPA reboots.

Which issue(s) this PR fixes:

Fixes #78712
Fixes #72775

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 14, 2019
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 14, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @josephburnett. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 14, 2019
@mwielgus
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 14, 2019
@mwielgus mwielgus requested review from mwielgus and removed request for fgrzadkowski June 14, 2019 16:25
@josephburnett
Copy link
Member Author

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

New comment.

@josephburnett
Copy link
Member Author

If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.

Done.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 14, 2019
@mwielgus mwielgus added this to the v1.15 milestone Jun 14, 2019
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 14, 2019
@josephburnett
Copy link
Member Author

/test pull-kubernetes-integration

@josephburnett
Copy link
Member Author

/test pull-kubernetes-verify

@krzysztof-jastrzebski
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2019
@k8s-ci-robot k8s-ci-robot merged commit cf7662d into kubernetes:master Jul 2, 2019
k8s-ci-robot added a commit that referenced this pull request Jul 3, 2019
Cherry pick #79035 to 1.13 (Fix HPA feedback from writing status.replicas to spec.replicas)
k8s-ci-robot added a commit that referenced this pull request Jul 3, 2019
Cherry pick #79035 to 1.14 (Fix HPA feedback from writing status.replicas to spec.replicas)
k8s-ci-robot added a commit that referenced this pull request Jul 4, 2019
…-#79035-upstream-release-1.15

Automated cherry pick of #79035: There are various reasons that the HPA will decide not the
@juanpmarin
Copy link

Hi, is this currently released ?

josephburnett pushed a commit to josephburnett/kubernetes that referenced this pull request Jul 18, 2019
Handle replica counts derived from spec and status as separate types
so we don't accidentally write observed replicas from status back into
spec, causing a positive feedback loop (kubernetes#79035).
josephburnett pushed a commit to josephburnett/kubernetes that referenced this pull request Jul 24, 2019
josephburnett pushed a commit to josephburnett/kubernetes that referenced this pull request Aug 6, 2019
During a Deployment update there may be more Pods in the scale target
ref status than in the spec. This test verifies that we do not scale
to the status value. Instead we should stay at the spec value.

Fails before kubernetes#79035 and passes after.
max88991 pushed a commit to max88991/kubernetes that referenced this pull request Aug 22, 2019
During a Deployment update there may be more Pods in the scale target
ref status than in the spec. This test verifies that we do not scale
to the status value. Instead we should stay at the spec value.

Fails before kubernetes#79035 and passes after.
@austinpray
Copy link

Hi, is this currently released ?

@juanpmarin it looks like this is gonna be in 1.16+

λ  kubernetes master ✓ git tag --contains 39c4875321991f305d51e30481a66701b6b76f5f
v1.16.0-alpha.1
v1.16.0-alpha.2
v1.16.0-alpha.3
v1.16.0-beta.0
v1.16.0-beta.1
v1.17.0-alpha.0

@kimxogus
Copy link

kimxogus commented Aug 29, 2019

@juanpmarin This is also cherry picked to 1.13(#79707), 1.14(#79708), 1.15(#79709, #79727). Latest version of them may have this fix too.

@paalkr
Copy link

paalkr commented Sep 29, 2019

I would just like to draw some attention to #72775. Many are still facing issues with rolling updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet