Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HPA sample sanitization #67252

Merged
merged 2 commits into from
Aug 24, 2018

Conversation

jbartosik
Copy link
Contributor

@jbartosik jbartosik commented Aug 10, 2018

What this PR does / why we need it: @mwielgus pointed out a case when HPA fails as a result of my changes to HPA algorithm:

  • Have pods that use a lot of CPU during initilization, become ready right after they initialize,
  • Trigger a scale up,
  • When new pods become ready will will count their usage (even though it's not related to any work that needs doing),
  • This triggers another scale up, even though existing pods can handle work, no problem.

The fix is:

  • Use all samples for non-cpu metrics.
  • Only use CPU samples if:
    • Pod is ready and was started more than 2 minutes ago, or
    • Pod is unready and last readiness change happened more than 10s after it was started.

Reasoning behind this in: https://docs.google.com/document/d/1UdtYedhmCxjaJIQi6hwJMY0eHQQKxlVD8lSHZC1BPOA/edit

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:

Release note:

Replace scale up forbidden window with disregarding CPU samples collected when pod was initializing.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 10, 2018
@jbartosik
Copy link
Contributor Author

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Aug 10, 2018
@jbartosik
Copy link
Contributor Author

/assign @DirectXMan12

@MaciekPytel
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 10, 2018
@jbartosik
Copy link
Contributor Author

/retest

3 similar comments
@jbartosik
Copy link
Contributor Author

/retest

@jbartosik
Copy link
Contributor Author

/retest

@jbartosik
Copy link
Contributor Author

/retest

@jbartosik jbartosik changed the title Improve HPA sample sanitization Fix HPA sample sanitization Aug 13, 2018
Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add what the fix is to the PR description. This definitely needs a release note. All algorithm changes need release notes.

Separately, hard coding a startup window is not acceptable. Some pods start up within 10s, others take 5 minutes. We can't just arbitrarily assume 2 minutes is correct.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Aug 16, 2018
@jbartosik
Copy link
Contributor Author

Done. I Updated description. And also I changed implementation a little (to match the conclusions in the doc).

@jbartosik
Copy link
Contributor Author

@DirectXMan12 please take a look

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 21, 2018
After my previous changes HPA wasn't behaving correctly in the following
situation:

- Pods use a lot of CPU during initilization, become ready right after they initialize,
- Scale up triggers,
- When new pods become ready HPA counts their usage (even though it's not related to any work that needs doing),
- Another scale up, even though existing pods can handle work, no problem.
@jbartosik
Copy link
Contributor Author

/assign @wojtek-t

@mwielgus
Copy link
Contributor

mwielgus commented Aug 24, 2018

@DirectXMan12
I took the liberty of LGTMing the PR to get it through the door, based on @jbartosik claims of your verbal conditional LGTM. As @jbartosik is going for vacation next week @krzysztof-jastrzebski will be doing all of the follow-up PRs to adjust readiness-related logic to both your and our preferences.

@wojtek-t
Copy link
Member

Please sqaush the last two commits.

Duration of initialization taint on CPU and window of initial readiness
setting controlled by flags.

Adding API violation exceptions following example of e50340e
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 24, 2018
@jbartosik
Copy link
Contributor Author

@wojtek-t Done.

@wojtek-t
Copy link
Member

/lgtm

[Though those validation exceptions need api-reviewer approval].

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 24, 2018
@wojtek-t
Copy link
Member

/hold

@thockin - for final approval of the exceptions.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2018
@thockin
Copy link
Member

thockin commented Aug 24, 2018

I am not sure what I am asked to approve? what is triggering the violation?

@jbartosik
Copy link
Contributor Author

@thockin I'm adding new flags for HPA. They require corresponding fields in componentconfig v1alpha1 API. All such fields have an exception (I think because they start with a capital letter).

@thockin
Copy link
Member

thockin commented Aug 24, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbartosik, mwielgus, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jbartosik
Copy link
Contributor Author

@thockin - please remove hold on this PR @wojtek-t placed it to make sure you approve it.

@jbartosik
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2018
@jbartosik
Copy link
Contributor Author

@DirectXMan12 @mwielgus please merge this PR

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot
Copy link
Contributor

@jbartosik: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce 4fd6a16 link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 66916, 67252, 67794, 67619, 67328). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 663551b into kubernetes:master Aug 24, 2018
@DirectXMan12
Copy link
Contributor

As @jbartosik is going for vacation next week @krzysztof-jastrzebski will be doing all of the follow-up PRs to adjust readiness-related logic to both your and our preferences.

As long as the PRs are posted before code slush that's acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants