Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not count soft-deleted pods for scaling purposes in HPA controller #67067

Merged
merged 1 commit into from
Aug 28, 2018

Conversation

moonek
Copy link
Contributor

@moonek moonek commented Aug 7, 2018

What this PR does / why we need it:
The metrics of "soft-deleted" pods in general to be deleted should probably not matter for scaling purposes, since they'll be gone "soon", whether they're nodelost or just normally delete.

As long as soft-deleted pods still exist, they prevent normal scale up.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #62845

Special notes for your reviewer:

Release note:

Stop counting soft-deleted pods for scaling purposes in HPA controller to avoid soft-deleted pods incorrectly affecting scale up replica count calculation.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 7, 2018
@moonek
Copy link
Contributor Author

moonek commented Aug 7, 2018

/assign @DirectXMan12

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Aug 7, 2018
@DirectXMan12
Copy link
Contributor

@kubernetes/sig-autoscaling-pr-reviews

EDIT: the right SIG, sorry for the mistype

@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Aug 7, 2018
@MHBauer
Copy link
Contributor

MHBauer commented Aug 8, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 8, 2018
@moonek
Copy link
Contributor Author

moonek commented Aug 8, 2018

How can I pass the test?
My commit is not a breaking change.

@liggitt
Copy link
Member

liggitt commented Aug 8, 2018

see https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/67067/pull-kubernetes-verify/101321/

diff -Naupr -x _output /go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/BUILD /tmp/verify-bazel.3ZRogM/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/BUILD
--- /go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/BUILD	2018-08-08 01:27:28.692897252 +0000
+++ /tmp/verify-bazel.3ZRogM/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/BUILD	2018-08-08 01:30:54.880071889 +0000
@@ -20,6 +20,7 @@ go_library(
         "//pkg/api/v1/pod:go_default_library",
         "//pkg/controller:go_default_library",
         "//pkg/controller/podautoscaler/metrics:go_default_library",
+        "//pkg/util/node:go_default_library",
         "//staging/src/k8s.io/api/autoscaling/v1:go_default_library",
         "//staging/src/k8s.io/api/autoscaling/v2beta1:go_default_library",
         "//staging/src/k8s.io/api/core/v1:go_default_library",

Run ./hack/update-bazel.sh

you need to run hack/update-bazel.sh and commit the results

@moonek
Copy link
Contributor Author

moonek commented Aug 8, 2018

/retest

@moonek
Copy link
Contributor Author

moonek commented Aug 8, 2018

How do I pass pull-kubernetes-e2e-kops-aws? I don't think this is a problem I can solve.

@moonek
Copy link
Contributor Author

moonek commented Aug 8, 2018

/retest

@moonek
Copy link
Contributor Author

moonek commented Aug 9, 2018

@MaciekPytel @jszczepkowski @DirectXMan12 ptal, this issue has been reported as a serious issue in the node failure test for production-grade k8s use.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 9, 2018
@moonek
Copy link
Contributor Author

moonek commented Aug 9, 2018

test code added.

@moonek
Copy link
Contributor Author

moonek commented Aug 9, 2018

/retest

@fedebongio
Copy link
Contributor

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Aug 9, 2018
@DirectXMan12
Copy link
Contributor

The code looks fine here, but I want to clarify what happened. AIUI, when the pods are set as nodelost, they should be deleted shortly thereafter by the node controller, so we shouldn't have an issue with them hanging around forever. Perhaps someone from @kubernetes/sig-node-bugs can weigh in?

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/bug Categorizes issue or PR as related to a bug. labels Aug 13, 2018
@moonek
Copy link
Contributor Author

moonek commented Aug 14, 2018

@DirectXMan12 If you expect the node controller to delete the nodelost pod, this is clearly an HPA bug.
K8s since version 1.5, the node controller has been changed so that it no longer forcibly deletes pods.
#35235
#35145
#51333

If the administrator does not forcibly delete the nodelost pod, it will not scale up forever.

@moonek
Copy link
Contributor Author

moonek commented Aug 22, 2018

@DirectXMan12 I want to know how the review is currently in progress. I want to avoid rebase.

@k8s-ci-robot k8s-ci-robot removed sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. area/apiserver labels Aug 25, 2018
@k8s-ci-robot
Copy link
Contributor

@moonek: Those labels are not set on the issue: area/apiserver

In response to this:

/remove-area apiserver

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@moonek: Those labels are not set on the issue: sig/api-machinery, sig/cli, sig/cloud-provider, sig/cluster-lifecycle, sig/release, sig/scheduling, area/apiserver

In response to this:

/remove-sig api-machinery
/remove-sig cli
/remove-sig cloud-provider
/remove-sig cluster-lifecycle
/remove-sig release
/remove-sig scheduling
/remove-area apiserver
/remove-area kubeadm
/remove-area kubectl
/remove-area kubelet

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed area/kubelet needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 25, 2018
@k8s-ci-robot
Copy link
Contributor

@moonek: Those labels are not set on the issue: sig/api-machinery, sig/cli, sig/cloud-provider, sig/cluster-lifecycle, sig/release, sig/scheduling, area/apiserver, area/kubeadm, area/kubectl, area/kubelet

In response to this:

/remove-sig api-machinery
/remove-sig cli
/remove-sig cloud-provider
/remove-sig cluster-lifecycle
/remove-sig release
/remove-sig scheduling
/remove-area apiserver
/remove-area kubeadm
/remove-area kubectl
/remove-area kubelet

oops.. It was a rebase accident.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@moonek
Copy link
Contributor Author

moonek commented Aug 25, 2018

/retest

@moonek
Copy link
Contributor Author

moonek commented Aug 27, 2018

@DirectXMan12 Is it difficult for this fix to be merged into 1.12?

@DirectXMan12
Copy link
Contributor

@moonek I'd like to get it merged for 1.12. I'm planning on doing one last review pass tomorrow, then approving it.

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor test-related change, otherwise this is good to go

},
},
},
},
},
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't be creating the entire object in the if-else, just setting the DeletionTimestamp:

pod = v1.Pod{...}
if deletionTimestamp {
  pod.DeletionTimestamp = &metav1.Time{Time: time.Now()}
}

Type: v1.PodReady,
Status: podReadiness,
pod := v1.Pod{}
if podDeletionTimestamp == false {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here

@moonek
Copy link
Contributor Author

moonek commented Aug 28, 2018

@DirectXMan12 I was also concerned about that part. I modified and re-committed.

@DirectXMan12
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DirectXMan12, moonek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 28, 2018
@DirectXMan12
Copy link
Contributor

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 28, 2018

@moonek: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-cross 749f8be3c173c070412e4586809c6dec8b1e2c7d link /test pull-kubernetes-cross
pull-kubernetes-local-e2e-containerized 749f8be3c173c070412e4586809c6dec8b1e2c7d link /test pull-kubernetes-local-e2e-containerized
pull-kubernetes-local-e2e 749f8be3c173c070412e4586809c6dec8b1e2c7d link /test pull-kubernetes-local-e2e
pull-kubernetes-e2e-gke 749f8be3c173c070412e4586809c6dec8b1e2c7d link /test pull-kubernetes-e2e-gke

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@moonek
Copy link
Contributor Author

moonek commented Aug 28, 2018

/retest

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 67067, 67947). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 42c6f1f into kubernetes:master Aug 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HPA not working properly when pod status "Unknown" (node failure)
7 participants