Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add request processing HPA into the queue after processing is finished. #72373

Merged
merged 1 commit into from Jan 4, 2019

Conversation

@krzysztof-jastrzebski
Copy link
Contributor

krzysztof-jastrzebski commented Dec 27, 2018

This fixes a bug whth skipping request inserted by resync because previous one hasn't processed yet.

What type of PR is this?
/kind bug

What this PR does / why we need it:
This PR fixes a bug with skipping request inserted by resync because previous one hasn't processed yet.
Which issue(s) this PR fixes:

Fixes #72372

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixes a bug in HPA controller so HPAs are always updated every resyncPeriod (15 seconds).
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 27, 2018

Hi @krzysztof-jastrzebski. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Dec 27, 2018

/sig autoscaling

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Dec 27, 2018

/assign mwielgus

@@ -298,20 +302,20 @@ func (a *HorizontalController) computeReplicasForMetrics(hpa *autoscalingv2.Hori
return replicas, metric, statuses, timestamp, nil
}

func (a *HorizontalController) reconcileKey(key string) error {
func (a *HorizontalController) reconcileKey(key string) (err error, deleted bool) {

This comment has been minimized.

Copy link
@mwielgus

mwielgus Dec 27, 2018

Contributor

Error must be the last variable.

This comment has been minimized.

Copy link
@krzysztof-jastrzebski

krzysztof-jastrzebski Dec 27, 2018

Author Contributor

done

@krzysztof-jastrzebski krzysztof-jastrzebski force-pushed the krzysztof-jastrzebski:hpa_fix branch from 7a0b7ee to c56439d Dec 27, 2018
@mwielgus

This comment has been minimized.

Copy link
Contributor

mwielgus commented Dec 27, 2018

How was this PR tested?

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Dec 27, 2018

I've created K8s build with debugs showing when HPAs are processed, created cluster and HPA. I observed how often HPA is updated. Next I removed HPA just after HPA was processed and verified that HPA is not processed anymore.

@mwielgus

This comment has been minimized.

Copy link
Contributor

mwielgus commented Dec 27, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Dec 27, 2018
@mwielgus

This comment has been minimized.

Copy link
Contributor

mwielgus commented Dec 27, 2018

/approve

@mwielgus

This comment has been minimized.

Copy link
Contributor

mwielgus commented Dec 27, 2018

/ok-to-test

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 27, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: krzysztof-jastrzebski, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 27, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@krzysztof-jastrzebski krzysztof-jastrzebski force-pushed the krzysztof-jastrzebski:hpa_fix branch from c56439d to 48eb53e Dec 27, 2018
@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Dec 28, 2018

/hold

Unconditional reinsertion seems very likely to cause problems. Doesn't this cause HPAs to be processed much more frequently than intended?

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Dec 28, 2018

HPA controller uses rate limiter which always returns ResyncPeriod (currently 15 seconds).
Rate Limiter:

func (r *FixedItemIntervalRateLimiter) When(item interface{}) time.Duration {

Creation:
queue: workqueue.NewNamedRateLimitingQueue(NewDefaultHPARateLimiter(resyncPeriod), "horizontalpodautoscaler"),

In that case every request inserted to queue waits 15 seconds in the queue. If we insert one request then it stays in queue for 15 seconds and all request inserted during 15 seconds are skipped as they would be executed after request which is in queue. After 15 seconds next request can be inserted and it also stays in queue for 15 seconds etc.
All redundant request are skipped, HPA is updated at most every 15 seconds.

I see this is not the best way of using RateLimitingQueue but I would like to do small fix so I can backport it to 1.12. If you think we should use queue in other way then I'm happy to fix it next PR. Do you have any suggestion how to do that?

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Dec 28, 2018

This is a strange use of the rate limiting queue trying to achieve tight guarantees about retry interval.

I'm concerned about doubling the number of queue insertion attempts for systems with large numbers of HPAs. Attempting to enqueue all HPAs in the system every 15 seconds is already very aggressive... this change would attempt to enqueue them all twice every 15 seconds.

Until the structure of this controller can be revisited, can we do something simpler and less likely to impact performance like one of the following:

  • set up the HPA AddEventHandlerWithResyncPeriod with resyncPeriod+time.Second
  • set up NewDefaultHPARateLimiter(resyncPeriod-time.Second)

Either of those approaches would give delayingType#waitingLoop a second to move ready HPA items from waitingForQueue into the queue. If that takes longer than a second (for example, under load), HPA processing could still skip a resync interval, but that actually seems like a better outcome to me than doubling queue attempts in pursuit of exact retry intervals.

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Dec 28, 2018

I was thinking about similar solution. The problem with it is why to add/subtract 1 second. resyncPeriod is defined by a flag. Default is 15 seconds but someone can use 1 second resuncPeriod. If we added 1 second then HPAs would be refreshed every 2 seconds instead of 1, we would also process HPA on every change of the objects (rate limiter set to 0) which can be in some cases very often.
I don't think that adding HPA to queue after processing it is a problem as current implementation of HPA(it is single-threaded) can process less than 10 HPAs per second. Adding 10 request every second to local queue shouldn't be a big deal.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jan 2, 2019

Default is 15 seconds but someone can use 1 second resyncPeriod. If we added 1 second then HPAs would be refreshed every 2 seconds instead of 1, we would also process HPA on every change of the objects (rate limiter set to 0) which can be in some cases very often.

Setting a 1 second resyncPeriod isn't likely to work well in general, but I agree we wouldn't want to drop that to 0. Setting to a percentage or capping at a minimum value would prevent that.

Unconditional requeuing is confusing to anyone who is familiar with all the other controllers. I'd prefer the smallest change we can make that doesn't take this controller even further away from standard usage of the queue components, so I'm not in favor of the current PR, but will defer to @kubernetes/sig-autoscaling-pr-reviews

/hold cancel

Copy link
Contributor

DirectXMan12 left a comment

can you expand a bit on why this occurs in the comment? Use your example from above

if err == nil {
// don't "forget" here because we want to only process a given HPA once per resync interval
return true
deleted, err := a.reconcileKey(key.(string))

This comment has been minimized.

Copy link
@DirectXMan12

DirectXMan12 Jan 4, 2019

Contributor

you should still have a comment as to why we're not forgetting the key, since this works significantly different from normal controllers.

@DirectXMan12

This comment has been minimized.

Copy link
Contributor

DirectXMan12 commented Jan 4, 2019

please also fix the PR description to remove the unused flags.

@DirectXMan12

This comment has been minimized.

Copy link
Contributor

DirectXMan12 commented Jan 4, 2019

I'm inclined to go with @liggitt's suggestion short-term, and think up a better long-term fix (e.g. don't watch HPA, and instead just reconcile every 15 seconds, staggered, or something).

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Jan 4, 2019

Using solution suggested by @liggitt will take the same amount of time (implementing and testing) as implementing final solution. I prefer submitting this solution (unless you see any bugs in it), backporting it and implementing final solution immediately.

@mwielgus

This comment has been minimized.

Copy link
Contributor

mwielgus commented Jan 4, 2019

All of the proposed solutions are ugly one way or another. Probably the cleanest solution would be to write the right queue instead of trying to bend the existing one to match the needs. Given the amount of time needed to implement, it I'm rather in favour of merging the proposed/existing/tested fix unless someone proves it doesn't work.

@krzysztof-jastrzebski please expand the comments in the code so that the possible confusion caused by non-standard use is minimal.

This fixes a bug with skipping request inserted by resync because previous one hasn't processed yet.
@krzysztof-jastrzebski krzysztof-jastrzebski force-pushed the krzysztof-jastrzebski:hpa_fix branch from 48eb53e to c6ebd12 Jan 4, 2019
@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Jan 4, 2019

@mwielgus done

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Jan 4, 2019

/test pull-kubernetes-e2e-kops-aws
/test pull-kubernetes-kubemark-e2e-gce-big

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Jan 4, 2019

/test pull-kubernetes-e2e-gce-100-performance

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Jan 4, 2019

/test pull-kubernetes-kubemark-e2e-gce-big

@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Jan 4, 2019

/test pull-kubernetes-e2e-gce-100-performance

@mwielgus

This comment has been minimized.

Copy link
Contributor

mwielgus commented Jan 4, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 4, 2019
@krzysztof-jastrzebski

This comment has been minimized.

Copy link
Contributor Author

krzysztof-jastrzebski commented Jan 4, 2019

/test pull-kubernetes-integration

@k8s-ci-robot k8s-ci-robot merged commit 86691ca into kubernetes:master Jan 4, 2019
19 checks passed
19 checks passed
cla/linuxfoundation krzysztof-jastrzebski authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
k8s-ci-robot added a commit that referenced this pull request Jan 8, 2019
…-pick-of-#72373-upstream-release-1.13

Automated cherry pick of #72373: Add request processing HPA into the queue after processing is
k8s-ci-robot added a commit that referenced this pull request Jan 8, 2019
…-pick-of-#72373-upstream-release-1.12

Automated cherry pick of #72373: Add request processing HPA into the queue after processing is
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.