Allow controller authors to configure queue rate limits #631

negz · 2019-10-07T21:44:45Z

Controllers created using controller-runtime implicitly use the DefaultControllerRateLimiter for the work queue they use to process reconcile requests. If I follow the logic correctly, this means that under the following conditions...

A watched object has changed.
A reconcile attempt has returned an error.
A reconcile attempt has returned Result{Requeue: true}

...the earliest time the resulting reconcile will be processed will be the maximum of:

The enqueue time plus a duration increasing exponentially from 5ms to ~16 minutes per object until said object is reconciled successfully (i.e. a reconcile attempt for that object returns Result{Requeue: false}, nil or Result{RequeueAfter: N}, nil).
The enqueue time plus a duration calculated to limit the controller to 10 requeues per second on average.

While building Crossplane we've found these requeue semantics are not ideal for reconciling Kubernetes resources with cloud provider APIs. There are two parts to this:

Typically we don't want to retry as soon as 5ms from the time we encountered an error, and also don't want to reach a point where we're waiting over 16 minutes between each reconcile.
There is no way for users who do not have access to our controller manager pod logs (i.e. most users) to learn that the object they're waiting on is suffering exponential backoff. Ideally upon encountering an error we'd emit an event explaining the error and informing the user that we'll try again after a particular time.

I believe we could solve both of the above issues if we were able to override the default rate limiter, perhaps by adding it to controller.Options. This would allow us to tweak the exponential backoff parameters, and to plumb the rate limiter through to our Reconciler so it could inform concerned parties when an object was next going to be reconciled.

Relates to #195 and #417.

The text was updated successfully, but these errors were encountered:

DirectXMan12 · 2019-10-07T22:27:53Z

/kind feature
/priority backlog

As long as we're careful not to tie ourselves to closely to the exact interfaces or semantics of the underlying apimachinery libraries, this seems like a reasonable request

negz · 2019-10-08T03:30:19Z

https://godoc.org/k8s.io/client-go/util/workqueue#RateLimiter is the interface controller.New currently uses behind the scenes. If we were unconcerned with exposing that interface I'd likely do something like:

// Options are the arguments for creating a new Controller
type Options struct {
	// ... existing fields omitted

	// A RateLimiter used to limit how frequently requests may be queued.
	// Defaults to <brief summary of DefaultControllerRateLimiter semantics>.
	RateLimiter workqueue.RateLimiter
}

// New returns a new Controller registered with the Manager.  The Manager will ensure that shared Caches have
// been synced before the Controller is Started.
func New(name string, mgr manager.Manager, options Options) (Controller, error) {
	if options.RateLimiter == nil {
		options.RateLimiter = workqueue.DefaultControllerRateLimiter()
	}

	// Make a great new controller...
}

@DirectXMan12 Would you be concerned about leaking that interface to controller-runtime users? If so, would you want to try to alter the interface, or would it be sufficient to define an identical interface inside the controller package?

DirectXMan12 · 2019-10-10T17:16:45Z

the more we expose from apimachinery, the more likely something explodes dramatically in our faces when someone decides that a method name needs to change, or an interface needs to go away, or whatnot (cough 1.16 apimachinery cough). That's one of my big concerns.

To clarify: what's the exact desire here -- set the backoff? Control the backoff in a more granular manner? That interface is decent if we need the latter -- we just need to make sure we can always support it.

negz · 2019-11-19T21:05:41Z

To clarify: what's the exact desire here -- set the backoff? Control the backoff in a more granular manner?

It's not immediately obvious to me what the difference between those two options is, so "yes". :) I'd like:

Control over backoff semantics. For example in our case where we're orchestrating (fairly slow moving) cloud APIs we might want our first retry after ~1 seconds (not 5ms), backing off up to ~5 minutes (not ~16 minutes).
Insight into the current state of the backoff. I feel it's a much better user experience to report "Encountered error 'foo' while attempting to bar the baz. Will try again no sooner than 15:03:30Z" rather than simply reporting that there was an error and letting the user guess at when the next reconcile will happen.

DirectXMan12 · 2019-12-06T01:21:38Z

Yeah, that makes sense. Exposing an option to configure the rate-limiter is probably fine.

rajathagasthya · 2019-12-12T00:49:02Z

Control over backoff semantics. For example in our case where we're orchestrating (fairly slow moving) cloud APIs we might want our first retry after ~1 seconds (not 5ms), backing off up to ~5 minutes (not ~16 minutes).

We have the exact same use case in controlling backoff. So this would be really useful.

@DirectXMan12 Does it make sense to add a new ratelimiter package and define the RateLimiter interface (identical to client-go's RateLimiter) there? And then use that as the type for the rate limiter option?

Fix function comments based on best practices from Effective Go

vincepri · 2020-02-21T17:25:12Z

/kind design

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 7, 2019

negz mentioned this issue Oct 8, 2019

Requeue with exponential backoff crossplane/crossplane-runtime#40

Closed

rajathagasthya mentioned this issue Dec 17, 2019

✨Make workqueue rate limiter configurable during controller creation #731

Merged

DirectXMan12 pushed a commit that referenced this issue Jan 31, 2020

Merge pull request #631 from CodeLingoBot/rewrite

f2c240a

Fix function comments based on best practices from Effective Go

k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Feb 21, 2020

vincepri added this to the Next milestone Feb 21, 2020

vincepri added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed kind/design Categorizes issue or PR as related to design. labels Feb 21, 2020

vincepri modified the milestones: Next, v0.5.x Feb 21, 2020

k8s-ci-robot closed this as completed in #731 Mar 4, 2020

sm-hawkfish mentioned this issue Sep 6, 2020

Unable to scale to large number of concurrent Katib Experiments / Trials kubeflow/katib#1328

Closed

camilamacedo86 mentioned this issue Dec 1, 2020

Requeuing Reconcile stops working after 15 requeues operator-framework/operator-sdk#4209

Closed

negz mentioned this issue Sep 23, 2021

Proposal: Add reconciliation buffer to reduce load #857

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow controller authors to configure queue rate limits #631

Allow controller authors to configure queue rate limits #631

negz commented Oct 7, 2019

DirectXMan12 commented Oct 7, 2019

negz commented Oct 8, 2019 •

edited

Loading

DirectXMan12 commented Oct 10, 2019

negz commented Nov 19, 2019

DirectXMan12 commented Dec 6, 2019

rajathagasthya commented Dec 12, 2019

vincepri commented Feb 21, 2020

Allow controller authors to configure queue rate limits #631

Allow controller authors to configure queue rate limits #631

Comments

negz commented Oct 7, 2019

DirectXMan12 commented Oct 7, 2019

negz commented Oct 8, 2019 • edited Loading

DirectXMan12 commented Oct 10, 2019

negz commented Nov 19, 2019

DirectXMan12 commented Dec 6, 2019

rajathagasthya commented Dec 12, 2019

vincepri commented Feb 21, 2020

negz commented Oct 8, 2019 •

edited

Loading