Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow controller authors to configure queue rate limits #631

Closed
negz opened this issue Oct 7, 2019 · 7 comments · Fixed by #731
Closed

Allow controller authors to configure queue rate limits #631

negz opened this issue Oct 7, 2019 · 7 comments · Fixed by #731
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@negz
Copy link
Contributor

negz commented Oct 7, 2019

Controllers created using controller-runtime implicitly use the DefaultControllerRateLimiter for the work queue they use to process reconcile requests. If I follow the logic correctly, this means that under the following conditions...

  • A watched object has changed.
  • A reconcile attempt has returned an error.
  • A reconcile attempt has returned Result{Requeue: true}

...the earliest time the resulting reconcile will be processed will be the maximum of:

  • The enqueue time plus a duration increasing exponentially from 5ms to ~16 minutes per object until said object is reconciled successfully (i.e. a reconcile attempt for that object returns Result{Requeue: false}, nil or Result{RequeueAfter: N}, nil).
  • The enqueue time plus a duration calculated to limit the controller to 10 requeues per second on average.

While building Crossplane we've found these requeue semantics are not ideal for reconciling Kubernetes resources with cloud provider APIs. There are two parts to this:

  • Typically we don't want to retry as soon as 5ms from the time we encountered an error, and also don't want to reach a point where we're waiting over 16 minutes between each reconcile.
  • There is no way for users who do not have access to our controller manager pod logs (i.e. most users) to learn that the object they're waiting on is suffering exponential backoff. Ideally upon encountering an error we'd emit an event explaining the error and informing the user that we'll try again after a particular time.

I believe we could solve both of the above issues if we were able to override the default rate limiter, perhaps by adding it to controller.Options. This would allow us to tweak the exponential backoff parameters, and to plumb the rate limiter through to our Reconciler so it could inform concerned parties when an object was next going to be reconciled.

Relates to #195 and #417.

@DirectXMan12
Copy link
Contributor

/kind feature
/priority backlog

As long as we're careful not to tie ourselves to closely to the exact interfaces or semantics of the underlying apimachinery libraries, this seems like a reasonable request

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 7, 2019
@negz
Copy link
Contributor Author

negz commented Oct 8, 2019

https://godoc.org/k8s.io/client-go/util/workqueue#RateLimiter is the interface controller.New currently uses behind the scenes. If we were unconcerned with exposing that interface I'd likely do something like:

// Options are the arguments for creating a new Controller
type Options struct {
	// ... existing fields omitted

	// A RateLimiter used to limit how frequently requests may be queued.
	// Defaults to <brief summary of DefaultControllerRateLimiter semantics>.
	RateLimiter workqueue.RateLimiter
}

// New returns a new Controller registered with the Manager.  The Manager will ensure that shared Caches have
// been synced before the Controller is Started.
func New(name string, mgr manager.Manager, options Options) (Controller, error) {
	if options.RateLimiter == nil {
		options.RateLimiter = workqueue.DefaultControllerRateLimiter()
	}

	// Make a great new controller...
}

@DirectXMan12 Would you be concerned about leaking that interface to controller-runtime users? If so, would you want to try to alter the interface, or would it be sufficient to define an identical interface inside the controller package?

@DirectXMan12
Copy link
Contributor

the more we expose from apimachinery, the more likely something explodes dramatically in our faces when someone decides that a method name needs to change, or an interface needs to go away, or whatnot (cough 1.16 apimachinery cough). That's one of my big concerns.

To clarify: what's the exact desire here -- set the backoff? Control the backoff in a more granular manner? That interface is decent if we need the latter -- we just need to make sure we can always support it.

@negz
Copy link
Contributor Author

negz commented Nov 19, 2019

To clarify: what's the exact desire here -- set the backoff? Control the backoff in a more granular manner?

It's not immediately obvious to me what the difference between those two options is, so "yes". :) I'd like:

  • Control over backoff semantics. For example in our case where we're orchestrating (fairly slow moving) cloud APIs we might want our first retry after ~1 seconds (not 5ms), backing off up to ~5 minutes (not ~16 minutes).
  • Insight into the current state of the backoff. I feel it's a much better user experience to report "Encountered error 'foo' while attempting to bar the baz. Will try again no sooner than 15:03:30Z" rather than simply reporting that there was an error and letting the user guess at when the next reconcile will happen.

@DirectXMan12
Copy link
Contributor

Yeah, that makes sense. Exposing an option to configure the rate-limiter is probably fine.

@rajathagasthya
Copy link
Contributor

Control over backoff semantics. For example in our case where we're orchestrating (fairly slow moving) cloud APIs we might want our first retry after ~1 seconds (not 5ms), backing off up to ~5 minutes (not ~16 minutes).

We have the exact same use case in controlling backoff. So this would be really useful.

@DirectXMan12 Does it make sense to add a new ratelimiter package and define the RateLimiter interface (identical to client-go's RateLimiter) there? And then use that as the type for the rate limiter option?

DirectXMan12 pushed a commit that referenced this issue Jan 31, 2020
Fix function comments based on best practices from Effective Go
@vincepri
Copy link
Member

/kind design

@k8s-ci-robot k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Feb 21, 2020
@vincepri vincepri added this to the Next milestone Feb 21, 2020
@vincepri vincepri added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed kind/design Categorizes issue or PR as related to design. labels Feb 21, 2020
@vincepri vincepri modified the milestones: Next, v0.5.x Feb 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
5 participants