add `controller::Config` and debounce period to scheduler #1265

aryan9600 · 2023-07-25T12:38:42Z

Motivation

Without debouncing, the scheduler does not get enough time to deduplicate events leading to unnecessary reconcile runs by the controller. Although reconcile loops are idempotent, we should avoid running them if not required.

Solution

Add controller::Config to allow configuring the behavior of the controller. Introduce a debounce period for the scheduler to allow for deduplication of requests. By default, the debounce period is kept zero seconds, which is equivalent to the existing behaviour (no debounce).

Fixes #1247

codecov · 2023-07-25T12:47:49Z

Codecov Report

Merging #1265 (584a4be) into main (409f6ef) will increase coverage by 0.09%.
Report is 1 commits behind head on main.
The diff coverage is 81.96%.

❗ Current head 584a4be differs from pull request most recent head 20f6eaf. Consider uploading reports for the commit 20f6eaf to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1265      +/-   ##
==========================================
+ Coverage   72.55%   72.64%   +0.09%     
==========================================
  Files          75       75              
  Lines        6136     6186      +50     
==========================================
+ Hits         4452     4494      +42     
- Misses       1684     1692       +8

Files Changed	Coverage Δ
kube-client/src/client/mod.rs	`71.91% <ø> (+0.40%)`	⬆️
kube-runtime/src/controller/runner.rs	`94.36% <ø> (ø)`
kube-runtime/src/controller/mod.rs	`34.21% <21.42%> (-0.95%)`	⬇️
kube-runtime/src/scheduler.rs	`97.84% <100.00%> (+0.46%)`	⬆️

kube-runtime/src/controller/mod.rs

clux · 2023-08-01T08:14:33Z

kube-runtime/src/scheduler.rs

+        let (mut sched_tx, sched_rx) = mpsc::unbounded::<ScheduleRequest<SingletonMessage>>();
+        let mut scheduler = scheduler(sched_rx, Some(Duration::from_secs(5)));
+
+        sched_tx
+            .send(ScheduleRequest {
+                message: SingletonMessage(1),
+                run_at: now + Duration::from_secs(1),
+            })
+            .await
+            .unwrap();
+        assert!(poll!(scheduler.next()).is_pending());
+        advance(Duration::from_secs(2)).await;
+        assert_eq!(scheduler.next().now_or_never().unwrap().unwrap().0, 1);
+        // make sure that we insert the popped message into expired
+        assert_eq!(scheduler.expired.len(), 1);


This is not quite how i envisioned debouncing would work. This basically takes the first and eliminates the last which is functionally wrong (the user should always see the last request as it is the most up to date).

Normally debounce works by adding an additional 5s (say) to the initial schedule, and in that period it does not run the first reconcile. If multiple requests happen within that 5s period, the 5s wait time gets reset so that we only actually run the reconciler 5s after the last uninterrupted request.

so is it more like if a request came in at time T asking the scheduler to emit it at T+10 and another request came in at time T+1 asking to be emitted at T+11, then we need to consider the latter request, provided that no other requests came in until T+6? so basically its not run_at that's to be considered here but the time at which the request arrives to the scheduler?

The requested run time isn't disregarded. It's more that we are giving a passive leeway in both the vacant and occupied case, so that the scheduler does not emit immediately no matter what.

We already deduplicate in the T+10 and T+11 case you wrote because they are far enough in the future that neither have run (and we pick the last). But we don't get the chance to deduplicate when people schedule for "now".

I think we can, in the Vacant case, add the debounce time to the scheduled run time and rely on the natural deduplication to do its job. But we must account for the shift in the Occupied case to ensure we actually pick the last still (and also add debounce time there when resetting).

okay i think i get it now, made the relevant changes.

Normally debounce works by adding an additional 5s (say) to the initial schedule, and in that period it does not run the first reconcile. If multiple requests happen within that 5s period, the 5s wait time gets reset so that we only actually run the reconciler 5s after the last uninterrupted request.

That's dangerous, it means that someone writing more frequently than the debounce period can lock you out of reconciling completely...

That's dangerous, it means that someone writing more frequently than the debounce period can lock you out of reconciling completely...

yep. that's why the period should be suggested as a few seconds at most. generally in kubernetes objects go through phases quickly when they are setting up, but then stabilise - and that's when you get your event (if you've chosen to debounce).

in the periodic object updater case, say something like a Node controller (which updates its status every like 30s by design), then it will lock you out if you put the debounce period greater than that. but also, if you put the number that high, you will have a uselessly slow reconciler even when developing.

i think we need to emphasize that debounce is a tool, and it is not meant to prevent all types of reconciliations retriggers (use predicates::generation to prevent status changes), and it is potentially a user error to put a too high number here.

the alternative is to maybe track the amount of delay we have caused and not add the debounce time if we've already added like 10x the debounce time? not sure how viable that is.

Honestly, I think this is severely overengineering to avoid running an extra reconcile or two during the setup phase. In my book, running one reconcile too much is a lot better than running one reconcile too little.

The goal is to run the right amount of reconciles though. If you know your constraints you won't under-run.

We can keep it off by default if you prefer. But I would really like to at least have the option to set this because I have hot controllers running thousand reconciliations per minute, and improvements towards cutting down the noise is helpful.

Add `controller::Config` to allow configuring the behavior of the controller. Introduce a debounce period for the scheduler to allow for deduplication of requests. By default, the debounce period is set to 1 second. Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>

kube-runtime/src/controller/mod.rs

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>

clux

Looking much cleaner now. Thank you. Some documentation nits and a test question, but these are getting pretty small now.

kube-runtime/src/controller/mod.rs

kube-runtime/src/scheduler.rs

kube-runtime/src/controller/mod.rs

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>

kube-runtime/src/controller/mod.rs

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>

clux

Thanks a lot for this. Did a quick local test on top in a CR controller with more than one managed (.owns() relation) child and did a quick sanity verification of the types of conditions this helps for. These cases worked well:

repeat phase transitions; (did some quick kubectl edits and kept saving every few seconds and it did keep the debounce active as long as i kept below the debounce time)
rapid CD like changes to multiple related objects (did kubecl delete x a && kubectl delete y b where our reconciler is normally fast enough to sneak in reconciles (and follow-ups) in between each of these (effectively doubling the amount necessary) - avoided with a debounce of 1s)

What this is not used for; avoiding repeat self-triggering via reconcilers that have multiple children. If you are patching multiple objects from the reconciler, you can indeed create multiple reconciler requests while your reconciler is running, but these are already naturally deduplicated (because of the one objectref policy in the reconciler). You will never get rid of the single follow-up request (caused by you changing the world), but that is also not the point. You want that event, because it you want to pick up from where you left off, and always reconcile the last point in time (also your object apply might have passed through admission and changed there).

Anyway, long way to say that I am very happy with this now. No more nits on my end. The only interface change is an addition of a parameter needed in the lower level applier interface (which can also be defaulted), and the default is kept at the old default; no debounce.

aryan9600 force-pushed the debounce branch from 77ab6ce to b9c1fbf Compare July 25, 2023 12:45

clux reviewed Aug 1, 2023

View reviewed changes

aryan9600 force-pushed the debounce branch from b9c1fbf to fe94fee Compare August 2, 2023 18:50

clux reviewed Aug 3, 2023

View reviewed changes

kube-runtime/src/controller/mod.rs Outdated Show resolved Hide resolved

aryan9600 force-pushed the debounce branch 2 times, most recently from 21249a7 to af1b5ae Compare August 4, 2023 18:10

aryan9600 added 2 commits August 5, 2023 00:08

add scheduler_debounced() to configure debounce for scheduler

10a050d

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>

address clippy warnings

0d61c59

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>

aryan9600 force-pushed the debounce branch from ed876ac to 0d61c59 Compare August 4, 2023 18:38

clux reviewed Aug 4, 2023

View reviewed changes

aryan9600 force-pushed the debounce branch from 5450c1a to e7b645f Compare August 5, 2023 08:41

improve tests and docs

599f5f8

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>

aryan9600 force-pushed the debounce branch from e7b645f to 599f5f8 Compare August 5, 2023 08:43

clux reviewed Aug 5, 2023

View reviewed changes

kube-runtime/src/controller/mod.rs Outdated Show resolved Hide resolved

make controller::Config::debounce() a builder

100fb57

Signed-off-by: Sanskar Jaiswal <jaiswalsanskar078@gmail.com>

aryan9600 force-pushed the debounce branch from 584a4be to 100fb57 Compare August 5, 2023 11:02

clux approved these changes Aug 5, 2023

View reviewed changes

clux added the changelog-add changelog added category for prs label Aug 5, 2023

clux added this to the 0.86.0 milestone Aug 6, 2023

Merge branch 'main' into debounce

20f6eaf

clux enabled auto-merge (squash) August 8, 2023 06:42

clux merged commit 5e98a92 into kube-rs:main Aug 8, 2023
15 checks passed

sbernauer mentioned this pull request Nov 24, 2023

bump: operator-rs 0.56.1 stackabletech/commons-operator#184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `controller::Config` and debounce period to scheduler #1265

add `controller::Config` and debounce period to scheduler #1265

aryan9600 commented Jul 25, 2023 •

edited by clux

codecov bot commented Jul 25, 2023 •

edited

clux Aug 1, 2023

aryan9600 Aug 2, 2023

clux Aug 2, 2023

aryan9600 Aug 2, 2023

nightkr Aug 3, 2023

clux Aug 3, 2023

nightkr Aug 3, 2023

clux Aug 3, 2023

clux left a comment

clux left a comment •

edited

add controller::Config and debounce period to scheduler #1265

add controller::Config and debounce period to scheduler #1265

Conversation

aryan9600 commented Jul 25, 2023 • edited by clux

Motivation

Solution

codecov bot commented Jul 25, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clux left a comment

Choose a reason for hiding this comment

clux left a comment • edited

Choose a reason for hiding this comment

add `controller::Config` and debounce period to scheduler #1265

add `controller::Config` and debounce period to scheduler #1265

aryan9600 commented Jul 25, 2023 •

edited by clux

codecov bot commented Jul 25, 2023 •

edited

clux left a comment •

edited