Add jitter to discovery cache TTL #177

roobre · 2021-07-26T10:21:59Z

Is your feature request related to a problem? Please describe.

Right now, nri-kubernetes caches the output of the discovery operations for a certain period of time. However, this TTL is fixed for all instances of the DaemonSet. For clusters with a significant number of nodes, this causes caches to deterministically expire at the same time, which creates bursts of load in the control plane.

Feature Description

We'd like to add some randomness to the expiration of this caches, so it is unlikely that all requests happen at the same time.

As a nice to have, it would be nice to be able to configure:

The TTL itself
The max amount of random deviation from the target TTL

invidian · 2021-07-27T10:41:33Z

I think there should additionally be maxSurge set on DaemonSet upgrade strategy, because the cache is not persistent anyway, so on restart/update, all instances will be querying at the same time anyway.

roobre · 2021-07-27T13:46:32Z

Very good point. I wonder though if setting this might have undesired effects in large clusters: Surge of 3 when we have 5 nodes it's not a big deal, but a surge of 3 when there are 1k nodes is relevant as it will take a long time to get data for the whole cluster.

Perhaps we could expose this in the helm chart, so users can tweak it as needed?

invidian · 2021-08-06T07:33:30Z

I still struggle to figure how to make the jitter configurable and what would be the reasonable approach here.

First decision is, do we want to only reduce (-) the TTL or also increase (-+) it? Only reducing will increase total amount of requests sent, but will guarantee that all nodes re-do the discovery after the TTL, which might be important in alerting etc. Increasing or reducing the TTL randomly will keep the number of requests the same on average, but the total time to refresh all caches won't be fixed, so it's harder to determine.

Second decision is, what % of the TTL the jitter should be. If we randomize from 0 (so jitter 100%), we should get a flat distribution of requests, which is IMO good from scalability point of view, as load remains stable. If we take some smaller value like 10%, then the load is spread over 10% of the TTL, which still makes the load spiky, though not as much.

roobre · 2021-08-06T12:04:58Z

Those are good points, but I do no think they matter a lot to users. In the end, this cache is not really user-facing, so we won't risk breaking expectations or are forced to provide strong guarantees (as it would be the case if these were an API cache). I would be fine with any approach as long as it is documented next to the parameters the user needs to change.

do we want to only reduce (-) the TTL or also increase (-+) it?

I would prefer +-, because intuitively it will make the real TTL closer to the target value. Additionally, when t → ∞, the average TTL would be exactly the configured value. If we did it one-sided, this would not be the case.

Second decision is, what % of the TTL the jitter should be

Ideally we should make this decision based on:

The time that this request takes
The number of nodes that would be making this request
The target permissive amount of requests we want to be in flight at the same time
Some fancy stats shenaningans and a target likelyhood for that to not happen

Unfortunately we do not have any of this, so I would resort to just guessing. In my extremely subjective view, a default 20% of jitter (+10% and -10%) would be reasonable. Seems small enough to not make users worried, or at least I would not mind this level of uncertainty. Users with bigger clusters might want to play with these two values to their heart's content.

invidian · 2021-08-06T12:58:45Z

The time that this request takes
Given the frequency of the requests 1/h on a single node by default, I think request time can be ignored here.

The target permissive amount of requests we want to be in flight at the same time

Exactly. So if as an example, we have 1000 notes and all of them do a discovery at the same time, we get 1000 operations, then 1h wait. With jitter from 0, we could spread that approximately to 4 req/s, constantly. For less number of nodes, it would be less.

So, shall we:

call a feature flag TTLJitter
make it 0 by default? Or 20?
make it unsigned

invidian · 2021-08-06T20:42:10Z

Recent jitter-related PR on K8s: kubernetes/kubernetes#101652

roobre · 2021-08-16T12:48:54Z

Interesting, I didn't know about apimachinery/wait. Seems like it's a bit tailored to time.Duration though, I'm not sure if it would be more readable than what we do now after all the weird time arithmetic.

Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>