Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jitter to discovery cache TTL #177

Closed
roobre opened this issue Jul 26, 2021 · 7 comments · Fixed by #185
Closed

Add jitter to discovery cache TTL #177

roobre opened this issue Jul 26, 2021 · 7 comments · Fixed by #185
Assignees
Labels
feature request Categorizes issue or PR as related to a new feature or enhancement.

Comments

@roobre
Copy link
Contributor

roobre commented Jul 26, 2021

Is your feature request related to a problem? Please describe.

Right now, nri-kubernetes caches the output of the discovery operations for a certain period of time. However, this TTL is fixed for all instances of the DaemonSet. For clusters with a significant number of nodes, this causes caches to deterministically expire at the same time, which creates bursts of load in the control plane.

Feature Description

We'd like to add some randomness to the expiration of this caches, so it is unlikely that all requests happen at the same time.

As a nice to have, it would be nice to be able to configure:

  • The TTL itself
  • The max amount of random deviation from the target TTL
@invidian invidian added the feature request Categorizes issue or PR as related to a new feature or enhancement. label Jul 26, 2021
@invidian
Copy link
Contributor

I think there should additionally be maxSurge set on DaemonSet upgrade strategy, because the cache is not persistent anyway, so on restart/update, all instances will be querying at the same time anyway.

@roobre
Copy link
Contributor Author

roobre commented Jul 27, 2021

Very good point. I wonder though if setting this might have undesired effects in large clusters: Surge of 3 when we have 5 nodes it's not a big deal, but a surge of 3 when there are 1k nodes is relevant as it will take a long time to get data for the whole cluster.

Perhaps we could expose this in the helm chart, so users can tweak it as needed?

@invidian
Copy link
Contributor

invidian commented Aug 6, 2021

I still struggle to figure how to make the jitter configurable and what would be the reasonable approach here.

First decision is, do we want to only reduce (-) the TTL or also increase (-+) it? Only reducing will increase total amount of requests sent, but will guarantee that all nodes re-do the discovery after the TTL, which might be important in alerting etc. Increasing or reducing the TTL randomly will keep the number of requests the same on average, but the total time to refresh all caches won't be fixed, so it's harder to determine.

Second decision is, what % of the TTL the jitter should be. If we randomize from 0 (so jitter 100%), we should get a flat distribution of requests, which is IMO good from scalability point of view, as load remains stable. If we take some smaller value like 10%, then the load is spread over 10% of the TTL, which still makes the load spiky, though not as much.

@roobre
Copy link
Contributor Author

roobre commented Aug 6, 2021

Those are good points, but I do no think they matter a lot to users. In the end, this cache is not really user-facing, so we won't risk breaking expectations or are forced to provide strong guarantees (as it would be the case if these were an API cache). I would be fine with any approach as long as it is documented next to the parameters the user needs to change.

do we want to only reduce (-) the TTL or also increase (-+) it?

I would prefer +-, because intuitively it will make the real TTL closer to the target value. Additionally, when t → ∞, the average TTL would be exactly the configured value. If we did it one-sided, this would not be the case.

Second decision is, what % of the TTL the jitter should be

Ideally we should make this decision based on:

  • The time that this request takes
  • The number of nodes that would be making this request
  • The target permissive amount of requests we want to be in flight at the same time
  • Some fancy stats shenaningans and a target likelyhood for that to not happen

Unfortunately we do not have any of this, so I would resort to just guessing. In my extremely subjective view, a default 20% of jitter (+10% and -10%) would be reasonable. Seems small enough to not make users worried, or at least I would not mind this level of uncertainty. Users with bigger clusters might want to play with these two values to their heart's content.

@invidian
Copy link
Contributor

invidian commented Aug 6, 2021

The time that this request takes
Given the frequency of the requests 1/h on a single node by default, I think request time can be ignored here.

The target permissive amount of requests we want to be in flight at the same time

Exactly. So if as an example, we have 1000 notes and all of them do a discovery at the same time, we get 1000 operations, then 1h wait. With jitter from 0, we could spread that approximately to 4 req/s, constantly. For less number of nodes, it would be less.

So, shall we:

  • call a feature flag TTLJitter
  • make it 0 by default? Or 20?
  • make it unsigned

@invidian
Copy link
Contributor

invidian commented Aug 6, 2021

Recent jitter-related PR on K8s: kubernetes/kubernetes#101652

@roobre
Copy link
Contributor Author

roobre commented Aug 16, 2021

Interesting, I didn't know about apimachinery/wait. Seems like it's a bit tailored to time.Duration though, I'm not sure if it would be more readable than what we do now after all the weird time arithmetic.

invidian added a commit that referenced this issue Aug 23, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 23, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 23, 2021
So API calls can be spead more for large installations, to avoid spikes
on requests to API server.

Closes #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 23, 2021
So API calls can be spead more for large installations, to avoid spikes
on requests to API server.

Closes #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 23, 2021
So API calls can be spead more for large installations, to avoid spikes
on requests to API server.

Closes #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 25, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 25, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 25, 2021
So API calls can be spead more for large installations, to avoid spikes
on requests to API server.

Closes #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 25, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 25, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 25, 2021
So API calls can be spead more for large installations, to avoid spikes
on requests to API server.

Closes #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 26, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 26, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Aug 26, 2021
So API calls can be spead more for large installations, to avoid spikes
on requests to API server.

Closes #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Sep 6, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Sep 6, 2021
Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Sep 6, 2021
So API calls can be spead more for large installations, to avoid spikes
on requests to API server.

Closes #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
invidian added a commit that referenced this issue Sep 13, 2021
* .github/workflows/push_pr.yaml: ignore inout as typo

As "inout parameter" is a valid expression.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/storage: add MemoryStorage implementation

Which can be used for testing in other packages.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/client/cached_test.go: improvements

* Inline what's possible.
* Use MemoryStorage from storage package.
* Move helper functions to the bottom to highlight tested scenarios.
* Remove dead test code.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/client: add unit tests for cached client

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/client: improve variables naming a bit

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/client: move cache expiry logic to common function

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/client/cached.go: simplify Discover() code a bit

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/ksm/client: consistently name storage in tests

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/ksm/client: remove unreleated test

Cache recovery logic is already tested in src/client, so there is no
need to test it more.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/ksm/client: use memory storage when possible in tests

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/ksm/client: move common cacher parameters to separate struct

So it is easier to extend.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/kubelet/client: use memory storage for testing when possible

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/client: move common fields to separate struct

This makes it easier to add new fields to client configuration.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/kubernetes.go: improve variable names

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/apiserver: uniform variables naming a bit

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/apiserver: accept storage interface rather than cache dir

This way, memory storage can be used in tests and we properly pass the
dependencies, rather than building them themselves.

That also uniforms cache client more with src/client package.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/apiserver: make use of client.DiscoveryCacherConfig

So it can be single place where cache-specific parameters will be
defined.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/apiserver: unify expiry time calculation with src/client

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/client: export Expired() function and prepare it for external use

So we can use it also from src/apiserver.

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/apiserver: use client.Expired() to avoid duplicating the logic

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/client: add jitter support

Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/apiserver: add TTL jitter support

Part of #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>

* src/kubernetes.go: add TTL jitter support

So API calls can be spead more for large installations, to avoid spikes
on requests to API server.

Closes #177

Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Categorizes issue or PR as related to a new feature or enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants