New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add jitter to discovery cache TTL #177
Comments
I think there should additionally be |
Very good point. I wonder though if setting this might have undesired effects in large clusters: Surge of 3 when we have 5 nodes it's not a big deal, but a surge of 3 when there are 1k nodes is relevant as it will take a long time to get data for the whole cluster. Perhaps we could expose this in the helm chart, so users can tweak it as needed? |
I still struggle to figure how to make the jitter configurable and what would be the reasonable approach here. First decision is, do we want to only reduce (-) the TTL or also increase (-+) it? Only reducing will increase total amount of requests sent, but will guarantee that all nodes re-do the discovery after the TTL, which might be important in alerting etc. Increasing or reducing the TTL randomly will keep the number of requests the same on average, but the total time to refresh all caches won't be fixed, so it's harder to determine. Second decision is, what % of the TTL the jitter should be. If we randomize from 0 (so jitter 100%), we should get a flat distribution of requests, which is IMO good from scalability point of view, as load remains stable. If we take some smaller value like 10%, then the load is spread over 10% of the TTL, which still makes the load spiky, though not as much. |
Those are good points, but I do no think they matter a lot to users. In the end, this cache is not really user-facing, so we won't risk breaking expectations or are forced to provide strong guarantees (as it would be the case if these were an API cache). I would be fine with any approach as long as it is documented next to the parameters the user needs to change.
I would prefer
Ideally we should make this decision based on:
Unfortunately we do not have any of this, so I would resort to just guessing. In my extremely subjective view, a default 20% of jitter (+10% and -10%) would be reasonable. Seems small enough to not make users worried, or at least I would not mind this level of uncertainty. Users with bigger clusters might want to play with these two values to their heart's content. |
Exactly. So if as an example, we have 1000 notes and all of them do a discovery at the same time, we get 1000 operations, then 1h wait. With jitter from 0, we could spread that approximately to 4 req/s, constantly. For less number of nodes, it would be less. So, shall we:
|
Recent jitter-related PR on K8s: kubernetes/kubernetes#101652 |
Interesting, I didn't know about |
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
So API calls can be spead more for large installations, to avoid spikes on requests to API server. Closes #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
So API calls can be spead more for large installations, to avoid spikes on requests to API server. Closes #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
So API calls can be spead more for large installations, to avoid spikes on requests to API server. Closes #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
So API calls can be spead more for large installations, to avoid spikes on requests to API server. Closes #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
So API calls can be spead more for large installations, to avoid spikes on requests to API server. Closes #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
So API calls can be spead more for large installations, to avoid spikes on requests to API server. Closes #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
So API calls can be spead more for large installations, to avoid spikes on requests to API server. Closes #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
* .github/workflows/push_pr.yaml: ignore inout as typo As "inout parameter" is a valid expression. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/storage: add MemoryStorage implementation Which can be used for testing in other packages. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/client/cached_test.go: improvements * Inline what's possible. * Use MemoryStorage from storage package. * Move helper functions to the bottom to highlight tested scenarios. * Remove dead test code. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/client: add unit tests for cached client Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/client: improve variables naming a bit Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/client: move cache expiry logic to common function Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/client/cached.go: simplify Discover() code a bit Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/ksm/client: consistently name storage in tests Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/ksm/client: remove unreleated test Cache recovery logic is already tested in src/client, so there is no need to test it more. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/ksm/client: use memory storage when possible in tests Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/ksm/client: move common cacher parameters to separate struct So it is easier to extend. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/kubelet/client: use memory storage for testing when possible Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/client: move common fields to separate struct This makes it easier to add new fields to client configuration. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/kubernetes.go: improve variable names Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/apiserver: uniform variables naming a bit Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/apiserver: accept storage interface rather than cache dir This way, memory storage can be used in tests and we properly pass the dependencies, rather than building them themselves. That also uniforms cache client more with src/client package. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/apiserver: make use of client.DiscoveryCacherConfig So it can be single place where cache-specific parameters will be defined. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/apiserver: unify expiry time calculation with src/client Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/client: export Expired() function and prepare it for external use So we can use it also from src/apiserver. Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/apiserver: use client.Expired() to avoid duplicating the logic Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/client: add jitter support Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/apiserver: add TTL jitter support Part of #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com> * src/kubernetes.go: add TTL jitter support So API calls can be spead more for large installations, to avoid spikes on requests to API server. Closes #177 Signed-off-by: Mateusz Gozdek <mgozdek@microsoft.com>
Is your feature request related to a problem? Please describe.
Right now, nri-kubernetes caches the output of the discovery operations for a certain period of time. However, this TTL is fixed for all instances of the DaemonSet. For clusters with a significant number of nodes, this causes caches to deterministically expire at the same time, which creates bursts of load in the control plane.
Feature Description
We'd like to add some randomness to the expiration of this caches, so it is unlikely that all requests happen at the same time.
As a nice to have, it would be nice to be able to configure:
The text was updated successfully, but these errors were encountered: