Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Only attempt to send metrics after joining a sufficiently large cluster #201

Open
alecrajeev opened this issue Mar 28, 2024 · 7 comments
Assignees
Labels
proposal A proposal for new functionality.

Comments

@alecrajeev
Copy link

Background

In clusters with several agent pods and many potential targets, it can take a few minutes for a cluster to have all of its pods join and become healthy on startup. I noticed that when a pod starts up that is intended to be part of a larger cluster, it takes a few minutes to join. During this time it will briefly attempt to scrape a large number of targets and run into OOM errors.

image

This is more apparent in a statefulset with OrderedReady enabled, where the first pod will have all potential targets. Then as additional pods get spun up, the number of scrape targets will be distributed evenly among them. We can know ahead time that a cluster will have at least say 5 pods.

What I have found as a work around, is to launch the pods with a config that has several discovery components to find all the potential targets, but no scrape component. Then once all 5 pods are running and part of the cluster, then I update the config to add the scrape component.

Also when the statefulset is changed from OrderedReady to Parallel, then all the pods launch at the same time. But what I found is there is an initial burst in 409 conflict errors from thanos-receive because the pods have not joined the cluster yet. Once they have joined, this goes away.

Proposal

I propose a new flag to grafana-agent run which specifies the minimum number of agents in the cluster before sending any metrics:

--cluster.min-cluster-size # minimum number of agents in the cluster before metrics will be sent

This will lead to a delay in sending metrics when first starting up because it takes a few minutes for a large cluster to become healthy and have its agents register themselves. However, this opt-in feature can be helpful for stability purposes. This can prevent the first grafana agent pod in a statefulset from running into an OOM error and dramatically cut down the number of 409 conflict errors from thanos-receive.

@alecrajeev alecrajeev added the proposal A proposal for new functionality. label Mar 28, 2024
@rfratto
Copy link
Member

rfratto commented Mar 28, 2024

Thanks for the proposal! What you propose here broadly makes sense.

IIRC, Loki/Mimir have (had?) a flag to specify a wait time before participating in workload distribution.

I'm worried about a scenario where the min cluster size will never be reached; in that case it might make sense to have a second flag to control wait time:

--cluster.min-cluster-size        # Minimum cluster size before participating in workload distribution 
--cluster.participation-wait-time # Minimum time to wait before participating in workload distribution  

Here, the idea is that you could set neither, one, or both to get different effects. If both are set, the first one that succeeds will force distribution to activate for that node, so --cluster.min-cluster-size=5 --cluster.participation-wait-time=5m will either wait for 5 minutes to pass or for there to be 5 members in the cluster.

@rfratto rfratto transferred this issue from grafana/agent Apr 11, 2024
@alecrajeev
Copy link
Author

I really like your idea. I hadn't thought about the case when there are not enough members in the cluster, so I think adding those two flags would be a good improvement. Also thanks for transferring this to the new alloy repo.

@jammiemil
Copy link

It would be nice to be able to do the same thing for prometheus.receive_http, We have a situation now where edge Alloy instances are using a central cluster of Alloy instances as a metric pipeline by using prometheus.remote_write on the edge to prometheus.receive_http in the cluster and whenever the cluster has to restart the first pod to come back up gets annihalated by remote_write requests the second the API comes up. Having some sort of mechanism to only bring that service up once the cluster has established quorum would be handy.

@rfratto rfratto moved this to Incoming in Alloy proposals Jun 14, 2024
@GroovyCarrot
Copy link

GroovyCarrot commented Jun 21, 2024

this causes us quite a lot of grief with agents crash looping and never actually managing to form a cluster, because they try to scrape thousands of targets when they start up rather than joining the cluster, waiting for the shard allocation, and then starting scraping. Even just --delay-initial-scrape=60s or something would suffice

edit:
it also seems reasonable to try to hard-cap the amount of memory being consumed at any time to prevent OOMs? that would also give you a metric signal to say that scrape intervals cannot be honoured, and more replicas need to be added?

@thampiotr
Copy link
Contributor

@GroovyCarrot regarding your second point, there's GOMEMLIMIT support from v1.1.0.

As a general question about this proposal: what should be happening when the workload distribution is not yet allowed due to these newly introduced conditions? Should all the components be disabled and not running? Or should only the components supporting the clustering be waiting / not doing any work?

@rfratto
Copy link
Member

rfratto commented Jul 2, 2024

It seems that the most important thing is the observable behaviour here: metrics shouldn't be collected until the cluster is "ready" (minimum cluster size or timeout waiting for a minimum cluster size). That makes me feel like it's OK if these components are technically running as long as they're not doing any work.

The easiest way of us implementing that would be to cause Lookup to fail or return no nodes if the cluster isn't "ready," since Lookup is used for work distribution. If we did this, we'd have to update component logic using Lookup, since most calls to Lookup treat errors or a length of zero owners as being self-owned.

Earlier today in a call I suggested that we may be able to use ckit's concept of "viewer" nodes (read-only nodes that participate in gossip but not in workload distribution) to implement this behaviour, where a node only transitions to "participant" once the cluster is ready. However, since gossip is eventually consistent, there's still a chance one node can briefly see itself as the only participant and over-assign work to itself. We might be able to address this edge case with ckit-level support for cluster readiness.

It seems like changing the behaviour of Lookup would be the easiest and least error-prone way of implementing this. Disabling components while the cluster isn't ready should work too, but is probably going to be harder to implement.

@GroovyCarrot
Copy link

GroovyCarrot commented Jul 16, 2024

@thampiotr we had set GOMEMLIMIT, but it seemed that it was because alloy uses informers to load the cluster state in, there's a certain amount of base memory that all the agents need that will increase with a growing cluster. Setting GOMEMLIMIT we found does help a lot, but ultimately we just had to give it more base memory.

We do still sometimes have the issue of clusters that are overcommitted on the number of targets they have to scrape, but if we scale the deployment down to 0, and then back up to more than necessary, it tends to lets them all start up simultaneously and form a cluster before they get overwhelmed - and then the autoscaling can bring the number down gradually.

As I mentioned, it would be ideal if an instance was unable to meet the cluster demands based on it's resource allocation, there would be some sort of signal that could be scaled on, rather than having to deal with the OOMs and crashlooping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal A proposal for new functionality.
Projects
Status: Incoming
Development

No branches or pull requests

5 participants