Proposal: Only attempt to send metrics after joining a sufficiently large cluster #201

alecrajeev · 2024-03-28T21:15:05Z

Background

In clusters with several agent pods and many potential targets, it can take a few minutes for a cluster to have all of its pods join and become healthy on startup. I noticed that when a pod starts up that is intended to be part of a larger cluster, it takes a few minutes to join. During this time it will briefly attempt to scrape a large number of targets and run into OOM errors.

This is more apparent in a statefulset with OrderedReady enabled, where the first pod will have all potential targets. Then as additional pods get spun up, the number of scrape targets will be distributed evenly among them. We can know ahead time that a cluster will have at least say 5 pods.

What I have found as a work around, is to launch the pods with a config that has several discovery components to find all the potential targets, but no scrape component. Then once all 5 pods are running and part of the cluster, then I update the config to add the scrape component.

Also when the statefulset is changed from OrderedReady to Parallel, then all the pods launch at the same time. But what I found is there is an initial burst in 409 conflict errors from thanos-receive because the pods have not joined the cluster yet. Once they have joined, this goes away.

Proposal

I propose a new flag to grafana-agent run which specifies the minimum number of agents in the cluster before sending any metrics:

--cluster.min-cluster-size # minimum number of agents in the cluster before metrics will be sent

This will lead to a delay in sending metrics when first starting up because it takes a few minutes for a large cluster to become healthy and have its agents register themselves. However, this opt-in feature can be helpful for stability purposes. This can prevent the first grafana agent pod in a statefulset from running into an OOM error and dramatically cut down the number of 409 conflict errors from thanos-receive.

The text was updated successfully, but these errors were encountered:

rfratto · 2024-03-28T21:26:48Z

Thanks for the proposal! What you propose here broadly makes sense.

IIRC, Loki/Mimir have (had?) a flag to specify a wait time before participating in workload distribution.

I'm worried about a scenario where the min cluster size will never be reached; in that case it might make sense to have a second flag to control wait time:

--cluster.min-cluster-size        # Minimum cluster size before participating in workload distribution 
--cluster.participation-wait-time # Minimum time to wait before participating in workload distribution

Here, the idea is that you could set neither, one, or both to get different effects. If both are set, the first one that succeeds will force distribution to activate for that node, so --cluster.min-cluster-size=5 --cluster.participation-wait-time=5m will either wait for 5 minutes to pass or for there to be 5 members in the cluster.

alecrajeev · 2024-04-16T20:16:38Z

I really like your idea. I hadn't thought about the case when there are not enough members in the cluster, so I think adding those two flags would be a good improvement. Also thanks for transferring this to the new alloy repo.

jammiemil · 2024-05-17T17:16:09Z

It would be nice to be able to do the same thing for prometheus.receive_http, We have a situation now where edge Alloy instances are using a central cluster of Alloy instances as a metric pipeline by using prometheus.remote_write on the edge to prometheus.receive_http in the cluster and whenever the cluster has to restart the first pod to come back up gets annihalated by remote_write requests the second the API comes up. Having some sort of mechanism to only bring that service up once the cluster has established quorum would be handy.

GroovyCarrot · 2024-06-21T13:12:33Z

this causes us quite a lot of grief with agents crash looping and never actually managing to form a cluster, because they try to scrape thousands of targets when they start up rather than joining the cluster, waiting for the shard allocation, and then starting scraping. Even just --delay-initial-scrape=60s or something would suffice

edit:
it also seems reasonable to try to hard-cap the amount of memory being consumed at any time to prevent OOMs? that would also give you a metric signal to say that scrape intervals cannot be honoured, and more replicas need to be added?

thampiotr · 2024-07-02T09:38:18Z

@GroovyCarrot regarding your second point, there's GOMEMLIMIT support from v1.1.0.

As a general question about this proposal: what should be happening when the workload distribution is not yet allowed due to these newly introduced conditions? Should all the components be disabled and not running? Or should only the components supporting the clustering be waiting / not doing any work?

rfratto · 2024-07-02T18:09:46Z

It seems that the most important thing is the observable behaviour here: metrics shouldn't be collected until the cluster is "ready" (minimum cluster size or timeout waiting for a minimum cluster size). That makes me feel like it's OK if these components are technically running as long as they're not doing any work.

The easiest way of us implementing that would be to cause Lookup to fail or return no nodes if the cluster isn't "ready," since Lookup is used for work distribution. If we did this, we'd have to update component logic using Lookup, since most calls to Lookup treat errors or a length of zero owners as being self-owned.

Earlier today in a call I suggested that we may be able to use ckit's concept of "viewer" nodes (read-only nodes that participate in gossip but not in workload distribution) to implement this behaviour, where a node only transitions to "participant" once the cluster is ready. However, since gossip is eventually consistent, there's still a chance one node can briefly see itself as the only participant and over-assign work to itself. We might be able to address this edge case with ckit-level support for cluster readiness.

It seems like changing the behaviour of Lookup would be the easiest and least error-prone way of implementing this. Disabling components while the cluster isn't ready should work too, but is probably going to be harder to implement.

GroovyCarrot · 2024-07-16T13:28:12Z

@thampiotr we had set GOMEMLIMIT, but it seemed that it was because alloy uses informers to load the cluster state in, there's a certain amount of base memory that all the agents need that will increase with a growing cluster. Setting GOMEMLIMIT we found does help a lot, but ultimately we just had to give it more base memory.

We do still sometimes have the issue of clusters that are overcommitted on the number of targets they have to scrape, but if we scale the deployment down to 0, and then back up to more than necessary, it tends to lets them all start up simultaneously and form a cluster before they get overwhelmed - and then the autoscaling can bring the number down gradually.

As I mentioned, it would be ideal if an instance was unable to meet the cluster demands based on it's resource allocation, there would be some sort of signal that could be scaled on, rather than having to deal with the OOMs and crashlooping.

alecrajeev added the proposal A proposal for new functionality. label Mar 28, 2024

rfratto transferred this issue from grafana/agent Apr 11, 2024

rfratto added this to Alloy proposals Jun 14, 2024

rfratto moved this to Incoming in Alloy proposals Jun 14, 2024

rfratto mentioned this issue Jul 9, 2024

Improve clustering logging and error handling. #1242

Merged

4 tasks

thampiotr mentioned this issue Jul 17, 2024

Tracking: Address Clustering Issues #784

Open

thampiotr self-assigned this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Only attempt to send metrics after joining a sufficiently large cluster #201

Proposal: Only attempt to send metrics after joining a sufficiently large cluster #201

alecrajeev commented Mar 28, 2024

rfratto commented Mar 28, 2024 •

edited

Loading

alecrajeev commented Apr 16, 2024

jammiemil commented May 17, 2024

GroovyCarrot commented Jun 21, 2024 •

edited

Loading

thampiotr commented Jul 2, 2024

rfratto commented Jul 2, 2024 •

edited

Loading

GroovyCarrot commented Jul 16, 2024 •

edited

Loading

Proposal: Only attempt to send metrics after joining a sufficiently large cluster #201

Proposal: Only attempt to send metrics after joining a sufficiently large cluster #201

Comments

alecrajeev commented Mar 28, 2024

Background

Proposal

rfratto commented Mar 28, 2024 • edited Loading

alecrajeev commented Apr 16, 2024

jammiemil commented May 17, 2024

GroovyCarrot commented Jun 21, 2024 • edited Loading

thampiotr commented Jul 2, 2024

rfratto commented Jul 2, 2024 • edited Loading

GroovyCarrot commented Jul 16, 2024 • edited Loading

rfratto commented Mar 28, 2024 •

edited

Loading

GroovyCarrot commented Jun 21, 2024 •

edited

Loading

rfratto commented Jul 2, 2024 •

edited

Loading

GroovyCarrot commented Jul 16, 2024 •

edited

Loading