[Clustering] Filter the label "instance" from the hash computation #6792

wildum · 2024-03-28T16:54:54Z

PR Description

In clustering mode, the targets exposed by exporters are distributed amongst the agents.
For an agent to know whether it should scrape the target or not, it will compute a hash based on the labels of the target.

We noticed that the agent adds a label "instance" corresponding by default to the host where the agent runs to the targets. This is a problem because if the agents are not running on the same host, they will have a different value for the label "instance". This means that they will compute different hashes and the targets won't be distributed correctly.

Notes to the Reviewer

Is there a scenario where a target can only be uniquely identified by the instance label?
If that's the case, then the agent would compute the same hash for different targets which would result in an unbalanced cluster.

PR Checklist

CHANGELOG.md updated
Tests updated

… in clustering

wildum · 2024-03-28T17:06:50Z

Converting to draft for now because we think it might be a risky change and we don't want it in 1.0

tpaschalis · 2024-04-01T10:55:35Z

I'm not convinced this is the right solution; feels too arbitrary. One of clustering's main assumptions is that all nodes run with the same configuration files and that the upstream APIs (eg. service discovery) work in the same way so that the fully-local consistent hashing works properly. The prometheus.exporter.* components may set a custom instance key related to the system they're running on, but this is settable.

IMHO this could be part of the documentation for running clustering together with exporter components. Nice find nonetheless!

wildum · 2024-04-02T07:52:07Z

In this example the agents have the same config files, they just run on different hosts. Do you think that we should constrain the agents to run on the same host if they use clustering + prometheus exporter or should we fix the exporters so that we don't set the instance label to the host where the agent is running?
If possible I think that this "instance" label should be removed. Its behavior is not consistent amongst exporters and the agent's host should not be inside of the targets' labels. But I guess that it would be a breaking change if used in some pipeline or dashboard for filtering

ptodev · 2024-04-04T11:58:06Z

We noticed that the agent adds a label "instance" corresponding by default to the host where the agent runs to the targets.

Is this correct? According to this doc page and this blog post, the instance label should normally be set to the address of the scrape target.

wildum · 2024-04-04T12:23:00Z

In the agent we set it by default to the host of the agent if I understand it correctly here: https://github.com/grafana/agent/blob/main/internal/component/prometheus/exporter/exporter.go#L179

ptodev · 2024-04-04T12:52:18Z

I think the bug in the clustering can be fixed if SNMP exporter's InstanceKey function is changed to return the addresses of the targets which are actually being scraped, similarly to the MSSQL exporter InstanceKey function. I could raise a PR for this,, since I'm oncall? And then we could close this one?

wildum · 2024-04-04T13:01:56Z

that would be a start but if we want the same behavior as this PR we need to make sure that all exporters (except the ones self, process, UNIX (maybe others?)) set the instance to the target's host and not to the agent's one

github-actions · 2024-05-10T00:10:42Z

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

thampiotr · 2024-06-06T10:21:13Z

After taking a closer look, I'm also not in favour of this solution, but I 100% agree we should do something to address this problem.

Thinking about this, it seems to me that instance is a problem because we run embedded exporters in a cluster of N collectors. This means we will run multiple instances of these exporters which may also be undesired due to increased load on systems from which metrics are exported. So I think we should provide means to run (or scrape) given exporter only on one instance at a time and then including the instance label should be harmless, it merely denotes which Alloy instance did the scraping. This would be accomplished by more fine-grained component scheduling.

In case we don't want the instance label to denote which Alloy instance in a cluster did the scraping, we can opt to not set the instance label at all and let Prometheus set it automatically. I wouldn't recommend this option, as this would be, AFAICT, an localhost network with a random port on which given exporter component exposes metrics.

To make this less error-prone, we could also check whether we are running in a cluster in exporters components and do something to warn the user?

ptodev · 2024-06-07T15:04:19Z

@thampiotr If a prometheus.exporter only does work when it's scraped, then it won't be a performance issue to run them in clustered mode. I don't know how most exporters are designed in practice, but I suppose most would only do work when they are scraped. If there is only one collector which scrapes them, all will be good and there won't be a performance hit.

I think we need to go through all prometheus.exporter components and make sure that the instance label is useful, and that it also works with clustering. Maybe we can create a tracker issue for that which lists all exporter components. And we can gradually create component-specific "sub-issues" which look into each component's instance label.

thampiotr · 2024-06-10T09:54:00Z

Thanks @ptodev for the discussion (offline). I'm going to summarise it here:

The performance is not the biggest concern, but a nice-to-have.
We both prefer to have a solution that is general to all exporters and doesn't require going through all the exporters one-by-one.
With a standalone exporter and with Agent Flow Mode and Alloy without clustering, the instance label will be typically set to the host where the exporter runs, which may be different to where the thing being exported runs (for example MySQL).
We want to focus on fixing the clustering issue here - making the instance label more useful (for example, be set to the hostname of the thing being exported) can be done separately.
Since fine-grained component scheduling is something we will need for other use-cases, such as kubernetes events, we agree that it's a good idea to use that planned feature to address this problem.
In the meantime, users can work around the problem quite easily by relabelling the instance label when using clustering to the same value in all instances in the cluster, for example:

discovery.relabel "replace_instance" {
  targets = discovery.file.targets.targets
  rule {
    action        = "replace"
    source_labels = ["instance"]
    target_label  = "instance"
	replacement   = "alloy-cluster"
  }  
}

With this, I will close this PR for now and add a note to fine-grained scheduling issue.

filter instance label from the target labels used to compute the hash…

a48c0ac

… in clustering

wildum marked this pull request as draft March 28, 2024 17:06

rfratto added variant/flow Relatd to Grafana Agent Flow. enhancement New feature or request bug Something isn't working and removed enhancement New feature or request labels Apr 9, 2024

thampiotr mentioned this pull request May 7, 2024

Tracking: Address Clustering Issues grafana/alloy#784

Open

github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label May 10, 2024

thampiotr closed this Jun 10, 2024

thampiotr mentioned this pull request Jun 10, 2024

When using clustering, exporters may not work correctly due to instance label grafana/alloy#1009

Open

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Jul 11, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Clustering] Filter the label "instance" from the hash computation #6792

[Clustering] Filter the label "instance" from the hash computation #6792

wildum commented Mar 28, 2024 •

edited

Loading

wildum commented Mar 28, 2024

tpaschalis commented Apr 1, 2024 •

edited

Loading

wildum commented Apr 2, 2024

ptodev commented Apr 4, 2024

wildum commented Apr 4, 2024

ptodev commented Apr 4, 2024

wildum commented Apr 4, 2024

github-actions bot commented May 10, 2024

thampiotr commented Jun 6, 2024

ptodev commented Jun 7, 2024

thampiotr commented Jun 10, 2024

[Clustering] Filter the label "instance" from the hash computation #6792

[Clustering] Filter the label "instance" from the hash computation #6792

Conversation

wildum commented Mar 28, 2024 • edited Loading

PR Description

Notes to the Reviewer

PR Checklist

wildum commented Mar 28, 2024

tpaschalis commented Apr 1, 2024 • edited Loading

wildum commented Apr 2, 2024

ptodev commented Apr 4, 2024

wildum commented Apr 4, 2024

ptodev commented Apr 4, 2024

wildum commented Apr 4, 2024

github-actions bot commented May 10, 2024

thampiotr commented Jun 6, 2024

ptodev commented Jun 7, 2024

thampiotr commented Jun 10, 2024

wildum commented Mar 28, 2024 •

edited

Loading

tpaschalis commented Apr 1, 2024 •

edited

Loading