Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clustering] Filter the label "instance" from the hash computation #6792

Closed

Conversation

wildum
Copy link
Contributor

@wildum wildum commented Mar 28, 2024

PR Description

In clustering mode, the targets exposed by exporters are distributed amongst the agents.
For an agent to know whether it should scrape the target or not, it will compute a hash based on the labels of the target.

We noticed that the agent adds a label "instance" corresponding by default to the host where the agent runs to the targets. This is a problem because if the agents are not running on the same host, they will have a different value for the label "instance". This means that they will compute different hashes and the targets won't be distributed correctly.

Notes to the Reviewer

Is there a scenario where a target can only be uniquely identified by the instance label?
If that's the case, then the agent would compute the same hash for different targets which would result in an unbalanced cluster.

PR Checklist

  • CHANGELOG.md updated
  • Tests updated

@wildum wildum marked this pull request as draft March 28, 2024 17:06
@wildum
Copy link
Contributor Author

wildum commented Mar 28, 2024

Converting to draft for now because we think it might be a risky change and we don't want it in 1.0

@tpaschalis
Copy link
Member

tpaschalis commented Apr 1, 2024

I'm not convinced this is the right solution; feels too arbitrary. One of clustering's main assumptions is that all nodes run with the same configuration files and that the upstream APIs (eg. service discovery) work in the same way so that the fully-local consistent hashing works properly. The prometheus.exporter.* components may set a custom instance key related to the system they're running on, but this is settable.

IMHO this could be part of the documentation for running clustering together with exporter components. Nice find nonetheless!

@wildum
Copy link
Contributor Author

wildum commented Apr 2, 2024

In this example the agents have the same config files, they just run on different hosts. Do you think that we should constrain the agents to run on the same host if they use clustering + prometheus exporter or should we fix the exporters so that we don't set the instance label to the host where the agent is running?
If possible I think that this "instance" label should be removed. Its behavior is not consistent amongst exporters and the agent's host should not be inside of the targets' labels. But I guess that it would be a breaking change if used in some pipeline or dashboard for filtering

@ptodev
Copy link
Contributor

ptodev commented Apr 4, 2024

We noticed that the agent adds a label "instance" corresponding by default to the host where the agent runs to the targets.

Is this correct? According to this doc page and this blog post, the instance label should normally be set to the address of the scrape target.

@wildum
Copy link
Contributor Author

wildum commented Apr 4, 2024

In the agent we set it by default to the host of the agent if I understand it correctly here: https://github.com/grafana/agent/blob/main/internal/component/prometheus/exporter/exporter.go#L179

@ptodev
Copy link
Contributor

ptodev commented Apr 4, 2024

I think the bug in the clustering can be fixed if SNMP exporter's InstanceKey function is changed to return the addresses of the targets which are actually being scraped, similarly to the MSSQL exporter InstanceKey function. I could raise a PR for this,, since I'm oncall? And then we could close this one?

@wildum
Copy link
Contributor Author

wildum commented Apr 4, 2024

that would be a start but if we want the same behavior as this PR we need to make sure that all exporters (except the ones self, process, UNIX (maybe others?)) set the instance to the target's host and not to the agent's one

@rfratto rfratto added variant/flow Relatd to Grafana Agent Flow. enhancement New feature or request bug Something isn't working and removed enhancement New feature or request labels Apr 9, 2024
Copy link
Contributor

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@github-actions github-actions bot added the needs-attention An issue or PR has been sitting around and needs attention. label May 10, 2024
@thampiotr
Copy link
Contributor

After taking a closer look, I'm also not in favour of this solution, but I 100% agree we should do something to address this problem.

Thinking about this, it seems to me that instance is a problem because we run embedded exporters in a cluster of N collectors. This means we will run multiple instances of these exporters which may also be undesired due to increased load on systems from which metrics are exported. So I think we should provide means to run (or scrape) given exporter only on one instance at a time and then including the instance label should be harmless, it merely denotes which Alloy instance did the scraping. This would be accomplished by more fine-grained component scheduling.

In case we don't want the instance label to denote which Alloy instance in a cluster did the scraping, we can opt to not set the instance label at all and let Prometheus set it automatically. I wouldn't recommend this option, as this would be, AFAICT, an localhost network with a random port on which given exporter component exposes metrics.

To make this less error-prone, we could also check whether we are running in a cluster in exporters components and do something to warn the user?

@ptodev
Copy link
Contributor

ptodev commented Jun 7, 2024

@thampiotr If a prometheus.exporter only does work when it's scraped, then it won't be a performance issue to run them in clustered mode. I don't know how most exporters are designed in practice, but I suppose most would only do work when they are scraped. If there is only one collector which scrapes them, all will be good and there won't be a performance hit.

I think we need to go through all prometheus.exporter components and make sure that the instance label is useful, and that it also works with clustering. Maybe we can create a tracker issue for that which lists all exporter components. And we can gradually create component-specific "sub-issues" which look into each component's instance label.

@thampiotr
Copy link
Contributor

Thanks @ptodev for the discussion (offline). I'm going to summarise it here:

  • The performance is not the biggest concern, but a nice-to-have.
  • We both prefer to have a solution that is general to all exporters and doesn't require going through all the exporters one-by-one.
  • With a standalone exporter and with Agent Flow Mode and Alloy without clustering, the instance label will be typically set to the host where the exporter runs, which may be different to where the thing being exported runs (for example MySQL).
  • We want to focus on fixing the clustering issue here - making the instance label more useful (for example, be set to the hostname of the thing being exported) can be done separately.
  • Since fine-grained component scheduling is something we will need for other use-cases, such as kubernetes events, we agree that it's a good idea to use that planned feature to address this problem.
  • In the meantime, users can work around the problem quite easily by relabelling the instance label when using clustering to the same value in all instances in the cluster, for example:
discovery.relabel "replace_instance" {
  targets = discovery.file.targets.targets
  rule {
    action        = "replace"
    source_labels = ["instance"]
    target_label  = "instance"
	replacement   = "alloy-cluster"
  }  
}

With this, I will close this PR for now and add a note to fine-grained scheduling issue.

@thampiotr thampiotr closed this Jun 10, 2024
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Jul 11, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. needs-attention An issue or PR has been sitting around and needs attention. variant/flow Relatd to Grafana Agent Flow.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants