Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target Allocator is not capable of scaling out with least-weighted strategy #2477

Closed
utr1903 opened this issue Dec 22, 2023 · 3 comments · Fixed by #2485
Closed

Target Allocator is not capable of scaling out with least-weighted strategy #2477

utr1903 opened this issue Dec 22, 2023 · 3 comments · Fixed by #2485
Assignees
Labels
area:target-allocator Issues for target-allocator bug Something isn't working question Further information is requested

Comments

@utr1903
Copy link
Contributor

utr1903 commented Dec 22, 2023

Component(s)

target allocator

What happened?

Description

I'm not sure whether this is a bug or an intentional implemention, however the least-weighted strategy of the Target allocator is not capable of scaling out in terms of re-distributing the existing targets in case of new collector addition.

Steps to Reproduce

  1. Configure 5 scrape targets and deploy 2 collectors.
  2. Let the collectors get their targets assigned.
  • collector1 -> target1, target3, target5
  • collector2 -> target2, target4
  1. Deploy a 3rd collector.
  2. Check target distribution.

Expected Result

Expected outcome would be re-distribution of the targets across all 3 instances.

  • collector1 -> target1, target4
  • collector2 -> target2, target5
  • collector3 -> target3

Actual Result

No re-distribution is triggered and the 3rd collector does not get assigned any target.

  • collector1 -> target1, target3, target5
  • collector2 -> target2, target4
  • collector3 ->

Kubernetes Version

1.27.1

Operator version

0.90.0

Collector version

0.91.0

Environment information

Environment

Log output

No response

Additional context

Implementation analysis

I've checked the code and if I'm not mistaken scaling out is not possible with the current implementation (please correct me if I'm wrong).

When newly added collector instances are found, it is detected here and the SetCollectors() function of the allocator is executed. It checks the differences of collectors between current and new states, finds a new one (addition) and therefore runs handleCollectors.

The critical part is this if case. Since the number of the collectors is not zero, the flag allocateTargets remains false and therefore we never re-distribute the targets here. However, even we were to enter this loop, the rest of the implementation at the moment does not handle re-assigning targets from the old state of the collectors to the new one.

So as far as I've understood, if one wants to scale out, one needs to delete and re-create the allocator instance after introducing a new collector instance.

Actual intention

The actual reason why I wanted to bring up this issue is that I am facing OOMKilled problems with the collectors due to uneven distribution of scrape targets and scaling them out seems not to help. Since the default value for the allocation strategy is defined as the least-weighted but not consistent-hashing, I assumed that the community encourages us to go for the least-weighted even though consistent-hashing strategy re-distributes all of the scrape targets in case a new collector instance is added.

I would be more than happy to get your experiences & recommendations.

@jaronoff97
Copy link
Contributor

jaronoff97 commented Dec 22, 2023

Thank you so much for putting the time and effort in to this issue. Given how detailed your question is, i'm going to do my best to give you a worthy response... Honestly, i think we should switch to the consistent hashing strategy as the default. It lets you autoscale your target allocators, it's a much better strategy in terms of efficiency as well. To answer the question though...

This is actually the intended behavior. The worry is that if you redestribute targets too often your metrics are more prone to data loss. Internally we call this "stable allocation". There are a few scenarios we have in mind for this, I'll describe a few but let me know if you have any more questions.

You could imagine a scenario where one collector's allocated target is very busy (imagine something like kube-state-metrics which handles many high cardinality metrics). This would mean that assigned collector is more prone to OOMs and restarts. If we were to detect a restart and reallocate this target, we may actually cause another collector to OOM and drop more data. Ultimately this would cause metric patches as your collectors come up, scrape some data, and die.

Another scenario is more dangerous – imagine a large-enough cluster with millions of potential targets being discovered. The collector pool doing the sharded scraping probably has relatively small per pod resourcing to be more resilient to failures. And let's say you have 10000 targets per collector in this state. If your cluster is constantly scaling up and down, every time you add a collector, you need to lock the target allocator to delete and re-allocate the entire pool of targets. If you hold on to that lock for too long, you will begin causing the collector to timeout.

The last reason we do this is a bit more challenging to explain, but I'll try and do this... There's a known issue in the compatibility between otel and prometheus relating to start timestamps. It's explained in detail here, but to summarize, when we change the assigned collectors targets, the collector's prometheus receiver starts a new timeseries each time. If we were to re-distribute the targets from collector A to collector B back to collector A, I'm unsure we would properly count a reset for a cumulative metric to make the series accurate. This is mostly speculation on my end, I haven't tested it explicitly, but from the conversations I've had with my colleagues, we believe that keeping stable allocations is going to ultimately be best for users.

We discussed the implementation of the consistent hashing in our SIG meetings and a bit on the introductory PR here. The gist of our goal there is when a new collector comes online, we minimize the possible redistributions by having replication of our collectors on a large enough hashring. cc @swiatekm-sumo who may have more info he'd like to add here.

Circling back to where I started, I do think its time we change the default strategy, and I would support a PR that does exactly that. It wouldn't be a breaking change for target allocator users, in fact our average user may experience a performance boost. With that change we should also recommend the relabel filter strategy which also contains performance improvements.

@utr1903
Copy link
Contributor Author

utr1903 commented Dec 23, 2023

@jaronoff97 Thanks a lot for the great response! It's definitely helpful for me to understand the background intention.
I would be happy to open up that PR to switch default allocation strategy to consistent-hashing (and to set the default filtering strategy to relabel-config, if agreed).

@jaronoff97
Copy link
Contributor

yeah! that would be great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:target-allocator Issues for target-allocator bug Something isn't working question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants