Target Allocator is not capable of scaling out with least-weighted strategy #2477

utr1903 · 2023-12-22T14:32:30Z

Component(s)

target allocator

What happened?

Description

I'm not sure whether this is a bug or an intentional implemention, however the least-weighted strategy of the Target allocator is not capable of scaling out in terms of re-distributing the existing targets in case of new collector addition.

Steps to Reproduce

Configure 5 scrape targets and deploy 2 collectors.
Let the collectors get their targets assigned.

collector1 -> target1, target3, target5
collector2 -> target2, target4

Deploy a 3rd collector.
Check target distribution.

Expected Result

Expected outcome would be re-distribution of the targets across all 3 instances.

collector1 -> target1, target4
collector2 -> target2, target5
collector3 -> target3

Actual Result

No re-distribution is triggered and the 3rd collector does not get assigned any target.

collector1 -> target1, target3, target5
collector2 -> target2, target4
collector3 ->

Kubernetes Version

1.27.1

Operator version

0.90.0

Collector version

0.91.0

Environment information

Environment

Log output

No response

Additional context

Implementation analysis

I've checked the code and if I'm not mistaken scaling out is not possible with the current implementation (please correct me if I'm wrong).

When newly added collector instances are found, it is detected here and the SetCollectors() function of the allocator is executed. It checks the differences of collectors between current and new states, finds a new one (addition) and therefore runs handleCollectors.

The critical part is this if case. Since the number of the collectors is not zero, the flag allocateTargets remains false and therefore we never re-distribute the targets here. However, even we were to enter this loop, the rest of the implementation at the moment does not handle re-assigning targets from the old state of the collectors to the new one.

So as far as I've understood, if one wants to scale out, one needs to delete and re-create the allocator instance after introducing a new collector instance.

Actual intention

The actual reason why I wanted to bring up this issue is that I am facing OOMKilled problems with the collectors due to uneven distribution of scrape targets and scaling them out seems not to help. Since the default value for the allocation strategy is defined as the least-weighted but not consistent-hashing, I assumed that the community encourages us to go for the least-weighted even though consistent-hashing strategy re-distributes all of the scrape targets in case a new collector instance is added.

I would be more than happy to get your experiences & recommendations.

The text was updated successfully, but these errors were encountered:

jaronoff97 · 2023-12-22T19:33:03Z

Thank you so much for putting the time and effort in to this issue. Given how detailed your question is, i'm going to do my best to give you a worthy response... Honestly, i think we should switch to the consistent hashing strategy as the default. It lets you autoscale your target allocators, it's a much better strategy in terms of efficiency as well. To answer the question though...

This is actually the intended behavior. The worry is that if you redestribute targets too often your metrics are more prone to data loss. Internally we call this "stable allocation". There are a few scenarios we have in mind for this, I'll describe a few but let me know if you have any more questions.

You could imagine a scenario where one collector's allocated target is very busy (imagine something like kube-state-metrics which handles many high cardinality metrics). This would mean that assigned collector is more prone to OOMs and restarts. If we were to detect a restart and reallocate this target, we may actually cause another collector to OOM and drop more data. Ultimately this would cause metric patches as your collectors come up, scrape some data, and die.

Another scenario is more dangerous – imagine a large-enough cluster with millions of potential targets being discovered. The collector pool doing the sharded scraping probably has relatively small per pod resourcing to be more resilient to failures. And let's say you have 10000 targets per collector in this state. If your cluster is constantly scaling up and down, every time you add a collector, you need to lock the target allocator to delete and re-allocate the entire pool of targets. If you hold on to that lock for too long, you will begin causing the collector to timeout.

The last reason we do this is a bit more challenging to explain, but I'll try and do this... There's a known issue in the compatibility between otel and prometheus relating to start timestamps. It's explained in detail here, but to summarize, when we change the assigned collectors targets, the collector's prometheus receiver starts a new timeseries each time. If we were to re-distribute the targets from collector A to collector B back to collector A, I'm unsure we would properly count a reset for a cumulative metric to make the series accurate. This is mostly speculation on my end, I haven't tested it explicitly, but from the conversations I've had with my colleagues, we believe that keeping stable allocations is going to ultimately be best for users.

We discussed the implementation of the consistent hashing in our SIG meetings and a bit on the introductory PR here. The gist of our goal there is when a new collector comes online, we minimize the possible redistributions by having replication of our collectors on a large enough hashring. cc @swiatekm-sumo who may have more info he'd like to add here.

Circling back to where I started, I do think its time we change the default strategy, and I would support a PR that does exactly that. It wouldn't be a breaking change for target allocator users, in fact our average user may experience a performance boost. With that change we should also recommend the relabel filter strategy which also contains performance improvements.

utr1903 · 2023-12-23T11:13:11Z

@jaronoff97 Thanks a lot for the great response! It's definitely helpful for me to understand the background intention.
I would be happy to open up that PR to switch default allocation strategy to consistent-hashing (and to set the default filtering strategy to relabel-config, if agreed).

jaronoff97 · 2024-01-02T20:16:26Z

yeah! that would be great, thank you!

utr1903 added bug Something isn't working needs triage labels Dec 22, 2023

This was referenced Dec 22, 2023

Implement test for adding new collector after completed initial allocation utr1903/opentelemetry-operator-utr1903#1

Merged

Add test for adding new collector after completed initial allocation #2478

Merged

jaronoff97 added question Further information is requested area:target-allocator Issues for target-allocator and removed needs triage labels Dec 22, 2023

jaronoff97 assigned utr1903 Jan 2, 2024

This was referenced Jan 3, 2024

Changed default allocation & filtering strategy utr1903/opentelemetry-operator-utr1903#2

Merged

Changed default allocation & filtering strategy #2485

Merged

jaronoff97 closed this as completed in #2485 Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Target Allocator is not capable of scaling out with least-weighted strategy #2477

Target Allocator is not capable of scaling out with least-weighted strategy #2477

utr1903 commented Dec 22, 2023

jaronoff97 commented Dec 22, 2023 •

edited

Loading

utr1903 commented Dec 23, 2023

jaronoff97 commented Jan 2, 2024

Target Allocator is not capable of scaling out with least-weighted strategy #2477

Target Allocator is not capable of scaling out with least-weighted strategy #2477

Comments

utr1903 commented Dec 22, 2023

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Kubernetes Version

Operator version

Collector version

Environment information

Environment

Log output

Additional context

Implementation analysis

Actual intention

jaronoff97 commented Dec 22, 2023 • edited Loading

utr1903 commented Dec 23, 2023

jaronoff97 commented Jan 2, 2024

jaronoff97 commented Dec 22, 2023 •

edited

Loading