Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge cardinality and complexity jump with telemetry v2 #19090

Open
Stono opened this issue Nov 20, 2019 · 21 comments
Open

Huge cardinality and complexity jump with telemetry v2 #19090

Stono opened this issue Nov 20, 2019 · 21 comments

Comments

@Stono
Copy link
Contributor

@Stono Stono commented Nov 20, 2019

Hey all,
I just wanted to share this concern as an issue so it can be discussed further. My believe is that telemtry v2 will increase prometheus metric cardinality quite significantly, especially for those with existing deployments on large clusters like ours.

This is because of the roll-up effect that the current mixer has at the moment. What I mean by rollup is that the pod_name is not a label on some of the biggest istio metrics, eg:

  • istio_request_duration_seconds_bucket
  • istio_response_bytes_bucket
  • istio_request_bytes_bucket

So take this simplistic example:

mixer

Currently you would end up with something like:

istio_request_duration_seconds_bucket{app="service1", instance="ip-of-mixer", le="whatever"} 40

With telemetry v2 you will end up with:

istio_request_duration_seconds_bucket{app="service1", pod_name="pod1", le="whatever"} 10
istio_request_duration_seconds_bucket{app="service1", pod_name="pod1", le="whatever"} 10
istio_request_duration_seconds_bucket{app="service1", pod_name="pod1", le="whatever"} 10
istio_request_duration_seconds_bucket{app="service1", pod_name="pod1", le="whatever"} 10

Each of these metrics already has high cardinality being a histogram (the le label), so this will result in huge increases in prometheus metrics.

Many of our applications run 10-20 pods, so the metrics explosion of switching to metrics which are labeled with the pod name I believe is going to be pretty huge and potentially unmanageable.

@bianpengyuan

This comment has been minimized.

Copy link
Contributor

@bianpengyuan bianpengyuan commented Nov 20, 2019

Thanks for bring this up! Could you share a bit more about what could be unmanageable? Is it prom query latency, or prom cpu usage?

I don't think we've measured telemetry v2's impact on prometheus server. We should also measure it in our load test.

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Nov 20, 2019

Hello,
Several things really, as the majority of our dashboards and alerts query these metrics, significantly increasing the cardinality impacts both query performance and prometheus cpu. Cardinality also significantly impacts prometheus RAM usage.

By unmanageable I mean a very large increase in resource requirements for prometheus in exchange for the resources saved from dropping mixer, for really (in my opinion) little gain.

As an example, we've recently been looking at moving to grafanalabs hosted prometheus which charges consumption pricing based on number of unique series. At the moment, the cardinality of these metrics is tied to the number of mixer instances (by instance=ip).

You can see here how I halved the number of unique series istio produces (without losing data) we were sending purely by going from 4 instances of telemetry to 2 (just to highlight the impact label cardinality has at scale):

image

Here you can see the impact of that same change on prometheus RAM:

Screenshot 2019-11-20 at 10 15 46

So any significant increase in label cardinality on these already extremely large metrics, on a large scale cluster will be significant.

Without mixer to do this rollup, the only other solution we could do is created a tiered prometheus architecture, which has recording rules for the istio metrics to do the roll up for us. But that's significant architectural complexity as a result of this change which i'd rather not do.

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Nov 20, 2019

the only other solution we could do is created a tiered prometheus architecture

Not a Prometheus expert but I was under the impression we could just drop the pod name label and everything should work, is that not true?

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Nov 20, 2019

@howardjohn nope it isn't, the only way to "drop" the pod name and roll up the metrics is to have a recording rule which records a new metric as an aggregate of the others.

If you just drop that label now in your scrape config you'll end up with duplicate metrics:

istio_request_duration_seconds_bucket{app="service1", pod_name="pod1", le="whatever"} 10
istio_request_duration_seconds_bucket{app="service1", pod_name="pod1", le="whatever"} 10

becomes:

istio_request_duration_seconds_bucket{app="service1", le="whatever"} 10
istio_request_duration_seconds_bucket{app="service1", le="whatever"} 10

not:

istio_request_duration_seconds_bucket{app="service1", le="whatever"} 20

See: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#metric_relabel_configs

Care must be taken with labeldrop and labelkeep to ensure that metrics are still uniquely labeled once the labels are removed.

@mandarjog

This comment has been minimized.

Copy link
Contributor

@mandarjog mandarjog commented Nov 20, 2019

@Stono thanks for bringing it up.
We use the operator in large cluster testing so did not experience this issue. Direct pod and endpoint scraping is also the way to export application metrics, so this is not unusual. Perhaps need last mile aggregate and record here. We will get back to you.

@mandarjog mandarjog self-assigned this Nov 20, 2019
@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Nov 20, 2019

To be clear, I don't have an issue with pod scraping and i'm not saying the direction you're going is wrong. We already selectively scrape pod application metrics, and envoy stats on certain important applications.

But you asked me to think about implications of telemetry v2, so i'm purely calling out why adding pod_name as a label on already high cardinality istio metrics, for existing users at scale, is a considerable change and one that will impact me personally quite a bit.

As an example; on your test cluster - deploy:

  • 20x workloads with 10 pods each, with 2x mixer's
  • go into prometheus and do count({__name__="istio_request_bytes_bucket"})

Ignoring all the other labels your cardinality there is effectively:

  • 20 (workloads) x 2 (mixer instances) x 10 (le) = 400

Now deploy mixer v2, and check the same count in prometheus and do the same count, you'll likely see it much higher as your cardinality is now:

  • 20 (workloads) x 10 (pod_name) x 10 (le) = 2000

So effectively a 5 fold increase on this example, on a single metric, and there are quite a few high cardinality istio metrics.

This impacts query performance and resource requirements and in my case, managed prometheus cost. In order to really see that your test clusters will need to be sending constant load at prometheus which causes aggregates across this.

From an "out of the box" istio perspective, you're likely going to need recording rules to do the rollup that mixer was previously doing in order to keep the metric query performance (dashboard experience, prometheus experience) "nice". For people such as me that don't use the out of the box prometheus setup because our requirements are more bespoke, we will need to architecturally work around this change which takes time, so really advanced notice of mixer deprecation is needed

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Nov 20, 2019

There is also the added issue of pod_name for us changes a lot (we deploy 100s of times daily) so its a very high churn label. Doing any sort of aggregation over time will likely be impacted too (as the query will span more series instances)

@mandarjog

This comment has been minimized.

Copy link
Contributor

@mandarjog mandarjog commented Nov 20, 2019

Pod name is added by scraper, it should not be a dimension on the base metric. Will check.

@douglas-reid

This comment has been minimized.

Copy link
Contributor

@douglas-reid douglas-reid commented Nov 20, 2019

pod_name is added (in the default istio config) by the scrape job:

https://github.com/istio/istio/blob/master/install/kubernetes/helm/istio/charts/prometheus/templates/configmap.yaml#L237-L239

We definitely want to aggregate over pods. Providing a transition guide is part of the planned deprecation story @Stono. I expect advice on scaling, etc., to be part of that transition guide.

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Nov 20, 2019

Sounds good.
My gut is that you'll need a rolling up recording rule but then the problem you'll have is you'll be storing both unrolled and rolled up series, albeit only querying across one of them, which will keep the current chunks in memory.

As an example me adding recording rules:

Screen Shot 2019-11-20 at 17 22 45

Impact on memory on the test cluster:
Screen Shot 2019-11-20 at 17 25 08

Looking forward to your suggestions.

@mandarjog

This comment has been minimized.

Copy link
Contributor

@mandarjog mandarjog commented Nov 20, 2019

@douglas-reid the instance label has the same cardinality as the pod name so getting rid of it may not move the needle.

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Nov 20, 2019

@douglas-reid the instance label has the same cardinality as the pod name so getting rid of it may not move the needle.

that is not true, in telemetry v1 instance=ip of mixer, in v2 pod_name=the application pod.

number of mixers is typically way less than the number of pods per workload.

@douglas-reid

This comment has been minimized.

Copy link
Contributor

@douglas-reid douglas-reid commented Nov 20, 2019

in v2, we are going to be collecting from a large number of sources and we need to account for that appropriately. any problematic labels will need to be dealt with, instance, pod_name or otherwise.

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Jan 10, 2020

To add to this, please see #20060 where Locality Aware Routing multiplies some of these high cardinality metrics by the number of zones (squared)

@howardjohn

This comment has been minimized.

Copy link
Member

@howardjohn howardjohn commented Feb 19, 2020

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 21, 2020

The result of a days worth of deployments cardinality on prometheus memory with telemetry v2 :-(

This wouldn't increment at all on v1 due to the rollup effect of mixer.

Screenshot 2020-02-21 at 18 53 43

We're likely going to have to do a tiered prometheus approach in order to deal with this

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 21, 2020

It might be because it's 10pm and i'm dealing with issues caused by the migration to telemetry v2, but I need to express just how much worse this is for us at our scale. And warn of anyone else from migrating to v2 before truly understanding the consequences.

For context; this cluster has 1,300 pods, 350 namespaces, 700 services.

istio-telemetry has the previous behaviour of rolling up Istios very high cardinality metrics, a behaviour that is required, without it the istio metrics sliced by every pod_name on your cluster create a huge overhead on prometheus. The added cardinality is particularly painful if like us, you deploy 100s of times per day, subsequently creating a very high churn on the new pod_name label that is part of your istio metrics. Memory footprint goes up significantly, queries are slower. It's a nightmare. Our prometheus has gone pop twice today and we've just ended up wiping its data to get it back online.

Subsequently, you have no choice but to move towards a federated prometheus architecture, where you have a prometheus in the middle, responsible for scraping all of your sidecars, and then it has recording rules to "roll them up". Your primary prometheus then scrapes that intermediary prometheus for the rolled up metrics. That prometheus server becomes a behemoth, as it's got all the original data and the recording rules version of that. In our case, it's footprint was actually greater than Mixer was in v1.

Also, if you think about it - what have you just built? Your own version of istio-telemetry! A service in the middle simply to roll up extremely high cardinality metrics! Only this one doesn't horizontally scale, and isn't maintained by the Istio team.

Federated prometheus are also a pain, for example; getting your scrape intervals right so you don't miss data, ensuring you don't miss labels during rollups. The istio upgrade process now becomes more complex because in each release we're going to have to make sure our recording are still compatible. All of which is a lot more overhead than v1 added to our deployment.

There are purported performance improvements including less overhead on your sidecars, however we noticed no change in cpu or memory usage across our fleet. What we did realise however is that prometheus scraping the sidecars results in 50% more network traffic than the sidecars pushing to istio-telemetry did. That's because prometheus "scrapes it all", then drop. So you've actually got some increased overhead from that - and that's despite having all our workloads running with sidecar.istio.io/statsInclusionSuffixes: outlier_detection.ejections_active,outlier_detection.ejections_enforced_total,upstream_cx_active,upstream_cx_total to minimise the amount of metrics they generate.

In summary, you have increased the complexity of running Istio at scale for operators with this change, it is - in my honest opinion, a really bad move without offering some serious support to the people invested in v1.

@Stono Stono changed the title Potential cardinality jump with telemetry v2 Huge cardinality and complexity jump with telemetry v2 Feb 21, 2020
@mandarjog

This comment has been minimized.

Copy link
Contributor

@mandarjog mandarjog commented Feb 21, 2020

@Stono I hear you. We will do several things to address your concerns.

  1. Provide a way to consolidate/roll up metrics with prometheus and support it.
  2. Reduce the amount of data that is scraped. This is imminently doable. we already proxy from :15090/stats/prometheus. --> :15020/stats/prometheus, we can add a usedonly and a filter param.
    This should reduce the scraped amount to what we really want.

I have spoken with Prometheus maintainers, and they say that scraping every endpoint is the standard and supported way of doing Prom. Mixer V1 provided an alternative, but it is not the norm.

@douglas-reid

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 21, 2020

I agree that it is the standard way of doing prometheus, my grief here is because as an end user the transition to v2 is absolute agony.

There are significant tradeoffs with this approach which need to be clearly defined, addressed and mitigated to help users who have large, production systems, using v1 including absolutely mandating how important it is to create the rollup capability they got out of mixer, using a tiered prometheus instead.

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 22, 2020

Also see #21377

@Stono

This comment has been minimized.

Copy link
Contributor Author

@Stono Stono commented Feb 28, 2020

FYI for anyone coming to this thread, I wrote a blog post on how we've tackled this with federation and rollup rules: https://karlstoney.com/2020/02/25/federated-prometheus-to-reduce-metric-cardinality/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.