Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize Overflow attribute section under Cardinality Limits #3904

Open
utpilla opened this issue Feb 26, 2024 · 35 comments
Open

Stabilize Overflow attribute section under Cardinality Limits #3904

utpilla opened this issue Feb 26, 2024 · 35 comments
Assignees
Labels
spec:metrics Related to the specification/metrics directory triage:accepted:ready-with-sponsor

Comments

@utpilla
Copy link
Contributor

utpilla commented Feb 26, 2024

What are you trying to achieve?

Mark the Overflow attribute section under Cardinality limits as Stable.

The spec was updated with the overflow attribute and cardinality limits section in #2960.

It looks like we already have more than three languages that have added support for this:

@utpilla utpilla added the spec:metrics Related to the specification/metrics directory label Feb 26, 2024
@cijothomas
Copy link
Member

#3798 This seems like a pre-req.

@cijothomas
Copy link
Member

Rust implementation is pretty basic - just enforced a hardcoded 2000 limit, not configured in anyway - neither via Views nor at Provider/Reader level. I hope other implementations are more complete?

@utpilla
Copy link
Contributor Author

utpilla commented Feb 26, 2024

Rust implementation is pretty basic - just enforced a hardcoded 2000 limit, not configured in anyway - neither via Views nor at Provider/Reader level. I hope other implementations are more complete?

@cijothomas I believe you're referring to the Configuration section under Cardinality limits. This issue only targets the Overflow attribute section.

I think as long as the SDKs have imposed some cardinality limit (be it hardcoded or configurable), the stability of metric overflow attribute can be tracked independently.

@cijothomas
Copy link
Member

Rust implementation is pretty basic - just enforced a hardcoded 2000 limit, not configured in anyway - neither via Views nor at Provider/Reader level. I hope other implementations are more complete?

@cijothomas I believe you're referring to the Configuration section under Cardinality limits. This issue only targets the Overflow attribute section.

I think as long as the SDKs have imposed some cardinality limit (be it hardcoded or configurable), the stability of metric overflow attribute can be tracked independently.

Oh this was just for that subsection, got it.

What is the benefit of stabilizing that part alone?

@utpilla
Copy link
Contributor Author

utpilla commented Feb 27, 2024

What is the benefit of stabilizing that part alone?

Even as a stand-alone feature, this is very useful. Here are two major benefits of using this feature:

  • It retains the correct gross summaries (such as Sum, Count, etc.) for every metric which would have otherwise been lost when the application exceeds the cardinality limit set by the SDK. This ensures that for any metric, data aggregated across all dimension combinations is always reliable.
  • It helps the user calibrate the cardinality limit to their needs. The presence of this attribute in the exported data is an indication that the application has exceeded the cardinality limit. Users can use this information to fine tune the cardinality limit when configuring the SDK.

@jmacd
Copy link
Contributor

jmacd commented Feb 27, 2024

@utpilla I agree with your reasoning -- there can be disagreement over how the behavior is implemented, but the attribute meaning is well defined either way, I think, and I agree that we can stabilize it.

@reyang reyang added the triaged-needmoreinfo The issue is triaged - the OTel community needs more information to decide label Feb 28, 2024
@reyang
Copy link
Member

reyang commented Feb 28, 2024

The TC discussed this issue during the specification issue triage meeting. Here are the two things that need to be done before we mark the issue as triage-accepted:

  1. The wording needs to be updated - @reyang will send a PR.
    Current wording: The maximum number of distinct, non-overflow attributes is one less than the limit, as a result.
    Proposal: make it formal (SHOULD/MUST), clarify that this is a lower limit, and the SDK can choose to support more (e.g. one more than the limit)

  2. Provide an end-to-end story (how to use Prom/Grafana) @utpilla I will reach out to you offline.

@reyang
Copy link
Member

reyang commented Feb 28, 2024

The TC discussed this issue during the specification issue triage meeting. Here are the two things that need to be done before we mark the issue as triage-accepted:

  1. The wording needs to be updated - @reyang will send a PR.
    Current wording: The maximum number of distinct, non-overflow attributes is one less than the limit, as a result.
    Proposal: make it formal (SHOULD/MUST), clarify that this is a lower limit, and the SDK can choose to support more (e.g. one more than the limit)

FYI @cijothomas @jack-berg @jmacd @jsuereth - I sent a PR #3912 trying to address 1).

@trask
Copy link
Member

trask commented Feb 28, 2024

the SDK can choose to support more (e.g. one more than the limit)

what does it mean to support more than one limit? thanks

@reyang
Copy link
Member

reyang commented Feb 28, 2024

the SDK can choose to support more (e.g. one more than the limit)

what does it mean to support more than one limit? thanks

It is not about more than one limit, it is about "the SDK can offer more room if there is a good reason", for example, the user can set the limit to 10 and the SDK can decide to offer 12 (or whatever number as long as it is greater than or equal to 10).

@trask
Copy link
Member

trask commented Feb 29, 2024

if there is a good reason

do we have a reason in mind already? it may be worth mentioning it in the PR, since there's a downside to not respecting the users requested limit (it could be a confusing user experience if they set the cardinality limit to one value but they see higher cardinality being reported to their backend)

@reyang
Copy link
Member

reyang commented Feb 29, 2024

if there is a good reason

do we have a reason in mind already? it may be worth mentioning it in the PR, since there's a downside to not respecting the users requested limit (it could be a confusing user experience if they set the cardinality limit to one value but they see higher cardinality being reported to their backend)

I can share some of the thinking here (the PR description already links to this issue):

  1. SDK cardinality limit is more focusing on preventing indefinite amount of memory and DDOS attack.
  2. In most cases, the user should not expect to use the SDK cardinality limit as the upper limit for their backend. The SDK cardinality limit is a limit which governs the metrics points during a collection cycle, across multiple collection cycles there could be higher cardinality (e.g. in DELTA Aggregation Temporality mode, one collection cycle could give totally different attribute sets than previous collection cycles).
  3. For high perf scenario, certain implementations (e.g. Rust) might want to reserve some metric points for special cases (e.g. when a measurement is being reported without any attribute key-value pair), so the output might have +1 or +2 situation.
  4. For high perf scenario, certain implementations (e.g. C/C++) might want to allocate 2^N metric points for faster lookup / memory manipulation.

And more importantly, removing the "is one less than the limit, as a result" wording makes it easier for the user. Imagine the users just want to report a value without any attribute, they have to understand and set the limit to 2 in order to get the correct result (the aggregated value, without any attribute). If they set the limit to 1, they will always end up getting the overflow point.

@reyang
Copy link
Member

reyang commented Mar 1, 2024

The TC discussed this issue during the specification issue triage meeting. Here are the two things that need to be done before we mark the issue as triage-accepted:

  1. The wording needs to be updated - @reyang will send a PR.
    Current wording: The maximum number of distinct, non-overflow attributes is one less than the limit, as a result.
    Proposal: make it formal (SHOULD/MUST), clarify that this is a lower limit, and the SDK can choose to support more (e.g. one more than the limit)
  2. Provide an end-to-end story (how to use Prom/Grafana) @utpilla I will reach out to you offline.
  1. is done via Formalize the overflow and cardinality limit interaction #3912.
  2. is in progress per my offline discussion with @utpilla.

@utpilla
Copy link
Contributor Author

utpilla commented Mar 2, 2024

The TC discussed this issue during the specification issue triage meeting. Here are the two things that need to be done before we mark the issue as triage-accepted:

  1. The wording needs to be updated - @reyang will send a PR.
    Current wording: The maximum number of distinct, non-overflow attributes is one less than the limit, as a result.
    Proposal: make it formal (SHOULD/MUST), clarify that this is a lower limit, and the SDK can choose to support more (e.g. one more than the limit)
  2. Provide an end-to-end story (how to use Prom/Grafana) @utpilla I will reach out to you offline.

@jack-berg @jmacd @jsuereth @reyang @trask

For the second point, please check this README file. I have put up some text explaining how the overflow attribute plays out with the Prometheus/Grafana workflow and how it adds value for the users. Let me know if you have any questions or suggestions.

@dashpole
Copy link
Contributor

dashpole commented Mar 4, 2024

Since this is a stabilization tracking issue, there are a few open issues related to cardinality limits:

#3578 in particular seems important to resolve before marking this stable.

@dashpole
Copy link
Contributor

dashpole commented Mar 4, 2024

I saw the TC notes. Is there any particular feedback related to Prometheus we want?

The current spec has drawbacks for prometheus users, but these likely apply to non-prometheus users as well:

  • Overflow doesn't appear in queries when summing by labels, which is very common
  • It isn't possible to make a generic alert for overflow for any metric. You would need to make one alert per-metric.
  • Overflow metrics will satisfy negative filter statements (e.g. status_code != 200), so it may result in strange error rates when overflow occurs.

@fstab and @gouthamve have provided a helpful feedback in the past, and may know scenarios I haven't thought of, or have suggestions for how to handle cardinality overflow.

@jack-berg
Copy link
Member

It isn't possible to make a generic alert for overflow for any metric. You would need to make one alert per-metric.

Can you elaborate on this? It seems possible to write a prometheus query that searches across all metric names with a particular label. Is it prohibitively expensive to run this query for alerting purposes? Or something else?

@dashpole
Copy link
Contributor

Thanks @jack-berg. You are correct, and I didn't realize that was possible.

@jsuereth
Copy link
Contributor

I was assuming this was prohibitively expensive. If not, I think we have a way forward here for stabilzation.

@dashpole
Copy link
Contributor

I'm not sure if it would be prohibitively expensive or not.

@bboreham
Copy link

Sorry I have not studied the complete back-story, but an idea came to me:

In this text:

An overflow attribute set is defined, containing a single attribute otel.metric.overflow having (boolean) value true, which is used to report a synthetic aggregation of the Measurements that could not be independently aggregated because of the limit.

if the word "single" is deleted, the other attributes could be replaced with an overflow value and Prometheus queries/joins/etc would now work.

In other words, instead of:

mycounter{a="hello" b="world"} 2
mycounter{a="bar" b="foo"} 3
mycounter{a="" b="foo"} 3
mycounter{otel.metric.overflow="true"} 100

Prometheus would see:

mycounter{a="hello" b="world"} 2
mycounter{a="bar" b="foo"} 3
mycounter{a="" b="foo"} 3
mycounter{otel.metric.overflow="true" a="overflow", b="overflow"} 100

@bwplotka
Copy link

Yup, querying across all metric names can be tricky/impossible/expensive for some backends, but Prometheus should be generally fine if queried correctly (e.g. just {otel.metric.overflow="true"}).

However, what @bboreham figured out would be small but amazing modification. It should mitigate the big challenges with the overflow logic @dashpole mentioned:

  • Overflow doesn't appear in queries when summing by labels, which is very common
  • Overflow metrics will satisfy negative filter statements (e.g. status_code != 200), so it may result in strange error rates when overflow occurs.

The slight disadvantage is question what to do if the target has multiple set of different label names on single metric, but perhaps including all labels (dimensions) that overflows on single overflow series might do the trick.

Also "..." could be a great overflow label value here as well 🤔 mycounter{otel.metric.overflow="true" a="...", b="..."} 100

@SuperQ
Copy link

SuperQ commented Mar 14, 2024

Please do not invent new conventions that do not follow Prometheus best practices.

In the event of a data error in a collector, best practice is to fail the whole collection with a 5xx error.

This follows the "Fail Fast" best practice.

@jsuereth
Copy link
Contributor

@SuperQ I believe what happens today in Prometheus (with client lirbaries) is the process itself crashes (eventually) on high cardanility. OTEL deemed this unacceptable, which is why we suppress/limit cardinality client side.

If you have links to other ways prometheus deals with things client-side happy to follow the best practice. But I would not say there is one today.

@jsuereth
Copy link
Contributor

@bboreham @bwplotka I do like what you're suggesting with setting the value of labels to "oveflow". It's not entirely feasible in OTEL because we don't have string labels, we have typed attributes (so a label could be an int, e.g.). I think this is why we just add the new label here. I do really like your suggestion though.

In this case, I think our assumption is that in prometheus the label will have an empty value (e..g overflow="true", a="", b=""). We assume (via best practice) the a/b will have real values in most cases, so an empty label will already stand out as much as a "..." would.

@jack-berg / @reyang do you remember all the rationale we had when we decided on just one new label? Is it just the typed-attribute issue?

@dashpole
Copy link
Contributor

One option would be for prometheus exporters to fail the scrape if the overflow attribute is set on any metrics. But unlike with the prometheus server's scrape limits, you wouldn't be able to mitigate it by changing a server-side config. You would need to restart your client with a higher overflow limit.

An alternative would be to recommend to users that they set their scrape limits on the prometheus server lower than the client-side limits so they still get the scrape failures, but also have the ability to mitigate via server config.

@SuperQ
Copy link

SuperQ commented Mar 14, 2024

@dashpole, yup, exactly my thoughts. If you handle sample limits in the library you want to fail the scrape. But as you point out, it's now the client's issue to deal with it. But that sounds like it's working as intended.

In Prometheus, we leave it up to users to decide what cardinality they want from the client libraries. Like you say, Prometheus can handle cardinality limits on the server side with admin configured scrape limits.

In practice we've found users tend to notice on the server side long before it becomes a client-side issue. Since the monitoring system has to deal with the total cardinality of all instances, it tends to show up there first.

The Prometheus client libraries use very few bytes of memory per metric exposed. I can get some benchmark tests of this if it helps.

@jsuereth
Copy link
Contributor

@dashpole This doesn't fix the fact that they will consume memory storing metrics client side.

@dashpole
Copy link
Contributor

@jsuereth sorry for the confusion. I'm not suggesting we remove client-side limits, and I think those are useful. I meant to say that we could keep the current client-side limits and attribute, and tell prometheus users to set scrape limits below our default value as a way to get the prometheus-recommended behavior. That way, we would not need to introduce scrape failures in our exporter, which could cause the problem described above for users.

@SuperQ I trust your assertiib regarding low memory usage of prometheus client libraries. I would read this proposal as adding an additional layer of protection for users to prevent DoS, not as a replacement for server-specified limits. I agree that Prometheus users will generally notice cardinality problems in the server before the client, assuming they are scraping the endpoint.

@SuperQ
Copy link

SuperQ commented Mar 14, 2024

Nerd-sniped myself into writing a quick test. I generated 100k and 1M tests with metrics like test_cardinality_total{test_label="zzy9RevCcI6lnBy4gzX02Antnt9ur9Ux"}.

  • 100k = 15MiB = ~157 bytes/metric pprof.me
  • 1M = 198MiB = ~207 bytes/metric pprof.me

Code:

package main

import (
	"fmt"
	"log"
	"net/http"
	_ "net/http/pprof"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
	"github.com/thanhpk/randstr"
)

func main() {
	fmt.Println("Generating metrics")
	metricTest := promauto.NewCounterVec(
		prometheus.CounterOpts{
			Name: "test_cardinality_total",
			Help: "Test for cardinality",
		},
		[]string{"test_label"},
	)

	for i := 0; i < 100_000; i++ {
		label := randstr.String(32)
		fmt.Printf("Generating metric %d %s\n", i, label)
		metricTest.WithLabelValues(label).Inc()
	}

	http.Handle("/metrics", promhttp.Handler())
	fmt.Println("Listening on :8080")
	log.Fatal(http.ListenAndServe(":8080", nil))
}

@reyang
Copy link
Member

reyang commented Mar 14, 2024

@bboreham @bwplotka I do like what you're suggesting with setting the value of labels to "oveflow". It's not entirely feasible in OTEL because we don't have string labels, we have typed attributes (so a label could be an int, e.g.). I think this is why we just add the new label here. I do really like your suggestion though.

In this case, I think our assumption is that in prometheus the label will have an empty value (e..g overflow="true", a="", b=""). We assume (via best practice) the a/b will have real values in most cases, so an empty label will already stand out as much as a "..." would.

@jack-berg / @reyang do you remember all the rationale we had when we decided on just one new label? Is it just the typed-attribute issue?

@jsuereth @jack-berg @bboreham It is not. Imagine this:

mycounter.add(2, a="hello", b="world");
mycounter.add(3, a="bar", b="foo");
mycounter.add(3, a="", b="foo");

// now we start getting DDOS attack or the code owner made a mistake while calling the API

mycounter.add(1, x1="1");
mycounter.add(1, x2="1");
mycounter.add(1, x3="1");

// an export happened, with overflow stream = mycounter{otel.metric.overflow="true" x1="overflow", x2="overflow", x3="overflow"} 3

mycounter.add(1, x4="1");
mycounter.add(1, x5="1");
mycounter.add(1, x6="1");

// an export happened, with overflow stream = mycounter{otel.metric.overflow="true" x1="overflow", x2="overflow", x3="overflow", x4="overflow", x5="overflow", x6="overflow"} 6
...
...

// until something explode in the SDK or the agent/backend

@SuperQ
Copy link

SuperQ commented Mar 14, 2024

Yup, that's a good DoS. And I'm also not suggesting that you avoid implementing client-side protections.

It's just that there is a clearly defined best practice for exposing the result of those protections. Which is to fail closed / fail fast. This provides the best "principal of least surprise" when it comes to data.

Having a partial result / soft fail provides a more difficult to cognitively and programatically handle observability result.

@reyang
Copy link
Member

reyang commented Mar 15, 2024

The TC discussed metrics cardinality limits and overflow attribute during the Mar. 13th, 2024 meeting.

The TC agreed to unblock #3905, the short summary will be added to the cardinality limits section in the spec to clarify the roadmap. Once the PR is merged, @reyang will publish a blog post clarify how OpenTelemetry plans to handle cardinality limits, and what should the users expect from experience perspective.

@reyang reyang added triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal and removed triaged-needmoreinfo The issue is triaged - the OTel community needs more information to decide labels Mar 15, 2024
@svrnm
Copy link
Member

svrnm commented Apr 22, 2024

@open-telemetry/technical-committee please take another look at this one, the PR #3905 was closed since it stalled, is this still something we want to accept or should we move it back to "deciding" or even reject it? (cc @jpkrohling @danielgblanco)

@reyang
Copy link
Member

reyang commented May 22, 2024

@open-telemetry/technical-committee please take another look at this one, the PR #3905 was closed since it stalled, is this still something we want to accept or should we move it back to "deciding" or even reject it? (cc @jpkrohling @danielgblanco)

Yes, I'll be working on it. We were focusing on getting Exemplars stable #3870 that's why this got delayed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory triage:accepted:ready-with-sponsor
Projects
Status: Spec - Accepted
Development

Successfully merging a pull request may close this issue.