Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

count connector is not working with prometheus exporter: delta not accumulated #30203

Closed
jiekun opened this issue Dec 25, 2023 · 16 comments
Closed

Comments

@jiekun
Copy link
Member

jiekun commented Dec 25, 2023

Component(s)

connector/count, exporter/prometheus

What happened?

Description

When using count connector with prometheus exporter, I always observe a delta value from prometheus output.

Steps to Reproduce

  1. Setting up a script which will report 2 spans.
  2. Setting up an OTel Collector with any receiver + count connector + prometheus exporter.

Now, run the script, 2 spans reported, and metric should be like:

trace_span_count_total{job="xxx",service_name="xxx"} 2

Now run the script again, 2 spans reported, I expect to see count = 4, however:

trace_span_count_total{job="xxx",service_name="xxx"} 2

Then I'm doing for-loop in the script, and makes it report 20 spans:

trace_span_count_total{job="xxx",service_name="xxx"} 20

Last step, I reduce the for-loop time to 1, 2 spans reported, and the metric value drop down to 2 again:

trace_span_count_total{job="xxx",service_name="xxx"} 2

Based on the info above, I think a delta value is exposed, which is not working as expected.

Debug

I run the collector locally, and found those clues:

  1. count connector is sending out delta value, which is expected to be accumulated in downstream components (processor / exporter).
  2. prometheus exporter does the following things in accumulateSum function:
    • If a metric is never seen, use a sync.Map to store it: codes.
    • If a metric is presented in sync.Map, try to accumulate it (For delta only): codes.

And then issue happened on L204:

  • the StartTimestamp of a count connector output metric is 0 (codes), which is != the end timestamp of previous stored metric ( in sync.Map ).
// Delta-to-Cumulative
if doubleSum.AggregationTemporality() == pmetric.AggregationTemporalityDelta && ip.StartTimestamp() == mv.value.Sum().DataPoints().At(0).Timestamp() {
	// cumulate
}
              
// store

So the exporter decided to store the delta value, without accumulating.

I'm not sure if those debug info is right, since it's my first time reading those components.

Collector version

v0.91.0

Environment information

Environment

OS: Ubuntu 22.04

OpenTelemetry Collector configuration

receivers:
  zipkin:
    endpoint: 0.0.0.0:9411
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
    send_batch_size: 300
    timeout: 5s

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    metric_expiration: 5m
    resource_to_telemetry_conversion:
      enabled: true
  
  debug:
    verbosity: detailed
    sampling_initial: 10
    sampling_thereafter: 100
    
  otlphttp/apm:
    endpoint: :4317
    tls:
      insecure: true

connectors:
  count:

service:
  pipelines:
    traces:
      receivers: [zipkin, otlp]
      exporters: [otlphttp/apm, count, debug]
      processors: [batch]
    metrics:
      receivers: [count]
      exporters: [prometheus]

Log output

No response

Additional context

No response

@jiekun jiekun added bug Something isn't working needs triage New item requiring triage labels Dec 25, 2023
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@JaredTan95
Copy link
Member

This should be a bug, can you follow your ideas to try to fix and PR?

@JaredTan95 JaredTan95 removed the needs triage New item requiring triage label Dec 25, 2023
@jiekun
Copy link
Member Author

jiekun commented Dec 25, 2023

This should be a bug, can you follow your ideas to try to fix and PR?

sure. I would like to confirm with @djaglowski : should we cache ts data in connector?

I think we could hold it for now until most maintainers come back from the Christmas holiday. The code changes should be easy but we need to align the direction first.

@JaredTan95
Copy link
Member

JaredTan95 commented Dec 25, 2023

should we cache ts data in connector?

You mean something like cache in servicegraphprocessor ? If once introduced, we have to consider how to route span belongs to one trace to the same otel col instance. (Just a friendly reminder..)

@djaglowski
Copy link
Member

This is not a bug. The connector emits a delta count intentionally so that it does not have to maintain state.

See #29300

@jiekun
Copy link
Member Author

jiekun commented Dec 25, 2023

This is not a bug. The connector emits a delta count intentionally so that it does not have to maintain state.

See #29300

I could understand based on the codes and comments. But is there any way to expose those counter to prometheus? The exporter require a start timestamp, and did not accumulate the delta correctly. It's behaving like a gauge (or we should say just a delta).

So, this issue is aim to figure out what component need to be change, the connector or the exporter. otherwise those delta value could not be used correctly.

@jiekun
Copy link
Member Author

jiekun commented Dec 25, 2023

Oh before we dive deeper, may I ask how people use this connector?

it's really appreciated if you could show me a little more information how to turn it into a counter, and we may not need to ''fix'' it :)

@djaglowski
Copy link
Member

The solution we need is a processor which aggregates delta metrics into cumulative. It's a general problem which shouldn't be handled only by one connector or exporter.

@jiekun
Copy link
Member Author

jiekun commented Dec 26, 2023

The solution we need is a processor which aggregates delta metrics into cumulative. It's a general problem which shouldn't be handled only by one connector or exporter.

Thank you, I have no further questions regarding this. Perhaps we can keep this issue open for further follow-up on the new processor?

Also, thank you for responding during the holiday season. Enjoy your vacation!

@JaredTan95 JaredTan95 removed the bug Something isn't working label Dec 26, 2023
@djaglowski
Copy link
Member

Thanks for bringing this up @jiekun. I hope we'll make progress on the processor in the near term and agree we should keep this open until a solution is in place.

@pepov
Copy link

pepov commented Jan 11, 2024

@djaglowski do you happen to know if someone is already working on a processor like that?

@djaglowski
Copy link
Member

djaglowski commented Jan 11, 2024

A delta-to-cumulative processor was proposed and accepted in #29300. As part of the discussion we agreed to separate #29461 into a separate processor. These would often be used together but we identified at least one case for using only one or the other. As recently as yesterday, @0x006EA1E5 expressed intent to work on these but potentially someone else could jump in and begin work on or the other sooner. Probably best if @0x006EA1E5 could weigh in here.

@jiekun
Copy link
Member Author

jiekun commented Jan 12, 2024

I will watch on this issue as well and provide necessary help if we need more human power.

@0x006EA1E5
Copy link

I do intent to work on this, but will be somewhat time constrained for the foreseeable future. I don't want to block anyone else who wants to work on it.

There are two proposed processors here, so there is at least some natural parallelism.

I guess first steps will be to define the spec, what to do in edge cases etc, and it looks to me that there is some common code already existing in some other similar processors which should be factored out.

@OverOrion
Copy link
Contributor

Hey there!

Just an FYI: @sh0rez 's PR #30707 works for accumulating the deltas, with this the countconnector -> deltatocumulativeprocessor -> prometheusexporter works as intended.
Kudos to them!

Config snippet:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  prometheus:
    endpoint: "0.0.0.0:9999"
    resource_to_telemetry_conversion:
      enabled: true
  debug:
    verbosity: detailed
    sampling_initial: 1
    sampling_thereafter: 1

connectors:
    count/1:
      metrics:
      logs:
        my.log.label_value.count:
          description: Keys are the label values
          attributes:
            - key: label
processors:
  transform:
    log_statements:
      - context: log
        statements:
          - set(attributes["label"], resource.attributes["label"])
  deltatocumulative:

service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:8888
  pipelines:
    metrics:
      receivers: [count/1]
      processors: [deltatocumulative]
      exporters: [prometheus, debug]
    logs:
      receivers: [otlp]
      processors: [transform]
      exporters: [count/1]

@jpkrohling
Copy link
Member

I'm closing this, but if you think there's still something to be done in the count connector, let me know and I'll reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants