[prometheus exporter] Metric value overflow on metrics_expiration #6935

albertteoh · 2021-12-22T12:19:11Z

Describe the bug
When a metric expires, the value for the metric appears to overflow the max int64 value: 9223372036854775808.

Steps to reproduce

Start otel collector with provided config below.
Start HotROD example: docker run --rm --network="host" --env JAEGER_AGENT_HOST=localhost --env JAEGER_AGENT_PORT=6835 -p8080-8083:8080-8083 jaegertracing/example-hotrod:latest all
1. This will emit spans to otel collector on port 6835.
Click on a button on HotROD example to send a trace.
Check that the latency_bucket metrics are correct for the first minute of data, and make a note of which bucket has count > 0. For example, where the metric has label le = "250".
1. Example query: latency_bucket{service_name = "driver", le="250"}

What did you expect to see?

After a minute (on metric expiry), the metric with le = "250" (example query latency_bucket{service_name = "driver", le="250"}) should no longer be query-able (a null value).

What did you see instead?

After a minute (on metric expiry), the metric with le = "250" (example query latency_bucket{service_name = "driver", le="250"}) will jump to 9223372036854775808.

What version did you use?
Version: main branch

What config did you use?
Config:

receivers:
  jaeger:
    protocols:
      thrift_binary:
        endpoint: "localhost:6836"
      thrift_compact:
        endpoint: "localhost:6835"
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: :12345
  otlp:
    protocols:
      grpc:
        endpoint: :4317
  prometheus:
    config:
      global:
        external_labels:
          p8s_logzio_name: spm-demo-otel
      scrape_configs: 
      - job_name: 'atm'
        scrape_interval: 15s
        static_configs:
        - targets: [ "0.0.0.0:8889" ]
exporters:
  jaeger:
    endpoint: "localhost:14250"
    tls:
      insecure: true
  prometheus:
    endpoint: "localhost:8889"
    metric_expiration: 1m
  logging:
    loglevel: debug
processors:
  batch:
  spanmetrics:
    metrics_exporter: prometheus
    latency_histogram_buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 1000ms, 10000ms, 100000ms, 1000000ms]
    # Additional list of dimensions on top of:
    # - service.name
    # - operation
    # - span.kind
    # - status.code
    dimensions:
      # If the span is missing http.method, the processor will insert
      # the http.method dimension with value 'GET'.
      # For example, in the following scenario, http.method is not present in a span and so will be added as a dimension to the metric with value "GET":
      # - promexample_calls{http_method="GET",operation="/Address",service_name="shippingservice",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
      - name: http.method
        default: GET
      # If a default is not provided, the http.status_code dimension will be omitted
      # if the span does not contain http.status_code.
      # For example, consider a scenario with two spans, one span having http.status_code=200 and another missing http.status_code. Two metrics would result with this configuration, one with the http_status_code omitted and the other included:
      # - promexample_calls{http_status_code="200",operation="/Address",service_name="shippingservice",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
      # - promexample_calls{operation="/Address",service_name="shippingservice",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
      - name: http.status_code    
extensions:
  pprof:
    endpoint: :1777
  zpages:
    endpoint: :55679
  health_check:
service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [jaeger]
      processors: [spanmetrics,batch]
      exporters: [jaeger, logging]
    metrics/spanmetrics:
      # This receiver is just a dummy and never used.
      # Added to pass validation requiring at least one receiver in a pipeline.
      receivers: [otlp/spanmetrics]
      exporters: [prometheus]
    metrics:
      receivers: [otlp,prometheus]
      exporters: [logging]      
  telemetry:
    logs:
      level: "debug"

Environment
OS: "Ubuntu 20.04"
Compiler(if manually compiled): "go 1.17.5")

Additional context

Is there something the spanmetrics processor should be doing to avoid this problem?
Is it perhaps a config or a promeetheus query issue?
Is this expected behaviour and is the recommendation to simply increase the metric_expiration duration?

The following screenshots illustrate the problem for the default 5m metric_expiration configuration:

Before Expiry

After Expiry

The logs also reflect the above screenshots, showing the correct metrics initially, then suddenly jumping to a very large count:

Initially, the le = "250" bucket has a count of 3.
5 mins later without sending any traces, a number of logs indicate the expiry of the metrics.
At this time, le = "250" bucket jumps to a very large number, which appears to be 1 + MaxInt64.
Other buckets remain at 0.

The text was updated successfully, but these errors were encountered:

jpkrohling · 2021-12-27T10:03:13Z

@gouthamve, is this something you can help with?

albertteoh · 2021-12-28T10:28:52Z

Thanks @jpkrohling. 🙏🏼

I have more evidence to suggest this is a bug in prometheusremotewrite exporter (I've updated the issue description with this exporter name) because I ran a local prometheus server off the same pipeline but scraping the prometheus exporter's data and it looks fine.

The following are screenshots of exactly the same data, with identical queries and timestamp ranges.

The first screenshot is from a locally running Grafana querying against a local prometheus server scraping data from my local OTEL collector's prometheusexporter.

The second screenshot is from an observability provider (M3 is the backing metrics store) where by the prometheusremotewrite exporter is configured to send metrics to, and the visualisation is consistent with the numbers I see in my local otel collector logs, which confirms the problems resides within the prometheusremotewrite exporter, whether if it's a bug of misconfiguration.

prometheus exporter

Looks fine, the data is null after 21:04:30.

prometheus remote write exporter

Logs show the number jumping to the MaxInt64 value as reported earlier:

Data point attributes:
     -> http_method: STRING(GET)
     -> http_status_code: STRING(200)
     -> operation: STRING(HTTP GET /route)
     -> p8s_logzio_name: STRING(spm-demo-otel)
     -> service_name: STRING(route)
     -> span_kind: STRING(SPAN_KIND_SERVER)
     -> status_code: STRING(STATUS_CODE_UNSET)
StartTimestamp: 2021-12-28 10:03:16.465 +0000 UTC
Timestamp: 2021-12-28 10:04:16.465 +0000 UTC
Count: 50
Sum: 2652.758000
ExplicitBounds #0: 2.000000
ExplicitBounds #1: 6.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 100.000000
ExplicitBounds #4: 250.000000
ExplicitBounds #5: 500.000000
ExplicitBounds #6: 1000.000000
ExplicitBounds #7: 10000.000000
ExplicitBounds #8: 100000.000000
ExplicitBounds #9: 1000000.000000
ExplicitBounds #10: 9223372036854.775391
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 50
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0
Buckets #11, Count: 0
...
2021-12-28T10:04:31.166Z	debug	prometheusexporter@v0.41.0/accumulator.go:246	metric expired: latency, deleting key: Histogram*spanmetricsprocessor*latency*http.method*GET*http.status_code*200*operation*HTTP GET /route*service.name*route*span.kind*SPAN_KIND_SERVER*status.code*STATUS_CODE_UNSET	{"kind": "exporter", "name": "prometheus"}
...
Data point attributes:
     -> http_method: STRING(GET)
     -> http_status_code: STRING(200)
     -> operation: STRING(HTTP GET /route)
     -> p8s_logzio_name: STRING(spm-demo-otel)
     -> service_name: STRING(route)
     -> span_kind: STRING(SPAN_KIND_SERVER)
     -> status_code: STRING(STATUS_CODE_UNSET)
StartTimestamp: 2021-12-28 10:04:31.465 +0000 UTC
Timestamp: 2021-12-28 10:04:31.465 +0000 UTC
Count: 9223372036854775808
Sum: NaN
ExplicitBounds #0: 2.000000
ExplicitBounds #1: 6.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 100.000000
ExplicitBounds #4: 250.000000
ExplicitBounds #5: 500.000000
ExplicitBounds #6: 1000.000000
ExplicitBounds #7: 10000.000000
ExplicitBounds #8: 100000.000000
ExplicitBounds #9: 1000000.000000
ExplicitBounds #10: 9223372036854.775391
Buckets #0, Count: 9223372036854775808
Buckets #1, Count: 9223372036854775808
Buckets #2, Count: 9223372036854775808
Buckets #3, Count: 9223372036854775808
Buckets #4, Count: 9223372036854775808
Buckets #5, Count: 9223372036854775808
Buckets #6, Count: 9223372036854775808
Buckets #7, Count: 9223372036854775808
Buckets #8, Count: 9223372036854775808
Buckets #9, Count: 9223372036854775808
Buckets #10, Count: 9223372036854775808
Buckets #11, Count: 9223372036854775808
...

Aneurysm9 · 2021-12-28T20:01:02Z

I'm confused. The issue title says "prometheusremotewrite", but the sample config uses the promethus exporter. The prometheusremotewrite exporter does not have the referenced metric_expiration configuration option.

It looks like what is happening here is that the data flow is something like spanmetrics->promexp->promrecv->log. The prometheus exporter is correctly expiring the metrics and stops emitting them. At that point the Prometheus scrape manager used by the prometheus receiver emits a signalling NaN value to indicate that the metric is stale. Because of a data model mismatch between Prometheus and OTLP, that value for bucket and distribution counts cannot be properly represented since OTLP uses integers for those values. There is a flag in the OTLP data model that is intended to convey the same information, but it is not currently set by the prometheus receiver. The prometheusremotewrite receiver has been updated to recognize this flag and emit the Prometheus signaling NaN. There is a PR to do the same for the prometheus exporter (or, more correctly, to delete metrics with that flag which will cause the scraper retrieving those metrics to emit the signalling NaN as appropriate). We had intended to update the promethus receiver to begin setting those flags once they were correctly handled in both receivers.

Aneurysm9 · 2021-12-28T23:08:05Z

@albertteoh can you try building a collector from this branch and let me know if the issue persists?

albertteoh · 2021-12-29T05:46:28Z

I'm confused. The issue title says "prometheusremotewrite", but the sample config uses the promethus exporter. The prometheusremotewrite exporter does not have the referenced metric_expiration configuration option.

Ah, sorry, silly mistake on my part. I had originally titled it with promethus exporter, but I jumped to an incorrect conclusion in my comment especially given the logs are in the same export group as the prometheusremotewrite exporter. 🤦🏼‍♂️ I've reverted the title name change for posterity.

can you try building a collector from this branch and let me know if the issue persists?

I tried your branch and looks good 👍🏼

Thanks very much for the quick turnaround, @Aneurysm9!

albertteoh added the bug Something isn't working label Dec 22, 2021

jpkrohling added the comp:prometheus Prometheus related issues label Dec 27, 2021

albertteoh changed the title ~~[prometheusexporter] Metric value overflow on metrics_expiration~~ [prometheusremotewrite exporter] Metric value overflow on metrics_expiration Dec 28, 2021

albertteoh changed the title ~~[prometheusremotewrite exporter] Metric value overflow on metrics_expiration~~ [prometheus receiver] Metric value overflow on metrics_expiration Dec 29, 2021

albertteoh changed the title ~~[prometheus receiver] Metric value overflow on metrics_expiration~~ [prometheus exporter] Metric value overflow on metrics_expiration Dec 29, 2021

Aneurysm9 mentioned this issue Jan 5, 2022

[receiver/prometheus] set pdata stale marker #7043

Merged

codeboten closed this as completed in #7043 Jan 6, 2022

albertteoh mentioned this issue Jan 12, 2022

[exporter/prometheusremotewrite] Collector crashes after metrics expiry #7149

Closed

albertteoh mentioned this issue Feb 9, 2022

"Monitor" tab for service health metrics jaegertracing/jaeger#2954

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prometheus exporter] Metric value overflow on metrics_expiration #6935

[prometheus exporter] Metric value overflow on metrics_expiration #6935

albertteoh commented Dec 22, 2021 •

edited

Loading

jpkrohling commented Dec 27, 2021

albertteoh commented Dec 28, 2021 •

edited

Loading

Aneurysm9 commented Dec 28, 2021

Aneurysm9 commented Dec 28, 2021

albertteoh commented Dec 29, 2021 •

edited

Loading

[prometheus exporter] Metric value overflow on metrics_expiration #6935

[prometheus exporter] Metric value overflow on metrics_expiration #6935

Comments

albertteoh commented Dec 22, 2021 • edited Loading

jpkrohling commented Dec 27, 2021

albertteoh commented Dec 28, 2021 • edited Loading

Aneurysm9 commented Dec 28, 2021

Aneurysm9 commented Dec 28, 2021

albertteoh commented Dec 29, 2021 • edited Loading

albertteoh commented Dec 22, 2021 •

edited

Loading

albertteoh commented Dec 28, 2021 •

edited

Loading

albertteoh commented Dec 29, 2021 •

edited

Loading