Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheus exporter] Metric value overflow on metrics_expiration #6935

Closed
albertteoh opened this issue Dec 22, 2021 · 5 comments · Fixed by #7043
Closed

[prometheus exporter] Metric value overflow on metrics_expiration #6935

albertteoh opened this issue Dec 22, 2021 · 5 comments · Fixed by #7043
Labels
bug Something isn't working comp:prometheus Prometheus related issues

Comments

@albertteoh
Copy link
Contributor

albertteoh commented Dec 22, 2021

Describe the bug
When a metric expires, the value for the metric appears to overflow the max int64 value: 9223372036854775808.

Steps to reproduce

  1. Start otel collector with provided config below.
  2. Start HotROD example: docker run --rm --network="host" --env JAEGER_AGENT_HOST=localhost --env JAEGER_AGENT_PORT=6835 -p8080-8083:8080-8083 jaegertracing/example-hotrod:latest all
    1. This will emit spans to otel collector on port 6835.
  3. Click on a button on HotROD example to send a trace.
  4. Check that the latency_bucket metrics are correct for the first minute of data, and make a note of which bucket has count > 0. For example, where the metric has label le = "250".
    1. Example query: latency_bucket{service_name = "driver", le="250"}

What did you expect to see?

After a minute (on metric expiry), the metric with le = "250" (example query latency_bucket{service_name = "driver", le="250"}) should no longer be query-able (a null value).

What did you see instead?

After a minute (on metric expiry), the metric with le = "250" (example query latency_bucket{service_name = "driver", le="250"}) will jump to 9223372036854775808.

What version did you use?
Version: main branch

What config did you use?
Config:

receivers:
  jaeger:
    protocols:
      thrift_binary:
        endpoint: "localhost:6836"
      thrift_compact:
        endpoint: "localhost:6835"
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: :12345
  otlp:
    protocols:
      grpc:
        endpoint: :4317
  prometheus:
    config:
      global:
        external_labels:
          p8s_logzio_name: spm-demo-otel
      scrape_configs: 
      - job_name: 'atm'
        scrape_interval: 15s
        static_configs:
        - targets: [ "0.0.0.0:8889" ]
exporters:
  jaeger:
    endpoint: "localhost:14250"
    tls:
      insecure: true
  prometheus:
    endpoint: "localhost:8889"
    metric_expiration: 1m
  logging:
    loglevel: debug
processors:
  batch:
  spanmetrics:
    metrics_exporter: prometheus
    latency_histogram_buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 1000ms, 10000ms, 100000ms, 1000000ms]
    # Additional list of dimensions on top of:
    # - service.name
    # - operation
    # - span.kind
    # - status.code
    dimensions:
      # If the span is missing http.method, the processor will insert
      # the http.method dimension with value 'GET'.
      # For example, in the following scenario, http.method is not present in a span and so will be added as a dimension to the metric with value "GET":
      # - promexample_calls{http_method="GET",operation="/Address",service_name="shippingservice",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
      - name: http.method
        default: GET
      # If a default is not provided, the http.status_code dimension will be omitted
      # if the span does not contain http.status_code.
      # For example, consider a scenario with two spans, one span having http.status_code=200 and another missing http.status_code. Two metrics would result with this configuration, one with the http_status_code omitted and the other included:
      # - promexample_calls{http_status_code="200",operation="/Address",service_name="shippingservice",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
      # - promexample_calls{operation="/Address",service_name="shippingservice",span_kind="SPAN_KIND_SERVER",status_code="STATUS_CODE_UNSET"} 1
      - name: http.status_code    
extensions:
  pprof:
    endpoint: :1777
  zpages:
    endpoint: :55679
  health_check:
service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [jaeger]
      processors: [spanmetrics,batch]
      exporters: [jaeger, logging]
    metrics/spanmetrics:
      # This receiver is just a dummy and never used.
      # Added to pass validation requiring at least one receiver in a pipeline.
      receivers: [otlp/spanmetrics]
      exporters: [prometheus]
    metrics:
      receivers: [otlp,prometheus]
      exporters: [logging]      
  telemetry:
    logs:
      level: "debug"

Environment
OS: "Ubuntu 20.04"
Compiler(if manually compiled): "go 1.17.5")

Additional context

  • Is there something the spanmetrics processor should be doing to avoid this problem?
  • Is it perhaps a config or a promeetheus query issue?
  • Is this expected behaviour and is the recommendation to simply increase the metric_expiration duration?

The following screenshots illustrate the problem for the default 5m metric_expiration configuration:

Before Expiry
before-expiry

After Expiry
after-expiry

The logs also reflect the above screenshots, showing the correct metrics initially, then suddenly jumping to a very large count:

  • Initially, the le = "250" bucket has a count of 3.
  • 5 mins later without sending any traces, a number of logs indicate the expiry of the metrics.
  • At this time, le = "250" bucket jumps to a very large number, which appears to be 1 + MaxInt64.
  • Other buckets remain at 0.
@albertteoh albertteoh added the bug Something isn't working label Dec 22, 2021
@jpkrohling jpkrohling added the comp:prometheus Prometheus related issues label Dec 27, 2021
@jpkrohling
Copy link
Member

@gouthamve, is this something you can help with?

@albertteoh
Copy link
Contributor Author

albertteoh commented Dec 28, 2021

Thanks @jpkrohling. 🙏🏼

I have more evidence to suggest this is a bug in prometheusremotewrite exporter (I've updated the issue description with this exporter name) because I ran a local prometheus server off the same pipeline but scraping the prometheus exporter's data and it looks fine.

The following are screenshots of exactly the same data, with identical queries and timestamp ranges.

The first screenshot is from a locally running Grafana querying against a local prometheus server scraping data from my local OTEL collector's prometheusexporter.

The second screenshot is from an observability provider (M3 is the backing metrics store) where by the prometheusremotewrite exporter is configured to send metrics to, and the visualisation is consistent with the numbers I see in my local otel collector logs, which confirms the problems resides within the prometheusremotewrite exporter, whether if it's a bug of misconfiguration.

prometheus exporter

Looks fine, the data is null after 21:04:30.

Screen Shot 2021-12-28 at 9 13 19 pm

prometheus remote write exporter

Logs show the number jumping to the MaxInt64 value as reported earlier:

Data point attributes:
     -> http_method: STRING(GET)
     -> http_status_code: STRING(200)
     -> operation: STRING(HTTP GET /route)
     -> p8s_logzio_name: STRING(spm-demo-otel)
     -> service_name: STRING(route)
     -> span_kind: STRING(SPAN_KIND_SERVER)
     -> status_code: STRING(STATUS_CODE_UNSET)
StartTimestamp: 2021-12-28 10:03:16.465 +0000 UTC
Timestamp: 2021-12-28 10:04:16.465 +0000 UTC
Count: 50
Sum: 2652.758000
ExplicitBounds #0: 2.000000
ExplicitBounds #1: 6.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 100.000000
ExplicitBounds #4: 250.000000
ExplicitBounds #5: 500.000000
ExplicitBounds #6: 1000.000000
ExplicitBounds #7: 10000.000000
ExplicitBounds #8: 100000.000000
ExplicitBounds #9: 1000000.000000
ExplicitBounds #10: 9223372036854.775391
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 50
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0
Buckets #11, Count: 0
...
2021-12-28T10:04:31.166Z	debug	prometheusexporter@v0.41.0/accumulator.go:246	metric expired: latency, deleting key: Histogram*spanmetricsprocessor*latency*http.method*GET*http.status_code*200*operation*HTTP GET /route*service.name*route*span.kind*SPAN_KIND_SERVER*status.code*STATUS_CODE_UNSET	{"kind": "exporter", "name": "prometheus"}
...
Data point attributes:
     -> http_method: STRING(GET)
     -> http_status_code: STRING(200)
     -> operation: STRING(HTTP GET /route)
     -> p8s_logzio_name: STRING(spm-demo-otel)
     -> service_name: STRING(route)
     -> span_kind: STRING(SPAN_KIND_SERVER)
     -> status_code: STRING(STATUS_CODE_UNSET)
StartTimestamp: 2021-12-28 10:04:31.465 +0000 UTC
Timestamp: 2021-12-28 10:04:31.465 +0000 UTC
Count: 9223372036854775808
Sum: NaN
ExplicitBounds #0: 2.000000
ExplicitBounds #1: 6.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 100.000000
ExplicitBounds #4: 250.000000
ExplicitBounds #5: 500.000000
ExplicitBounds #6: 1000.000000
ExplicitBounds #7: 10000.000000
ExplicitBounds #8: 100000.000000
ExplicitBounds #9: 1000000.000000
ExplicitBounds #10: 9223372036854.775391
Buckets #0, Count: 9223372036854775808
Buckets #1, Count: 9223372036854775808
Buckets #2, Count: 9223372036854775808
Buckets #3, Count: 9223372036854775808
Buckets #4, Count: 9223372036854775808
Buckets #5, Count: 9223372036854775808
Buckets #6, Count: 9223372036854775808
Buckets #7, Count: 9223372036854775808
Buckets #8, Count: 9223372036854775808
Buckets #9, Count: 9223372036854775808
Buckets #10, Count: 9223372036854775808
Buckets #11, Count: 9223372036854775808
...

Screen Shot 2021-12-28 at 9 13 52 pm

@albertteoh albertteoh changed the title [prometheusexporter] Metric value overflow on metrics_expiration [prometheusremotewrite exporter] Metric value overflow on metrics_expiration Dec 28, 2021
@Aneurysm9
Copy link
Member

I'm confused. The issue title says "prometheusremotewrite", but the sample config uses the promethus exporter. The prometheusremotewrite exporter does not have the referenced metric_expiration configuration option.

It looks like what is happening here is that the data flow is something like spanmetrics->promexp->promrecv->log. The prometheus exporter is correctly expiring the metrics and stops emitting them. At that point the Prometheus scrape manager used by the prometheus receiver emits a signalling NaN value to indicate that the metric is stale. Because of a data model mismatch between Prometheus and OTLP, that value for bucket and distribution counts cannot be properly represented since OTLP uses integers for those values. There is a flag in the OTLP data model that is intended to convey the same information, but it is not currently set by the prometheus receiver. The prometheusremotewrite receiver has been updated to recognize this flag and emit the Prometheus signaling NaN. There is a PR to do the same for the prometheus exporter (or, more correctly, to delete metrics with that flag which will cause the scraper retrieving those metrics to emit the signalling NaN as appropriate). We had intended to update the promethus receiver to begin setting those flags once they were correctly handled in both receivers.

@Aneurysm9
Copy link
Member

@albertteoh can you try building a collector from this branch and let me know if the issue persists?

@albertteoh albertteoh changed the title [prometheusremotewrite exporter] Metric value overflow on metrics_expiration [prometheus receiver] Metric value overflow on metrics_expiration Dec 29, 2021
@albertteoh albertteoh changed the title [prometheus receiver] Metric value overflow on metrics_expiration [prometheus exporter] Metric value overflow on metrics_expiration Dec 29, 2021
@albertteoh
Copy link
Contributor Author

albertteoh commented Dec 29, 2021

I'm confused. The issue title says "prometheusremotewrite", but the sample config uses the promethus exporter. The prometheusremotewrite exporter does not have the referenced metric_expiration configuration option.

Ah, sorry, silly mistake on my part. I had originally titled it with promethus exporter, but I jumped to an incorrect conclusion in my comment especially given the logs are in the same export group as the prometheusremotewrite exporter. 🤦🏼‍♂️ I've reverted the title name change for posterity.

can you try building a collector from this branch and let me know if the issue persists?

I tried your branch and looks good 👍🏼

Thanks very much for the quick turnaround, @Aneurysm9!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working comp:prometheus Prometheus related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants