[receiver/prometheus] Receiver incorrectly sets StartTimestamp #22810

sirianni · 2023-05-26T14:25:08Z

Component(s)

receiver/prometheus

Description

The prometheus receiver is setting StartTimestamp on cumulative metrics to the observed Timestamp instead of nil (or a "zero value") when process_time_seconds is not matched.

This is undesired behavior in general, but causes severe issues when used in combination with cumulativetodelta configured with initial_value: auto (the default).

A similar bug report was made in #17190, though that seems to point to the cumulativetodelta processor as the culprit. I believe this bug is in prometheusreceiver.

Steps to Reproduce

Start an OTel Collector with a prometheus receiver

Expected Result

StartTimestamp on cumulative metric should either:

Be set to the process_start_time metric
Be null or a "zero value"

Actual Result

StartTimestamp on cumulative metric is equal to the Observed Timestamp on that metric

Collector version

0.78.0

Environment information

No response

OpenTelemetry Collector configuration

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'istio'
        scrape_interval: 30s
        static_configs:
        - targets: ['0.0.0.0:15020']

exporters:
  logging:
    verbosity: detailed

service:
  telemetry:
    logs:
      level: DEBUG
      development: true
      encoding: console
  pipelines:
    metrics:
      receivers:
      - prometheus
      processors: []
      exporters:
      - logging

Log output

Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(istio)
     -> service.instance.id: Str(0.0.0.0:15020)
     -> net.host.port: Str(15020)
     -> http.scheme: Str(http)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/prometheusreceiver development
Metric #14
Descriptor:
     -> Name: istio_requests
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> connection_security_policy: Str(none)
     -> reporter: Str(destination)
     -> request_protocol: Str(grpc)
     -> response_code: Str(0)
     -> response_flags: Str(-)
     -> source_canonical_revision: Str(latest)
StartTimestamp: 2023-05-26 14:11:13.822 +0000 UTC
Timestamp: 2023-05-26 14:11:13.822 +0000 UTC
Value: 171.000000

Additional context

The scraped endpoint does return a process_start_time_seconds metric but it has a prefix.

$ http :15020/stats/prometheus | grep 'process_start_time'
# HELP istio_agent_process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE istio_agent_process_start_time_seconds gauge
istio_agent_process_start_time_seconds 1.6838030013e+09

$ date -d@1683803001
Thu 11 May 2023 07:03:21 AM EDT

I'm aware of the start_time_metric_regex configuration option in the receiver, but that's not usable for us since we are using receivercreator here to do autodiscovery and the regex is application-specific.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-05-26T14:25:26Z

Pinging code owners:

receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

sirianni · 2023-05-26T15:05:21Z

The issue may be with the second scrape. Here, the StartTimestamp is set to the first (artificial) StartTimestamp

Metric #159
Descriptor:
     -> Name: istio_requests
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
...
StartTimestamp: 2023-05-26 14:11:13.822 +0000 UTC
Timestamp: 2023-05-26 14:11:43.822 +0000 UTC
Value: 171.000000

Maybe this is correct according to spec (??). But the interaction of this with cumulativetodelta processor seems wrong.

Aneurysm9 · 2023-05-26T15:22:55Z

I'm aware of the start_time_metric_regex configuration option in the receiver, but that's not usable for us since we are using receivercreator here to do autodiscovery and the regex is application-specific.

Can you provide more detail regarding why this is not an option? This is the mechanism that the prometheus receiver exposes to do what you are trying to do. Does the receiver creator need to provide more flexibility?

sirianni · 2023-05-26T15:36:41Z

I'm aware of the start_time_metric_regex configuration option in the receiver, but that's not usable for us since we are using receivercreator here to do autodiscovery and the regex is application-specific.

Can you provide more detail regarding why this is not an option? This is the mechanism that the prometheus receiver exposes to do what you are trying to do. Does the receiver creator need to provide more flexibility?

We are using receivercreator with annotation based discovery

receivers:
  receiver_creator/prometheus:
    watch_observers:
    - k8s_observer
    receivers:
      prometheus:
        rule: type == "pod" && annotations["prometheus.io/scrape"] == "true"
        config:
          config:
            scrape_configs:
            - job_name: 'prometheus:`name`'
              scrape_interval: 60s
              metrics_path: '`"prometheus.io/path" in annotations ? annotations["prometheus.io/path"] : "/metrics"`'
              static_configs:
              - targets: [ '`endpoint`:`"prometheus.io/port" in annotations ? annotations["prometheus.io/port"] : 9090`' ]

I suppose we could devise some custom Kubernetes annotation so that apps can configure their start_time regex, and then propagate that down to the config. However, that would be a non-standard annotation and also a bit impractical to expect all the apps we support to modify their k8s manifests to include it.

It's a worthy goal, but I'd also expect reasonable fallback behavior out of the box by prometheusreceiver when it can't reliably determine the process start time.

sirianni · 2023-05-26T22:02:16Z

I'm actually not able to reproduce this locally

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'istio'
        scrape_interval: 30s
        static_configs:
        - targets: ['0.0.0.0:15020']

processors:
  cumulativetodelta:
  filter:
    metrics:
      metric:
      - 'name != "istio_requests"'
      datapoint:
      - 'attributes["response_code"] != "200"'

exporters:
  logging:
    verbosity: detailed

connectors:
  forward:

service:
  telemetry:
    logs:
      level: DEBUG
      development: true
      encoding: console
  pipelines:
    metrics/1:
      receivers:
      - prometheus
      processors:
      - filter
      exporters:
      - forward
      - logging
    metrics/2:
      receivers:
      - forward
      processors:
      - cumulativetodelta
      exporters:
      - logging

I see the following datapoints output, which looks correct

InstrumentationScope otelcol/prometheusreceiver development
Metric #0
Descriptor:
     -> Name: istio_requests
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> reporter: Str(destination)
     -> request_protocol: Str(grpc)
     -> response_code: Str(200)
StartTimestamp: 2023-05-26 21:52:13.822 +0000 UTC
Timestamp: 2023-05-26 21:52:13.822 +0000 UTC
Value: 27900767.000000

Metric #0
Descriptor:
     -> Name: istio_requests
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> reporter: Str(destination)
     -> request_protocol: Str(grpc)
     -> response_code: Str(200)
StartTimestamp: 2023-05-26 21:52:13.822 +0000 UTC
Timestamp: 2023-05-26 21:52:43.822 +0000 UTC
Value: 27900794.000000

Metric #0
Descriptor:
     -> Name: istio_requests
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
     -> reporter: Str(destination)
     -> request_protocol: Str(grpc)
     -> response_code: Str(200)
StartTimestamp: 2023-05-26 21:52:13.822 +0000 UTC
Timestamp: 2023-05-26 21:52:43.822 +0000 UTC
Value: 27.000000

But we are definitely seeing this in our production environment after upgrading to 0.78.0. We see some huge datapoints in our charts for the istio_requests metric correlating exactly with when the collector restarts.

sirianni · 2023-05-26T22:02:51Z

Helpful gist from @seankhliao

github-actions · 2023-07-26T03:33:09Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole · 2023-07-26T13:46:16Z

I think this is expected behavior based on the specification.

See https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/exporter/collector/internal/normalization/standard_normalizer.go#L264 for what my (GCP) exporter does to handle these points. We drop points where the start time isn't after the end time, and use that to "normalize" subsequent points by subtracting the initial value from them.

I'm personally not a fan of the behavior, as the point implies that a non-zero number of events occurred in zero time. I would expect most backends to fail to handle these points if they aren't filtered out. A nil start time seems like a better representation of what we actually know, and I would rather have a processor that can deal with the lack of start time than have the points. We might need to change the spec first, though.

github-actions · 2023-09-26T03:29:36Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2023-11-25T05:18:48Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

sirianni added bug Something isn't working needs triage New item requiring triage labels May 26, 2023

github-actions bot added the receiver/prometheus Prometheus receiver label May 26, 2023

sirianni changed the title ~~Receiver incorrectly sets StartTimestamp~~ [receiver/prometheus] Receiver incorrectly sets StartTimestamp May 26, 2023

TylerHelmuth added priority:p1 High and removed needs triage New item requiring triage labels May 26, 2023

songy23 mentioned this issue Jun 16, 2023

Spikes after dd-agent restarts DataDog/opentelemetry-mapping-go#100

Closed

github-actions bot added the Stale label Jul 26, 2023

github-actions bot removed the Stale label Jul 27, 2023

mx-psi mentioned this issue Aug 8, 2023

[processor/cumulativetodelta] make first point drop configurable #20770

Closed

github-actions bot added the Stale label Sep 26, 2023

github-actions bot added the closed as inactive label Nov 25, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/prometheus] Receiver incorrectly sets StartTimestamp #22810

[receiver/prometheus] Receiver incorrectly sets StartTimestamp #22810

sirianni commented May 26, 2023 •

edited

Loading

github-actions bot commented May 26, 2023

sirianni commented May 26, 2023 •

edited

Loading

Aneurysm9 commented May 26, 2023

sirianni commented May 26, 2023

sirianni commented May 26, 2023

sirianni commented May 26, 2023

github-actions bot commented Jul 26, 2023

dashpole commented Jul 26, 2023

github-actions bot commented Sep 26, 2023

github-actions bot commented Nov 25, 2023

[receiver/prometheus] Receiver incorrectly sets StartTimestamp #22810

[receiver/prometheus] Receiver incorrectly sets StartTimestamp #22810

Comments

sirianni commented May 26, 2023 • edited Loading

Component(s)

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented May 26, 2023

sirianni commented May 26, 2023 • edited Loading

Aneurysm9 commented May 26, 2023

sirianni commented May 26, 2023

sirianni commented May 26, 2023

sirianni commented May 26, 2023

github-actions bot commented Jul 26, 2023

dashpole commented Jul 26, 2023

github-actions bot commented Sep 26, 2023

github-actions bot commented Nov 25, 2023

sirianni commented May 26, 2023 •

edited

Loading

sirianni commented May 26, 2023 •

edited

Loading