Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/prometheus] Receiver incorrectly sets StartTimestamp #22810

Closed
sirianni opened this issue May 26, 2023 · 10 comments
Closed

[receiver/prometheus] Receiver incorrectly sets StartTimestamp #22810

sirianni opened this issue May 26, 2023 · 10 comments
Labels
bug Something isn't working closed as inactive priority:p1 High receiver/prometheus Prometheus receiver Stale

Comments

@sirianni
Copy link
Contributor

sirianni commented May 26, 2023

Component(s)

receiver/prometheus

Description

The prometheus receiver is setting StartTimestamp on cumulative metrics to the observed Timestamp instead of nil (or a "zero value") when process_time_seconds is not matched.

This is undesired behavior in general, but causes severe issues when used in combination with cumulativetodelta configured with initial_value: auto (the default).

A similar bug report was made in #17190, though that seems to point to the cumulativetodelta processor as the culprit. I believe this bug is in prometheusreceiver.

Steps to Reproduce

Start an OTel Collector with a prometheus receiver

Expected Result

StartTimestamp on cumulative metric should either:

  • Be set to the process_start_time metric
  • Be null or a "zero value"

Actual Result

StartTimestamp on cumulative metric is equal to the Observed Timestamp on that metric

Collector version

0.78.0

Environment information

No response

OpenTelemetry Collector configuration

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'istio'
        scrape_interval: 30s
        static_configs:
        - targets: ['0.0.0.0:15020']

exporters:
  logging:
    verbosity: detailed

service:
  telemetry:
    logs:
      level: DEBUG
      development: true
      encoding: console
  pipelines:
    metrics:
      receivers:
      - prometheus
      processors: []
      exporters:
      - logging

Log output

Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(istio)
     -> service.instance.id: Str(0.0.0.0:15020)
     -> net.host.port: Str(15020)
     -> http.scheme: Str(http)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/prometheusreceiver development
Metric #14
Descriptor:
     -> Name: istio_requests
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> connection_security_policy: Str(none)
     -> reporter: Str(destination)
     -> request_protocol: Str(grpc)
     -> response_code: Str(0)
     -> response_flags: Str(-)
     -> source_canonical_revision: Str(latest)
StartTimestamp: 2023-05-26 14:11:13.822 +0000 UTC
Timestamp: 2023-05-26 14:11:13.822 +0000 UTC
Value: 171.000000

Additional context

The scraped endpoint does return a process_start_time_seconds metric but it has a prefix.

$ http :15020/stats/prometheus | grep 'process_start_time'
# HELP istio_agent_process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE istio_agent_process_start_time_seconds gauge
istio_agent_process_start_time_seconds 1.6838030013e+09
$ date -d@1683803001
Thu 11 May 2023 07:03:21 AM EDT

I'm aware of the start_time_metric_regex configuration option in the receiver, but that's not usable for us since we are using receivercreator here to do autodiscovery and the regex is application-specific.

@sirianni sirianni added bug Something isn't working needs triage New item requiring triage labels May 26, 2023
@github-actions github-actions bot added the receiver/prometheus Prometheus receiver label May 26, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@sirianni sirianni changed the title Receiver incorrectly sets StartTimestamp [receiver/prometheus] Receiver incorrectly sets StartTimestamp May 26, 2023
@TylerHelmuth TylerHelmuth added priority:p1 High and removed needs triage New item requiring triage labels May 26, 2023
@sirianni
Copy link
Contributor Author

sirianni commented May 26, 2023

The issue may be with the second scrape. Here, the StartTimestamp is set to the first (artificial) StartTimestamp

Metric #159
Descriptor:
     -> Name: istio_requests
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
...
StartTimestamp: 2023-05-26 14:11:13.822 +0000 UTC
Timestamp: 2023-05-26 14:11:43.822 +0000 UTC
Value: 171.000000

Maybe this is correct according to spec (??). But the interaction of this with cumulativetodelta processor seems wrong.

@Aneurysm9
Copy link
Member

I'm aware of the start_time_metric_regex configuration option in the receiver, but that's not usable for us since we are using receivercreator here to do autodiscovery and the regex is application-specific.

Can you provide more detail regarding why this is not an option? This is the mechanism that the prometheus receiver exposes to do what you are trying to do. Does the receiver creator need to provide more flexibility?

@sirianni
Copy link
Contributor Author

I'm aware of the start_time_metric_regex configuration option in the receiver, but that's not usable for us since we are using receivercreator here to do autodiscovery and the regex is application-specific.

Can you provide more detail regarding why this is not an option? This is the mechanism that the prometheus receiver exposes to do what you are trying to do. Does the receiver creator need to provide more flexibility?

We are using receivercreator with annotation based discovery

receivers:
  receiver_creator/prometheus:
    watch_observers:
    - k8s_observer
    receivers:
      prometheus:
        rule: type == "pod" && annotations["prometheus.io/scrape"] == "true"
        config:
          config:
            scrape_configs:
            - job_name: 'prometheus:`name`'
              scrape_interval: 60s
              metrics_path: '`"prometheus.io/path" in annotations ? annotations["prometheus.io/path"] : "/metrics"`'
              static_configs:
              - targets: [ '`endpoint`:`"prometheus.io/port" in annotations ? annotations["prometheus.io/port"] : 9090`' ]

I suppose we could devise some custom Kubernetes annotation so that apps can configure their start_time regex, and then propagate that down to the config. However, that would be a non-standard annotation and also a bit impractical to expect all the apps we support to modify their k8s manifests to include it.

It's a worthy goal, but I'd also expect reasonable fallback behavior out of the box by prometheusreceiver when it can't reliably determine the process start time.

@sirianni
Copy link
Contributor Author

I'm actually not able to reproduce this locally

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'istio'
        scrape_interval: 30s
        static_configs:
        - targets: ['0.0.0.0:15020']

processors:
  cumulativetodelta:
  filter:
    metrics:
      metric:
      - 'name != "istio_requests"'
      datapoint:
      - 'attributes["response_code"] != "200"'

exporters:
  logging:
    verbosity: detailed

connectors:
  forward:

service:
  telemetry:
    logs:
      level: DEBUG
      development: true
      encoding: console
  pipelines:
    metrics/1:
      receivers:
      - prometheus
      processors:
      - filter
      exporters:
      - forward
      - logging
    metrics/2:
      receivers:
      - forward
      processors:
      - cumulativetodelta
      exporters:
      - logging

I see the following datapoints output, which looks correct

InstrumentationScope otelcol/prometheusreceiver development
Metric #0
Descriptor:
     -> Name: istio_requests
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> reporter: Str(destination)
     -> request_protocol: Str(grpc)
     -> response_code: Str(200)
StartTimestamp: 2023-05-26 21:52:13.822 +0000 UTC
Timestamp: 2023-05-26 21:52:13.822 +0000 UTC
Value: 27900767.000000
Metric #0
Descriptor:
     -> Name: istio_requests
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> reporter: Str(destination)
     -> request_protocol: Str(grpc)
     -> response_code: Str(200)
StartTimestamp: 2023-05-26 21:52:13.822 +0000 UTC
Timestamp: 2023-05-26 21:52:43.822 +0000 UTC
Value: 27900794.000000
Metric #0
Descriptor:
     -> Name: istio_requests
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Delta
NumberDataPoints #0
Data point attributes:
     -> reporter: Str(destination)
     -> request_protocol: Str(grpc)
     -> response_code: Str(200)
StartTimestamp: 2023-05-26 21:52:13.822 +0000 UTC
Timestamp: 2023-05-26 21:52:43.822 +0000 UTC
Value: 27.000000

But we are definitely seeing this in our production environment after upgrading to 0.78.0. We see some huge datapoints in our charts for the istio_requests metric correlating exactly with when the collector restarts.

@sirianni
Copy link
Contributor Author

Helpful gist from @seankhliao

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 26, 2023
@dashpole
Copy link
Contributor

I think this is expected behavior based on the specification.

See https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/blob/main/exporter/collector/internal/normalization/standard_normalizer.go#L264 for what my (GCP) exporter does to handle these points. We drop points where the start time isn't after the end time, and use that to "normalize" subsequent points by subtracting the initial value from them.

I'm personally not a fan of the behavior, as the point implies that a non-zero number of events occurred in zero time. I would expect most backends to fail to handle these points if they aren't filtered out. A nil start time seems like a better representation of what we actually know, and I would rather have a processor that can deal with the lack of start time than have the points. We might need to change the spec first, though.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 26, 2023
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed as inactive priority:p1 High receiver/prometheus Prometheus receiver Stale
Projects
None yet
Development

No branches or pull requests

4 participants