Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spanmetrics connector] suddenly collector pod Evicted and unexpectedly increased 'calls_total' #31025

Open
pingping95 opened this issue Feb 3, 2024 · 2 comments
Labels
bug Something isn't working connector/spanmetrics needs triage New item requiring triage Stale

Comments

@pingping95
Copy link

pingping95 commented Feb 3, 2024

Component(s)

connector/spanmetrics

What happened?

Description

calls_total metric value rose sharply.

sum by (http_status_code) (rate(calls_total{service_name="my-app"}[1m]))
image

there was no called client and just received 'k8s health check probe' http call.

image

Steps to Reproduce

Expected Result

Actual Result

Collector version

0.92.0

Environment information

Environment

  • Amazon Linux 2
  • Kubernetes(EKS) 1.28

OpenTelemetry Collector configuration

config:

  # RECEIVER
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
        http:
          endpoint: ${env:MY_POD_IP}:4318
    prometheus:
      config:
        scrape_configs:
          - job_name: opentelemetry-collector
            scrape_interval: 10s
            static_configs:
              - targets:
                  - ${env:MY_POD_IP}:8888

  # PROCESSOR
  processors:
    batch: {}
    memory_limiter:
      check_interval: 5s
      limit_percentage: 80
      spike_limit_percentage: 30

    filter:
      metrics:
        datapoint:
          - 'attributes["span.kind"] != "SPAN_KIND_SERVER"'


  # EXPORTER
  exporters:
    debug:
      verbosity: detailed

    prometheusremotewrite:
      endpoint: https://mimir.xxxxx.xxx/api/v1/push
      target_info:
        enabled: true
      external_labels:
        test: true

  # SPAN_KIND   : CLIENT, SERVER, INTERNAL, PUBLISHER, CONSUMER
  # STATUS_CODE : UNSET, OK, ERROR

  # CONNECTOR
  connectors:
    # 1. SERVER Metrics
    spanmetrics:
      histogram:
        explicit:
          buckets: [50ms, 100ms, 200ms, 500ms, 1s, 5s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      dimensions:
        - name: http.method
          default: GET
        - name: http.route
        - name: http.status_code
        - name: cluster

      exemplars:
        enabled: true
      events:
        enabled: true
        dimensions:
          - name: exception.type
          - name: exception.message
      resource_metrics_key_attributes:
        - service.name

  # EXTENSIONS
  extensions:
    health_check:
      endpoint: ${env:MY_POD_IP}:13133
    memory_ballast:
      size_in_percentage: 40

  # SERVICE
  service:
    extensions:
      - health_check
      - memory_ballast

    pipelines:
      traces:
        receivers:
          - otlp
        processors:
          - memory_limiter
          - batch
        exporters:
          - spanmetrics

      metrics:
        receivers:
          - spanmetrics
        processors:
          - memory_limiter
          - batch
          - filter
        exporters:
          - prometheusremotewrite
          - debug

Log output

No output

Additional context

I think this strange behavior cause is 'exemplars' settings ?

Could any advice about here?

I am plan to migrate from tempo metrics-generator to otel-collector`s spanmetrics connector.

load-balancing exporter -> spanmetrics connector otel collector
                                          -> tail_sampling otel collector
image
@pingping95 pingping95 added bug Something isn't working needs triage New item requiring triage labels Feb 3, 2024
Copy link
Contributor

github-actions bot commented Feb 3, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

github-actions bot commented Apr 4, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/spanmetrics needs triage New item requiring triage Stale
Projects
None yet
Development

No branches or pull requests

1 participant