Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span export fails with spanmetrics connector in a pipeline #23151

Closed
equinsuocha opened this issue Jun 6, 2023 · 4 comments
Closed

Span export fails with spanmetrics connector in a pipeline #23151

equinsuocha opened this issue Jun 6, 2023 · 4 comments
Labels
bug Something isn't working closed as inactive connector/spanmetrics needs triage New item requiring triage Stale

Comments

@equinsuocha
Copy link

equinsuocha commented Jun 6, 2023

Component(s)

connector/forward

What happened?

Description

Span export frequently fails if traces are forwarded by connector to multiple (2 or more) pipelines containing span transformations.

Steps to Reproduce

See collector pipeline configuration.

Main highlights are:

  1. single otlp receiver for incoming external data;
  2. trace pipeline with otlp receiver and 2 exporters: otlp/spanlogs (forwarding traces to external service) and forward/sanitize-metrics;
  3. trace pipeline with forward/sanitize-metrics receiver, transform processor and spanmetrics exporter;
  4. metric pipeline with spanmetrics receiver, routing processor and 2 prometheusremotewrite exporters;

Expected Result

The forward/sanitize-metrics pipeline is meant to sanitize span resource record before reaching spanmetrics connector, as otherwise internally it will create a label for every span resource entry disregarding all label config options and then labels dropped along the way resulting into metric collisions, counter resets and so on, while otlp/spanlogs forwarding spans as is, without any additional processing.

Actual Result

This configuration results in otlp/spanlogs exporter failure, probably because of transform processor executed in forward/sanitize-metrics pipeline during the export process, so it looks a lot like race condition:

2023-05-04T11:04:44.014Z error exporterhelper/queued_retry.go:401 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "traces", "name": "otlp/spanlogs", "error": "Permanent error: rpc error: code = Internal desc = grpc: error unmarshalling request: proto: ExportTraceServiceRequest: illegal tag 0 (wire type 0)", "dropped_items": 564} go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:401 go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/traces.go:137 go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1 go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:205 go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1 go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/internal/bounded_memory_queue.go:58

removing batch processor from traces pipeline reduces amount of dropped spans by orders of magnitude, but there is still constant data loss, and removing transform processor from traces/sanitize pipeline resolves the issue.

Collector version

0.77.0

Environment information

Environment

OpenTelemetry Collector configuration

receivers:
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: 0.0.0.0:12345
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
exporters:
  otlp/tempo:
    endpoint: <%= @traces_endpoint %>
    tls:
      insecure: true
  otlp/spanlogs:
    endpoint: 172.17.0.1:32310
    tls:
      insecure: true
  prometheusremotewrite/default:
    ## General remotewrite endpoint for all services
    endpoint: <%= @metrics_endpoint %>
    target_info:
      enabled: false
    timeout: 15s
    remote_write_queue:
      enabled: true
      queue_size: 100000
      num_consumers: 30
    namespace: traces_spanmetrics
    tls:
      insecure: true
    external_labels:
      spanmetrics_instance: ${K8S_NODE_NAME}
    ## This will create metrics based on ALL labels
    resource_to_telemetry_conversion:
      enabled: false
  prometheusremotewrite/high-throughput:
    ## Remotewrite endpoint for high-throughput service metrics
    endpoint: <%= @metrics_endpoint %>
    target_info:
      enabled: false
    timeout: 15s
    remote_write_queue:
      enabled: true
      queue_size: 100000
      num_consumers: 30
    namespace: traces_spanmetrics
    tls:
      insecure: true
    external_labels:
      spanmetrics_instance: ${K8S_NODE_NAME}
    ## This will create metrics based on ALL labels
    resource_to_telemetry_conversion:
      enabled: false
connectors:
  forward/sanitize-metrics:
  spanmetrics:
    histogram:
      explicit:
        buckets: [2ms, 4ms, 8ms, 16ms, 32ms, 64ms, 128ms, 256ms, 512ms, 1024ms, 2048ms, 4096ms, 8192ms, 16384ms]
    dimensions:
      - name: otel_library_name
      - name: region
      - name: region_domain
      - name: service.owner
      - name: service.repository
      - name: k8s.pod.name
      - name: parent_scope
      - name: business_domain
    dimensions_cache_size: 100000
processors:
  attributes:
    ## Add region and env labels
    actions:
      - key: env
        value: <%= @environment %>
        action: upsert
      - key: region
        value: <%= @region %>
        action: upsert
      - key: region_domain
        value: <%= @region_domain %>
        action: upsert
  batch:
    send_batch_size: 1024
    send_batch_max_size: 4096
    timeout: 200ms
  filter/http-span-name:
    ## Exclude healthcheck and metrics endpoints
    spans:
      exclude:
        match_type: regexp
        span_names:
          - "GET.+health.+"
          - "GET .+/metrics"
  k8sattributes:
    auth_type: "serviceAccount"
    passthrough: false
    filter:
      node_from_env_var: KUBE_NODE_NAME
    extract:
      labels:
        - tag_name: business_domain
          key: business_domain
      metadata:
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.deployment.name
        - k8s.namespace.name
        - k8s.node.name
        - k8s.pod.start_time
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip
      - sources:
          - from: resource_attribute
            name: k8s.pod.uid
      - sources:
          - from: connection
  transform/sanitize-spans-for-metrics:
    ## Allow list of labels to improve performance of spanmetrics connector
    error_mode: ignore
    trace_statements:
      - context: resource
        statements:
          - keep_keys(attributes, ["service.name", "span.name", "otel.library.name", "business_domain", "k8s.pod.name", "span.kind", "status.code", "parent_scope"])
  routing/remotewriteloadbalancer:
    ## Route metrics to different prometheusremotewrite exporters based on service.name
    default_exporters:
    - prometheusremotewrite/default
    table:
    - statement: route() where resource.attributes["service.name"] == "someservice"
      exporters: [prometheusremotewrite/high-throughput]
  transform/http-user-agent:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["http.user_agent"], "^.*[A-Z_@\\s\\.\\(\\)]+.*$", "someservice") where resource.attributes["service.name"] != "someservice" and instrumentation_scope.name == "@opentelemetry/instrumentation-http"
  transform/otel-library-name:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - set(attributes["otel_library_name"], instrumentation_scope.name)
          - set(attributes["span.kind"], "unspecified") where kind == 0
          - set(attributes["span.kind"], "internal") where kind == 1
          - set(attributes["span.kind"], "server") where kind == 2
          - set(attributes["span.kind"], "client") where kind == 3
          - set(attributes["span.kind"], "producer") where kind == 4
          - set(attributes["span.kind"], "consumer") where kind == 5
  transform/parent-scope:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - set(resource.attributes["parent_scope"], instrumentation_scope.name) where resource.attributes["parent_scope"] == nil and (kind == 2 or kind == 5)
  transform/token:
    ## Scrub sensitive tokens from spans
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
        # replace tokens in url query params in corresponding span sttributes:
          - replace_pattern(attributes["http.target"], "token=[0-9a-zA-Z]*", "token=SCRUBBED")
          - replace_pattern(attributes["http.url"], "token=[0-9a-zA-Z]*", "token=SCRUBBED")
          - replace_pattern(attributes["http.route"], "token=[0-9a-zA-Z]*", "token=SCRUBBED")
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors:
        - memory_limiter
        - filter/http-span-name
        - k8sattributes
        - attributes
        - transform/token
        - transform/otel-library-name
        - transform/http-user-agent
        - batch
      exporters: [otlp/spanlogs, forward/sanitize-metrics]
    traces/sanitize:
      receivers: [forward/sanitize-metrics]
      processors: [transform/sanitize-spans-for-metrics]
      exporters: [spanmetrics]
    metrics/spanmetrics:
      receivers: [spanmetrics]
      processors: [routing/remotewriteloadbalancer]
      exporters: [prometheusremotewrite/default, prometheusremotewrite/high-throughput]

Log output

2023-05-04T11:04:44.014Z	error	exporterhelper/queued_retry.go:401	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "traces", "name": "otlp/spanlogs", "error": "Permanent error: rpc error: code = Internal desc = grpc: error unmarshalling request: proto: ExportTraceServiceRequest: illegal tag 0 (wire type 0)", "dropped_items": 564}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:401
go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
	go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/traces.go:137
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/queued_retry.go:205
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
	go.opentelemetry.io/collector/exporter@v0.76.1/exporterhelper/internal/bounded_memory_queue.go:58

Additional context

No response

@equinsuocha equinsuocha added bug Something isn't working needs triage New item requiring triage labels Jun 6, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

Pinging code owners for connector/spanmetrics: @albertteoh @kovrus. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • needs: Github issue template generation code needs this to generate the corresponding labels.
  • connector/spanmetrics: @albertteoh

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working closed as inactive connector/spanmetrics needs triage New item requiring triage Stale
Projects
None yet
Development

No branches or pull requests

2 participants