Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics explosion since v0.35 for tracing collector component #5155

Closed
ese opened this issue Sep 11, 2023 · 8 comments
Closed

Metrics explosion since v0.35 for tracing collector component #5155

ese opened this issue Sep 11, 2023 · 8 comments
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@ese
Copy link

ese commented Sep 11, 2023

What's wrong?

Since v0.35 tracing component is generating a massive amount of metrics due to cardinality explosion. Before this version grafana-agent /metrics endpoint generated response of around 0.7Mb and after this release is generating over 250Mb responses. I tested upon the last release v0.36.1 getting the same results.

These are the metrics causing the issue (especially becouse the label net_sock_peer_port). they were not being generated before using the same config:

traces_http_server_duration_bucket
traces_http_server_duration_sum
traces_http_server_duration_count
traces_http_server_response_content_length_total

Steps to reproduce

Upgrade agent to v0.35 or later and collect traces

System information

No response

Software version

v0.36.1

Configuration

server:
        log_level: warn
    traces:
        configs:
          - batch:
                send_batch_size: 2000
                timeout: 10s
            name: default
            receivers:
                jaeger:
                    protocols:
                        grpc: null
                        thrift_binary: null
                        thrift_compact: null
                otlp:
                    protocols:
                        grpc: null
                        http: null
                zipkin: null
            service_graphs:
              enabled: false
            remote_write:
              - endpoint: tempo-distributor.tempo.svc.cluster.local:4317
                insecure: true
                sending_queue:
                    queue_size: 10000
                retry_on_failure:
                    max_elapsed_time: 5s
            scrape_configs:
              - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
                job_name: kubernetes-pods
                kubernetes_sd_configs:
                  - role: pod
                relabel_configs:
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_namespace
                    target_label: namespace
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_name
                    target_label: pod
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_container_name
                    target_label: container
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_label_app_kubernetes_io_instance
                    target_label: release
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_label_app_kubernetes_io_component
                    target_label: component
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_label_worker
                    target_label: worker
                tls_config:
                    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                    insecure_skip_verify: false


### Logs

_No response_
@ese ese added the bug Something isn't working label Sep 11, 2023
@ese ese changed the title Metrics explosion since v0.35 for tracing component Metrics explosion since v0.35 for tracing collector component Sep 11, 2023
@ese
Copy link
Author

ese commented Sep 12, 2023

Could be related with #4764.

Is there a way to disable this metrics in static config?

@rfratto
Copy link
Member

rfratto commented Sep 12, 2023

Definitely looks related to #4764 after we upgraded our dependency on OpenTelemetry Collector. #4769 fixed the issue in Flow mode only, so static mode still has the issue.

It's not immediately clear to me if there's a direct translation for #4769 to work in static mode. @ptodev WDYT?

@ptodev
Copy link
Contributor

ptodev commented Sep 14, 2023

Hi @ese 👋 Thank you for your report. In static mode we don't expect such a cardinality exposion, because the telemetry.disableHighCardinalityMetrics feature gate is always enabled. This means that high cardinality labels such as net_sock_peer_addr should not be present.

I just tried running the main branch of the Agent with a config like this:

Agent config
server:
  log_level: debug

logs:
  positions_directory: "/Users/paulintodev/Desktop/otel_test/test_log_pos_dir"
  configs:
    - name: "grafanacloud-paulintodev-logs"
      clients:
        - url: ""
          basic_auth:
            username: ""
            password: ""

traces:
  configs:
  - name: default
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4320"
    remote_write:
      - endpoint: localhost:4317
        insecure: true
    batch:
      timeout: 5s
      send_batch_size: 100
    automatic_logging:
      backend: "logs_instance"
      logs_instance_name: "grafanacloud-paulintodev-logs"
      roots: true
    spanmetrics:
      handler_endpoint: "localhost:8899"
      namespace: "paulin_test_"
    tail_sampling:
      policies:
        [
          {
            name: test-policy-4,
            type: probabilistic,
            probabilistic: {sampling_percentage: 100}
          },
        ]
    service_graphs:
      enabled: true

Indeed, I do not see high cardinality metrics. My metrics look like this - with no net_sock_peer_addr labels:

traces_rpc_server_duration_bucket{rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",traces_config="default",le="10"} 396
traces_rpc_server_duration_bucket{rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",traces_config="default",le="25"} 396

Would you mind telling us the precise version which you are using please? Also, it'd help if you could share a few example metrics.

It may also help to try disabling some receivers by removing them from the config. I have only tested this with "otlp" - it is possible that other receivers don't honour the feature gate, but I feel like this is unlikely.

@ese
Copy link
Author

ese commented Sep 29, 2023

I tested 0.36.2 and its fixed

@ese ese closed this as completed Sep 29, 2023
@jcreixell
Copy link
Contributor

FYI I ran into this myself with an agent running v0.32.1 and using the OTLP endpoint, I will upgrade and confirm it's fixed

@jcreixell
Copy link
Contributor

I can still reproduce this with v0.37.2. The tracing bit of my config is just:

        traces:
          configs:
            - name: default
              receivers: # enable the receivers that you need
                otlp:
                  protocols:
                    grpc:
                    http:
              remote_write:
                - endpoint: tempo-eu-west-0.grafana.net:443
                  basic_auth:
                    username: 274733
                    password_file: /var/lib/grafana-agent/rw.key

and I am instrumenting a ruby/rails application with OTEL's SDK and sending the traces directly to the OTLP endpoint of the agent

@ptodev
Copy link
Contributor

ptodev commented Nov 21, 2023

I just tested Agent v0.37.2 and v0.37.3, and while I do see the issue in v0.37.2, I do not see it in v0.37.3. This must be due to the OTel upgrade done in v0.37.3.

On v0.37.2 I do see high cardinality labels such as net_sock_peer_port:

traces_http_server_duration_bucket{http_flavor="1.1",http_method="POST",http_scheme="http",http_status_code="200",http_user_agent="Grafana Agent/main-4681f0c (darwin/arm64)",net_sock_peer_addr="127.0.0.1",net_sock_peer_port="60328",traces_config="default",le="0"} 0

But on v0.37.3, those high cardinality labels are gone:

traces_http_server_duration_bucket{http_flavor="1.1",http_method="POST",http_scheme="http",http_status_code="200",traces_config="default",le="0"} 0

I'm not sure exactly what changed to fix this issue, but if anyone is affected I'd suggest upgrading to a version of the Agent which is v0.37.3 or above.

@jcreixell
Copy link
Contributor

I can confirm it is fixed in v0.38.0 🙌

@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
No open projects
Development

No branches or pull requests

5 participants