[metrics-generator] overrides per tenant do not work for metrics-generator #3462

ToonTijtgat2 · 2024-03-05T10:05:11Z

Describe the bug
I'm trying to use the runtime config to override per tenant configuration, but it does not work.
When leaving the config (tempo.yaml) like this, it works fine.:

    overrides:
      per_tenant_override_config: /conf/overrides.yaml
      per_tenant_override_period: 5s
      defaults:
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev

The logs of the metrics generator show lines like this: level=info ts=2024-03-05T09:46:15.602380169Z caller=registry.go:232 tenant=tracing-apps-app-dev msg="collecting metrics" active_series=1227

From the moment I change the config to:
tempo.yaml:

overrides:
      per_tenant_override_config: /conf/overrides.yaml
      per_tenant_override_period: 5s

overrides.yaml:

    overrides:
      tracing-apps-app-dev:
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev

the metrics generator doesn't seem to do anything anymore.
For who knows what reason, it keeps working if I do this change and only restart the deployment of the metrics-generator.
Until I remove the namespace, then it's broken.
On an other cluster I tried to use the runtime-config from the first time, and there it never worked.

To Reproduce
Steps to reproduce the behavior:

Have a clean installation of tempo with the overrides runtime config like above.
Check if the metrics generator generates any metrics.

Expected behavior
I expect that the metrics generator would generate metrics for the tenant tracing-apps-app-dev and sent it to the mimir instance of prometheus-apps-app-dev

Environment:

Infrastructure: Kubernetes
Deployment tool: helm

not working config:

  tempo-query.yaml: |
    backend: 127.0.0.1:3100
  tempo.yaml: |

    compactor:
      compaction:
        block_retention: 168h
        compacted_block_retention: 1h
        compaction_cycle: 30s
        compaction_window: 1h
        max_block_bytes: 107374182400
        max_compaction_objects: 6000000
        max_time_per_tenant: 5m
        retention_concurrency: 10
        v2_in_buffer_bytes: 5242880
        v2_out_buffer_bytes: 20971520
        v2_prefetch_traces_count: 1000
      ring:
        kvstore:
          store: memberlist
    distributor:
      receivers:
        jaeger:
          protocols:
            grpc:
              endpoint: 0.0.0.0:14250
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318
      ring:
        kvstore:
          store: memberlist
    ingester:
      flush_all_on_shutdown: true
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 3
        tokens_file_path: /var/tempo/tokens.json
    memberlist:
      abort_if_cluster_join_fails: false
      bind_addr: []
      bind_port: 7946
      gossip_interval: 1s
      gossip_nodes: 2
      gossip_to_dead_nodes_time: 30s
      join_members:
      - dns+tempo-gossip-ring:7946
      leave_timeout: 5s
      left_ingesters_timeout: 5m
      max_join_backoff: 1m
      max_join_retries: 10
      min_join_backoff: 1s
      node_name: ""
      packet_dial_timeout: 5s
      packet_write_timeout: 5s
      pull_push_interval: 30s
      randomize_node_name: true
      rejoin_interval: 0s
      retransmit_factor: 2
      stream_timeout: 10s
    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions: []
          histogram_buckets:
          - 0.1
          - 0.2
          - 0.4
          - 0.8
          - 1.6
          - 3.2
          - 6.4
          - 12.8
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimensions: []
          histogram_buckets:
          - 0.002
          - 0.004
          - 0.008
          - 0.016
          - 0.032
          - 0.064
          - 0.128
          - 0.256
          - 0.512
          - 1.02
          - 2.05
          - 4.1
      registry:
        collection_interval: 15s
        external_labels: {}
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/wal
        remote_write:
        - remote_timeout: 30s
          url: mimirurl/api/v1/push
        remote_write_flush_deadline: 1m
    multitenancy_enabled: true
    overrides:
      per_tenant_override_config: /conf/overrides.yaml
      per_tenant_override_period: 5s
    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend-discovery:9095
      max_concurrent_queries: 20
      query_relevant_ingesters: true
      search:
        external_backend: null
        external_endpoints: []
        external_hedge_requests_at: 8s
        external_hedge_requests_up_to: 2
        prefer_self: 10
        query_timeout: 30s
      trace_by_id:
        query_timeout: 10s
    query_frontend:
      max_retries: 2
      search:
        concurrent_jobs: 1000
        target_bytes_per_job: 104857600
      trace_by_id:
        hedge_requests_at: 2s
        hedge_requests_up_to: 2
        query_shards: 50
    server:
      grpc_server_max_recv_msg_size: 6194304
      grpc_server_max_send_msg_size: 6194304
      http_listen_port: 3100
      http_server_read_timeout: 30s
      http_server_write_timeout: 30s
      log_format: logfmt
      log_level: info
    storage:
      trace:
        azure:
          container_name: traces-blocks
          storage_account_key: xxx
          storage_account_name: xxx
        backend: azure
        blocklist_poll: 5m
        local:
          path: /var/tempo/traces
        wal:
          path: /var/tempo/wal
    usage_report:
      reporting_enabled: false
  overrides.yaml: |

    overrides:
      tracing-apps-app-dev:
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev

working config:

    backend: 127.0.0.1:3100
  tempo.yaml: |

    compactor:
      compaction:
        block_retention: 168h
        compacted_block_retention: 1h
        compaction_cycle: 30s
        compaction_window: 1h
        max_block_bytes: 107374182400
        max_compaction_objects: 6000000
        max_time_per_tenant: 5m
        retention_concurrency: 10
        v2_in_buffer_bytes: 5242880
        v2_out_buffer_bytes: 20971520
        v2_prefetch_traces_count: 1000
      ring:
        kvstore:
          store: memberlist
    distributor:
      receivers:
        jaeger:
          protocols:
            grpc:
              endpoint: 0.0.0.0:14250
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318
      ring:
        kvstore:
          store: memberlist
    ingester:
      flush_all_on_shutdown: true
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 3
        tokens_file_path: /var/tempo/tokens.json
    memberlist:
      abort_if_cluster_join_fails: false
      bind_addr: []
      bind_port: 7946
      gossip_interval: 1s
      gossip_nodes: 2
      gossip_to_dead_nodes_time: 30s
      join_members:
      - dns+tempo-gossip-ring:7946
      leave_timeout: 5s
      left_ingesters_timeout: 5m
      max_join_backoff: 1m
      max_join_retries: 10
      min_join_backoff: 1s
      node_name: ""
      packet_dial_timeout: 5s
      packet_write_timeout: 5s
      pull_push_interval: 30s
      randomize_node_name: true
      rejoin_interval: 0s
      retransmit_factor: 2
      stream_timeout: 10s
    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions: []
          histogram_buckets:
          - 0.1
          - 0.2
          - 0.4
          - 0.8
          - 1.6
          - 3.2
          - 6.4
          - 12.8
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimensions: []
          histogram_buckets:
          - 0.002
          - 0.004
          - 0.008
          - 0.016
          - 0.032
          - 0.064
          - 0.128
          - 0.256
          - 0.512
          - 1.02
          - 2.05
          - 4.1
      registry:
        collection_interval: 15s
        external_labels: {}
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/wal
        remote_write:
        - remote_timeout: 30s
          url: https://mimirurl/api/v1/push
        remote_write_flush_deadline: 1m
    multitenancy_enabled: true
    overrides:
      per_tenant_override_config: /conf/overrides.yaml
      per_tenant_override_period: 5s
      defaults:
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev
    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend-discovery:9095
      max_concurrent_queries: 20
      query_relevant_ingesters: true
      search:
        external_backend: null
        external_endpoints: []
        external_hedge_requests_at: 8s
        external_hedge_requests_up_to: 2
        prefer_self: 10
        query_timeout: 30s
      trace_by_id:
        query_timeout: 10s
    query_frontend:
      max_retries: 2
      search:
        concurrent_jobs: 1000
        target_bytes_per_job: 104857600
      trace_by_id:
        hedge_requests_at: 2s
        hedge_requests_up_to: 2
        query_shards: 50
    server:
      grpc_server_max_recv_msg_size: 6194304
      grpc_server_max_send_msg_size: 6194304
      http_listen_port: 3100
      http_server_read_timeout: 30s
      http_server_write_timeout: 30s
      log_format: logfmt
      log_level: info
    storage:
      trace:
        azure:
          container_name: traces-blocks
          storage_account_key: xxx
          storage_account_name: xxx
        backend: azure
        blocklist_poll: 5m
        local:
          path: /var/tempo/traces
        wal:
          path: /var/tempo/wal
    usage_report:
      reporting_enabled: false
  overrides.yaml: |

    overrides: {}

Additional Context
I need the tenant overrides config since in prd we have multiple tenants.
I found an issue with the same problem, but it got auto closed.
#3032

The text was updated successfully, but these errors were encountered:

joe-elliott · 2024-03-05T19:33:40Z

Are there any relevant logs that might help? At first glance I'm not seeing anything wrong with your config. We use the overrides all the time internally so it generally works.

Can you review metrics to help narrow down the issue. Are the distributors still sending spans to the generators? Are spans being dropped for any reason?

One sharp edge that might be causing this is that if an per tenant override block is matched than the entire block is used for that tenant (including all 0s). Tempo does not override at the field level.

ToonTijtgat2 · 2024-03-06T08:21:26Z

Dear @joe-elliott, thanks for checking my configuration.

I tried the legacy and the new config way with the overrides per tenant, but from the moment an override is used on the metrics_generator it seems to stop working.

I'll try to set the log level to debug to see If I can find any more relevant logs.

Since the only change is the small overrides part, and traces still seem to come trough, I think no spans are dropped. is there a way to check this?

ToonTijtgat2 · 2024-03-06T08:56:16Z

After the change to the broken code, I indeed see all lines like rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (0 bytes) exceeded while adding 923 bytes for user tracing-apps-app-dev

Why does it start doing this while I only wanted to override metrics_generator...

ToonTijtgat2 · 2024-03-06T09:04:42Z

  overrides.yaml: |

    overrides:
      tracing-apps-app-dev:
        ingestion:
          rate_strategy: local
          rate_limit_bytes: 15000000
          burst_size_bytes: 20000000
          max_traces_per_user: 10000
        read:
          max_bytes_per_tag_values_query: 5000000
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev

With this config it works, so when you override, make sure you override everything...

Thanks @joe-elliott for the solution.

joe-elliott · 2024-03-06T12:39:26Z

Thanks for following up!

@knylander-grafana can we make sure this is documented somewhere? this has caught multiple people before.

knylander-grafana · 2024-03-07T01:56:18Z

WIll do! I'll create a doc issue: #3462

ToonTijtgat2 closed this as completed Mar 6, 2024

knylander-grafana mentioned this issue Mar 7, 2024

[DOC] Update overrides configuration for metrics-generator #3472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metrics-generator] overrides per tenant do not work for metrics-generator #3462

[metrics-generator] overrides per tenant do not work for metrics-generator #3462

ToonTijtgat2 commented Mar 5, 2024

joe-elliott commented Mar 5, 2024

ToonTijtgat2 commented Mar 6, 2024

ToonTijtgat2 commented Mar 6, 2024

ToonTijtgat2 commented Mar 6, 2024

joe-elliott commented Mar 6, 2024

knylander-grafana commented Mar 7, 2024

[metrics-generator] overrides per tenant do not work for metrics-generator #3462

[metrics-generator] overrides per tenant do not work for metrics-generator #3462

Comments

ToonTijtgat2 commented Mar 5, 2024

joe-elliott commented Mar 5, 2024

ToonTijtgat2 commented Mar 6, 2024

ToonTijtgat2 commented Mar 6, 2024

ToonTijtgat2 commented Mar 6, 2024

joe-elliott commented Mar 6, 2024

knylander-grafana commented Mar 7, 2024