Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics-generator] overrides per tenant do not work for metrics-generator #3462

Closed
ToonTijtgat2 opened this issue Mar 5, 2024 · 6 comments

Comments

@ToonTijtgat2
Copy link

Describe the bug
I'm trying to use the runtime config to override per tenant configuration, but it does not work.
When leaving the config (tempo.yaml) like this, it works fine.:

    overrides:
      per_tenant_override_config: /conf/overrides.yaml
      per_tenant_override_period: 5s
      defaults:
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev

The logs of the metrics generator show lines like this: level=info ts=2024-03-05T09:46:15.602380169Z caller=registry.go:232 tenant=tracing-apps-app-dev msg="collecting metrics" active_series=1227

From the moment I change the config to:
tempo.yaml:

overrides:
      per_tenant_override_config: /conf/overrides.yaml
      per_tenant_override_period: 5s

overrides.yaml:

    overrides:
      tracing-apps-app-dev:
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev

the metrics generator doesn't seem to do anything anymore.
For who knows what reason, it keeps working if I do this change and only restart the deployment of the metrics-generator.
Until I remove the namespace, then it's broken.
On an other cluster I tried to use the runtime-config from the first time, and there it never worked.

To Reproduce
Steps to reproduce the behavior:

  1. Have a clean installation of tempo with the overrides runtime config like above.
  2. Check if the metrics generator generates any metrics.

Expected behavior
I expect that the metrics generator would generate metrics for the tenant tracing-apps-app-dev and sent it to the mimir instance of prometheus-apps-app-dev

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm

not working config:

  tempo-query.yaml: |
    backend: 127.0.0.1:3100
  tempo.yaml: |

    compactor:
      compaction:
        block_retention: 168h
        compacted_block_retention: 1h
        compaction_cycle: 30s
        compaction_window: 1h
        max_block_bytes: 107374182400
        max_compaction_objects: 6000000
        max_time_per_tenant: 5m
        retention_concurrency: 10
        v2_in_buffer_bytes: 5242880
        v2_out_buffer_bytes: 20971520
        v2_prefetch_traces_count: 1000
      ring:
        kvstore:
          store: memberlist
    distributor:
      receivers:
        jaeger:
          protocols:
            grpc:
              endpoint: 0.0.0.0:14250
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318
      ring:
        kvstore:
          store: memberlist
    ingester:
      flush_all_on_shutdown: true
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 3
        tokens_file_path: /var/tempo/tokens.json
    memberlist:
      abort_if_cluster_join_fails: false
      bind_addr: []
      bind_port: 7946
      gossip_interval: 1s
      gossip_nodes: 2
      gossip_to_dead_nodes_time: 30s
      join_members:
      - dns+tempo-gossip-ring:7946
      leave_timeout: 5s
      left_ingesters_timeout: 5m
      max_join_backoff: 1m
      max_join_retries: 10
      min_join_backoff: 1s
      node_name: ""
      packet_dial_timeout: 5s
      packet_write_timeout: 5s
      pull_push_interval: 30s
      randomize_node_name: true
      rejoin_interval: 0s
      retransmit_factor: 2
      stream_timeout: 10s
    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions: []
          histogram_buckets:
          - 0.1
          - 0.2
          - 0.4
          - 0.8
          - 1.6
          - 3.2
          - 6.4
          - 12.8
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimensions: []
          histogram_buckets:
          - 0.002
          - 0.004
          - 0.008
          - 0.016
          - 0.032
          - 0.064
          - 0.128
          - 0.256
          - 0.512
          - 1.02
          - 2.05
          - 4.1
      registry:
        collection_interval: 15s
        external_labels: {}
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/wal
        remote_write:
        - remote_timeout: 30s
          url: mimirurl/api/v1/push
        remote_write_flush_deadline: 1m
    multitenancy_enabled: true
    overrides:
      per_tenant_override_config: /conf/overrides.yaml
      per_tenant_override_period: 5s
    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend-discovery:9095
      max_concurrent_queries: 20
      query_relevant_ingesters: true
      search:
        external_backend: null
        external_endpoints: []
        external_hedge_requests_at: 8s
        external_hedge_requests_up_to: 2
        prefer_self: 10
        query_timeout: 30s
      trace_by_id:
        query_timeout: 10s
    query_frontend:
      max_retries: 2
      search:
        concurrent_jobs: 1000
        target_bytes_per_job: 104857600
      trace_by_id:
        hedge_requests_at: 2s
        hedge_requests_up_to: 2
        query_shards: 50
    server:
      grpc_server_max_recv_msg_size: 6194304
      grpc_server_max_send_msg_size: 6194304
      http_listen_port: 3100
      http_server_read_timeout: 30s
      http_server_write_timeout: 30s
      log_format: logfmt
      log_level: info
    storage:
      trace:
        azure:
          container_name: traces-blocks
          storage_account_key: xxx
          storage_account_name: xxx
        backend: azure
        blocklist_poll: 5m
        local:
          path: /var/tempo/traces
        wal:
          path: /var/tempo/wal
    usage_report:
      reporting_enabled: false
  overrides.yaml: |

    overrides:
      tracing-apps-app-dev:
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev

working config:

    backend: 127.0.0.1:3100
  tempo.yaml: |

    compactor:
      compaction:
        block_retention: 168h
        compacted_block_retention: 1h
        compaction_cycle: 30s
        compaction_window: 1h
        max_block_bytes: 107374182400
        max_compaction_objects: 6000000
        max_time_per_tenant: 5m
        retention_concurrency: 10
        v2_in_buffer_bytes: 5242880
        v2_out_buffer_bytes: 20971520
        v2_prefetch_traces_count: 1000
      ring:
        kvstore:
          store: memberlist
    distributor:
      receivers:
        jaeger:
          protocols:
            grpc:
              endpoint: 0.0.0.0:14250
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318
      ring:
        kvstore:
          store: memberlist
    ingester:
      flush_all_on_shutdown: true
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 3
        tokens_file_path: /var/tempo/tokens.json
    memberlist:
      abort_if_cluster_join_fails: false
      bind_addr: []
      bind_port: 7946
      gossip_interval: 1s
      gossip_nodes: 2
      gossip_to_dead_nodes_time: 30s
      join_members:
      - dns+tempo-gossip-ring:7946
      leave_timeout: 5s
      left_ingesters_timeout: 5m
      max_join_backoff: 1m
      max_join_retries: 10
      min_join_backoff: 1s
      node_name: ""
      packet_dial_timeout: 5s
      packet_write_timeout: 5s
      pull_push_interval: 30s
      randomize_node_name: true
      rejoin_interval: 0s
      retransmit_factor: 2
      stream_timeout: 10s
    metrics_generator:
      metrics_ingestion_time_range_slack: 30s
      processor:
        service_graphs:
          dimensions: []
          histogram_buckets:
          - 0.1
          - 0.2
          - 0.4
          - 0.8
          - 1.6
          - 3.2
          - 6.4
          - 12.8
          max_items: 10000
          wait: 10s
          workers: 10
        span_metrics:
          dimensions: []
          histogram_buckets:
          - 0.002
          - 0.004
          - 0.008
          - 0.016
          - 0.032
          - 0.064
          - 0.128
          - 0.256
          - 0.512
          - 1.02
          - 2.05
          - 4.1
      registry:
        collection_interval: 15s
        external_labels: {}
        stale_duration: 15m
      ring:
        kvstore:
          store: memberlist
      storage:
        path: /var/tempo/wal
        remote_write:
        - remote_timeout: 30s
          url: https://mimirurl/api/v1/push
        remote_write_flush_deadline: 1m
    multitenancy_enabled: true
    overrides:
      per_tenant_override_config: /conf/overrides.yaml
      per_tenant_override_period: 5s
      defaults:
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev
    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend-discovery:9095
      max_concurrent_queries: 20
      query_relevant_ingesters: true
      search:
        external_backend: null
        external_endpoints: []
        external_hedge_requests_at: 8s
        external_hedge_requests_up_to: 2
        prefer_self: 10
        query_timeout: 30s
      trace_by_id:
        query_timeout: 10s
    query_frontend:
      max_retries: 2
      search:
        concurrent_jobs: 1000
        target_bytes_per_job: 104857600
      trace_by_id:
        hedge_requests_at: 2s
        hedge_requests_up_to: 2
        query_shards: 50
    server:
      grpc_server_max_recv_msg_size: 6194304
      grpc_server_max_send_msg_size: 6194304
      http_listen_port: 3100
      http_server_read_timeout: 30s
      http_server_write_timeout: 30s
      log_format: logfmt
      log_level: info
    storage:
      trace:
        azure:
          container_name: traces-blocks
          storage_account_key: xxx
          storage_account_name: xxx
        backend: azure
        blocklist_poll: 5m
        local:
          path: /var/tempo/traces
        wal:
          path: /var/tempo/wal
    usage_report:
      reporting_enabled: false
  overrides.yaml: |

    overrides: {}

Additional Context
I need the tenant overrides config since in prd we have multiple tenants.
I found an issue with the same problem, but it got auto closed.
#3032

@joe-elliott
Copy link
Member

Are there any relevant logs that might help? At first glance I'm not seeing anything wrong with your config. We use the overrides all the time internally so it generally works.

Can you review metrics to help narrow down the issue. Are the distributors still sending spans to the generators? Are spans being dropped for any reason?

One sharp edge that might be causing this is that if an per tenant override block is matched than the entire block is used for that tenant (including all 0s). Tempo does not override at the field level.

@ToonTijtgat2
Copy link
Author

Dear @joe-elliott, thanks for checking my configuration.

I tried the legacy and the new config way with the overrides per tenant, but from the moment an override is used on the metrics_generator it seems to stop working.

I'll try to set the log level to debug to see If I can find any more relevant logs.

Since the only change is the small overrides part, and traces still seem to come trough, I think no spans are dropped. is there a way to check this?

@ToonTijtgat2
Copy link
Author

After the change to the broken code, I indeed see all lines like rpc error: code = ResourceExhausted desc = RATE_LIMITED: ingestion rate limit (0 bytes) exceeded while adding 923 bytes for user tracing-apps-app-dev

Why does it start doing this while I only wanted to override metrics_generator...

@ToonTijtgat2
Copy link
Author

  overrides.yaml: |

    overrides:
      tracing-apps-app-dev:
        ingestion:
          rate_strategy: local
          rate_limit_bytes: 15000000
          burst_size_bytes: 20000000
          max_traces_per_user: 10000
        read:
          max_bytes_per_tag_values_query: 5000000
        global:
          max_bytes_per_trace: 1500000
        metrics_generator:
          processors: [service-graphs, span-metrics]
          remote_write_headers:
            X-Scope-OrgID: prometheus-apps-app-dev

With this config it works, so when you override, make sure you override everything...

Thanks @joe-elliott for the solution.

@joe-elliott
Copy link
Member

Thanks for following up!

@knylander-grafana can we make sure this is documented somewhere? this has caught multiple people before.

@knylander-grafana
Copy link
Contributor

WIll do! I'll create a doc issue: #3462

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants