Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout & cache on metrics generator local-blocks processor #3768

Open
icemanDD opened this issue Jun 10, 2024 · 9 comments
Open

Timeout & cache on metrics generator local-blocks processor #3768

icemanDD opened this issue Jun 10, 2024 · 9 comments

Comments

@icemanDD
Copy link

Describe the bug
When onboard metrics generator local-blocks processor(Aggregate by), it always timeout for for the first query but would load the results for the second query or third query, and we did not set any cache layer.
Is there any local cache for this feature?
How should we optimize query-frontend, querier and metrics generator to get local-blocks working on millions of spans?

To Reproduce
tempo config:

query_frontend:
  max_outstanding_per_tenant: 300000
  max_retries: 2
  max_batch_size: 5
  metrics:
    concurrent_jobs: 300000 # also tried: 2000, 200
    target_bytes_per_job: 2.25e+07 # also tried: 2.25e+08, 1.25e+09
querier:
  max_concurrent_queries: 1500
  query_relevant_ingesters: true
...
  metrics:
    concurrent_blocks: 1500
    time_overlap_cutoff: 0.0
metrics_generator:
  ...
  processor:
    local_blocks:
      search:
        prefetch_trace_count: 10000
      block:
        search_page_size_bytes: 5000000
        parquet_row_group_size_bytes: 500000000
        bloom_filter_shard_size_bytes: 1000000
      filter_server_spans: false
      complete_block_timeout: 1h
      concurrent_blocks: 200

use async iterator is not helpful:
VPARQUET_ASYNC_ITERATOR="1"

Environment
Tempo 2.4.2

Expected behavior
local-blocks could return results in less 10-15 secs with the proper configuration, or only return the queried data in 10-15 secs instead of trigger the timeout

@mdisibio
Copy link
Contributor

Hi, thanks for posting your configuration. Can you also give an estimate of how many span/s (tempo_distributor_spans_received_total) this cluster is receiving? Also, a file listing or information about the blocks in the generator (maybe default /var/tempo/generator/traces)? These will be very helpful to see the volume/size of this cluster and what settings are best.

To start, yes there is local caching in the local-blocks processor, which is why it is faster on the next call. Thanks for testing async, agree generally sync is faster which is why it is the default.

Aggregate By is not heavy on the frontend and queriers, it is mainly metrics generator.

Here are next steps I would recommend:

  1. Change parquet_row_group_size_bytes to 100MB. We have found that ~100MB is ideal in all our clusters, both for searching and metrics. 500MB is large, which increases size of dictionary per row group, and overhead to scan the block.
  2. What is block:max_block_bytes? The default is 500MB, which is ok for the entire block size, on average it will be 5 row groups.
  3. What is block:max_block_duration? I would set to between 1 and 5 minutes. This is a good balance between how long data is stored in the wal (less efficient to scan), and flushing overhead. If 5 minutes then each call for last hour will scan 12 blocks.
  4. Finally, generators respond well to horizontal scaling. On the same input volume, 2x the number of pods means each pod contains 1/2 the data. Without digging deeper to identify the specific bottleneck, it's hard to say how fast each pod should be in your workload, but scaling will help.

@icemanDD
Copy link
Author

how many span/s (tempo_distributor_spans_received_total) this cluster is receiving?

around ~500k spans/s

a file listing or information about the blocks in the generator (maybe default /var/tempo/generator/traces)?

For example one file in wal:

  File: 00000191
  Size: 131235840       Blocks: 256328     IO Block: 4096 
  Inode: 1180675     Links: 1

@icemanDD
Copy link
Author

after updating parquet_row_group_size_bytes to 100MB

  Size: 30059883        Blocks: 58712      IO Block: 4096   
  Inode: 1308723     Links: 1

@mdisibio
Copy link
Contributor

 File: 00000191
  Size: 131235840

WAL "blocks" are composed of internal flushes which are mini-parquet files. This looks like flush 191, and is 130MB. That number of flushes for a WAL block is kind of high. Are flush_check_period and max_block_duration default values? With default values flush_check_period=10s, max_block_duration=1m, there are 6 flushes per wal block, and 60 blocks total for last hour. The final blocks are in var/tempo/generator/traces/blocks/<tenant>.

For 500K spans/s you may need 50+ generators to get the "cold" latency to your target.

@icemanDD
Copy link
Author

one of the block file:

File: 5d8f3be4-c96b-45f7-bd7e-618d5cd60172/
  Size: 4096            Blocks: 8          IO Block: 4096   directory
  Inode: 3932202     Links: 2

does this make sense? I am using default MaxBlockBytes(500MB) and max_block_duration to 3m
also will update search_page_size_bytes, bloom_filter_shard_size_bytes or parquet_dedicated_columns help?

@mdisibio
Copy link
Contributor

Size: 4096

This is listing the directory, can you list the files inside the folder (i.e. data.parquet) ?

parquet_dedicated_columns

Yes, definitely. This blog post has a walk through and to use tempo-cli analyse blocks.

@icemanDD
Copy link
Author

Got it, example on data.parquet:

  File: data.parquet
  Size: 184291173       Blocks: 359952     IO Block: 4096   regular file
  Inode: 1308506     Links: 1

@mdisibio
Copy link
Contributor

mdisibio commented Jul 8, 2024

@icemanDD Hi, did the new settings and dedicated columns help?

@icemanDD
Copy link
Author

icemanDD commented Jul 8, 2024

Hi @mdisibio that helps a bit, but we realize traceql metrics works better, does it follow the same pattern for optimization? Or it needs scaling for querier for better performance, especially query older data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants