Skip to content
This repository has been archived by the owner on Jul 19, 2023. It is now read-only.

Stack trace symbols resolution is slow #690

Closed
kolesnikovae opened this issue May 12, 2023 · 2 comments
Closed

Stack trace symbols resolution is slow #690

kolesnikovae opened this issue May 12, 2023 · 2 comments

Comments

@kolesnikovae
Copy link
Contributor

kolesnikovae commented May 12, 2023

It’s been noticed that SelectMergeStacktraces queries typically spend significant (~30%) amount of time in resolving stacktraces. This is especially evident when an on-disk block is queried.

Example: in the screenshot below you can see that we've spent 2 seconds reading ~630k stacktraces from the parquet table (appx. 500 pages) – it's three times longer than fetching sample data itself:
image

One of the hypotheses is that data locality of the stacktraces parquet table may be poor in certain cases, which results in high read amplification, therefore the column iterator needs to scan the whole file (seeking page by page). We need to investigate the problem further and optimize stacktraces parquet table, if necessary:

  • Try to order rows by Series ID first – this should improve data locality: stack traces are typically fetched for a single application, or a subset of them.
  • Find the optimal page size. Experiment with read ahead access.
  • Encoding: figure out if Delta encoding is appropriate for the LocationIDs array and fix it if not. Potentially, this optimization only decreases the CPU time spent in varint decoding.
  • Model: use uint32 for LocationIDs and StacktraceID: 4 billions is enough to reference locations within a single block. This optimization probably only improves in-memory representation, on-disk size is unlikely to change.

We also may want to experiment with stack truncation, e.g keep only those that account for 95-99% of values, and create a stub frame for the truncated ones. This should be done before resolving locations (reading stacktraces parquet table). This is meaningful only if the relevant stacktraces are stored close to each other on disk, otherwise the amount of rows/pages we need to fetch will change insignificantly, and the latency of the operation will not decrease.

One approach is to return the result of aggregation by addressing stacktraces globally, and let the querier decide what to keep. Another way is to truncate stacks locally in ingesters, but the resulting flamegraph may be less accurate.

@cyriltovena
Copy link
Collaborator

cyriltovena commented May 15, 2023

I'm also wondering if we should have a single nested parquet files for stacktraces (instead of 4 different files), it sounds better specially if we order them by series.

@kolesnikovae kolesnikovae changed the title Stacktraces resolve is slow Stacktraces resolution is slow Jun 6, 2023
@kolesnikovae kolesnikovae changed the title Stacktraces resolution is slow Stack trace symbols resolution is slow Jun 8, 2023
@kolesnikovae
Copy link
Contributor Author

Resolved via #767

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants