Stack trace symbols resolution is slow #690

kolesnikovae · 2023-05-12T12:09:33Z

It’s been noticed that SelectMergeStacktraces queries typically spend significant (~30%) amount of time in resolving stacktraces. This is especially evident when an on-disk block is queried.

Example: in the screenshot below you can see that we've spent 2 seconds reading ~630k stacktraces from the parquet table (appx. 500 pages) – it's three times longer than fetching sample data itself:

One of the hypotheses is that data locality of the stacktraces parquet table may be poor in certain cases, which results in high read amplification, therefore the column iterator needs to scan the whole file (seeking page by page). We need to investigate the problem further and optimize stacktraces parquet table, if necessary:

Try to order rows by Series ID first – this should improve data locality: stack traces are typically fetched for a single application, or a subset of them.
Find the optimal page size. Experiment with read ahead access.
Encoding: figure out if Delta encoding is appropriate for the LocationIDs array and fix it if not. Potentially, this optimization only decreases the CPU time spent in varint decoding.
Model: use uint32 for LocationIDs and StacktraceID: 4 billions is enough to reference locations within a single block. This optimization probably only improves in-memory representation, on-disk size is unlikely to change.

We also may want to experiment with stack truncation, e.g keep only those that account for 95-99% of values, and create a stub frame for the truncated ones. This should be done before resolving locations (reading stacktraces parquet table). This is meaningful only if the relevant stacktraces are stored close to each other on disk, otherwise the amount of rows/pages we need to fetch will change insignificantly, and the latency of the operation will not decrease.

One approach is to return the result of aggregation by addressing stacktraces globally, and let the querier decide what to keep. Another way is to truncate stacks locally in ingesters, but the resulting flamegraph may be less accurate.

cyriltovena · 2023-05-15T12:57:21Z

I'm also wondering if we should have a single nested parquet files for stacktraces (instead of 4 different files), it sounds better specially if we order them by series.

kolesnikovae · 2023-07-08T14:07:38Z

Resolved via #767

kolesnikovae added kind/performance area/database labels May 12, 2023

kolesnikovae self-assigned this May 12, 2023

kolesnikovae mentioned this issue May 26, 2023

Ensure presence of service_name label in grafana agent grafana/pyroscope#2055

Closed

kolesnikovae mentioned this issue Jun 6, 2023

WIP: Shard stacktraces table by service name #749

Closed

kolesnikovae changed the title ~~Stacktraces resolve is slow~~ Stacktraces resolution is slow Jun 6, 2023

kolesnikovae changed the title ~~Stacktraces resolution is slow~~ Stack trace symbols resolution is slow Jun 8, 2023

kolesnikovae closed this as completed Jul 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack trace symbols resolution is slow #690

Stack trace symbols resolution is slow #690

kolesnikovae commented May 12, 2023 •

edited

Loading

cyriltovena commented May 15, 2023 •

edited

Loading

kolesnikovae commented Jul 8, 2023

Stack trace symbols resolution is slow #690

Stack trace symbols resolution is slow #690

Comments

kolesnikovae commented May 12, 2023 • edited Loading

cyriltovena commented May 15, 2023 • edited Loading

kolesnikovae commented Jul 8, 2023

kolesnikovae commented May 12, 2023 •

edited

Loading

cyriltovena commented May 15, 2023 •

edited

Loading