Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: get series labels from store gateway #2431

Merged
merged 60 commits into from
Sep 29, 2023
Merged

feat: get series labels from store gateway #2431

merged 60 commits into from
Sep 29, 2023

Conversation

bryanhuhta
Copy link
Contributor

@bryanhuhta bryanhuhta commented Sep 20, 2023

Related: #2230
Related: https://github.com/grafana/pyroscope-app-plugin/issues/70

Previously series were only queried from the head block in the ingester. This adds a time window to the Series RPC and queries the store gateway if necessary.

This PR also adds a rudimentary UI update to capture the selected window start/end and passes it along the RPC. Edit: I reverted this in c72f5c6 to reduce the complexity of this PR. This will be added in a follow-up PR where more time and energy can be used to focus on a "non-rudimentary" implementation.

@bryanhuhta bryanhuhta self-assigned this Sep 20, 2023
@bryanhuhta bryanhuhta marked this pull request as ready for review September 20, 2023 22:04
@bryanhuhta bryanhuhta requested a review from a team as a code owner September 20, 2023 22:04
@bryanhuhta bryanhuhta changed the title Get series labels from store gateway feat: get series labels from store gateway Sep 20, 2023
Copy link
Collaborator

@kolesnikovae kolesnikovae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Next we need to update PhlareDB.Series to use the new implementation

Comment on lines 208 to 209
if req.Msg.Start > req.Msg.End {
return nil, connect.NewError(connect.CodeInvalidArgument, errors.New("start must be before end"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also validate the time range in the query-frontend handler, like this: https://github.com/grafana/pyroscope/blob/main/pkg/frontend/frontend_select_series.go#L37-L43

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In d8df865, I moved the validation from the querier to the query-frontend.

Should we do validation in both places?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing it in the query-frontend should be enough. We only need to make sure req.Msg.Start > req.Msg.End condition is actually validated there

pkg/phlaredb/head_queriers.go Show resolved Hide resolved
pkg/phlaredb/block_querier.go Show resolved Hide resolved
Comment on lines 945 to 949
// TODO(bryan) do we need to close this?
err := b.Open(ctx)
if err != nil {
return nil, err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open is very expensive operation, therefore and we do want to keep blocks open. Unfortunately, at the moment we don't have a straight-forward caching / eviction policy, which may easily become a problem. We will change this in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we also what it to have a state where only the TSDB index is there and parquet files are not even looked at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense we try keep things open for as long as possible if opens are expensive. I like @simonswine's idea of doing a partial open vs loading the entire block.

@bryanhuhta
Copy link
Contributor Author

bryanhuhta commented Sep 26, 2023

I tested these changes in dev (using 02ec939) and the behavior of querying from the store gateway works as intended. We also maintain legacy query behavior if no interval is sent.

I also ran a quick performance check and the results were, well, substandard at best. I made 20 requests with a variety of intervals to the query frontend and collected the server-side request duration. I tested the existing query pattern, the new one with only synchronous requests (basically a best-case scenario), and the new query pattern with asynchronous requests (the "typical" scenario).

Interval Old avg. (ms) New avg. sync (ms) New avg. async (ms)
1h 215.35 177.85 390.84
6h 218.00 253.25 427.29
24h 200.45 2837.50 6582.29
2d 183.35 5199.00 12021.38
7d 195.60 15423.00 13842.05
14d 196.15 27824.16 23781.07

chart (4)

As the interval increases, the response time increases exponentially in the synchronous case (with an R2 of 0.998 according to Google Sheets). The asynchronous case increase almost like the synchronous ones, but taper off towards the end. This, however, is due to the pods OOM and crashing, cancelling requests, and causing artificially lower request times.

Any request interval above 24h are almost guaranteed to make pods crash and requests to fail with the new pattern. Any interval above 24h works, but the application is more or less unusable due to the high latency. Less than 24h interval, the requests latency is largely unnoticeable.

I suspect the OOMs are a result of opening an unbounded number of blocks and scanning them for series labels (though that's just a hunch). We could mitigate this reducing parallelism on a single pod, but that would in turn increase the total request duration. Another obvious solution would be to try cache the series labels so we can avoid opening a block altogether, but caching is always easier to propose than it is to implement.

At this point, the question is: do we need to come up with a strategy to mitigate the cost of hitting the store gateway before or after we merge this PR?

I suspect these numbers are just too high and the experience would be too poor to accept the change in its current form.

@kolesnikovae
Copy link
Collaborator

Cross-posting from slack:

I'm surprised the performance is so low. The TSDB index is downloaded into memory entirely on the first access to the block and kept there forever (sort of an in-memory cache). I'm wondering if the performance of subsequent queries is any better. Judging by the flamegraph, it is not since we spend most of the time decoding data (in-memory):

image

We should:

  1. As we have all the data in memory, the query should take no longer than tens of milliseconds. Most likely, there is a better way to iterate the series (rather than decoding series by series). If it does not help, I think we may want to keep decoded data in-memory. Our indices are tiny, a few thousands of series each, this should not be much worse than we have now.
  2. Implement block compaction and block deduplication. We process lots of duplicate data (3-6x). @cyriltovena and @simonswine are working on it.
  3. Make the cache more robust. Maybe pre-load some blocks into memory on start (so it won't happen at the first query) and evict old TSDB indices.

I guess we could try to fix this before merging.

  1. The very first thing I'd do is the late materialisation: when we iterate series via index.(*Decoder).Series, all label keys and values are resolved with reader.lookupSymbol, which contributes most to the overall CPU consumption. I believe we should only resolve symbol references after we collected and deduplicated, so that lookup is performed just once for each unique symbol.
  2. Only fetch labels listed in the request, rather than doing it afterwards with model.Labels.WithLabels (which is responsible for 5%).
  3. Optimize the allocations. index.(*Decoder).Series makes lots of allocations, but I fail to explain them (line numbers can be found in the pprof profile you can get with profilecli). I'm 99% percent sure we can eliminate 99.99% of them.
  4. Skip the chunk-related part. We don't use it.

pkg/phlaredb/block_querier.go Outdated Show resolved Hide resolved
pkg/phlaredb/block_querier.go Outdated Show resolved Hide resolved
pkg/phlaredb/block_querier_test.go Outdated Show resolved Hide resolved
pkg/phlaredb/block_querier_test.go Show resolved Hide resolved
pkg/querier/querier.go Show resolved Hide resolved
pkg/phlaredb/block_querier.go Outdated Show resolved Hide resolved
pkg/phlaredb/block_querier.go Outdated Show resolved Hide resolved
pkg/phlaredb/block_querier.go Outdated Show resolved Hide resolved

var labelsSet []*typesv1.Labels
var lock sync.Mutex
group, ctx := errgroup.WithContext(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should bound the amount of parallelism with some constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set a parallelism constant to 50 in 28d98bb. 50 is only a SWAG.

pkg/phlaredb/block_querier.go Outdated Show resolved Hide resolved
@bryanhuhta
Copy link
Contributor Author

After implementing some of the suggestions raised by @kolesnikovae and @simonswine, here are the new performance metrics:

Interval Old avg. (ms) New (v1) avg. (ms) New (current) avg (ms)
1h 215.35 177.85 118.00
6h 218.00 253.25 141.97
24h 200.45 2837.50 506.45
2d 183.35 5199.00 938.38
7d 195.60 15423.00 2400.54
14d 196.15 27824.16 4908.47

chart (6)

As you can see, there are significant decreases in response time from the first iteration. Even at quite long intervals, the response time hovers around 5s (while not optimal, still manageable). More typical intervals have an adequate request duration, broadly staying under 1s (until it's longer than 2d).

@kolesnikovae had a number of other suggestions we could pursue to improve performance, but at this point, I think these numbers are healthy enough to ship and iterate on.

@kolesnikovae
Copy link
Collaborator

Great work @bryanhuhta! I think you're right, we should merge it now, and only optimise it if we see the actual need


for _, q := range queriers {
group.Go(util.RecoverPanic(func() error {
labels, err := q.Series(ctx, req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for later: if we change that API to send back finguerprint ID we should gain even more perf boost

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

something to investigate if you could use later

func (r *Reader) Series(id storage.SeriesRef, lbls *phlaremodel.Labels, chks *[]ChunkMeta) (uint64, error) {

Copy link
Contributor

@simonswine simonswine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @bryanhuhta for you patience going though all of our feedback.

Before you merge (sorry), I would like you to change the start/stop time unit from seconds to milliseconds, as the other query endpoints use.

pkg/frontend/frontend_series.go Outdated Show resolved Hide resolved
pkg/frontend/frontend_series.go Outdated Show resolved Hide resolved
@bryanhuhta bryanhuhta merged commit 945eb83 into main Sep 29, 2023
16 checks passed
@bryanhuhta bryanhuhta deleted the sg-labels branch September 29, 2023 15:59
This was referenced Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants