-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: get series labels from store gateway #2431
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Next we need to update PhlareDB.Series
to use the new implementation
pkg/querier/querier.go
Outdated
if req.Msg.Start > req.Msg.End { | ||
return nil, connect.NewError(connect.CodeInvalidArgument, errors.New("start must be before end")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also validate the time range in the query-frontend handler, like this: https://github.com/grafana/pyroscope/blob/main/pkg/frontend/frontend_select_series.go#L37-L43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In d8df865, I moved the validation from the querier to the query-frontend.
Should we do validation in both places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing it in the query-frontend should be enough. We only need to make sure req.Msg.Start > req.Msg.End
condition is actually validated there
pkg/phlaredb/block_querier.go
Outdated
// TODO(bryan) do we need to close this? | ||
err := b.Open(ctx) | ||
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open
is very expensive operation, therefore and we do want to keep blocks open. Unfortunately, at the moment we don't have a straight-forward caching / eviction policy, which may easily become a problem. We will change this in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially we also what it to have a state where only the TSDB index is there and parquet files are not even looked at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense we try keep things open for as long as possible if opens are expensive. I like @simonswine's idea of doing a partial open vs loading the entire block.
I tested these changes in dev (using 02ec939) and the behavior of querying from the store gateway works as intended. We also maintain legacy query behavior if no interval is sent. I also ran a quick performance check and the results were, well, substandard at best. I made 20 requests with a variety of intervals to the query frontend and collected the server-side request duration. I tested the existing query pattern, the new one with only synchronous requests (basically a best-case scenario), and the new query pattern with asynchronous requests (the "typical" scenario).
As the interval increases, the response time increases exponentially in the synchronous case (with an R2 of 0.998 according to Google Sheets). The asynchronous case increase almost like the synchronous ones, but taper off towards the end. This, however, is due to the pods OOM and crashing, cancelling requests, and causing artificially lower request times. Any request interval above 24h are almost guaranteed to make pods crash and requests to fail with the new pattern. Any interval above 24h works, but the application is more or less unusable due to the high latency. Less than 24h interval, the requests latency is largely unnoticeable. I suspect the OOMs are a result of opening an unbounded number of blocks and scanning them for series labels (though that's just a hunch). We could mitigate this reducing parallelism on a single pod, but that would in turn increase the total request duration. Another obvious solution would be to try cache the series labels so we can avoid opening a block altogether, but caching is always easier to propose than it is to implement. At this point, the question is: do we need to come up with a strategy to mitigate the cost of hitting the store gateway before or after we merge this PR? I suspect these numbers are just too high and the experience would be too poor to accept the change in its current form. |
Cross-posting from slack: I'm surprised the performance is so low. The TSDB index is downloaded into memory entirely on the first access to the block and kept there forever (sort of an in-memory cache). I'm wondering if the performance of subsequent queries is any better. Judging by the flamegraph, it is not since we spend most of the time decoding data (in-memory): We should:
I guess we could try to fix this before merging.
|
|
||
var labelsSet []*typesv1.Labels | ||
var lock sync.Mutex | ||
group, ctx := errgroup.WithContext(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should bound the amount of parallelism with some constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set a parallelism constant to 50 in 28d98bb. 50 is only a SWAG.
If start/end are not provided, fallback to the legacy query strategy of only querying the head block.
After implementing some of the suggestions raised by @kolesnikovae and @simonswine, here are the new performance metrics:
As you can see, there are significant decreases in response time from the first iteration. Even at quite long intervals, the response time hovers around 5s (while not optimal, still manageable). More typical intervals have an adequate request duration, broadly staying under 1s (until it's longer than 2d). @kolesnikovae had a number of other suggestions we could pursue to improve performance, but at this point, I think these numbers are healthy enough to ship and iterate on. |
Great work @bryanhuhta! I think you're right, we should merge it now, and only optimise it if we see the actual need |
|
||
for _, q := range queriers { | ||
group.Go(util.RecoverPanic(func() error { | ||
labels, err := q.Series(ctx, req) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for later: if we change that API to send back finguerprint ID we should gain even more perf boost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
something to investigate if you could use later
pyroscope/pkg/phlaredb/tsdb/index/index.go
Line 1738 in ac9da90
func (r *Reader) Series(id storage.SeriesRef, lbls *phlaremodel.Labels, chks *[]ChunkMeta) (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @bryanhuhta for you patience going though all of our feedback.
Before you merge (sorry), I would like you to change the start/stop time unit from seconds to milliseconds, as the other query endpoints use.
Related: #2230
Related: https://github.com/grafana/pyroscope-app-plugin/issues/70
Previously series were only queried from the head block in the ingester. This adds a time window to the
Series
RPC and queries the store gateway if necessary.This PR also adds a rudimentary UI update to capture the selected window start/end and passes it along the RPC.Edit: I reverted this in c72f5c6 to reduce the complexity of this PR. This will be added in a follow-up PR where more time and energy can be used to focus on a "non-rudimentary" implementation.