-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find a way to futher optimize GROUP BY * queries #8304
Comments
@rbetts @timhallinflux hopefully this helps explain some of the potential problems with parallel query processing we have. There are at least two issues. One related to subqueries and one related to a fundamental difficulty within the system. |
Continuing along with this of potential ideas for optimizing aggregates more, we would need to determine which part of the iterator is slow. How quickly can we process and discard an entire series of data and how much of that do we spend performing the aggregation? If processing the actual series is quick, we can work to quickly fill a buffer with the points for an interval and send those points to the proper aggregate. Then we let the aggregate do its more consuming work in a separate goroutine and keep the series processor moving. If the series processing is not as fast as it should be, we should explore the idea of having a separate goroutine operate on finding the starting node for each time series so that we can perform more efficient seeking between nodes and hopefully create a zipper formation. So while one goroutine processes We should also consider using |
Just ran a cpuprofile on the |
Looked a bit at a flame graph related to this today (so 2 months since I last looked at it) and found that about 1/3 of the time spent within the call iterators is spent finding the subset of tags when there's a dimension specified. The other benchmark mentioned in the previous comment doesn't use a dimension inside of it so it doesn't represent this problem very well. Also, that function wouldn't have shown up on the flame graph because the I'm also unclear how much parallelization may help although I'm willing to mess around with it a bit. It will be necessary to perform additional memory allocations (which we currently try to reduce) if we add parallelization, but the parallelization may offset any loses due to using a bit more memory. I'm going to try and experiment a bit. Just writing down my current thoughts at the moment. |
Another note, it seems like a lot of time is spent in Part of this problem seems to stem from subqueries (which I knew about and that problem of too many This is still just initial guessing. There may be something I forgot about for why things are this way. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions. |
this is still relevant. Seems the heap doesn't have to do a point-to-point comparison if the inputs are from different series. I'd send a patch as RFC. |
At the moment, we optimize queries like this well because we process the various iterators in parallel and perform what amounts to an iterative map/reduce:
Unfortunately, we don't handle this type of query with the same grace. This type of query is bottlenecked at using one core.
The issue is with how internal processing of the query is done and how the final query is read. For the original query, we can process each series separately and perform an initial aggregate on a single series in parallel. Then we can combine the results of each series together. Since all of these points are processed in parallel with each other and the final result is read from only one location, it's much easier to merge them.
But, when we have
GROUP BY *
or a grouping by something with a high cardinality, the emitter will have to read all of the points from the first series before it can continue to reading the points from the second series. This means that only the first point in the second series is processed in parallel and then it blocks until it's time for the series to be read. Since series are processed in parallel, this makesGROUP BY *
especially bad since it will inevitably only process series serially.The emitter emits the series serially because that's what's required out of the emitter. It returns each series in order so it reads all of the series in order. While reading series 1, we cannot continue to process series 2 because the entirety of series 1 has to be emitted before we continue. The only way to continue processing series 2 would be to buffer the results of series 2 until they could eventually be read. With queries that emit a lot of points, this could result in running out of memory. If we limit the number of buffered points to prevent a query from running out of memory, we still don't solve the underlying problem which is that queries will always inevitably lead to using only a single core at some point within the query lifecycle. We just delay the issue.
Another solution might be to try and process a single series in parallel rather than serially like we currently do. This may or may not work. I'll expand on ideas, but we can't have a naive solution to doing this. The reason why we read a series in order is because seeking within the TSM tree generally doesn't improve performance very much. The slowest part of the query is finding the initial cursor and iteration is generally pretty fast after that. Note: Double check this assumption as it changes the math behind this if it's not true.
Either way, we will likely need a hybrid approach. Sometimes, processing the series in parallel is faster such as when we're merging series together. Sometimes, processing the series itself in parallel might be faster if we can think of a way to do that efficiently.
Another potential factor to consider is that this query will perform much slower than it likely should:
This is because the parallel optimization only happens within the TSM1 engine and is not done to combine the results of subquery. That's likely an easier optimization we can introduce too.
The text was updated successfully, but these errors were encountered: