Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a way to futher optimize GROUP BY * queries #8304

Closed
jsternberg opened this issue Apr 19, 2017 · 8 comments
Closed

Find a way to futher optimize GROUP BY * queries #8304

jsternberg opened this issue Apr 19, 2017 · 8 comments

Comments

@jsternberg
Copy link
Contributor

At the moment, we optimize queries like this well because we process the various iterators in parallel and perform what amounts to an iterative map/reduce:

SELECT count(value) FROM cpu

Unfortunately, we don't handle this type of query with the same grace. This type of query is bottlenecked at using one core.

SELECT count(value) FROM cpu GROUP BY *

The issue is with how internal processing of the query is done and how the final query is read. For the original query, we can process each series separately and perform an initial aggregate on a single series in parallel. Then we can combine the results of each series together. Since all of these points are processed in parallel with each other and the final result is read from only one location, it's much easier to merge them.

But, when we have GROUP BY * or a grouping by something with a high cardinality, the emitter will have to read all of the points from the first series before it can continue to reading the points from the second series. This means that only the first point in the second series is processed in parallel and then it blocks until it's time for the series to be read. Since series are processed in parallel, this makes GROUP BY * especially bad since it will inevitably only process series serially.

The emitter emits the series serially because that's what's required out of the emitter. It returns each series in order so it reads all of the series in order. While reading series 1, we cannot continue to process series 2 because the entirety of series 1 has to be emitted before we continue. The only way to continue processing series 2 would be to buffer the results of series 2 until they could eventually be read. With queries that emit a lot of points, this could result in running out of memory. If we limit the number of buffered points to prevent a query from running out of memory, we still don't solve the underlying problem which is that queries will always inevitably lead to using only a single core at some point within the query lifecycle. We just delay the issue.

Another solution might be to try and process a single series in parallel rather than serially like we currently do. This may or may not work. I'll expand on ideas, but we can't have a naive solution to doing this. The reason why we read a series in order is because seeking within the TSM tree generally doesn't improve performance very much. The slowest part of the query is finding the initial cursor and iteration is generally pretty fast after that. Note: Double check this assumption as it changes the math behind this if it's not true.

Either way, we will likely need a hybrid approach. Sometimes, processing the series in parallel is faster such as when we're merging series together. Sometimes, processing the series itself in parallel might be faster if we can think of a way to do that efficiently.

Another potential factor to consider is that this query will perform much slower than it likely should:

SELECT max(min) FROM (SELECT min(value) FROM cpu GROUP BY *)

This is because the parallel optimization only happens within the TSM1 engine and is not done to combine the results of subquery. That's likely an easier optimization we can introduce too.

@jsternberg
Copy link
Contributor Author

@rbetts @timhallinflux hopefully this helps explain some of the potential problems with parallel query processing we have. There are at least two issues. One related to subqueries and one related to a fundamental difficulty within the system.

@jsternberg
Copy link
Contributor Author

Continuing along with this of potential ideas for optimizing aggregates more, we would need to determine which part of the iterator is slow. How quickly can we process and discard an entire series of data and how much of that do we spend performing the aggregation? If processing the actual series is quick, we can work to quickly fill a buffer with the points for an interval and send those points to the proper aggregate. Then we let the aggregate do its more consuming work in a separate goroutine and keep the series processor moving.

If the series processing is not as fast as it should be, we should explore the idea of having a separate goroutine operate on finding the starting node for each time series so that we can perform more efficient seeking between nodes and hopefully create a zipper formation. So while one goroutine processes [0s, 10s), another would process [10s, 20s). Currently, we don't do this because seeking is far more expensive than just reading through.

We should also consider using sync.Pool to do buffer management for memory. I believe we spend a bunch of time growing buffers and maps. We can see if sync.Pool helps with that. It seems to help with memory sensitive libraries like zap.

@jsternberg
Copy link
Contributor Author

Just ran a cpuprofile on the BenchmarkEngine_CreateIterator_Count_1M function (which should be one of the fastest to finish). The cpuprofile showed that 35.65% of the time was spent in the reduce() function (where the primary work is done for aggregates). 23.61% was spent finding the next point in the window. There's further down areas that might be optimizable too, but the difference in percentage between those might indicate something is there. No certainty though as if these are fast functions, the percentages may be meaningless.

@jsternberg
Copy link
Contributor Author

jsternberg commented Jun 12, 2017

Looked a bit at a flame graph related to this today (so 2 months since I last looked at it) and found that about 1/3 of the time spent within the call iterators is spent finding the subset of tags when there's a dimension specified. The other benchmark mentioned in the previous comment doesn't use a dimension inside of it so it doesn't represent this problem very well. Also, that function wouldn't have shown up on the flame graph because the influxql.(*Tags).Subset function returns immediately with an empty Tags struct if the number of dimensions is zero, which is not the case for something using GROUP BY *. Optimizing that as a hotspot may be a good use of time by itself.

I'm also unclear how much parallelization may help although I'm willing to mess around with it a bit. It will be necessary to perform additional memory allocations (which we currently try to reduce) if we add parallelization, but the parallelization may offset any loses due to using a bit more memory. I'm going to try and experiment a bit. Just writing down my current thoughts at the moment.

@jsternberg
Copy link
Contributor Author

Another note, it seems like a lot of time is spent in influxql.keysMatch. It checks if the keys of a map match so it can reduce doing a memory allocation, but this seems a bit silly since we should know before the reduction if the keys match or not. If you have SELECT mean(value) FROM cpu GROUP BY host, the values are sent with the correct subset to begin with.

Part of this problem seems to stem from subqueries (which I knew about and that problem of too many Subset calls should be fixed), but some seems to be present before subqueries existed. It may be time to try making this a bit smarter so it only retrieves the subset of tags when it needs to.

This is still just initial guessing. There may be something I forgot about for why things are this way.

@stale
Copy link

stale bot commented Jul 24, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 24, 2019
@stale
Copy link

stale bot commented Jul 31, 2019

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

@stale stale bot closed this as completed Jul 31, 2019
@foobar
Copy link
Contributor

foobar commented Apr 3, 2020

this is still relevant.
It seems to me that the sorted-merge-iterator uses a heap to sort all points but the Less function is REALLY slow. As @jsternberg stated, the keysMatch is not efficient.
The performance gap caused by sorted merge iterator is significant. Just compare the time of two aggregations: SELECT mean(*) FROM m GROUP BY * which is not sorted, and
SELECT median(*) FROM m GROUP BY * which is. The difference is ~30x on my sample data.

Seems the heap doesn't have to do a point-to-point comparison if the inputs are from different series. I'd send a patch as RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants