Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
row_number() is slower than equivalent rank() #5298
Using a query similar to this:
is much faster than
To the extent that this is a preferred workaround, with a small margin of possible error if
The differences in plan is that
Here's an example that reproduces the issue:
WITH t AS ( SELECT rank() OVER (PARTITION BY orderkey, partkey ORDER BY shipdate DESC) AS rnk FROM tpch.sf10.lineitem ) SELECT checksum(rnk) FROM t WHERE rnk = 1
CPU time: 273.1s
WITH t AS ( SELECT row_number() OVER (PARTITION BY orderkey, partkey ORDER BY shipdate DESC) AS rnk FROM tpch.sf10.lineitem ) SELECT checksum(rnk) FROM t WHERE rnk = 1
CPU time: 4614.6s
According to yourkit:
I did some digging. The source of the performance issue (and, though nobody has brought it up yet, memory consumption) is that TopNRowNumberOperator is creating a MinMaxPriorityQueue per peer group, and then holding all of those until it gets to the
This seems like it would be correct behavior in the case where the input for the operator isn't already partitioned and ordered. It doesn't seem like this is the case where WindowFilterPushdown applies the rewrite.
The question at this point is how to resolve the issue.
The goal of the rewrite was to be able to run per-group top-n queries without having to read all data in memory before doing a sort and filtering. With the specialized operator, the amount of memory required will be a constant factor of the number of partition-by groups, which should, in general, be better (except when the filter selects almost every row from each partition-by group). The question is why is MinMaxPriorityQueue so expensive.
Certainly. Since the planner has the ability to describe certain properties of the data (partitioning, grouping, sorting, etc), we should be able to add this fairly easily. It's already able to select between a streaming and non-streaming version of the window operator depending on whether the data is pre-grouped or sorted on the partition-by keys. Take a look at
I did a little looking into the unit cost. The bulk of it is tied up in the
Unaccounted for in the memory reservation is the HashMap$Nodes, MMPQs, and associated objects. Based on inspecting the process with jvisualvm when the ExceededMemoryLimitException gets thrown, that's around 200 bytes a pop.
jvisualvm suggests that there might be another 150-200 bytes unaccounted for in either of those numbers. I haven't hunted down where it's coming from yet.
What matters as far as the ExceededMemoryLimitException is concerned is the 420 bytes; the data structures in TNRNO aren't counted against the memory reservation anyway.
Practically speaking, the unit cost is not the major issue with this query:
Like you said, in this query, nearly every row is in its own group. I'll take a look at the option of making this work like the streaming version window operator where possible. I don't think we're going to get the unit cost down enough to make 60 million units go on my laptop :-)
referenced this issue
Sep 14, 2016
Late reply, but you should be able to work around by defeating the
WHERE (rnk = 1 OR rand() < 0)
This works because
Summary of the causes for the slowness and high memory usage of
Fundamentally we need a way to trade off memory and CPU. Using copies and heaps (like what the current approach does) can be beneficial for memory usage especially for cases where each partition has thousand of rows but only needs the top few. But this can be a problem for CPU usage given copying is expensive and updating the heap (including comparing and sifting up/down) is not that cheap. On the other hand, keeping all the input pages in memory, sorting partitions, and marking row numbers (like what
Attempt 1 (Failed)
I rewrote the operator by modifying
The benchmark uses