Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

row_number() is slower than equivalent rank() #5298

Closed
DerrickRice opened this issue May 17, 2016 · 11 comments

Comments

Projects
None yet
8 participants
@DerrickRice
Copy link

commented May 17, 2016

Using a query similar to this:

select * from (select x, y, rank() over (partition by x order by y) as rnk) as iq where iq.rnk = 1

is much faster than

select * from (select x, y, row_number() over (partition by x order by y) as rnum) as iq where iq.rnum = 1

To the extent that this is a preferred workaround, with a small margin of possible error if random() returns the same value for rows with equal x, y values.

select * from (select x, y, rank() over (partition by x order by y, random()) as rnk) as iq where iq.rnk = 1

The differences in plan is that rank uses Window and row_count uses TopNRowNumber[limit 1]. It seems the predicate rnk = 1 got pushed down to be a TopNRowNumber, arguably because that's more direct than actually computing the row_number and filtering.

@martint

This comment has been minimized.

Copy link
Contributor

commented May 18, 2016

Here's an example that reproduces the issue:

WITH t AS (
  SELECT rank() OVER (PARTITION BY orderkey, partkey ORDER BY shipdate DESC) AS rnk 
  FROM tpch.sf10.lineitem
) 
SELECT checksum(rnk) 
FROM t 
WHERE rnk = 1

CPU time: 273.1s

WITH t AS (
  SELECT row_number() OVER (PARTITION BY orderkey, partkey ORDER BY shipdate DESC) AS rnk 
  FROM tpch.sf10.lineitem
) 
SELECT checksum(rnk) 
FROM t 
WHERE rnk = 1

CPU time: 4614.6s

@martint

This comment has been minimized.

Copy link
Contributor

commented May 18, 2016

According to yourkit:

+---TopNRowNumberOperator.java:242 com.facebook.presto.operator.TopNRowNumberOperator.getPage()                                                                                        |  436,688   92 %  |          1,682  |
  |                                                                                                                                                                                    |                  |                 |
  +---TopNRowNumberOperator.java:326 com.facebook.presto.operator.TopNRowNumberOperator.getFlushingPartition()                                                                         |  434,941   92 %  |             84  |
  | |                                                                                                                                                                                  |                  |                 |
  | +---TopNRowNumberOperator.java:356 com.facebook.presto.operator.TopNRowNumberOperator$PartitionBuilder.access$600(TopNRowNumberOperator$PartitionBuilder)                          |  434,419   92 %  |             69  |
  | | |                                                                                                                                                                                |                  |                 |
  | | +---TopNRowNumberOperator.java:402 com.facebook.presto.operator.TopNRowNumberOperator$PartitionBuilder.build()                                                                   |  434,350   92 %  |             98  |
  | | | |                                                                                                                                                                              |                  |                 |
  | | | +---TopNRowNumberOperator.java:434 com.google.common.collect.MinMaxPriorityQueue.poll()                                                                                        |  434,000   92 %  |              0  |
  | | | | |                                                                                                                                                                            |                  |                 |
  | | | | +---MinMaxPriorityQueue.java:285 com.google.common.collect.MinMaxPriorityQueue.removeAndGet(int)                                                                             |  434,000   92 %  |        433,439  |
  | | | |   |                                                                                                                                                                          |                  |                 |
  | | | |   +---MinMaxPriorityQueue.java:450                                                                                                                                           |  433,439   91 %  |        433,439  |
@mattsfuller

This comment has been minimized.

Copy link
Contributor

commented May 18, 2016

@ebd2 @maciejgrzybek -- are looking into window function performance.

@ebd2 ebd2 self-assigned this Aug 25, 2016

@ebd2

This comment has been minimized.

Copy link
Member

commented Sep 6, 2016

I did some digging. The source of the performance issue (and, though nobody has brought it up yet, memory consumption) is that TopNRowNumberOperator is creating a MinMaxPriorityQueue per peer group, and then holding all of those until it gets to the finishing state.

This seems like it would be correct behavior in the case where the input for the operator isn't already partitioned and ordered. It doesn't seem like this is the case where WindowFilterPushdown applies the rewrite.

The question at this point is how to resolve the issue.

  • The simplest would be to not do the rewrite.
  • More complicated would be to make the TopNRowNumberOperator (or some new operator applicable to this situation) aware of when data is coming in pre-sorted into peer groups.

Some further background on what cases this rewrite is expected to be beneficial in would be good. @mattsfuller, @martint

@ebd2

This comment has been minimized.

Copy link
Member

commented Sep 6, 2016

(Sorry for any @mention spam. Browser issues :-/ )

@martint

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2016

The goal of the rewrite was to be able to run per-group top-n queries without having to read all data in memory before doing a sort and filtering. With the specialized operator, the amount of memory required will be a constant factor of the number of partition-by groups, which should, in general, be better (except when the filter selects almost every row from each partition-by group). The question is why is MinMaxPriorityQueue so expensive.

More complicated would be to make the TopNRowNumberOperator (or some new operator applicable to this situation) aware of when data is coming in pre-sorted into peer groups.

Certainly. Since the planner has the ability to describe certain properties of the data (partitioning, grouping, sorting, etc), we should be able to add this fairly easily. It's already able to select between a streaming and non-streaming version of the window operator depending on whether the data is pre-grouped or sorted on the partition-by keys. Take a look at AddLocalExchanges.visitWindow

@ebd2

This comment has been minimized.

Copy link
Member

commented Sep 7, 2016

I did a little looking into the unit cost. The bulk of it is tied up in the Block[] row, which is accounted for in the memory reservation with an estimated cost of 420 bytes each.

Unaccounted for in the memory reservation is the HashMap$Nodes, MMPQs, and associated objects. Based on inspecting the process with jvisualvm when the ExceededMemoryLimitException gets thrown, that's around 200 bytes a pop.

jvisualvm suggests that there might be another 150-200 bytes unaccounted for in either of those numbers. I haven't hunted down where it's coming from yet.

What matters as far as the ExceededMemoryLimitException is concerned is the 420 bytes; the data structures in TNRNO aren't counted against the memory reservation anyway.

Practically speaking, the unit cost is not the major issue with this query:

presto> WITH t AS ( SELECT row_number() OVER (PARTITION BY orderkey, partkey ORDER BY shipdate DESC) AS rnk FROM tpch.sf10.lineitem ) SELECT count(rnk) FROM t where rnk = 1;
  _col0   
----------
 59986002 
(1 row)

Like you said, in this query, nearly every row is in its own group. I'll take a look at the option of making this work like the streaming version window operator where possible. I don't think we're going to get the unit cost down enough to make 60 million units go on my laptop :-)

@harbby

This comment has been minimized.

Copy link

commented Jul 5, 2017

I have encountered this problem, but I temporarily use the rank to replace row_number to circumvent this problem

@electrum

This comment has been minimized.

Copy link
Contributor

commented Jul 5, 2017

Late reply, but you should be able to work around by defeating the TopN optimizer like this:

WHERE (rnk = 1 OR rand() < 0)

This works because rand() will always return something >= 0 so this evaluates to false, but the optimizer doesn't know that and thus can't eliminate it.

@highker

This comment has been minimized.

Copy link
Contributor

commented Dec 21, 2017

Reproduce a full GC with row_number() but didn't get any problem with rank(). I'm going to dig into this.

@highker highker self-assigned this Dec 21, 2017

@highker

This comment has been minimized.

Copy link
Contributor

commented Jan 1, 2018

Summary

Summary of the causes for the slowness and high memory usage of TopNRowNumberOperator from the above thread and some of my diggings:

  • MinMaxPriorityQueue/HashMap$Node has an overhead of about 420 bytes per entry according #5298 (comment). (From some heap dumps I collected, it's about 390 bytes). While the original code uses 100 byte as an estimation.
  • getSingleValueBlock contributes to 50% memory allocation from some profiling in production. This can cause huge GC pressure. Also, getSingleValueBlock is not a cheap method.
  • The flushing logic (getFlushingPartition) in the original code looks inefficient (check #5298 (comment)). It has a complexity of O(n^2), where n is the number of partitions. It always scans all the partitions and pick the one with the most rows.
  • The operator copies a row twice in the worst case: one during heap build and one during output page build. This is inefficient comparing with rank function.

Observation

Fundamentally we need a way to trade off memory and CPU. Using copies and heaps (like what the current approach does) can be beneficial for memory usage especially for cases where each partition has thousand of rows but only needs the top few. But this can be a problem for CPU usage given copying is expensive and updating the heap (including comparing and sifting up/down) is not that cheap. On the other hand, keeping all the input pages in memory, sorting partitions, and marking row numbers (like what rank does) can be CPU-efficient, where there is no copying or heap building. But can waste memory (a lot) in the worst case.

Attempt 1 (Failed)

I rewrote the operator by modifying TypedHeap to support multiple channels and then replace MinMaxPriorityQueue with the new TypedHeap. It turns out this approach has no difference than using MinMaxPriorityQueue. Essentially they are the same algorithm. Changing the underlying data structure does seem to benefit the performance.

Attempt 2

Still use PriorityQueue but avoid copying. I use a set of heaps (PriorityQueue) to maintain the top N rows of a partition. A partition is still created by GroupByHash. Each element in a heap
references a row from some input page, where all the input pages will be stored in a hash map. When an input page has more than 50% rows unreferenced, it will be compacted; all the elements (possibly from different heaps) that reference the compacted page will be updated accordingly. A prototype (https://github.com/highker/presto/tree/heap <- outputChannels and memory tracking are not implemented for now) is benchmarked as follows.

Benchmark

The benchmark uses tpch.tiny as the source. I also benchmarked tpch.sf1 and tpchsf10 (where both look promising with the above attempt). The reason I didn't paste the result for tpch.sfX is because the original TopNRowNumberOperator is too slow to produce a benchmark result.
Three queries are used to cover different (extreme) cases, where <FUNCTION> is either rank() or row_number():

dense case:
WITH t AS ( SELECT <FUNCTION> OVER (PARTITION BY random(10) ORDER BY shipdate DESC) AS rnk FROM lineitem ) SELECT checksum(rnk) FROM t WHERE rnk < 10000

medium case:
WITH t AS ( SELECT <FUNCTION> OVER (PARTITION BY random(100) ORDER BY shipdate DESC) AS rnk FROM lineitem ) SELECT checksum(rnk) FROM t WHERE rnk < 50

sparse case:
WITH t AS ( SELECT <FUNCTION> OVER (PARTITION BY orderkey, partkey ORDER BY shipdate DESC) AS rnk FROM lineitem ) SELECT checksum(rnk) FROM t WHERE rnk = 1

Result:
rank:
dense: 162.247 cpu ms
medium: 151.223 cpu ms
sparse: 113.861 cpu ms

row_number (before):
dense: 189.106 cpu ms
medium: 188.419 cpu ms
sparse: 592.715 cpu ms

row_number (after):
dense: 154.980 cpu ms
medium: 154.193 cpu ms
sparse: 90.515 cpu ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.