递归+排序字段加权 skip scan 解决窗口查询多列分组去重的性能问题

作者

digoal

日期

2020-06-15

背景

构造数据 :

create or replace function gen_res() returns setof numeric as $$  
declare  
  s numeric := random();  
begin  
  return query select s from generate_series(1,100);  
end;  
$$ language plpgsql strict;

create table a (uid int, score numeric, class text);   
  
insert into a select id, gen_res(), md5(ceil(random()*100)::text) from generate_series(1,100000) t(id);

构造score相同的uid 100个.

with a1 as (select max(score) score from a)  
insert into a select id, score , md5(ceil(random()*100)::text) from generate_series(100001,100100) t(id) , (select score from a1,generate_series(1,100)) t1;

postgres=# select * from a limit 10;  
 uid |       score       |              class                 
-----+-------------------+----------------------------------  
   2 | 0.979678114877171 | 44f683a84163b3523afe57c2e008bc8c  
   2 | 0.979678114877171 | d645920e395fedad7bbbed0eca3fe2e0  
   2 | 0.979678114877171 | fe9fc289c3ff0af142b6d3bead98a923  
   2 | 0.979678114877171 | d09bf41544a3365a46c9077ebb5e35c3  
   2 | 0.979678114877171 | 1679091c5a880faf6fb5e6087eb1b2dc  
   2 | 0.979678114877171 | 34173cb38f07f89ddbebc2ac9128303f  
   2 | 0.979678114877171 | 70efdf2ec9b086079795c442636b55fb  
   2 | 0.979678114877171 | fc490ca45c00b1249bbe3554a4fdf6fb  
   2 | 0.979678114877171 | fe9fc289c3ff0af142b6d3bead98a923  
   2 | 0.979678114877171 | a5771bce93e200c36f7cd9dfd0e5deaa  
(10 rows)

同一个uid有多条记录, 每条记录的score相同.

不同UID的score大多数情况下不同, 但是也有少许uid的score相同.

按score倒序, 查询UID. 每个UID 返回一条.

最简单直接的方法是使用窗口查询, 窗口: 按score和uid分组(不能仅按score分组, 因为有的uid 可能score一样), 然后按score倒序.

如下:

create index idx_a_1 on a (score desc,uid);

explain (analyze,verbose,timing,costs,buffers)  
select uid,score from (  
select row_number() over (partition by score,uid order by score desc) as rn,    
uid, score from a) t  
where t.rn=1 limit 50;  
  
                                                                       QUERY PLAN                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.56..607.79 rows=50 width=36) (actual time=0.081..4.117 rows=50 loops=1)  
   Output: t.uid, t.score  
   Buffers: shared hit=4613  
   ->  Subquery Scan on t  (cost=0.56..607839.84 rows=50050 width=36) (actual time=0.081..4.111 rows=50 loops=1)  
         Output: t.uid, t.score  
         Filter: (t.rn = 1)  
         Rows Removed by Filter: 4851  
         Buffers: shared hit=4613  
         ->  WindowAgg  (cost=0.56..482714.24 rows=10010048 width=44) (actual time=0.080..3.853 rows=4901 loops=1)  
               Output: row_number() OVER (?), a.uid, a.score  
               Buffers: shared hit=4613  
               ->  Index Only Scan using idx_a_1 on public.a  (cost=0.56..282513.28 rows=10010048 width=36) (actual time=0.018..1.532 rows=5001 loops=1)  
                     Output: a.uid, a.score  
                     Heap Fetches: 5001  
                     Buffers: shared hit=4613  
 Planning Time: 0.074 ms  
 Execution Time: 4.135 ms  
(17 rows)

性能看起来是不错的, 但是你会发现这条sql虽然只需要返回50条记录, 却扫描了5001行(因为窗口查询使用索引进行有序遍历, 而不是跳跃扫描的方式), 能不能只扫描50行呢?

结果如下:

  uid   |       score         
--------+-------------------  
  31536 | 0.999998891470238  
 100001 | 0.999998891470238  
 100002 | 0.999998891470238  
 100003 | 0.999998891470238  
 100004 | 0.999998891470238  
 100005 | 0.999998891470238  
 100006 | 0.999998891470238  
 100007 | 0.999998891470238  
 100008 | 0.999998891470238  
 100009 | 0.999998891470238  
 100010 | 0.999998891470238  
 100011 | 0.999998891470238  
 100012 | 0.999998891470238  
 100013 | 0.999998891470238  
 100014 | 0.999998891470238  
 100015 | 0.999998891470238  
 100016 | 0.999998891470238  
 100017 | 0.999998891470238  
 100018 | 0.999998891470238  
 100019 | 0.999998891470238  
 100020 | 0.999998891470238  
 100021 | 0.999998891470238  
 100022 | 0.999998891470238  
 100023 | 0.999998891470238  
 100024 | 0.999998891470238  
 100025 | 0.999998891470238  
 100026 | 0.999998891470238  
 100027 | 0.999998891470238  
 100028 | 0.999998891470238  
 100029 | 0.999998891470238  
 100030 | 0.999998891470238  
 100031 | 0.999998891470238  
 100032 | 0.999998891470238  
 100033 | 0.999998891470238  
 100034 | 0.999998891470238  
 100035 | 0.999998891470238  
 100036 | 0.999998891470238  
 100037 | 0.999998891470238  
 100038 | 0.999998891470238  
 100039 | 0.999998891470238  
 100040 | 0.999998891470238  
 100041 | 0.999998891470238  
 100042 | 0.999998891470238  
 100043 | 0.999998891470238  
 100044 | 0.999998891470238  
 100045 | 0.999998891470238  
 100046 | 0.999998891470238  
 100047 | 0.999998891470238  
 100048 | 0.999998891470238  
 100049 | 0.999998891470238  
(50 rows)

如果想让这个查询只扫描50行, 需要使用跳跃扫描, 不能遍历整段索引. 可以使用递归实现:

直接按score排序limit递归会丢失相同score的UID:

create index idx_a_3 on a (score desc) include (uid) where score is not null;

with recursive tmp as (  
(  
select array[uid::numeric, score] as r from  a  
        WHERE score  is not null  
        order by score desc limit 1  
)  
union all  
(  
  select  
  (select array[uid::numeric, score] as r from a t1  
        where t1.score < (tmp.r)[2]  
        and t1.score is not null  
        order by t1.score desc limit 1  
  )  
  from tmp where (tmp.r)[2] is not null  
)  
)  
select (tmp.r)[1],(tmp.r)[2] from tmp   
where tmp.* is not null  
limit 50;

                                                                            QUERY PLAN                                                                              
------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=55.78..56.79 rows=50 width=64) (actual time=0.013..0.539 rows=50 loops=1)  
   Output: (tmp.r[1]), (tmp.r[2])  
   Buffers: shared hit=241  
   CTE tmp  
     ->  Recursive Union  (cost=0.50..55.78 rows=101 width=32) (actual time=0.009..0.507 rows=50 loops=1)  
           Buffers: shared hit=241  
           ->  Subquery Scan on "*SELECT* 1"  (cost=0.50..0.52 rows=1 width=32) (actual time=0.009..0.009 rows=1 loops=1)  
                 Output: "*SELECT* 1".r  
                 Buffers: shared hit=5  
                 ->  Limit  (cost=0.50..0.51 rows=1 width=43) (actual time=0.008..0.008 rows=1 loops=1)  
                       Output: (ARRAY[(a.uid)::numeric, a.score]), a.score  
                       Buffers: shared hit=5  
                       ->  Index Only Scan using idx_a_3 on public.a  (cost=0.50..125126.42 rows=10009993 width=43) (actual time=0.008..0.008 rows=1 loops=1)  
                             Output: ARRAY[(a.uid)::numeric, a.score], a.score  
                             Heap Fetches: 0  
                             Buffers: shared hit=5  
           ->  WorkTable Scan on tmp tmp_1  (cost=0.00..5.33 rows=10 width=32) (actual time=0.010..0.010 rows=1 loops=49)  
                 Output: (SubPlan 1)  
                 Filter: (tmp_1.r[2] IS NOT NULL)  
                 Buffers: shared hit=236  
                 SubPlan 1  
                   ->  Limit  (cost=0.50..0.51 rows=1 width=43) (actual time=0.009..0.009 rows=1 loops=49)  
                         Output: (ARRAY[(t1.uid)::numeric, t1.score]), t1.score  
                         Buffers: shared hit=236  
                         ->  Index Only Scan using idx_a_3 on public.a t1  (cost=0.50..41709.81 rows=3336664 width=43) (actual time=0.009..0.009 rows=1 loops=49)  
                               Output: ARRAY[(t1.uid)::numeric, t1.score], t1.score  
                               Index Cond: (t1.score < (tmp_1.r)[2])  
                               Heap Fetches: 0  
                               Buffers: shared hit=236  
   ->  CTE Scan on tmp  (cost=0.00..2.02 rows=100 width=64) (actual time=0.012..0.533 rows=50 loops=1)  
         Output: tmp.r[1], tmp.r[2]  
         Filter: (tmp.* IS NOT NULL)  
         Buffers: shared hit=241  
 Planning Time: 0.090 ms  
 Execution Time: 0.559 ms  
(35 rows)

结果与窗口查询不同, 丢失了score相同的uid, 每个score只返回了一个uid, 与业务逻辑不相符

   r   |         r           
-------+-------------------  
 31536 | 0.999998891470238  
 83720 | 0.999991911982715  
 55722 | 0.999975578407255  
 23814 | 0.999972620235976  
 67381 | 0.999961911696577  
 29892 | 0.999943494870404  
 90079 | 0.999925486089911  
 21214 | 0.999922738312673  
 59001 |  0.99991736006243  
 87204 | 0.999917295777077  
  7030 | 0.999915557360705  
 47014 | 0.999902190202047  
 32264 | 0.999901911139158  
 38647 | 0.999860249737811  
 32674 | 0.999837253269899  
 87613 | 0.999832584158813  
 71232 |   0.9998249229434  
 41469 | 0.999812869635608  
 95274 | 0.999804819593745  
 70312 | 0.999779352434143  
 70923 | 0.999774257144811  
 26243 | 0.999741038483439  
 79093 | 0.999711161650342  
 51332 | 0.999708612428986  
 70293 | 0.999701119867272  
   749 | 0.999676152526778  
 82356 | 0.999673161844026  
 24750 |  0.99965793726258  
 24520 | 0.999657836562118  
 94013 | 0.999634740712228  
 32113 | 0.999620645922992  
 32524 | 0.999576429034658  
 64496 | 0.999547735428802  
 99351 | 0.999533980413151  
 74897 | 0.999506182824039  
 57650 | 0.999505328313486  
 12643 | 0.999502105912729  
 62484 | 0.999499621461251  
 98690 |  0.99949818874131  
 78253 | 0.999480342172379  
 12805 | 0.999474363236803  
 24470 | 0.999473037063549  
 42317 | 0.999456625347026  
 17058 | 0.999453040472755  
 67604 | 0.999448579359846  
 14985 |  0.99943607450459  
 72078 | 0.999425638849996  
 43290 | 0.999415581298248  
 10890 | 0.999408682154407  
 33988 | 0.999400768473588  
(50 rows)

解决办法, 在score上加权, 增加一个值, 这个值要求每个uid都不一样, 并且对最终order by score desc的结果没有影响.

分数前几位保持不变, 在后面追加. 例如score有效位置为小数后15位, uid共10位, 那么将UID向小数点后只少移动25位即可, 不影响最终的排序结果.

如果你觉得这个移动太长了, 也可以使用一个简单(但是可能依旧有重复, 只是概率非常非常低)的方法, 将UID按999取模, 然后向后移动至少18位. order by (score::numeric + (mod(uid,999)/1e25)) desc

例子如下:

create index idx_a_2 on a ((score + (mod(uid,999)/1e25))) include (uid) where (score + (mod(uid,999)/1e25)) is not null ;

with recursive tmp as (  
(  
select array[uid::numeric, score, (score::numeric + (mod(uid,999)/1e25))] as r from  a  
        WHERE (score + (mod(uid,999)/1e25)) is not null  
        order by (score + (mod(uid,999)/1e25)) desc limit 1  
)  
union all  
(  
  select  
  (select array[uid::numeric, score, (score + (mod(uid,999)/1e25))] as r from a t1  
        where (t1.score + (mod(t1.uid,999)/1e25)) < (tmp.r)[3]  
        and (t1.score + (mod(t1.uid,999)/1e25)) is not null  
        order by (t1.score + (mod(t1.uid,999)/1e25)) desc limit 1  
  )  
  from tmp where (tmp.r)[3] is not null  
)  
)  
select (tmp.r)[1],(tmp.r)[2] from tmp   
where tmp.* is not null  
limit 50;

此时得到了与窗口查询一样的结果, 重复的score没有丢失UID

   r    |         r           
--------+-------------------  
  31536 | 0.999998891470238  
 100100 | 0.999998891470238  
 100099 | 0.999998891470238  
 100098 | 0.999998891470238  
 100097 | 0.999998891470238  
 100096 | 0.999998891470238  
 100095 | 0.999998891470238  
 100094 | 0.999998891470238  
 100093 | 0.999998891470238  
 100092 | 0.999998891470238  
 100091 | 0.999998891470238  
 100090 | 0.999998891470238  
 100089 | 0.999998891470238  
 100088 | 0.999998891470238  
 100087 | 0.999998891470238  
 100086 | 0.999998891470238  
 100085 | 0.999998891470238  
 100084 | 0.999998891470238  
 100083 | 0.999998891470238  
 100082 | 0.999998891470238  
 100081 | 0.999998891470238  
 100080 | 0.999998891470238  
 100079 | 0.999998891470238  
 100078 | 0.999998891470238  
 100077 | 0.999998891470238  
 100076 | 0.999998891470238  
 100075 | 0.999998891470238  
 100074 | 0.999998891470238  
 100073 | 0.999998891470238  
 100072 | 0.999998891470238  
 100071 | 0.999998891470238  
 100070 | 0.999998891470238  
 100069 | 0.999998891470238  
 100068 | 0.999998891470238  
 100067 | 0.999998891470238  
 100066 | 0.999998891470238  
 100065 | 0.999998891470238  
 100064 | 0.999998891470238  
 100063 | 0.999998891470238  
 100062 | 0.999998891470238  
 100061 | 0.999998891470238  
 100060 | 0.999998891470238  
 100059 | 0.999998891470238  
 100058 | 0.999998891470238  
 100057 | 0.999998891470238  
 100056 | 0.999998891470238  
 100055 | 0.999998891470238  
 100054 | 0.999998891470238  
 100053 | 0.999998891470238  
 100052 | 0.999998891470238  
(50 rows)

  
                                                                  QUERY PLAN       
-------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=60.64..61.65 rows=50 width=64) (actual time=0.016..0.478 rows=50 loops=1)  
   Output: (tmp.r[1]), (tmp.r[2])  
   Buffers: shared hit=250  
   CTE tmp  
     ->  Recursive Union  (cost=0.50..60.64 rows=101 width=32) (actual time=0.012..0.445 rows=50 loops=1)  
           Buffers: shared hit=250  
           ->  Subquery Scan on "*SELECT* 1"  (cost=0.50..0.55 rows=1 width=32) (actual time=0.011..0.012 rows=1 loops=1)  
                 Output: "*SELECT* 1".r  
                 Buffers: shared hit=5  
                 ->  Limit  (cost=0.50..0.54 rows=1 width=64) (actual time=0.011..0.011 rows=1 loops=1)  
                       Output: (ARRAY[(a.uid)::numeric, a.score, (a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric))]), ((a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric)))  
                       Buffers: shared hit=5  
                       ->  Index Scan Backward using idx_a_2 on public.a  (cost=0.50..417341.60 rows=9959943 width=64) (actual time=0.011..0.011 rows=1 loops=1)  
                             Output: ARRAY[(a.uid)::numeric, a.score, (a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric))], (a.score + ((mod(a.uid, 999))::numeric / '10000000000000000000000000'::numeric))  
                             Buffers: shared hit=5  
           ->  WorkTable Scan on tmp tmp_1  (cost=0.00..5.81 rows=10 width=32) (actual time=0.008..0.008 rows=1 loops=49)  
                 Output: (SubPlan 1)  
                 Filter: (tmp_1.r[3] IS NOT NULL)  
                 Buffers: shared hit=245  
                 SubPlan 1  
                   ->  Limit  (cost=0.50..0.56 rows=1 width=64) (actual time=0.008..0.008 rows=1 loops=49)  
                         Output: (ARRAY[(t1.uid)::numeric, t1.score, (t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric))]), ((t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric)))  
                         Buffers: shared hit=245  
                         ->  Index Scan Backward using idx_a_2 on public.a t1  (cost=0.50..201702.36 rows=3319981 width=64) (actual time=0.008..0.008 rows=1 loops=49)  
                               Output: ARRAY[(t1.uid)::numeric, t1.score, (t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric))], (t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric))  
                               Index Cond: ((t1.score + ((mod(t1.uid, 999))::numeric / '10000000000000000000000000'::numeric)) < (tmp_1.r)[3])  
                               Buffers: shared hit=245  
   ->  CTE Scan on tmp  (cost=0.00..2.02 rows=100 width=64) (actual time=0.015..0.473 rows=50 loops=1)  
         Output: tmp.r[1], tmp.r[2]  
         Filter: (tmp.* IS NOT NULL)  
         Buffers: shared hit=250  
 Planning Time: 0.127 ms  
 Execution Time: 0.502 ms  
(33 rows)