digoal
2020-03-13
PostgreSQL , 分区表 , 大量分区 , 聚合查询 , hint bit , CLogControlLock
分区表, 每个分区至少1个文件, 如果有上千个分区, 就有上千个文件, 在11以前的版本, 这样的分区表要做一次全表聚合(例如count)会比较慢, 因为需要挨个扫描.
有了并行计算之后, 这个性能得到了改观, 以PG 12为例, 2048个分区的分区表, 写入2亿左右记录, 可以1秒完成聚合.
分区表的并行度取决于分区的并行度, 所以我们建标时, 强制给每个分区设定52度.
do language plpgsql $$
declare
begin
drop table if exists p;
create table p (id int , info text, crt_time timestamp) partition by range (crt_time);
create table p2020 partition of p FOR VALUES FROM ('2020-01-01') TO ('2021-01-01') partition by hash (id);
create table p2021 partition of p FOR VALUES FROM ('2021-01-01') TO ('2022-01-01') partition by hash (id);
for i in 0..1023 loop
execute format ('create table p2020_%s partition of p2020 FOR VALUES WITH (MODULUS 1024, REMAINDER %s)', i, i);
execute format ('create table p2021_%s partition of p2021 FOR VALUES WITH (MODULUS 1024, REMAINDER %s)', i, i);
execute format('alter table p2020_%s set (parallel_workers=52)',i);
execute format('alter table p2021_%s set (parallel_workers=52)',i);
end loop;
end;
$$;
压测写入
vi test.sql
\set id random(1,2000000000)
insert into p values (:id, random()::text, now());
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 52 -j 52 -T 600
progress: 14.0 s, 459890.7 tps, lat 0.113 ms stddev 0.132
progress: 15.0 s, 439633.9 tps, lat 0.118 ms stddev 0.209
progress: 16.0 s, 440804.9 tps, lat 0.118 ms stddev 0.233
progress: 17.0 s, 444148.2 tps, lat 0.117 ms stddev 0.148
progress: 18.0 s, 463751.1 tps, lat 0.112 ms stddev 0.094
progress: 19.0 s, 469972.4 tps, lat 0.110 ms stddev 0.078
......
强制并行, 以得到最好的所有分区聚合性能, 如下:
db1=# show max_worker_processes ;
max_worker_processes
----------------------
128
(1 row)
set max_parallel_workers =32;
set max_parallel_workers_per_gather =32;
set parallel_setup_cost =0;
set parallel_tuple_cost =0;
set min_parallel_table_scan_size =0;
set min_parallel_index_scan_size =0;
set parallel_leader_participation=off;
另外有一个参数, 根据实际情况选择
情况1:
set enable_partitionwise_aggregate=off;
db1=# explain select count(*) from p;
QUERY PLAN
------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=847577.04..847577.05 rows=1 width=8)
-> Gather (cost=847576.95..847576.96 rows=32 width=8)
Workers Planned: 32
-> Partial Aggregate (cost=847576.95..847576.96 rows=1 width=8)
-> Parallel Append (cost=0.00..840170.98 rows=2962388 width=0)
-> Parallel Seq Scan on p2020_25 (cost=0.00..920.11 rows=3011 width=0)
-> Parallel Seq Scan on p2020_4 (cost=0.00..905.94 rows=2994 width=0)
-> Parallel Seq Scan on p2020_42 (cost=0.00..898.65 rows=2965 width=0)
-> Parallel Seq Scan on p2020_49 (cost=0.00..891.59 rows=2859 width=0)
-> Parallel Seq Scan on p2020_3 (cost=0.00..890.98 rows=2998 width=0)
-> Parallel Seq Scan on p2020_13 (cost=0.00..890.25 rows=3025 width=0)
开多个并行, 一次聚合多个分区, 每个partition非并行聚合, 并行append结果.
适合分区很多小表的情况(例如单表1GB内).
情况2:
set enable_partitionwise_aggregate=on;
db1=# explain select count(*) from p;
QUERY PLAN
------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=832959.57..832959.58 rows=1 width=8)
-> Gather (cost=874.54..832795.73 rows=65536 width=8)
Workers Planned: 32
-> Parallel Append (cost=874.54..832795.73 rows=2048 width=8)
-> Partial Aggregate (cost=927.63..927.64 rows=1 width=8)
-> Parallel Seq Scan on p2020_25 (cost=0.00..920.11 rows=3011 width=0)
-> Partial Aggregate (cost=913.43..913.44 rows=1 width=8)
-> Parallel Seq Scan on p2020_4 (cost=0.00..905.94 rows=2994 width=0)
-> Partial Aggregate (cost=906.07..906.08 rows=1 width=8)
-> Parallel Seq Scan on p2020_42 (cost=0.00..898.65 rows=2965 width=0)
-> Partial Aggregate (cost=898.73..898.74 rows=1 width=8)
-> Parallel Seq Scan on p2020_49 (cost=0.00..891.59 rows=2859 width=0)
-> Partial Aggregate (cost=898.47..898.48 rows=1 width=8)
-> Parallel Seq Scan on p2020_3 (cost=0.00..890.98 rows=2998 width=0)
一次运算一个分区, 一个分区开多个并行计算得到结果, 多阶段并行聚合结果.
适合分区数少于16, 但是单个分区较大(例如10GB以上) 的情况.
到底开启还是关闭enable_partitionwise_aggregate, 看怎么快就怎么玩. 一般来说就是要让计算时间大于调度耗费的时间.
db1=# select count(*) from p;
count
-----------
212758999
(1 row)
Time: 1098.609 ms (00:01.099)
你会发现, 刚刚写入完成, 立马去count查询会很慢很慢, 原因是什么呢?
db1=# select wait_event, wait_event_type from pg_stat_activity ;
wait_event | wait_event_type
---------------------+-----------------
AutoVacuumMain | Activity
LogicalLauncherMain | Activity
ExecuteGather | IPC
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
|
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
SLRURead | IO
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
CLogControlLock | LWLock
|
BgWriterMain | Activity
CheckpointerMain | Activity
WalWriterMain | Activity
(40 rows)
从会话表可以看到, 等待CLogControlLock锁, 为什么查询会有这个锁呢?
因为PG每条记录的头信息里面有记录这条记录的事务状态(提交、回滚), 但是PG在提交、回滚事务时立即完成, 并不需要去更新这个事务对应写记录的头信息, 而是写clog文件(每个事务2个BIT), 那么问题来了? 记录头信息里面的事务状态什么时候更新呢?
第一次touch到这条记录的时候(例如select, vacuum, analyze, autovacuum, update,等等), 会去clog对应bit里面拿到事务状态, 然后去更新记录头信息对应的事务状态掩码.
而且当开启了datafile block checksum或者设置了wal_log_hints=on时, 这个更新记录头信息对应的事务状态掩码的操作也会产生WAL日志, 所以查询也会产生日志.
因此, 为什么第一次查询会慢呢? 原因就是在更新记录头信息里面的 hint bit.
更新之后就快了, 不需要再到clog里面判断记录到事务结束状态, 直接看记录头信息的掩码就知道了.
内核改进, 有一个开关可以设置, 表级别, 绘画级别, 集群级别可设置的要不要修改hint bit, 例如一些日志表, 流水表, 写完很快被删掉, 没必要再去更新hint bit, 还不如就不更新了, 查询的时候去clog获取.
内核改进, 提高CLogControlLock并发性, 不要堵塞, 这样才能体现并行计算的能力
您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.