cs_hash_dup_count is too slow #13

amutu · 2014-03-08T05:05:41Z

for 5 million rows,the PG takes 1min,while takes more than 20min.this is part of the whole data:https://github.com/amutu/data/blob/master/samp2.
I think the slowness is data set relative,the group key is too selective,so may be calc the hash value or solve the hash conflict take the most time？

I see PG sql plan,it use sort.So should we make imcs make ref the group by column statistics and then select sort or hash as the plan？

knizhnik · 2014-03-08T14:54:43Z

Sorry, which SQL query you are executing?
Can I somehow get the whole data set?
20 minutes for 5 million rows is something terribly slow...
I should be less than second.
May be there is swapping on your system... Can you check CPU usage, disk IO, and perform profiling (using for example "sudo perf top")?
At my system cs_hash_dup_count on 3500 rows query takes only one millisecond...

amutu · 2014-03-09T04:02:19Z

the box has 128GB memory and 24 core,and I check again,there is no swapping.
this is some dstat output:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 0 100 0 0 0| 59k 426k| 0 0 | 1B 55B| 182 223
91 1 8 0 0 0| 0 4096B| 132B 972B| 0 0 | 26k 747
91 1 8 0 0 0| 0 8192B| 66B 358B| 0 0 | 26k 692
91 1 8 0 0 0| 0 36k| 186B 358B| 0 0 | 26k 708
91 1 8 0 0 0| 0 56k| 66B 358B| 0 0 | 26k 771
91 1 8 0 0 0| 0 4096B| 186B 358B| 0 0 | 26k 759
91 1 8 0 0 0| 0 4096B| 66B 358B| 0 0 | 26k 779
91 1 8 0 0 0| 0 8192B| 66B 358B| 0 0 | 26k 992
92 1 7 0 0 0| 0 32k| 220B 424B| 0 0 | 26k 826
90 1 10 0 0 0| 0 4096B| 66B 358B| 0 0 | 26k 750
91 1 8 0 0 0| 0 4096B| 66B 358B| 0 0 | 26k 855

this is my test result:
imcs:pg = 1048s:119s

postgres=# select samp3_load();

samp3_load

(1 row)

Time: 12031.358 ms
postgres=# select cs_hash_dup_count(uin,stacktopsignature) from samp3_get();
NOTICE: IMCS command: hash_dup_count
zzz

    cs_hash_dup_count

("int8:{1,1,1,1,9,1,2,1,3,3,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,2,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,4,3,1,1,7,1
,2,3,1,1,1,2,1,1,1,1,1,1,1,1,5,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,31,
3,3,1,3,1,8,1,2,1,6,1,1,1,10,6,1,1,2,1,1,1,2,1,1,1,1,9,1,1,1,1,2,24,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,2,1,1,6,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,83,1,1,2,1,1,1,1,1,4
5,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,63,1,1,1,1,1,1,17,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,2,3,5,1,1,4,1,1,1,2,1,1,1,1,1,2,1,1,2,1,1,1,4,1,1,1,1,1,1,1,1,2,1,1,1,1,1,3,1,1,1,1,2,18,1,1,1,1,1,1,1,1
,1,1,1,1,10,1,1,1,1,1,34,1,611,1,1,1,1,1,2,1,1,2,1,1,2,1,1,1,1,1,3,1,2,1,1,1,1,1,2,1,1,1,1,2,1,2,1,13,156,1,1,1,5,1,1,
1,1,1,2,1,1,2,2,1,1,1,1,1,1,1,1,1,1,1,1,18,2,1,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,4,6,1
,1,1,1,1,1,1,1,1,1,1,3,1,1,1,...}","bpchar64:{3c838f4690639fa1a96bd78b9d313fc0,2cf4b5bb766cf43995364dea8fc699de,0745cc
287b798401fcc5e25e28238f26,819e969fae126c74194bcdf2229eac54,c55a1630e299cbda59ce2313fe4c2ae6,38b7b14d02b002676a7a2a178
5dfcecd,fe11cf8727cc92c1d1deb81b47b11c5b,32a0d4a23471a453983003e3ab2dd238,6893c4c258da87ae2d3f06b6aa30c74d,17d9488d753
ad498472925150213fa34,4c748ba32d214a92893a2ea56f930c9e,5b928de3d8470bf6ae6032a13721f8c1,a4e9455ec714f70c468b89a887dc9d
45,447056efd85b4f3c21489b98111f8732,cfb62a06b3d3d95357a26a21a5607546,95d0a4348a87541075275b69da2034d1,87086cbd84d01914
afd152924097340f,b386c078b670be4e93ef9df6dab1387f,bba6ce9f6937dec1853a78aff57b482f,4fb4a50e98b5d5e8441404748ae5d21d,ba
a7e794180fce758a19c4c8d6107a46,b170444e408a29b4e8d4b3cb698fb084,0faa1cb5f8e826f4a38b7d5881928532,2fa62a23a2e98451db248
20774ec228a,c27eb6ccb279929e0322b2b5a8dcbbfc,23eea0ce8d89901d8fa72da122d74a4b,45a769697bc4309a10c3b3167785d308,b1788ac
ebfdd30e2d55bc036bc7d7431,...}")
(1 row)

Time: 1047903.148 ms
postgres=# select count(1) from (select count(distinct uin) from samp3 group by stacktopsignature) t;

count

297208
(1 row)

Time: 118905.357 ms

samp3 ddl and data is here:
http://amutu.com/blog/public_stage/samp3.dump
which I get from pg_dump -Fc -Z9 -tsamp3 -f samp3.dump

knizhnik · 2014-03-09T18:54:38Z

Once again: thank for your help. It was really a bug in imcs_dup_hash_initialize - one of hash tables was not correctly extended. After fixing this bug execution of this query takes at my system about 1.5 sec (and before it was 26921 seconds!)

knizhnik closed this as completed Mar 9, 2014

amutu mentioned this issue Mar 14, 2014

autoupdate broken：insert cause backend segment fault #20

Closed

amutu mentioned this issue Apr 4, 2014

some command hang on FreeBSD #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cs_hash_dup_count is too slow #13

cs_hash_dup_count is too slow #13

amutu commented Mar 8, 2014

knizhnik commented Mar 8, 2014

amutu commented Mar 9, 2014

knizhnik commented Mar 9, 2014

cs_hash_dup_count is too slow #13

cs_hash_dup_count is too slow #13

Comments

amutu commented Mar 8, 2014

knizhnik commented Mar 8, 2014

amutu commented Mar 9, 2014

samp3_load

count

knizhnik commented Mar 9, 2014