Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cs_hash_dup_count is too slow #13

Closed
amutu opened this issue Mar 8, 2014 · 3 comments
Closed

cs_hash_dup_count is too slow #13

amutu opened this issue Mar 8, 2014 · 3 comments

Comments

@amutu
Copy link

amutu commented Mar 8, 2014

for 5 million rows,the PG takes 1min,while takes more than 20min.this is part of the whole data:https://github.com/amutu/data/blob/master/samp2.
I think the slowness is data set relative,the group key is too selective,so may be calc the hash value or solve the hash conflict take the most time?

I see PG sql plan,it use sort.So should we make imcs make ref the group by column statistics and then select sort or hash as the plan?

@knizhnik
Copy link
Owner

knizhnik commented Mar 8, 2014

Sorry, which SQL query you are executing?
Can I somehow get the whole data set?
20 minutes for 5 million rows is something terribly slow...
I should be less than second.
May be there is swapping on your system... Can you check CPU usage, disk IO, and perform profiling (using for example "sudo perf top")?
At my system cs_hash_dup_count on 3500 rows query takes only one millisecond...

@amutu
Copy link
Author

amutu commented Mar 9, 2014

the box has 128GB memory and 24 core,and I check again,there is no swapping.
this is some dstat output:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 0 100 0 0 0| 59k 426k| 0 0 | 1B 55B| 182 223
91 1 8 0 0 0| 0 4096B| 132B 972B| 0 0 | 26k 747
91 1 8 0 0 0| 0 8192B| 66B 358B| 0 0 | 26k 692
91 1 8 0 0 0| 0 36k| 186B 358B| 0 0 | 26k 708
91 1 8 0 0 0| 0 56k| 66B 358B| 0 0 | 26k 771
91 1 8 0 0 0| 0 4096B| 186B 358B| 0 0 | 26k 759
91 1 8 0 0 0| 0 4096B| 66B 358B| 0 0 | 26k 779
91 1 8 0 0 0| 0 8192B| 66B 358B| 0 0 | 26k 992
92 1 7 0 0 0| 0 32k| 220B 424B| 0 0 | 26k 826
90 1 10 0 0 0| 0 4096B| 66B 358B| 0 0 | 26k 750
91 1 8 0 0 0| 0 4096B| 66B 358B| 0 0 | 26k 855

this is my test result:
imcs:pg = 1048s:119s

postgres=# select samp3_load();

samp3_load

5651116

(1 row)

Time: 12031.358 ms
postgres=# select cs_hash_dup_count(uin,stacktopsignature) from samp3_get();
NOTICE: IMCS command: hash_dup_count
zzz

    cs_hash_dup_count                                                                                             

















("int8:{1,1,1,1,9,1,2,1,3,3,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,2,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,4,3,1,1,7,1
,2,3,1,1,1,2,1,1,1,1,1,1,1,1,5,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,31,
3,3,1,3,1,8,1,2,1,6,1,1,1,10,6,1,1,2,1,1,1,2,1,1,1,1,9,1,1,1,1,2,24,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,2,1,1,6,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,83,1,1,2,1,1,1,1,1,4
5,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,63,1,1,1,1,1,1,17,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,2,3,5,1,1,4,1,1,1,2,1,1,1,1,1,2,1,1,2,1,1,1,4,1,1,1,1,1,1,1,1,2,1,1,1,1,1,3,1,1,1,1,2,18,1,1,1,1,1,1,1,1
,1,1,1,1,10,1,1,1,1,1,34,1,611,1,1,1,1,1,2,1,1,2,1,1,2,1,1,1,1,1,3,1,2,1,1,1,1,1,2,1,1,1,1,2,1,2,1,13,156,1,1,1,5,1,1,
1,1,1,2,1,1,2,2,1,1,1,1,1,1,1,1,1,1,1,1,18,2,1,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,4,6,1
,1,1,1,1,1,1,1,1,1,1,3,1,1,1,...}","bpchar64:{3c838f4690639fa1a96bd78b9d313fc0,2cf4b5bb766cf43995364dea8fc699de,0745cc
287b798401fcc5e25e28238f26,819e969fae126c74194bcdf2229eac54,c55a1630e299cbda59ce2313fe4c2ae6,38b7b14d02b002676a7a2a178
5dfcecd,fe11cf8727cc92c1d1deb81b47b11c5b,32a0d4a23471a453983003e3ab2dd238,6893c4c258da87ae2d3f06b6aa30c74d,17d9488d753
ad498472925150213fa34,4c748ba32d214a92893a2ea56f930c9e,5b928de3d8470bf6ae6032a13721f8c1,a4e9455ec714f70c468b89a887dc9d
45,447056efd85b4f3c21489b98111f8732,cfb62a06b3d3d95357a26a21a5607546,95d0a4348a87541075275b69da2034d1,87086cbd84d01914
afd152924097340f,b386c078b670be4e93ef9df6dab1387f,bba6ce9f6937dec1853a78aff57b482f,4fb4a50e98b5d5e8441404748ae5d21d,ba
a7e794180fce758a19c4c8d6107a46,b170444e408a29b4e8d4b3cb698fb084,0faa1cb5f8e826f4a38b7d5881928532,2fa62a23a2e98451db248
20774ec228a,c27eb6ccb279929e0322b2b5a8dcbbfc,23eea0ce8d89901d8fa72da122d74a4b,45a769697bc4309a10c3b3167785d308,b1788ac
ebfdd30e2d55bc036bc7d7431,...}")
(1 row)

Time: 1047903.148 ms
postgres=# select count(1) from (select count(distinct uin) from samp3 group by stacktopsignature) t;

count

297208
(1 row)

Time: 118905.357 ms

samp3 ddl and data is here:
http://amutu.com/blog/public_stage/samp3.dump
which I get from pg_dump -Fc -Z9 -tsamp3 -f samp3.dump

@knizhnik
Copy link
Owner

knizhnik commented Mar 9, 2014

Once again: thank for your help. It was really a bug in imcs_dup_hash_initialize - one of hash tables was not correctly extended. After fixing this bug execution of this query takes at my system about 1.5 sec (and before it was 26921 seconds!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants