Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: group by sequence doesn't match values sequence #34

Open
amutu opened this issue Apr 14, 2014 · 14 comments
Open

ERROR: group by sequence doesn't match values sequence #34

amutu opened this issue Apr 14, 2014 · 14 comments

Comments

@amutu
Copy link

amutu commented Apr 14, 2014

for same data,if filter is countryid =1 ,the sql report error:
ERROR: group by sequence doesn't match values sequence
while if the filter is countryid =2 or countryid =3 ,then sql run OK

select clientversion,t.agg_value cnt_distinct,encode(trim(E'\000' from t.group_by) ,'escape') revision from (select wx_version.clientversion,x.uin,x.revision,countryid from wx_version,crashlog_get(clientversion,now()::timestamp - '2 hour'::interval,now()::timestamp) as x where x is not null) y,cs_project_agg(cs_hash_dup_count(cs_filter(y.countryid = cs_const(2::int),y.uin),cs_filter(y.countryid = cs_const(2::int),y.revision))) as t(agg_value,group_by) order by 2 desc nulls last limit 10;
ERROR: group by sequence doesn't match values sequence
Time: 139.287 ms

select clientversion,t.agg_value cnt_distinct,encode(trim(E'\000' from t.group_by) ,'escape') revision from (select wx_version.clientversion,x.uin,x.revision,countryid from wx_version,crashlog_get(clientversion,now()::timestamp - '2 hour'::interval,now()::timestamp) as x where x is not null) y,cs_project_agg(cs_hash_dup_count(cs_filter(y.countryid = cs_const(1::int),y.uin),cs_filter(y.countryid = cs_const(1::int),y.revision))) as t(agg_value,group_by) order by 2 desc nulls last limit 10;
clientversion | cnt_distinct | revision
---------------+--------------+----------
(0 rows)

Time: 67.556 ms

@amutu
Copy link
Author

amutu commented Apr 14, 2014

wired, on another server with the same data ,all sql above run ok.
only difference is on another server the data is fresh loaded(I restart the server today) while the server which report the error is running since last week and load data every hour from cron job.

@amutu
Copy link
Author

amutu commented Apr 14, 2014

some more case:

select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031,now()::timestamp - '10 hour'::interval,now()::timestamp) as x where x is not null;
run OK

select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031) as x where x is not null;
run OK

select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031,now()::timestamp - '5 hour'::interval,now()::timestamp) as x where x is not null;
ERROR: group by sequence doesn't match values sequence

select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031,now()::timestamp - '4 hour'::interval,now()::timestamp) as x where x is not null;
ERROR: group by sequence doesn't match values sequence

@amutu
Copy link
Author

amutu commented Apr 14, 2014

it looks like memory dirty on this machine:

select cs_count(x.class2) c0 ,cs_count(x.uin) as c1 ,cs_count(x.class1) as c2 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.uin)) as c3 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.class1)) as c4 from crashlog_get(620757031,'2014-04-14 12:00:00','2014-04-14 14:00:00') as x where x is not null;
NOTICE: IMCS command: const
CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN
NOTICE: IMCS command: eq
NOTICE: IMCS command: filter
NOTICE: IMCS command: count
NOTICE: IMCS command: const
CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN
NOTICE: IMCS command: eq
NOTICE: IMCS command: filter
NOTICE: IMCS command: count
c0 | c1 | c2 | c3 | c4
-------+-------+-------+-------+-------
27729 | 28334 | 27712 | 28334 | 27712
(1 row)

on another machine ,the result is :
select cs_count(x.class2) c0 ,cs_count(x.uin) as c1 ,cs_count(x.class1) as c2 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.uin)) as c3 ,cs_count(cs_filter(x.revision = cs_const('493383',32),x.class1)) as c4 from crashlog_get(620757031,'2014-04-14 12:00:00','2014-04-14 14:00:00') as x where x is not null;
NOTICE: IMCS command: const
CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN
NOTICE: IMCS command: eq
NOTICE: IMCS command: filter
NOTICE: IMCS command: count
NOTICE: IMCS command: const
CONTEXT: PL/pgSQL function cs_const(text,integer) line 1 at RETURN
NOTICE: IMCS command: eq
NOTICE: IMCS command: filter
NOTICE: IMCS command: count
c0 | c1 | c2 | c3 | c4
-------+-------+-------+-------+-------
28334 | 28334 | 28334 | 28334 | 28334
(1 row)

Time: 2.476 ms

but in the bad state machine,some query run without problem,such:
select cs_hash_dup_count(cs_filter(x.revision = cs_const('493383',32),x.uin) ,cs_filter(x.revision = cs_const('493383',32),x.class1)) from crashlog_get(620757031) as x where x is not null;
run OK

@knizhnik
Copy link
Owner

Is the problem reproduced if you switch off multithreading query execution: set imcs.n_threads=1 in postgresql.conf

@amutu
Copy link
Author

amutu commented Apr 14, 2014

no,restart the pg and reload the data to cs make all sql run ok.

@amutu
Copy link
Author

amutu commented Apr 15, 2014

How can I help to debug tihis problem? it happened again.

my usage:
I have a partitioned table crashlog and the partition table named crashlog_p_20140101 ,crashlog_p_20140102...

I have a cron job,delete and load data every hour:
step 1:delete oldest data to release memroy
crashlog_delete(data of n days ago)

step 2:clean data of last hour in case of run this cmd multi times
crashlog_delete(last hour till now)

step3:load data of last hour
crashlog_load(filter := 'select data from carshlog where data is last hour')

at the same time,user may run sql on the server.

is it casued by some concurrent delete/select/load ? because if user not use(so thiere is not select or read),then the data is ok.I find this because after hole night load data,there is no such error in the morning,but in the afternoon,the error happened.

@knizhnik
Copy link
Owner

I found one possible synchronization issues in delete function.
Please try to reproduce the problem with new version.

Also, do you have imcs_autoload enabled? If yes, please try to disable it (actually I do not want to say that it is impossible to use autoload in concurrent queries, but I want to understand if the problem is related with it or not).

@amutu
Copy link
Author

amutu commented Apr 15, 2014

I can repreduce the bug by 2 script concurrent running in 3 sessions:
https://github.com/amutu/data/tree/master

while the data is constant insert into crashlog table,

open first session:
while true
do
./delete_load.sh >> ./delete_load.log
done

open second session:
while true
do
./delete_load.sh >> ./delete_load2.log
done

open third session:
while true
do
./check_cs_data.sh >> ./check_cs_data.log
done

open forth session:
tail -f ./check_cs_data.log

for a wile,you will see .....n data error found! where n >= 1

note:you can create wx_version as:
create table wx_version as select clientversion from crashlog;

I see the code:
/* This function is called in context protected by imcs->lock /
void imcs_free_page(imcs_page_t
page)
{
imcs_free_page_t* pg = (imcs_free_page_t*)page;
pg->next = imcs->free_pages;
imcs->free_pages = pg;
imcs->n_used_pages -= 1;
}

but in imcs_delete_page,I can't find imcs->lock around imcs_free_page,is it relative?

@knizhnik
Copy link
Owner

Did you try the new version (61)?

@amutu
Copy link
Author

amutu commented Apr 15, 2014

I just try the new version,unfortunately,it still can be reproduced by the script above.

@knizhnik
Copy link
Owner

Sorry, but what was the actual error message? Did you inspect the logs?
Unfortunately this scenario is not correct (this is why I didn't try to reproduce it).
IMCS requires that data is always inserted in timestamp increasing order.
But you concurrently run two delete_load.sh each of which tries to load is own data range based on system clock values! So first script may try to load [100..200] and second [110...210]. What will happen in second script will be able to execute it's load command first?

@amutu
Copy link
Author

amutu commented Apr 15, 2014

on the pgsql console,I get error some time(may happen,then disappear,then happen...):
ERROR: group by sequence doesn't match values sequence
when run sql:
select cs_hash_dup_count(x.uin ,x.class1) from wx_version,crashlog_get(clientversion) as x where x is not null;

on the console:
tail -f ./check_cs_data.log
I get ".....n data error found!"

and sure, the load session get error aoubt the out of order.but this sould not make the cs length miss match

@knizhnik
Copy link
Owner

Actually I do not understand what makes psql return error code. If your query returns nothing (empty result set_, then psql still returns normal error code. It will return non-zero error code only if some error happen during query execution. And I suspect that such error will be "value out of timeseries order"! But please notice that this error is detected only for timestamp timeseries. And values are now appended in order of columns in the table. In your case: uin | clientversion | revision | crashtime ...
It means that before this error was reported for crashtime (or what else you are using as timestamp), some other values from this columns will be successfully added to columnar store. And it will cause timeseries length mismatch. IMCS is not transactional storage - it is not able to undo partly completed operation. The obvious workarround is to populate first timeseries column. I committed this patch in revision 62.

@amutu
Copy link
Author

amutu commented Apr 16, 2014

Now,I can reproduce this bug in this branch : 64fb76f

this is the step:
crash=> select crashlog_truncate();

crashlog_truncate

(1 row)

crash=> select crashlog_load(filter:= 'logtime >= $$2014-04-14$$ and logtime < $$2014-04-14 01:00:00$$');

crashlog_load

    902405

(1 row)

---------------fresh data,no miss match cs--------------------
crash=> select count(1) from (select x.clientversion a ,cs_count(x.uin) b,cs_count(x.uin) c,cs_count(x.logtime) d, cs_count(x.class1) e, cs_count(x.class2) f, cs_count(x.class3) g ,cs_count(x.revision) h,cs_count(x.phoneid) i from wx_version,crashlog_get(clientversion) as x where x is not null) t where b <> c or b<> d or b<> e or b<> f or b <> g or b<> h or b<> i;

count

 0

(1 row)

--------------------delete has bug-------------
crash=> select crashlog_delete(clientversion,'2014-04-14 00:14:33','2014-04-14 00:18:01') from wx_version;

crashlog_delete

           1
           0
           1

......
0
0
2
0
11462
1

(240 rows)

--------------cs data length miss match------------------
crash=> select count(1) from (select x.clientversion a ,cs_count(x.uin) b,cs_count(x.uin) c,cs_count(x.logtime) d, cs_count(x.class1) e, cs_count(x.class2) f, cs_count(x.class3) g ,cs_count(x.revision) h,cs_count(x.phoneid) i from wx_version,crashlog_get(clientversion) as x where x is not null) t where b <> c or b<> d or b<> e or b<> f or b <> g or b<> h or b<> i;

count

34

(1 row)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants