Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server crash when building an ivfflat index with a high number of clusters #42

Closed
ArthurMelin opened this issue Oct 26, 2022 · 2 comments

Comments

@ArthurMelin
Copy link

Hello again, this issue is somewhat related to #41 but happens with different condtions and at a different code location.

This time the crash happens when creating an index on a table with a large number of rows and with the lists param also greater than around 6500 clusters.

Reproduction steps:

CREATE TABLE embed (id integer NOT NULL, vec vector(384) NOT NULL);

Insert 1M rows into the table

SET maintenance_work_mem='16GB';
CREATE INDEX ON embed USING ivfflat (vec vector_cosine_ops) WITH (lists = 8000);
Server logs

2022-10-26 09:17:07.110 UTC [54774] STATEMENT:  create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);
2022-10-26 09:17:07.110 UTC [54774] DEBUG:  building index "embed_vec_idx" on table "embed" serially
2022-10-26 09:17:19.927 UTC [54584] DEBUG:  snapshot of 1+0 running transaction ids (lsn 0/6A280188 oldest xid 773 latest complete 772 next xid 774)
2022-10-26 09:17:47.742 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  server process (PID 54774) was terminated by signal 11: Segmentation fault
2022-10-26 09:17:47.743 UTC [53734] DETAIL:  Failed process was running: create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);
2022-10-26 09:17:47.743 UTC [53734] LOG:  server process (PID 54774) was terminated by signal 11: Segmentation fault
2022-10-26 09:17:47.743 UTC [53734] DETAIL:  Failed process was running: create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);
2022-10-26 09:17:47.743 UTC [53734] LOG:  terminating any other active server processes
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54588
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54584
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54583
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54585
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54586
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  sending SIGQUIT to process 54587
2022-10-26 09:17:47.743 UTC [54587] DEBUG:  writing stats file "pg_stat/global.stat"
2022-10-26 09:17:47.743 UTC [53734] DEBUG:  forked new backend, pid=54840 socket=9
2022-10-26 09:17:47.744 UTC [54840] LOG:  connection received: host=[local]
2022-10-26 09:17:47.744 UTC [54840] FATAL:  the database system is in recovery mode
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(1): 0 before_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(1): 0 on_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  proc_exit(1): 1 callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  exit(1)
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-10-26 09:17:47.744 UTC [54840] DEBUG:  proc_exit(-1): 0 callbacks to make
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  server process (PID 54840) exited with exit code 1
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:47.746 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:48.242 UTC [54587] DEBUG:  writing stats file "pg_stat/db_16385.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  removing temporary stats file "pg_stat_tmp/db_16385.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  writing stats file "pg_stat/db_0.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  removing temporary stats file "pg_stat_tmp/db_0.stat"
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2022-10-26 09:17:48.243 UTC [54587] DEBUG:  proc_exit(-1): 0 callbacks to make
2022-10-26 09:17:48.245 UTC [53734] DEBUG:  reaping dead processes
2022-10-26 09:17:48.245 UTC [53734] LOG:  all server processes terminated; reinitializing

GDB stack trace

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f6309e36df9 in InitCenters (index=index@entry=0x7f6309ea5d38, samples=samples@entry=0x7f62dde6b050, centers=centers@entry=0x564469e56d20, lowerBound=lowerBound@entry=0x7f5fe2f62050)
    at src/ivfkmeans.c:63
63                              lowerBound[j * numCenters + i] = distance;
(gdb) bt
#0  0x00007f6309e36df9 in InitCenters (index=index@entry=0x7f6309ea5d38, samples=samples@entry=0x7f62dde6b050, centers=centers@entry=0x564469e56d20, lowerBound=lowerBound@entry=0x7f5fe2f62050)
    at src/ivfkmeans.c:63
#1  0x00007f6309e37164 in ElkanKmeans (index=0x7f6309ea5d38, samples=0x7f62dde6b050, centers=0x564469e56d20) at src/ivfkmeans.c:254
#2  0x00007f6309e37ada in IvfflatKmeans (index=0x7f6309ea5d38, samples=<optimized out>, centers=0x564469e56d20) at src/ivfkmeans.c:513
#3  0x00007f6309e354c0 in ComputeCenters (buildstate=buildstate@entry=0x7fff8198cd30) at src/ivfbuild.c:401
#4  0x00007f6309e35f4c in BuildIndex (heap=<optimized out>, index=0x7f6309ea5d38, indexInfo=<optimized out>, buildstate=buildstate@entry=0x7fff8198cd30, forkNum=forkNum@entry=MAIN_FORKNUM) at src/ivfbuild.c:580
#5  0x00007f6309e35fc4 in ivfflatbuild (heap=<optimized out>, index=<optimized out>, indexInfo=<optimized out>) at src/ivfbuild.c:599
#6  0x0000564467c8022c in index_build (heapRelation=heapRelation@entry=0x7f6309ea5a30, indexRelation=indexRelation@entry=0x7f6309ea5d38, indexInfo=indexInfo@entry=0x564469d00f38,
    isreindex=isreindex@entry=false, parallel=parallel@entry=true) at index.c:3012
#7  0x0000564467c81e06 in index_create (heapRelation=heapRelation@entry=0x7f6309ea5a30, indexRelationName=indexRelationName@entry=0x564469de0920 "embed_vec_idx", indexRelationId=40964, indexRelationId@entry=0,
    parentIndexRelid=parentIndexRelid@entry=0, parentConstraintId=parentConstraintId@entry=0, relFileNode=0, indexInfo=0x564469d00f38, indexColNames=0x564469de08a8, accessMethodObjectId=16435, tableSpaceId=0,
    collationObjectId=0x564469dd3360, classObjectId=0x564469dd3380, coloptions=0x564469dd33a0, reloptions=94851833923912, flags=0, constr_flags=0, allow_system_table_mods=false, is_internal=false,
    constraintId=0x7fff8198d0f4) at index.c:1232
#8  0x0000564467d39c55 in DefineIndex (relationId=relationId@entry=16462, stmt=stmt@entry=0x564469b70780, indexRelationId=indexRelationId@entry=0, parentIndexId=parentIndexId@entry=0,
    parentConstraintId=parentConstraintId@entry=0, is_alter_table=is_alter_table@entry=false, check_rights=true, check_not_in_use=true, skip_build=false, quiet=false) at indexcmds.c:1164
#9  0x0000564467f6d24b in ProcessUtilitySlow (pstate=pstate@entry=0x564469d00e20, pstmt=pstmt@entry=0x564469b71820,
    queryString=queryString@entry=0x564469b6fa10 "create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);", context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0,
    queryEnv=queryEnv@entry=0x0, dest=0x564469c87e28, qc=0x7fff8198d780) at utility.c:1534
#10 0x0000564467f6c6d2 in standard_ProcessUtility (pstmt=0x564469b71820, queryString=0x564469b6fa10 "create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);",
    readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x564469c87e28, qc=0x7fff8198d780) at utility.c:1066
#11 0x0000564467f6c7bc in ProcessUtility (pstmt=pstmt@entry=0x564469b71820, queryString=<optimized out>, readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
    queryEnv=<optimized out>, dest=0x564469c87e28, qc=0x7fff8198d780) at utility.c:527
#12 0x0000564467f69af6 in PortalRunUtility (portal=portal@entry=0x564469bb35d0, pstmt=pstmt@entry=0x564469b71820, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
    dest=dest@entry=0x564469c87e28, qc=qc@entry=0x7fff8198d780) at pquery.c:1155
#13 0x0000564467f69dd6 in PortalRunMulti (portal=portal@entry=0x564469bb35d0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x564469c87e28,
    altdest=altdest@entry=0x564469c87e28, qc=qc@entry=0x7fff8198d780) at pquery.c:1312
#14 0x0000564467f6a1a3 in PortalRun (portal=portal@entry=0x564469bb35d0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x564469c87e28,
    altdest=altdest@entry=0x564469c87e28, qc=0x7fff8198d780) at pquery.c:788
#15 0x0000564467f66084 in exec_simple_query (query_string=query_string@entry=0x564469b6fa10 "create index on embed using ivfflat (vec vector_cosine_ops) with (lists = 8000);") at postgres.c:1213
#16 0x0000564467f682a4 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7fff8198d980, dbname=<optimized out>, username=<optimized out>) at postgres.c:4496
#17 0x0000564467ec5e46 in BackendRun (port=port@entry=0x564469b98a70) at postmaster.c:4530
#18 0x0000564467ec8318 in BackendStartup (port=port@entry=0x564469b98a70) at postmaster.c:4252
#19 0x0000564467ec8565 in ServerLoop () at postmaster.c:1745
#20 0x0000564467ec9b9e in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x564469b691b0) at postmaster.c:1417
#21 0x0000564467e0aa52 in main (argc=5, argv=0x564469b691b0) at main.c:209

Versions:

posgresql 14.5
pgvector v0.3.0 (379a760)

@ArthurMelin
Copy link
Author

Found the issue, it's integer overflows in ivfkmeans.c for array accesses:
https://github.com/pgvector/pgvector/blob/master/src/ivfkmeans.c#L63
and similar others in ElkanKmeans().

I fixed it by changing the type of j and k to int64. @ankane do you want me to make a PR?

@ankane ankane closed this as completed in b3cad93 Oct 30, 2022
@ankane
Copy link
Member

ankane commented Oct 30, 2022

Hey @ArthurMelin, thanks for the great reporting and debugging! Just pushed a fix.

A few notes to self:

  • lowerBound indexing overflows at ~6500 lists since 6500 * (6500 * 50) is around 2^31 (centers * (centers * samples per center))
  • halfcdist indexing does not overflow since 32768 * 32768 is less than 2^31 (IVFFLAT_MAX_LISTS * IVFFLAT_MAX_LISTS), but could if IVFFLAT_MAX_LISTS is increased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants