You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After working with @levkk and @montanalow to install PostgresML (as of master: 63ebce3) on my linux box, I discovered that functions such as pgml.cosine_similarity and pgml.norm_l1 cause Postgres to segfault.
As an example:
[v15.1][5126] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 188.973 ms
[v][] ?!>
Postgres logs leading up to a crash against pgml.cosine_similarity() are:
/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/logging.py:65: RuntimeWarning: Error deriving logger module name, using <None>. Exception: <module '' from '/home/pg/15/data'> is a built-in module
warnings.warn(
No sentence-transformers model found with name /home/zombodb/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.
2023-05-04 18:35:48.950 UTC [20973] LOG: server process (PID 21218) was terminated by signal 11: Segmentation fault
2023-05-04 18:35:48.950 UTC [20973] DETAIL: Failed process was running: select *, pgml.cosine_similarity(embed, pgml.embed('intfloat/e5-large', 'meetings with beer or wine and cheese')) from embeddings_e5large_100k limit 10;
2023-05-04 18:35:48.950 UTC [20973] LOG: terminating any other active server processes
2023-05-04 18:35:48.953 UTC [20973] LOG: all server processes terminated; reinitializing
2023-05-04 18:35:48.979 UTC [20973] FATAL: Can't attach, lock is not in an empty state: PgLwLockInner
2023-05-04 18:35:48.980 UTC [20973] LOG: database system is shut down
The backtrace from a --debug build of pgml is:
Thread 1 "postgres" received signal SIGSEGV, Segmentation fault.
0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
(gdb) bt
#0 0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
#1 0x00007ff52b8b363a in blas::sdot (n=1024, x=..., incx=1, y=..., incy=1)
at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/blas-0.22.0/src/lib.rs:109
#2 0x00007ff52b7c6aa6 in pgml::vectors::cosine_similarity_s (vector=..., other=...) at src/vectors.rs:304
#3 0x00007ff52b7c6d9a in pgml::vectors::cosine_similarity_s_wrapper::cosine_similarity_s_wrapper_inner (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#4 0x00007ff52b4ae1c1 in pgml::vectors::cosine_similarity_s_wrapper::{closure#0} () at src/vectors.rs:302
#5 0x00007ff52b6edb8c in std::panicking::try::do_call<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (
data=0x7ffe798f2828) at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:483
#6 0x00007ff52b6f0f6b in __rust_try.llvm.11079318101650794703 () from /home/pg/15/lib/postgresql/pgml.so
#7 0x00007ff52b6ea049 in std::panicking::try<pgrx_pg_sys::submodules::datum::Datum, pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}> (f=...)
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:447
#8 0x00007ff52b75a0f6 in std::panic::catch_unwind<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...)
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panic.rs:137
#9 0x00007ff52b765983 in pgrx_pg_sys::submodules::panic::run_guarded<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:403
#10 0x00007ff52b77111c in pgrx_pg_sys::submodules::panic::pgrx_extern_c_guard<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:380
#11 0x00007ff52b7c6c9d in pgml::vectors::cosine_similarity_s_wrapper (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#12 0x000055fa54ce4b43 in ExecInterpExpr ()
#13 0x000055fa54cf15a2 in ExecScan ()
#14 0x000055fa54d0c368 in ExecLimit ()
#15 0x000055fa54ce88a2 in standard_ExecutorRun ()
My box is a (humblebrag):
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 3970X 32-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU max MHz: 3700.0000
CPU min MHz: 2200.0000
BogoMIPS: 7386.30
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb
rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 mo
vbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt t
ce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 s
mep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_loc
al clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter p
fthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 1 MiB (32 instances)
L1i: 1 MiB (32 instances)
L2: 16 MiB (32 instances)
L3: 128 MiB (8 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Vulnerable
Spec store bypass: Vulnerable
Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
Srbds: Not affected
Tsx async abort: Not affected
This crash seems to be isolated to blas as I created 100k embeddings with pgml.embed() in a mere 7m 50s, using 4 parallel workers, even. So that part is good.
I had a thought that rebooting the computer might help since I had just stressed the GPU making all those embeddings, but naw, that didn't change anything.
A theory is that since pgml links to so many libraries (probably directly and indirectly) that maybe there's some kind of symbol resolution problem and the wrong symbols are being called? Just a theory.
@thomcc might be able to offer some help with this if it's some kind of linking problem? Offering up his services as PostgresML's success is pgrx's success!
The text was updated successfully, but these errors were encountered:
$ apt show libblas-dev -a
Package: libblas-dev
Version: 3.10.0-2ubuntu1
Priority: optional
Section: libdevel
Source: lapack
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Debian Science Team <debian-science-maintainers@lists.alioth.debian.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 1084 kB
Provides: libblas.so
Depends: libblas3 (= 3.10.0-2ubuntu1)
Suggests: liblapack-doc
Homepage: https://www.netlib.org/lapack/
Download-Size: 164 kB
APT-Sources: http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages
Description: Basic Linear Algebra Subroutines 3, static library
This package is a binary incompatible upgrade to the blas-dev
package. Several minor changes to the C interface have been
incorporated.
.
BLAS (Basic Linear Algebra Subroutines) is a set of efficient
routines for most of the basic vector and matrix operations.
They are widely used as the basis for other high quality linear
algebra software, for example lapack and linpack. This
implementation is the Fortran 77 reference implementation found
at netlib.
.
This package contains a static version of the library.
After working with @levkk and @montanalow to install PostgresML (as of master: 63ebce3) on my linux box, I discovered that functions such as
pgml.cosine_similarity
andpgml.norm_l1
cause Postgres to segfault.As an example:
Postgres logs leading up to a crash against
pgml.cosine_similarity()
are:The backtrace from a --debug build of pgml is:
My box is a (humblebrag):
With an nvidia RTX 4080:
Lev helped me discover that by commenting out this line,
postgresml/pgml-extension/build.rs
Line 2 in 8b3ac40
This crash seems to be isolated to
blas
as I created 100k embeddings withpgml.embed()
in a mere 7m 50s, using 4 parallel workers, even. So that part is good.I had a thought that rebooting the computer might help since I had just stressed the GPU making all those embeddings, but naw, that didn't change anything.
A theory is that since pgml links to so many libraries (probably directly and indirectly) that maybe there's some kind of symbol resolution problem and the wrong symbols are being called? Just a theory.
@thomcc might be able to offer some help with this if it's some kind of linking problem? Offering up his services as PostgresML's success is pgrx's success!
The text was updated successfully, but these errors were encountered: