Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions using blas cause a segfault (SIGSEV) #617

Closed
eeeebbbbrrrr opened this issue May 4, 2023 · 2 comments
Closed

Functions using blas cause a segfault (SIGSEV) #617

eeeebbbbrrrr opened this issue May 4, 2023 · 2 comments
Labels
bug Something isn't working ml

Comments

@eeeebbbbrrrr
Copy link

eeeebbbbrrrr commented May 4, 2023

After working with @levkk and @montanalow to install PostgresML (as of master: 63ebce3) on my linux box, I discovered that functions such as pgml.cosine_similarity and pgml.norm_l1 cause Postgres to segfault.

As an example:

[v15.1][5126] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 188.973 ms
[v][] ?!> 

Postgres logs leading up to a crash against pgml.cosine_similarity() are:

/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/logging.py:65: RuntimeWarning: Error deriving logger module name, using <None>. Exception: <module '' from '/home/pg/15/data'> is a built-in module
  warnings.warn(
No sentence-transformers model found with name /home/zombodb/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.
2023-05-04 18:35:48.950 UTC [20973] LOG:  server process (PID 21218) was terminated by signal 11: Segmentation fault
2023-05-04 18:35:48.950 UTC [20973] DETAIL:  Failed process was running: select *, pgml.cosine_similarity(embed, pgml.embed('intfloat/e5-large', 'meetings with beer or wine and cheese')) from embeddings_e5large_100k limit 10;
2023-05-04 18:35:48.950 UTC [20973] LOG:  terminating any other active server processes
2023-05-04 18:35:48.953 UTC [20973] LOG:  all server processes terminated; reinitializing
2023-05-04 18:35:48.979 UTC [20973] FATAL:  Can't attach, lock is not in an empty state: PgLwLockInner
2023-05-04 18:35:48.980 UTC [20973] LOG:  database system is shut down

The backtrace from a --debug build of pgml is:

Thread 1 "postgres" received signal SIGSEGV, Segmentation fault.
0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
(gdb) bt
#0  0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
#1  0x00007ff52b8b363a in blas::sdot (n=1024, x=..., incx=1, y=..., incy=1)
    at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/blas-0.22.0/src/lib.rs:109
#2  0x00007ff52b7c6aa6 in pgml::vectors::cosine_similarity_s (vector=..., other=...) at src/vectors.rs:304
#3  0x00007ff52b7c6d9a in pgml::vectors::cosine_similarity_s_wrapper::cosine_similarity_s_wrapper_inner (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#4  0x00007ff52b4ae1c1 in pgml::vectors::cosine_similarity_s_wrapper::{closure#0} () at src/vectors.rs:302
#5  0x00007ff52b6edb8c in std::panicking::try::do_call<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (
    data=0x7ffe798f2828) at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:483
#6  0x00007ff52b6f0f6b in __rust_try.llvm.11079318101650794703 () from /home/pg/15/lib/postgresql/pgml.so
#7  0x00007ff52b6ea049 in std::panicking::try<pgrx_pg_sys::submodules::datum::Datum, pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}> (f=...)
    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:447
#8  0x00007ff52b75a0f6 in std::panic::catch_unwind<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...)
    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panic.rs:137
#9  0x00007ff52b765983 in pgrx_pg_sys::submodules::panic::run_guarded<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:403
#10 0x00007ff52b77111c in pgrx_pg_sys::submodules::panic::pgrx_extern_c_guard<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:380
#11 0x00007ff52b7c6c9d in pgml::vectors::cosine_similarity_s_wrapper (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#12 0x000055fa54ce4b43 in ExecInterpExpr ()
#13 0x000055fa54cf15a2 in ExecScan ()
#14 0x000055fa54d0c368 in ExecLimit ()
#15 0x000055fa54ce88a2 in standard_ExecutorRun ()

My box is a (humblebrag):

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen Threadripper 3970X 32-Core Processor
    CPU family:          23
    Model:               49
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU max MHz:         3700.0000
    CPU min MHz:         2200.0000
    BogoMIPS:            7386.30
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
                         rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 mo
                         vbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt t
                         ce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 s
                         mep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_loc
                         al clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter p
                         fthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   1 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    16 MiB (32 instances)
  L3:                    128 MiB (8 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

With an nvidia RTX 4080:

  nvidia-debugdump -l
Found 1 NVIDIA devices
   Device ID:              0
   Device name:            NVIDIA GeForce RTX 4080   (*PrimaryCard)
   GPU internal ID:        GPU-b772ddf7-d413-e1bb-d1e1-8e7022c59343

Lev helped me discover that by commenting out this line,

println!("cargo:rustc-link-lib=static=openblas");
, everything works:

[v15.1][8595] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);
 norm_l1 
---------
       6
(1 row)

Time: 0.620 ms

This crash seems to be isolated to blas as I created 100k embeddings with pgml.embed() in a mere 7m 50s, using 4 parallel workers, even. So that part is good.

I had a thought that rebooting the computer might help since I had just stressed the GPU making all those embeddings, but naw, that didn't change anything.

A theory is that since pgml links to so many libraries (probably directly and indirectly) that maybe there's some kind of symbol resolution problem and the wrong symbols are being called? Just a theory.

@thomcc might be able to offer some help with this if it's some kind of linking problem? Offering up his services as PostgresML's success is pgrx's success!

@levkk levkk added ml bug Something isn't working labels May 4, 2023
@montanalow
Copy link
Contributor

@eeeebbbbrrrr which version of blas are you using?

My machine:

 $ apt show libblas-dev -a
Package: libblas-dev
Version: 3.10.0-2ubuntu1
Priority: optional
Section: libdevel
Source: lapack
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Debian Science Team <debian-science-maintainers@lists.alioth.debian.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 1084 kB
Provides: libblas.so
Depends: libblas3 (= 3.10.0-2ubuntu1)
Suggests: liblapack-doc
Homepage: https://www.netlib.org/lapack/
Download-Size: 164 kB
APT-Sources: http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages
Description: Basic Linear Algebra Subroutines 3, static library
 This package is a binary incompatible upgrade to the blas-dev
 package. Several minor changes to the C interface have been
 incorporated.
 .
 BLAS (Basic Linear Algebra Subroutines) is a set of efficient
 routines for most of the basic vector and matrix operations.
 They are widely used as the basis for other high quality linear
 algebra software, for example lapack and linpack.  This
 implementation is the Fortran 77 reference implementation found
 at netlib.
 .
 This package contains a static version of the library.

@montanalow
Copy link
Contributor

Fix is in #620

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ml
Projects
None yet
Development

No branches or pull requests

3 participants