Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault on 2.2.0 in dgetrf on ubuntu x86_64 #2

Open
dlwh opened this issue May 11, 2021 · 8 comments
Open

segfault on 2.2.0 in dgetrf on ubuntu x86_64 #2

dlwh opened this issue May 11, 2021 · 8 comments

Comments

@dlwh
Copy link

dlwh commented May 11, 2021

import dev.ludovic.netlib.LAPACK;
import org.netlib.util.intW;

class Main {
  public static void main(String[] args) {
    double[] arr = new double[400];
    int[] piv = new int[20];
    intW info = new intW(0);
    LAPACK.getInstance().dgetrf(20, 20, arr, 20, piv, info);
  } 
}

reproduces in OpenJDK 64-Bit Server VM, Java 1.8.0_292 and OpenJDK 64-Bit Server VM, Java 16.0.1

There aren't any debug symbols and I'm no expert on assembly, but this is what I'm getting. the first instruction is the segfault.


0x7fffd85a262dmov    (%rsi),%eax
--
0x7fffd85a262flea    0x30(%rbp),%rsi
0x7fffd85a2633mov    $0x10000,%eax
0x7fffd85a2638and    0x4(%rsi),%eax
0x7fffd85a263bcmp    $0x10000,%eax
0x7fffd85a2641jne    0x7fffd85a26ce
0x7fffd85a2647mov    $0xe0,%eax
0x7fffd85a264cand    0x100(%rbp),%eax
0x7fffd85a2652cmp    $0xe0,%eax
0x7fffd85a2658jne    0x7fffd85a26ce
0x7fffd85a265elea    0x10(%rbp),%rsi
0x7fffd85a2662mov    (%rsi),%eax
0x7fffd85a2664cmp    $0x50654,%eax
0x7fffd85a266aje     0x7fffd85a26ce
0x7fffd85a2670lea    0x188(%rbp),%rsi
0x7fffd85a2677vmovdqu32 %zmm0,(%rsi)
0x7fffd85a267dvmovdqu32 %zmm7,0x40(%rsi)
0x7fffd85a2684vmovdqu32 %zmm8,0x80(%rsi)
0x7fffd85a268bvmovdqu32 %zmm31,0xc0(%rsi)
@dlwh
Copy link
Author

dlwh commented May 11, 2021

also reproduces in version 2.0.0

@dlwh
Copy link
Author

dlwh commented May 11, 2021

fwiw, i get a segfault for any dimension >= 18, but not before

@dlwh
Copy link
Author

dlwh commented May 11, 2021

Also get a failure in sgetrf for dim >= 10

@luhenry
Copy link
Owner

luhenry commented May 12, 2021

Hi @dlwh, let me try to reproduce that locally, it's first time I see it.

@luhenry
Copy link
Owner

luhenry commented May 12, 2021

I can reproduce with OpenBLAS, but not with Intel MKL. I also can only reproduce if OPENBLAS_NUM_THREADS is greater than 1. I'm now looking at how dgetrf_parallel (the function in which the SIGSEGV is triggered) is invoked, and why it triggers anything.

@luhenry
Copy link
Owner

luhenry commented May 12, 2021

Here is what I'm observing. When calling dgetrf_ from Java, we have $rsp = 0x7ffff599d0e8. It then calls dgetrf_parallel a first time, which allocates arrays on the stack and changes $rsp = 0x7ffff5918ef8 (aka 541,168 bytes). It then calls dgetrf_parallel recursively a second time, which allocates arrays on the stack again and changes $rsp = 0x7ffff5894d80 (aka another 541,048 bytes). It then SIGSEGV when trying to store variables on the stack [1].

When accessing the current thread's stack size and stack base, we can clearly see that this is indeed a stack overflow:

(gdb) p (Thread::_thr_current)->_stack_size
$3 = 1052672
(gdb) p (Thread::_thr_current)->_stack_base
$4 = (address) 0x7ffff59a5000 "\177ELF\002\001\001\003"

(0x7ffff59a5000 - 1052672 = 0x7ffff58a4000, which is smaller than $rsp = 0x7ffff5894d80 on the last call to dgetrf_parallel)

Now, onto figuring out why dgetrf_parallel allocates so much stack on the stack, and whether it's reproducible with calls to liblapack.so straight from C.

Also, when setting -Xss10M (set the stack size to 10 MB), I can't reproduce the issue.

[1]

   0x00007fff2a145040 <+0>:     lea    0x8(%rsp),%r10
   0x00007fff2a145045 <+5>:     and    $0xffffffffffffff80,%rsp
   0x00007fff2a145049 <+9>:     mov    %rdi,%rax
   0x00007fff2a14504c <+12>:    mov    %rdx,%rsi
   0x00007fff2a14504f <+15>:    pushq  -0x8(%r10)
   0x00007fff2a145053 <+19>:    push   %rbp
   0x00007fff2a145054 <+20>:    mov    %rsp,%rbp // $rbp = $rsp
   0x00007fff2a145057 <+23>:    push   %r15
   0x00007fff2a145059 <+25>:    push   %r14
   0x00007fff2a14505b <+27>:    push   %r13
   0x00007fff2a14505d <+29>:    push   %r12
   0x00007fff2a14505f <+31>:    push   %r10
   0x00007fff2a145061 <+33>:    push   %rbx
   0x00007fff2a145062 <+34>:    sub    $0x840c0,%rsp // allocate stack frame of 0x840c0 = 540,864 bytes
=> 0x00007fff2a145069 <+41>:    mov    %rdi,-0x83fd0(%rbp) // $rbp[-0x83fd0] = $rdi // stack grows down so access with negative index is normal

@luhenry
Copy link
Owner

luhenry commented May 13, 2021

@dlwh this issue is a repeat of a previously encountered issue with Breeze and netlib-java (so prior to my change). I opened an issue on OpenBLAS.

In the meantime, the workarounds are the following:

I'm exploring the licensing implication of packaging a custom OpenBLAS in the library to avoid having to install it locally, similarly to numpy. That might be also be a longer term solution for this specific issue.

@dlwh
Copy link
Author

dlwh commented May 13, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants