Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"On entry to DORMQR parameter number 10 had an illegal value" - perhaps lapack/NUMA related? #342

Open
jwaldmann opened this issue Nov 5, 2020 · 3 comments

Comments

@jwaldmann
Copy link

Hi. I am getting error

 ** On entry to DORMQR parameter number 10 had an illegal value

when I run

stack test --resolver=nightly

for https://gitlab.imn.htwk-leipzig.de/waldmann/circuit

it's something with NUMA? The executable (for the test case) is dyn-linked with

   libnuma.so.1 => /lib64/libnuma.so.1 

and the processor is AMD Ryzen 7 PRO 2700X.

When I run this on a different machine, this library isn't used, and the test runs fine.

The error message seems to come from lapack.

@idontgetoutmuch
Copy link
Member

Thanks for the bug report.

@jwaldmann
Copy link
Author

jwaldmann commented Nov 10, 2020

What can I do to help debug this? Does the following help?

I figured that the error message is printed by SUBROUTINE XERBLA( SRNAME, INFO ) in lapack-3.9.0/SRC/xerbla.f

so I compiled my test case with debugging

stack test --resolver=nightly -v --executable-profiling --library-profiling

then run under the debugger, set breakpoint, run

gdb stack-work/dist/x86_64-linux/Cabal-3.2.0.0/build/render-bp/render-bp
(gdb) break xerbla_
Function "xerbla_" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (xerbla_) pending.

(gdb) run
...
Breakpoint 1, 0x00007ffff7f98650 in xerbla_ () from /lib64/libblas.so.3
(gdb) where
#0  0x00007ffff7f98650 in xerbla_ () from /lib64/libblas.so.3
#1  0x00007ffff79d3ccd in dormqr_ () from /lib64/liblapack.so.3
#2  0x00007ffff7963912 in dgelss_ () from /lib64/liblapack.so.3
#3  0x00000000008d5fef in linearSolveSVDR_l ()
#4  0x00000000007e53fe in ?? ()
#5  0x0000000000000001 in ?? ()
#6  0x0000000000000001 in ?? ()
#7  0x0000000000000009 in ?? ()
#8  0x0000004200107da0 in ?? ()
#9  0x0000000000000000 in ?? ()

@idontgetoutmuch
Copy link
Member

Sadly that doesn't help me and I can't reproduce the bug (all tests pass).

Here's what I do in these situations (apologies if this is grandmothers and eggs).

  1. Create a self-contained example that triggers the error.
  2. Re-write this entirely in C.
  3. If the error is no longer triggered then we know to focus on the Haskell side of things.
  4. My guess is that the bug will be triggered in C and then we can report it to the library maintainers.

One other thing that worries me is that we might be using different versions of the various libraries and you yourself may be using different versions on your two systems. I will put some nix together to pin the various packages. If you don't use nix at least I will be able to specify which versions I am using.

I know nothing about NUMA or why it is used in one case and not the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants