Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSAUPD() fails with ARPACK 3.9.0, but not with ARPACK 3.8.0 #401

Closed
szhorvat opened this issue Feb 16, 2023 · 40 comments
Closed

DSAUPD() fails with ARPACK 3.9.0, but not with ARPACK 3.8.0 #401

szhorvat opened this issue Feb 16, 2023 · 40 comments

Comments

@szhorvat
Copy link
Contributor

szhorvat commented Feb 16, 2023

This is a summary of the issue described at igraph/igraph#2311, i.e. DSAUPD() failing when running the igraph test suite with ARPACK 3.9.0. With ARPACK 3.8.0, there is no problem.

We are solving the following eigenvalue problem:

  • Find the leading eigenvalue/vector of an $n\times n$ matrix $A$ where $A_{ij} = 0$ for all $i,j$ except that $A_{12} = A_{21} = 1$. The problem occurs specifically with $n=10$ and $n=20$, not other values.
  • Find a single eigenvalue/vector using which='LA'

The error

An error occurs depending on the starting vector that is used. The error is easy to reproduce after trying a few random starting vectors.

Details of the error:

Reproducing

The issue can be reproduced independently of igraph, using Octave. This suggests that the problem may be in ARPACK itself, and not with how igraph uses ARPACK.

Transcript of Octave session:

octave:20> n=10 # problem occurs with n=10 or n=20 only
octave:21> A=zeros(n,n); A(1,2)=A(2,1)=1
A =

   0   1   0   0   0   0   0   0   0   0
   1   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0

octave:22> for i=1:100; opt.v0 = rand(n,1); [v,d]=eigs(A,1,'la',opt); end
error: eigs: error in dsaupd: 
error: called from
    eigs at line 338 column 22

The purpose of the for loop is to find a random starting vector that reproduces the issue.

The actual starting vector used in the igraph code can be replicated like so: opt.v0 = ones(n,1); opt.v0(1:2) = 1e-4 * (1-2*rand(2,1)). However, the simpler random starting vector used in the above example also reproduces the issue.

Details of my environment

I am using ARPACK on macOS 10.14.6 on x86_64 with MacPorts, specifically this setup: macports/macports-ports#17716

The problem occurs regardless of the BLAS implementation used. Tried both Apple's vecLib (default in MacPorts) and OpenBLAS.

The ARPACK 3.9.0 test suite passes in this environment.

Affected ARPACK versions

Only 3.9.0, not 3.8.0.

@fghoussen
Copy link
Collaborator

fghoussen commented Feb 16, 2023

Try to test the eigen solve by yourself out of any software that may do "stuffs" and set parameters (you don't know about) for you: you need to understand what are the parameter you need for your solve to CV (or to understand which parameter make the solve DV) and be sure they are set the way you need for your particular problem.

For instance, use the arpack C++ solver provided by arpack-ng (based on eigen so you need to install it)

>> ./bootstrap
>> ./configure --enable-icb-exmm
>> make all check

Now you can play with all parameters and understand what are the one you need:

>> cd EXAMPLES/MATRIX_MARKET/
>> ./arpackmm --help
>> cat A.mtx
10 10 2

0  1  1.
1  0  1.

Now you see that using a CG solver is OK :

>> ./arpackmm --A A.mtx --mag SM --nbEV 2 --nbCV 10 --slv CG
OPT: A A.mtx, B N.A., dense no, nbEV 2, nbCV 10, stdPb yes, symPb yes, cpxPb no, simplePrec no, mag SM
OPT: shiftReal no, sigmaReal 0, shiftImag no, sigmaImag 0, invert no, tol 1e-06, maxIt 100, Ritz vectors
OPT: slv CG, slvItrPC Diag, slvItrTol 1e-06, slvItrMaxIt 100, slvDrtPivot 1e-06, slvDrtOffset 0, slvDrtScale 1
OPT: check yes, verbose 0, debug 0, restart no

INP: create A 0 s

OUT: mode 1, nb EV found 2, nb iterations 1
OUT: init mode solver 0 s, RCI time 0 s
OUT: full time 0.014 s

STAT: total number of user OP*x operation                         19
STAT: total number of user  B*x operation                         0
STAT: total number of reorthogonalization steps taken             9
STAT: total number of it. refinement steps in reorthogonalization 18
STAT: total number of restart steps                               8


>> for i in {1..100}; do ./arpackmm --A A.mtx --mag SM --nbEV 2 --nbCV 10 --slv CG; done

Using other solvers (BiCG, LU, ...) seems OK too

>> for i in {1..100}; do ./arpackmm --A A.mtx --mag SM --nbEV 2 --nbCV 10 --slv BiCG; done
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag SM --nbEV 2 --nbCV 10 --slv LU; done
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag SM --nbEV 2 --nbCV 10 --slv QR; done
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag SM --nbEV 2 --nbCV 10 --slv LLT; done
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag SM --nbEV 2 --nbCV 10 --slv LDLT; done

All works fine at my side, all solves are OK. I do not see any problem. Do you reproduce all this at your side?

Above I asked nbEV = 2 as I am a human and I know that the matrice may have only 2 eigen values. Say now octave can't infer that and set "automatically" for you nbEV = 5 (or n=10 the size of the matrix)... And you'll get a fail as arpack will not be able to compute all EV in this case as you are asking too much and it does no more make sense (rank(A) = 2 - does asking 5 or more EV makes sense? Can you blame for failing if what you ask doesn't make sense?...)

>> ./arpackmm --A A.mtx --mag SM --nbEV 5 --nbCV 15 --slv QR
OPT: A A.mtx, B N.A., dense no, nbEV 5, nbCV 15, stdPb yes, symPb yes, cpxPb no, simplePrec no, mag SM
OPT: shiftReal no, sigmaReal 0, shiftImag no, sigmaImag 0, invert no, tol 1e-06, maxIt 100, Ritz vectors
OPT: slv QR, slvItrPC Diag, slvItrTol 1e-06, slvItrMaxIt 100, slvDrtPivot 1e-06, slvDrtOffset 0, slvDrtScale 1
OPT: check yes, verbose 0, debug 0, restart no

INP: create A 0 s
Error: [dz][sn]aupd - KO with info -3, nbIt 100
Error: arpack solve KO
Error: solve KO
Error: arpack solve KO

I do not know which parameter octave / igraph use or not, and, I do not know if they use ICB (which prevent a bunch of bugs as, with ICB, compiler can check types #394 (comment))

[v,d]=eigs(A,1,'la',opt);

Not sure what this tells?! I never use octave or scipy. I always used arpack from Fortran or C setting parameters myself.

Which kind of solver does this use in backend? CG? BiCG? MINRES? GMRES? QR? LU? LLt? LDLt? With or without preconditioning? If preconditioning which one Jacobi? ILU? Depending on cases wrong solvers (used in backend during reverse-communication-interface steps) may cause failures. If iterative solving, what is the tolerance you ask for (failures may occur if tol is too small)?

How much eigen values do you ask for? 1? 2? 10? Was CV workspace increased accordingly (2*nbEV+1 at least)?

Sometimes, you can't get result unless you shift/invert: did you try? Specially when you look for small eigen values, invert is often key AFAIR. Shift may help if CV is difficult.

Personal experience: depending on your matrices, you need to play with all a bunch of parameters, some set of parameters will fail, some other get result but are slow, some others are the ones that enable efficient / fast solve.

I do not know how and which parameters in octave/scipy/igraph are set (or not) according to the nature of the matrices of your problem: this is crucial to get good / fast convergence.

For sure, there is no one set of unique parameters that work for all matrices...

@szhorvat
Copy link
Contributor Author

szhorvat commented Feb 17, 2023

For reference, this is the eigs() documentation for Octave: https://octave.sourceforge.io/octave/function/eigs.html

Regarding solvers, I am not sure what you are referring to. We use the reverse communication interface, i.e. calling DSAUPD, as described here: https://github.com/opencollab/arpack-ng/blob/master/SRC/dsaupd.f There is no mention of solvers on this documentation page.

I will give the detailed parameters passed to DSAUPD below:

  • We are operating in "Mode 1", i.e. A*x = lambda*x, A symmetric
  • BMAT='I' (since using Mode 1)
  • WHICH='LA', as pointed out in the top post. You used SM in your example. This is the 3rd argument of eigs() when using Octave.
  • NEV=1, looking for a single eigenvalue, as pointed out in the top post. This is the 2nd argument of eigs() when using Octave.
  • TOL=0
  • RESID, we use a random initial residual vector, in the manner I described in the top post. Simply choosing all elements uniformly at random from (0,1) also reproduces the problem. In Octave, this is set through the v0 field of the options.
  • NCV. You are right, I did not originally give this information. For a 10-by-10 matrix we used 5. I can reproduce the issue with NCV values of 4, 5, ..., 9. In Octave, NCV is set with the p field of the options struct. Example of setting it to 5: for i=1:100; opt.p=5; opt.v0 = rand(n,1); [v,d]=eigs(A,1,'la',opt); end
  • LWORKL was set to NCV**2 + 8*NCV, as suggested by the documentation.

EDIT: Adding some missing info:

  • IPARAM(1) = ISHIFT = 1
  • IPARAM(3) = MXITER = 3000
  • IPARAM(7) = MODE = 1

Here's igraph's automatic NCV selection, modelled on Octave's, in case you have any comments on it: https://github.com/igraph/igraph/blob/master/src/linalg/arpack.c#L855

Is there any other information you need about how DSAUPD is called?

@fghoussen
Copy link
Collaborator

fghoussen commented Feb 17, 2023

Did you try to run the exact same command lines detailed here #401 (comment) using LA magnitude? Does it work at your side? If so, you may consider the problem is not arpack but the way it's used (= the way it is set up in igraph/octave/scipy/... as they likely use generic heuristic that may no more suit your particular problem?)

All this works fine at my side:

>> ./arpackmm --A A.mtx --mag LA --nbEV 2 --nbCV 10 --slv CG
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag LA --nbEV 2 --nbCV 10 --slv CG; done
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag LA --nbEV 2 --nbCV 10 --slv BiCG; done
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag LA --nbEV 2 --nbCV 10 --slv LLT; done
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag LA --nbEV 2 --nbCV 10 --slv LDLT; done
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag LA --nbEV 2 --nbCV 10 --slv LU; done
>> for i in {1..100}; do ./arpackmm --A A.mtx --mag LA --nbEV 2 --nbCV 10 --slv QR; done

Now yesterday this >> ./arpackmm --A A.mtx --mag SM --nbEV 5 --nbCV 15 --slv QR failed because nbCV can not exceed N (I should have read the error message).

So now you know how to get this to work: OK with both nbEV = 5 or 9

>> ./arpackmm --A A.mtx --mag LA --nbEV 5 --nbCV 10 --slv CG
OPT: A A.mtx, B N.A., dense no, nbEV 5, nbCV 10, stdPb yes, symPb yes, cpxPb no, simplePrec no, mag LA
OPT: shiftReal no, sigmaReal 0, shiftImag no, sigmaImag 0, invert no, tol 1e-06, maxIt 100, Ritz vectors
OPT: slv CG, slvItrPC Diag, slvItrTol 1e-06, slvItrMaxIt 100, slvDrtPivot 1e-06, slvDrtOffset 0, slvDrtScale 1
OPT: check yes, verbose 0, debug 0, restart no

INP: create A 0 s

OUT: mode 1, nb EV found 5, nb iterations 1
OUT: init mode solver 0 s, RCI time 0 s
OUT: full time 0.015 s

STAT: total number of user OP*x operation                         19
STAT: total number of user  B*x operation                         0
STAT: total number of reorthogonalization steps taken             9
STAT: total number of it. refinement steps in reorthogonalization 18
STAT: total number of restart steps                               8


>> ./arpackmm --A A.mtx --mag LA --nbEV 9 --nbCV 10 --slv CG
OPT: A A.mtx, B N.A., dense no, nbEV 9, nbCV 10, stdPb yes, symPb yes, cpxPb no, simplePrec no, mag LA
OPT: shiftReal no, sigmaReal 0, shiftImag no, sigmaImag 0, invert no, tol 1e-06, maxIt 100, Ritz vectors
OPT: slv CG, slvItrPC Diag, slvItrTol 1e-06, slvItrMaxIt 100, slvDrtPivot 1e-06, slvDrtOffset 0, slvDrtScale 1
OPT: check yes, verbose 0, debug 0, restart no

INP: create A 0 s

OUT: mode 1, nb EV found 9, nb iterations 1
OUT: init mode solver 0 s, RCI time 0 s
OUT: full time 0.016 s

STAT: total number of user OP*x operation                         19
STAT: total number of user  B*x operation                         0
STAT: total number of reorthogonalization steps taken             9
STAT: total number of it. refinement steps in reorthogonalization 18
STAT: total number of restart steps                               8


For reference, this is the eigs() documentation for Octave: https://octave.sourceforge.io/octave/function/eigs.html

I encourage you to set as many relevant parameters (according to the nature of your problem) as needed to maximize your chances to get a stable result (that will not change with changes in octave /igraph / ...). The less you let igraph/octave set default parameters for you, the more you are safe.

Regarding solvers, I am not sure what you are referring to. We use the reverse communication interface,

In between arpack iterative calls, arpack ask you to update the vector he works with. To do theses updates you may need to solve some linear system using any solver you decide like LAPACK

call av (nx, workd(ipntr(1)), workd(ipntr(2)))
or eigen
Y = solver.solve(YY); // Y = B^-1 * A * X.
or any other solver of your choice (petsc, superlu, ...): I do not know which solvers igraph/octave decided to use, nor how they are parametrised.

i.e. calling DSAUPD, as described here: https://github.com/opencollab/arpack-ng/blob/master/SRC/dsaupd.f There is no mention of solvers on this documentation page.

I will give the detailed parameters passed to DSAUPD below:

* We are operating in "Mode 1", i.e. `A*x = lambda*x, A symmetric`

Yes, when no B, one always use mode 1

* `BMAT='I'` (since using Mode 1)

* `WHICH='LA'`, as pointed out in the top post. You used SM in your example. This is the 3rd argument of `eigs()` when using Octave.

Exact same result with LA added here #402

* `NEV=1`, looking for a single eigenvalue, as pointed out in the top post. This is the 2nd argument of `eigs()` when using Octave.

* `TOL=0`

This is likely what kills you. If your RCI uses an iterative solver (?) you have no chance to get to a solution: relax your tol. Use 1e.-3 for instance. Try different tol.

* `RESID`, we use a random initial residual vector, in the manner I described in the top post. Simply choosing all elements uniformly at random from (0,1) also reproduces the problem. In Octave, this is set through the `v0` field of the options.

* `NCV`. You are right, I did not originally give this information. For a 10-by-10 matrix we used 5. I can reproduce the issue with NCV values of 4, 5, ..., 9. In Octave, NCV is set with the `p` field of the options struct. Example of setting it to 5: `for i=1:100; opt.p=5; opt.v0 = rand(n,1); [v,d]=eigs(A,1,'la',opt); end`

* `LWORKL` was set to `NCV**2 + 8*NCV`, as suggested by the documentation.

Here's igraph's automatic NCV selection, modelled on Octave's, in case you have any comments on it: https://github.com/igraph/igraph/blob/master/src/linalg/arpack.c#L855

Sorry, can't afford to go on debugging igraph / octave... Moreover if the problem is not in arpack but the way it's used

Is there any other information you need about how DSAUPD is called?

@fghoussen
Copy link
Collaborator

fghoussen commented Feb 17, 2023

Regarding solvers, I am not sure what you are referring to. We use the reverse communication interface,

Also, common, misuse I have seen: are you sure you handle all possible cases in RCI? That is all other cases (ido=-1 or 1 or 2 - easily checked with an assert) than the most frequent one

@szhorvat
Copy link
Contributor Author

szhorvat commented Feb 17, 2023

I'm sorry, I am not familiar with arpackmm, and I cannot correlate what arpackmm does with what is written in the DSAUPD documentation. I do not even see how it calls DSAUPD(), or when it calls DSAUPD() vs DNAUPD(). For this reason, as well as because of the issues I will point out below, I cannot give you a reproducer in terms of arpackmm.


Above, I gave complete details about how to call DSAUPD() to reproduce the issue. If anything specific is missing, please point it out and I will make an effort to provide it. I hope you can work with this.


All this works fine at my side:

>> ./arpackmm --A A.mtx --mag LA --nbEV 2 --nbCV 10 --slv CG

The parameters you use here are not consistent with what I gave you (neither --nbEV nor --nbCV). In fact I did try arpackmm --A A.mtx --mag LA --nbEV 1 --nbCV 5, but it gives me "Error: bad --mag - bad argument". Furthermore, it is unclear to me how to use arpackmm to replicate all the details I described, for example how to provide a starting vector.

In between arpack iterative calls, arpack ask you to update the vector he works with. To do theses updates you may need to solve some linear system using any solver you decide like LAPACK

The specific case I described (Mode 1 with DSAUPD()) only requires a matrix multiplication, not solving a linear system.

@szhorvat
Copy link
Contributor Author

Also, common, misuse I have seen: are you sure you handle all possible cases in RCI? That is all other cases (ido=-1 or 1 or 2 - easily checked with an assert) than the most frequent one

Yes, all IDO values are handled. If an unexpected IDO value came up, it would result in printing an informative error message.

@szhorvat
Copy link
Contributor Author

* `TOL=0`

This is likely what kills you. If your RCI uses an iterative solver (?) you have no chance to get to a solution: relax your tol. Use 1e.-3 for instance. Try different tol.

There is no solver needed, only a matrix multiplication is done. tol=0 is supposed to indicate that the default should be used. However, I just tried several tol values between 1e-1 and 1e-6 and it fails in the same way for all of them.

@fghoussen
Copy link
Collaborator

Checking out last master cd49a7f, did you try to run the exact same command lines detailed here #401 (comment) using LA magnitude?

I am not familiar with arpackmm

arpackmm is just a utility that reads matrices formatted in ASCII matrix market format, and, then call arpack (RCI being handled by eigen)

gives me "Error: bad --mag - bad argument".

Checkout out cd49a7f

The specific case I described (Mode 1 with DSAUPD()) only requires a matrix multiplication, not solving a linear system.

OK. Good to know. Easier case.

Yes, all IDO values are handled... would result in printing an informative

Assert if you can to be sure!

The parameters you use here are not consistent with what I gave you (neither --nbEV nor --nbCV)

At my side

>> ./arpackmm --A A.mtx --mag LA --nbEV 1 --nbCV 5 --slv CG
OPT: A A.mtx, B N.A., dense no, nbEV 1, nbCV 5, stdPb yes, symPb yes, cpxPb no, simplePrec no, mag LA
OPT: shiftReal no, sigmaReal 0, shiftImag no, sigmaImag 0, invert no, tol 1e-06, maxIt 100, Ritz vectors
OPT: slv CG, slvItrPC Diag, slvItrTol 1e-06, slvItrMaxIt 100, slvDrtPivot 1e-06, slvDrtOffset 0, slvDrtScale 1
OPT: check yes, verbose 0, debug 0, restart no

INP: create A 0 s

OUT: mode 1, nb EV found 1, nb iterations 1
OUT: init mode solver 0 s, RCI time 0 s
OUT: full time 0.014 s

STAT: total number of user OP*x operation                         9
STAT: total number of user  B*x operation                         0
STAT: total number of reorthogonalization steps taken             4
STAT: total number of it. refinement steps in reorthogonalization 8
STAT: total number of restart steps                               3

@szhorvat
Copy link
Contributor Author

szhorvat commented Feb 18, 2023

The difference appears to be in the starting vector. We provide an explicit starting vector, while arpackmm doesn't. We do this because we can make a reasonable guess about what the eigenvector should look like, and also because it is a requirement of igraph not to rely on external RNGs.

For this specific problem, the failure is reproducible with relatively high probability if providing a random starting vector sampled uniformly from [-1, 1], or from [0, 1].

It's also reproducible if we set the first two elements to 1 + (small random perturbation), and the rest of the elements to zero. Note that the leading eigenvector is (1, 1, 0, 0, ..., 0).

One specific starting vector that reproduces the problem for me has 1.0000450596090784, 1.0000033192770064 as the first two elements, then zero for the rest. However, I would recommend that you experiment with 1 + (small random number) when trying to reproduce the issue.


I cannot reproduce the problem if we leave it up to ARPACK to set a random starting vector, instead of providing it ourselves.

@fghoussen
Copy link
Collaborator

fghoussen commented Feb 18, 2023

I encourage you to use ICB: this prevents hell debugging that may happen as compilers/linkers are heavily developped (optim, LTO, ...). Using ICB make sure type passed from/to C/F77 are the expected ones. Check #405

We do this because we can make a reasonable guess about what the eigenvector should look like

Would have used shift for that instead of providing initial vector: test it with arpackmm.

The difference appears to be in the starting vector.
I cannot reproduce the problem if we leave it up to ARPACK to set a random starting vector

May be in relation with #80 which should not be a problem for me AFAIU (see also JuliaLinearAlgebra/Arpack.jl#138). Not sure.

@szhorvat
Copy link
Contributor Author

Are you able to reproduce the issue with a custom starting vector?

@fghoussen
Copy link
Collaborator

Are you able to reproduce the issue with a custom starting vector?

Not now. Away from home and quite busy this week-end.

You can test that with --restart and dumpToFile/restartFromFile features in arpackmm AFAIR: first dump to file, replace your init vector in the file, re-run arpack with --restart. You'll can answer your question.

Would have used shift for that instead of providing initial vector: test it with arpackmm.

Rephrasing, would have used shift with estimated eigen value (shift is a scalar), and also, setting estimated eigen vector as initial vector. Doing one thing but not the other may screw arpack. If you can estimate eigen value, shifting is likely enough (and simpler)

@FabienPean
Copy link
Contributor

TLDR; the problem comes from from a change in arpack between 3.8.0 and 3.9.0 connected to a change in openblas implementation occurring between 0.3.13 and 0.3.17. You can test it with your example on msys2 by downgrading openblas version available here https://repo.msys2.org/mingw/mingw64

The longer story:

  • Looking at the diff between ARPACK versions: 3.8.0...3.9.0 there are no significant changes to the files related to DSAUPD.
  • Bug couldn't be reproduced within a Github Codespace on https://github.com/igraph/igraph repo with manually installed ARPACK 3.9.0
  • Both octave on msys and macports relies on openblas
  • openblas changed by 8 versions during the lapse of time between arpack 3.8.0 and 3.9.0
  • arpack 3.8.0 works with openblas 0.3.13 and 0.3.21
  • arpack 3.9.0 works with openblas 0.3.13 and fails with openblas 0.3.21
  • arpack 3.9.0 fails on next available msys package openblas 0.3.17 released on Jul 15, 2021 (msys does not have packages for openblas 0.3.14-0.3.16)

@fghoussen
Copy link
Collaborator

arpack 3.9.0 works with openblas 0.3.13 and fails with openblas 0.3.21

@szhorvat: could you try with another blas (netlib or MKL)?

@szhorvat
Copy link
Contributor Author

szhorvat commented Feb 21, 2023

I tried with two different BLAS:

  • OpenBLAS 0.3.21
  • Apple's vecLib (macOS 10.14.6 on x86_64)

These are the two that MacPorts supports conveniently. There was no difference in the behaviour.

If you think it's worth it, I can look into trying a third option tomorrow.

@szhorvat
Copy link
Contributor Author

Thanks for looking into it @FabienPean

  • Both octave on msys and macports relies on openblas

To be accurate, most ports in MacPorts, including both Octave and ARPACK, use vecLib by default (the +accelerate variant). But they can be configured to use OpenBLAS.

arpack 3.9.0 works with openblas 0.3.13 and fails with openblas 0.3.21

Do you mean that you managed to reproduce the issue with OpenBLAS 0.3.21 but not 0.3.13? Did you use igraph or Octave to reproduce it?

@fghoussen
Copy link
Collaborator

One specific starting vector that reproduces the problem for me has 1.0000450596090784, 1.0000033192770064 as the first two elements, then zero for the rest.

What do you call / use-as the starting vector?... And do you update associated uncertainty?
For me, starting vector is v: do we talk about the same thing?

@fghoussen
Copy link
Collaborator

fghoussen commented Feb 21, 2023

octave:22> for i=1:100; opt.v0 = rand(n,1); [v,d]=eigs(A,1,'la',opt); end
error: eigs: error in dsaupd:
error: called from
eigs at line 338 column 22

What is this error? Is it -9: Starting vector is zero. or another one?
I may have a clue. Hope to find some time for a few tests...

@szhorvat
Copy link
Contributor Author

szhorvat commented Feb 21, 2023

What do you call / use-as the starting vector?...

By "starting vector" I mean the initial residual vector passed in the resid argument during the very first call to dsaupd(). See here: https://github.com/opencollab/arpack-ng/blob/master/SRC/dsaupd.f#L130

And do you update associated uncertainty?

What do you mean by "associated uncertainty"?

@szhorvat
Copy link
Contributor Author

szhorvat commented Feb 21, 2023

What is this error? Is it -9: Starting vector is zero. or another one?

I would need more time to figure out for certain what the error is specifically with Octave, but very likely it is the same as with igraph, i.e. not -9 but -9999 Could not build an Arnoldi factorization.. Please see the top post in this thread.

@szhorvat
Copy link
Contributor Author

szhorvat commented Feb 21, 2023

Strangely, I was not able to reproduce it in this Codespace either. For reference, this is Ubuntu Jammy. I tried both with OpenBLAS 3.20 as well as with reference BLAS/LAPACK 3.10. I could not reproduce it with either.

I can reproduce it both on macOS 10.14 as well as with msys2 (which causes the igraph CI tests to keep failing at the moment). On macOS, I can reproduce it either when building ARPACK manually, or through MacPorts, and with both GCC 11 and GCC 12 (used as the Fortran compiler).

@FabienPean
Copy link
Contributor

FabienPean commented Feb 21, 2023

Do you mean that you managed to reproduce the issue with OpenBLAS 0.3.21 but not 0.3.13? Did you use igraph or Octave to reproduce it?

More precisely, on msys2/mingw64, the error you described with ARPACK 3.9.0 could not be reproduced with openblas 0.3.13, but it could be reproduced with OpenBLAS 0.3.17. Versions 0.3.14 to 0.3.16 are absent from msys2 repository so couldn't be tested.

Shortly change in ARPACK between 3.8.0→3.9.0 is related negatively to a change in openblas 0.3.14≤version≤0.3.17

@szhorvat
Copy link
Contributor Author

szhorvat commented Feb 22, 2023

arpack 3.9.0 works with openblas 0.3.13 and fails with openblas 0.3.21

@szhorvat: could you try with another blas (netlib or MKL)?

@fghoussen I tried with netlib LAPACK/BLACK version 3.11.0, on macOS 10.14, and I cannot reproduce the problem. I cannot try with MKL, unfortunately.

So here's the current list of findings from my side:

  • macOS 10.14 x86_64, Apple vecLib/Accelerate: reproducible

  • macOS 10.14 x86_64, OpenBLAS 0.3.21: reproducible

  • macOS 10.14 x86_64, Netlib LAPACK/BLAS 3.11: cannot reproduce

  • Linux, Ubuntu Jammy, x86_64, OpenBLAS 0.3.20, cannot reproduce

  • Linux, Ubuntu Jammy, x86_64, Netlib LAPACK/BLAS 3.10, cannot reproduce

  • Linux, openSUSE Leap 15.4, x86_64, OpenBLAS, version indicated as "0.3.8.dev", reproducible

  • Linux, openSUSE Leap 15.4, x86_64, OpenBLAS 0.3.20 with OpenMP threading, cannot reproduce

  • Linux, openSUSE Leap 15.4, x86_64, MKL 2020.4, cannot reproduce

  • ...

@fghoussen
Copy link
Collaborator

@szhorvat: I may have some clue. But, need time to run some tests. Hope to get this done by the week-end.

@szhorvat
Copy link
Contributor Author

Thanks so much for looking into this @fghoussen ! Let me know if you need anything else from me.

@fghoussen
Copy link
Collaborator

@szhorvat: works at my side checkout out #408

@fghoussen
Copy link
Collaborator

One specific starting vector that reproduces the problem for me has 1.0000450596090784, 1.0000033192770064 as the first two elements, then zero for the rest.

Just tried: works OK at least 3 successive times

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 4, 2023

FYI first commit where the issue appears is ce2e69a

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 4, 2023

@fghoussen I have a question about the correct use of the reverse communication interface.

The ARPACK manual (e.g. here), as well as the simple example driver (here) seems to suggest that when calling saupd in a loop, we simply need to check the value of ido and take action accordingly. The loop should continue until ido is 99. But what about the value returned in info? Can info ever be non-zero (indicating an error) when ido != 99? I see that arpackmm checks the value of info directly in the loop. Is this required, or can one count on there being no error when ido != 99?

Does the ARPACK-NG project provide the ARPACK manual somewhere so that we wouldn't have to get it from other, potentially ephemeral sources? The original ARPACK site is no longer accessible, unfortunately.

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 4, 2023

All that said, handling info does not resolve the issue here. In this case I still get ido == 99 (i.e. indicating completion) and info == -9999. Prior to this info was always 0 (i.e. no error).

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 4, 2023

@fghoussen Regarding ce2e69a, as far as I can tell, this commit causes dgetv0 to re-seed the RNG with the same seed on every call. Are you sure that this is fine?

First, note that dgetv0 will only generate random numbers when its initv parameter is set to false (reference). dgetv0 is called through dsaup2 as well as through dsaitr. Through dsaup2 it would only get called with initv=false on the first iteration, specifically when resid was not provided. However, from dsaitr, it can be called with initv=false when a restart occurs (see here). Then it makes no sense to use the same random seed, as the result would be the same on every call (??) I see multiple calls to dlarnv (i.e. the RNG) through dgetv0 in my experiments, presumably these multiple calls should produce different (i.e. actually random) results.

Is there a bug here?

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 5, 2023

What do you call / use-as the starting vector?... And do you update associated uncertainty?
For me, starting vector is v: do we talk about the same thing?

The ARPACK User's Guide refers to the resid parameter of XYaupd as the "starting vector". Please see Section 2.3.7, "Setting the Starting Vector".

The v parameter of the same function is an output parameter, thus it does not need to be set (its value is ignored). I am pointing this out because in #408 you mentioned several times "setting both resid and v". I'm not sure if we are talking about the same v.

But anyway, this is tangential to the main issue in #401 (comment)

@fghoussen
Copy link
Collaborator

The ARPACK manual (e.g. here),
The original ARPACK site is no longer accessible, unfortunately.

Yes, I noticed also that the original site is no longer accessible.
Adding the manual (that may also no more be accessible some day...) to the repo sounds safe: #409

I see that arpackmm

I had to compute eigen values/vectors on difficult matrices where ido was not always 1 or -1 and iparam was not always 1 which is the default-most-often case (Y=AX) in RCI. I had to dig the doc to handle all other cases to get at last arpack to converge (= problems were never due to arpack but always to misuses of it). I needed also to compare performance impacts w.r.t. different solvers (CG, LU, ...) in RCI. In the end, I couldn't rely on scipy / octave / matlab / scilab / ... which seems to handle mainly the default-most-often case and doesn't always allow fine control (solvers): I ended up with something which is now here as "arpackmm" (heavily tested on many different cases https://github.com/opencollab/arpack-ng/blob/master/EXAMPLES/MATRIX_MARKET/arpackmm.sh).

From this experience, I've seen lots of code samples out there (in research labs and enterprises) miusing arpack and / or doing just was is necessary to get a result for some given cases (= some kind of matrices only = the kind they need depending on their application / field). arpack is difficult to use because the doc is cumbersome as is the language (F77)... In this sense, I believe the most part of problems using arpack do not come from arpack, but, do come from misuses (after reading ugly doc that nobody can understand): this is what I tried to say in above comments.

Now, I'll try to answer (what I understand from) your questions. Keep in mind that I can misuse arpack myself and that I am aware of that: this is why I tested arpackmm on the maximum number of cases (improving the way I had to use arpack each time). Put another way, I may be wrong answering your questions...

Regarding ce2e69a

AFAIR, ce2e69a fixes both #332, #371 and MKL related problems.

?larnv needs iseed to respect some constraints https://netlib.org/lapack/explore-html/df/dd1/group___o_t_h_e_rauxiliary_ga77e05a87ced667cbdb502aa87c72d056.html. Before ce2e69a, as the cumbersome F77 never really makes clear how the obscure data F77 keyword works, nobody really knows if inits was initialized where / how or not-at-all: in cases where, indeed, inits was not initialized the iseed was whatever what was there in memory and may randomly not respect ?larnv constraints such that LAPACK fired errors. ce2e69a make sur LAPACK is used correctly in all cases (code review and git grep inits [no more use of BLACS - only MPI dir/src are used] will convince you that inits seems to be useless and not used anywhere else). This change is local (inits is a local variable) and AFAIU has no impact out of *getv0.f.

re-seed the RNG with the same seed on every call. Are you sure that this is fine?

I would say yes. Whatever inits was true or false, you use the same seeds as data is supposed to save both inits and iseed: arpack uses seeds initialized or saved in data. Sounds OK to me: we need different random vector but do not care about the seeds. I guess RGN is supposed to handle some entropy that gives different answers (numbers) even if called repeatedly with same seed (not sure how this works but sounds possible to me). Also here, the point is to get a v0 initial vector, so I would say *getv0.f is meant to be called once only as the name suggests.

dgetv0 will only generate random numbers when its initv

inits is not initv. Not sure, but I guess inits initialize the seeds (always done - ce2e69a), and that initv initialize the vector if needed with the (always initialized) seeds. My understanding is that always initializing the seed (inits) is no problem if you need to initialize or nor the vector later on (initv): the other way is obviously wrong and is what is fixed by ce2e69a. Not sure.

But what about the value returned in info?

My understanding is that (recall, I may be wrong):

  • At the first iteration info is used, if needed, to pass prior information known by the user to arpack. For instance, for restarting looks like one need to set info = 1 and give your initial vector and residual in v and resid respectively. That's what arpackmm does

    if (restartFromFile) {

  • In later iterations info is a returned code returned by arpack to say if a problem occured or not during the iteration

The ARPACK User's Guide refers to the resid parameter of XYaupd as the "starting vector".

Lehoucq et al are mathematicians. All such people I met are known to be rigorous and precise : this is why they are good at what they do. They call a cat a cat, not a dog. A vector is a (non-zero) vector, not a residual. A residual is a residual (a difference of vectors targetting zero = leftover to converge) not a vector.

My understanding is that (I may be wrong):

  • Residual resid is A*v - lambda*v, not v and must be set in resid.
  • Vector is v initialized with v0 (random or given) and must be set in v.
  • You need to make sure both v and resid are "not-zero enough" at start as AFAIU arpack iterates until resid is not zero:
v = v0
resid = A*v - l*v
while (resid != zero) {
  do stuffs with v
  resid = A*v - l*v
}

Which would echo with #215, where A is zero and the lambda to find turns out to be zero too, such that the initial resid will be zero (= stopping arpack from beginning!) unless you shift (A = A-shift*I) to make resid non null (A has shift on it's diag) at the beginning of the while loop (this case sounds like a corner case to me).

For me, once again, the doc here is to blame: the error "starting vector is zero" from the doc should be actually read like-so "starting residual (associated to the vector) is zero".

I may be wrong.

still get ido == 99 (i.e. indicating completion) and info == -9999

Isn't your ncv too low?
At my side, this is OK

>> for i in {1..100}; do ./arpackmm --A issue401.mtx --mag LA --nbEV 1 --nbCV 5; done

Is there a bug here?

If you do not use ICB (linking to "underscored" symbols), compiler/linker optims/LTO may screw data (incorrect types) when passing from C to F77 or the other way. Did you try to compile with -O0 to prevent optims? Did you printf/write (ido, info, ...) what you passed from C/F77 and what you get in F77/C at both sides of the eupd/aupd interfaces? You may not see what you expect as I did before ICB.

Using ICB, insure compiler/liker know data type and run optims/LTO accordingly (arpackmm uses ICB).

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 6, 2023

Let me be more concise regarding ce2e69a, to avoid misunderstandings, and to make sure we keep the discussion on topic.

There is a piece of code in Xgetv0() that generates a random vector. This commit makes it so that this "random" vector will always be the very same, on each call to Xgetv0() (i.e. not really random). Note that Xgetv0() gets called multiple times while solving an eigenvector problem (through XYaitr()), and after ce2e69a it ends up producing the same "random" vector on each of those calls. This does not seem right to me, @fghoussen @sylvestre

In additions to issues this may cause in the algorithm, it also clearly changes ARPACK's behaviour across repeated calls. Previously, if a first solution attempt would fail (e.g. converge to the wrong eigenvector) due to bad luck in the random choice of resid, a second try would have a chance of succeeding. That is no longer the case.

Please address this.


re-seed the RNG with the same seed on every call. Are you sure that this is fine?

I would say yes. Whatever inits was true or false, you use the same seeds as data is supposed to save both inits and iseed: arpack uses seeds initialized or saved in data.

I do not understand this sentence.

Sounds OK to me: we need different random vector but do not care about the seeds.

The commit we are discussing made it so that the "random" vector will always be exactly the same.

I guess RGN is supposed to handle some entropy that gives different answers (numbers) even if called repeatedly with same seed (not sure how this works but sounds possible to me).

This is not true. If you give the same seed, the output will be the same. Xlarnv() effectively uses iseed as the RNG state.

Also here, the point is to get a v0 initial vector, so I would say *getv0.f is meant to be called once only as the name suggests.

This is not correct. Xgetv0 will get called at most once through XYaup2, but then potentially several more times through XYaitr(). I pointed this out in #401 (comment) and linked to the call site.

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 6, 2023

This is precisely where I see things going wrong:

https://github.com/opencollab/arpack-ng/blob/master/SRC/dsaitr.f#L417

c              %------------------------------------------------%
c              | Give up after several restart attempts.        |
c              | Set INFO to the size of the invariant subspace |
c              | which spans OP and exit.                       |
c              %------------------------------------------------%

dsaitr() tries to call dgetv0() up to 3 times, and if dgetv0() fails each of those times, it gives up.

But of course if dgetv0() failed once, it will fail on all subsequent attempts, as commit ce2e69a eliminated any randomness it had. It is now fully deterministic. Retries are futile.

I don't know how to make this any clearer.

Please fix this in 3.9.1. You simply need to allow the RNG's state to be propagated normally instead of resetting it on each call to dgetv0().

@szhorvat
Copy link
Contributor Author

@fghoussen @sylvestre I am happy to submit a PR that reverts the change in ce2e69a. Will you consider merging it?

@fghoussen
Copy link
Collaborator

Reverting is not a good move to me. As explained:

  • inits is notinitv
  • the intent before / after ce2e69a is the same : use same freshly-initialized or previously-saved seed (Why? Don't know. But that is the original code intention)
  • I remember clearly debugging this: write(*,*) iseed gave uninitialized values and fired errors in lapack. Reverting will trigger regressions.
  • the problem is likely not ce2e69a as independent approach get solve OK

@szhorvat
Copy link
Contributor Author

Once again you are not responding to my comments, and not paying attention to or not reading what I wrote. I cannot continue here until you do this.

@fghoussen
Copy link
Collaborator

Reverting will trigger regressions (possibly unseen by CI).
I could understand you PR that kind of patch (not tested):

>> git diff SRC/dgetv0.f
diff --git a/SRC/dgetv0.f b/SRC/dgetv0.f
index 8be4fa2..d7ef52f 100644
--- a/SRC/dgetv0.f
+++ b/SRC/dgetv0.f
@@ -193,10 +193,10 @@ c     | Initialize the seed of the LAPACK |
 c     | random number generator           |
 c     %-----------------------------------%
 c
-      iseed(1) = 1
-      iseed(2) = 3
-      iseed(3) = 5
-      iseed(4) = 7
+      iseed(1) = random_number(0, 4095) c ?larnv need 0<seed<4095.
+      iseed(2) = random_number(0, 4095) c ?larnv need 0<seed<4095.
+      iseed(3) = random_number(0, 4095) c ?larnv need 0<seed<4095.
+      iseed(4) = random_number(0, 4095) c ?larnv need 0<seed<4095.
 c
       if (ido .eq.  0) then
 c

... But once again, ce2e69a can not be responsible for your problem: it's elsewhere.

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Mar 29, 2023
szhorvat added a commit to szhorvat/arpack-ng that referenced this issue Apr 9, 2023
 - fixes opencollab#401, opencollab#410, opencollab#411
 - restores 'inits' variable removed in ce2e69a, ensuring that the RNG state is propagated
 - reverts e0d6705 to ensure that seed is different on each parallel thread
 - updates seed initialization of parallel pdgetv0/psgetv0 so that they match that of pzgetv0/pcgetv0
fghoussen pushed a commit to szhorvat/arpack-ng that referenced this issue Apr 10, 2023
 - fixes opencollab#401, opencollab#410, opencollab#411
 - restores 'inits' variable removed in ce2e69a, ensuring that the RNG state is propagated
 - reverts e0d6705 to ensure that seed is different on each parallel thread
 - updates seed initialization of parallel pdgetv0/psgetv0 so that they match that of pzgetv0/pcgetv0
@fghoussen
Copy link
Collaborator

Fixed by #423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants