New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial vector generation is broken in 3.9.0, causing rare failures #410
Comments
Would be good, if possible, to have reproducible / deterministic data to reproduce the problem. |
Can you address the obvious logic flaw that I described above? |
This issue should be obvious from following the logic of the code, as in the top comment. It would be good if you could clarify which part of the description needs clarification, and what sort of reproducer will provide this. As I said above, part of ARPACK re-tries a certain operation with several random vectors, until it succeeds. In the vast majority of cases, the very first try will succeed. I am assuming that you want to see a reproducer that demonstrates that the first try may fail. I provide this below. It is based on the igraph test case failure described in #401. Note that this only works as intended when linking against certain BLAS/LAPACK on certain platforms, see here. https://gist.github.com/szhorvat/f1d06fa8e58901d17cd3e06eac8891f5 After ce2e69a, DSAUPD() fails with
I suggest that you add |
For example, with these added diff --git a/SRC/dgetv0.f b/SRC/dgetv0.f
index 8be4fa2..f3d473d 100644
--- a/SRC/dgetv0.f
+++ b/SRC/dgetv0.f
@@ -225,6 +225,8 @@ c
if (.not.initv) then
idist = 2
call dlarnv (idist, iseed, n, resid)
+ print *, "generating random resid in dgetv0(), resid is now"
+ print *, resid
end if
c
c %----------------------------------------------------------%
diff --git a/SRC/dsaitr.f b/SRC/dsaitr.f
index 3460d99..3902a14 100644
--- a/SRC/dsaitr.f
+++ b/SRC/dsaitr.f
@@ -406,6 +406,7 @@ c | If in reverse communication mode and |
c | RSTART = .true. flow returns here. |
c %--------------------------------------%
c
+ print *, "calling dgetv0() from dsaitr(), itry = ", itry
call dgetv0 (ido, bmat, itry, .false., n, j, v, ldv,
& resid, rnorm, ipntr, workd, ierr)
if (ido .ne. 99) go to 9000
diff --git a/SRC/dsaup2.f b/SRC/dsaup2.f
index fd4143f..42f5efb 100644
--- a/SRC/dsaup2.f
+++ b/SRC/dsaup2.f
@@ -324,6 +324,7 @@ c
10 continue
c
if (getv0) then
+ print *, "calling dgetv0() from dsaup2"
call dgetv0 (ido, bmat, 1, initv, n, 1, v, ldv, resid, rnorm,
& ipntr, workd, info)
c Output:
Notice how after |
@szhorvat: can you PR some patch doing #401 (comment) with for instance https://cyber.dabamos.de/programming/modernfortran/random-numbers.html (or any other way you like to change seed according to bounds mentioned in dlarnv doc: this is what ce2e69a fixes)? If seed change, v0 will change, right? |
The usage pattern is like this: One puts in an There are two documented requirements for
I don't see how these requirements could have been violated. On the first call, they are clearly satisfied: On subsequent calls, Do you have an example of |
So the intent of the original code and ce2e69a are both correct.
Don't really remember. Long time ago: try MKL with (or without?) ILP64. Line 190 in f36eb6c
|
No, because after ce2e69a, you never pass back the state to
Unfortunately I lost access to the system where I could test with the MKL. If we don't have an example of the problem that ce2e69a was supposed to fix, we can't move forward.
Can you expand on this? I do not program in Fortran, so I searched for program test
integer foo
logical q, r
data foo /137/
data q /.true./
data r /.false./
write (*,*) 'foo = ', foo
write (*,*) 'q = ', q, 'r = ', r
stop
end Output:
Once again, there needs to be a clear understanding of the problem that ce2e69a was intending to fix. Otherwise I suggest directly reverting it, as it is clearly causing problems. |
Did you try #401 (comment)?
OK
Look at issues attached to the ce2e69a PR. Try arpack-ng/.github/workflows/jobs.yml Line 156 in 396b90a
No: not home + low connection / email access at the time.
inits was not initialized (guess), so iseed wasn't neither (100% sure): get random iseed out of dlarnv bounds |
I can do it the week after next |
Sure! No hurry :) |
Bug: opencollab/arpack-ng#401 Bug: opencollab/arpack-ng#410 Bug: opencollab/arpack-ng#411 Bug: igraph/igraph#2311 Signed-off-by: Sam James <sam@gentoo.org>
- fixes opencollab#401, opencollab#410, opencollab#411 - restores 'inits' variable removed in ce2e69a, ensuring that the RNG state is propagated - reverts e0d6705 to ensure that seed is different on each parallel thread - updates seed initialization of parallel pdgetv0/psgetv0 so that they match that of pzgetv0/pcgetv0
- fixes opencollab#401, opencollab#410, opencollab#411 - restores 'inits' variable removed in ce2e69a, ensuring that the RNG state is propagated - reverts e0d6705 to ensure that seed is different on each parallel thread - updates seed initialization of parallel pdgetv0/psgetv0 so that they match that of pzgetv0/pcgetv0
Fixed by #423 |
@fghoussen Thanks! I'm looking forward to the release of 3.9.1. |
…llab/arpack-ng#410 is fixed" This reverts commit 848cbe9.
…llab/arpack-ng#410 is fixed" This reverts commit 848cbe9.
In ARPACK, the
Xgetv0()
function function generates a random initial vector for a numerical iteration process. With low probability, it may occur that this random vector won't be suitable for the numerical algorithm. In this case, ARPACK re-generates it, up to three times, until a suitable one is found. This happens here:https://github.com/opencollab/arpack-ng/blob/master/SRC/dsaitr.f#L409
Commit ce2e69a breaks this process by making
Xgetv0()
deterministic. The only change the commit makes is to re-seed the random number generator with the very same fixed seed at the beginning ofXgetv0()
. Therefore, if with this fixed seedXgetv0()
happens to produce an unsuitable "random" vector, all subsequent retries will obviously fail in the same way.The fact that the above-linked code performs random re-tries until success makes it obvious that ce2e69a (which removes the randomness) is incorrect.
I explained this already in detail in #401. This issue is to make the problem plain to see without having to go through the long discussion there. The description is kept intentionally simple.
While the failure occurs with low probability, and it is dependent on the LAPACK implementation, it does occur in practice. It is currently causing the igraph test suite to fail with ARPACK 3.9.0 on several platforms (see e.g. with msys2). I confirmed that the failure mode is precisely the one described above, by adding
print
statements to follow the code path, and verify that insideXgetv0()
the very same "random" vector keeps getting generated again and again. A direct revert of ce2e69a eliminates the problem.The text was updated successfully, but these errors were encountered: