Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial vector generation is broken in 3.9.0, causing rare failures #410

Closed
szhorvat opened this issue Mar 11, 2023 · 14 comments
Closed

Initial vector generation is broken in 3.9.0, causing rare failures #410

szhorvat opened this issue Mar 11, 2023 · 14 comments

Comments

@szhorvat
Copy link
Contributor

szhorvat commented Mar 11, 2023

In ARPACK, the Xgetv0() function function generates a random initial vector for a numerical iteration process. With low probability, it may occur that this random vector won't be suitable for the numerical algorithm. In this case, ARPACK re-generates it, up to three times, until a suitable one is found. This happens here:

https://github.com/opencollab/arpack-ng/blob/master/SRC/dsaitr.f#L409

Commit ce2e69a breaks this process by making Xgetv0() deterministic. The only change the commit makes is to re-seed the random number generator with the very same fixed seed at the beginning of Xgetv0(). Therefore, if with this fixed seed Xgetv0() happens to produce an unsuitable "random" vector, all subsequent retries will obviously fail in the same way.

The fact that the above-linked code performs random re-tries until success makes it obvious that ce2e69a (which removes the randomness) is incorrect.


I explained this already in detail in #401. This issue is to make the problem plain to see without having to go through the long discussion there. The description is kept intentionally simple.


While the failure occurs with low probability, and it is dependent on the LAPACK implementation, it does occur in practice. It is currently causing the igraph test suite to fail with ARPACK 3.9.0 on several platforms (see e.g. with msys2). I confirmed that the failure mode is precisely the one described above, by adding print statements to follow the code path, and verify that inside Xgetv0() the very same "random" vector keeps getting generated again and again. A direct revert of ce2e69a eliminates the problem.

@fghoussen
Copy link
Collaborator

Would be good, if possible, to have reproducible / deterministic data to reproduce the problem.

@szhorvat
Copy link
Contributor Author

Can you address the obvious logic flaw that I described above?

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 12, 2023

This issue should be obvious from following the logic of the code, as in the top comment. It would be good if you could clarify which part of the description needs clarification, and what sort of reproducer will provide this.

As I said above, part of ARPACK re-tries a certain operation with several random vectors, until it succeeds. In the vast majority of cases, the very first try will succeed. I am assuming that you want to see a reproducer that demonstrates that the first try may fail. I provide this below. It is based on the igraph test case failure described in #401. Note that this only works as intended when linking against certain BLAS/LAPACK on certain platforms, see here.

https://gist.github.com/szhorvat/f1d06fa8e58901d17cd3e06eac8891f5

After ce2e69a, DSAUPD() fails with INFO = -9999 in this program. You will see output similar to the following:

Trial 0:
Finished.
INFO = 0

Trial 1:
Finished.
INFO = 0

Trial 2:
Finished.
INFO = 0

Trial 3:
Finished.
INFO = 0

Trial 4:
Finished.
INFO = 0

Trial 5:
Finished.
INFO = 0

Trial 6:
Finished.
INFO = -9999

I suggest that you add print statements to the Fortran code in ARPACK, to see for yourself that the problem indeed occurs as I described.

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 12, 2023

For example, with these added print statements:

diff --git a/SRC/dgetv0.f b/SRC/dgetv0.f
index 8be4fa2..f3d473d 100644
--- a/SRC/dgetv0.f
+++ b/SRC/dgetv0.f
@@ -225,6 +225,8 @@ c
          if (.not.initv) then
             idist = 2
             call dlarnv (idist, iseed, n, resid)
+            print *, "generating random resid in dgetv0(), resid is now"
+            print *, resid
          end if
 c
 c        %----------------------------------------------------------%
diff --git a/SRC/dsaitr.f b/SRC/dsaitr.f
index 3460d99..3902a14 100644
--- a/SRC/dsaitr.f
+++ b/SRC/dsaitr.f
@@ -406,6 +406,7 @@ c           | If in reverse communication mode and |
 c           | RSTART = .true. flow returns here.   |
 c           %--------------------------------------%
 c
+            print *, "calling dgetv0() from dsaitr(), itry = ", itry
             call dgetv0 (ido, bmat, itry, .false., n, j, v, ldv,
      &                   resid, rnorm, ipntr, workd, ierr)
             if (ido .ne. 99) go to 9000
diff --git a/SRC/dsaup2.f b/SRC/dsaup2.f
index fd4143f..42f5efb 100644
--- a/SRC/dsaup2.f
+++ b/SRC/dsaup2.f
@@ -324,6 +324,7 @@ c
    10 continue
 c
       if (getv0) then
+         print *, "calling dgetv0() from dsaup2"
          call dgetv0 (ido, bmat, 1, initv, n, 1, v, ldv, resid, rnorm,
      &                ipntr, workd, info)
 c

Output:

Trial 6:
 calling dgetv0() from dsaup2
 calling dgetv0() from dsaup2
 calling dgetv0() from dsaitr(), itry =            1
 generating random resid in dgetv0(), resid is now
  0.39574246391875789        8.6496039750016962E-004 -0.92272057899825910      -0.91656714952780050       0.11759638488413060      -0.29962625203712179       0.90382695702586346      -0.25045104802183715       0.33224741301423677      -0.29023922021963955     
 calling dgetv0() from dsaitr(), itry =            1
 calling dgetv0() from dsaitr(), itry =            2
 generating random resid in dgetv0(), resid is now
  0.39574246391875789        8.6496039750016962E-004 -0.92272057899825910      -0.91656714952780050       0.11759638488413060      -0.29962625203712179       0.90382695702586346      -0.25045104802183715       0.33224741301423677      -0.29023922021963955     
 calling dgetv0() from dsaitr(), itry =            1
 generating random resid in dgetv0(), resid is now
  0.39574246391875789        8.6496039750016962E-004 -0.92272057899825910      -0.91656714952780050       0.11759638488413060      -0.29962625203712179       0.90382695702586346      -0.25045104802183715       0.33224741301423677      -0.29023922021963955     
 calling dgetv0() from dsaitr(), itry =            1
 calling dgetv0() from dsaitr(), itry =            2
 generating random resid in dgetv0(), resid is now
  0.39574246391875789        8.6496039750016962E-004 -0.92272057899825910      -0.91656714952780050       0.11759638488413060      -0.29962625203712179       0.90382695702586346      -0.25045104802183715       0.33224741301423677      -0.29023922021963955     
 calling dgetv0() from dsaitr(), itry =            1
 generating random resid in dgetv0(), resid is now
  0.39574246391875789        8.6496039750016962E-004 -0.92272057899825910      -0.91656714952780050       0.11759638488413060      -0.29962625203712179       0.90382695702586346      -0.25045104802183715       0.33224741301423677      -0.29023922021963955     
 calling dgetv0() from dsaitr(), itry =            1
 calling dgetv0() from dsaitr(), itry =            2
 generating random resid in dgetv0(), resid is now
  0.39574246391875789        8.6496039750016962E-004 -0.92272057899825910      -0.91656714952780050       0.11759638488413060      -0.29962625203712179       0.90382695702586346      -0.25045104802183715       0.33224741301423677      -0.29023922021963955     
 calling dgetv0() from dsaitr(), itry =            3
 generating random resid in dgetv0(), resid is now
  0.39574246391875789        8.6496039750016962E-004 -0.92272057899825910      -0.91656714952780050       0.11759638488413060      -0.29962625203712179       0.90382695702586346      -0.25045104802183715       0.33224741301423677      -0.29023922021963955     
Finished.
INFO = -9999

Notice how after itry > 1, it generates a new "random" vector, but that vector is always the same due to ce2e69a. Thus if a failure occurs with the new vector, the failure will continue to occur in subsequent tries as well.

@fghoussen
Copy link
Collaborator

@szhorvat: can you PR some patch doing #401 (comment) with for instance https://cyber.dabamos.de/programming/modernfortran/random-numbers.html (or any other way you like to change seed according to bounds mentioned in dlarnv doc: this is what ce2e69a fixes)? If seed change, v0 will change, right?

@szhorvat
Copy link
Contributor Author

dlarnv() (or rather dlaruv(), which dlarnv() calls internally) appears to use iseed not merely as a seed, but as the complete 48-bit state of its random number generation algorithm.

The usage pattern is like this: One puts in an iseed on the first call, then dlarnv() outputs a vector of random numbers and updates iseed. After this, the resulting iseed should be passed back at the next call of dlarnv().

There are two documented requirements for iseed in dlarnv():

  • All elements must be in [0, 4095]. That's because iseed really represents 48 bits, stored as four 12-bit values.
  • The last element, iseed(4), must be odd.

I don't see how these requirements could have been violated. On the first call, they are clearly satisfied: 1, 3, 5, 7 are fine (and the more complicated parallel case was also fine at the time of ce2e69a, as far as I can tell). In fact, ce2e69a did not change the initial values used in iseed.

On subsequent calls, iseed is no longer the responsibility of ARPACK. One is supposed to use whatever the previous dlarnv() call returned. If that violates the requirements, that's a LAPACK bug, not an ARPACK issue. But such a violation sounds very unlikely. Nevertheless, I printed out the returned iseed values on my machine, and I did not see any problem.

Do you have an example of dlarnv() returning an invalid iseed with some LAPACK implementation? If so, which one, and can you show an example program that reproduces the invalid iseed?

@fghoussen
Copy link
Collaborator

One puts in an iseed on the first call, then dlarnv() outputs a vector of random numbers and updates iseed.

So the intent of the original code and ce2e69a are both correct.

Do you have an example of dlarnv() returning an invalid iseed with some LAPACK implementation?

Don't really remember. Long time ago: try MKL with (or without?) ILP64.
AFAIR iseed is uninitialized as the cumbersome data keyword did not initialize inits

data inits /.true./

@szhorvat
Copy link
Contributor Author

szhorvat commented Mar 14, 2023

One puts in an iseed on the first call, then dlarnv() outputs a vector of random numbers and updates iseed.

So the intent of the original code and ce2e69a are both correct.

No, because after ce2e69a, you never pass back the state to dlarnv(). Instead, you always start from the same fixed state, thus dlarnv() returns the same "random" numbers every time.

Don't really remember. Long time ago: try MKL with (or without?) ILP64.

Unfortunately I lost access to the system where I could test with the MKL.

If we don't have an example of the problem that ce2e69a was supposed to fix, we can't move forward.


AFAIR iseed is uninitialized as the cumbersome data keyword did not initialize inits

Can you expand on this? I do not program in Fortran, so I searched for data, and all documentation I could find stated that data will set the initial value of variables (example). Then I tried the following program using GNU Fortran 12, and it gives the expected output:

      program test

      integer foo
      logical q, r

      data foo /137/
      data q /.true./
      data r /.false./

      write (*,*) 'foo = ', foo
      write (*,*) 'q = ', q, 'r = ', r

      stop
      end

Output:

 foo =          137
 q =  T r =  F

Once again, there needs to be a clear understanding of the problem that ce2e69a was intending to fix. Otherwise I suggest directly reverting it, as it is clearly causing problems.

@fghoussen
Copy link
Collaborator

Did you try #401 (comment)?

Instead, you always start from the same fixed state

OK

If we don't have an example of the problem

Look at issues attached to the ce2e69a PR. Try

./configure --enable-icb --with-blas=mkl_gf_ilp64 --with-lapack=mkl_gf_ilp64
.

Can you expand on this?

No: not home + low connection / email access at the time.

clear understanding of the problem

inits was not initialized (guess), so iseed wasn't neither (100% sure): get random iseed out of dlarnv bounds

@fghoussen
Copy link
Collaborator

@szhorvat: can you PR a revert?

If we don't have an example of the problem

Analyzing history, the problem appeared in this specific case: parpack + ILP64 + MKL. But, it's now out of reach #368.

@szhorvat
Copy link
Contributor Author

I can do it the week after next

@fghoussen
Copy link
Collaborator

I can do it the week after next

Sure! No hurry :)

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Mar 29, 2023
szhorvat added a commit to szhorvat/arpack-ng that referenced this issue Apr 9, 2023
 - fixes opencollab#401, opencollab#410, opencollab#411
 - restores 'inits' variable removed in ce2e69a, ensuring that the RNG state is propagated
 - reverts e0d6705 to ensure that seed is different on each parallel thread
 - updates seed initialization of parallel pdgetv0/psgetv0 so that they match that of pzgetv0/pcgetv0
fghoussen pushed a commit to szhorvat/arpack-ng that referenced this issue Apr 10, 2023
 - fixes opencollab#401, opencollab#410, opencollab#411
 - restores 'inits' variable removed in ce2e69a, ensuring that the RNG state is propagated
 - reverts e0d6705 to ensure that seed is different on each parallel thread
 - updates seed initialization of parallel pdgetv0/psgetv0 so that they match that of pzgetv0/pcgetv0
szhorvat added a commit to igraph/igraph that referenced this issue Jul 15, 2023
@fghoussen
Copy link
Collaborator

Fixed by #423

@szhorvat
Copy link
Contributor Author

@fghoussen Thanks! I'm looking forward to the release of 3.9.1.

szhorvat added a commit to igraph/igraph that referenced this issue Oct 17, 2023
vtraag pushed a commit to vtraag/igraph that referenced this issue Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants