-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gap_fit MPI Segmentation fault #636
Comments
@Sideboard has added some code that eliminates the need for the two-step process! |
Which step fails? The sparsification or the fit? |
The fit. Sparsification worked fine. |
Yes I know about the change to one step, but haven't figured out how to use it yet. |
The mistake must be in the command line - at least it should print it back. Using the
As far as I know it's as simple as submitting a single MPI job. |
It does print the command line back. I'll look further. |
Now I tried doing MPI all-in-one, and got the same segmentation fault. Here is a sample from the *.err file, with the command line feedback. I am using the exact command line I used last year with only the datafile change, trying to run on 2 nodes. Thanks for any advice.
Here is the *.out file.
|
The standard output does not report the parsing of the command line - I still suspect a problem there, otherwise it would stop at a later stage. |
Ok, now what I did is go back to a sort of square-one, using Deringer's old command line in the 2017 paper, and putting it in a Here is my run command : |
Does the crash generate a core file? If not, is it because your shell is limiting it, and you can turn that limit off? We may be able to figure out where it's crashing at least. |
The error file shows as if it was crashing in the Scalapack initialisation. Could it be an incompatibility between the mpi and the scalapack? |
I don't know how to turn the limit off? I guess I can look it up.
I don't see any other output files than the ones I posted above. Thanks! |
That |
I used the linux_x86_64_gfortran_openmpi+openmp architecture and |
Also in the make config instructions, I did add netcdf-c support. |
Did you enable scalapack in "make config"? How does it know where to get your scalapack libraries? Did you add them to the math libraries when you ran "make config". Can you upload your Can you post the output of |
Did you enable scalapack in "make config"? YES I'm sure I did that. ldd output: (OH OH, I see some things not found, but I did put lopenblas and netcdf in the make config questions, which my admin told me.)
|
There should be no way those things could be missing when it actually runs. Did you have all the same modules loaded when you ran the scalapack must be linked statically, which is unfortunate, since it means you can't tell which one it's using from the Unfortunately, there's an infinite number of ways to set up mpi, scalapack, and lapack, and they need to be consistent. If you |
OK, sorry, here is the
|
OK, thanks very much. We installed again with OneAPI and MKL instead. It turned out that the scalapack modules we have do not work on AMD nodes. It worked now to make a potential. |
Dear Developers,
Please tell me what the usual problem is with this? I got the same type of error using both mpirun and srun. Trying to start gap_fit training. Last year I used 4 nodes with 64 ntasks per node, and it was working. I used the 2 -step process as before, sparsification first, then MPI run. Input file attached for the MPI run.
Thanks.
This is my MPI program :
The text was updated successfully, but these errors were encountered: