Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpCoordinates_sparsify: does not exist #550 #4

Open
MES-physics opened this issue Feb 6, 2023 · 18 comments
Open

gpCoordinates_sparsify: does not exist #550 #4

MES-physics opened this issue Feb 6, 2023 · 18 comments

Comments

@MES-physics
Copy link

(Continued from QUIP Issue gpCoordinates_sparsify: does not exist #550)

Now I tried again, eliminating the radial_decay= and setting the
alpha_max={{9}} instead, on both serial and MPI runs.
The same error happens on the MPI run only:
SYSTEM ABORT: proc=0 Traceback (most recent call last)
File "gp_fit.f95", line 193 kind unspecified
gpCoordinates_sparsify: does not exist
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
srun: error: amd081: task 1: Exited with exit code 1

@mcaroba
Copy link
Collaborator

mcaroba commented Feb 6, 2023

What happens if you try to fit a GAP in the "usual" one-pass way (that is, not a double pass to get the sparse set first when running the MPI version)? Does it succeed or do you also get an error? If it succeeds (as in reach the end of program execution), do you see weird things in the XML files, like NaNs?

@MES-physics
Copy link
Author

I will try this in a bit, thanks.

@MES-physics
Copy link
Author

It didn't work, gave this error:

Abort(1090575) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(143): 
MPID_Init(1221)......: 
MPIR_pmi_init(112)...: 
(unknown)(): Other MPI error
Attempting to use an MPI routine before initializing MPICH
Abort(1090575) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(143): 
MPID_Init(1221)......: 
MPIR_pmi_init(112)...: 

@mcaroba
Copy link
Collaborator

mcaroba commented Feb 8, 2023

This seems to be an MPI problem. Have you tried running the serial (or OpenMP, in any case non-MPI) version of gap_fit? We need to figure out if this is a soap_turbo problem or a problem of how soap_turbo works with the MPI implementation of gap_fit.

@MES-physics
Copy link
Author

When I ran the soap-turbo on serial with sparsify only,
that part worked. I didn’t try serial on it after that. I’ll try.

@MES-physics
Copy link
Author

Now I have the soap_turbo serial version started from the beginning, still running since last night with no errors so far. I think it will take a couple days.
But I wanted to know, what is the speed advantage of soap_turbo over regular soap?

@mcaroba
Copy link
Collaborator

mcaroba commented Feb 9, 2023

soap_turbo used to be significantly faster but soap has seen some speed improvements too in the last couple of years. In general, you should always use compression with soap_turbo, the easiest and safest way to do this is to add compress_mode="trivial" to your gap_fit command. This will make things about 5 times faster (also training becomes a lot faster).

@MES-physics
Copy link
Author

Oh, I did try that compress_mode="trivial" last time, but the program stopped, saying command not recognized or something. I'll have to try again in serial with that.

@MES-physics
Copy link
Author

Now the soap_turbo worked with the serial run, and produced a decent-looking .xml file. compress_mode="trivial" worked too. I must have misplaced it initially.

@mcaroba
Copy link
Collaborator

mcaroba commented Feb 11, 2023

Thanks for checking. Can you then confirm that this issue only happens when you are running the MPI version of the code? Or does it go away after you fixed the compression tag?

@MES-physics
Copy link
Author

MES-physics commented Feb 11, 2023 via email

@MES-physics
Copy link
Author

Now I tried. The compress_mode=trivial was in this MPI trial.
It seems that I cannot run soap_turbo parameters with MPI.
It gives this unkown key error for sparse_file, even if I put the command line in a config_file as @albapa said.
In addition, the soap_turbo version of the serial potential will not run in LAMMPS, but I got the regular soap potential to work in LAMMPS. These are all using the Carbon_GAP_20 training file with the general Carbon gap command parameters. Thanks so much.

libAtoms::Hello World: 2023-02-23 13:23:08
libAtoms::Hello World: git version  https://github.com/libAtoms/QUIP.git,v0.9.12-14-gb55e89d55-dirty
libAtoms::Hello World: QUIP_ARCH    linux_x86_64_gfortran_openmpi+openmp
libAtoms::Hello World: compiled on  Feb 20 2023 at 22:53:50
libAtoms::Hello World: MPI parallelisation with 32 processes
libAtoms::Hello World: OpenMP parallelisation with 16 threads
libAtoms::Hello World: OMP_STACKSIZE=1G
libAtoms::Hello World: MPI run with the same seed on each process
libAtoms::Hello World: Random Seed = 1542028897
libAtoms::Hello World: global verbosity = 0

Calls to system_timer will do nothing by default

MPI hostnames :: amd082.orc.gmu.edu
MPI host refs :: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
MPI my_host  : 0
MPI hostname : amd082.orc.gmu.edu
param_read_line: unknown key sparse_file

@mcaroba
Copy link
Collaborator

mcaroba commented Feb 24, 2023

Let's think a bit about it. It's weird that you can't use soap_turbo in LAMMPS, even with the serial version of the code. We use soap_turbo GAPs routinely with LAMMPS. I wonder if there is something less obvious going on here. @albapa is the sparse_file option a soap-specific feature or should it work also with other descriptors?

@MES-physics
Copy link
Author

Hang on, I think @albapa gave me the answer just now . sparse_file not within {}.. I’ll try again.

@MES-physics
Copy link
Author

OK, sorry for confusion. In my soap_turbo script, the sparse_file=3.input is after the soap_turbo }closes! In the regular soap, it is rightly within.
I'll try the soap_turbo again though and report back finally.

@MES-physics
Copy link
Author

OHoh, I corrected the above problem, but now I get this error. Maybe I'm missing something else?

SYSTEM ABORT: proc=2 Traceback (most recent call last)
File "/opt/sw/other/apps/linux-rhel8-x86_64_v2/gcc-10.3.0-openmpi-4.1.2/quip/0.9.12/src/libAtoms/Sys-
tem.f9", line 1471 kind unspecified
string_to_real: could not convert, iostat=5010 

@gabor1
Copy link

gabor1 commented Feb 24, 2023

oh dear - it's possible that the parser is first trying to make a number out of it. try to name your files file3.input file4.input etc.

@MES-physics
Copy link
Author

I now changed to those file names and the same exact error happens, darn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants