Simulated annealing calculation error using pair-allegro #40

walker9564 · 2024-05-04T18:41:08Z

OS: CentOS Linux release 7.9.2009 (Core)
Compiler: GCC 13.2.0
CPU: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
NUMA node(s): 2
pytorch：1.12.0
lammps version: 2021.09 release
mpi :intel parallel studio xe 2019

When I executed the simulated annealing algorithm on small clusters, I got the following error.

LAMMPS (29 Sep 2021)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
units metal
atom_style atomic
boundary p p p

newton on

read_data in.data
Reading data file ...
orthogonal box = (0.0000000 0.0000000 0.0000000) to (20.000000 20.000000 20.000000)
1 by 1 by 1 MPI processor grid
reading atoms ...
12 atoms
read_data CPU = 0.003 seconds
#read_restart file.restart.100000

pair_style allegro
pair_coeff * * fe-total.pth Fe

timestep 0.001 # ps

thermo_style custom step dt time temp ke pe etotal press vol
thermo 20
dump 1 all custom 200 dump.lammpstrj id type x y z
restart 100000 file.restart
fix s1 all nvt temp 0.01 1000 $(100.0*dt)
fix s1 all nvt temp 0.01 1000 0.10000000000000000555
run 30000
Neighbor list info ...
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 8
ghost atom cutoff = 8
binsize = 4, bins = 5 5 5
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair allegro, perpetual
attributes: full, newton on, ghost
pair build: full/bin/ghost
stencil: full/ghost/bin/3d
bin: standard
Per MPI rank memory allocation (min/avg/max) = 4.315 | 4.315 | 4.315 Mbytes
Step Dt Time Temp KinEng PotEng TotEng Press Volume
0 0.001 0 0 0 -77.797695 -77.797695 0 8000
.......
.......
.......
470920 0.001 470.92 676.16539 0.9614136 -83.998843 -83.03743 128.36286 8000
470940 0.001 470.94 668.32156 0.95026076 -83.998562 -83.048301 126.87379 8000
470960 0.001 470.96 676.39779 0.96174404 -83.99844 -83.036696 128.40698 8000

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18750 RUNNING AT node02
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18750 RUNNING AT node02
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764

The input

file content is as follows。
units metal
atom_style atomic
boundary p p p
newton on
read_data in.data
#read_restart file.restart.100000

pair_style allegro
pair_coeff * * fe-total.pth Fe

timestep 0.001 # ps
thermo_style custom step dt time temp ke pe etotal press vol
thermo 20
dump 1 all custom 200 dump.lammpstrj id type x y z
restart 100000 file.restart
fix s1 all nvt temp 0.01 1000 $(100.0dt)
run 30000
unfix s1
fix s2 all nvt temp 1000 1000 $(100.0dt)
run 100000
unfix s2
fix s3 all nvt temp 1000 50 $(100.0*dt)
run 6000000
unfix s3
write_data out.data

He did not complete the task. I need to perform 6130000 calculations, but the task ends around 470000 times. Then the error message above appears.
So I tried to use GDB to analyze the errors, but I am not very familiar with this aspect.

The analysis results are as follows.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) where
#0 0x0000000000000000 in ?? ()
#1 0x00007fffe0ff25ad in torch::jit::InterpreterStateImpl::callstack() const () from /opt/software/python3/lib/python3.7/site -packages/torch/lib/libtorch_cpu.so
#2 0x00007fffe0ff3e8e in torch::jit::InterpreterStateImpl::handleError(std::exception const&, bool, c10::NotImplementedError* , c10::optionalstd::string) ()
from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#3 0x00007fffe1000fd0 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocatorc10::IValue >&) ( ) from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#4 0x00007fffe0fee44f in torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocatorc10::IValue >&) () from / opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#5 0x00007fffe0fe167a in torch::jit::GraphExecutorImplBase::run(std::vector<c10::IValue, std::allocatorc10::IValue >&) () f rom /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#6 0x00007fffe0c90ade in torch::jit::Method::operator()(std::vector<c10::IValue, std::allocatorc10::IValue >, std::unordere d_map<std::string, c10::IValue, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const , c10::IValue> > > const&) const () from /opt/software/python3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so
#7 0x00000000006f3496 in torch::jit::Module::forward (this=this@entry=0x2c83a38, inputs=..., kwargs=...) at /opt/software/pyt hon3/lib/python3.7/site-packages/torch/include/torch/csrc/jit/api/module.h:114
#8 0x00000000006ef443 in LAMMPS_NS::PairAllegro::compute (this=0x2c836c0, eflag=, vflag=) at /o pt/source/lammps-stable_29Sep2021/src/pair_allegro.cpp:426
#9 0x00000000005379fb in LAMMPS_NS::Verlet::run (this=0x2c82c60, n=6000000) at /opt/source/lammps-stable_29Sep2021/src/verlet .cpp:312
#10 0x00000000004f291b in LAMMPS_NS::Run::command (this=, narg=, arg=) at /opt/so urce/lammps-stable_29Sep2021/src/run.cpp:180
#11 0x0000000000448614 in LAMMPS_NS::Input::execute_command (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input. cpp:794
#12 0x0000000000448c2c in LAMMPS_NS::Input::file (this=0x2c68cd0) at /opt/source/lammps-stable_29Sep2021/src/input.cpp:273
#13 0x00000000004235a8 in main (argc=, argv=) at /opt/source/lammps-stable_29Sep2021/src/main.cp p:98

I noticed that it mentioned Segmentation fault, but I'm not sure how to solve this problem.I hope u can provide me with some valuable help.thanks！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simulated annealing calculation error using pair-allegro #40

Simulated annealing calculation error using pair-allegro #40

walker9564 commented May 4, 2024 •

edited

Simulated annealing calculation error using pair-allegro #40

Simulated annealing calculation error using pair-allegro #40

Comments

walker9564 commented May 4, 2024 • edited

When I executed the simulated annealing algorithm on small clusters, I got the following error.

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 18750 RUNNING AT node02 = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 18750 RUNNING AT node02 = EXIT CODE: 11 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

Intel(R) MPI Library troubleshooting guide: https://software.intel.com/node/561764

The input

The analysis results are as follows.

walker9564 commented May 4, 2024 •

edited

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18750 RUNNING AT node02
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18750 RUNNING AT node02
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764