Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in ParGridFunction::ExchangeFaceNbrData() when using Nedelec finite elements #1440

Closed
hongbo-yao opened this issue Apr 25, 2020 · 17 comments · Fixed by #1451

Comments

@hongbo-yao
Copy link

hongbo-yao commented Apr 25, 2020

Hi,
I wanna to report one segmentation fault happened when calling ParGridFunction::ExchangeFaceNbrData() in ex3p and ex22p.

sphere2.msh is my mesh file. In ex22p, firstly, I set
const char *mesh_file = "sphere2.msh"; int ser_ref_levels = 0; int par_ref_levels = 0;
and call u.real().ExchangeFaceNbrData(); after step13: a->RecoverFEMSolution(U, b, u);

When I run ex22p with command: mpirun -np 4 ./ex22p -p 0 it works.

When I run ex22p with command: mpirun -np 4 ./ex22p -p 1, Segmentation fault happened:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 99990 RUNNING AT MacBook-Pro
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Similar fault happened in ex3p that uses Nedelec elements.

Here are my mesh files(please delete .txt when using):
sphere2.geo.txt
sphere2.msh.txt

Thanks,
Hongbo

@hongbo-yao hongbo-yao changed the title Bug in ParGridFunction::ExchangeFaceNbrData() Bug in ParGridFunction::ExchangeFaceNbrData() when using Nedelec finite elements Apr 25, 2020
@hongbo-yao hongbo-yao changed the title Bug in ParGridFunction::ExchangeFaceNbrData() when using Nedelec finite elements segmentation fault in ParGridFunction::ExchangeFaceNbrData() when using Nedelec finite elements Apr 25, 2020
@hongbo-yao hongbo-yao changed the title segmentation fault in ParGridFunction::ExchangeFaceNbrData() when using Nedelec finite elements Segmentation fault in ParGridFunction::ExchangeFaceNbrData() when using Nedelec finite elements Apr 25, 2020
@mlstowell
Copy link
Member

Hello, @hongbo-yao ,
I have been trying to reproduce this problem but I have not seen it so far. I did get bad results from ex3p but it did not crash. I can look into this further but it would be useful to learn more about your observations. Example 22 seems to run fine although it requires over 20GB of memory and it seems to have trouble sending the results to GLVis. I'm curious to know if your system has sufficient memory because such a crash could have resulted from a failure to allocate memory.

Best wishes,
Mark

@hongbo-yao
Copy link
Author

hongbo-yao commented Apr 26, 2020

Hi @mlstowell ,
I have rebuild my mfem version, now it's the newest master version.

I tested ex3p and ex22p again and it still produced the segmentation fault. I guess it is not caused by failure of allocating memory since I did not refine the mesh in both cases and the memory of my laptop is 16GB (Number of edge unknows: 125872). Is that possible caused by METIS? My version is metis-5.1.0 and I will test lower version later.

I will change my test computer when I work at school, but now I have to wait for some time during this period.

And I will think more and find out what exactly happened.

Thanks,
Best regards,
Hongbo

@hongbo-yao
Copy link
Author

Hi @mlstowell ,
I have rebuild my mfem version, now it's the newest master version.

I tested ex3p and ex22p again and it still produced the segmentation fault. I guess it is not caused by failure of allocating memory since I did not refine the mesh in both cases and the memory of my laptop is 16GB (Number of edge unknows: 125872). Is that possible caused by METIS? My version is metis-5.1.0 and I will test lower version later.

I will change my test computer when I work at school, but now I have to wait for some time during this period.

And I will think more and find out what exactly happened.

Thanks,
Best regards,
Hongbo

Not METIS's fault, 4.0.3 still produces segmentation fault.

@v-dobrev
Copy link
Member

Do you get the segfault on 1 mpi rank? If so, you can run this in a debugger and when the segfault happens, the debugger will stop and let you examine the function call stack, e.g. in gdb and lldb you can use the bt command. This can help us narrow down where the issue is.

@hongbo-yao
Copy link
Author

Do you get the segfault on 1 mpi rank? If so, you can run this in a debugger and when the segfault happens, the debugger will stop and let you examine the function call stack, e.g. in gdb and lldb you can use the bt command. This can help us narrow down where the issue is.

Hi, 1 process passed the test, more than 1 processs failed

@v-dobrev
Copy link
Member

OK, you can still run the debugger with two rank but it is a little trickier. Try running this command:

mpirun -np 2 xterm -e lldb -- ./ex1p

replacing ./ex1p with the program and arguments you have issue with. Then just type the r (i.e. run) command in both xterm windows where lldb is running. At least one of the ranks should get to the segfault. When that happens, type the command bt and post its output here.

@hongbo-yao
Copy link
Author

hongbo-yao commented Apr 26, 2020

Hi @v-dobrev ,
I tried mpirun -np 2 xterm -e lldb -- ./ex1p and here are messages:
mpirun -np 2 xterm -e lldb -- ./ex22p
[proxy:0:0@MacBook-Pro] HYDU_create_process (utils/launch/launch.c:74): execvp error on file xterm (No such file or directory)
[proxy:0:0@MacBook-Pro] HYDU_create_process (utils/launch/launch.c:74): execvp error on file xterm (No such file or directory)

This may be caused by the not-installed of debugger tool, I will install and test and I will let you know when I make progress.

But I found a confused phenomenon:metis-4.0.3 passed with 1,2,3 processs but failed with 4 processs. metis-5.1.0 passed with 1,2 processs but failed with 3, 4 processs. However, they all failed before with more than one process. a little random...

And it is also affected by mesh, In my sphere2.geo file, if reduce computational domain and regenerate .msh, it can passed on 4 ranks.

Best,
Hongbo

@v-dobrev
Copy link
Member

I think xterm comes with XQuartz and it is installed in /opt/X11/bin/xterm.

The debugger, lldb, should come with XCode or the XCode command-line tools.

The random behavior may mean memory is used after free or something like that. If that is the case, valgrind can be very helpful in tracking down where this happens.

@hongbo-yao
Copy link
Author

hongbo-yao commented Apr 26, 2020

OK, you can still run the debugger with two rank but it is a little trickier. Try running this command:

mpirun -np 2 xterm -e lldb -- ./ex1p

replacing ./ex1p with the program and arguments you have issue with. Then just type the r (i.e. run) command in both xterm windows where lldb is running. At least one of the ranks should get to the segfault. When that happens, type the command bt and post its output here.

Hi @v-dobrev , before showing the outputs, here are some changed codes in ex22p
// 2. Parse command-line options. const char *mesh_file = "sphere2.msh"; int ser_ref_levels = 0; int par_ref_levels = 0; int order = 1; int prob = 1;
and
`// 13. Recover the parallel grid function corresponding to U. This is the
// local finite element solution on each processor.
a->RecoverFEMSolution(U, b, u);

if(myid==0)
{
std::cout << "ExchangeFaceNbrData...\n";
}
u.real().ExchangeFaceNbrData();`

Following this advice, with command mpirun -np 2 xterm -e lldb -- ./ex22p -p 1 Here are output messages:
--------proc 0
(lldb) target create "./ex22p"
Current executable set to './ex22p' (x86_64).
(lldb) settings set -- target.run-args "-p" "1"
(lldb) r
Process 1872 launched: '/Users/yaohb/install/mfem/mfem-master/examples2/ex22p' (x86_64)
[cli_0]: write_line error; fd=7 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=7 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(586):
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=7 buf=:cmd=abort exitcode=1093647
:
system msg for write_line failure : Bad file descriptor
Process 1872 exited with status = 15 (0x0000000f)
(lldb) bt
error: invalid thread
(lldb)

-------proc1
(lldb) target create "./ex22p"
Current executable set to './ex22p' (x86_64).
(lldb) settings set -- target.run-args "-p" "1"
(lldb) r
Process 1876 launched: '/Users/yaohb/install/mfem/mfem-master/examples2/ex22p' (x86_64)
[cli_1]: write_line error; fd=9 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_1]: Unable to write to PMI_fd
[cli_1]: write_line error; fd=9 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(586):
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_1]: write_line error; fd=9 buf=:cmd=abort exitcode=1093647
:
system msg for write_line failure : Bad file descriptor
Process 1876 exited with status = 15 (0x0000000f)
(lldb) bt
error: invalid thread
(lldb)


It seems that xterm and lldb cannot work together with mpirun since these errors also happened to ex1p
mpirun -np 2 xterm -e lldb -- ./ex1p
but I have to tell that
mpirun -np 2 ./ex1p
works fine

Thanks,
Hongbo

@v-dobrev
Copy link
Member

This is strange, I had no issues running mpirun -np 2 xterm -e lldb -- ./ex1p. Both processors ran without problems.

What MPI are you using? Is it MPICH, OpenMPI, or something else? On my Mac I use OpenMPI v2.1.6.

@hongbo-yao
Copy link
Author

hongbo-yao commented Apr 26, 2020

Hi @v-dobrev , It finally can run, the following are outputs of command:
mpirun -np 4 xterm -e lldb -- ./ex22p


(lldb) target create "./ex22p"
Current executable set to './ex22p' (x86_64).
(lldb) r
Process 15612 launched: '/Users/yaohb/install/mfem-openmpi/mfem-master/examples2/ex22p' (x86_64)
Process 15612 stopped

  • thread Setting HYPRE_DIR #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x11206b1a0)
    frame #0: 0x00000001003ea031 ex22pmfem::ParGridFunction::ExchangeFaceNbrData() + 865 ex22pmfem::ParGridFunction::ExchangeFaceNbrData:
    -> 0x1003ea031 <+865>: movq (%rbx,%rdi,8), %rdi
    0x1003ea035 <+869>: movq %rdi, 0x8(%rax,%rsi,8)
    0x1003ea03a <+874>: movslq 0x8(%r12,%rsi,4), %rdi
    0x1003ea03f <+879>: movq (%rbx,%rdi,8), %rdi
    Target 0: (ex22p) stopped.
    (lldb) bt
  • thread Setting HYPRE_DIR #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x11206b1a0)
    • frame #0: 0x00000001003ea031 ex22pmfem::ParGridFunction::ExchangeFaceNbrData() + 865 frame #1: 0x00000001000032a9 ex22pmain + 9593
      frame Wrapper to the hypre eigensolver #2: 0x00007fff6d9a0ed9 libdyld.dylib`start + 1
      (lldb)

Please take a look,
Thanks

@v-dobrev
Copy link
Member

It looks like you are running an optimized build without debug information. Can you rebuild mfem in debug mode and re-run? For example, you can use make pdebug -j 4 in the main mfem directory and then re-build the example as usual with make ex22p.

@hongbo-yao
Copy link
Author

Hi @v-dobrev, It seems that we found it,

(lldb) target create "./ex22p"
Current executable set to './ex22p' (x86_64).
(lldb) r
Process 16742 launched: '/Users/yaohb/install/mfem-openmpi/mfem-master-debug/examples2/ex22p' (x86_64)
Boundary elements with wrong orientation: 1628 / 3252 (fixed)
Process 16742 stopped

  • thread Setting HYPRE_DIR #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x11276b1a0)
    frame #0: 0x0000000100516cc5 ex22p`mfem::ParGridFunction::ExchangeFaceNbrData(this=0x00007ffeefbfc580, i=1)::$_1::operator()(int) const at pgridfunc.cpp:233:4
    230
    231 auto d_data = this->Read();
    232 auto d_send_data = send_data.Write();
    -> 233 MFEM_FORALL(i, send_data.Size(),
    234 {
    235 d_send_data[i] = d_data[d_send_ldof[i]];
    236 });
    Target 0: (ex22p) stopped.
    (lldb) bt
  • thread Setting HYPRE_DIR #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x11276b1a0)
    • frame #0: 0x0000000100516cc5 ex22pmfem::ParGridFunction::ExchangeFaceNbrData(this=0x00007ffeefbfc580, i=1)::$_1::operator()(int) const at pgridfunc.cpp:233:4 frame #1: 0x00000001005101ca ex22pvoid mfem::ForallWrap<1, mfem::ParGridFunction::ExchangeFaceNbrData()::$_0, mfem::ParGridFunction::ExchangeFaceNbrData()::$_1>(use_dev=true, N=6437, d_body=0x00007ffeefbfc598, h_body=0x00007ffeefbfc580, X=0, Y=0, Z=0)::$_0&&, mfem::ParGridFunction::ExchangeFaceNbrData()::$_1&&, int, int, int) at forall.hpp:378:34
      frame Wrapper to the hypre eigensolver #2: 0x000000010050fdd0 ex22pmfem::ParGridFunction::ExchangeFaceNbrData(this=0x0000000110700320) at pgridfunc.cpp:233:4 frame #3: 0x0000000100003dbb ex22pmain(argc=1, argv=0x00007ffeefbfea18) at ex22p.cpp:460:13
      frame Report for Dev release #4: 0x00007fff6d9a0ed9 libdyld.dylibstart + 1 frame #5: 0x00007fff6d9a0ed9 libdyld.dylibstart + 1
      (lldb)

@v-dobrev
Copy link
Member

I see what the issue is: ParGridFunction::ExchangeFaceNbrData() does not support Nedelec or Raviart-Thomas elements at the moment. This is not too hard to fix.

However, I'm not sure you really need to call u.real().ExchangeFaceNbrData() in ex22p -- what are you trying to achieve with this call?

@hongbo-yao
Copy link
Author

hongbo-yao commented Apr 28, 2020

Hi @v-dobrev , What I want to do is to compute the face jumps of the electric fields in parallel (see #1417 ), the face jumps are a part of the widely used residual based error estimator (volume residual + face jumps, MFEM only supports ZZ error estimator) for Maxwell's equations, so I think I really need to call ExchangeFaceNbrData() in ex3p or ex22p.

Are there any other ways to realize this goal? Without calling ExchangeFaceNbrData(), I can still compute the face jumps except the MPI shared faces, but I think it is better to consider the MPI shared faces.

Thanks!

@v-dobrev
Copy link
Member

OK, I see. This makes sense.

I think the fix in ParGridFunction::ExchangeFaceNbrData() is the following: replace

d_send_data[i] = d_data[d_send_ldof[i]];

with

      const int ldof = d_send_ldof[i];
      d_send_data[i] = d_data[ldof >= 0 ? ldof : -1-ldof];

Try it out and check to see if you get the same result on different numbers of processors. If this is the right fix, we can create a branch to merge it into master.

@hongbo-yao
Copy link
Author

hongbo-yao commented Apr 28, 2020

OK, I see. This makes sense.

I think the fix in ParGridFunction::ExchangeFaceNbrData() is the following: replace

d_send_data[i] = d_data[d_send_ldof[i]];

with

      const int ldof = d_send_ldof[i];
      d_send_data[i] = d_data[ldof >= 0 ? ldof : -1-ldof];

Try it out and check to see if you get the same result on different numbers of processors. If this is the right fix, we can create a branch to merge it into master.

Thanks, @v-dobrev,
This is definitely right in my tests!

Both ex3p (H(curl)) and ex22p (H(curl) and H(div)) passed with 1-4 ranks for both optimized and debug builds.

But it is better do more tests before merging.

Finally, sincere thanks for your help on this issue!

Hongbo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants