Maximum matrix sizes #4268

Heinrich-BR · 2024-04-25T15:16:24Z

Hello MFEM developers,

I'm testing out MFEM's scaling on large clusters and for that, I'm pushing some of the examples to see how big they can be made before they break, simply by parallel refining the mesh further and further. However, I'm noticing that in general, they stop working about one or two orders of magnitude before I would expect memory issues or integer overflow to cause problems.

For instance, take the example ex1p. If you set the serial refinement level to -1 and the parallel refinement level to 7, the example ends up with about 17 million unknowns and segfaults in the mfem::internal::quadrature_interpolator::TensorDerivatives function. With a parallel refinement level of 6, the example runs normally to the end. I know that it is not a matter of running out of memory since the cluster I am using has more than enough for problems much larger than this. I have tried building MFEM and its dependencies with 64-bit integers and with mixed integers, but nothing seems to allow me to go further than this. Splitting the problem into many MPI ranks does help me go one parallel refinement level further before breaking again, but it would require a very large number of ranks to reach the problem sizes I am interested in, and in principle it should be doable with only a few.

I would like to ask, then, is this a known limit for MFEM or is there perhaps some build configuration that I'm missing?

The text was updated successfully, but these errors were encountered:

jandrej · 2024-04-25T15:49:47Z

Are you using the "performance" examples or is this the regular examples/ex1p?

Heinrich-BR · 2024-04-25T16:06:06Z

I have tried with both, and I'm having issues in either case.

Edit: But just to clarify, the particular example I gave with the 17 million unknowns was on the regular examples/ex1p

sebastiangrimberg · 2024-04-25T16:19:07Z

Can you step through on a debugger to see what is going on? My suspicion, if I understand correctly that the 17 million dof case is run on a single MPI rank, is that there is a quadrature data array being allocated (for example, for the element Jacobians) which might be larger than 2B entries and you are overflowing on the index. This happens for 3D Jacobians when NE * NQ * 9 > 2^31-1 and should show up if you walk through the execution in gdb or lldb. MFEM does not use 64-bit integers for objects local to an MPI rank so would be unaffected by building Hypre, etc. with 64-bit integer support and this seems consistent with the observation that increasing the number of MPI ranks makes the error go away.

pazner · 2024-04-25T19:18:45Z

Can you provide some more information, e.g. which initial mesh you are using, and how many MPI ranks?

If I understand your configuration correctly, the serial mesh is refined until it has 10,000 elements, and then it is partitioned, and 7 parallel refinements are performed. For a 2D quad mesh, this will result in $10^4 \times 4^7 \approx 1.6 \times 10^8$ elements, so 17 million unknowns seems very low. Maybe you are printing out the size of the local problem rather than the global problem (which would make sense if you are using ~ 10 MPI ranks). For a 3D hex mesh, 7 additional refinements will result in roughly $2 \times 10^{10}$ elements.

Also, for better parallel load balancing, it is probably better to partition the mesh as much as possible in serial (before running out of memory on a single rank), and only then partition and switch to parallel refinements. The parallel refinements do not do any repartitioning or load balancing.

v-dobrev · 2024-04-25T21:51:57Z

@Heinrich-BR, can you post the exact command line you use (and any modifications you made to examples/ex1p.cpp) when you get the segfault? I can try to reproduce it. Also which version/git-hash of MFEM are you using?

Heinrich-BR · 2024-04-26T09:13:21Z

Hi everyone! Thank you for your support! Let me try to answer everything.

Can you step through on a debugger to see what is going on? My suspicion, if I understand correctly that the 17 million dof case is run on a single MPI rank, is that there is a quadrature data array being allocated (for example, for the element Jacobians) which might be larger than 2B entries and you are overflowing on the index. This happens for 3D Jacobians when NE * NQ * 9 > 2^31-1 and should show up if you walk through the execution in gdb or lldb. MFEM does not use 64-bit integers for objects local to an MPI rank so would be unaffected by building Hypre, etc. with 64-bit integer support and this seems consistent with the observation that increasing the number of MPI ranks makes the error go away.

That's very interesting @sebastiangrimberg , and it does sound very likely. I've stepped through with a debugger before which is how I found where the issue was happening to start with, but I'll try again and keep out an eye for the quadrature data array.

Can you provide some more information, e.g. which initial mesh you are using, and how many MPI ranks?

I am using the beam-hex.mesh mesh on only one MPI rank, at least to get to this issue. I have run this on other meshes and run into the same issue, although the number of refinements I can get away with changes with the mesh. For instance, using the star.mesh allows me to get further. The simulation breaks when I have around 300 million unknowns. This, I believe, supports @sebastiangrimberg 's idea that it might be that the size of the quadrature data array is what's controlling this limit. Since star.mesh is a 2D mesh embedded in 3D space, I would imagine that the number of quadrature points per element would be smaller.

If I understand your configuration correctly, the serial mesh is refined until it has 10,000 elements, and then it is partitioned, and 7 parallel refinements are performed.

Not exactly, I set the number of serial refinements to be -1 (i.e. no serial refinement at all), so all of the refinement is parallel. I don't think the refinement being serial or parallel makes any difference in the case with 1 MPI rank anyway. The important part is that starting from the original mesh, there were 7 refinements in total.

@Heinrich-BR, can you post the exact command line you use (and any modifications you made to examples/ex1p.cpp) when you get the segfault? I can try to reproduce it. Also which version/git-hash of MFEM are you using?

Of course! I'm using the latest version of MFEM (as of this week), so commit 31bd94e. The only modifications I made to ex1p.cpp were:

set ref_levels on line 139 to -1 (instead of the expression that causes it to refine until about 10000 elements).
set par_ref_levels on line 153 to 7.

With this, I ran the example with the command

mpirun -np 1 ${MFEM_PATH}/examples/ex1p -d cpu --full-assembly --no-visualization -o 1 -m ${MFEM_PATH}/data/beam-hex.mesh

Hopefully this helps you reproduce the error! Thank you everyone for your support and have a great weekend!

Heinrich-BR · 2024-04-26T12:33:55Z

Can you step through on a debugger to see what is going on? My suspicion, if I understand correctly that the 17 million dof case is run on a single MPI rank, is that there is a quadrature data array being allocated (for example, for the element Jacobians) which might be larger than 2B entries and you are overflowing on the index. This happens for 3D Jacobians when NE * NQ * 9 > 2^31-1 and should show up if you walk through the execution in gdb or lldb. MFEM does not use 64-bit integers for objects local to an MPI rank so would be unaffected by building Hypre, etc. with 64-bit integer support and this seems consistent with the observation that increasing the number of MPI ranks makes the error go away.

Just as an update regarding this, I've looked into it with a debugger again and retrieved some numbers:

$NE=16,777,216$
$NQ=27$
$NE\times NQ\times 9=4,076,863,488$

Given this is almost twice as much as $2^{31}-1$, it does seem likely that it is what's crashing. So if I understand this problem correctly, we can only have at most $(2^{31}-1)/(9N_Q)$ entries per MPI rank in any particular mesh. For the example I'm using with the beam-hex.mesh, the limit is then 8.8 million unknowns per MPI rank, correct? The issue then, is that this limit is relatively low for many HPC applications. If I wanted to do a run with, say, a billion unknowns, I'd need to split it into more than 100 MPI tasks, which becomes unwieldy especially if it's on GPUs, where typically you might wish to pin one MPI task on each GPU.

But of course, I'm just speculating here in case this is indeed the issue. It's quite possible I've missed something. Let me know what you find!

tzanio added examples & miniapps hpc labels Apr 25, 2024

tzanio assigned tzanio and jandrej Apr 25, 2024

pazner self-assigned this Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximum matrix sizes #4268

Maximum matrix sizes #4268

Heinrich-BR commented Apr 25, 2024

jandrej commented Apr 25, 2024

Heinrich-BR commented Apr 25, 2024 •

edited

sebastiangrimberg commented Apr 25, 2024

pazner commented Apr 25, 2024

v-dobrev commented Apr 25, 2024

Heinrich-BR commented Apr 26, 2024

Heinrich-BR commented Apr 26, 2024

Maximum matrix sizes #4268

Maximum matrix sizes #4268

Comments

Heinrich-BR commented Apr 25, 2024

jandrej commented Apr 25, 2024

Heinrich-BR commented Apr 25, 2024 • edited

sebastiangrimberg commented Apr 25, 2024

pazner commented Apr 25, 2024

v-dobrev commented Apr 25, 2024

Heinrich-BR commented Apr 26, 2024

Heinrich-BR commented Apr 26, 2024

Heinrich-BR commented Apr 25, 2024 •

edited