Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test problem for HYPRE_MIXEDINT #326

Open
jandrej opened this issue Apr 12, 2021 · 11 comments
Open

Test problem for HYPRE_MIXEDINT #326

jandrej opened this issue Apr 12, 2021 · 11 comments

Comments

@jandrej
Copy link

jandrej commented Apr 12, 2021

In the process of getting mfem to work with the HYPRE_MIXEDINT option (see mfem/mfem#1583) we are running into issues.

I tried to run the current hypre version (recent git master) using the ij test executable with

$ srun -n1728 -ppbatch -A *** ./test/ij -P 12 12 12 -n 1400 1400 1400

to run a large enough test. This fails with a memory

ij: hypre_memory.c:34: hypre_OutOfMemory: Assertion `0' failed.
[***:mpi_rank_341][error_sighandler] Caught error: Aborted (signal 6)

Am I using the option wrong?

@ulrikeyang
Copy link
Contributor

ulrikeyang commented Apr 12, 2021 via email

@jandrej
Copy link
Author

jandrej commented Apr 12, 2021

Yes I configured with --enable-mixed-int. I ran the test on quartz. I did not try to run with the bigint option only, but I can do that if that helps.

@ulrikeyang
Copy link
Contributor

ulrikeyang commented Apr 12, 2021 via email

@jandrej
Copy link
Author

jandrej commented Apr 12, 2021

Using your suggested options

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 1400 1400 1400 -agg_nl 1

works fine.

Thanks!

@jandrej jandrej closed this as completed Apr 12, 2021
@jandrej jandrej reopened this Apr 12, 2021
@jandrej
Copy link
Author

jandrej commented Apr 12, 2021

I tried another option since in mfem a simple Laplace problem works fine with mixed-int. Elasticity with the systems option fails.

When I run

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 1200 1200 1200 -sysL 2 -agg_nl 1 -intptype 10

the example segfaults (without further information about errors etc.)

Is the combination of -agg_nl 1 -intptype 10 supposed to work on -sysL 2? I expect this to produce a 7pt stencil with 2 equations.

@ulrikeyang
Copy link
Contributor

ulrikeyang commented Apr 12, 2021 via email

@jandrej
Copy link
Author

jandrej commented Apr 12, 2021

The test also fails with a much smaller allocation

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10

so I don't suspect running OOM here.

@ulrikeyang
Copy link
Contributor

ulrikeyang commented Apr 12, 2021 via email

@jandrej
Copy link
Author

jandrej commented Apr 12, 2021

Running

$ srun -n1728 -ppbatch -A ceed ./test/ij -P 12 12 12 -n 700 700 700 -sysL 2 -agg_nl 1 -interptype 10 -nf 2 -nodal 1

segfaults without further information

@ulrikeyang
Copy link
Contributor

ulrikeyang commented Apr 12, 2021 via email

@drew-parsons
Copy link
Contributor

drew-parsons commented Jun 3, 2021

I'm also having problems with mixedint tests, presumably the same or related to the problem reported here. Building 2.21.0 (patch set here)

After building the mixedint library and tests, TEST_ams for instance generates this output (corresponding to solvers.out.10 from TEST_ams/solvers.jobs. Segfault also occurs for out.8, 9 and 11. I have 8 processors on this system, if that's relevant)

$ cd /build/hypre-64m/src/test/TEST_ams
$ ln  -s ../ams_driver
$ LD_LIBRARY_PATH=/build/hypre-64m/src/lib:$LD_LIBRARY_PATH mpirun -np 4 ./ams_driver -solver 5 -tol 1e-4 -h1
Problem size: 5080

=============================================
Setup phase times:
=============================================
AME Setup:
  wall clock time = 0.010000 seconds
  wall MFLOPS     = 0.000000
  cpu clock time  = 0.008005 seconds
  cpu MFLOPS      = 0.000000


Solving generalized eigenvalue problem with preconditioning

block size 5

No constraints

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node sandy exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I'm also building the standard and bigint configurations. bigint does not generate an mpirun segfault.

This is with openmpi 4.1.0 and pmix 4.0.0.

More distressing than the mpi segfault itself, the error correlates with a complete linux kernel meltdown. For some reason the hard drive bus seems to get detached after the mpi segfault is triggered, causing all filesystems to be dropped to read-only, hence a complete system failure. Since /var is also made read-only, I can't give an exact log of this behaviour.

My kernel crash is reproducible in the sense that it is currently happening every time I run mixedint tests, but does not occur with bigint tests. However the precise point at which the filesystem lockup occurs varies, sometimes during TEST_ams. sometimes TEST_ij, or TEST_lobpcg. More often in TEST_lobpcg.

With respect to the workaround suggested above, there is no -agg_nl for ams_driver. The mpi segfault reported here seems to affect several mixedint tests, not only ij but also amd_driver, sstruct, struct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants