Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

git bisect fem_system_ex2 #1559

Closed
pbauman opened this issue Jan 12, 2018 · 24 comments
Closed

git bisect fem_system_ex2 #1559

pbauman opened this issue Jan 12, 2018 · 24 comments

Comments

@pbauman
Copy link
Member

pbauman commented Jan 12, 2018

I went to a year ago and b36673c is passing. So will start git bisect there.

@pbauman
Copy link
Member Author

pbauman commented Jan 12, 2018

So it's all the way back to 0ef555e so far, without about 4 bisect steps to go. Any running bets on how close my git bisect good is the to offending commit?

@pbauman
Copy link
Member Author

pbauman commented Jan 12, 2018

For posterity, here's my script that was fed to git bisect run

#!/bin/bash

MYTEST_LIBMESH_BUILD_DIR=/fry1/data/users/pbauman/research/libmesh/build/test-femsystemex2-fail
MYTEST_LIBMESH_INSTALL_DIR=/fry1/data/users/pbauman/research/libmesh/install/test-femsystemex2-fail

# Go to libMesh build directory, configure, make, make install
mkdir -p $MYTEST_LIBMESH_BUILD_DIR
cd $MYTEST_LIBMESH_BUILD_DIR
module load gcc mpich openblas petsc vtk boost hdf5 cppunit
rm -rf ./*
../../libMesh/configure --enable-everything --with-metis=PETSc --with-methods=devel --prefix=$MYTEST_LIBMESH_INSTALL_DIR
make -j 32
cd examples/fem_system/fem_system_ex2/
make -j 16 check

@jwpeterson
Copy link
Member

This thread finally jogged my memory. I believe the error is coming from a TypeVector::inverse() call on a matrix of all zeros. At some point I changed this to throw an exception, whereas before it would just divide by zero...

@jwpeterson
Copy link
Member

Sorry for not commenting before. This test's VTK requirement means that it doesn't get run as much as it should... we should probably consider writing multiple output files so the test can be turned on in more configurations...

@pbauman
Copy link
Member Author

pbauman commented Jan 12, 2018

So.... git bisect said ec5c76f was the offending commit... which is exactly one after the one I randomly chose (and thought I tested...). I'm very suspicious, will report back.

@roystgnr
Copy link
Member

Hmm... in theory that TypeVector::inverse() change could definitely break NewtonSolver-based apps. The solver looks for NaN residuals when considering a proposed step and backs off when it finds them, often to a better step which then leads to convergence, whereas we don't have any try/catch blocks in the solver and we just scream and die if the assembly throws an exception. I should probably fix that.

On the other hand, I don't think that explains this bug, because it's dying in the first assembly of the second time step, not in the middle of a solve.

On the gripping hand... it looks like solid_system.C itself is testing libmesh_isnan()? But I can't tell whether that's actually an attempt to recover from NaN solution values or whether it's just a sneaky way of handling the bc/displacement config value... or whether it's the latter but it accidentally also served as the former.

@jwpeterson
Copy link
Member

You might test ae45822. That was the commit where we first started throwing exceptions from TypeVector::inverse().

@jwpeterson
Copy link
Member

On the other hand, I don't think that explains this bug, because it's dying in the first assembly of the second time step, not in the middle of a solve.

IIRC the error was from this line:

is that not the case? I haven't seen a stack trace, sorry if it was already posted by someone!

@roystgnr
Copy link
Member

That's where it's dying; it's definitely a failure to invert a zero matrix.

But what I'm saying is that, if that was previously a recoverable failure, it would have had to have been in the assembly of a proposed step that would (upon discovery of NaN instead of a thrown exception) have been retracted. If the assembly of the solution at the beginning of a time step throws NaN, there's no way for the solver to converge after that.

@jwpeterson
Copy link
Member

jwpeterson commented Jan 12, 2018

if that was previously a recoverable failure

Maybe it previously computed a NaN, nothing used/depended on it, and the simulation just kept chugging along?

@pbauman
Copy link
Member Author

pbauman commented Jan 12, 2018

Sorry, had meetings, am back at this. I'm currently retesting my original starting commit for the sake of my sanity. I haven't even tried to get a stack trace because I'm lazy and git bisect was easier.

@pbauman
Copy link
Member Author

pbauman commented Jan 12, 2018

I had a bug in my bisect script (was using source tree that wasn't being used in bisect). FML.

@pbauman
Copy link
Member Author

pbauman commented Jan 12, 2018

OK, original good commit is still good, bad is still bad, so trying bisect again with fixed script...

@pbauman
Copy link
Member Author

pbauman commented Jan 12, 2018

OK, git bisect reports 3b4b2fb as the offending commit

@roystgnr
Copy link
Member

roystgnr commented Jan 12, 2018 via email

@jwpeterson
Copy link
Member

Any ETA on when fem_system_ex2 might be fixed? I think we should probably temporarily disable the test if it's going to be a while...

@roystgnr
Copy link
Member

Not soon enough, I fear. Let's have the shell script return XFAIL like @pbauman suggested on the list?

@jwpeterson
Copy link
Member

Is that the same thing as return 77;?

@roystgnr
Copy link
Member

Yeah.

@jwpeterson
Copy link
Member

OK, I can take care of it then.

@roystgnr
Copy link
Member

Sorry, I'm being an idiot. SKIP is "return 77;" and is what I had in mind, but XFAIL is set up in Makefile.am and is I believe what Paul was suggesting. I think SKIP would be preferable but either would be fine temporarily.

@pbauman
Copy link
Member Author

pbauman commented Jan 22, 2018

Not quite. You just need to add XFAIL_TESTS = <test_name> in the Makefile.am where all the TESTS are listed. See here for example.

@jwpeterson
Copy link
Member

Hmm... it looks like we don't use TESTS in any of our Makefile.am files, maybe we just aren't using this feature of automake. For the time being, I think I'll just SKIP it.

jwpeterson added a commit to jwpeterson/libmesh that referenced this issue Jan 22, 2018
There is a known issue with this example triggered by the changes in
3b4b2fb. Until it can be fixed, we are simply skipping it.

Refs libMesh#1559.
jwpeterson added a commit to jwpeterson/libmesh that referenced this issue Jan 22, 2018
There is a known issue with this example triggered by the changes in
3b4b2fb. Until it can be fixed, we are simply skipping it.

I initially thought about using libmesh_example_requires() for this to
avoid exposing the automake-specific return code, but that macro
prints out a bunch of other text that isn't needed in this case.

Refs libMesh#1559.
@roystgnr
Copy link
Member

Resolved by #1592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants