-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
git bisect fem_system_ex2 #1559
Comments
So it's all the way back to 0ef555e so far, without about 4 bisect steps to go. Any running bets on how close my git bisect good is the to offending commit? |
For posterity, here's my script that was fed to #!/bin/bash MYTEST_LIBMESH_BUILD_DIR=/fry1/data/users/pbauman/research/libmesh/build/test-femsystemex2-fail MYTEST_LIBMESH_INSTALL_DIR=/fry1/data/users/pbauman/research/libmesh/install/test-femsystemex2-fail # Go to libMesh build directory, configure, make, make install mkdir -p $MYTEST_LIBMESH_BUILD_DIR cd $MYTEST_LIBMESH_BUILD_DIR module load gcc mpich openblas petsc vtk boost hdf5 cppunit rm -rf ./* ../../libMesh/configure --enable-everything --with-metis=PETSc --with-methods=devel --prefix=$MYTEST_LIBMESH_INSTALL_DIR make -j 32 cd examples/fem_system/fem_system_ex2/ make -j 16 check |
This thread finally jogged my memory. I believe the error is coming from a TypeVector::inverse() call on a matrix of all zeros. At some point I changed this to throw an exception, whereas before it would just divide by zero... |
Sorry for not commenting before. This test's VTK requirement means that it doesn't get run as much as it should... we should probably consider writing multiple output files so the test can be turned on in more configurations... |
So.... git bisect said ec5c76f was the offending commit... which is exactly one after the one I randomly chose (and thought I tested...). I'm very suspicious, will report back. |
Hmm... in theory that TypeVector::inverse() change could definitely break NewtonSolver-based apps. The solver looks for NaN residuals when considering a proposed step and backs off when it finds them, often to a better step which then leads to convergence, whereas we don't have any try/catch blocks in the solver and we just scream and die if the assembly throws an exception. I should probably fix that. On the other hand, I don't think that explains this bug, because it's dying in the first assembly of the second time step, not in the middle of a solve. On the gripping hand... it looks like solid_system.C itself is testing libmesh_isnan()? But I can't tell whether that's actually an attempt to recover from NaN solution values or whether it's just a sneaky way of handling the bc/displacement config value... or whether it's the latter but it accidentally also served as the former. |
You might test ae45822. That was the commit where we first started throwing exceptions from |
IIRC the error was from this line:
is that not the case? I haven't seen a stack trace, sorry if it was already posted by someone! |
That's where it's dying; it's definitely a failure to invert a zero matrix. But what I'm saying is that, if that was previously a recoverable failure, it would have had to have been in the assembly of a proposed step that would (upon discovery of NaN instead of a thrown exception) have been retracted. If the assembly of the solution at the beginning of a time step throws NaN, there's no way for the solver to converge after that. |
Maybe it previously computed a NaN, nothing used/depended on it, and the simulation just kept chugging along? |
Sorry, had meetings, am back at this. I'm currently retesting my original starting commit for the sake of my sanity. I haven't even tried to get a stack trace because I'm lazy and |
I had a bug in my bisect script (was using source tree that wasn't being used in bisect). FML. |
OK, original good commit is still good, bad is still bad, so trying bisect again with fixed script... |
OK, git bisect reports 3b4b2fb as the offending commit |
Definitely my fault, then, huh? I was trying to get some redundancy out of EquationSystems::reinit(), but didn't notice that the removed code was *not* redundant if codes (like fem_system_ex2) used reinit() on a single System.
This is going to take a bit of thought.
|
Any ETA on when fem_system_ex2 might be fixed? I think we should probably temporarily disable the test if it's going to be a while... |
Not soon enough, I fear. Let's have the shell script return XFAIL like @pbauman suggested on the list? |
Is that the same thing as |
Yeah. |
OK, I can take care of it then. |
Sorry, I'm being an idiot. SKIP is "return 77;" and is what I had in mind, but XFAIL is set up in Makefile.am and is I believe what Paul was suggesting. I think SKIP would be preferable but either would be fine temporarily. |
Not quite. You just need to add |
Hmm... it looks like we don't use |
There is a known issue with this example triggered by the changes in 3b4b2fb. Until it can be fixed, we are simply skipping it. Refs libMesh#1559.
There is a known issue with this example triggered by the changes in 3b4b2fb. Until it can be fixed, we are simply skipping it. I initially thought about using libmesh_example_requires() for this to avoid exposing the automake-specific return code, but that macro prints out a bunch of other text that isn't needed in this case. Refs libMesh#1559.
Resolved by #1592 |
I went to a year ago and b36673c is passing. So will start
git bisect
there.The text was updated successfully, but these errors were encountered: