-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run ADIOS2 CI without test reruns #3825
Conversation
@eisenhauer Was this just a test to see if test retries are necessary? Looking through some of the failed jobs, it seems there were a handful of failures and timeouts on each that presumably would have passed on a subsequent try. |
Actually, that doesn't quite seem to be what happened. Looking at the raw output of one of the tests, at least one test passed several times and then finally failed due to a timeout. Search the log above for Still this is a little unnerving, isn't it? That if we just run enough times (4, in the case above), some of the mpi tests will eventually fail? |
It is a bit unnerving. I started running this (submitted automatically every week), because of fear that the test retries might be hiding bugs or race conditions that we should be addressing. The InSitu engine seems to be particularly prone to failures, though it's not the only place we see them. Unfortunately when I've had time to go looking for failure modes I've had a hard time reproducing the failure... |
I'm wondering about this too. When I try to run the adios2 tests locally, having built against Quick note of context: I'm currently trying to speed up mpich builds in CI (see #3616), with one goal of that work being to eventually replace most OpenMPI builds with mpich (see #3617). |
Speeding up CI is a good goal. I don't know that I can offer a lot of insight into the failures, as my attempts to kill them haven't gotten anywhere. (I did kill a few testing bugs early on, but mostly it was multiple tests using the same output filename, which caused issues when they were run concurrently. I haven't been able to blame that for any of the regular failures no-rerun failures. I will, however, fess up to being responsible for some of the longest running tests. There are some SST tests where we spawn multiple readers and randomly kill old ones or spawn new ones to make sure that the writer will survive such things. (That was the sort of situation where I was worried we might be hiding occasional failures, but I haven't seen evidence of that.). Those tests take minutes, simply because we want to make sure new guys have time to start up, connect, etc. So there are several tests with 300 second timeouts. |
Thanks for sharing that info. Even though there may be some long-running tests, it seems something else is going on with mpich. The test run with mpich in the name regularly takes around an hour in CI, all the test runs with ompi in the name take roughly half that. test suite with mpich: https://open.cdash.org/viewTest.php?onlypassed&buildid=9020462 Same OS and compiler, and the ompi suite ran 1276 tests, which is 2 more than the mpich suite. That's odd, but probably explainable and unrelated. But the mpich tests took 58 minutes, the ompi tests took 28 minutes. |
And I'm wondering if the difference in times could be explained by "invisible retries" in the case of mpich. |
Just to toss out two things that might also play a role. I think that Vicente's MPI data plane in SST is only enabled on MPICH, but when it is enabled it is used by default. If there are startup costs or something that are different than when we use the sockets-based dataplane, that might cause systematic differences between the MPI implementations. Also, the SST tests most always involve at least one reader and writer, with those being separate executables (maybe each is an MPI job, maybe a single process). When the MPI implementation is capable of MPMD mode (that is, to launch different MPI ranks with different executables) we try to use that because it speeds up the testing. (The alternative is to launch 2 MPI jobs, one for the reader and one for the writer.) The ability to use MPI in mpmd mode might also explain some speed differences for a different MPI implementation. |
Those are two great suggestions, giving me a couple of new lines of investigation which are quite welcome. Thanks 😁 I can see where the variable controlling whether the mpi data plane is included depends on the mpi implementation being mpich, so I'll dig in from that angle a little. Regarding mpmd-mode, just a quick look makes it seem both mpich and openmpi support that, so I'll see if it can be enabled for mpich if it's not already. Thanks again @eisenhauer! |
Well, I'm fully cognizant that SST tends to be a problem child and I will apologize for the complexity of the CMake in staging-common. There may be a better way to do what it does... |
No description provided.