Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: use mpich built with ch3:sock to speed up tests #3883

Conversation

scottwittenburg
Copy link
Collaborator

@scottwittenburg scottwittenburg commented Oct 31, 2023

Try an mpich built using a possible future version of spack (spack PR here) that supports building --with-device=ch3:sock:tcp.

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Nov 14, 2023

@vicentebolea Do you know anything about this failing test Install.CMake.EncryptionOperator? On (gcc8, static, serial) it seems to be throwing the following exceptions:

1: terminate called after throwing an instance of 'std::runtime_error'
1:   what():  [Mon Nov 13 15:47:31 2023] [ADIOS2 EXCEPTION] <Plugins> <PluginManager> <GetOperatorCreateFun> : Couldn't find operator plugin named MyOperator

and

2: IO System base failure exception, STOPPING PROGRAM
2: [Mon Nov 13 15:47:31 2023] [ADIOS2 EXCEPTION] <Toolkit> <transport::file::FilePOSIX> <CheckFile> : couldn't open file testOperator.bp, in call to POSIX open: errno = 2: No such file or directory

@scottwittenburg
Copy link
Collaborator Author

It's also a little hard to see the benefits here, as the results are muddied by the following two tests timing out after 2 minutes (that's 2 tests x 5 tries each x 2 min per try = 20 min):

  • Engine.BP.*/BPChangingShapeWithinStep.MultiBlock/*.BP3.MPI (recent results from other PRs here).
  • Engine.BP.*/BPChangingShapeWithinStep.MultiBlock/*.BP4.MPI (recent results from other PRs here)

Oddly, those timeouts seemed to have started only today. But it looks like it affected ompi builds from before I changed to mpich (on other PRs as well as on master). And more odd, I haven't rebased this PR since Oct 24.

@scottwittenburg
Copy link
Collaborator Author

And more odd, I haven't rebased this PR since Oct 24.

Oh wait, that part is not weird, gha tested a merge commit made by github merging my PR head (1847e39) into master at:

fd111d462 Merge pull request #3913 from anagainaru/perfstub-fix

@vicentebolea
Copy link
Collaborator

@vicentebolea Do you know anything about this failing test Install.CMake.EncryptionOperator? On (gcc8, static, serial) it seems to be throwing the following exceptions:

As I recall this was resolved by @spyridon97 in the past weeks.

@vicentebolea
Copy link
Collaborator

Yep we are having issues with those tests (Engine.BP./BPChangingShapeWithinStep.MultiBlock/.BP3.MPI) I have a PR trying to figure the reason at #3908

@scottwittenburg
Copy link
Collaborator Author

As I recall this was resolved by @spyridon97 in the past weeks.

If you have a link to the PR, I'll take a look.

@scottwittenburg scottwittenburg force-pushed the another-way-to-mpich-ch3-sock-tcp branch 2 times, most recently from f6d01f7 to 84a321e Compare November 15, 2023 18:27
@scottwittenburg
Copy link
Collaborator Author

Table showing test times and number of tests for each compiler/parallel pair for the most recent CI run:

compiler      parallel       # tests run      elapsed time (tests only)
-----------------------------------------------------------------------
clang6    |     ompi     |       1269      |           21m
clang6    |    mpich     |       1257      |            7m
clang10   |    mpich     |       2029      |           11m
gcc8      |     ompi     |       1296      |           30m
gcc8      |    mpich     |       1284      |           12m
gcc9      |    mpich     |       2029      |           14m
gcc10     |    mpich     |       2027      |           14m
gcc11     |    mpich     |       2029      |           14m

Anecdotally, I know the gcc10-mpich tests used to take around 1 hr.

@scottwittenburg
Copy link
Collaborator Author

scottwittenburg commented Nov 16, 2023

@vicentebolea Thanks for your help with this. If you approve of these changes, I'll rebuild/tag/push the images to the correct location and rewrite the history to a single commit with those changes. Otherwise, we can iterate on anything you disagree with while still using the test images.

@scottwittenburg scottwittenburg changed the title WIP: ci: another approach to tolerating oversubscription ci: use mpich built with ch3:sock to speed up tests Nov 16, 2023
vicentebolea
vicentebolea previously approved these changes Nov 16, 2023
Copy link
Collaborator

@vicentebolea vicentebolea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic, great work there! how about the image building time? Did it significantly increase?

@scottwittenburg
Copy link
Collaborator Author

how about the image building time? Did it significantly increase?

I don't think it increased much, if at all, but I'll let you know for sure when I build them all cleanly in just a few minutes here.

@scottwittenburg
Copy link
Collaborator Author

33 minutes to build the gcc and clang images, 12 minutes to build icc and oneapi. That doesn't include pushing, I'm pushing them now.

@vicentebolea
Copy link
Collaborator

Good enough, there has been a 50% time increase in image building but we get a much faster tests execution time. Sounds good! feel free to merge after pushing the images and making the changes in the image names in this PR. I will re-approve then

@scottwittenburg scottwittenburg merged commit ced424b into ornladios:master Nov 17, 2023
34 checks passed
@scottwittenburg scottwittenburg deleted the another-way-to-mpich-ch3-sock-tcp branch November 17, 2023 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants