Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of determinism in regression test results when compiled with the Intel compiler on Linux #358

Closed
owainkenwayucl opened this issue Jun 1, 2020 · 11 comments

Comments

@owainkenwayucl
Copy link
Collaborator

owainkenwayucl commented Jun 1, 2020

(I'm digging into this so hope to have more detail later)

Currently, contradictory to what is said in #161, and in report9/README.md, the codebase is not deterministic when compiled with the Intel compiler suite (tested with version 2020 (release) and 2019 (update 5) on RHEL 7.8).

"Covidsim is now deterministic across platforms (linux, Mac and Windows) and across compilers (gcc, clang, intel and msvc) for a specified number of threads and fixed random number seeds."

Releases up to and including 0.8.0 could be built and pass the regression test suite on newer Intel compilers such as those provided with the Intel 2019 Update 5 and 2020 (release) compiler suites but newer versions of CovidSim no longer pass (they are fine with the GNU compiler suite, at a significant performance penalty so we deploy GNU builds on our HTC/HPC resources).

Out of mild interest, and based on the above statement I figured I'd try again with the Intel compilers and now not only do checksums not match (presumably due to minor floating point differences), but the code (both 0.14.0 and master) reliably writes out different numbers of files in the UK regression test suite script when compiled with either the 2020 or 2019 compiler suite. This produces the error:

Lengths don't match.

in the test output. Sometimes, the test writes out more files than are in the reference set but sometimes it writes fewer. This looks like it's related to run length (e.g. when it's shorter it's, for example, the end files of mt-*.bmp etc that are missing) so presumably(?) this is the simulation taking shorter or longer to converge due to accumulated floating point errors.

This seems somewhat dependent on the micro architecture - for example 0.14.0 writes more files on one of our compute nodes (Cascade Lake) but fewer on one of the login nodes (Skylake).

Edit - On further testing this difference between the two platforms seems to be related to $OMP_NUM_THREADS being set explicitly on the compute nodes inside a job whereas it's not set on the login nodes, however CovidSim is told to use 1 thread, so I guess the compiler is doing ... something.

This behaviour occurs regardless of whether I run it directly (have it build its own binary) or run the tests by running make test in a build directory. It does not seem to happen with the US test, which merely has the predicted checksum issues.

I'm not adding any extra compiler options. I've not changed the tests in any way so I assume they run on 1 core.

I have a set of scripts which check out all the releases of the code from GitHub (0.7.0 onwards), and run the UK regression test script which I've run as an array job on our HTC cluster and it indicates that this species of error first cropped up in release 0.10.0.

In versions from 0.10.0 to 0.12.0 inclusive, this can be resolved by manually setting -fp-model precise in $CXXFLAGS but that doesn't seem to help in 0.13.0 onward.

I've done a test with clang and that seems to be fine so this seems to be limited to the Intel compilers.

I would advise against setting -fp-model strict on the Intel compiler because on some older versions of the compiler this triggered some really hideous bugs.

Is there a recommended set of compiler options for getting a deterministic build with the Intel compiler?

@tomgreen66
Copy link

I think this was mentioned in #30 where different results were found due to AVX2/AVX512. I think Cascade Lake has slightly different set of AVX512 instructions than Skylake that could cause the difference you are seeing. In this case I think deterministic means if you rerun the code again you will get the same result rather than giving the same results across different compilers and platforms. Just got some Cascade Lake and AMD Rome nodes along with Skylake so might give it a try where I am (using Intel compiler on AMD has its own gotchas). I hit similar issues when looking at the threading issue so be interested what developers say.

@owainkenwayucl
Copy link
Collaborator Author

owainkenwayucl commented Jun 2, 2020

I did try turning off AVX on its own to little avail, but I suspect that the differences are a collection of optimisations that need to be turned off and turning off one doesn't solve the whole problem. If I get a chance today I'll do an -O0 build and see what happens...

The documented differences in AVX512 from Skylake -> Cascade Lake should be limited to the addition of some half width float instructions to target machine learning but obviously we can't rule out undocumented implementation differences leading to different results.

@weshinsley
Copy link
Collaborator

weshinsley commented Jun 2, 2020

Thanks for the report and for testing this. In what way are you "assuming one core" - do you have any environment variables set up that force OpenMP to only use one core? (eg, see #342) If that's the case, then the multi-threaded regression tests will fail - they need 2 threads. But none of our test systems have needed us to do anything special to allow that.

Using the provided cmake instructions is probably the best way to build, but if you want to compile hands-on, running the executable built by "g++ -O2 -fopenmp *.cpp -o CovidSim", gives me identical results as our Windows visual studio builds (and performance not greatly different), so compilation is as simple as you get.

We haven't currently got a way of testing the Intel C++ compiler on Linux - but we have binary reproducibility on Intel C++ (2020) in Windows, Clang 10 on Windows, MSVC (VS2019) on Windows, Clang 10 on Windows, Clang 10 on Linux, and G++ 9.1.x on Linux.

On Windows, we tested with -O0, -O2, /fp:strict, /fp:precise, and also forcing SSE2 instructions, and on GCC, we varied -O0, -O2, -ffloat-store and -mfpath=sse, and found none of those caused any difference in the results either. So... what I'm saying is reproducibility should be very straightforward to show, and if you find yourself digging into very complicated things, probably step back and look at your setup from more of a distance.

The Cray with AVX512 may still be an issue; we're not sure and don't have access to that sort of hardware for more testing. But essentially, you should be able to get reproducibility very easily with this code, so I'd look at your OpenMP environment variables first.

@owainkenwayucl
Copy link
Collaborator Author

Thanks for that detail - I've now managed to do runs with AVX disabled and with -O0 and neither helped with newever versions. Certainly on the compute nodes, $OMP_NUM_THREADS is forced to the number of cores requested so, I guess setting that to 2 is the next thing to try - thanks for the suggestion.

@owainkenwayucl
Copy link
Collaborator Author

owainkenwayucl commented Jun 2, 2020

Right. It looks like it's a combination of the threading issues ($OMP_NUM_THREADS being set inside jobs) and needing to force -fp-model precise via $CXXFLAGS fixed it so I'll close this issue but it might be worth making the CMakeLists.txt file turn on that option (-fp-model precise) by default with the Intel compiler on Linux?

@owainkenwayucl
Copy link
Collaborator Author

Oh, and of course thanks for your help!

@weshinsley
Copy link
Collaborator

weshinsley commented Jun 2, 2020

No problem.

Strange that Intel C++ on linux is not already using /fp:precise by default, which is the standard middle ground for Windows. On Intel/Win we have three choices: precise and strict (which made no difference) - and /fp:fast, which we didn't try as it intuitively didn't seem like a reliable idea... your Intel isn't defaulting to /fp:fast is it?

@owainkenwayucl
Copy link
Collaborator Author

owainkenwayucl commented Jun 2, 2020

I presume it is defaulting to that (possibly one of the other options is implying it?). There should be no numerical differences between precise and strict (the only difference between strict and precise is that the former turns on exceptions).

However previous Intel compilers have had bugs where turning on strict broke integer math in some edge cases!!!

owainkenwayucl added a commit to UCL-RITS/rcps-modulefiles that referenced this issue Jun 2, 2020
@weshinsley
Copy link
Collaborator

@owainkenwayucl
Copy link
Collaborator Author

owainkenwayucl commented Jun 2, 2020

Intel gotta get those answers quickly even if they are wrong! 😬

If you are interested, here is the issue where my colleague + the Plumed developer discovered the Integer/strict issue: plumed/plumed2#391

@StewMH
Copy link

StewMH commented Jun 3, 2020

Hi,
Sorry to jump in but in case this is helpful.

We hit a number of similar issues in ATLAS where we compile with GCC and preload the Intel Math Function Library to use AVX etc on diverse hardware without rebuilding.

Running (e.g.) a GCC compiled binary with the Intel Math Function library preloaded:

LD_PRELOAD=$INTEL_PRELOAD/libimf.so:$INTEL_PRELOAD/libintlc.so.5

will then replace glibc <cmath> functions with Intel's implementations. It can narrow down purely numerical issues vs. differences in optimisation, compilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants