Checksum fails on Cori. (Haswell/KNL DOE supercomputer) #30

kngott · 2020-04-03T17:47:21Z

When doing our initial testing, we successfully compiled with Intel and GCC compilers. However, the test case failed on Cori, but was successful on our workstations. We tracked the problem down to the AVX2 and AVX512 instructions in the Cray compiler wrappers:

-march=core-avx2 & -march=core-avx512

When we were tracking this down, we noticed the first divergence happened in P.SpatialBondingBox[3]:
It seems to come from roundoff error in SetupModel.cpp, around line 128:
P.nch = 4 * ((int)ceil(P.height / P.cheight / 4));
P.height / P.cheight is very close to 7 and P.nch ends up being 7 with avx2 and 8 without.

Note that if one prints the numbers (or inserts a line of std::atomic_memory_fence there), the numbers then agree. Both are 8. However, the code then diverges again elsewhere and the checksum is still a failure.

A general note, the checksum regression test will likely not work across different compilers and hardwares. Even a*x+b could give different answers depending on what the compiler chooses to do: fma or multiplication followed by plus.

The text was updated successfully, but these errors were encountered:

matt-gretton-dann · 2020-04-04T17:12:37Z

So I agree the checksum regression test is not a perfect solution. As another example of the issue you highlight, we deliberately run it single-threaded so we don't get variance between different thread performance giving different results between runs. Fused multiply accumulate and vectorization will just add to the issues.

This isn't a problem running the model in full as it is stochastic anyway.

However, we haven't had the time to work out a scalable and maintainable way of running the regression test in a way that allows a small amount of variation, but doesn't let the figures drift over time.

kngott · 2020-04-04T23:13:04Z

c=1 in the regression testing did catch my eye. :)

Until a more general solution to the regression testing is worked out, I guess the best thing to do is just be aware the AVX2 and AVX512 instruction sets are known to cause variations.

matt-gretton-dann added the enhancement New feature or request label Apr 4, 2020

matt-gretton-dann self-assigned this Apr 4, 2020

tomgreen66 mentioned this issue Jun 1, 2020

Lack of determinism in regression test results when compiled with the Intel compiler on Linux #358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checksum fails on Cori. (Haswell/KNL DOE supercomputer) #30

Checksum fails on Cori. (Haswell/KNL DOE supercomputer) #30

kngott commented Apr 3, 2020

matt-gretton-dann commented Apr 4, 2020

kngott commented Apr 4, 2020

Checksum fails on Cori. (Haswell/KNL DOE supercomputer) #30

Checksum fails on Cori. (Haswell/KNL DOE supercomputer) #30

Comments

kngott commented Apr 3, 2020

matt-gretton-dann commented Apr 4, 2020

kngott commented Apr 4, 2020