Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checksum fails on Cori. (Haswell/KNL DOE supercomputer) #30

Open
kngott opened this issue Apr 3, 2020 · 2 comments
Open

Checksum fails on Cori. (Haswell/KNL DOE supercomputer) #30

kngott opened this issue Apr 3, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@kngott
Copy link
Collaborator

kngott commented Apr 3, 2020

When doing our initial testing, we successfully compiled with Intel and GCC compilers. However, the test case failed on Cori, but was successful on our workstations. We tracked the problem down to the AVX2 and AVX512 instructions in the Cray compiler wrappers:

-march=core-avx2 & -march=core-avx512

When we were tracking this down, we noticed the first divergence happened in P.SpatialBondingBox[3]:
It seems to come from roundoff error in SetupModel.cpp, around line 128:
P.nch = 4 * ((int)ceil(P.height / P.cheight / 4));
P.height / P.cheight is very close to 7 and P.nch ends up being 7 with avx2 and 8 without.

Note that if one prints the numbers (or inserts a line of std::atomic_memory_fence there), the numbers then agree. Both are 8. However, the code then diverges again elsewhere and the checksum is still a failure.

A general note, the checksum regression test will likely not work across different compilers and hardwares. Even a*x+b could give different answers depending on what the compiler chooses to do: fma or multiplication followed by plus.

@matt-gretton-dann matt-gretton-dann added the enhancement New feature or request label Apr 4, 2020
@matt-gretton-dann
Copy link
Collaborator

So I agree the checksum regression test is not a perfect solution. As another example of the issue you highlight, we deliberately run it single-threaded so we don't get variance between different thread performance giving different results between runs. Fused multiply accumulate and vectorization will just add to the issues.

This isn't a problem running the model in full as it is stochastic anyway.

However, we haven't had the time to work out a scalable and maintainable way of running the regression test in a way that allows a small amount of variation, but doesn't let the figures drift over time.

@matt-gretton-dann matt-gretton-dann self-assigned this Apr 4, 2020
@kngott
Copy link
Collaborator Author

kngott commented Apr 4, 2020

c=1 in the regression testing did catch my eye. :)

Until a more general solution to the regression testing is worked out, I guess the best thing to do is just be aware the AVX2 and AVX512 instruction sets are known to cause variations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants