Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate catastrophic cancellation in cross products and other code #435

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mosra
Copy link
Owner

@mosra mosra commented Apr 21, 2020

Original article: https://pharr.org/matt/blog/2019/11/03/difference-of-floats.html

While this makes 32-bit float cross product precision basically equivalent to a 64-bit calculation casted back to 32-bit, it stays with the speed halfway between the straightforward 32- and 64-bit implementation. Benchmark on Release:

Starting Magnum::Math::Test::VectorBenchmark with 9 test cases...
 BENCH [2]   0.98 ± 0.05   ns cross2Baseline<Float>()@24999x100000 (wall time)
 BENCH [3]   3.44 ± 0.11   ns cross2Baseline<Double>()@24999x100000 (wall time)
 BENCH [4]   1.97 ± 0.08   ns cross2()@24999x100000 (wall time)
 BENCH [5]   2.22 ± 0.11   ns cross3Baseline<Float>()@24999x100000 (wall time)
 BENCH [6]   4.69 ± 0.22   ns cross3Baseline<Double>()@24999x100000 (wall time)
 BENCH [7]   3.32 ± 0.15   ns cross3()@24999x100000 (wall time)
Finished Magnum::Math::Test::VectorBenchmark with 0 errors out of 450000 checks.

However this happens only on platforms that actually have a FMA instruction. For example on Emscripten the code is ten times slower than the baseline implementation, which is not an acceptable tradeoff -- there simply using doubles to calculate the result is faster. And enabling the more precise variant only on some platforms doesn't seem like a good idea for portability. For the record, benchmark output on Chrome (node.js in the terminal gives similar results):

Starting Magnum::Math::Test::VectorBenchmark with 7 test cases...
 BENCH [2]   2.53 ± 0.34   ns cross2Baseline<Float>()@499x100000 (wall time)
 BENCH [3]   5.18 ± 1.30   ns cross2Baseline<Double>()@499x100000 (wall time)
 BENCH [4]   6.22 ± 0.46   ns cross2()@499x100000 (wall time)
 BENCH [5]   2.73 ± 0.35   ns cross3Baseline<Float>()@499x100000 (wall time)
 BENCH [6]   5.94 ± 0.61   ns cross3Baseline<Double>()@499x100000 (wall time)
 BENCH [7]  28.77 ± 2.40   ns cross3()@499x100000 (wall time)
Finished Magnum::Math::Test::VectorBenchmark with 0 errors out of 7000 checks.

Stashing this aside until I'm clearer what to do with this. Things to keep an eye on:

Have to do some precision improvements, so a baseline is needed. The
debug perf is beyond awful, actually.
And the Vector3 version 5% slower in Release, on GCC at least. FFS,
what was I thinking with the gather() things. Nice in user code,
extremely bad in library code.
While this makes 32-bit float cross product precision basically
equivalent to a 64-bit calculation casted back to 32-bit, it stays with
the speed halfway between the straightforward 32- and 64-bit
implementation.

However only on platforms that actually have a FMA instruction. For
example on Emscripten the code is TEN TIMES slower than the baseline
implementation, which is not an acceptable tradeoff -- there simply
using doubles to calculate the result is faster. And enabling the more
precise variant only on some platforms doesn't seem like a good idea for
portability.

Stashing this aside until I'm clearer what to do with this.
@mosra mosra mentioned this pull request May 9, 2020
87 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

1 participant