Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FMA operation #1354

Merged
merged 9 commits into from Oct 26, 2018

Conversation

Projects
None yet
9 participants
@thvnx
Copy link
Contributor

commented Sep 19, 2017

This commit adds the floating-point fused multiply-add operation (FMA) to the Pervasives module.

The following changes are made:

  • configure: check math.h for the C99 fma() operation.
  • fma declarations in pervasives.ml[i] (stdlib and otherlibs/threads).
  • C fma() call or emulation in byterun/floats.c.
  • Dedicated tests.

I am a regular ocaml follower, as well as an intensive FMA user. I've done this work to enjoy both ocaml language and FMA operation. If useful to the community, could this contribution be integrated to the next ocaml release?

The test suite passes successfully.

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2017

When emulated (only activated if C99 functions aren't provided by math.h), the FMA doesn't resist to overflows or underflows. Do I have to mention it in documentation?

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2017

Can you confirm that this is intended to improve the precision of the calculation, not to make the combined operation faster (which wouldn't probably be the case because of the C function call)?

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2017

You are right. It's for accuracy purposes, and additionally, improves the programming of extended-precision-based numerical algorithms (with perhaps performance improvements in some such cases).

It's very usefull for me but maybe not so relevant for the ocaml developers community?

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2017

I've a biais towards numerical code, so I'll let other core maintainers comment on the relevance of including the function in the stdlib. (The fact that it could at some point be implemented directly in the backend to benefit from CPU instructions would be an argument in favor of including the function.)

Anyway, if it is decided that the function deserves to be in the stdlib, it should probably go in the future Float module discussed elsewhere, not in Pervasives.

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2017

Thanks for your comments. I agree.

Where can I find the discussion about the Float module?

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2017

I agree with @alainfrisch that this would be a fine addition to the future Float module. (Perhaps along with other C99 FP functions that we still miss.)

The fallback implementation is taken from authors I trust. However it uses some C99 features (declaration in the middle of a block) that (some version of) MSVC doesn't support.

It should be easy to have ocamlopt turn calls to this function into FMA instructions provided by the processor, if they exist. E.g. for PowerPC and for ARM64, possibly for ARM too. For x86-64 FMA instructions were added to SSE only recently, so we have a portability issue.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2017

Adding a Float module was discussed in #964, and this is now conditioned on #1010.

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2017

Is the fallback implementation correct if overflow occur? The paper which is cited explicitly exclude this case from its proofs.

I am not sure whether this applies here, but note that overflow could occur in intermediate computations but not in the correct final result. In that case, it is even not enough to add something like "undefined result in case of overflow" in the documentation, because then it is not clear what overflow mean.

We should also take care about infinites and NaN...

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2017

@jhjourdan, the fallback implementation isn't correct if overflow occurs. In that case, the result will be as if computations were done without FMA.

I'll look if a more robust FMA emulation exists (maybe here).

EDIT: emulation linked above seems conform to IEEE standard.

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Sep 21, 2017

Does anyone understand why AppVeyor build fails? I don't have any Windows environment to reproduce the actual issue, i.e. warnings from line 1768 to 1773.

@nojb

This comment has been minimized.

Copy link
Contributor

commented Sep 21, 2017

This is related to some recent PR merges and it is being worked on. It should be resolved in the near future.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Sep 22, 2017

Concerning overflows, ISO C99 7.12.13.1 says "A range error may occur", without further details. So, I'm not sure all fma implementations out there handle overflow in x * y correctly.

@damiendoligez damiendoligez added this to the 4.07-or-later milestone Sep 25, 2017

@gasche gasche closed this Sep 27, 2017

@gasche gasche reopened this Sep 27, 2017

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Sep 27, 2017

About the FMA emulation: the one provided by the GLIBC is conform to IEEE 754 requirements (since the version 2.24, see here). Note that it uses the same algorithm which is currently implemented in this PR. I will soon adapt the fallback consequently.

@jhjourdan

This comment has been minimized.

Copy link
Contributor

commented Sep 28, 2017

Great !

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Oct 3, 2017

I've updated the FMA emulation and added new tests. FMA fallback supports now overflow and almost all underflows (some double-rounding issues are very difficult to handle: 3 of 317 tests still fail).

@damiendoligez

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

Now we have merged #1010 and #1638, so we have a Float module. @thvnx could you update this PR?

@damiendoligez damiendoligez removed this from the consider-for-4.07 milestone Jun 5, 2018

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Jun 6, 2018

Great! I'll do it ASAP.

@thvnx thvnx force-pushed the thvnx:fma-support branch 2 times, most recently from a0c267c to b27815f Jul 9, 2018

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2018

I moved the fma operation in the Float module.

But, I still have a few things to do:

  • 1. Make tests working again. testsuite/HACKING.adoc is incomplete, directives for 'Creating a new test' are missing. I do not understand how to add new tests well for now.
  • 2. In case of fma emulation only (when C99 functions are not available), some double-rounding issues are very difficult to handle: 3 of 317 tests still fail.
  • 3. For the tests, I've taken all the numerical values from the glibc ones, do I need to be more specific in testsuite/tests/fma/fma.ml (with a link to the original file for example)?
@damiendoligez

This comment has been minimized.

Copy link
Member

commented Jul 9, 2018

testsuite/HACKING.adoc is incomplete, directives for 'Creating a new test' are missing. I do not understand how to add new tests well for now.

cc @shindere

@shindere

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2018

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2018

@shindere, yes, I have a .ml file with its .reference counterpart. I've added an ocamltests file which contains the file name of my .ml test. When I run 'make tests', I obtain the following output:

List of directories returning no results:
    tests/fma
@shindere

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2018

@thvnx thvnx force-pushed the thvnx:fma-support branch from 42edd41 to bcf57f3 Oct 24, 2018

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Oct 24, 2018

I did not succeed in solving the three failed tests for the hard-to-tackle double-rounding issue (in the case of fallback emulation). So, I just disabled them. It works well when FMA hardware is provided but not fully IEEE compliant in emulation mode. OK anyway?

external fma : float -> float -> float -> float =
"caml_fma_float" "caml_fma" [@@unboxed] [@@noalloc]
(** A fused multiply-add [fma x y z] computes [x * y + z] with a
single rounding. Deprecated if no hardware FMA support is provided,

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Oct 24, 2018

Contributor

I'm not sure "deprecated" is relevant here. Something like "If no hardware support is provided, ...". But can this also be caused by a lack of libc support? And if the libc provides the emulation itself without hardware support, is it fully accurate?

This comment has been minimized.

Copy link
@thvnx

thvnx Oct 24, 2018

Author Contributor

It depends on the libc. Glibc is IEEE compliant, I don't know about the others. The fallback I wrote is very close to the one of the glibc (but they can switch rounding modes and other things that I can't do here).

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Oct 24, 2018

OK anyway?

Ok for me, but perhaps @xavierleroy or @Chris00 would like to comment on it.

@@ -60,7 +60,7 @@ external div : float -> float -> float = "%divfloat"
external fma : float -> float -> float -> float =
"caml_fma_float" "caml_fma" [@@unboxed] [@@noalloc]
(** A fused multiply-add [fma x y z] computes [x * y + z] with a
single rounding. Deprecated if no hardware FMA support is provided,
single rounding. If no hardware support is provided,

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Oct 24, 2018

Contributor

Ok, so IIUC, the criterion is not so much that the hardware support is provided or not, but whether the libc provides a compliant version (emulated or hardware based) or not. There can be cases where hardware support is available but emulation is still used (your emulation, or from the libc). (Sorry to nitpick, but it's important to explain as precisely possible sources of differences in fp calculations across different machines.)

This comment has been minimized.

Copy link
@thvnx

thvnx Oct 24, 2018

Author Contributor

Yes, you're absolutely right! I'll fix it in the next commit.

double-rounding issue may happen with software emulation fallback.
@since 4.09.0 *)
single rounding. Can be not fully IEEE compliant if a software
emulation is use in place of the hardware instruction. @since

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Oct 24, 2018

Contributor

Proposal:

[fma x y z] returns [x * y + z], with a best effort for computing this expression with a single rounding, using either hardware instructions (providing full IEEE compliance) or a software emulation.

@since 4.08

This comment has been minimized.

Copy link
@thvnx

thvnx Oct 25, 2018

Author Contributor

Fixed, thanks!

@alainfrisch
Copy link
Contributor

left a comment

Minor comments only.

I'm not competent to check the emulation code but @xavierleroy said he had enough trust, so I think this can be merged once comments are addressed unless someone else speaks up.

configure Outdated
@@ -1064,7 +1064,7 @@ fi
# For the Pervasives module

if sh ./hasgot2 -i math.h $mathlib expm1 log1p hypot copysign; then

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Oct 25, 2018

Contributor

You should include fma in the list of functions tested with hasgot2.

This comment has been minimized.

Copy link
@thvnx

thvnx Oct 25, 2018

Author Contributor

Done

(** [fma x y z] returns [x * y + z], with a best effort for computing
this expression with a single rounding, using either hardware
instructions (providing full IEEE compliance) or a software emulation.
@since 4.09.0 *)

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Oct 25, 2018

Contributor
  1. 4.08
  2. I did not think about it before, but what about saying a word about performance (with and without emulation) compared to the naive version? People might think that using fma is a good idea for performance reason even when precision is not a big concern, and it's not obvious whether this is a good idea. Is it true that with hardware support, fma would be fastest but with emulation from libc or yours, itwould be slower?

This comment has been minimized.

Copy link
@thvnx

thvnx Oct 25, 2018

Author Contributor

Add a note about performance.

Changes Outdated
@@ -126,6 +126,9 @@ Working version
min_max_num to module Float.
(Christophe Troestler, review by Alain Frish, Xavier Clerc and Daniel Bünzli)

- GPR#1354: Add fma support to Float module.
(Laurent Thévenoux)

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Oct 25, 2018

Contributor

Please include people involved in the review.

This comment has been minimized.

Copy link
@thvnx

thvnx Oct 25, 2018

Author Contributor

Reviewers only or all participants?

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Oct 25, 2018

Contributor

Looking quickly at the discussion, I'd say : @jhjourdan @xavierleroy @alainfrisch

This comment has been minimized.

Copy link
@thvnx

thvnx Oct 25, 2018

Author Contributor

Thanks, done!


(* tests 254, 257--259, 264--265, 267--268, and 272--273 are disabled
because of unresolved double-rounding issue in FMA emulation
fallback *)

This comment has been minimized.

Copy link
@alainfrisch

alainfrisch Oct 25, 2018

Contributor

I'll let you decide, but perhaps, for those tests, you could specify either a list of possible results (the correct one, and the one obtained by the emulation), or do a check "modulo epsilon"; just to make sure that the function is not too crazy in those cases.

This comment has been minimized.

Copy link
@thvnx

thvnx Oct 25, 2018

Author Contributor

Fixed (with list of possible results), integration in progress, I'm waiting for mscv64 failures if any.

Show resolved Hide resolved Changes Outdated

alainfrisch and others added some commits Oct 25, 2018

Fix typo
Co-Authored-By: thvnx <laurent.thevenoux@gmail.com>

@alainfrisch alainfrisch merged commit db99969 into ocaml:trunk Oct 26, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2018

Thanks @thvnx for your contribution and patience when dealing with incremental comments!

@shindere

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2018

This PR causes a failure of the fma test on Inria's CI (Cygwin32 machine).
The problem happens wiht and without flambda in native and bytecode mode.
There is a difference between the expected output and the obtained one:

@@ -252,17 +252,17 @@
 252 OK!
 253 OK!
 254 OK!
-255 OK!
+255 FAIL!	fma (0x1.fffp+0, 0x1.0000000000001p+0, -0x1.fffp+0) returned 0x1p-51 instead of 0x1.fffp-52.
 256 OK!
 257 OK!
-258 OK!
-259 OK!
-260 OK!
+258 FAIL!	fma (0x1.deadbeef2feedp+1023, 0x0.deadbeef2feedp-1022, -0x1.a05f8c01a4bfbp+1) returned 0x1.098p-53 instead of 0x1.0989687bc9da4p-53.
+259 FAIL!	fma (0x1.deadbeef2feedp+900, 0x0.deadbeef2feedp-1022, -0x1.a05f8c01a4bfbp-122) returned 0x1.098p-176 instead of 0x1.0989687bc9da4p-176.
+260 FAIL!	fma (0x1.fffffffffffffp+1023, 0x1.001p+0, -0x1.fffffffffffffp+1023) returned 0x1p+1012 instead of 0x1.fffffffffffffp+1011.
 261 OK!
 262 OK!
 263 OK!
-264 OK!
-265 OK!
+264 FAIL!	fma (0x1.deadbeef2feedp-495, 0x1.deadbeef2feedp-495, -0x1.bf86a5786a574p-989) returned 0x0.00000428p-1022 instead of 0x0.0000042625a1fp-1022.
+265 FAIL!	fma (0x1.deadbeef2feedp-503, 0x1.deadbeef2feedp-503, -0x1.bf86a5786a574p-1005) returned 0x0.000000000428p-1022 instead of 0x0.0000000004262p-1022.
 266 OK!
 267 OK!
 268 OK!
@@ -295,10 +295,10 @@
 295 OK!
 296 OK!
 297 OK!
-298 OK!
-299 OK!
-300 OK!
-301 OK!
+298 FAIL!	fma (0x1.fffffffffffffp-1, 0x1.fffffffffffffp-1, -0x1.ffffffffffffep-1) returned 0x0p+0 instead of 0x1p-106.
+299 FAIL!	fma (0x1.fffffffffffffp-1, -0x1.fffffffffffffp-1, 0x1.ffffffffffffep-1) returned 0x0p+0 instead of -0x1p-106.
+300 FAIL!	fma (-0x1.fffffffffffffp-1, 0x1.fffffffffffffp-1, 0x1.ffffffffffffep-1) returned 0x0p+0 instead of -0x1p-106.
+301 FAIL!	fma (-0x1.fffffffffffffp-1, -0x1.fffffffffffffp-1, -0x1.ffffffffffffep-1) returned 0x0p+0 instead of 0x1p-106.
 302 OK!
 303 OK!
 304 OK!
@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Oct 26, 2018

@shindere , I can easily fix it. Do I have to create a new pull request or I can just pull in this one?

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2018

No unfortunately, you need to create a new PR. Is the problem a bad software emulation from that libc?

@thvnx

This comment has been minimized.

Copy link
Contributor Author

commented Oct 26, 2018

@alainfrisch, OK. Yes, I guess this is due to another software support for the fma on this platform (or something related to subnormals).

@shindere, I'll create a new PR with this patch: thvnx@06d120c. Can you test it before someone merge it to trunk?

See #2120

damiendoligez added a commit to damiendoligez/ocaml that referenced this pull request Nov 5, 2018

Support FMA operation (ocaml#1354)
Adds a fused multiply-add operation to the Float module.

The following changes are made:
- configure: check math.h for the C99 fma() operation.
- fma declarations in float.ml[i] (stdlib/).
- C fma() call or emulation in runtime/floats.c.
- dedicated tests in testsuite/tests/fma.
@shindere

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2019

For several weeks, Inria's CI was broken in several ways, so that
the testsuite could not be run for the Cygwin 64 port.

Now that this has been fixed, it turns out that the FMA tests fail on
this platform, for both native and bytecode, with and without flambda.

More precisely, in all these cases the test's output differs from the
expected one:

@@ -255,15 +255,15 @@
 255 OK!
 256 OK!
 257 OK!
-258 OK!
-259 OK!
-260 OK!
-261 OK!
-262 OK!
+258 FAIL!	fma (0x1.deadbeef2feedp+1023, 0x0.deadbeef2feedp-1022, -0x1.a05f8c01a4bfbp+1) returned 0x0p+0 instead of 0x1.0989687bc9da4p-53.
+259 FAIL!	fma (0x1.deadbeef2feedp+900, 0x0.deadbeef2feedp-1022, -0x1.a05f8c01a4bfbp-122) returned 0x0p+0 instead of 0x1.0989687bc9da4p-176.
+260 FAIL!	fma (0x1.fffffffffffffp+1023, 0x1.001p+0, -0x1.fffffffffffffp+1023) returned infinity instead of 0x1.fffffffffffffp+1011.
+261 FAIL!	fma (-0x1.fffffffffffffp+1023, 0x1.fffffffffffffp+0, 0x1.fffffffffffffp+1023) returned -infinity instead of -0x1.ffffffffffffdp+1023.
+262 FAIL!	fma (0x1.fffffffffffffp+1023, 0x1p+1, -0x1.fffffffffffffp+1023) returned infinity instead of 0x1.fffffffffffffp+1023.
 263 OK!
-264 OK!
-265 OK!
-266 OK!
+264 FAIL!	fma (0x1.deadbeef2feedp-495, 0x1.deadbeef2feedp-495, -0x1.bf86a5786a574p-989) returned 0x0p+0 instead of 0x0.0000042625a1fp-1022.
+265 FAIL!	fma (0x1.deadbeef2feedp-503, 0x1.deadbeef2feedp-503, -0x1.bf86a5786a574p-1005) returned 0x0p+0 instead of 0x0.0000000004262p-1022.
+266 FAIL!	fma (0x1p-537, 0x1p-538, 0x0.0000000000001p-1022) returned 0x0.0000000000001p-1022 instead of 0x0.0000000000002p-1022.
 267 OK!
 268 OK!
 269 OK!
@@ -275,8 +275,8 @@
 275 OK!
 276 OK!
 277 OK!
-278 OK!
-279 OK!
+278 FAIL!	fma (0x0.0000000000001p-1022, 0x1p-1, 0x0.fffffffffffffp-1022) returned 0x0.fffffffffffffp-1022 instead of 0x1p-1022.
+279 FAIL!	fma (-0x0.0000000000001p-1022, 0x1p-1, -0x0.fffffffffffffp-1022) returned -0x0.fffffffffffffp-1022 instead of -0x1p-1022.
 280 OK!
 281 OK!
 282 OK!

And configure says:

checking for fma... yes

Any idea what may go wrong here?

@dra27

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2019

@shindere

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.