Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pack/Unpack external32 with long double still broken #8918

Closed
dalcinl opened this issue May 4, 2021 · 23 comments
Closed

Pack/Unpack external32 with long double still broken #8918

dalcinl opened this issue May 4, 2021 · 23 comments

Comments

@dalcinl
Copy link
Contributor

dalcinl commented May 4, 2021

@gpaulsen This is a followup on #8769. Pack/Unpack external32 with long double is still broken.

I'm using a very recent ompi:master at 17b723b. The test from this gist (thanks @markalle) is failing on my Fedora 34 workstation.

$ mpicc pack_long_double.c 
$ ./a.out 
sizeof(long double) == 16
MPI_Type_size MPI_LONG_DOUBLE == 16
in data : 1.000000000000000000000000000000000000 2.000000000000000000000000000000000000 3.000000000000000000000000000000000000
out data: 0.000000000000000000000000000000000000 0.000000000000000000000000000000000000 0.000000000000000000000000000000000000
in      :
  00 00 00 00  00 00 00 80  ff 3f 00 00  00 00 00 00  00 00 00 00  00 00 00 80
  00 40 00 00  00 00 00 00  00 00 00 00  00 00 00 c0  00 40 00 00  00 00 00 00
packed  :
  00 00 40 00  00 00 00 80  00 00 00 00  00 00 00 00  00 00 40 00  00 00 00 80
  00 00 00 00  00 00 00 00  00 00 60 00  00 00 00 80  00 00 00 00  00 00 00 00
unpacked:
  00 00 40 00  00 00 00 a0  00 00 00 00  00 00 00 00  00 00 40 00  00 00 00 a0
  00 00 00 00  00 00 00 00  00 00 40 00  00 00 00 b0  00 00 00 00  00 00 00 00
@markalle
Copy link
Contributor

markalle commented May 4, 2021

For me the ppc build works from master, but I assume that's because my build isn't using opal_dt_swap_long_double() and yours is probably taking the x86 path? So mine is just swapping the bytes without any extra ieee struct related to conversion

I'm not really familiar with what the opal_dt_swap_long_double() routine is doing. Are you x86?

@dalcinl
Copy link
Contributor Author

dalcinl commented May 4, 2021

Are you x86?

Yes.

@ggouaillardet
Copy link
Contributor

IIRC, a long time ago in a galaxy far far away, I worked on this in the context of heterogenous clusters, when communicating MPI_LONG_DOUBLE between x86 and sparc.

The same routine is used by MPI_Pack_external() and MPI_Unpack_external(), so the implicit expectation is that we have a symmetric (e.g. f(f(a) == a) function, but that is not the case.

I also noted MPICH does a different packing (simple byte swap (and hence symmetrical)?)

So I think we should first refer to the MPI standard in order to find the right way to external32 pack a MPI_LONG_DOUBLE, and then figure out the implementation detail.

@dalcinl
Copy link
Contributor Author

dalcinl commented May 5, 2021

@ggouaillardet According to the MPI standard, the proper way to pack MPI_LONG_DOUBLE is to store values as big-endian quadruple-precision IEEE floating point (binary128, corresponding to GCC's __float128 type). This is defined in MPI-3.1 Sec. 13.5.2 pp. 538, cf. with binary128 IEEE format.

MPI does not define a predefined datatype for binary128. Heterogeneous support may not have all the bits required to implement pack/unpack external32. EDIT: MPI_REAL16 may be binary128.

A trivial implementation of pack/unpack external32 for long double on supported compilers and platforms is simply to cast each value to/from __float128 and byte-swap to big-endian if required.

@markalle
Copy link
Contributor

markalle commented May 5, 2021

I agree with a caveat about I'm not sure if __float128 gets us enough portablity (eg it worked from from my x86 linux machine but my mac didn't seem to have quadmath.h). But as far as functionality if I store the data "123" as a long double then as a __float128, the bits come out as you described, for example:

123.0 stored as a long double:
bin 00000000 00000000 00000000 00000000
    00000000 00000000 01000000 00000101
    11110110 00000000 00000000 00000000
    00000000 00000000 00000000 00000000
that decodes as
* first 48 bits empty
* sign bit at 48
* exponent starts at bit 49 = 1000000 00000101 = 16389, then bias by -16383 = 6
* mantissa = binary fraction without implied leading 1: 1.1110110 = 1.921875
* val = 1.921875 * 2^6 = 123.0

(or in other words the above is the 80 bit x86 extended precision format)

123.0 stored as a __float128:
bin 01000000 00000101 11101100 00000000
    00000000 00000000 00000000 00000000
    00000000 00000000 00000000 00000000
    00000000 00000000 00000000 00000000
that decodes as
* sign bit at 0
* exponent starts at bit 1 = 1000000 00000101 = 16389, then bias by -16383 = 6
* mantissa = binary fraction with implied leading 1: 1.1110110 = 1.921875
* val = 1.921875 * 2^6 = 123.0

(or in other words it's the ieee 754 quad precision)

So above is really just long winded say of saying/confirming that long double at least on my x86 is 80 bit extended precision and __float128 makes it ieee 754 quad precision.

@markalle
Copy link
Contributor

markalle commented May 6, 2021

I made a couple routines to test with that convert both directions at least for the basic cases. I don't actually understand the two special exponents that describe subnormal numbers and the various NaNs, although I think the 80 bit extended precision and 128 bit quad precision use the same special exponents so a conversion like this gist that doesn't have special cases for the special exponents might be okay:

https://gist.github.com/markalle/650a6389518d04e54c9cd102485a95fd

My reason for making these conversions would be for architectures where we don't have __float128. If we do have __float128 then of course we should just use that.

I didn't incorporate the above into OMPI, and would need to look closer to understand how to change where the conversion is made. From my past reading I recall what @ggouaillardet is describing. It always seemed like the conversion routines didn't have enough arguments to describe what they were trying to do, because all they handled was byte swapping (and only of same-sized arguments) and they assumed it was all symmetrical.

@dalcinl
Copy link
Contributor Author

dalcinl commented May 6, 2021

@markalle Here you have my own version of these routines, not including any byte swapping as that part is trivial. I don't think your routines are correct, the explicit vs. implicit significand bit requires care, although I may be wrong.

https://gist.github.com/dalcinl/05cccf7b11cdf169a750485f67b499b7

In my code, the float128 -> long double path may not be totally correct for arbitrary values, but at least they roundtrip with long double input.
Please note that my code uses __float128 for testing, but the converter routines do not depend on __float128.
Do you have quick access to a big-endian machine or VM? I'm not sure my auxiliary union definitions are correct for big-endian architectures.

@markalle
Copy link
Contributor

markalle commented May 6, 2021

I do see one difference between your code and mine for decoding a stored float128: if e==0 you're turning off the highest bit of the mantissa, where mine universally turned it on.

For the rest I think we're doing the same operations, except I did all mine with characters because unsigned gets really confusing to me for mixing endianness. In particular I'm concerned that your code might be relying on the architecture endianness matching what you're trying to store. But a likely use case here is we're running little endian and intending to encode into a big endian float128.

So for example if the sign and exponent was 0x400d, the little endian machine would have those two bytes stored as 0d 40. A line like e = ul.bits.exp; would take that value as 0x400d because the arch matches the storage, then uq.bits.exp = e would similarly store 0d 40 as desired because the architecture is storing the 0x400d in little endian for us. But I don't think that would all work if the arch was little endian and we were trying to store to big endian.

I'm equally suspicious of all the frac and f[0123] and f0123 parts when mixing endianness.

I like that it's fewer operations, and I haven't done any timings but am willing to assume yours is faster. But I'd be a lot more comfortable with character based encoding if the storage endianness isn't the same as the architecture endianness.

@dalcinl
Copy link
Contributor Author

dalcinl commented May 7, 2021

@markalle I do not really care what version is ultimately used. However, I warn you about about the handling of explicit vs. hidden bit. Your testing with value 123 is perhaps not enough. Even if the processor can handle pseudo denormals, turning the highest bit of the mantisa on unconditionally is kind of ackward for the special case of converting a (positive) zero value, the bit pattern you generate does not have all bits set to zero.

As I said before, my version DOES NOT include any byte-swapping to big-endian, and should be tested in some actual big-endian architecture (both 32 and 64 bits), as I'm not sure about my struct definitions for that case.

@ggouaillardet
Copy link
Contributor

@bosilca could you please give us some insights?

There are two sub-issues here:

  • how to convert a 80 bits quad precision to a 128 bits quad precision (that should be the easy part)
  • how to "plug" this subroutine in the datatype engine. Currently, all conversions are symmetric functions (e.g. byte swap),
    but in that case, the conversion is no more symmetric and I do not see how the datatype engine can choose between 80->128 and 128->80

@markalle
Copy link
Contributor

markalle commented May 10, 2021

I made a PR using a lightly modified version of your f80 pack/unpack.

By adding a little more context added to pFunction's arguments I sidesteped my concern, about @dalcinl's function by ordering his function so it's always operating on data in the local-arch format, so sometimes it does opal_dt_swap_bytes first followed by opal_dt_swap_long_double and sometimes vice-versa

#8941

@awlauria
Copy link
Contributor

@markalle @dalcinl what environments do we need to hit/reproduce this issue?

@dalcinl
Copy link
Contributor Author

dalcinl commented May 13, 2021

@awlauria X86_64 is enough. Ideally, we should also try it in a big-endian arch. However, I'm not aware of any big-endian arch where long double actually means 80bit extended precision floating point.

@awlauria
Copy link
Contributor

Important but not a blocker for rc, probably want it for release.

@markalle
Copy link
Contributor

Yeah, testing is a problem. The original bug is hit on x86_64, but trying to hit all the paths in the fix is hard.

@markalle
Copy link
Contributor

I did another repush to #8941 as the fix. I think it's done, but testing is a concern.

I ran mac x86_64 which hits

  • little endian f80_to_f128 and back (using bit operations, no __float128)

I'll try PPC64 and see what other paths I can hit next. Other paths it would be nice to include are

  • little endian conversion via __float128
  • big endian conversion with f80_to_f128
  • big endian conversion with __float128
  • something with no conversion other than endianness

@dalcinl
Copy link
Contributor Author

dalcinl commented May 26, 2021

f80 and big endian may be impossible to test, at least with MPI code. f80 is an Intel format, I do not expect any other big-endian CPU to implement it. On CPUs other than Intel, I expect long double to be either double, or IEEE binary128 (__float128), or some other custom format, e.g. POWER ISA

@markalle
Copy link
Contributor

markalle commented May 28, 2021

I made this comment over in the #8941 fix PR:

The systems I've tested on now are

  • (full OMPI build)
    • x86_64 mac (no __float128 path, uses f80_to_f128)
    • ppc9 linux (uses __float128 conversion path)
    • ppc9 linux (with __float128 artificially disabled leaving it with no conversion)
    • x86_64 linux (uses __float128 conversion path)
    • x86_64 linux (with __float128 artificially disabled so it used f80_to_f128)
  • (stand alone program that uses the same conversion routines and mimics as much as it can)
    • qemu mips linux (big endian with 64-bit long doubles) (uses f64_to_128 path)

That's not a bad sampling, and even though it's not using the full OMPI on the old qemu system, my standalone program uses most of the OMPI code.

@markalle
Copy link
Contributor

This one fell into the background again, but I think it's all done and ready to go over in #8941

@gpaulsen
Copy link
Member

Looks like we have a FIX in #8941. We just need a review and then cherry-picks to the release branches.

@awlauria
Copy link
Contributor

Main: #8941
v5.0.x: #10933

@dalcinl can you close when you have a chance to verify?

@dalcinl
Copy link
Contributor Author

dalcinl commented Oct 18, 2022

All mpi4py pack/unpack tests with external32 passed for branches:

  • ompi@main: [logs]
  • ompi@v5.0.x [logs] (ignore failures, looks like a new and unrelated regression)

@dalcinl dalcinl closed this as completed Oct 18, 2022
@awlauria
Copy link
Contributor

awesome thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants