Pack/Unpack external32 with long double still broken #8918

dalcinl · 2021-05-04T07:11:59Z

@gpaulsen This is a followup on #8769. Pack/Unpack external32 with long double is still broken.

I'm using a very recent ompi:master at 17b723b. The test from this gist (thanks @markalle) is failing on my Fedora 34 workstation.

$ mpicc pack_long_double.c 
$ ./a.out 
sizeof(long double) == 16
MPI_Type_size MPI_LONG_DOUBLE == 16
in data : 1.000000000000000000000000000000000000 2.000000000000000000000000000000000000 3.000000000000000000000000000000000000
out data: 0.000000000000000000000000000000000000 0.000000000000000000000000000000000000 0.000000000000000000000000000000000000
in      :
  00 00 00 00  00 00 00 80  ff 3f 00 00  00 00 00 00  00 00 00 00  00 00 00 80
  00 40 00 00  00 00 00 00  00 00 00 00  00 00 00 c0  00 40 00 00  00 00 00 00
packed  :
  00 00 40 00  00 00 00 80  00 00 00 00  00 00 00 00  00 00 40 00  00 00 00 80
  00 00 00 00  00 00 00 00  00 00 60 00  00 00 00 80  00 00 00 00  00 00 00 00
unpacked:
  00 00 40 00  00 00 00 a0  00 00 00 00  00 00 00 00  00 00 40 00  00 00 00 a0
  00 00 00 00  00 00 00 00  00 00 40 00  00 00 00 b0  00 00 00 00  00 00 00 00

The text was updated successfully, but these errors were encountered:

markalle · 2021-05-04T17:10:40Z

For me the ppc build works from master, but I assume that's because my build isn't using opal_dt_swap_long_double() and yours is probably taking the x86 path? So mine is just swapping the bytes without any extra ieee struct related to conversion

I'm not really familiar with what the opal_dt_swap_long_double() routine is doing. Are you x86?

dalcinl · 2021-05-04T18:01:58Z

Are you x86?

Yes.

ggouaillardet · 2021-05-05T02:21:27Z

IIRC, a long time ago in a galaxy far far away, I worked on this in the context of heterogenous clusters, when communicating MPI_LONG_DOUBLE between x86 and sparc.

The same routine is used by MPI_Pack_external() and MPI_Unpack_external(), so the implicit expectation is that we have a symmetric (e.g. f(f(a) == a) function, but that is not the case.

I also noted MPICH does a different packing (simple byte swap (and hence symmetrical)?)

So I think we should first refer to the MPI standard in order to find the right way to external32 pack a MPI_LONG_DOUBLE, and then figure out the implementation detail.

dalcinl · 2021-05-05T07:25:40Z

@ggouaillardet According to the MPI standard, the proper way to pack MPI_LONG_DOUBLE is to store values as big-endian quadruple-precision IEEE floating point (binary128, corresponding to GCC's __float128 type). This is defined in MPI-3.1 Sec. 13.5.2 pp. 538, cf. with binary128 IEEE format.

MPI does not define a predefined datatype for binary128. Heterogeneous support may not have all the bits required to implement pack/unpack external32. EDIT: MPI_REAL16 may be binary128.

A trivial implementation of pack/unpack external32 for long double on supported compilers and platforms is simply to cast each value to/from __float128 and byte-swap to big-endian if required.

markalle · 2021-05-05T17:00:35Z

I agree with a caveat about I'm not sure if __float128 gets us enough portablity (eg it worked from from my x86 linux machine but my mac didn't seem to have quadmath.h). But as far as functionality if I store the data "123" as a long double then as a __float128, the bits come out as you described, for example:

123.0 stored as a long double:
bin 00000000 00000000 00000000 00000000
    00000000 00000000 01000000 00000101
    11110110 00000000 00000000 00000000
    00000000 00000000 00000000 00000000
that decodes as
* first 48 bits empty
* sign bit at 48
* exponent starts at bit 49 = 1000000 00000101 = 16389, then bias by -16383 = 6
* mantissa = binary fraction without implied leading 1: 1.1110110 = 1.921875
* val = 1.921875 * 2^6 = 123.0

(or in other words the above is the 80 bit x86 extended precision format)

123.0 stored as a __float128:
bin 01000000 00000101 11101100 00000000
    00000000 00000000 00000000 00000000
    00000000 00000000 00000000 00000000
    00000000 00000000 00000000 00000000
that decodes as
* sign bit at 0
* exponent starts at bit 1 = 1000000 00000101 = 16389, then bias by -16383 = 6
* mantissa = binary fraction with implied leading 1: 1.1110110 = 1.921875
* val = 1.921875 * 2^6 = 123.0

(or in other words it's the ieee 754 quad precision)

So above is really just long winded say of saying/confirming that long double at least on my x86 is 80 bit extended precision and __float128 makes it ieee 754 quad precision.

markalle · 2021-05-06T07:11:02Z

I made a couple routines to test with that convert both directions at least for the basic cases. I don't actually understand the two special exponents that describe subnormal numbers and the various NaNs, although I think the 80 bit extended precision and 128 bit quad precision use the same special exponents so a conversion like this gist that doesn't have special cases for the special exponents might be okay:

https://gist.github.com/markalle/650a6389518d04e54c9cd102485a95fd

My reason for making these conversions would be for architectures where we don't have __float128. If we do have __float128 then of course we should just use that.

I didn't incorporate the above into OMPI, and would need to look closer to understand how to change where the conversion is made. From my past reading I recall what @ggouaillardet is describing. It always seemed like the conversion routines didn't have enough arguments to describe what they were trying to do, because all they handled was byte swapping (and only of same-sized arguments) and they assumed it was all symmetrical.

dalcinl · 2021-05-06T13:14:01Z

@markalle Here you have my own version of these routines, not including any byte swapping as that part is trivial. I don't think your routines are correct, the explicit vs. implicit significand bit requires care, although I may be wrong.

https://gist.github.com/dalcinl/05cccf7b11cdf169a750485f67b499b7

In my code, the float128 -> long double path may not be totally correct for arbitrary values, but at least they roundtrip with long double input.
Please note that my code uses __float128 for testing, but the converter routines do not depend on __float128.
Do you have quick access to a big-endian machine or VM? I'm not sure my auxiliary union definitions are correct for big-endian architectures.

markalle · 2021-05-06T20:14:18Z

I do see one difference between your code and mine for decoding a stored float128: if e==0 you're turning off the highest bit of the mantissa, where mine universally turned it on.

For the rest I think we're doing the same operations, except I did all mine with characters because unsigned gets really confusing to me for mixing endianness. In particular I'm concerned that your code might be relying on the architecture endianness matching what you're trying to store. But a likely use case here is we're running little endian and intending to encode into a big endian float128.

So for example if the sign and exponent was 0x400d, the little endian machine would have those two bytes stored as 0d 40. A line like e = ul.bits.exp; would take that value as 0x400d because the arch matches the storage, then uq.bits.exp = e would similarly store 0d 40 as desired because the architecture is storing the 0x400d in little endian for us. But I don't think that would all work if the arch was little endian and we were trying to store to big endian.

I'm equally suspicious of all the frac and f[0123] and f0123 parts when mixing endianness.

I like that it's fewer operations, and I haven't done any timings but am willing to assume yours is faster. But I'd be a lot more comfortable with character based encoding if the storage endianness isn't the same as the architecture endianness.

dalcinl · 2021-05-07T14:12:37Z

@markalle I do not really care what version is ultimately used. However, I warn you about about the handling of explicit vs. hidden bit. Your testing with value 123 is perhaps not enough. Even if the processor can handle pseudo denormals, turning the highest bit of the mantisa on unconditionally is kind of ackward for the special case of converting a (positive) zero value, the bit pattern you generate does not have all bits set to zero.

As I said before, my version DOES NOT include any byte-swapping to big-endian, and should be tested in some actual big-endian architecture (both 32 and 64 bits), as I'm not sure about my struct definitions for that case.

ggouaillardet · 2021-05-08T09:32:25Z

@bosilca could you please give us some insights?

There are two sub-issues here:

how to convert a 80 bits quad precision to a 128 bits quad precision (that should be the easy part)
how to "plug" this subroutine in the datatype engine. Currently, all conversions are symmetric functions (e.g. byte swap),
but in that case, the conversion is no more symmetric and I do not see how the datatype engine can choose between 80->128 and 128->80

markalle · 2021-05-10T07:34:17Z

I made a PR using a lightly modified version of your f80 pack/unpack.

By adding a little more context added to pFunction's arguments I sidesteped my concern, about @dalcinl's function by ordering his function so it's always operating on data in the local-arch format, so sometimes it does opal_dt_swap_bytes first followed by opal_dt_swap_long_double and sometimes vice-versa

#8941

awlauria · 2021-05-13T19:21:37Z

@markalle @dalcinl what environments do we need to hit/reproduce this issue?

dalcinl · 2021-05-13T19:49:56Z

@awlauria X86_64 is enough. Ideally, we should also try it in a big-endian arch. However, I'm not aware of any big-endian arch where long double actually means 80bit extended precision floating point.

awlauria · 2021-05-20T13:19:52Z

Important but not a blocker for rc, probably want it for release.

markalle · 2021-05-25T20:22:30Z

Yeah, testing is a problem. The original bug is hit on x86_64, but trying to hit all the paths in the fix is hard.

markalle · 2021-05-25T21:08:58Z

I did another repush to #8941 as the fix. I think it's done, but testing is a concern.

I ran mac x86_64 which hits

little endian f80_to_f128 and back (using bit operations, no __float128)

I'll try PPC64 and see what other paths I can hit next. Other paths it would be nice to include are

little endian conversion via __float128
big endian conversion with f80_to_f128
big endian conversion with __float128
something with no conversion other than endianness

dalcinl · 2021-05-26T08:06:50Z

f80 and big endian may be impossible to test, at least with MPI code. f80 is an Intel format, I do not expect any other big-endian CPU to implement it. On CPUs other than Intel, I expect long double to be either double, or IEEE binary128 (__float128), or some other custom format, e.g. POWER ISA

markalle · 2021-05-28T22:27:57Z

I made this comment over in the #8941 fix PR:

The systems I've tested on now are

(full OMPI build)
- x86_64 mac (no __float128 path, uses f80_to_f128)
- ppc9 linux (uses __float128 conversion path)
- ppc9 linux (with __float128 artificially disabled leaving it with no conversion)
- x86_64 linux (uses __float128 conversion path)
- x86_64 linux (with __float128 artificially disabled so it used f80_to_f128)
(stand alone program that uses the same conversion routines and mimics as much as it can)
- qemu mips linux (big endian with 64-bit long doubles) (uses f64_to_128 path)

That's not a bad sampling, and even though it's not using the full OMPI on the old qemu system, my standalone program uses most of the OMPI code.

markalle · 2021-07-27T17:49:25Z

This one fell into the background again, but I think it's all done and ready to go over in #8941

gpaulsen · 2021-09-24T19:43:15Z

Looks like we have a FIX in #8941. We just need a review and then cherry-picks to the release branches.

awlauria · 2022-10-17T16:59:07Z

Main: #8941
v5.0.x: #10933

@dalcinl can you close when you have a chance to verify?

dalcinl · 2022-10-18T06:27:18Z

All mpi4py pack/unpack tests with external32 passed for branches:

ompi@main: [logs]
ompi@v5.0.x [logs] (ignore failures, looks like a new and unrelated regression)

awlauria · 2022-10-21T18:48:46Z

awesome thanks.

dalcinl mentioned this issue May 4, 2021

Fixing the partial pack unpack issue. #8769

Merged

awlauria added Target: main Target: v5.0.x Target: v4.0.x Target: v4.1.x labels May 4, 2021

awlauria added the Severity: blocker label May 5, 2021

markalle mentioned this issue May 10, 2021

pack external32 long double conversion (extended 80 / quad 128) #8941

Merged

awlauria removed the Severity: blocker label May 20, 2021

drew-parsons mentioned this issue May 3, 2022

testPackUnpackExternal alignment error on sparc64 mpi4py/mpi4py#147

Open

dalcinl closed this as completed Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pack/Unpack external32 with long double still broken #8918

Pack/Unpack external32 with long double still broken #8918

dalcinl commented May 4, 2021

markalle commented May 4, 2021

dalcinl commented May 4, 2021

ggouaillardet commented May 5, 2021

dalcinl commented May 5, 2021 •

edited

Loading

markalle commented May 5, 2021

markalle commented May 6, 2021

dalcinl commented May 6, 2021 •

edited

Loading

markalle commented May 6, 2021

dalcinl commented May 7, 2021

ggouaillardet commented May 8, 2021

markalle commented May 10, 2021 •

edited

Loading

awlauria commented May 13, 2021

dalcinl commented May 13, 2021

awlauria commented May 20, 2021

markalle commented May 25, 2021

markalle commented May 25, 2021

dalcinl commented May 26, 2021

markalle commented May 28, 2021 •

edited

Loading

markalle commented Jul 27, 2021

gpaulsen commented Sep 24, 2021

awlauria commented Oct 17, 2022

dalcinl commented Oct 18, 2022

awlauria commented Oct 21, 2022

Pack/Unpack external32 with long double still broken #8918

Pack/Unpack external32 with long double still broken #8918

Comments

dalcinl commented May 4, 2021

markalle commented May 4, 2021

dalcinl commented May 4, 2021

ggouaillardet commented May 5, 2021

dalcinl commented May 5, 2021 • edited Loading

markalle commented May 5, 2021

markalle commented May 6, 2021

dalcinl commented May 6, 2021 • edited Loading

markalle commented May 6, 2021

dalcinl commented May 7, 2021

ggouaillardet commented May 8, 2021

markalle commented May 10, 2021 • edited Loading

awlauria commented May 13, 2021

dalcinl commented May 13, 2021

awlauria commented May 20, 2021

markalle commented May 25, 2021

markalle commented May 25, 2021

dalcinl commented May 26, 2021

markalle commented May 28, 2021 • edited Loading

markalle commented Jul 27, 2021

gpaulsen commented Sep 24, 2021

awlauria commented Oct 17, 2022

dalcinl commented Oct 18, 2022

awlauria commented Oct 21, 2022

dalcinl commented May 5, 2021 •

edited

Loading

dalcinl commented May 6, 2021 •

edited

Loading

markalle commented May 10, 2021 •

edited

Loading

markalle commented May 28, 2021 •

edited

Loading