-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JDK-8289551: Conversions between bit representations of half precision values and floats #9422
Conversation
…n values and floats
👋 Welcome back darcy! A progress list of the required criteria for merging this PR into |
Some notes in the current implementation in the PR: the impetus for the change is to provide a minimal level of platform support for the binary16 floating-point format defined in IEEE 754. This is done by added two methods to Float, one for binary16 -> float conversion and the other for float -> binary16 conversion. In the absence of the ability to define primitive types, short is used as a carrier for the bits of a binary16 value. The conversion code in its current form favors readability over speed; a more performant software implementation may be possible even without intrinsification. A different 16-bit floating-point format, bfloat, is popular in some circles. Any similar support for bfloat will be left for future work. Please also review the companion CSR: https://bugs.openjdk.org/browse/JDK-8290216 |
Webrevs
|
// significand bits in the float and binary16 | ||
// formats | ||
(bin16SignifBits << (FloatConsts.SIGNIFICAND_WIDTH - 11))); | ||
return sign * Float.intBitsToFloat(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int result = (floatExpBits |
// Shift left difference in the number of
// significand bits in the float and binary16
// formats
(bin16SignifBits << (FloatConsts.SIGNIFICAND_WIDTH - 11)));
avoids a useless |
operation
// @IntrinsicCandidate | ||
public static short floatToBinary16AsShortBits(float f) { | ||
if (Float.isNaN(f)) { | ||
// Arbitrary binary16 NaN value; could try to preserve the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Arbitrary binary16 quiet NaN value; could try to preserve the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked the 754-2019 standard; unlike in some earlier versions of the standard, the 2019 edition does specify how a quiet vs signaling NaN is supposed to be encoded. Quoting from section 6.2.1:
"A quiet NaN bit string should be encoded with the first bit (d1) of the trailing significand
field T being 1. A signaling NaN bit string should be encoded with the first bit of the trailing significand
field being 0."
So the NaN bit pattern used is a quiet NaN.
In any case, I'll update this code to more faithfully pass along the bits of a Float NaN.
short sign_bit = (short)((doppel >> 16) & 0x8000); | ||
|
||
// The overflow threshold is binary16 MAX_VALUE + 1/2 ulp | ||
if (abs_f > (65504.0f + 16.0f) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (abs_f >= (65504.0f + 16.0f) ) {
Value exactly halfway must round to infinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. The rest of the code computed the right value, but the condition should be changed as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I increased test coverage around Binary16.MAX_VALUE+0.5*ulp(Binary16.MAX_VALUE).
// range of float). | ||
|
||
// Significand bits as if using rounding to zero (truncation). | ||
signif_bits = (short)((doppel & 0x0007f_e000) >> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signif_bits = (short)((doppel & 0x007f_e000) >>
or even
signif_bits = (short)((doppel & 0x007f_ffff) >>
32 bit hex are more readable when they have 8 hex digits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops; yes, I intended for the int hex constants to have at most 8 hex digits.
} | ||
|
||
short result = 0; | ||
result = (short)(((exp + 15) << 10) | signif_bits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result = (short)(((exp + 15) << 10) + signif_bits);
The final exponent needs to be incremented when signif_bits == 0x400
. The |
is not enough for this to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; I'll correct this bug in the next push.
I'll be adding a test to verify correct rounding of all the values in a binade - the set of floating-point values with the same exponent. That test reveals this issue.
|
||
short result = 0; | ||
result = (short)(((exp + 15) << 10) | signif_bits); | ||
return (short)(sign_bit | (0x7fff & result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (short)(sign_bit | result);
because result <= 0x7fff
.
// subnormals is [-140, -149]. Multiply abs_f down by | ||
// 2^(-125) -- since (-125 = -149 - (-24)) -- so that | ||
// the trailing bits of a subnormal float represent | ||
// the correct trailing bits of a binary16 subnormal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write intervals (ranges) in the form [low, high]
, so [-24, -15]
and [-149, -140]
.
// the correct trailing bits of a binary16 subnormal. | ||
exp = -15; // Subnormal encoding using -E_max. | ||
float f_adjust = abs_f * 0x1.0p-125f; | ||
signif_bits = (short)(Float.floatToRawIntBits(f_adjust) & 0x03ff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the if
and the exp++
can be avoided if the & 0x03ff
is dropped altogether.
signif_bits = (short)Float.floatToRawIntBits(f_adjust);
The reason is the same as for the normalized case below: a carry will eventually flow into the representation for the exponent.
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last invocation to Integer.compare()
is not correct.
For example, if bin16_1 = -1
and bin16_2 = 1
, the invocation returns 1, which is incorrect.
I added NaN round-trip tests is a separate file since the properties being testing are not required by the specification, unlike the properties explored in the other tests. If the NaN tests need special handling on certain platforms/configurations, that can be don't without complicating the tests of most of the functionality. |
int round = doppel & 0x0000_1000; | ||
int sticky = doppel & 0x0000_0fff; | ||
|
||
if (round != 0 && (lsb != 0 || sticky != 0 )) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to reverse the order of (lsb != 0 || sticky != 0)
to (sticky != 0 || lsb != 0)
to improve the likelihood that the evaluation of ||
does not need to evaluate its right-hand side.
Statistically, sticky != 0
has a much higher chance to evaluate to true
than lsb != 0
.
Thanks for the review comments thus far. A few notes on the recently pushed update, corresponding to webrev 04. The conversion methods have been renamed as suggested in comments received outside of the PR. Thinking more about how a future possible value class for the binary16 format might be structured (java.math.FloatBinary16? java.math.Binary16?), even if such a class cannot extend java.lang.Number, it could still implement the conversion methods found on Number. In particular, a value class for binary16 could have a "floatValue" method as well as factory to convert from float to binary16. Therefore, there need not be any overloading concerns with the methods defined on Float for this issue. (As an aside, it might be helpful to define a superinterface with the methods of java.lang.Number to better capture the ConvertibleToPrimitive API.) The regression tests were augmented in several ways. One is by the introduction of an alternative implementation of the float -> binary16 conversion that uses the hardware support for float rounding to compute the proper low-order bits. This approach benchmarked as significantly slower than the existing bit-inspection approach so it is just used to double-check in testing. Before updating the PR, all 2^32 float values were converted to binary16 using both approaches and the results matched. Testing all 2^32 values all the time seems excessive, so in the PR just one exponent value is checked. For each exponent in normal range, the rounding is analagous at corresponding positions of the number line. Therefore, testing one exponent should be sufficient for testing all the different combination of bit that influence rounding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't claim to be an expert in IEEE formats... the methods generally look good with thorough testing. Looking forward to the impact intrinsification will have!
* @bug 8289551 | ||
* @summary Verify conversion between float and the binary16 format | ||
* @library ../Math | ||
* @build FloatConsts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line necessary?
@jddarcy This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 3 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the ➡️ To integrate this PR with the above commit message to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
@@ -975,6 +976,215 @@ public static int floatToIntBits(float value) { | |||
@IntrinsicCandidate | |||
public static native float intBitsToFloat(int bits); | |||
|
|||
/** | |||
* {@return the {@code float} value closest to the numerical value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{@return
seems to be out of place and missing a closing }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
/** | ||
* {@return the floating-point binary16 value, encoded in a {@code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{@0return
seems to be out of place and missing a closing }
|
||
// Shift left difference in the number of significand bits in | ||
// the float and binary16 formats | ||
final int SIGNIF_SHIFT = (FloatConsts.SIGNIFICAND_WIDTH - 11); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make more sense to declare this local as private static final
? It could then even be used in the next method.
// Dealing with finite values in exponent range of | ||
// binary16 (when rounding is done, could still round up) | ||
int exp = Math.getExponent(f); | ||
assert -25 <= exp && exp <= 15; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that both the subnormal and the normal case can be unified if we pay closer attention to the positions of the lsb, round and sticky bits in subnormals.
// Clamp exp to the [-15, 15] range while retaining the
// difference between the original value and -15 on clamping.
// This is the excess shift value in addition to 13.
int expdelta = Math.max(0, -15 - exp);
exp += expdelta;
assert -15 <= exp && exp <= 15;
int f_signif_bits = doppel & 0x007f_ffff; // original significand
// Significand bits as if using rounding to zero (truncation).
short signif_bits = (short)(f_signif_bits >> (13 + expdelta));
// For round to nearest even, determining whether or
// not to round up (in magnitude) is a function of the
// least significant bit (LSB), the next bit position
// (the round position), and the sticky bit (whether
// there are any nonzero bits in the exact result to
// the right of the round digit). An increment occurs
// in three cases:
//
// LSB Round Sticky
// 0 1 1
// 1 1 0
// 1 1 1
// See "Computer Arithmetic Algorithms," Koren, Table 4.9
int lsb = f_signif_bits & (1 << 13 + expdelta);
int round = f_signif_bits & (1 << 12 + expdelta);
int sticky = f_signif_bits & ((1 << 12 + expdelta) - 1);
if (round != 0 && ((lsb | sticky) != 0 )) {
signif_bits++;
}
// No bits set in significand beyond the *first* exponent
// bit, not just the sigificand; quantity is added to the
// exponent to implement a carry out from rounding the
// significand.
assert (0xf800 & signif_bits) == 0x0;
return (short)(sign_bit | ( ((exp + 15) << 10) + signif_bits ) );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test this variant, will do tomorrow when also reviewing the tests themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct variant below passes the tests.
// For binary16 subnormals, beside forcing exp to -15,
// retain the difference expdelta = E_min - exp.
// This is the excess shift value, in addition to 13, to be used
// in the computations below.
// Further the (hidden) msb with value 1 in f must be involved as well.
int expdelta = 0;
int msb = 0x0000_0000;
if (exp < -14) {
expdelta = -14 - exp;
exp = -15;
msb = 0x0080_0000;
}
int f_signif_bits = doppel & 0x007f_ffff | msb;
// Significand bits as if using rounding to zero (truncation).
short signif_bits = (short)(f_signif_bits >> (13 + expdelta));
// For round to nearest even, determining whether or
// not to round up (in magnitude) is a function of the
// least significant bit (LSB), the next bit position
// (the round position), and the sticky bit (whether
// there are any nonzero bits in the exact result to
// the right of the round digit). An increment occurs
// in three cases:
//
// LSB Round Sticky
// 0 1 1
// 1 1 0
// 1 1 1
// See "Computer Arithmetic Algorithms," Koren, Table 4.9
int lsb = f_signif_bits & (1 << 13 + expdelta);
int round = f_signif_bits & (1 << 12 + expdelta);
int sticky = f_signif_bits & ((1 << 12 + expdelta) - 1);
if (round != 0 && ((lsb | sticky) != 0 )) {
signif_bits++;
}
// No bits set in significand beyond the *first* exponent
// bit, not just the sigificand; quantity is added to the
// exponent to implement a carry out from rounding the
// significand.
assert (0xf800 & signif_bits) == 0x0;
return (short)(sign_bit | ( ((exp + 15) << 10) + signif_bits ) );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test classes look good.
LGTM |
/integrate |
Going to push as commit 7318b22.
Your commit was automatically rebased without conflicts. |
Initial implementation.
Progress
Issues
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9422/head:pull/9422
$ git checkout pull/9422
Update a local copy of the PR:
$ git checkout pull/9422
$ git pull https://git.openjdk.org/jdk pull/9422/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 9422
View PR using the GUI difftool:
$ git pr show -t 9422
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9422.diff