Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter #12869

Closed
wants to merge 11 commits into from

Conversation

vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Mar 3, 2023

Implemented Float.floatToFloat16 and Float.float16ToFloat intrinsics in Interpreter and C1 compiler to produce the same results as C2 intrinsics on x64, Aarch64 and RISC-V - all platforms where C2 intrinsics for these Java methods were implemented originally.

Replaced SharedRuntime::f2hf() and hf2f() C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic. This runtime stub is only used to calculate constant values during C2 compilation and can be skipped.

I added new tests based on Tobias's TestAll.java And copied jdk/lang/Float/Binary16Conversion*.java tests to run them with -Xcomp to make sure code is compiled by C1 or C2. I modified Binary16ConversionNaN.java to compare results from Interpreter, C1 and C2.

Tested tier1-5, Xcomp, stress


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8302976: C2 intrinsification of Float.floatToFloat16 and Float.float16ToFloat yields different result than the interpreter

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/12869/head:pull/12869
$ git checkout pull/12869

Update a local copy of the PR:
$ git checkout pull/12869
$ git pull https://git.openjdk.org/jdk pull/12869/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 12869

View PR using the GUI difftool:
$ git pr show -t 12869

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/12869.diff

…6ToFloat yields different result than the interpreter
@bridgekeeper
Copy link

bridgekeeper bot commented Mar 3, 2023

👋 Welcome back kvn! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 3, 2023

@vnkozlov The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Mar 3, 2023
@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 4, 2023

/label remove hotspot

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 4, 2023

/label add hotspot-compiler

@openjdk openjdk bot removed the hotspot hotspot-dev@openjdk.org label Mar 4, 2023
@openjdk
Copy link

openjdk bot commented Mar 4, 2023

@vnkozlov
The hotspot label was successfully removed.

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 4, 2023

/label add hotspot-runtime

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Mar 4, 2023
@openjdk
Copy link

openjdk bot commented Mar 4, 2023

@vnkozlov
The hotspot-compiler label was successfully added.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Mar 4, 2023
@openjdk
Copy link

openjdk bot commented Mar 4, 2023

@vnkozlov
The hotspot-runtime label was successfully added.

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 5, 2023

GHA failure on linux-x86 in test compiler/vectorization/runner/LoopRangeStrideTest.java is due to JDK-8303105

@vnkozlov vnkozlov marked this pull request as ready for review March 6, 2023 16:02
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 6, 2023
@mlbridge
Copy link

mlbridge bot commented Mar 6, 2023

Webrevs

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 6, 2023

@fyang, please help to verify that new tests passed on RISC-V with these changes and review these changes. Thanks!

I tested x86 (64- and 32-bit) and AArch64.

@sviswa7
Copy link

sviswa7 commented Mar 7, 2023

@vnkozlov Thanks a lot for taking this up. Is the following in the PR description still true:
"Replaced SharedRuntime::f2hf() and hf2f() C runtime functions with calls to runtime stubs which use the same HW instructions as C2 intrinsics. Only for 64-bit x64 because 32-bit x86 stub does not work: result is passed through FPU register and NaN values become different from C2 intrinsic."
From the PR it looks to me that for x86_64 you have the changes in place for SharedRuntime and the same result is produced across SharedRuntime, interpreter, c1, and c2.
For x86 32-bit also things are consistent across. Only the SharedRuntime optimization doesnt happen for x86 32bit as StubRoutines::hf2f() and StubRoutines::f2hf() are set as null. The fallback is handled correctly in interpreter, c1, and c2.

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 7, 2023

For x86 32-bit also things are consistent across. Only the SharedRuntime optimization doesnt happen for x86 32bit as StubRoutines::hf2f() and StubRoutines::f2hf() are set as null. The fallback is handled correctly in interpreter, c1, and c2.

Correct, it is consistent. Only optimization to calculate constant value during compile time is skipped. C2 will generate HW instruction for ConvF2HF node as if its input was not constant. That is it.

It is possible to add similar Stub routines for AArch64 and RISC-V to be called from C2 but I am not expert in those platforms so I skipped them.

It may be not clear but StubRoutines::hf2f() and StubRoutines::f2hf() are used only by C2 (through SharedRuntime) for constant values in ConvF2HFNode::Value() and ConvHF2FNode::Value()

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 7, 2023

Note, I removed ConvF2HFNode::Identity() optimization because tests show that it produces different NaN results due to skipped conversion.

Comment on lines 198 to 199
// Instruction requires different XMM registers
vcvtps2ph(tmp, src, 0x04, Assembler::AVX_128bit);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vcvtps2ph can have source and destination as same. Did you mean to say here in the comment that "Instruction requires XMM register as destination"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flt_to_flt16 is used in x86.ad instruction which requires preserving src register.
I did not want to add an other macroassembler instruction for src->src case.
I will add this to this comment.

Comment on lines 3928 to 3931
if (VM_Version::supports_f16c() || VM_Version::supports_avx512vl()) {
// For results consistency both intrinsics should be enabled.
if (vmIntrinsics::is_intrinsic_available(vmIntrinsics::_float16ToFloat) &&
vmIntrinsics::is_intrinsic_available(vmIntrinsics::_floatToFloat16)) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also check for InlineIntrinsics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vmIntrinsics::disabled_by_jvm_flags() checks InlineIntrinsics. See vmIntrinsics.cpp changes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right.

return nullptr; // Generate a vanilla entry
}
// For AVX CPUs only. f16c support is disabled if UseAVX == 0.
if (VM_Version::supports_f16c() || VM_Version::supports_avx512vl()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check for VM_Version::supports_float16() here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And I need to remove !InlineIntrinsics check at line 340.

@@ -3874,6 +3925,15 @@ void StubGenerator::generate_initial() {
StubRoutines::_updateBytesAdler32 = generate_updateBytesAdler32();
}

if (VM_Version::supports_f16c() || VM_Version::supports_avx512vl()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check for VM_Version::supports_float16() here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@sviswa7
Copy link

sviswa7 commented Mar 7, 2023

Note, I removed ConvF2HFNode::Identity() optimization because tests show that it produces different NaN results due to skipped conversion.

Yes, removing the Identity optimization is correct. It doesn't hold for NaN inputs.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 7, 2023
@jatin-bhateja
Copy link
Member

jatin-bhateja commented Mar 7, 2023

Hi @vnkozlov , There is some discrepancy in results b/w interpreter, C1 and C2 for following case.

public class Foo {
  public static short bar(float f) {return Float.floatToFloat16(f);}
  public static void main(String[] args) {
	 System.out.println(Float.floatToRawIntBits(Float.float16ToFloat((short) 31745)));
	 System.out.println(bar(Float.float16ToFloat((short) 31745)));
  }
}

CPROMPT>java  -Xint -cp . Foo
2143297536                                                       // FP32 QNaN + significand preserved
32257                                                            // FP16 QNaN + significand preserved
CPROMPT>java -Xbatch -Xcomp -cp . Foo
2139103232                                                       // FP32 SNaN + significand preserved
31745                                                            // FP16 SNaN + significand preserved
CPROMPT>java -XX:-TieredCompilation -Xbatch -Xcomp -cp . Foo
2139103232                                                       // FP32  SNaN + significand preserved
32257                                                            // FP16  QNaN + significand preserved.

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 7, 2023

Hi @vnkozlov , There is some discrepancy in results b/w interpreter, C1 and C2 for following case.

And that is fine. Consistency have to be preserved only during one run. Different runs with different flags (with disabled intrinsics, for example) may produce different results.

EDIT: I should have paid more attention to the example outputs. The third (C2) run produces inconsistent result!

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 7, 2023

On other hand -Xint should not disable intrinsics. I will look.

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 7, 2023

It looks like C1 compilation does not invoke intrinsics. Investigating.

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 7, 2023

We should not allow JIT compilation of Float.float16ToFloat and Float.floatToFloat16 Java methods as we do for other math methods: abstractInterpreter.hpp#L144

What happens with @jatin-bhateja test is Float class is still not loaded when we trigger compilation for Foo::main method with -Xcomp flag.
C2 generates Uncommon trap at the very start of main() and code is deoptimized and goes to Interpreter which uses intrinsics.
C1 on other hand generates calls to Float.float16ToFloat and Float.floatToFloat16 in compiled Foo::main method and then compiles Float.float16ToFloat and Float.floatToFloat16 which are called.

The fix ix simple:

+++ b/src/hotspot/share/interpreter/abstractInterpreter.hpp
@@ -159,6 +159,8 @@ class AbstractInterpreter: AllStatic {
       case vmIntrinsics::_dexp  : // fall thru
       case vmIntrinsics::_fmaD  : // fall thru
       case vmIntrinsics::_fmaF  : // fall thru
+      case vmIntrinsics::_floatToFloat16       : // fall thru
+      case vmIntrinsics::_float16ToFloat       : // fall thru
       case vmIntrinsics::_Continuation_doYield : // fall thru
         return false;

test now produce the same result:

$ java -Xint Foo
2143297536
32257
$ java -Xcomp -XX:-TieredCompilation Foo
2143297536
32257
$ java -Xcomp -XX:TieredStopAtLevel=1 Foo
2143297536
32257

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 7, 2023

C2 also compiled float16ToFloat. That is why we got FP32 SNaN result with -XX:-TieredCompilation in Jatin's example.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Mar 7, 2023
@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 7, 2023

@jatin-bhateja I applied the fix. Please, verify.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 7, 2023
@RealFYang
Copy link
Member

RealFYang commented Mar 8, 2023

Hi, Thanks for handling linux-riscv64 at the same time.
Bad news is that we witnessed test failures when running following test with QEMU (no riscv hardware available with Zfhmin extension for now): test/hotspot/jtreg/compiler/intrinsics/float16/TestAllFloat16ToFloat.java

Exception in thread "main" java.lang.RuntimeException: Inconsistent result for Float.floatToFloat16(NaN/7fc00000): 7e00 != fc01
        at TestAllFloat16ToFloat.verify(TestAllFloat16ToFloat.java:62)
        at TestAllFloat16ToFloat.run(TestAllFloat16ToFloat.java:72)
        at TestAllFloat16ToFloat.main(TestAllFloat16ToFloat.java:94)

It looks like there is a problem when handling NaNs with fcvt.h.s/fmv.x.h and fmv.h.x/fcvt.s.h instructions at the bottom.
It's also possible to be an issue of QEMU as well. It would take quite a while to diagnose. But I don't want this to block this PR.
So I would prefer removing support of this feature for this port and adding back once this is resolved. And I have prepared a patch for that purpose. See attachement.
12869-revert-riscv.txt

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 8, 2023

Thank you very much @RealFYang for testing changes and preparing patch. I applied your patch.

Copy link
Member

@jatin-bhateja jatin-bhateja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vnkozlov , Thanks for explanations, looks good to me now.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. Minor comments/suggestions follow.

return (jshort)(sign_bit | ( ((exp + 15) << 10) + signif_bits ) );
assert(StubRoutines::f2hf() != nullptr, "floatToFloat16 intrinsic is not supported on this platform");
typedef jshort (*f2hf_stub_t)(jfloat x);
return ((f2hf_stub_t)StubRoutines::f2hf())(x);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of keeping the wrappers around? The stubs can be called directly, can't they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted isolate function type cast and assert in one place.
BTW the comment in assert should be "the stub is not implemented on this platform".

if( t == Type::FLOAT ) return TypeInt::SHORT;
if (t == Type::TOP) return Type::TOP;
if (t == Type::FLOAT) return TypeInt::SHORT;
if (StubRoutines::f2hf() == nullptr) return bottom_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this check? My understanding is ConvF2HF/ConvHF2F require intrinsification and on platforms where stubs are absent, intrinsification is disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is optimization: use stub to calculate constant value during compilation instead of generating HW instruction in compiled code. It is not required to have this stub for intensification to work - ConvF2HFNode will be processed normally and will use intrinsics code (HW instruction) defined in .ad file.
These stubs are used only here, not in C1 and not in Interpreter. As consequence these stubs implementation is optional and I implemented them only on x64. That is why I have this check.
I debated to not have them at all to not confuse people but they did improve performance a little.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications. Now it makes much more sense.

Still, the mix of StubRoutines::f2hf() and SharedRuntime::f2hf() looks a bit confusing.

What if you move the wrapper to StubRoutines class instead? (JRT_LEAF et al stuff looks redundant here. Also, even though there are other arithmetic operations declared on StubRoutines, they provide default implementations universally available across all platforms. f2hf case is different since it exposes a platform-specific stub and its availability is limited.)

Or encapsulate the constant folding logic (along with the guard) into SharedRuntime and return Type* (instead of int/float scalar).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or encapsulate the constant folding logic (along with the guard) into SharedRuntime and return Type* (instead of int/float scalar).

I take this particular suggestion back. SharedRuntime is compiler-agnostic while Type is C2-specific.


const TypeInt *ti = t->is_int();
if (ti->is_con()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it confusing that ConvHF2FNode::Value() has is_con() check, but ConvF2HFNode::Value()doesn't. I'd prefer to see both implementations unified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It follows the same pattern as other nodes here: ConvF2INode::Value() vs ConvI2FNode::Value().
If you want to change it we need to do that in separate RFE for all methods here.
But I don't think we need to do that because Float/Double does not have range values as Integer types.
Float have only 3 types of value: FloatTop, FloatBot, FloatCon. So we don't need to check for constant if checked for TOP and BOT. For Integer we need to check bool is_con() const { return _lo==_hi; }.

Copy link
Contributor Author

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review @iwanowww

if( t == Type::FLOAT ) return TypeInt::SHORT;
if (t == Type::TOP) return Type::TOP;
if (t == Type::FLOAT) return TypeInt::SHORT;
if (StubRoutines::f2hf() == nullptr) return bottom_type();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is optimization: use stub to calculate constant value during compilation instead of generating HW instruction in compiled code. It is not required to have this stub for intensification to work - ConvF2HFNode will be processed normally and will use intrinsics code (HW instruction) defined in .ad file.
These stubs are used only here, not in C1 and not in Interpreter. As consequence these stubs implementation is optional and I implemented them only on x64. That is why I have this check.
I debated to not have them at all to not confuse people but they did improve performance a little.


const TypeInt *ti = t->is_int();
if (ti->is_con()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It follows the same pattern as other nodes here: ConvF2INode::Value() vs ConvI2FNode::Value().
If you want to change it we need to do that in separate RFE for all methods here.
But I don't think we need to do that because Float/Double does not have range values as Integer types.
Float have only 3 types of value: FloatTop, FloatBot, FloatCon. So we don't need to check for constant if checked for TOP and BOT. For Integer we need to check bool is_con() const { return _lo==_hi; }.

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 8, 2023

What if you move the wrapper to StubRoutines class instead?
Okay, I will try it.

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 9, 2023

Thank you, Sandhya, Jatin, Vladimir and Dean for review.

@vnkozlov
Copy link
Contributor Author

vnkozlov commented Mar 9, 2023

/integrate

@openjdk
Copy link

openjdk bot commented Mar 9, 2023

Going to push as commit 8cfd74f.
Since your change was applied there have been 59 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 9, 2023
@openjdk openjdk bot closed this Mar 9, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Mar 9, 2023
@openjdk
Copy link

openjdk bot commented Mar 9, 2023

@vnkozlov Pushed as commit 8cfd74f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@vnkozlov vnkozlov deleted the 8302976 branch March 9, 2023 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

6 participants