Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8289552: Make intrinsic conversions between bit representations of half precision values and floats #9781

Closed
wants to merge 14 commits into from

Conversation

smita-kamath
Copy link

@smita-kamath smita-kamath commented Aug 5, 2022


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8289552: Make intrinsic conversions between bit representations of half precision values and floats

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9781/head:pull/9781
$ git checkout pull/9781

Update a local copy of the PR:
$ git checkout pull/9781
$ git pull https://git.openjdk.org/jdk pull/9781/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 9781

View PR using the GUI difftool:
$ git pr show -t 9781

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9781.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 5, 2022

👋 Welcome back svkamath! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@smita-kamath
Copy link
Author

smita-kamath commented Aug 5, 2022

Hi All,
This change is to provide intrinsic implementation for conversion between floats and half precision floating point formats.
Kindly review and share your feedback.

Thanks,
Smita

@openjdk
Copy link

openjdk bot commented Aug 5, 2022

@smita-kamath The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot hotspot-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Aug 5, 2022
@smita-kamath smita-kamath marked this pull request as ready for review August 5, 2022 17:11
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 5, 2022
@mlbridge
Copy link

mlbridge bot commented Aug 5, 2022

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add JMH microbenchmark and show results.

@@ -447,6 +447,84 @@ JRT_LEAF(jdouble, SharedRuntime::l2d(jlong x))
return (jdouble)x;
JRT_END

JRT_LEAF(jshort, SharedRuntime::f2hf(jfloat x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary/desirable to have a transliteration of the Java implementation here rather than just using the Java library implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jddarcy Thanks for your comment. I am not sure if there is a way of using Java library implementation here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that if a platform didn't have special support for the functionality in question it could not have the intrinsic and just use the Java library implementation?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jddarcy This code is for constant expression evaluation during C2 compilation.

@smita-kamath
Copy link
Author

@vnkozlov Thank you for your comment. I will add a jmh test case.

Comment on lines 1929 to 1931
emit_int8((unsigned char)0x1D);
emit_int8((unsigned char)(0xC0 | encode));
emit_int8(imm8);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do emit_int24() here.

Comment on lines 1924 to 1927
void Assembler::evcvtps2ph(XMMRegister dst, XMMRegister src, int imm8, int vector_len) {
assert(VM_Version::supports_evex(), "");
InstructionAttr attributes(vector_len, /* rex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /*uses_vl */ true);
attributes.set_is_evex_instruction();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instruction is supported on AVX (vex encoded) platform as well.
Please rename this as vcvtps2ph and update the check to include supports_evex or supports_f16C. The is_evex_instruction check should then be remvoed.

Comment on lines 1934 to 1937
void Assembler::evcvtph2ps(XMMRegister dst, XMMRegister src, int vector_len) {
assert(VM_Version::supports_evex(), "");
InstructionAttr attributes(vector_len, /* rex_w */ false, /* legacy_mode */false, /* no_mask_reg */ true, /* uses_vl */ true);
attributes.set_is_evex_instruction();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This instruction is supported on AVX (vex encoded) platform as well.
Please rename this as vcvtph2ps and update the check to include supports_evex or supports_f16C. The is_evex_instruction check should then be remvoed.

Comment on lines 1685 to 1686
if (!VM_Version::supports_evex() || !VM_Version::supports_avx512vl()) {
return false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include supports_f16c here.

@@ -11306,6 +11306,31 @@ instruct convD2L_reg_reg(rRegL dst, regD src, rFlagsReg cr)
ins_pipe(pipe_slow);
%}

instruct convF2HF_reg_reg(rRegI dst, regF src, regF tmp) %{
predicate(VM_Version::supports_evex());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The predicate could be removed as the check is fully done in match_rule_supported.

predicate(VM_Version::supports_evex());
effect(TEMP tmp);
match(Set dst (ConvF2HF src));
format %{ "evcvtps2ph $dst,$src" %}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to also show the tmp register in the format.

%}

instruct convHF2F_reg_reg(regF dst, rRegI src) %{
predicate(VM_Version::supports_evex());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The predicate could be removed as it is fully covered in the match_rule_supported.


//------------------------------Identity---------------------------------------
Node* ConvHF2FNode::Identity(PhaseGVN* phase) {
return (in(1)->Opcode() == Op_ConvF2HF) ? in(1)->in(1) : this;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do this optimization:
HF2F(F2HF(x)) != x
This is because precision and range both are lost when we convert F2HF.

virtual int Opcode() const;
virtual const Type *bottom_type() const { return Type::FLOAT; }
virtual const Type* Value(PhaseGVN* phase) const;
virtual Node* Identity(PhaseGVN* phase);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The identity method could be removed from ConvHF2FNode

@PaulSandoz
Copy link
Member

Please add JMH microbenchmark and show results.

Also please update the unit tests to run with and without the intrinsics enabled.

@smita-kamath
Copy link
Author

JMH results
Before patch
Benchmark (TESTSIZE) Mode Cnt Score Error Units
Fp16ConversionBenchmark.float16ToFloat 2048 thrpt 3 359.680 ± 0.292 ops/ms
Fp16ConversionBenchmark.floatToFloat16 2048 thrpt 3 236.153 ± 8.164 ops/ms

After patch
Benchmark (TESTSIZE) Mode Cnt Score Error Units
Fp16ConversionBenchmark.float16ToFloat 2048 thrpt 3 1046.233 ± 0.019 ops/ms
Fp16ConversionBenchmark.floatToFloat16 2048 thrpt 3 1752.046 ± 0.058 ops/ms

@Param({"2048"})
public int TESTSIZE;

public short[] HFargV1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use F16 instead of HF.
e.g. f16in, f16out, fin, fout.

public class Fp16ConversionBenchmark {

@Param({"2048"})
public int TESTSIZE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public int TESTSIZE;
public int size;


@Benchmark
public void floatToFloat16() {
for (int i = 0; i < TESTSIZE; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the array length instead, it's more idiomatic e.g.:

Suggested change
for (int i = 0; i < TESTSIZE; i++) {
for (int i = 0; i < FargV1.length; i++) {

int i = 0;
Random r = new Random(1024);

HFargV1 = new short[TESTSIZE];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test size is less than the number of special values then we will get an exception. I cannot quite decide if special values are that important here. Might be better to test separately?

for (int i = 0; i < TESTSIZE; i++) {
ResHF[i] = Float.floatToFloat16(FargV1[i]);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return the result array, so it's consumed by a black hole.

@@ -25,6 +25,9 @@
* @test
* @bug 8289551
* @summary Verify NaN sign and significand bits are preserved across conversions
* @run main Binary16ConversionNaN
* @run main/othervm/timeout=600 -XX:+UnlockDiagnosticVMOptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify a timeout? The tests have been running fine without intrinsics, no timeouts reports.

Comment on lines 69 to 73
public void floatToFloat16(Blackhole bh) {
for (int i = 0; i < fin.length; i++) {
f16out[i] = Float.floatToFloat16(fin[i]);
}
bh.consume(f16out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify to this:

Suggested change
public void floatToFloat16(Blackhole bh) {
for (int i = 0; i < fin.length; i++) {
f16out[i] = Float.floatToFloat16(fin[i]);
}
bh.consume(f16out);
public float floatToFloat16() {
for (int i = 0; i < fin.length; i++) {
f16out[i] = Float.floatToFloat16(fin[i]);
}
return f16out;

@@ -0,0 +1,82 @@
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the copyright comment to be consistent with other source, and also correct the year to be only 2022? e.g. copy that from Binary16Conversion.java.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java code looks good (HotSpot reviewers also need to approve).

@openjdk
Copy link

openjdk bot commented Aug 25, 2022

@smita-kamath This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8289552: Make intrinsic conversions between bit representations of half precision values and floats

Reviewed-by: kvn, sviswanathan, jbhateja

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 119 new commits pushed to the master branch:

  • 2586b1a: 8295155: Incorrect javadoc of java.base module
  • e1a77cf: 8295163: Remove old hsdis Makefile
  • 3c7ae12: 8294821: Class load improvement for AES crypto engine
  • 619cd82: 8294702: BufferedInputStream uses undefined value range for markpos
  • 9d0009e: 6777156: GTK L&F: JFileChooser can jump beyond root directory in combobox and selection textarea.
  • 3ebe5ad: 8294751: Zero: Allow larger default heaps
  • 33d0618: 6616245: NullPointerException when using JFileChooser with a custom FileView
  • fba763f: 8291519: jdk/jfr/api/event/TestShouldCommit.java failed with Unexpected value of shouldCommit()
  • 6053bf0: 8293782: Shenandoah: some tests failed on lock rank check
  • 4435d56: 8282395: URL.openConnection can throw IOOBE
  • ... and 109 more: https://git.openjdk.org/jdk/compare/375f02fb21ae37c381229e2a28b1f26e3cb926d4...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @PaulSandoz, @sviswa7, @jatin-bhateja) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 25, 2022
Copy link
Member

@PaulSandoz PaulSandoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment: since there is native runtime code to perform conversions for constants (IIRC) I think we need to some special tests to verify that works correctly (shame there is no way to leverage the Java code in such cases). It might require some advice from other HotSpot engineers, but i think we do need to follow up with another PR on that aspect.

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Aug 25, 2022
Comment on lines 1685 to 1686
if (!VM_Version::supports_f16c() && !VM_Version::supports_evex() &&
!VM_Version::supports_avx512vl()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition should be:
if (!VM_Version::supports_f16c() &&
!(VM_Version::supports_evex() && VM_Version::supports_avx512vl())) {
return false;
}

For AVX-512 both evex and avx512vl are needed.

%}
ins_pipe( pipe_slow );
%}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For F2HF, good to also add optimized rule with StoreC to benefit from vcvtps2ph memory destination form of instruction.
match(Set mem (StoreC mem (ConvF2HF src)));

%}
ins_pipe( pipe_slow );
%}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For HF2F, good to also add optimized rule with LoadS to benefit from vcvtph2ps memory src form of instruction.
match(Set dst (ConvHF2F ( LoadS mem)));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sviswa7 , with this rule, I dont see additional performance gain. So I have not added it in the latest update.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding JMH benchmarks and data. I submitted testing.

@vnkozlov
Copy link
Contributor

vnkozlov commented Sep 29, 2022

@smita-kamath I have builds failures. Please, build and test yourself to verify changes.

src/hotspot/cpu/x86/assembler_x86.cpp: In member function 'void Assembler::evcvtps2ph(Address, KRegister, XMMRegister, int, int)':
src/hotspot/cpu/x86/assembler_x86.cpp:1950:15: error: no matching function for call to 'Assembler::emit_operand(XMMRegister&, Address&)'
 emit_operand(src, dst);

@smita-kamath
Copy link
Author

Hi @vnkozlov, I apologize for the inconvenience. However, I don't see any build failures on Windows or Linux. Can you please let me know the platform for which your build fails? Thanks a lot.

@sviswa7
Copy link

sviswa7 commented Sep 30, 2022

@vnkozlov I also don't see any build issue. I tried creating a dummy draft PR under my fork to get the GHA to run. But looks like now-a-days GHA doesn't run on draft PR.

@sviswa7
Copy link

sviswa7 commented Sep 30, 2022

@vnkozlov I spoke too soon. All the GHA tests passed in the dummy draft PR I created using Smita's patch:
#10500
Please take a look. No build failures reported and all tier1 tests passed.

@Bhavana-Kilambi
Copy link
Contributor

Hi, would you be adding IR tests to verify the generation of the the newly introduced IR nodes?

ins_pipe( pipe_slow );
%}

instruct convF2HF_mem_reg(memory mem, regF src, kReg ktmp, rRegI rtmp) %{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use kmovwl instead which will relax the avx512bw constraint, however, you will need avx512vl for evcvtps2ph. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rethink about it, you can get 0x01 by right shifting k0 to the right - kshiftrw(ktmp, k0, 15)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merykitty Thanks for the suggestion. I will update the instruct to use kmovwl. I will also experiment with kshiftrw and let you know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use kmovwl instead which will relax the avx512bw constraint, however, you will need avx512vl for evcvtps2ph. Thanks.

Yes, in general all AVX512VL targets support AVX512BW, but cloud instances give freedom to enable custom features. Regarding K0, as per section "15.6.1.1" of SDM, expectation is that K0 can appear in source and destination of regular non predication context, k0 should always contain all true mask so it should be unmodifiable for subsequent usages i.e. should not be present as destination of a mask manipulating instruction. Your suggestion is to have that in source but it may not work either. Changing existing sequence to use kmovw and replace AVX512BW with AVX512VL will again mean introducing an additional predication check for this pattern.

Copy link
Member

@merykitty merykitty Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I get it, the encoding of k0 is treated specially in predicated instructions to refer to an all-set mask, but the register itself may not actually contain that value. So usage in kshiftrw may fail. In that case I think we can generate an all-set mask on the fly using kxnorw(ktmp, ktmp, ktmp) to save a GPR in this occasion. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @merykitty, I am seeing performance regression with kxnorw instruction. So I have updated the PR with kmovwl. Thanks.

@merykitty
Copy link
Member

merykitty commented Sep 30, 2022

@sviswa7 The failure is due to JDK-8293618, @smita-kamath please merge with master. Thanks.

@vnkozlov
Copy link
Contributor

@sviswa7 The failure is due to JDK-8293618, @smita-kamath please merge with master. Thanks.

Yes, I tested with latest JDK sources which includes JDK-8293618.

@smita-kamath
Copy link
Author

@Bhavana-Kilambi, I plan to do this in a separate PR along with the gtest. Here's the bug https://bugs.openjdk.org/browse/JDK-8293323.

@smita-kamath
Copy link
Author

@vnkozlov, I have implemented all of the reviewers comments. Could you kindly test this patch? Thanks a lot for your help.

@vnkozlov
Copy link
Contributor

I started new testing.

Copy link
Contributor

@vnkozlov vnkozlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version v12 passed my tier1-4 testing. Good.

@smita-kamath
Copy link
Author

@vnkozlov Thank you for reviewing the patch.

@smita-kamath
Copy link
Author

/integrate

@openjdk openjdk bot added the sponsor Pull request is ready to be sponsored label Oct 11, 2022
@openjdk
Copy link

openjdk bot commented Oct 11, 2022

@smita-kamath
Your change (at version a00c3ec) is now ready to be sponsored by a Committer.

@sviswa7
Copy link

sviswa7 commented Oct 11, 2022

/sponsor

@openjdk
Copy link

openjdk bot commented Oct 11, 2022

Going to push as commit 07946aa.
Since your change was applied there have been 119 commits pushed to the master branch:

  • 2586b1a: 8295155: Incorrect javadoc of java.base module
  • e1a77cf: 8295163: Remove old hsdis Makefile
  • 3c7ae12: 8294821: Class load improvement for AES crypto engine
  • 619cd82: 8294702: BufferedInputStream uses undefined value range for markpos
  • 9d0009e: 6777156: GTK L&F: JFileChooser can jump beyond root directory in combobox and selection textarea.
  • 3ebe5ad: 8294751: Zero: Allow larger default heaps
  • 33d0618: 6616245: NullPointerException when using JFileChooser with a custom FileView
  • fba763f: 8291519: jdk/jfr/api/event/TestShouldCommit.java failed with Unexpected value of shouldCommit()
  • 6053bf0: 8293782: Shenandoah: some tests failed on lock rank check
  • 4435d56: 8282395: URL.openConnection can throw IOOBE
  • ... and 109 more: https://git.openjdk.org/jdk/compare/375f02fb21ae37c381229e2a28b1f26e3cb926d4...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 11, 2022
@openjdk openjdk bot closed this Oct 11, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Oct 11, 2022
@openjdk
Copy link

openjdk bot commented Oct 11, 2022

@sviswa7 @smita-kamath Pushed as commit 07946aa.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@TobiHartmann
Copy link
Member

There is an issue with the _floatToFloat16 intrinsic, leading to incorrect results: JDK-8302976. @smita-kamath, could you please have a look? Thanks.

@eme64
Copy link
Contributor

eme64 commented Aug 9, 2024

@smita-kamath I think I just found another regression of this feature: https://bugs.openjdk.org/browse/JDK-8338126
Can you please have a look?

@smita-kamath
Copy link
Author

@eme64, Sure will look into it. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-libs core-libs-dev@openjdk.org hotspot hotspot-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

10 participants