-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8289552: Make intrinsic conversions between bit representations of half precision values and floats #9781
Conversation
👋 Welcome back svkamath! A progress list of the required criteria for merging this PR into |
Hi All, Thanks, |
@smita-kamath The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add JMH microbenchmark and show results.
@@ -447,6 +447,84 @@ JRT_LEAF(jdouble, SharedRuntime::l2d(jlong x)) | |||
return (jdouble)x; | |||
JRT_END | |||
|
|||
JRT_LEAF(jshort, SharedRuntime::f2hf(jfloat x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary/desirable to have a transliteration of the Java implementation here rather than just using the Java library implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jddarcy Thanks for your comment. I am not sure if there is a way of using Java library implementation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that if a platform didn't have special support for the functionality in question it could not have the intrinsic and just use the Java library implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jddarcy This code is for constant expression evaluation during C2 compilation.
@vnkozlov Thank you for your comment. I will add a jmh test case. |
emit_int8((unsigned char)0x1D); | ||
emit_int8((unsigned char)(0xC0 | encode)); | ||
emit_int8(imm8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do emit_int24() here.
void Assembler::evcvtps2ph(XMMRegister dst, XMMRegister src, int imm8, int vector_len) { | ||
assert(VM_Version::supports_evex(), ""); | ||
InstructionAttr attributes(vector_len, /* rex_w */ false, /* legacy_mode */ false, /* no_mask_reg */ true, /*uses_vl */ true); | ||
attributes.set_is_evex_instruction(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This instruction is supported on AVX (vex encoded) platform as well.
Please rename this as vcvtps2ph and update the check to include supports_evex or supports_f16C. The is_evex_instruction check should then be remvoed.
void Assembler::evcvtph2ps(XMMRegister dst, XMMRegister src, int vector_len) { | ||
assert(VM_Version::supports_evex(), ""); | ||
InstructionAttr attributes(vector_len, /* rex_w */ false, /* legacy_mode */false, /* no_mask_reg */ true, /* uses_vl */ true); | ||
attributes.set_is_evex_instruction(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This instruction is supported on AVX (vex encoded) platform as well.
Please rename this as vcvtph2ps and update the check to include supports_evex or supports_f16C. The is_evex_instruction check should then be remvoed.
src/hotspot/cpu/x86/x86.ad
Outdated
if (!VM_Version::supports_evex() || !VM_Version::supports_avx512vl()) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include supports_f16c here.
src/hotspot/cpu/x86/x86_64.ad
Outdated
@@ -11306,6 +11306,31 @@ instruct convD2L_reg_reg(rRegL dst, regD src, rFlagsReg cr) | |||
ins_pipe(pipe_slow); | |||
%} | |||
|
|||
instruct convF2HF_reg_reg(rRegI dst, regF src, regF tmp) %{ | |||
predicate(VM_Version::supports_evex()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The predicate could be removed as the check is fully done in match_rule_supported.
src/hotspot/cpu/x86/x86_64.ad
Outdated
predicate(VM_Version::supports_evex()); | ||
effect(TEMP tmp); | ||
match(Set dst (ConvF2HF src)); | ||
format %{ "evcvtps2ph $dst,$src" %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to also show the tmp register in the format.
src/hotspot/cpu/x86/x86_64.ad
Outdated
%} | ||
|
||
instruct convHF2F_reg_reg(regF dst, rRegI src) %{ | ||
predicate(VM_Version::supports_evex()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The predicate could be removed as it is fully covered in the match_rule_supported.
|
||
//------------------------------Identity--------------------------------------- | ||
Node* ConvHF2FNode::Identity(PhaseGVN* phase) { | ||
return (in(1)->Opcode() == Op_ConvF2HF) ? in(1)->in(1) : this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot do this optimization:
HF2F(F2HF(x)) != x
This is because precision and range both are lost when we convert F2HF.
virtual int Opcode() const; | ||
virtual const Type *bottom_type() const { return Type::FLOAT; } | ||
virtual const Type* Value(PhaseGVN* phase) const; | ||
virtual Node* Identity(PhaseGVN* phase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The identity method could be removed from ConvHF2FNode
Also please update the unit tests to run with and without the intrinsics enabled. |
JMH results After patch |
@Param({"2048"}) | ||
public int TESTSIZE; | ||
|
||
public short[] HFargV1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use F16
instead of HF
.
e.g. f16in, f16out, fin, fout.
public class Fp16ConversionBenchmark { | ||
|
||
@Param({"2048"}) | ||
public int TESTSIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public int TESTSIZE; | |
public int size; |
|
||
@Benchmark | ||
public void floatToFloat16() { | ||
for (int i = 0; i < TESTSIZE; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the array length instead, it's more idiomatic e.g.:
for (int i = 0; i < TESTSIZE; i++) { | |
for (int i = 0; i < FargV1.length; i++) { |
int i = 0; | ||
Random r = new Random(1024); | ||
|
||
HFargV1 = new short[TESTSIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test size is less than the number of special values then we will get an exception. I cannot quite decide if special values are that important here. Might be better to test separately?
for (int i = 0; i < TESTSIZE; i++) { | ||
ResHF[i] = Float.floatToFloat16(FargV1[i]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return the result array, so it's consumed by a black hole.
@@ -25,6 +25,9 @@ | |||
* @test | |||
* @bug 8289551 | |||
* @summary Verify NaN sign and significand bits are preserved across conversions | |||
* @run main Binary16ConversionNaN | |||
* @run main/othervm/timeout=600 -XX:+UnlockDiagnosticVMOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to specify a timeout? The tests have been running fine without intrinsics, no timeouts reports.
public void floatToFloat16(Blackhole bh) { | ||
for (int i = 0; i < fin.length; i++) { | ||
f16out[i] = Float.floatToFloat16(fin[i]); | ||
} | ||
bh.consume(f16out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify to this:
public void floatToFloat16(Blackhole bh) { | |
for (int i = 0; i < fin.length; i++) { | |
f16out[i] = Float.floatToFloat16(fin[i]); | |
} | |
bh.consume(f16out); | |
public float floatToFloat16() { | |
for (int i = 0; i < fin.length; i++) { | |
f16out[i] = Float.floatToFloat16(fin[i]); | |
} | |
return f16out; |
@@ -0,0 +1,82 @@ | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the copyright comment to be consistent with other source, and also correct the year to be only 2022? e.g. copy that from Binary16Conversion.java
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java code looks good (HotSpot reviewers also need to approve).
@smita-kamath This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 119 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@vnkozlov, @PaulSandoz, @sviswa7, @jatin-bhateja) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment: since there is native runtime code to perform conversions for constants (IIRC) I think we need to some special tests to verify that works correctly (shame there is no way to leverage the Java code in such cases). It might require some advice from other HotSpot engineers, but i think we do need to follow up with another PR on that aspect.
src/hotspot/cpu/x86/x86.ad
Outdated
if (!VM_Version::supports_f16c() && !VM_Version::supports_evex() && | ||
!VM_Version::supports_avx512vl()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition should be:
if (!VM_Version::supports_f16c() &&
!(VM_Version::supports_evex() && VM_Version::supports_avx512vl())) {
return false;
}
For AVX-512 both evex and avx512vl are needed.
src/hotspot/cpu/x86/x86_64.ad
Outdated
%} | ||
ins_pipe( pipe_slow ); | ||
%} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For F2HF, good to also add optimized rule with StoreC to benefit from vcvtps2ph memory destination form of instruction.
match(Set mem (StoreC mem (ConvF2HF src)));
src/hotspot/cpu/x86/x86_64.ad
Outdated
%} | ||
ins_pipe( pipe_slow ); | ||
%} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For HF2F, good to also add optimized rule with LoadS to benefit from vcvtph2ps memory src form of instruction.
match(Set dst (ConvHF2F ( LoadS mem)));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sviswa7 , with this rule, I dont see additional performance gain. So I have not added it in the latest update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding JMH benchmarks and data. I submitted testing.
@smita-kamath I have builds failures. Please, build and test yourself to verify changes.
|
Hi @vnkozlov, I apologize for the inconvenience. However, I don't see any build failures on Windows or Linux. Can you please let me know the platform for which your build fails? Thanks a lot. |
@vnkozlov I also don't see any build issue. I tried creating a dummy draft PR under my fork to get the GHA to run. But looks like now-a-days GHA doesn't run on draft PR. |
Hi, would you be adding IR tests to verify the generation of the the newly introduced IR nodes? |
ins_pipe( pipe_slow ); | ||
%} | ||
|
||
instruct convF2HF_mem_reg(memory mem, regF src, kReg ktmp, rRegI rtmp) %{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use kmovwl
instead which will relax the avx512bw constraint, however, you will need avx512vl for evcvtps2ph
. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rethink about it, you can get 0x01 by right shifting k0 to the right - kshiftrw(ktmp, k0, 15)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merykitty Thanks for the suggestion. I will update the instruct to use kmovwl. I will also experiment with kshiftrw and let you know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use
kmovwl
instead which will relax the avx512bw constraint, however, you will need avx512vl forevcvtps2ph
. Thanks.
Yes, in general all AVX512VL targets support AVX512BW, but cloud instances give freedom to enable custom features. Regarding K0, as per section "15.6.1.1" of SDM, expectation is that K0 can appear in source and destination of regular non predication context, k0 should always contain all true mask so it should be unmodifiable for subsequent usages i.e. should not be present as destination of a mask manipulating instruction. Your suggestion is to have that in source but it may not work either. Changing existing sequence to use kmovw and replace AVX512BW with AVX512VL will again mean introducing an additional predication check for this pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I get it, the encoding of k0 is treated specially in predicated instructions to refer to an all-set mask, but the register itself may not actually contain that value. So usage in kshiftrw
may fail. In that case I think we can generate an all-set mask on the fly using kxnorw(ktmp, ktmp, ktmp)
to save a GPR in this occasion. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @merykitty, I am seeing performance regression with kxnorw instruction. So I have updated the PR with kmovwl. Thanks.
@sviswa7 The failure is due to JDK-8293618, @smita-kamath please merge with master. Thanks. |
Yes, I tested with latest JDK sources which includes JDK-8293618. |
@Bhavana-Kilambi, I plan to do this in a separate PR along with the gtest. Here's the bug https://bugs.openjdk.org/browse/JDK-8293323. |
@vnkozlov, I have implemented all of the reviewers comments. Could you kindly test this patch? Thanks a lot for your help. |
I started new testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest version v12 passed my tier1-4 testing. Good.
@vnkozlov Thank you for reviewing the patch. |
/integrate |
@smita-kamath |
/sponsor |
Going to push as commit 07946aa.
Your commit was automatically rebased without conflicts. |
@sviswa7 @smita-kamath Pushed as commit 07946aa. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
There is an issue with the |
@smita-kamath I think I just found another regression of this feature: https://bugs.openjdk.org/browse/JDK-8338126 |
@eme64, Sure will look into it. Thank you. |
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/9781/head:pull/9781
$ git checkout pull/9781
Update a local copy of the PR:
$ git checkout pull/9781
$ git pull https://git.openjdk.org/jdk pull/9781/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 9781
View PR using the GUI difftool:
$ git pr show -t 9781
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/9781.diff