Skip to content

Conversation

@jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented May 22, 2023

Starting with 4th Generation Xeon, Intel has made extensive extensions to existing ISA to support 16 bit scalar and vector floating point operations based on IEEE 754 binary16 format.

We plan to support this in multiple stages spanning across Java side definition of Float16 type, scalar operation and finally SLP vectorization support.

This patch adds minimal Java and Compiler side support for one API Float16.add.

Summary of changes :-

  • Minimal implementation of Float16 primitive class supporting one operation (Float16.add)
  • X86 AVX512-FP16 feature detection at VM startup.
  • C2 IR and Inline expander changes for Float16.add API.
  • FP16 constant folding handling.
  • Backend support : Instruction selection patterns and assembler support.
  • New IR framework and functional tests.

Implementation details:-

1/ Newly defined Float16 class encapsulate a short value holding IEEE 754 binary16 encoded value.

2/ Float16 is a primitive class which in future will be aligned with other enhanced primitive wrapper classes proposed by JEP-402.

3/ Float16 to support all the operations supported by corresponding Float class.

4/ Java implementation of each API will internally perform floating point operation at FP32 granularity.

5/ API which can be directly mapped to an Intel AVX512FP16 instruction will be a candidate for intensification by C2 compiler.

6/ With Valhalla, C2 compiler always creates an InlineType IR node for a value class instance.
Total number of inputs of an InlineType node match the number of non-static fields. In this case node will have one input of short type TypeInt::SHORT.

7/ Since all the scalar AVX512FP16 instructions operate on floating point registers and Float16 backing storage is held in a general-purpose register hence we need to introduce appropriate conversion IR which moves a 16-bit value from GPR to a XMM register and vice versa.
image

8/ Current plan is to introduce a new IR node for each operation which is a subclass of its corresponding single precision IR node. This will allow leveraging idealization routines (Ideal/Identity/Value) of its parent operation.

9/ All the single/double precision IR nodes carry a Type::FLOAT/DOUBLE ideal type. This represents entire FP32/64 value range and is different from integral types which explicitly record lower and upper bounds of value ranges. Value resolution routines operating over integral types calibrates the lattice associated with IR node based on flow function depicting the semantics of the operation.

10/ There are two separate ideal type for floating point constants TypeF and TypeD, these are singleton types and represents a unique single/double precision floating point value

11/ Ideal types are linked to specific register classes and determine the allocation set for register allocator.

12/ Since FP16 value range is a proper subset of FP32 value range hence all the newly created IR nodes will still carry Type::FLOAT type.

13/ To handle constant folding scenarios value routines associated with newly introduced conversion IR nodes convert an FP16 constant to a FP32 constant (TypeF) value using existing runtime helpers. Constant folding is performed by existing idealization routines of parent IR nodes.

14/ Auto-vectorizer scans through packs of scalar IR to create a vector IR.

15/ Functional and performance test for creation for each newly supported operation.

Above design is generic and should work for any reduced precision types like FP8 values.

Please let me know your views on this.

Best Regards,
Jatin

PS: Test instantiating Float16 class must be compiled with -XDenablePrimitiveClasses currently, we need to relax this constraint for known primitive classes in language compiler, JEP-402 talks about converting existing primitive box classes to value classes should address this limitation.


Progress

  • Change must not contain extraneous whitespace

Issue

  • JDK-8308363: Initial compiler support for FP16 scalar operations. (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/valhalla.git pull/848/head:pull/848
$ git checkout pull/848

Update a local copy of the PR:
$ git checkout pull/848
$ git pull https://git.openjdk.org/valhalla.git pull/848/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 848

View PR using the GUI difftool:
$ git pr show -t 848

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/valhalla/pull/848.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 22, 2023

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into lworld+fp16 will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 22, 2023

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8308363: Initial compiler support for FP16 scalar operations.

Reviewed-by: sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the lworld+fp16 branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the lworld+fp16 branch, type /integrate in a new comment.

@openjdk openjdk bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 22, 2023
@mlbridge
Copy link

mlbridge bot commented May 22, 2023

Comment on lines 5175 to +5176
// For now, compare to class file version 51 so old verifier doesn't see Q signatures.
if ( (_major_version < 51 /* CONSTANT_CLASS_DESCRIPTORS */ ) || (!EnablePrimitiveClasses)) {
if ( (_major_version < 51 /* CONSTANT_CLASS_DESCRIPTORS */ ) || (!EnablePrimitiveClasses && !is_jdk_internal_class_sig(signature))) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last condition (!EnablePrimitiveClasses && !is_jdk_internal_class_sig(signature)) seems not right to me. Assume there is a jdk internal primitive class defined , and EnablePrimitiveClasses is disabled, then result of (!EnablePrimitiveClasses && !is_jdk_internal_class_sig(signature)) is false (i.e. true && false), then the followed error is not printed if the jdk version matches >= 51. But it should report the error since EnablePrimitiveClasses is closed. So it should be:

if ( (_major_version < 51 /* CONSTANT_CLASS_DESCRIPTORS */ ) || !EnablePrimitiveClasses || !is_jdk_internal_class_sig(signature))

right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea here is to relax the need to use an explicit JVM flag -XX:+EnablePrimtiiveClasses for primitive classes known to VM.

Comment on lines 200 to 201
do_intrinsic(_add_float16, java_lang_Float16, add_name, floa16_float16_signature, F_R) \
do_name(add_name, "add") \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: please remove extra spaces between add_name and "add".

};

//------------------------------AddHFNode---------------------------------------
// Add 2 floats

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to: Add 2 half-precision floats ?

Comment on lines 178 to 179
public:
ReinterpretS2HFNode( Node *in1 ) : Node(0,in1) {}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changes:

ReinterpretS2HFNode(Node* in1) : Node(0, in1) {}

Comment on lines 180 to 181
virtual int Opcode() const;
virtual const Type *bottom_type() const { return Type::FLOAT; }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to:

virtual const Type* bottom_type() const { return Type::FLOAT; }

Comment on lines 188 to 191
ReinterpretHF2SNode( Node *in1 ) : Node(0,in1) {}
virtual int Opcode() const;
virtual const Type* Value(PhaseGVN* phase) const;
virtual const Type *bottom_type() const { return TypeInt::SHORT; }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

}

bool LibraryCallKit::inline_fp16_operations(vmIntrinsics::ID id) {
Node* result = NULL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use nullptr instead of NULL.

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Sep 11, 2023

vecAddFP16.txt
Performance data for Float16 vector ADD operation on Sapphire Rapids with different configuration.
image

Copy link

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work in general. I only have couple of comments in the code, please take a look. Also there might be some further optimization opportunities for the path that comes from Op_ConvF2HF -> ReinterpretS2HF. ConvF2HF is doing the conversion from xmm to xmm register and then moves the xmm to gpr. ReinterpretS2HF then moves from gpr back to xmm. This unnecessary movement from xmm->gpr and from gpr->xmm could be optimized out.


void Assembler::evaddph(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) {
assert(VM_Version::supports_avx512_fp16(), "requires AVX512-FP16");
InstructionAttr attributes(vector_len, false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ false);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uses_vl should be true here. Also wondering if we could create a generic method like fp16varithop() with most of the boiler plate common code and call it from individual instructions like add, sub etc. This generic method we could either create now or when we include additional fp16 instructions.

Comment on lines +200 to +201
do_intrinsic(_sum_float16, java_lang_Float16, sum_name, floa16_float16_signature, F_S) \
do_name(sum_name, "sum") \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be called add instead of sum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be called add instead of sum?

I made this change after Paul offline comments, to align it with Float.sum API.


$1_FLAGS += -g -Xlint:all $$($1_TARGET_RELEASE) $$(PARANOIA_FLAGS) $$(JAVA_WARNINGS_ARE_ERRORS)
$1_FLAGS += $$($1_JAVAC_FLAGS)
$1_FLAGS += $$($1_JAVAC_FLAGS) -XDenablePrimitiveClasses
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change now that we have special handling in the VM for Float16?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change now that we have special handling in the VM for Float16?

Yes, Float16 being a primitive class needs this flag during build time compilation.

format %{ "vmovw $dst, $src" %}
ins_encode %{
__ vmovw($dst$$Register, $src$$XMMRegister);
__ movswl($dst$$Register, $dst$$Register);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do without movswl here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do without movswl here?

This was done keeping in mind JVM semantics where byte/short are internally promoted to int type since JVM operands and local variables are 32 bit values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is actually a float16 here (short is only indicating the 16bit storage) and it is not expected to have any integral operation directly on this. The movswl would unnecessarily add to the pathlength.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is actually a float16 here (short is only indicating the 16bit storage) and it is not expected to have any integral operation directly on this. The movswl would unnecessarily add to the pathlength.

We do have an API to return a raw value and a direct comparison b/w raw value is done at JVM word (32 bit) granularity, I was planning to remove additional sign extending move when we add Float16.compare(short, short), but I think its better to fix tests for now and remove this extra instruction.

@jatin-bhateja
Copy link
Member Author

Very good work in general. I only have couple of comments in the code, please take a look. Also there might be some further optimization opportunities for the path that comes from Op_ConvF2HF -> ReinterpretS2HF. ConvF2HF is doing the conversion from xmm to xmm register and then moves the xmm to gpr. ReinterpretS2HF then moves from gpr back to xmm. This unnecessary movement from xmm->gpr and from gpr->xmm could be optimized out.

Yes, but this can only be done if we factor out xmm -> gpr movement out of ConvF2HF, its inputs is a IEEE 754 binary32 bit floating point value and output is a binary16 bit value help in a GPR. It will not be proper to remove ConvF2HF + ReinterpretS2HF by a direct ideal transformation currently.

@sviswa7
Copy link

sviswa7 commented Sep 18, 2023

Very good work in general. I only have couple of comments in the code, please take a look. Also there might be some further optimization opportunities for the path that comes from Op_ConvF2HF -> ReinterpretS2HF. ConvF2HF is doing the conversion from xmm to xmm register and then moves the xmm to gpr. ReinterpretS2HF then moves from gpr back to xmm. This unnecessary movement from xmm->gpr and from gpr->xmm could be optimized out.

Yes, but this can only be done if we factor out xmm -> gpr movement out of ConvF2HF, its inputs is a IEEE 754 binary32 bit floating point value and output is a binary16 bit value help in a GPR. It will not be proper to remove ConvF2HF + ReinterpretS2HF by a direct ideal transformation currently.

One path is having an instruct with match rule for this, something like below I think:
match(Set dst (ReinterprestS2HF(ConvF2HF src)));

@jatin-bhateja
Copy link
Member Author

Very good work in general. I only have couple of comments in the code, please take a look. Also there might be some further optimization opportunities for the path that comes from Op_ConvF2HF -> ReinterpretS2HF. ConvF2HF is doing the conversion from xmm to xmm register and then moves the xmm to gpr. ReinterpretS2HF then moves from gpr back to xmm. This unnecessary movement from xmm->gpr and from gpr->xmm could be optimized out.

Yes, but this can only be done if we factor out xmm -> gpr movement out of ConvF2HF, its inputs is a IEEE 754 binary32 bit floating point value and output is a binary16 bit value help in a GPR. It will not be proper to remove ConvF2HF + ReinterpretS2HF by a direct ideal transformation currently.

One path is having an instruct with match rule for this, something like below I think: match(Set dst (ReinterprestS2HF(ConvF2HF src)));

Agree, patten match will break if ConF2HF is shared across multiple nodes, but its still better to have than no optimization.

@jatin-bhateja
Copy link
Member Author

Hi @sviswa7 , @XiaohongGong , All your comments have been addressed. I am committing this patch

Best Regards

@jatin-bhateja
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Sep 19, 2023

Going to push as commit f03fb4e.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 19, 2023
@openjdk openjdk bot closed this Sep 19, 2023
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 19, 2023
@openjdk
Copy link

openjdk bot commented Sep 19, 2023

@jatin-bhateja Pushed as commit f03fb4e.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

3 participants