Skip to content

8273949: Intrinsic creation for VectorMask.toLong operation. #126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Sep 17, 2021

Summary of changes:

  • Intrinsification of VectorMask.toLong() API.
  • Supports inline expansion for both AVX512 and non-AVX512 targets.
  • Used toLong() API to optimize existing Java API implementation of VectorMask.laneIsSet() operation.

Following performance number are generated using JMH benchmark modification included with the patch.

System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (Cascade Lake Server 28C 2S)

Benchmark VECSIZE Baseline Score (ops/ms) With Opt (ops/ms) Gain Ratio
MaskQueryOperationsBenchmark.testToLongByte 128 90451.424 346941.379 3.835665196
MaskQueryOperationsBenchmark.testToLongByte 256 63127.764 338331.425 5.359471072
MaskQueryOperationsBenchmark.testToLongByte 512 40543.264 313836.333 7.740776199
MaskQueryOperationsBenchmark.testToLongLong 128 171989.714 152872.758 0.88884826
MaskQueryOperationsBenchmark.testToLongLong 256 164702.273 324794.578 1.972010295
MaskQueryOperationsBenchmark.testToLongLong 512 122667.916 318060.096 2.59285481
MaskQueryOperationsBenchmark.testToLongShort 128 122656.408 346691.082 2.826522378
MaskQueryOperationsBenchmark.testToLongShort 256 96838.555 360909.28 3.726917239
MaskQueryOperationsBenchmark.testToLongShort 512 63119.009 313075.159 4.960077225
MaskQueryOperationsBenchmark.testToLonglong 128 180855.623 324620.433 1.794914792
MaskQueryOperationsBenchmark.testToLonglong 256 122705.631 324315.916 2.643040204
MaskQueryOperationsBenchmark.testToLonglong 512 90396.687 324318.095 3.58772103

Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Issue

  • JDK-8273949: Intrinsic creation for VectorMask.toLong operation.

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-vector pull/126/head:pull/126
$ git checkout pull/126

Update a local copy of the PR:
$ git checkout pull/126
$ git pull https://git.openjdk.java.net/panama-vector pull/126/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 126

View PR using the GUI difftool:
$ git pr show -t 126

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/panama-vector/pull/126.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 17, 2021

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into vectorIntrinsics+mask will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Sep 17, 2021
@mlbridge
Copy link

mlbridge bot commented Sep 17, 2021

Webrevs

for (int i = 0; i < bits.length; i++) {
res = bits[i] ? res | set : res;
set = set << 1;
if (length() <= Long.SIZE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check that i > 0 && i < length() and if not throw IllegalArgumentException.

Currently the behavior is unspecified, can you please add the following to the documentation of VectorMask.laneIsSet. (I will handle any CSR related tasks.)

     * @throws IllegalArgumentException if the index is is out of range
     * ({@code < 0 || >= length()})

Comment on lines 711 to 712
return (int) VectorSupport.maskReductionCoerced(VECTOR_OP_MASK_TRUECOUNT, Byte128Mask.class, byte.class, VLENGTH, this,
(m) -> (long)trueCountHelper(((Byte128Mask)m).getBits()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (int) VectorSupport.maskReductionCoerced(VECTOR_OP_MASK_TRUECOUNT, Byte128Mask.class, byte.class, VLENGTH, this,
(m) -> (long)trueCountHelper(((Byte128Mask)m).getBits()));
return (int) VectorSupport.maskReductionCoerced(VECTOR_OP_MASK_TRUECOUNT, Byte128Mask.class, byte.class, VLENGTH, this,
(m) -> trueCountHelper(m.getBits()));

We don't need the cast to long for the return from the lambda expression. And I think the cast of the mask is redundant too. Same applies to the other three methods.

Comment on lines 66 to 68
if (i < 0 || i >= length()) {
throw new IllegalArgumentException("Index " + i + " must be zero or positive, and less than " + length());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies i should have previously said use Objects.checkIndex(i, length()). In fact lets call length once:

    int length = length();
    Objects.checkIndex(i, length);
    if (length <= Long.SIZE) {
    ...

Separately i wonder if the second bounds check should be if i <= Long.SIZE, since the first bounds check should dominate in many cases?

(I know there are other areas where we should Objects.checkIndex e.g. *MaxVector.lane/withLane but we can fix those later.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With length <= Long.SIZE there is a better possibility of optimizing out else part and thus only fast path gets generated. length is final static values and hence in most of the cases due to in-lining this constant may get propagated.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java changes are good.

@openjdk
Copy link

openjdk bot commented Sep 20, 2021

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8273949: Intrinsic creation for VectorMask.toLong operation.

Reviewed-by: psandoz, sviswanathan, eliu

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 1 new commit pushed to the vectorIntrinsics+mask branch:

  • b44662a: 8273264: AArch64: [vector] Add missing rules for VectorMaskCast

Please see this link for an up-to-date comparison between the source branch of this pull request and the vectorIntrinsics+mask branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the vectorIntrinsics+mask branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Sep 20, 2021
match(Set dst (VectorMaskTrueCount mask));
effect(TEMP_DEF dst, TEMP tmp, TEMP ktmp, TEMP xtmp);
format %{ "vector_truecount_evex $mask \t! vector mask true count" %}
instruct vmask_tolong_evex(rRegL dst, kReg mask) %{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instructs with rRegL should be in ifdef _LP64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its already covered under _LP64

match(Set dst (VectorMaskLastTrue mask));
effect(TEMP_DEF dst, TEMP tmp, TEMP ktmp, TEMP xtmp, KILL cr);
format %{ "vector_mask_first_or_last_true_evex $mask \t! vector first/last true location" %}
instruct vmask_truecount_evex(rRegI dst, kReg mask, rRegL tmp) %{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have flags register as popcnt instruction affects flags.

Comment on lines 8708 to 8709
instruct vmask_truecount_avx(rRegI dst, vec mask, rRegL tmp, vec xtmp, vec xtmp1) %{
predicate(!VM_Version::supports_avx512vlbw());
predicate(n->in(1)->bottom_type()->isa_vectmask() == NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also include flags register.

class VectorMaskToLongNode : public VectorMaskOpNode {
public:
VectorMaskToLongNode(Node* mask, const Type* ty):
VectorMaskOpNode(mask, ty, Op_VectorMaskLastTrue) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be Op_VectorMaskToLong?

case T_LONG: // fall-through
case T_FLOAT: // fall-through
case T_DOUBLE: return Op_VectorMaskToLong;
default: fatal("MASK_TRUECOUNT: %s", type2name(bt));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MASK_TOLONG here? Also break is missing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with other case statements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default: fatal("MASK_TRUECOUNT: %s", type2name(bt));
should refer to MASK_TOLONG.

@jatin-bhateja
Copy link
Member Author

@nsjian , kindly check the IR change, it may need some change in AARCH64 backend since now VectorStoreMask is not being inserted while connecting mask generating node to mask operation node. This saves redundant store mask operation if target supports predicate registers.

@jatin-bhateja jatin-bhateja deleted the JDK-8273949 branch September 22, 2021 04:26
@jatin-bhateja jatin-bhateja restored the JDK-8273949 branch September 22, 2021 04:26
@jatin-bhateja jatin-bhateja reopened this Sep 22, 2021
@nsjian
Copy link

nsjian commented Sep 22, 2021

@nsjian , kindly check the IR change, it may need some change in AARCH64 backend since now VectorStoreMask is not being inserted while connecting mask generating node to mask operation node. This saves redundant store mask operation if target supports predicate registers.

Hi @jatin-bhateja , thanks for the heads-up. The VectorMaskOp connection change looks reasonable. We will update SVE backend later.

Copy link
Collaborator

@sviswa7 sviswa7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one comment on cut and paste mistake in error message, Please fix that. Other than that the patch looks good. No need for re-rereview.

Copy link
Member

@e1iu e1iu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just the comment is mismatched.

// <E, M>
// long maskReductionCoerced(int oper, Class<? extends M> maskClass, Class<?> elemClass,
//                          int length, M m, VectorMaskOp<M> defaultImpl)
bool LibraryCallKit::inline_vector_mask_operation() {

@jatin-bhateja
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Sep 23, 2021

Going to push as commit 0e7348d.
Since your change was applied there has been 1 commit pushed to the vectorIntrinsics+mask branch:

  • b44662a: 8273264: AArch64: [vector] Add missing rules for VectorMaskCast

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Sep 23, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Sep 23, 2021
@openjdk
Copy link

openjdk bot commented Sep 23, 2021

@jatin-bhateja Pushed as commit 0e7348d.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@DamonFool
Copy link
Member

Build failure on MacOS after this PR: #133 .

@DamonFool
Copy link
Member

Incorrect computation after this pr.
Please see: #138 .
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants