Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8256973: Intrinsic creation for VectorMask query (lastTrue,firstTrue,trueCount) APIs #3916

Closed
wants to merge 7 commits into from

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented May 7, 2021

This patch intrinsifies following mask query APIs using optimal instruction sequence for X86 target.

  1. VectorMask.firstTrue.
  2. VectorMask.lastTrue.
  3. VectorMask.trueCount.

Current implementations of above APIs iterates over the underlined boolean array encapsulated in a mask instance to ascertain the count/position index of true bits.
X86 AVX2 and AVX512 targets offers direct instructions to populate the masks held in the byte vector to a GP or an opmask register there by accelerating further querying.

Intrinsification is not performed for vector species containing less than two vector lanes.

Please find below the performance number for benchmark included in the patch:
Machine: Cascade Lake server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz 28C)

VectorMask.trueCount VECTOR SIZE ALGO BASELINE AVX3 WITH OPT AVX3 GAIN
MaskQueryOperationsBenchmark.testFirstTrueByte 128 1 338396.436 362711.622 1.071854143
MaskQueryOperationsBenchmark.testFirstTrueByte 128 2 205477.472 362668.035 1.765001445
MaskQueryOperationsBenchmark.testFirstTrueByte 128 3 185613.377 362518.206 1.953082326
MaskQueryOperationsBenchmark.testFirstTrueByte 256 1 338522.114 328751.231 0.971136648
MaskQueryOperationsBenchmark.testFirstTrueByte 256 2 148825.341 328783.35 2.209189294
MaskQueryOperationsBenchmark.testFirstTrueByte 256 3 200854.856 328784.24 1.636924526
MaskQueryOperationsBenchmark.testFirstTrueByte 512 1 338551.089 319908.361 0.944933782
MaskQueryOperationsBenchmark.testFirstTrueByte 512 2 116338.756 320026.839 2.750818816
MaskQueryOperationsBenchmark.testFirstTrueByte 512 3 200871.692 320008.208 1.593097588
MaskQueryOperationsBenchmark.testFirstTrueInt 128 1 338489.157 190221.57 0.561972418
MaskQueryOperationsBenchmark.testFirstTrueInt 128 2 205140.903 362387.766 1.766531007
MaskQueryOperationsBenchmark.testFirstTrueInt 128 3 185508.994 362566.265 1.95444036
MaskQueryOperationsBenchmark.testFirstTrueInt 256 1 338403.999 328829.751 0.971707639
MaskQueryOperationsBenchmark.testFirstTrueInt 256 2 148988.857 328835.479 2.207114583
MaskQueryOperationsBenchmark.testFirstTrueInt 256 3 200815.907 328778.266 1.637212265
MaskQueryOperationsBenchmark.testFirstTrueInt 512 1 338462.403 328796.84 0.971442728
MaskQueryOperationsBenchmark.testFirstTrueInt 512 2 116355.623 328811.386 2.825917455
MaskQueryOperationsBenchmark.testFirstTrueInt 512 3 200856.08 328773.859 1.636862867
MaskQueryOperationsBenchmark.testFirstTrueLong 128 1 338451.783 204432.394 0.60402221
MaskQueryOperationsBenchmark.testFirstTrueLong 128 2 204443.049 155670.633 0.761437641
MaskQueryOperationsBenchmark.testFirstTrueLong 128 3 207254.769 155672.842 0.751118263
MaskQueryOperationsBenchmark.testFirstTrueLong 256 1 338520.255 328789.176 0.971254072
MaskQueryOperationsBenchmark.testFirstTrueLong 256 2 205883.123 328742.103 1.596741385
MaskQueryOperationsBenchmark.testFirstTrueLong 256 3 185519.176 328733.537 1.771965271
MaskQueryOperationsBenchmark.testFirstTrueLong 512 1 338605.11 328694.935 0.970732353
MaskQueryOperationsBenchmark.testFirstTrueLong 512 2 148444.7 328352.346 2.211950619
MaskQueryOperationsBenchmark.testFirstTrueLong 512 3 200884.874 328814.376 1.636829939
MaskQueryOperationsBenchmark.testFirstTrueShort 128 1 338529.326 362293.877 1.070199387
MaskQueryOperationsBenchmark.testFirstTrueShort 128 2 204676.583 362428.992 1.770739899
MaskQueryOperationsBenchmark.testFirstTrueShort 128 3 185495.663 362422.835 1.953807594
MaskQueryOperationsBenchmark.testFirstTrueShort 256 1 338533.82 328635.479 0.970761146
MaskQueryOperationsBenchmark.testFirstTrueShort 256 2 148822.446 328803.55 2.209368001
MaskQueryOperationsBenchmark.testFirstTrueShort 256 3 200752.028 328805.974 1.637871245
MaskQueryOperationsBenchmark.testFirstTrueShort 512 1 338464.548 320054.91 0.945608371
MaskQueryOperationsBenchmark.testFirstTrueShort 512 2 116329.063 328763.508 2.826151088
MaskQueryOperationsBenchmark.testFirstTrueShort 512 3 199971.049 328819.066 1.644333355
MaskQueryOperationsBenchmark.testLastTrueByte 128 1 325618.244 337629.441 1.036887359
MaskQueryOperationsBenchmark.testLastTrueByte 128 2 197655.729 337544.012 1.707737052
MaskQueryOperationsBenchmark.testLastTrueByte 128 3 325600.645 337256.796 1.035798919
MaskQueryOperationsBenchmark.testLastTrueByte 256 1 325677.144 308312.588 0.946681687
MaskQueryOperationsBenchmark.testLastTrueByte 256 2 138177.514 308293.997 2.231144476
MaskQueryOperationsBenchmark.testLastTrueByte 256 3 201281.142 308353.239 1.531952949
MaskQueryOperationsBenchmark.testLastTrueByte 512 1 325499.635 305103.491 0.937338965
MaskQueryOperationsBenchmark.testLastTrueByte 512 2 98267.327 304803.64 3.101780106
MaskQueryOperationsBenchmark.testLastTrueByte 512 3 201072.661 304969.972 1.516715253
MaskQueryOperationsBenchmark.testLastTrueInt 128 1 325286.171 337337.209 1.037047496
MaskQueryOperationsBenchmark.testLastTrueInt 128 2 197351.915 331432.723 1.679399579
MaskQueryOperationsBenchmark.testLastTrueInt 128 3 325173.097 337518.586 1.037965899
MaskQueryOperationsBenchmark.testLastTrueInt 256 1 325199.786 308436.805 0.948453284
MaskQueryOperationsBenchmark.testLastTrueInt 256 2 138200.527 308405.442 2.231579348
MaskQueryOperationsBenchmark.testLastTrueInt 256 3 201240.625 308234.527 1.531671485
MaskQueryOperationsBenchmark.testLastTrueInt 512 1 325590.639 308381.757 0.947145649
MaskQueryOperationsBenchmark.testLastTrueInt 512 2 98334.197 308440.373 3.13665421
MaskQueryOperationsBenchmark.testLastTrueInt 512 3 200832.953 308431.355 1.535760693
MaskQueryOperationsBenchmark.testLastTrueLong 128 1 325564.887 193981.861 0.595831641
MaskQueryOperationsBenchmark.testLastTrueLong 128 2 214005.351 153667.869 0.718056199
MaskQueryOperationsBenchmark.testLastTrueLong 128 3 214061.493 156337.24 0.730337988
MaskQueryOperationsBenchmark.testLastTrueLong 256 1 325601.502 308291.032 0.946835411
MaskQueryOperationsBenchmark.testLastTrueLong 256 2 197911.182 308292.149 1.557729815
MaskQueryOperationsBenchmark.testLastTrueLong 256 3 325608.187 308405.393 0.947167195
MaskQueryOperationsBenchmark.testLastTrueLong 512 1 325734.897 308321.619 0.946541564
MaskQueryOperationsBenchmark.testLastTrueLong 512 2 137974.465 308131.475 2.233250008
MaskQueryOperationsBenchmark.testLastTrueLong 512 3 205479.182 308311.636 1.500451934
MaskQueryOperationsBenchmark.testLastTrueShort 128 1 325681.411 337663.377 1.036790451
MaskQueryOperationsBenchmark.testLastTrueShort 128 2 198127.51 337287.453 1.702375672
MaskQueryOperationsBenchmark.testLastTrueShort 128 3 325519.01 337453.387 1.036662612
MaskQueryOperationsBenchmark.testLastTrueShort 256 1 325647.378 308266.5 0.946626691
MaskQueryOperationsBenchmark.testLastTrueShort 256 2 138287.837 308402.656 2.230150263
MaskQueryOperationsBenchmark.testLastTrueShort 256 3 205375.864 308418.101 1.501725154
MaskQueryOperationsBenchmark.testLastTrueShort 512 1 325548.631 308137.064 0.946516233
MaskQueryOperationsBenchmark.testLastTrueShort 512 2 98424.074 308145.17 3.130790644
MaskQueryOperationsBenchmark.testLastTrueShort 512 3 205381.622 308345.763 1.50133084
MaskQueryOperationsBenchmark.testTrueCountByte 128 1 197488.249 340490.471 1.724104967
MaskQueryOperationsBenchmark.testTrueCountByte 128 2 191307.785 354400.26 1.852513529
MaskQueryOperationsBenchmark.testTrueCountByte 128 3 181206.7 354512.75 1.956399791
MaskQueryOperationsBenchmark.testTrueCountByte 256 1 144485.784 328347.7 2.272525995
MaskQueryOperationsBenchmark.testTrueCountByte 256 2 136709.938 328318.229 2.401568122
MaskQueryOperationsBenchmark.testTrueCountByte 256 3 141501.903 328274.337 2.319928779
MaskQueryOperationsBenchmark.testTrueCountByte 512 1 108395.25 318599.11 2.939234976
MaskQueryOperationsBenchmark.testTrueCountByte 512 2 98731.287 318651.791 3.22746518
MaskQueryOperationsBenchmark.testTrueCountByte 512 3 106344.335 318657.098 2.99646519
MaskQueryOperationsBenchmark.testTrueCountInt 128 1 124691.716 354457.62 2.842671762
MaskQueryOperationsBenchmark.testTrueCountInt 128 2 191325.138 354360.523 1.852137815
MaskQueryOperationsBenchmark.testTrueCountInt 128 3 181480.334 353746.697 1.949228818
MaskQueryOperationsBenchmark.testTrueCountInt 256 1 144513.076 328404.916 2.27249274
MaskQueryOperationsBenchmark.testTrueCountInt 256 2 136710.717 328516.92 2.403007805
MaskQueryOperationsBenchmark.testTrueCountInt 256 3 141631.832 328432.841 2.318919669
MaskQueryOperationsBenchmark.testTrueCountInt 512 1 108479.473 328405.877 3.027355019
MaskQueryOperationsBenchmark.testTrueCountInt 512 2 98747.682 328300.378 3.324638831
MaskQueryOperationsBenchmark.testTrueCountInt 512 3 106378.04 328384.537 3.086957957
MaskQueryOperationsBenchmark.testTrueCountLong 128 1 213646.579 159098.437 0.74468048
MaskQueryOperationsBenchmark.testTrueCountLong 128 2 212671.379 162528.924 0.764225655
MaskQueryOperationsBenchmark.testTrueCountLong 128 3 212649.052 162530.898 0.764315178
MaskQueryOperationsBenchmark.testTrueCountLong 256 1 197350.819 328365.924 1.663869072
MaskQueryOperationsBenchmark.testTrueCountLong 256 2 191473.127 328501.883 1.715655289
MaskQueryOperationsBenchmark.testTrueCountLong 256 3 185529.513 328428.64 1.770223156
MaskQueryOperationsBenchmark.testTrueCountLong 512 1 144516.188 328334.76 2.27195835
MaskQueryOperationsBenchmark.testTrueCountLong 512 2 136752.367 328505.571 2.402192943
MaskQueryOperationsBenchmark.testTrueCountLong 512 3 141445.742 328392.887 2.321688036
MaskQueryOperationsBenchmark.testTrueCountShort 128 1 197863.202 354533.342 1.791810394
MaskQueryOperationsBenchmark.testTrueCountShort 128 2 191802.914 354377.939 1.84761499
MaskQueryOperationsBenchmark.testTrueCountShort 128 3 181773.298 354374.525 1.949541153
MaskQueryOperationsBenchmark.testTrueCountShort 256 1 144414.679 328435.088 2.27425003
MaskQueryOperationsBenchmark.testTrueCountShort 256 2 136923.991 328267.898 2.397446171
MaskQueryOperationsBenchmark.testTrueCountShort 256 3 141545.957 328308.681 2.319449371
MaskQueryOperationsBenchmark.testTrueCountShort 512 1 108420.143 328282.998 3.027878297
MaskQueryOperationsBenchmark.testTrueCountShort 512 2 98736.441 328420.616 3.326235103
MaskQueryOperationsBenchmark.testTrueCountShort 512 3 106432.386 328245.585 3.084076166

ALGO (1=bestcase, 2=worstcast,3=avgcase)


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8256973: Intrinsic creation for VectorMask query (lastTrue,firstTrue,trueCount) APIs

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3916/head:pull/3916
$ git checkout pull/3916

Update a local copy of the PR:
$ git checkout pull/3916
$ git pull https://git.openjdk.java.net/jdk pull/3916/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3916

View PR using the GUI difftool:
$ git pr show -t 3916

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3916.diff

@bridgekeeper
Copy link

bridgekeeper bot commented May 7, 2021

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@jatin-bhateja
Copy link
Member Author

/label hotspot-compiler-dev

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label May 7, 2021
@openjdk
Copy link

openjdk bot commented May 7, 2021

@jatin-bhateja
The hotspot-compiler label was successfully added.

@openjdk openjdk bot added the rfr Pull request is ready for review label May 7, 2021
@mlbridge
Copy link

mlbridge bot commented May 7, 2021

Webrevs

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These mask operations can be considered a form of reduction.

Do you think it makes sense to reuse VectorSupport.reductionCoerced instead of adding a new intrinsic? (Note that we reuse VectorSupport.binaryOp for mask logical binary operations).

Perhaps that allows for further reuse later if/when we add operations to integral vectors to count bits like we already have with scalars, such as Integer.bitCount, Integer.numberOfLeadingZeros etc?

@@ -173,6 +143,31 @@ static boolean allTrueHelper(boolean[] bits) {
return true;
}

/*package-private*/
static int trueCountHelper(boolean[] bits) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming-wise i think you can drop Helper from such methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a Helper routine called from the lambda expression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we don't use that naming pattern in other places for the fallback Java code. It's just the scalar implementation.

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented May 7, 2021

These mask operations can be considered a form of reduction.

Do you think it makes sense to reuse VectorSupport.reductionCoerced instead of adding a new intrinsic? (Note that we reuse VectorSupport.binaryOp for mask logical binary operations).

Perhaps that allows for further reuse later if/when we add operations to integral vectors to count bits like we already have with scalars, such as Integer.bitCount, Integer.numberOfLeadingZeros etc?

Hi @PaulSandoz , that's a nice suggestion, I think instead of reduction which may emit bulky sequence, VectorMask.toLong() + Long.bitCount() could have been used for trueCount. But since toLong may not work for ARM SVE, so in the mean time intrinsifying at the level of API looked reasonable.

@PaulSandoz
Copy link
Member

Hi @PaulSandoz , that's a nice suggestion, I think instead of reduction which may emit bulky sequence, VectorMask.toLong() + Long.bitCount() could have been used for trueCount. But since toLong may not work for ARM SVE, so in the mean time intrinsifying at the level of API looked reasonable.

Do you mean that reusing VectorSupport.reductionCoerced as the intrinsic entry point may emit bulky sequence?

Note that i was not suggesting to reuse Long.bitCount() etc. i was just using that as a example that the bit-wise reduction operations on masks can also apply to integral vectors, suggesting there might be some sharing in C2 just like is done for binary-wise operations, such as logical AND.

For example:

        @Override
        @ForceInline
        public Int256Mask and(VectorMask<Integer> mask) {
            Objects.requireNonNull(mask);
            Int256Mask m = (Int256Mask)mask;
            return VectorSupport.binaryOp(VECTOR_OP_AND, Int256Mask.class, int.class, VLENGTH,
                                             this, m,
                                             (m1, m2) -> m1.bOp(m2, (i, a, b) -> a & b));
        }

And notice that VECTOR_OP_AND is reused for vector lane-wise binary and reduction operations on IntVector etc. Can we do the same for other bitwise reduction-like operations, first implementing optimal support for masks, then later expanding for integral vectors?

So rather than introducing specific constants, such as VECTOR_OP_MASK_TRUECOUNT etc, we can generalize to VECTOR_OP_BITCOUNT etc that can apply to both masks and integral vectors, where for masks we interpret BIT appropriately to mean boolean true value.

@jatin-bhateja
Copy link
Member Author

Hi @PaulSandoz , that's a nice suggestion, I think instead of reduction which may emit bulky sequence, VectorMask.toLong() + Long.bitCount() could have been used for trueCount. But since toLong may not work for ARM SVE, so in the mean time intrinsifying at the level of API looked reasonable.

Do you mean that reusing VectorSupport.reductionCoerced as the intrinsic entry point may emit bulky sequence?

Hi @PaulSandoz , semantically reductionCoerced could be used as an entry point for trueCount (VECTOR_OP_BITCOUNT) since we are iterating over each lane element (boolean type in this case) and returning the final set bits count, but for lastTrue and firstTrue operation are more like iterative operation on the lines of Vector.lane and Vector.withLane for which we have explicit entry points.

Also VectorSupport.reductionCoerced adds a constraint on the type parameter V to have lower bound as Vector, VectorMask is not in the hierarchy of Vector class. We can relax that constraint though. In addition we may need bypass some portions in LibraryCallKit::inline_vector_reduction for mask query APIs, given all this does it sound reasonable to add a one different entry point (maskOp) for all the mask query APIs. Looking for your feedback.

Note that i was not suggesting to reuse Long.bitCount() etc. i was just using that as a example that the bit-wise reduction operations on masks can also apply to integral vectors, suggesting there might be some sharing in C2 just like is done for binary-wise operations, such as logical AND.

For example:

        @Override
        @ForceInline
        public Int256Mask and(VectorMask<Integer> mask) {
            Objects.requireNonNull(mask);
            Int256Mask m = (Int256Mask)mask;
            return VectorSupport.binaryOp(VECTOR_OP_AND, Int256Mask.class, int.class, VLENGTH,
                                             this, m,
                                             (m1, m2) -> m1.bOp(m2, (i, a, b) -> a & b));
        }

And notice that VECTOR_OP_AND is reused for vector lane-wise binary and reduction operations on IntVector etc. Can we do the same for other bitwise reduction-like operations, first implementing optimal support for masks, then later expanding for integral vectors?

So rather than introducing specific constants, such as VECTOR_OP_MASK_TRUECOUNT etc, we can generalize to VECTOR_OP_BITCOUNT etc that can apply to both masks and integral vectors, where for masks we interpret BIT appropriately to mean boolean true value.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java code looks good.

Perhaps when we add bit-counting operations to vector we might find a way to consolidate. I don't wanna block progress based on something we might do in the future.

@@ -173,6 +143,31 @@ static boolean allTrueHelper(boolean[] bits) {
return true;
}

/*package-private*/
static int trueCountHelper(boolean[] bits) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we don't use that naming pattern in other places for the fallback Java code. It's just the scalar implementation.

Comment on lines 106 to 109
@Benchmark
public void testTrueCountByte(Blackhole bh) {
bh.consume(bmask.trueCount());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use a black hole. A returned value will be consumed by a black hole by the framework.

Suggested change
@Benchmark
public void testTrueCountByte(Blackhole bh) {
bh.consume(bmask.trueCount());
}
@Benchmark
public int testTrueCountByte() {
return bmask.trueCount();
}

@openjdk
Copy link

openjdk bot commented May 11, 2021

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8256973: Intrinsic creation for VectorMask query (lastTrue,firstTrue,trueCount) APIs

Reviewed-by: psandoz, vlivanov

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 231 new commits pushed to the master branch:

  • ff84577: 8267098: AArch64: C1 StubFrames end confusingly
  • 0daec49: 8267246: -XX:MaxRAMPercentage=0 is unreasonable for jtreg tests on many-core machines
  • 324defe: 8267212: test/jdk/java/util/Collections/FindSubList.java intermittent crash with "no reachable node should have no use"
  • bdbe23b: 8265462: Handle multiple slots in the NSS Internal Module from SunPKCS11's Secmod
  • 10236e7: 8263242: serviceability/sa/ClhsdbFindPC.java cannot find MaxJNILocalCapacity with ASLR
  • e6705c0: 8266949: Check possibility to disable OperationTimedOut on Unix
  • b92c5a4: 8265292: [macos_aarch64] java/foreign/TestDowncall.java crashes with SIGBUS
  • fadf580: 8262952: [macos_aarch64] os::commit_memory failure
  • f8f40ab: 8230486: G1BarrierSetAssembler::g1_write_barrier_post unnecessarily pushes/pops new_val
  • 9d168e2: 8266973: Migrate to ClassHierarchyIterator when enumerating subclasses
  • ... and 221 more: https://git.openjdk.java.net/jdk/compare/c5dc657f0be90bd594663dcc612f40a930c2bbe7...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 11, 2021
@jatin-bhateja
Copy link
Member Author

Hi @PaulSandoz , thanks your comments on JMH have been addressed. @neliasso @iwanowww kindly share your feedback/comments on compiler side changes.

@IntrinsicCandidate
public static
<M>
int maskOp(int oper, Class<?> maskClass, Class<?> elemClass, int length, M m,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second Paul here: maskOp case is already covered by reductionCoerced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above mixing it with reduction coerced will require changes in original entry point (type parameter have Vector as the lower bound) , also we may need to bypass some irrelevant portions in inline_vector_reduction() , for the time being to keep the things clean added a different entry point for all masked operations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair enough. We can revisit that later and merge them if needed.
Some suggestions to consider to align it with reductionCoerced:

  • reflect in the name that it's effectively a reduction, but on masks (maskReductionCoerced?);
  • return type can be generalized to long;
  • bound on M: <M extends VectorMask>;
  • no need to introduce a special interface, Function<T,R> just works: VectorMaskOp<M> -> Function<M, Long>;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair enough. We can revisit that later and merge them if needed.
Some suggestions to consider to align it with reductionCoerced:

  • reflect in the name that it's effectively a reduction, but on masks (maskReductionCoerced?);
  • return type can be generalized to long;

Hi @iwanowww, Can you kindly elaborate why should the return type be long here ?
We will need to again downcast it to integer since these APIs return an integer value.

  • bound on M: <M extends VectorMask>;
  • no need to introduce a special interface, Function<T,R> just works: VectorMaskOp<M> -> Function<M, Long>;

@@ -9217,6 +9217,14 @@ void Assembler::shrxq(Register dst, Register src1, Register src2) {
emit_int16((unsigned char)0xF7, (0xC0 | encode));
}

void Assembler::evpmovb2m(KRegister dst, XMMRegister src, int vector_len) {
assert(VM_Version::supports_avx512bw(), "");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be VM_Version::supports_avx512vlbw()?
VPMOVB2M requires AVX512VL for 128-/256-bit cases.

@@ -8054,4 +8061,42 @@ instruct vmasked_store64(memory mem, vec src, kReg mask) %{
%}
ins_pipe( pipe_slow );
%}

instruct vmask_true_count_evex(rRegI dst, vec mask, rRegL tmp, kReg ktmp, vec xtmp) %{
predicate(VM_Version::supports_avx512bw());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: VM_Version::supports_avx512vlbw()?

int vector_len = vector_length_encoding(mask_node);
int opcode = this->ideal_Opcode();
int mask_len = mask_node->bottom_type()->is_vect()->length();
__ vector_mask_oper(opcode, $dst$$Register, $mask$$XMMRegister, $xtmp$$XMMRegister,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oper looks misleading to me here: it usually means operand in Mach-related code.

Either vector_mask_operation() or vector_mask_op() is a better alternative IMO.

%}

instruct vmask_true_count_avx(rRegI dst, vec mask, rRegL tmp, vec xtmp, vec xtmp1) %{
predicate(!VM_Version::supports_avx512bw());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VM_Version::supports_avx512vlbw()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in match_rule_supported_vector.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you still need to adjust the predicate to be able to correctly split between AVX512BW+VL and AVX512F/AVX/AVX2 configurations.

@@ -3721,3 +3721,99 @@ void C2_MacroAssembler::arrays_equals(bool is_array_equ, Register ary1, Register
vpxor(vec2, vec2);
}
}

#ifdef _LP64
void C2_MacroAssembler::vector_mask_oper(int opc, Register dst, XMMRegister mask, XMMRegister xtmp,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about stressing that it requires AVX512BW & VL extensions? For example, by putting an assert and adding _evex suffix to the name.


#ifdef _LP64
void C2_MacroAssembler::vector_mask_oper(int opc, Register dst, XMMRegister mask, XMMRegister xtmp,
Register tmp, KRegister ktmp, int masklen, int vlen) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/vlen/vlen_enc/

}

void C2_MacroAssembler::vector_mask_oper(int opc, Register dst, XMMRegister mask, XMMRegister xtmp,
XMMRegister xtmp1, Register tmp, int masklen, int vlen) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/vlen/vlen_enc/

@@ -1294,6 +1294,37 @@ Node* ShiftVNode::Identity(PhaseGVN* phase) {
return this;
}

Node* VectorMaskOpNode::Ideal(PhaseGVN* phase, bool can_reshape) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make much sense to me. Why don't you simply require the input to be in canonical shape from the very beginning by unconditionally wrapping it into VectorStoreMask during construction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -853,6 +853,47 @@ class VectorMaskGenNode : public TypeNode {
const Type* _elemType;
};

class VectorMaskOpNode : public TypeNode {
public:
VectorMaskOpNode(Node* mask, const Type* ty, const Type* ety, int mopc):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty/ety caught my eye. It doesn't match anything in vectornode.hpp and may confuse readers.
Any reason not to use vt?

Also, any particular reason to cache full-blown type instead of capturing just the BasicType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case all the mask operations produce an integer value. Thus did not use vt, have removed ety since there its does have any direct use currently.

@@ -1650,6 +1657,9 @@ const bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType
case Op_RotateRightV:
case Op_RotateLeftV:
case Op_MacroLogicV:
case Op_VectorMaskLastTrue:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't you support 128-/256-bit cases w/ AVX/AVX2 instructions?
This check effectively requires AVX512 as a baseline.

%}

instruct vmask_true_count_avx(rRegI dst, vec mask, rRegL tmp, vec xtmp, vec xtmp1) %{
predicate(!VM_Version::supports_avx512bw());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you still need to adjust the predicate to be able to correctly split between AVX512BW+VL and AVX512F/AVX/AVX2 configurations.

@IntrinsicCandidate
public static
<M>
int maskOp(int oper, Class<?> maskClass, Class<?> elemClass, int length, M m,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fair enough. We can revisit that later and merge them if needed.
Some suggestions to consider to align it with reductionCoerced:

  • reflect in the name that it's effectively a reduction, but on masks (maskReductionCoerced?);
  • return type can be generalized to long;
  • bound on M: <M extends VectorMask>;
  • no need to introduce a special interface, Function<T,R> just works: VectorMaskOp<M> -> Function<M, Long>;

@jatin-bhateja
Copy link
Member Author

Hi @iwanowww , your comments have been addressed.

I think you still need to adjust the predicate to be able to correctly split between AVX512BW+VL and AVX512F/AVX/AVX2 >configurations.

There are two patterns now one which supports AVX512VLBW (to handle mask length from 2-64) and other non-AVX512LVBW ( to handle mask lengths 2-32) , Byte512Vector mandates the presence of AVX512BW as enforced by Matcher::match_rule_supported_vector()) thus removed the special code sequence for 512 bit vector in absence of AVX512BW feature.

reflect in the name that it's effectively a reduction, but on masks (maskReductionCoerced?);
DONE

bound on M: ;
DONE

@mlbridge
Copy link

mlbridge bot commented May 17, 2021

Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:

Ok, fair enough. We can revisit that later and merge them if needed.
Some suggestions to consider to align it with `reductionCoerced`:

* reflect in the name that it's effectively a reduction, but on masks (`maskReductionCoerced`?);
* return type can be generalized to `long`;

Hi @iwanowww, Can you kindly elaborate why should the return type be long here ?
We will need to again downcast it to integer since these APIs return an integer value.

FTR downcasts are fine here.

In the context of JVM intrinsics the main question is what carrier type
to pick.

If you don't envision any future operations on masks to return 64-bit
values, then it's fine to pick int.

Otherwise, it's better to start with long.

Because when such operation is introduced, return type (and all use
sites) will have to be adjusted anyway (instead of introducing yet
another intrinsic method).

Best regards,
Vladimir Ivanov

1 similar comment
@mlbridge
Copy link

mlbridge bot commented May 17, 2021

Mailing list message from Vladimir Ivanov on hotspot-compiler-dev:

Ok, fair enough. We can revisit that later and merge them if needed.
Some suggestions to consider to align it with `reductionCoerced`:

* reflect in the name that it's effectively a reduction, but on masks (`maskReductionCoerced`?);
* return type can be generalized to `long`;

Hi @iwanowww, Can you kindly elaborate why should the return type be long here ?
We will need to again downcast it to integer since these APIs return an integer value.

FTR downcasts are fine here.

In the context of JVM intrinsics the main question is what carrier type
to pick.

If you don't envision any future operations on masks to return 64-bit
values, then it's fine to pick int.

Otherwise, it's better to start with long.

Because when such operation is introduced, return type (and all use
sites) will have to be adjusted anyway (instead of introducing yet
another intrinsic method).

Best regards,
Vladimir Ivanov

Copy link

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Byte512Vector mandates the presence of AVX512BW as enforced by Matcher::match_rule_supported_vector()) thus removed the special code sequence for 512 bit vector in absence of AVX512BW feature.

Please, elaborate why matters Byte512Vector here?

Intrinsics are fed with corresponding vector element type, so unconditionally refecting AVX512F case (w/ BW & VL absent) means that on Xeon Phis VectorMask.lastTrue/firstTrue/trueCont on 512-bit masks are useless (irrespective of element type) while some 512-bit vector shapes are supported. Is it intended?

ciType* elem_type = elem_klass->const_oop()->as_instance()->java_mirror_type();
BasicType elem_bt = elem_type->basic_type();

if (num_elem <= 2) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned that masks of length 2 are supported, but it's rejected here.

@jatin-bhateja
Copy link
Member Author

Byte512Vector mandates the presence of AVX512BW as enforced by Matcher::match_rule_supported_vector()) thus removed the special code sequence for 512 bit vector in absence of AVX512BW feature.

Please, elaborate why matters Byte512Vector here?

Intrinsics are fed with corresponding vector element type, so unconditionally refecting AVX512F case (w/ BW & VL absent) means that on Xeon Phis VectorMask.lastTrue/firstTrue/trueCont on 512-bit masks are useless (irrespective of element type) while some 512-bit vector shapes are supported. Is it intended?

This is being enforced by Matcher::match_rule_supported_vector(), for a 512 bit vector of sub-word type is supported only if target supports AVX512BW.
For other types apart from sub-word types a 512 bit vector mask will be handled by the second instruction selection pattern which is predicated by !VM_Version::supports_avx512vlbw() since for them maximum vector size needed to hold the byte vector containing mask will always be <= 32 bytes.

@iwanowww
Copy link

This is being enforced by Matcher::match_rule_supported_vector(), for a 512 bit vector of sub-word type is supported only if target supports AVX512BW.
For other types apart from sub-word types a 512 bit vector mask will be handled by the second instruction selection pattern which is predicated by !VM_Version::supports_avx512vlbw() since for them maximum vector size needed to hold the byte vector containing mask will always be <= 32 bytes.

Ah, now I get it! Thanks for the clarifications.
It's the consequence of canonical mask representation being consumed by the operations. Worth putting a comment stressing that aspect.

Copy link

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving the check on mask length aside (num_elem <= 2 in LibraryCallKit::inline_vector_mask_operation), the patch looks good.

A couple minor suggestions follow.

ins_encode %{
const MachNode* mask_node = static_cast<const MachNode*>(this->in(this->operand_index($mask)));
assert(mask_node->bottom_type()->isa_vect(), "");
int vector_len = vector_length_encoding(mask_node);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use int vlen_enc = vector_length_encoding(this, $mask); here.

int opcode = this->ideal_Opcode();
int mask_len = mask_node->bottom_type()->is_vect()->length();
__ vector_mask_operation(opcode, $dst$$Register, $mask$$XMMRegister, $xtmp$$XMMRegister,
$tmp$$Register, $ktmp$$KRegister, mask_len, vector_len);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On naming: vector_len and mask_len are misleadingly similar. While the latter represents the number of elements, the former is x86-specific encoding of vector length. It makes sense to stress the difference w/ a different name. That's why I propose vlen_enc. Unfortunately, it's not uniformly used across x86.ad yet, but at least some code already migrated.

@jatin-bhateja
Copy link
Member Author

/integrate

@openjdk openjdk bot closed this May 19, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels May 19, 2021
@openjdk
Copy link

openjdk bot commented May 19, 2021

@jatin-bhateja Since your change was applied there have been 232 commits pushed to the master branch:

  • 65a8bf5: 8265126: [REDO] unified handling for VectorMask object re-materialization during de-optimization
  • ff84577: 8267098: AArch64: C1 StubFrames end confusingly
  • 0daec49: 8267246: -XX:MaxRAMPercentage=0 is unreasonable for jtreg tests on many-core machines
  • 324defe: 8267212: test/jdk/java/util/Collections/FindSubList.java intermittent crash with "no reachable node should have no use"
  • bdbe23b: 8265462: Handle multiple slots in the NSS Internal Module from SunPKCS11's Secmod
  • 10236e7: 8263242: serviceability/sa/ClhsdbFindPC.java cannot find MaxJNILocalCapacity with ASLR
  • e6705c0: 8266949: Check possibility to disable OperationTimedOut on Unix
  • b92c5a4: 8265292: [macos_aarch64] java/foreign/TestDowncall.java crashes with SIGBUS
  • fadf580: 8262952: [macos_aarch64] os::commit_memory failure
  • f8f40ab: 8230486: G1BarrierSetAssembler::g1_write_barrier_post unnecessarily pushes/pops new_val
  • ... and 222 more: https://git.openjdk.java.net/jdk/compare/c5dc657f0be90bd594663dcc612f40a930c2bbe7...master

Your commit was automatically rebased without conflicts.

Pushed as commit 7aa6568.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@jatin-bhateja jatin-bhateja deleted the JDK-8256973 branch May 19, 2021 05:23
@iwanowww
Copy link

Jatin, the final commit erroneously contains mask.incr file. Please, remove it.

@TobiHartmann
Copy link
Member

@DamonFool
Copy link
Member

Jatin, the final commit erroneously contains mask.incr file. Please, remove it.

PR: #4107

@iwanowww
Copy link

PR: #4107

Thanks, Jie. Reviewed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org integrated Pull request has been integrated
5 participants