Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8266054: VectorAPI rotate operation optimization #3720

Closed
wants to merge 19 commits into from

Conversation

@jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Apr 27, 2021

Current VectorAPI Java side implementation expresses rotateLeft and rotateRight operation using following operations:-

vec1 = lanewise(VectorOperators.LSHL, n)
vec2 = lanewise(VectorOperators.LSHR, n)
res = lanewise(VectorOperations.OR, vec1 , vec2)

This patch moves above handling from Java side to C2 compiler which facilitates dismantling the rotate operation if target ISA does not support a direct rotate instruction.

AVX512 added vector rotate instructions vpro[rl][v][dq] which operate over long and integer type vectors. For other cases (i.e. sub-word type vectors or for targets which do not support direct rotate operations ) instruction sequence comprising of vector SHIFT (LEFT/RIGHT) and vector OR is emitted.

Please find below the performance data for included JMH benchmark.
Machine: Cascade Lake Server (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz)

<style> </style>
Benchmark (bits) (shift) (size) Baseline Score (ops/ms) With Opts (ops/ms) Gain
RotateBenchmark.testRotateLeftB 128 7 256 3939.136 3836.133 0.973851372
RotateBenchmark.testRotateLeftB 128 7 512 1984.231 1918.27 0.966757399
RotateBenchmark.testRotateLeftB 128 15 256 3925.165 4043.842 1.030234907
RotateBenchmark.testRotateLeftB 128 15 512 1962.723 1936.551 0.986665464
RotateBenchmark.testRotateLeftB 128 31 256 3945.6 3817.883 0.967630525
RotateBenchmark.testRotateLeftB 128 31 512 1944.458 1914.229 0.984453766
RotateBenchmark.testRotateLeftB 256 7 256 4612.149 4514.874 0.978908964
RotateBenchmark.testRotateLeftB 256 7 512 2296.252 2270.237 0.988670669
RotateBenchmark.testRotateLeftB 256 15 256 4576.628 4515.53 0.986649996
RotateBenchmark.testRotateLeftB 256 15 512 2288.278 2270.923 0.992415694
RotateBenchmark.testRotateLeftB 256 31 256 4624.243 4511.46 0.975610495
RotateBenchmark.testRotateLeftB 256 31 512 2305.459 2273.788 0.986262605
RotateBenchmark.testRotateLeftB 512 7 256 7748.283 7777.105 1.003719792
RotateBenchmark.testRotateLeftB 512 7 512 3906.214 3912.647 1.001646863
RotateBenchmark.testRotateLeftB 512 15 256 7764.653 7763.482 0.999849188
RotateBenchmark.testRotateLeftB 512 15 512 3916.061 3919.363 1.000843194
RotateBenchmark.testRotateLeftB 512 31 256 7779.754 7770.239 0.998776954
RotateBenchmark.testRotateLeftB 512 31 512 3916.471 3912.718 0.999041739
RotateBenchmark.testRotateLeftI 128 7 256 4043.39 13461.814 3.329338501
RotateBenchmark.testRotateLeftI 128 7 512 1996.217 6455.425 3.233829288
RotateBenchmark.testRotateLeftI 128 15 256 4028.614 13077.277 3.246098286
RotateBenchmark.testRotateLeftI 128 15 512 1997.612 6452.918 3.230315997
RotateBenchmark.testRotateLeftI 128 31 256 4123.357 13079.045 3.171940969
RotateBenchmark.testRotateLeftI 128 31 512 2003.356 6452.716 3.22095324
RotateBenchmark.testRotateLeftI 256 7 256 7666.949 25658.625 3.34665393
RotateBenchmark.testRotateLeftI 256 7 512 3855.826 12278.106 3.18429981
RotateBenchmark.testRotateLeftI 256 15 256 7670.901 24625.466 3.210244272
RotateBenchmark.testRotateLeftI 256 15 512 3765.786 12272.771 3.259019764
RotateBenchmark.testRotateLeftI 256 31 256 7660.599 25678.864 3.352069988
RotateBenchmark.testRotateLeftI 256 31 512 3773.401 12006.469 3.181869353
RotateBenchmark.testRotateLeftI 512 7 256 11900.948 31242.989 2.625252123
RotateBenchmark.testRotateLeftI 512 7 512 5830.878 15727.149 2.697217983
RotateBenchmark.testRotateLeftI 512 15 256 12171.847 33180.067 2.72596813
RotateBenchmark.testRotateLeftI 512 15 512 5830.544 16740.182 2.871118372
RotateBenchmark.testRotateLeftI 512 31 256 11909.553 31250.882 2.624018047
RotateBenchmark.testRotateLeftI 512 31 512 5846.747 15738.831 2.691895339
RotateBenchmark.testRotateLeftL 128 7 256 2047.243 6888.484 3.364761291
RotateBenchmark.testRotateLeftL 128 7 512 1005.029 3245.931 3.229688895
RotateBenchmark.testRotateLeftL 128 15 256 1996.921 6985.256 3.498013191
RotateBenchmark.testRotateLeftL 128 15 512 986.906 3217.778 3.260470602
RotateBenchmark.testRotateLeftL 128 31 256 1999.06 6977.672 3.490476524
RotateBenchmark.testRotateLeftL 128 31 512 987.258 3236.63 3.278403416
RotateBenchmark.testRotateLeftL 256 7 256 3752.412 12995.954 3.4633601
RotateBenchmark.testRotateLeftL 256 7 512 1824.093 5809.576 3.184912173
RotateBenchmark.testRotateLeftL 256 15 256 3759.99 13262.631 3.52730486
RotateBenchmark.testRotateLeftL 256 15 512 1823.393 5803.872 3.183006626
RotateBenchmark.testRotateLeftL 256 31 256 3757.134 13284.633 3.535842214
RotateBenchmark.testRotateLeftL 256 31 512 1822.192 5824.178 3.196248255
RotateBenchmark.testRotateLeftL 512 7 256 5794.005 15567.753 2.686872552
RotateBenchmark.testRotateLeftL 512 7 512 2969.393 7694.79 2.591368
RotateBenchmark.testRotateLeftL 512 15 256 5817.292 15726.597 2.703422314
RotateBenchmark.testRotateLeftL 512 15 512 2944.655 7664.954 2.603005785
RotateBenchmark.testRotateLeftL 512 31 256 5822.131 16718.64 2.871567129
RotateBenchmark.testRotateLeftL 512 31 512 2944.763 7657.814 2.600485676
RotateBenchmark.testRotateLeftS 128 7 256 8006.155 7976.701 0.99632108
RotateBenchmark.testRotateLeftS 128 7 512 4031.753 4003.43 0.992975016
RotateBenchmark.testRotateLeftS 128 15 256 8003.879 7952.752 0.993612222
RotateBenchmark.testRotateLeftS 128 15 512 4026.359 4014.757 0.997118488
RotateBenchmark.testRotateLeftS 128 31 256 8000.842 7995.733 0.999361442
RotateBenchmark.testRotateLeftS 128 31 512 4044.421 4007.426 0.990852832
RotateBenchmark.testRotateLeftS 256 7 256 15078.471 15034.395 0.997076892
RotateBenchmark.testRotateLeftS 256 7 512 7236.509 7620.334 1.053040078
RotateBenchmark.testRotateLeftS 256 15 256 15093.661 15024.17 0.995396014
RotateBenchmark.testRotateLeftS 256 15 512 7308.568 7724.381 1.056893909
RotateBenchmark.testRotateLeftS 256 31 256 15332.233 15432.113 1.006514381
RotateBenchmark.testRotateLeftS 256 31 512 7317.18 7626.679 1.042297579
RotateBenchmark.testRotateLeftS 512 7 256 24079.012 23939.263 0.994196232
RotateBenchmark.testRotateLeftS 512 7 512 11441.41 11921.21 1.041935391
RotateBenchmark.testRotateLeftS 512 15 256 23563.675 23590.959 1.001157884
RotateBenchmark.testRotateLeftS 512 15 512 11418.634 11949.391 1.046481654
RotateBenchmark.testRotateLeftS 512 31 256 24035.69 23595.385 0.9816812
RotateBenchmark.testRotateLeftS 512 31 512 11668.091 11899.536 1.019835721
RotateBenchmark.testRotateRightB 128 7 256 3852.421 3816.521 0.990681185
RotateBenchmark.testRotateRightB 128 7 512 1956.766 1923.638 0.983070025
RotateBenchmark.testRotateRightB 128 15 256 3899.136 4038.945 1.035856405
RotateBenchmark.testRotateRightB 128 15 512 1957.733 2030.973 1.037410617
RotateBenchmark.testRotateRightB 128 31 256 3902.5 4043.736 1.03619116
RotateBenchmark.testRotateRightB 128 31 512 1957.728 1920.434 0.980950367
RotateBenchmark.testRotateRightB 256 7 256 4565.887 4515.083 0.988873137
RotateBenchmark.testRotateRightB 256 7 512 2300.057 2278.065 0.990438498
RotateBenchmark.testRotateRightB 256 15 256 4570.754 4527.692 0.990578797
RotateBenchmark.testRotateRightB 256 15 512 2300.524 2268.659 0.986148808
RotateBenchmark.testRotateRightB 256 31 256 4577.569 4513.29 0.98595783
RotateBenchmark.testRotateRightB 256 31 512 2304.335 2273.178 0.986478962
RotateBenchmark.testRotateRightB 512 7 256 7772.483 7842.671 1.009030319
RotateBenchmark.testRotateRightB 512 7 512 3907.265 3917.325 1.002574691
RotateBenchmark.testRotateRightB 512 15 256 7855.653 7865.25 1.001221668
RotateBenchmark.testRotateRightB 512 15 512 3909.845 3976.813 1.017128045
RotateBenchmark.testRotateRightB 512 31 256 7746.765 7870.159 1.015928455
RotateBenchmark.testRotateRightB 512 31 512 3919.596 3981.934 1.01590419
RotateBenchmark.testRotateRightI 128 7 256 4125.151 13056.878 3.165187893
RotateBenchmark.testRotateRightI 128 7 512 2045.201 6501.447 3.17887924
RotateBenchmark.testRotateRightI 128 15 256 4111.736 13318.124 3.23905134
RotateBenchmark.testRotateRightI 128 15 512 2055.355 6497.289 3.161151723
RotateBenchmark.testRotateRightI 128 31 256 4109.353 13073.3 3.181352393
RotateBenchmark.testRotateRightI 128 31 512 2055.431 6463.902 3.14479153
RotateBenchmark.testRotateRightI 256 7 256 7804.976 24585.962 3.150036848
RotateBenchmark.testRotateRightI 256 7 512 3815.818 11985.145 3.140911071
RotateBenchmark.testRotateRightI 256 15 256 7644.977 25863.841 3.383115606
RotateBenchmark.testRotateRightI 256 15 512 3822.508 12280.58 3.212702236
RotateBenchmark.testRotateRightI 256 31 256 7709.635 25655.108 3.327668301
RotateBenchmark.testRotateRightI 256 31 512 3801.5 12271.65 3.228107326
RotateBenchmark.testRotateRightI 512 7 256 12223.711 31239.788 2.555671351
RotateBenchmark.testRotateRightI 512 7 512 5973.571 16740.852 2.802486486
RotateBenchmark.testRotateRightI 512 15 256 12205.47 31248.025 2.560165647
RotateBenchmark.testRotateRightI 512 15 512 5966.513 15728.168 2.6360737
RotateBenchmark.testRotateRightI 512 31 256 12209.405 33181.105 2.71766765
RotateBenchmark.testRotateRightI 512 31 512 5981.527 15727.496 2.629344647
RotateBenchmark.testRotateRightL 128 7 256 2054.509 6980.849 3.397818652
RotateBenchmark.testRotateRightL 128 7 512 997.375 3242.374 3.250907633
RotateBenchmark.testRotateRightL 128 15 256 2051.459 6892.389 3.359749817
RotateBenchmark.testRotateRightL 128 15 512 1002.906 3223.342 3.21400211
RotateBenchmark.testRotateRightL 128 31 256 2044.749 6984.157 3.415654929
RotateBenchmark.testRotateRightL 128 31 512 1004.273 3237.496 3.22372104
RotateBenchmark.testRotateRightL 256 7 256 3811.551 13347.75 3.501920872
RotateBenchmark.testRotateRightL 256 7 512 1892.883 5840.85 3.085689924
RotateBenchmark.testRotateRightL 256 15 256 3821.705 14034.823 3.672398314
RotateBenchmark.testRotateRightL 256 15 512 1799.193 5817.533 3.233412424
RotateBenchmark.testRotateRightL 256 31 256 3816.666 14022.31 3.673968327
RotateBenchmark.testRotateRightL 256 31 512 1796.649 5824.13 3.241662673
RotateBenchmark.testRotateRightL 512 7 256 5943.986 15586.254 2.622188881
RotateBenchmark.testRotateRightL 512 7 512 3022.686 7662.241 2.534911334
RotateBenchmark.testRotateRightL 512 15 256 5958.008 15726.859 2.639616966
RotateBenchmark.testRotateRightL 512 15 512 2998.469 7654.703 2.552870482
RotateBenchmark.testRotateRightL 512 31 256 5937.491 15741.207 2.651154671
RotateBenchmark.testRotateRightL 512 31 512 3014.699 7656.837 2.539834657
RotateBenchmark.testRotateRightS 128 7 256 8172.896 8003.474 0.979270261
RotateBenchmark.testRotateRightS 128 7 512 4111.074 4047.267 0.984479238
RotateBenchmark.testRotateRightS 128 15 256 8225.79 8040.421 0.9774649
RotateBenchmark.testRotateRightS 128 15 512 4129.801 4011.919 0.971455767
RotateBenchmark.testRotateRightS 128 31 256 8176.102 8052.686 0.984905276
RotateBenchmark.testRotateRightS 128 31 512 4117.735 4046.522 0.982705784
RotateBenchmark.testRotateRightS 256 7 256 15213.617 15169.51 0.997100821
RotateBenchmark.testRotateRightS 256 7 512 7530.289 7625.581 1.012654494
RotateBenchmark.testRotateRightS 256 15 256 15238.384 15069.978 0.988948566
RotateBenchmark.testRotateRightS 256 15 512 7275.098 7620.764 1.047513587
RotateBenchmark.testRotateRightS 256 31 256 15299.821 15043.765 0.983264118
RotateBenchmark.testRotateRightS 256 31 512 7273.028 7630.97 1.04921499
RotateBenchmark.testRotateRightS 512 7 256 23998.152 23920.046 0.996745333
RotateBenchmark.testRotateRightS 512 7 512 11582.679 11916.382 1.02881052
RotateBenchmark.testRotateRightS 512 15 256 23982.797 23434.756 0.977148579
RotateBenchmark.testRotateRightS 512 15 512 11629.806 11918.759 1.0248459
RotateBenchmark.testRotateRightS 512 31 256 23988.549 23475.629 0.978618132
RotateBenchmark.testRotateRightS 512 31 512 11650.146 11916.47 1.022860143

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3720/head:pull/3720
$ git checkout pull/3720

Update a local copy of the PR:
$ git checkout pull/3720
$ git pull https://git.openjdk.java.net/jdk pull/3720/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3720

View PR using the GUI difftool:
$ git pr show -t 3720

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3720.diff

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Apr 27, 2021

/label hotspot-compiler-dev

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Apr 27, 2021

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Apr 27, 2021

@jatin-bhateja The label hotspot-compile is not a valid label. These labels are valid:

  • serviceability
  • hotspot
  • sound
  • hotspot-compiler
  • kulla
  • i18n
  • shenandoah
  • jdk
  • javadoc
  • 2d
  • security
  • swing
  • hotspot-runtime
  • jmx
  • build
  • nio
  • beans
  • core-libs
  • compiler
  • net
  • hotspot-gc
  • hotspot-jfr
  • awt
@openjdk
Copy link

@openjdk openjdk bot commented Apr 27, 2021

@jatin-bhateja The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@jatin-bhateja jatin-bhateja changed the title 8265126: unified handling for VectorMask object re-materialization during de-optimization (re-submit) 8266054: VectorAPI rotate operation optimization Apr 27, 2021
@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Apr 27, 2021

/label remove hotspot

@openjdk openjdk bot removed the hotspot label Apr 27, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Apr 27, 2021

@jatin-bhateja
The hotspot label was successfully removed.

@openjdk openjdk bot added the rfr label Apr 27, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Apr 27, 2021

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Apr 27, 2021

/label add hotspot-compiler-dev

@openjdk
Copy link

@openjdk openjdk bot commented Apr 27, 2021

@jatin-bhateja
The hotspot-compiler label was successfully added.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

I noticed the tests are only updated for int and long, is that intentional? The HotSpot changes in some cases seem to imply all integral types are supported via the use of is_integral_type, contradicted by the use of is_subword_type.

I would recommend trying to leverage Integer/Long.rotateLeft/Right implementations. They are not available for byte/short, so lets add specific methods in those cases, that should make the Java op implementation clearer.


@Benchmark
public void testRotateLeftI(Blackhole bh) {
for(int i = 0 ; i < 10000; i++) {

This comment has been minimized.

@PaulSandoz

PaulSandoz Apr 30, 2021
Member

No need for the outer loop. JMH will do that for you.

public class RotateBenchmark {

@Param({"64","128","256"})
public int TESTSIZE;

This comment has been minimized.

@PaulSandoz

PaulSandoz Apr 30, 2021
Member

Suggested change
public int TESTSIZE;
int size;

Lower case for instance field names (same applies to the species).
No need for public.

public final long[] specialValsL = {0L, -0L, Long.MIN_VALUE, Long.MAX_VALUE};
public final int[] specialValsI = {0, -0, Integer.MIN_VALUE, Integer.MAX_VALUE};
Comment on lines 54 to 55

This comment has been minimized.

@PaulSandoz

PaulSandoz Apr 30, 2021
Member

Suggested change
public final long[] specialValsL = {0L, -0L, Long.MIN_VALUE, Long.MAX_VALUE};
public final int[] specialValsI = {0, -0, Integer.MIN_VALUE, Integer.MAX_VALUE};
static final long[] specialValsL = {0L, -0L, Long.MIN_VALUE, Long.MAX_VALUE};
static final int[] specialValsI = {0, -0, Integer.MIN_VALUE, Integer.MAX_VALUE};
public void testRotateLeftI(Blackhole bh) {
for(int i = 0 ; i < 10000; i++) {
for (int j = 0 ; j < TESTSIZE; j+= ISPECIES.length()) {
vecI = IntVector.fromArray(ISPECIES, inpI, j);

This comment has been minimized.

@PaulSandoz

PaulSandoz Apr 30, 2021
Member

Suggested change
vecI = IntVector.fromArray(ISPECIES, inpI, j);
var vecI = IntVector.fromArray(ISPECIES, inpI, j);

Use a local variable instead of storing to a field

vecI = vecI.lanewise(VectorOperators.ROL, i);
vecI = vecI.lanewise(VectorOperators.ROL, i);
vecI.lanewise(VectorOperators.ROL, i).intoArray(resI, j);
}

This comment has been minimized.

@PaulSandoz

PaulSandoz Apr 30, 2021
Member

Suggested change
}
}
return resI;

return the result to better ensure HotSpot does not detect the result is unused.

@@ -730,6 +725,24 @@ public abstract class $abstractvectortype$ extends AbstractVector<$Boxtype$> {
v0.bOp(v1, (i, a, n) -> ($type$)(a >> n));
case VECTOR_OP_URSHIFT: return (v0, v1) ->
v0.bOp(v1, (i, a, n) -> ($type$)((a & LSHR_SETUP_MASK) >>> n));
#if[long]

This comment has been minimized.

@PaulSandoz

PaulSandoz Apr 30, 2021
Member

I recommend you create new methods in IntVector etc called rotateLeft and rotateRight that do what is in the lambda expressions. Then you can collapse this to non-conditional cases calling those methods.

Do the same for the tests (like i did with the unsigned support), see

and

gen_compare_op "UNSIGNED_LT" "ult" "BITWISE"

That will avoid the embedding of complex expressions.

@@ -1649,6 +1649,9 @@ const bool Matcher::match_rule_supported_vector(int opcode, int vlen, BasicType
break;
case Op_RotateRightV:
case Op_RotateLeftV:
if (is_subword_type(bt)) {

This comment has been minimized.

@PaulSandoz

PaulSandoz Apr 30, 2021
Member

Does that have the effect of not intrinsifying for byte or short?

This comment has been minimized.

@jatin-bhateja

jatin-bhateja May 3, 2021
Author Member

Yes, it makes sure that intrinsification is based on Shifts and Or operations instead of Rotation operation.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented May 3, 2021

Hi @PaulSandoz , thanks your comments have been addressed.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Testing-wise, can we reuse the Kernel-Binary-*-op.template files? hence no need for separate templates
Further, i think we need to test with the vector accepting lane-wise method and the broadcast accepting method, since they go through different code paths. The broadcast method can use primitive type rather than cast to int, likely making it easier to reuse the binary templates.

It would be good if the scalar methods for rotating left/right were identical for the main code and tests. I prefer the code in the test methods.

@@ -521,9 +521,9 @@ static boolean opKind(Operator op, int bit) {
/** Produce {@code a>>>(n&(ESIZE*8-1))}. Integral only. */
public static final /*bitwise*/ Binary LSHR = binary("LSHR", ">>>", VectorSupport.VECTOR_OP_URSHIFT, VO_SHIFT);
/** Produce {@code rotateLeft(a,n)}. Integral only. */
public static final /*bitwise*/ Binary ROL = binary("ROL", "rotateLeft", -1 /*VectorSupport.VECTOR_OP_LROTATE*/, VO_SHIFT | VO_SPECIAL);
public static final /*bitwise*/ Binary ROL = binary("ROL", "rotateLeft", VectorSupport.VECTOR_OP_LROTATE, VO_SHIFT | VO_SPECIAL);

This comment has been minimized.

@PaulSandoz

PaulSandoz May 3, 2021
Member

I think we can remove the VO_SPECIAL flag on ROL and ROR now it is uniformly managed?

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Java code updates look good

I believe you can now remove the four "-Rotate_.template" files now that you leverage exiting templates?

Also, i believe ancillary changes to gen-template.sh are no longer strictly required, now that we defer to method calls for ROL/ROR?

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented May 8, 2021

Java code updates look good

I believe you can now remove the four "_-Rotate__.template" files now that you leverage exiting templates?

Also, i believe ancillary changes to gen-template.sh are no longer strictly required, now that we defer to method calls for ROL/ROR?

Thanks Paul, redundant files (missed in last check-in) have been removed and benchmark results with latest code updated.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Looks good. Someone from the HotSpot side needs to review related changes.

The way i read the perf numbers is that on non AVX512 systems the numbers are in the noise (no worse, no better), with significant improvement on AVX512.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented May 17, 2021

Hi @iwanowww, @neliasso, can you kindly review compiler side changes and share your feedback.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Jun 1, 2021

Hi @iwanowww, @neliasso , kindly review compiler side changes and share your feedback.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Jun 10, 2021

Hi @iwanowww, @neliasso , kindly review compiler side changes and share your feedback.

1 similar comment
@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Jun 18, 2021

Hi @iwanowww, @neliasso , kindly review compiler side changes and share your feedback.

Copy link
Contributor

@theRealELiu theRealELiu left a comment

Just some format issues when I tried to use this benchmark.

intinp[i] = i;
longinp[i] = i;
}
for (int i = 0 ; i < specialvalsbyte.length; i++) {

This comment has been minimized.

@theRealELiu

theRealELiu Jul 14, 2021
Contributor

Suggested change
for (int i = 0 ; i < specialvalsbyte.length; i++) {
for (int i = 0; i < specialvalsbyte.length; i++) {

Please remove this kind of space.

@Benchmark
public void testRotateLeftB(Blackhole bh) {
ByteVector bytevec = null;
for (int j = 0 ; j < size; j+= bspecies.length()) {

This comment has been minimized.

@theRealELiu

theRealELiu Jul 14, 2021
Contributor

Suggested change
for (int j = 0 ; j < size; j+= bspecies.length()) {
for (int j = 0 ; j < size; j += bspecies.length()) {

Needs a space between +=.

int rshiftopc = VectorNode::opcode(urshiftopc(), elem_bt);
if (!is_supported &&
arch_supports_vector(lshiftopc, num_elem, elem_bt, VecMaskNotUsed) &&
arch_supports_vector(rshiftopc, num_elem, elem_bt, VecMaskNotUsed) &&
arch_supports_vector(Op_OrV, num_elem, elem_bt, VecMaskNotUsed)) {
is_supported = true;
}
return is_supported;
}
Comment on lines 78 to 86

This comment has been minimized.

@sviswa7

sviswa7 Jul 16, 2021

Please add comments here why the Left/Right shift and Or opcodes are being checked here. Also add comments why for left shift we are only checking for int and long left shift opcodes whereas for right shift sub word opcodes are being checked.

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 18, 2021
Author Member

Both left and right shifts opcodes are selected for all integral types (byte/short/int/long). VectorNode::opcode returns the granular left shift type based on the sub-type i.e. elem_bt in case of LeftShiftI. Re-organizing the code for better readability.

if (!is_supported &&
arch_supports_vector(lshiftopc, num_elem, elem_bt, VecMaskNotUsed) &&
arch_supports_vector(rshiftopc, num_elem, elem_bt, VecMaskNotUsed) &&
arch_supports_vector(Op_OrV, num_elem, elem_bt, VecMaskNotUsed)) {
is_supported = true;
}
Comment on lines 79 to 84

This comment has been minimized.

@sviswa7

sviswa7 Jul 16, 2021

When check_bcast is set, is_supported could be false when replicate is not supported. Is replicate not needed for shift+or sequence?

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 18, 2021
Author Member

check_bcast is true only when shift value is a non-constant scalar value, in that case we need to check for broadcasting operation for shift, in all other cases broadcast is not needed. Constant shift value is an optimizing case since AVX512 offers instructions which directly accept 8bit immediate shift value.

} else {
// TODO When mask usage is supported, VecMaskNotUsed needs to be VecMaskUseLoad.
if ((sopc != 0) &&
!arch_supports_vector(sopc, num_elem, elem_bt, is_vector_mask(vbox_klass) ? VecMaskUseAll : VecMaskNotUsed)) {

This comment has been minimized.

@sviswa7

sviswa7 Jul 16, 2021

Could we not call arch_supports_vector_rotate from arch_supports_vector?

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 18, 2021
Author Member

DONE

bool is_const_rotate = is_rotate && cnt_type && cnt_type->is_con() &&
-0x80 <= cnt_type->get_con() && cnt_type->get_con() < 0x80;
if (is_rotate) {
if (!arch_supports_vector_rotate(sopc, num_elem, elem_bt, !is_const_rotate)) {

This comment has been minimized.

@sviswa7

sviswa7 Jul 16, 2021

What is the relationship between check_bcast and !is_const_rotate? Some comments here on this would help.

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 18, 2021
Author Member

Constant shift value is an optimizing case since AVX512 offers instructions which directly accept constant shifts in the range (-256, 255). Similar handling is done in SLP.
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/superword.cpp#L2493

But I feel this is very X86 specific check in generic code, so moving decision to a new target specific matcher routine.

cnt = elem_bt == T_LONG ? gvn().transform(new ConvI2LNode(cnt)) : cnt;
opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, type_bt));
} else {
// constant shift.

This comment has been minimized.

@sviswa7

sviswa7 Jul 16, 2021

Did you mean constant rotate here?

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 18, 2021
Author Member

Yes.

assert(VectorNode::is_invariant_vector(cnt), "Broadcast expected");
cnt = cnt->in(1);
if (bt == T_LONG) {
// Shift count vector for Rotate vector has long elements too.
assert(cnt->Opcode() == Op_ConvI2L, "ConvI2L expected");
cnt = cnt->in(1);
}
shiftRCnt = phase->transform(new AndINode(cnt, phase->intcon(shift_mask)));
shiftRCnt = cnt;

This comment has been minimized.

@sviswa7

sviswa7 Jul 16, 2021

Why do we remove the And with mask here?

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 18, 2021
Author Member

And'ing with shift_mask is already done on Java API side implementation before making a call to intrinsic rountine.

This comment has been minimized.

@sviswa7

sviswa7 Jul 21, 2021

This path is also taken from non vector api path, wont masking be needed there?

This comment has been minimized.

@sviswa7

sviswa7 Jul 26, 2021

@jatin-bhateja This question is still pending.

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 26, 2021
Author Member

@sviswa7, SLP flow will either have a constant 8bit shift value or a variable shift present in vector, this also include broadcasted non-constant shift value.

This comment has been minimized.

@theRealELiu

theRealELiu Jul 27, 2021
Contributor

It would be better comment here, since the correctness relay on some others.

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 27, 2021
Author Member

@theRealELiu , @sviswa7 , comment already exist in code, I guess I mentioned incorrectly earlier on this thread, rectified my comments.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Jul 18, 2021

Hi @sviswa7 your comments have been addressed.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Jul 20, 2021

Hi @sviswa7,
I have removed the noise from the earlier benchmark and updated benchmark results with latest changes.

if (!is_const_rotate) {
const Type * type_bt = Type::get_const_basic_type(elem_bt);
cnt = elem_bt == T_LONG ? gvn().transform(new ConvI2LNode(cnt)) : cnt;
opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, type_bt));
} else {
Comment on lines +1594 to +1598

This comment has been minimized.

@sviswa7

sviswa7 Jul 27, 2021

Why conversion for only T_LONG and not for T_BYTE and T_SHORT? Is there an assumption here that only T_INT and T_LONG elem_bt are supported?

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 27, 2021
Author Member

Correcting this, I2L may be needed in auto-vectorization flow since Integer/Long.rotate[Right/Left] APIs accept only integral shift, so for Long.rotate* operations integral shift value must be converted to long using I2L before broadcasting it. VectorAPI lanewise operations between vector-scalar, scalar type already matches with vector basic type. Since degeneration routine is common b/w both the flows so maintaining IR consistency here.

This comment has been minimized.

@sviswa7

sviswa7 Jul 27, 2021

For Vector API the shift is always coming in as int type for rotate by scalar (lanewiseShiftTemplate). The down conversion to byte or short needs to be done before scalar2vector.

This comment has been minimized.

@sviswa7

sviswa7 Jul 27, 2021

I see that similar thing is done before for shift, so down conversion to sub type is not required.

Node* shift_mask_node = (bt == T_LONG) ? (Node*)(phase->longcon(shift_mask + 1L)) :
(Node*)(phase->intcon(shift_mask + 1));
Node* vector_mask = phase->transform(VectorNode::scalar2vector(shift_mask_node,vlen, elem_ty));
int subVopc = VectorNode::opcode((bt == T_LONG) ? Op_SubL : Op_SubI, bt);
Comment on lines +1196 to +1199

This comment has been minimized.

@sviswa7

sviswa7 Jul 27, 2021

There seems to be an assumption here that the vector type is INT or LONG only and not subword type. From Vector API you can get the sub word types as well.
Also if this path is coming from auto-vectorizer, don't we need masking here?

This comment has been minimized.

@jatin-bhateja

jatin-bhateja Jul 27, 2021
Author Member

Subtype is being passed to VectorNode::opcode for correct opcode selection. Also shift_mask_node is a constant value node, so there is no assumption on vector type. Wrap around (masking) for shift value may not be needed here since we are degenerating rotate into shifts (logical left and rights). Also scalar rotate nodes getting into SLP flow wraps the constant shift values appropriately during RotateLeft/RightNode Idealizations.

Copy link

@sviswa7 sviswa7 left a comment

Looks good to me.

if (!is_const_rotate) {
const Type * type_bt = Type::get_const_basic_type(elem_bt);
cnt = elem_bt == T_LONG ? gvn().transform(new ConvI2LNode(cnt)) : cnt;
opd2 = gvn().transform(VectorNode::scalar2vector(cnt, num_elem, type_bt));
} else {

This comment has been minimized.

@sviswa7

sviswa7 Jul 27, 2021

I see that similar thing is done before for shift, so down conversion to sub type is not required.

@jatin-bhateja
Copy link
Member Author

@jatin-bhateja jatin-bhateja commented Jul 28, 2021

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Jul 28, 2021

Going to push as commit d994b93.
Since your change was applied there have been 124 commits pushed to the master branch:

  • ed1cb24: 8271118: C2: StressGCM should have higher priority than frequency-based policy
  • 9bc52af: 8271323: [TESTBUG] serviceability/sa/ClhsdbCDSCore.java fails with -XX:TieredStopAtLevel=1
  • 752b6df: 8261236: C2: ClhsdbJstackXcompStress test fails when StressGCM is enabled
  • a50161b: Merge
  • f1e15c8: 8271350: runtime/Safepoint tests use OutputAnalyzer::shouldMatch instead of shouldContaint
  • fbe28e4: 8270866: NPE in DocTreePath.getTreePath()
  • f662127: 8270491: SEGV at read_string_field(oopDesc*, char const*, JavaThread*)+0x54
  • cea7bc2: 8271223: two runtime/ClassFile tests don't check exit code
  • 90cd2fa: 8270859: Post JEP 411 refactoring: client libs with maximum covering > 10K
  • c8af823: 8267485: Remove the dependency on SecurityManager in JceSecurityManager.java
  • ... and 114 more: https://git.openjdk.java.net/jdk/compare/a8f15427156b8095ee815fbe6ed14c25c1d4b374...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Jul 28, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Jul 28, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Jul 28, 2021

@jatin-bhateja Pushed as commit d994b93.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@wangweij
Copy link
Contributor

@wangweij wangweij commented Jul 28, 2021

No comma after "2021" in test/micro/org/openjdk/bench/jdk/incubator/vector/RotateBenchmark.java.

@vnkozlov
Copy link
Contributor

@vnkozlov vnkozlov commented Jul 28, 2021

@sviswa7 and @jatin-bhateja jatin-bhateja
The push caused https://bugs.openjdk.java.net/browse/JDK-8271366
I am strongly suggest in a future to ask an Oracle's engineer to test Intel's changes before pushing.

@PaulSandoz
Copy link
Member

@PaulSandoz PaulSandoz commented Jul 28, 2021

@sviswa7 and @jatin-bhateja jatin-bhateja
The push caused https://bugs.openjdk.java.net/browse/JDK-8271366
I am strongly suggest in a future to ask an Oracle's engineer to test Intel's changes before pushing.

Yes, as discussed before please request that we perform internal tests before integrating e.g. CC me. Unfortunately the pre-commit PR tests don't cover all the tests cases and we don't yet have a way to expand that set.

@sviswa7
Copy link

@sviswa7 sviswa7 commented Jul 28, 2021

@vnkozlov @PaulSandoz Sorry for the inconvenience. @jatin-bhateja Please don't be in a hurry to push and reach out to Oracle engineers for testing before pushing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants