Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8276083: Incremental patch to further optimize new compress/expand APIs over X86 #157

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Oct 28, 2021

Summary of changes:

  1. Added both scalar and vector variants of JMH performance tests for Vector.compress/expand and VectorMask.compress APIs.

  2. Improved performance of operations where mask length is less than 4. Mask loading is a two stage process where in first the boolean array is loaded into a vector and then either transferred to a predicate register or a vector whose size is equivalent to that of underlined SPECIES. A mask whose length is less than 4 will result into a less than 32 bit vector load operation. Operations dependent on smaller masks are now being handled in java side implementation of these and some other APIs. Since the condition for special handling and fallback logic leading to C2 intrinsic call is based on constant expression hence one of the control path is optimized out. This shall also prevent any performance penalty due to failed lazy inline expansion which most often occurs due to unsupported vector sizes. If lazy inline expansion fails then C2 emits a direct call instruction to a callee method and thus we also loose any opportunity for procedure in-lining at that point, a separate issue has been created to address this problem.

  3. Improved performance of VectorMask.compress over legacy non-AVX512 targets, added the missing checks in C2Compiler::is_intrinsics_supported routine to enable procedure in-lining early during parsing if target does not support direct compress/expand instructions.

  4. Inline expand VectorMask.intoArray operation to trigger boxing-unboxing optimization. This significantly improved the performance of VectorMask.compress in newly added JMH micros.

Following is the performance data for included JMH micros:
System Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server 40C 2S)

A) VectorMask.compress:

image

B) Vector.compress:

image

C) Vector.expand:

image

Patch has been regressed using tier3 regressions at various AVX levels 0/1/2/3/KNL.

Kindly review and share your feedback.

Best Regards,
Jatin


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Issue

  • JDK-8276083: Incremental patch to further optimize new compress/expand APIs over X86

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-vector pull/157/head:pull/157
$ git checkout pull/157

Update a local copy of the PR:
$ git checkout pull/157
$ git pull https://git.openjdk.java.net/panama-vector pull/157/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 157

View PR using the GUI difftool:
$ git pr show -t 157

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/panama-vector/pull/157.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 28, 2021

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into vectorIntrinsics+compress will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Oct 28, 2021
@mlbridge
Copy link

mlbridge bot commented Oct 28, 2021

Webrevs

@@ -933,10 +933,19 @@ final class $vectortype$ extends $abstractvectortype$ {
@Override
@ForceInline
public $masktype$ compress() {
if (VLENGTH < 4) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impacts the code of every vector and across every architecture. I realize its dead code for many cases, but we need to find a better way to express and manage this. IMHO this is not a maintainable solution.

The fallback would be an obvious place for this logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can push this into fall back path but in that case C2 will not be able to inline the fall back logic during failed lazy intrinsification attempt. I collected perf data with and without this and could clearly see benefit (around 1.5x) of keeping this outside the fall back. Given that conditions are guarded by constant expressions so only one of the path will be jit'ed.

I think we do support 2 byte vector loads for AARCH64 but currently minimum vector size over X86 is 4 bytes.
Agree with you, given this is an optimization on a slow path so its ok to compromise some gain to keep consistency across architectures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i think that is reasonable for now. We might be able to revisit later. Generally improving the fallback path is an area we have yet to focus on.

I think the most important aspects to focus on at the moment are the common cases where I am presuming the vector length is likely to be >= 4.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did the perf tests get generated? I don't see any related changes to see how they would be generated.

@jatin-bhateja
Copy link
Member Author

How did the perf tests get generated? I don't see any related changes to see how they would be generated.

Templates added, generated files still needed some manual editing since after scatter/gather test overhaul scatter gather perf tests / templates do no exist. It can be fixed in other subsequent patch.

@jatin-bhateja
Copy link
Member Author

@PaulSandoz , your comments are addressed. Please let me know if there are further comments.

@PaulSandoz
Copy link
Member

PaulSandoz commented Nov 1, 2021

@PaulSandoz , your comments are addressed. Please let me know if there are further comments.

Did you run gen-tests.sh?

I tried running that on a clone of your branch and it generates more code, some of which does not compile:

  1. maskCompress iterates over a non-existent array a.
  2. Benchmarks for ROR and ROL are generated and the scalar benchmarks refer to non-existent methods ROR_scalar and ROL_scalar.
  3. The gather/scatter benchmarks are deleted. (They got removed when i switched the scatter/gather unit tests over to the load/store files. See 8266518: Refactor and expand scatter/gather tests jdk17#48)

So it looks like we got out of sync, likely via merges from mainline. We should at least fix 2 and possibly reconsider how to do 3.

@jatin-bhateja
Copy link
Member Author

jatin-bhateja commented Nov 2, 2021

@PaulSandoz , your comments are addressed. Please let me know if there are further comments.

Did you run gen-tests.sh?

I tried running that on a clone of your branch and it generates more code, some of which does not compile:

  1. maskCompress iterates over a non-existent array a.
  2. Benchmarks for ROR and ROL are generated and the scalar benchmarks refer to non-existent methods ROR_scalar and ROL_scalar.
  3. The gather/scatter benchmarks are deleted. (They got removed when i switched the scatter/gather unit tests over to the load/store files. See 8266518: Refactor and expand scatter/gather tests jdk17#48)

So it looks like we got out of sync, likely via merges from mainline. We should at least fix 2 and possibly reconsider how to do 3.

Hi @PaulSandoz ,

I have fixed 1 it was a typo, as already mentioned I did some manual editing since existing generation is broken.

Should it be OK to post the fixes for items 2 and 3 in a immediate subsequent patch once this gets integrated ?

Changes in this patch are related to compress/expand and its performance improvements.

@PaulSandoz
Copy link
Member

Yes, we can fix 2/3 separately independently of this branch. In fact i think for 3 we should consider a separate template like with the unit tests.

Copy link
Member

@PaulSandoz PaulSandoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java changes are good.

@openjdk
Copy link

openjdk bot commented Nov 2, 2021

@jatin-bhateja This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8276083: Incremental patch to further optimize new compress/expand APIs over X86

Reviewed-by: psandoz

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the vectorIntrinsics+compress branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the vectorIntrinsics+compress branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Nov 2, 2021
@jatin-bhateja
Copy link
Member Author

Thanks @PaulSandoz. Integrating this patch.

@jatin-bhateja
Copy link
Member Author

/integrate

@openjdk
Copy link

openjdk bot commented Nov 4, 2021

Going to push as commit 821635c.

@openjdk openjdk bot closed this Nov 4, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Nov 4, 2021
@openjdk
Copy link

openjdk bot commented Nov 4, 2021

@jatin-bhateja Pushed as commit 821635c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@jatin-bhateja jatin-bhateja mentioned this pull request Nov 4, 2021
2 tasks
case vmIntrinsics::_VectorComExp:
if (!Matcher::match_rule_supported(Op_CompressM)) return false;
if (!Matcher::match_rule_supported(Op_CompressV)) return false;
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of break, this should return EnableVectorSupport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants