New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8276083: Incremental patch to further optimize new compress/expand APIs over X86 #157
8276083: Incremental patch to further optimize new compress/expand APIs over X86 #157
Conversation
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
Webrevs
|
@@ -933,10 +933,19 @@ final class $vectortype$ extends $abstractvectortype$ { | |||
@Override | |||
@ForceInline | |||
public $masktype$ compress() { | |||
if (VLENGTH < 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This impacts the code of every vector and across every architecture. I realize its dead code for many cases, but we need to find a better way to express and manage this. IMHO this is not a maintainable solution.
The fallback would be an obvious place for this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can push this into fall back path but in that case C2 will not be able to inline the fall back logic during failed lazy intrinsification attempt. I collected perf data with and without this and could clearly see benefit (around 1.5x) of keeping this outside the fall back. Given that conditions are guarded by constant expressions so only one of the path will be jit'ed.
I think we do support 2 byte vector loads for AARCH64 but currently minimum vector size over X86 is 4 bytes.
Agree with you, given this is an optimization on a slow path so its ok to compromise some gain to keep consistency across architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i think that is reasonable for now. We might be able to revisit later. Generally improving the fallback path is an area we have yet to focus on.
I think the most important aspects to focus on at the moment are the common cases where I am presuming the vector length is likely to be >= 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did the perf tests get generated? I don't see any related changes to see how they would be generated.
Templates added, generated files still needed some manual editing since after scatter/gather test overhaul scatter gather perf tests / templates do no exist. It can be fixed in other subsequent patch. |
@PaulSandoz , your comments are addressed. Please let me know if there are further comments. |
Did you run I tried running that on a clone of your branch and it generates more code, some of which does not compile:
So it looks like we got out of sync, likely via merges from mainline. We should at least fix 2 and possibly reconsider how to do 3. |
Hi @PaulSandoz , I have fixed 1 it was a typo, as already mentioned I did some manual editing since existing generation is broken. Should it be OK to post the fixes for items 2 and 3 in a immediate subsequent patch once this gets integrated ? Changes in this patch are related to compress/expand and its performance improvements. |
Yes, we can fix 2/3 separately independently of this branch. In fact i think for 3 we should consider a separate template like with the unit tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java changes are good.
@jatin-bhateja This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
Thanks @PaulSandoz. Integrating this patch. |
/integrate |
Going to push as commit 821635c. |
@jatin-bhateja Pushed as commit 821635c. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
case vmIntrinsics::_VectorComExp: | ||
if (!Matcher::match_rule_supported(Op_CompressM)) return false; | ||
if (!Matcher::match_rule_supported(Op_CompressV)) return false; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of break, this should return EnableVectorSupport.
Summary of changes:
Added both scalar and vector variants of JMH performance tests for Vector.compress/expand and VectorMask.compress APIs.
Improved performance of operations where mask length is less than 4. Mask loading is a two stage process where in first the boolean array is loaded into a vector and then either transferred to a predicate register or a vector whose size is equivalent to that of underlined SPECIES. A mask whose length is less than 4 will result into a less than 32 bit vector load operation. Operations dependent on smaller masks are now being handled in java side implementation of these and some other APIs. Since the condition for special handling and fallback logic leading to C2 intrinsic call is based on constant expression hence one of the control path is optimized out. This shall also prevent any performance penalty due to failed lazy inline expansion which most often occurs due to unsupported vector sizes. If lazy inline expansion fails then C2 emits a direct call instruction to a callee method and thus we also loose any opportunity for procedure in-lining at that point, a separate issue has been created to address this problem.
Improved performance of VectorMask.compress over legacy non-AVX512 targets, added the missing checks in C2Compiler::is_intrinsics_supported routine to enable procedure in-lining early during parsing if target does not support direct compress/expand instructions.
Inline expand VectorMask.intoArray operation to trigger boxing-unboxing optimization. This significantly improved the performance of VectorMask.compress in newly added JMH micros.
Following is the performance data for included JMH micros:
System Configuration: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server 40C 2S)
A) VectorMask.compress:
B) Vector.compress:
C) Vector.expand:
Patch has been regressed using tier3 regressions at various AVX levels 0/1/2/3/KNL.
Kindly review and share your feedback.
Best Regards,
Jatin
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/panama-vector pull/157/head:pull/157
$ git checkout pull/157
Update a local copy of the PR:
$ git checkout pull/157
$ git pull https://git.openjdk.java.net/panama-vector pull/157/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 157
View PR using the GUI difftool:
$ git pr show -t 157
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/panama-vector/pull/157.diff