-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8353500: [s390x] Intrinsify Unsafe::setMemory #24480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back amitkumar! A progress list of the required criteria for merging this PR into |
|
@offamitkumar This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 939 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
|
@offamitkumar The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
|
with patch: without patch: |
| __ z_risbg(tmp, size, 32, 128/* risbgz */ + 63, 64 - exact_log2(2 * elem_size), 0); // just do the right shift and set cc | ||
| __ z_bre(L_Tail); | ||
|
|
||
| __ align(16); // loop alignment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align(32) would be more helpful:
- instruction engine fetches octoword (32 bytes) bundles.
- Tight loop is < 32 byes -> all in one bundle, does not cross cache line boundary.
| // multiple of 2 | ||
| do_setmemory_atomic_loop(2, dest, size, byteVal, _masm); | ||
|
|
||
| __ align(16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this alignment good for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Branch target alignment. There is no fallthrough path from before this point. Should it be 32?
| __ z_ogrk(rScratch1, dest, size); | ||
|
|
||
| __ z_nill(rScratch1, 7); | ||
| __ z_bre(L_fill8Bytes); // branch if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls use z_braz() to reflect check semantics
|
Since this is taken from #24254: Maybe you can review that one, too? |
This reverts commit a6af6da.
|
This looks good to me. I suggest measuring performance with the latest version. |
|
Result looks almost similar: |
|
This result is from shared-machine, but looks like the regression part is fixed. We got regression because, for Unaligned case, only 1-byte store instruction were getting emitted (i.e. I will do further testing and will see if more optimization can be done. Then will mark this PR ready for review. |
@TheRealMDoerr But in this case MVC will only be used iff store is unaligned. If they are unaligned then we don't care about the atomicity. In other case, we will use If we don't go ahead with mvc, then we are seeing regression, as you have noticed in the previous result. |
|
As I said, mvc usage may be a bug. It was probably not indented that gcc generates it for |
Are these corner cases relevant at all? |
I am not sure about that. But the hit was significant in case of 255 & 256 byte. |
|
The invariant on other platforms is that all Bytes before the non-writable address have been written when hitting a signal. I don' know if that is really required on s390. It may be a risk to use a different behavior. The code can be used to write memory mapped files or other stuff. |
I thought the reason for not using mvc always is atomicity within array elements? That is, if you're writing an array of 4- or 8-byte values, than change to every one of those array elements should be atomic w.r.t. other CPUs. If that is true, you cannot use mvc. (However, that requirement would not be relevant for arrays of 1-byte values.) |
Correct. |
|
There's a lot of confusion about this. There is no requirement that all bytes before the non-writable address have been written when hitting a signal. Behaving nicely when writing beyond allocated memory is "best effort" only: we're trying to be nice, that's all. The atomicity requirement is here , in the specification of |
|
Ah, thanks! I was not aware of that. That means the current implementation is probably wrong in some cases (mvc generated by gcc). Or is mvc only used in the single Byte aligned case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new proposal is probably ok, then.
Yes, that's right, just for the byte-aligned case. |
|
Tier-1 test are clean with fastdebug-vm; These are the performance number on my z16 zVM: |
|
The atomicity spec cited by @theRealAph severely limits the optimisation options. Depending on the data alignment, you have to use 8, 4, or 2-byte stores. Only for the unaligned case there are no hard restrictions, just the soft "let's be nice" conventions. With that said, the vector implementation should be ok. It is just not as nice as a byte store loop. There could be as many as 15 uninitialised bytes if just the last byte of a vector store is not writable. I would take that risk. |
| UnsafeMemoryAccessMark umam(this, true, false); | ||
|
|
||
| __ z_vlvgb(Z_V0, byteVal, 0); | ||
| __ z_vrepb(Z_V0, Z_V0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also use z_vzero(Vreg) to preload the vector register with all zeroes. Saves an instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not loading 0 here. This is my intention: with z_vlvgb, putting value of byteVal in the first 0th index of Z_V0 and then with z_vrepb replicating the 0th index value (1 byte) to the whole register.
z_vzero will make sense if we are zeroing out the memory but that's not the case always. We do fill some non-zero 1 byte value in most of the case.
|
The large number of conditional branches may cause a regression in real life scenarios with a large variance of sizes and alignments. |
I can try to run the same benchmark with larger sizes. But again it wouldn't replicate the real life scenario. Could you suggest some other benchmark ? |
|
As of now I am not getting any regression in the benchmark. And vector store + mvc is not performing better then the vector store only solution. So I am moving ahead with the integration. |
|
Thanks to all for the help and reviews/suggestion you provided. /integrate |
|
Going to push as commit 2000551.
Your commit was automatically rebased without conflicts. |
|
@offamitkumar Pushed as commit 2000551. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
|
What are all those |
Sorry that is old code; nops were inserted for the loop alignment; this is the newer stub code: |
Unsafe::setMemory intrinsic implementation for s390x.
Stub Code:
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24480/head:pull/24480$ git checkout pull/24480Update a local copy of the PR:
$ git checkout pull/24480$ git pull https://git.openjdk.org/jdk.git pull/24480/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24480View PR using the GUI difftool:
$ git pr show -t 24480Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24480.diff
Using Webrev
Link to Webrev Comment