-
Notifications
You must be signed in to change notification settings - Fork 5.8k
8247928: Refactor G1ConcurrentMarkThread for mark abort #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8247928: Refactor G1ConcurrentMarkThread for mark abort #44
Conversation
👋 Welcome back tschatzl! A progress list of the required criteria for merging this PR into |
@tschatzl The following label will be automatically applied to this pull request: When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on GitHub as well 👍
@tschatzl This change now passes all automated pre-integration checks. In addition to the automated checks, the change must also fulfill all project specific requirements After integration, the commit message will be:
Since the source branch of this PR was last updated there have been 46 commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid automatic rebasing, please merge ➡️ To integrate this PR with the above commit message to the |
/reviewer 2 |
@kimbarrett : could you have a look as the latest change is due to your comments? Thanks. |
@tschatzl Syntax:
|
/integrate |
@tschatzl Since your change was applied there have been 46 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit 8db3335. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
r18 should not be used as it is reserved as platform register. Linux is fine with userspace using it, but Windows and also recently macOS ( openjdk/jdk11u-dev#301 (comment) ) are actually using it on the kernel side. The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to specify which registers to spill; fortunately this helper is only used here: https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404 I haven't seen causing this particular instance any issues in practice _yet_, presumably because it looks hard to align the stars in order to trigger a problem (between stp and ldp of r18 a transition to kernel space must happen *and* the kernel needs to do something with r18). But jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that causes troubles as explained in the link above. Output of `-XX:+PrintInterpreter` before this change: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000138809b00, 0x000000013880a280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000138809b00: ldr x2, [x12, #16] 0x0000000138809b04: ldrh w2, [x2, #44] 0x0000000138809b08: add x24, x20, x2, uxtx #3 0x0000000138809b0c: sub x24, x24, #0x8 [...] 0x0000000138809fa4: stp x16, x17, [sp, #128] 0x0000000138809fa8: stp x18, x19, [sp, #144] 0x0000000138809fac: stp x20, x21, [sp, #160] [...] 0x0000000138809fc0: stp x30, xzr, [sp, #240] 0x0000000138809fc4: mov x0, x28 ;; 0x10864ACCC 0x0000000138809fc8: mov x9, #0xaccc // #44236 0x0000000138809fcc: movk x9, #0x864, lsl #16 0x0000000138809fd0: movk x9, #0x1, lsl #32 0x0000000138809fd4: blr x9 0x0000000138809fd8: ldp x2, x3, [sp, #16] [...] 0x0000000138809ff4: ldp x16, x17, [sp, #128] 0x0000000138809ff8: ldp x18, x19, [sp, #144] 0x0000000138809ffc: ldp x20, x21, [sp, #160] ``` After: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000108e4db00, 0x0000000108e4e280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000108e4db00: ldr x2, [x12, #16] 0x0000000108e4db04: ldrh w2, [x2, #44] 0x0000000108e4db08: add x24, x20, x2, uxtx #3 0x0000000108e4db0c: sub x24, x24, #0x8 [...] 0x0000000108e4dfa4: stp x16, x17, [sp, #128] 0x0000000108e4dfa8: stp x19, x20, [sp, #144] 0x0000000108e4dfac: stp x21, x22, [sp, #160] [...] 0x0000000108e4dfbc: stp x29, x30, [sp, #224] 0x0000000108e4dfc0: mov x0, x28 ;; 0x107E4A06C 0x0000000108e4dfc4: mov x9, #0xa06c // #41068 0x0000000108e4dfc8: movk x9, #0x7e4, lsl #16 0x0000000108e4dfcc: movk x9, #0x1, lsl #32 0x0000000108e4dfd0: blr x9 0x0000000108e4dfd4: ldp x2, x3, [sp, #16] [...] 0x0000000108e4dff0: ldp x16, x17, [sp, #128] 0x0000000108e4dff4: ldp x19, x20, [sp, #144] 0x0000000108e4dff8: ldp x21, x22, [sp, #160] [...] ```
After JDK-8283091, the loop below can be vectorized partially. Statement 1 can be vectorized but statement 2 can't. ``` // int[] iArr; long[] lArrFld; int i1,i2; for (i1 = 6; i1 < 227; i1++) { iArr[i1] += lArrFld[i1]++; // statement 1 iArr[i1 + 1] -= (i2++); // statement 2 } ``` But we got incorrect results because the vector packs of iArr are scheduled incorrectly like: ``` ... load_vector XMM1,[R8 + openjdk#16 + R11 << openjdk#2] movl RDI, [R8 + openjdk#20 + R11 << openjdk#2] # int load_vector XMM2,[R9 + openjdk#8 + R11 << openjdk#3] subl RDI, R11 # int vpaddq XMM3,XMM2,XMM0 ! add packedL store_vector [R9 + openjdk#8 + R11 << openjdk#3],XMM3 vector_cast_l2x XMM2,XMM2 ! vpaddd XMM1,XMM2,XMM1 ! add packedI addl RDI, openjdk#228 # int movl [R8 + openjdk#20 + R11 << openjdk#2], RDI # int movl RBX, [R8 + openjdk#24 + R11 << openjdk#2] # int subl RBX, R11 # int addl RBX, openjdk#227 # int movl [R8 + openjdk#24 + R11 << openjdk#2], RBX # int ... movl RBX, [R8 + openjdk#40 + R11 << openjdk#2] # int subl RBX, R11 # int addl RBX, openjdk#223 # int movl [R8 + openjdk#40 + R11 << openjdk#2], RBX # int movl RDI, [R8 + openjdk#44 + R11 << openjdk#2] # int subl RDI, R11 # int addl RDI, openjdk#222 # int movl [R8 + openjdk#44 + R11 << openjdk#2], RDI # int store_vector [R8 + openjdk#16 + R11 << openjdk#2],XMM1 ... ``` simplified as: ``` load_vector iArr in statement 1 unvectorized loads/stores in statement 2 store_vector iArr in statement 1 ``` We cannot pick the memory state from the first load for LoadI pack here, as the LoadI vector operation must load the new values in memory after iArr writes 'iArr[i1 + 1] - (i2++)' to 'iArr[i1 + 1]'(statement 2). We must take the memory state of the last load where we have assigned new values ('iArr[i1 + 1] - (i2++)') to the iArr array. In JDK-8240281, we picked the memory state of the first load. Different from the scenario in JDK-8240281, the store, which is dependent on an earlier load here, is in a pack to be scheduled and the LoadI pack depends on the last_mem. As designed[2], to schedule the StoreI pack, all memory operations in another single pack should be moved in the same direction. We know that the store in the pack depends on one of loads in the LoadI pack, so the LoadI pack should be scheduled before the StoreI pack. And the LoadI pack depends on the last_mem, so the last_mem must be scheduled before the LoadI pack and also before the store pack. Therefore, we need to take the memory state of the last load for the LoadI pack here. To fix it, the pack adds additional checks while picking the memory state of the first load. When the store locates in a pack and the load pack relies on the last_mem, we shouldn't choose the memory state of the first load but choose the memory state of the last load. [1]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2380 [2]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2232 Jira: ENTLLT-5482 Change-Id: I341d10b91957b60a1b4aff8116723e54083a5fb8 CustomizedGitHooks: yes
Co-authored-by: Christian Stein <sormuras@gmail.com>
Co-authored-by: Christian Stein <sormuras@gmail.com>
Co-authored-by: Christian Stein <sormuras@gmail.com>
1. bootstrap-version: java -Xcomp -version with PEA. currently, it's still broken due to JVM-2007 and JVM-1987. 2. hotspot-tier1: execute tier1 tests with PEA. we still have 50~ errors. Co-authored-by: Xin Liu <xxinliu@amazon.com>
Hi all,
this is a continuation of the review of the same CR at https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-September/030794.html; the current change is "Version 3" above, the relevant email is https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-September/030812.html copied verbatim here again:
Hi Kim,
On 03.09.20 19:17, Kim Barrett wrote:
Done.
Since within the "Concurrent Mark" time span there can be any number of
other (gc) stw pauses, I am sceptical how subtracting only the Remark
pause this could be anything but misleading.
It could be certainly interesting to provide some concurrent duration
for the whole marking, but I think this out of scope here.
I do not know what the old "Concurrent Mark" end message actually
measured: if you look, the first message will not contain remark pause,
the next ones after a mark restart do contain them (so all time
excluding the last Remark pause). It is neither just the concurrent time
nor the full marking time. Not sure how useful that is.
This code simplifies the message to something like: from start of
marking related activity until all marking activity fully completed.
Actually there are still a few inconsistencies in the
(gc=info,gc+marking=info) log output. Let me give an example:
[23.160s][info][gc ] GC(225) Concurrent Cycle
[23.160s][info][gc,marking] GC(225) Concurrent Clear Claimed Marks
[23.160s][info][gc,marking] GC(225) Concurrent Clear Claimed Marks 0.009ms
[23.160s][info][gc,marking] GC(225) Concurrent Scan Root Regions
[23.209s][info][gc,marking] GC(225) Concurrent Scan Root Regions 48.811ms
[23.209s][info][gc,marking] GC(225) Concurrent Mark (23.209s)
^-- only the "Concurrent Mark" messages has start time - I think it is
superfluous, just repeating the output of UL.
[23.209s][info][gc,marking] GC(225) Concurrent Mark From Roots
[...]
[24.792s][info][gc,marking] GC(225) Concurrent Mark From Roots 1583.035ms
[24.792s][info][gc,marking] GC(225) Concurrent Preclean
[24.793s][info][gc,marking] GC(225) Concurrent Preclean 0.379ms
[24.886s][info][gc ] GC(225) Pause Remark 552M->225M(760M) 93.108ms
[24.886s][info][gc,marking] GC(225) Concurrent Mark (23.209s, 24.886s)
1676.831ms
^-- same here with start/end times.
[24.886s][info][gc,marking] GC(225) Concurrent Rebuild Remembered Sets
[...]
[31.592s][info][gc,marking] GC(225) Concurrent Rebuild Remembered Sets
6706.292ms
[31.594s][info][gc ] GC(225) Pause Cleanup 451M->451M(1024M) 0.774ms
[31.594s][info][gc,marking] GC(225) Concurrent Cleanup for Next Mark
[31.600s][info][gc,marking] GC(225) Concurrent Cleanup for Next Mark 6.298ms
[31.600s][info][gc ] GC(225) Concurrent Cycle 8439.504ms
I fixed those two in the next patch next week I will publish using a
github PR...
Another inconsistency is that the Concurrent Mark Abort message does not
contain any duration information (but UL captures the "end" time stamp).
But I would like to defer that (existing) issue.
Agree to both.
+1. Done.
Done.
All fixed.
Testing: tier1 currently running.
Thanks,
Thomas
Progress
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/44/head:pull/44
$ git checkout pull/44