New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8272807: Permit use of memory concurrent with pretouch #5215
Conversation
|
@kimbarrett The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
@kimbarrett This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 6 new commits pushed to the
Please see this link for an up-to-date comparison between the source branch of this pull request and the
|
Are there any performance tests that actually check |
end = align_down(end, sizeof(int)); | ||
if (start < end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So ... if this were called as follows:
os::pretouch_memory(page_start, page_start+3, vm_page_size())
we would not pretouch any pages. That seems wrong - either we should touch one page, or this is an illegal condition and we should preclude it. It seems to me this logic would be much simpler if start
and end
were more constrained than they appear to be right now. E.g start
should be int
-aligned; end-start
should be > sizeof(int)
.
I've run various configurations of some of our "usual" performance benchmarks I haven't written a microbenchmark to commit memory, time touching it with |
Yeah, the overhead is measurable. See for example Epsilon with 100G heap (several runs, most typical result is shown):
This correlates with 100G / 4K = 25M pages to touch with atomics, which gives us roughly additional 500ms/25M = 20 ns per atomic/page (most likely cache-missing atomic costing extra). In the test above, this adds up to ~2% overhead. I do believe this overhead is inconsequential (since user already kinda loses startup performance "privileges" with And this is x86_64. Whereas I see that AArch64 seems to do the call to the helper with |
??? Not sure where this |
Oh, the aarch64 atomics are going through common stubs that ignore the memory order. Wow! Why did I think they were using the gcc intrinsics and taking advantage of the memory order? |
Yeah. This patch provides another good reason for @theRealAph to implement |
Um, a disturbing thought came to me about this. Is this intended to allow concurrent heap access by the Java application? Because I think it runs afoul of the guarantees for the strong CAS. Consider two Java threads that perform strong CAS(0 -> 1) on a field, and that field is by luck is on the same offset at which this pre-touching code does ADD(0). From the Java perspective, exactly one of Java CASes should succeed, as long as nothing else writes there from Java. But since VM does another atomic store to the same location, both Java CASes could fail? |
Thinking some more about this. The conflict I described must be fine, as strong CAS implementations are supposed to guard from the spurious violations like these. Stray ADD(0) (true sharing) is not very different from the stray update to the same cache line (false sharing) in this scenario. (I did check this empirically with JCStress on x86_64 and AArch64, and seem to perform as expected). |
@shipilev -right a "strong CAS" has to filter out spurious interference on ll/sc based implementations. That said, how can we be actively using an oop located in a page that we are also in the process of pre-touching? |
I think that is what this patch is supposed to enable: concurrent pretouch. Where "concurrent" might mean "concurrent with application code". For example, init the heap, let the Java code run, and then use a background GC worker to concurrently pre-touch the heap. ("Pre-" becomes a little fuzzy here...). |
[Originally replied on the mailing list, but the mailing list to PR reflector has been unreliable, and didn't pick this up.]
Among other uses, exactly. And yes, “Pre-“ is a little fuzzy. I was originally intending to add a new “concurrent_touch_memory” |
That's a good idea, using the heap pretouch from a simple collector like that. I was not able to reproduce the difference you are seeing on x86_64. So far as I also did the same measurements on an aarch64 machine. I was surprised that Unfortunately, the good news stops there. The atomic-add0 pretouch was 1/3 The time to handle the touches is so good on aarch64 that it kind of makes me |
I'm withdrawing this PR, with the intent of taking a different tack, adding a new function for the concurrent use-case. |
Understandable. I would be happy to assist with Epsilon-based prototype. I think concurrently ahead-touching Epsilon heap would provide both good baseline for future work, and it would also improve fast-startup-with-hopefully-no-hiccups Epsilon cases. |
Mailing list message from Kim Barrett on hotspot-runtime-dev:
Among other uses, exactly. And yes, ?Pre-? is a little fuzzy. I was originally intending to add a new ?concurrent_touch_memory? |
Please review this change to os::pretouch_memory to permit use of the memory
concurrently with the pretouch operation. This is accomplished by using an
atomic add of zero as the operation for touching the memory, ensuring the
virtual location is backed by physical memory while not changing any values
being read or written by the application.
While I was there, fixed some other lurking issues in os::pretouch_memory.
There was a potential overflow in the iteration that has been fixed. And if
the range arguments weren't page aligned then the last page might not get
touched. The latter was even mentioned in the function's description. Both
of those have been fixed by careful alignment and some extra checks. The
resulting code is a little more complicated, but more robust and complete.
This change doesn't make use of the new capability; I have some other
changes in development to do that.
Testing:
mach5 tier1-3.
I've been using this change while developing uses of the new capability.
Performance testing hasn't found any regressions related to this change.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5215/head:pull/5215
$ git checkout pull/5215
Update a local copy of the PR:
$ git checkout pull/5215
$ git pull https://git.openjdk.java.net/jdk pull/5215/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 5215
View PR using the GUI difftool:
$ git pr show -t 5215
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5215.diff