Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8272807: Permit use of memory concurrent with pretouch #5215

Closed
wants to merge 1 commit into from

Conversation

kimbarrett
Copy link

@kimbarrett kimbarrett commented Aug 23, 2021

Please review this change to os::pretouch_memory to permit use of the memory
concurrently with the pretouch operation. This is accomplished by using an
atomic add of zero as the operation for touching the memory, ensuring the
virtual location is backed by physical memory while not changing any values
being read or written by the application.

While I was there, fixed some other lurking issues in os::pretouch_memory.
There was a potential overflow in the iteration that has been fixed. And if
the range arguments weren't page aligned then the last page might not get
touched. The latter was even mentioned in the function's description. Both
of those have been fixed by careful alignment and some extra checks. The
resulting code is a little more complicated, but more robust and complete.

This change doesn't make use of the new capability; I have some other
changes in development to do that.

Testing:
mach5 tier1-3.

I've been using this change while developing uses of the new capability.
Performance testing hasn't found any regressions related to this change.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8272807: Permit use of memory concurrent with pretouch

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5215/head:pull/5215
$ git checkout pull/5215

Update a local copy of the PR:
$ git checkout pull/5215
$ git pull https://git.openjdk.java.net/jdk pull/5215/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5215

View PR using the GUI difftool:
$ git pr show -t 5215

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5215.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Aug 23, 2021

👋 Welcome back kbarrett! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Aug 23, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Aug 23, 2021

@kimbarrett The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime label Aug 23, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Aug 23, 2021

Webrevs

Copy link
Contributor

@tschatzl tschatzl left a comment

Lgtm.

@openjdk
Copy link

@openjdk openjdk bot commented Aug 23, 2021

@kimbarrett This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8272807: Permit use of memory concurrent with pretouch

Reviewed-by: tschatzl

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 6 new commits pushed to the master branch:

  • 594e516: 8272778: Consolidate is_instance and is_instance_inlined in java_lang_String
  • d542745: 8267894: Skip work for empty regions in G1 Full GC
  • 741f58c: 8272417: ZGC: fastdebug build crashes when printing ClassLoaderData
  • b7f75c0: 8271142: package help is not displayed for missing X11/extensions/Xrandr.h
  • e8a289e: 8272609: Add string deduplication support to SerialGC
  • b690f29: 8269687: pauth_aarch64.hpp include name is incorrect

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Aug 23, 2021
@shipilev
Copy link
Contributor

@shipilev shipilev commented Aug 23, 2021

Performance testing hasn't found any regressions related to this change.

Are there any performance tests that actually check -XX:+AlwaysPreTouch performance? It is very likely that the OS trip to commit the page is vastly costlier than the atomic increment per page, but it would be nice to confirm.

end = align_down(end, sizeof(int));
if (start < end) {
Copy link
Member

@dholmes-ora dholmes-ora Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ... if this were called as follows:
os::pretouch_memory(page_start, page_start+3, vm_page_size())
we would not pretouch any pages. That seems wrong - either we should touch one page, or this is an illegal condition and we should preclude it. It seems to me this logic would be much simpler if start and end were more constrained than they appear to be right now. E.g start should be int-aligned; end-start should be > sizeof(int).

@kimbarrett
Copy link
Author

@kimbarrett kimbarrett commented Aug 24, 2021

Performance testing hasn't found any regressions related to this change.

Are there any performance tests that actually check -XX:+AlwaysPreTouch performance? It is very likely that the OS trip to commit the page is vastly costlier than the atomic increment per page, but it would be nice to confirm.

I've run various configurations of some of our "usual" performance benchmarks
with and without this change and haven't seen any difference at all. (Changing
the value of AlwaysPreTouch does have an effect on some, with the right other
configuration options.) Even with the work I've been doing to move some of the
pretouching to make use of the new feature, it can be hard to find any difference.

I haven't written a microbenchmark to commit memory, time touching it with
one of the approaches, uncommit, repeat. I could do that, though I don't
expect it to show anything either.

@shipilev
Copy link
Contributor

@shipilev shipilev commented Aug 24, 2021

I haven't written a microbenchmark to commit memory, time touching it with
one of the approaches, uncommit, repeat. I could do that, though I don't
expect it to show anything either.

Yeah, the overhead is measurable. See for example Epsilon with 100G heap (several runs, most typical result is shown):

$ time ~/trunks/jdk/build/baseline/bin/java -Xms100g -Xmx100g -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC Hello
Hello!

real	0m23.075s
user	0m1.880s
sys	0m21.108s

$ time ~/trunks/jdk/build/patched/bin/java -Xms100g -Xmx100g -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC Hello
Hello!

real	0m23.568s  ; + 500ms
user	0m2.306s    ; + 420ms
sys	0m21.189s  ; + 80ms (noise?)

This correlates with 100G / 4K = 25M pages to touch with atomics, which gives us roughly additional 500ms/25M = 20 ns per atomic/page (most likely cache-missing atomic costing extra). In the test above, this adds up to ~2% overhead. I do believe this overhead is inconsequential (since user already kinda loses startup performance "privileges" with -XX:+AlwaysPreTouch anyway), especially if we would be able to leverage this feature to pre-touch heap in background in future RFEs.

And this is x86_64. Whereas I see that AArch64 seems to do the call to the helper with memory_order_conservative always.

@kimbarrett
Copy link
Author

@kimbarrett kimbarrett commented Aug 24, 2021

And this is x86_64. Whereas I see that AArch64 seems to do the call to the helper with memory_order_conservative always.

??? Not sure where this memory_order_conservative might be. The new code use memory_order_relaxed.

@kimbarrett
Copy link
Author

@kimbarrett kimbarrett commented Aug 24, 2021

And this is x86_64. Whereas I see that AArch64 seems to do the call to the helper with memory_order_conservative always.

??? Not sure where this memory_order_conservative might be. The new code use memory_order_relaxed.

Oh, the aarch64 atomics are going through common stubs that ignore the memory order. Wow! Why did I think they were using the gcc intrinsics and taking advantage of the memory order?

@shipilev
Copy link
Contributor

@shipilev shipilev commented Aug 24, 2021

Oh, the aarch64 atomics are going through common stubs that ignore the memory order. Wow! Why did I think they were using the gcc intrinsics and taking advantage of the memory order?

Yeah. This patch provides another good reason for @theRealAph to implement memory_order_relaxed handling on those paths ;)

@shipilev
Copy link
Contributor

@shipilev shipilev commented Aug 25, 2021

Um, a disturbing thought came to me about this.

Is this intended to allow concurrent heap access by the Java application? Because I think it runs afoul of the guarantees for the strong CAS. Consider two Java threads that perform strong CAS(0 -> 1) on a field, and that field is by luck is on the same offset at which this pre-touching code does ADD(0). From the Java perspective, exactly one of Java CASes should succeed, as long as nothing else writes there from Java. But since VM does another atomic store to the same location, both Java CASes could fail?

@shipilev
Copy link
Contributor

@shipilev shipilev commented Aug 25, 2021

Thinking some more about this. The conflict I described must be fine, as strong CAS implementations are supposed to guard from the spurious violations like these. Stray ADD(0) (true sharing) is not very different from the stray update to the same cache line (false sharing) in this scenario. (I did check this empirically with JCStress on x86_64 and AArch64, and seem to perform as expected).

@dholmes-ora
Copy link
Member

@dholmes-ora dholmes-ora commented Aug 26, 2021

@shipilev -right a "strong CAS" has to filter out spurious interference on ll/sc based implementations. That said, how can we be actively using an oop located in a page that we are also in the process of pre-touching?

@shipilev
Copy link
Contributor

@shipilev shipilev commented Aug 26, 2021

That said, how can we be actively using an oop located in a page that we are also in the process of pre-touching?

I think that is what this patch is supposed to enable: concurrent pretouch. Where "concurrent" might mean "concurrent with application code". For example, init the heap, let the Java code run, and then use a background GC worker to concurrently pre-touch the heap. ("Pre-" becomes a little fuzzy here...).

@kimbarrett
Copy link
Author

@kimbarrett kimbarrett commented Aug 29, 2021

[Originally replied on the mailing list, but the mailing list to PR reflector has been unreliable, and didn't pick this up.]

That said, how can we be actively using an oop located in a page that we are also in the process of pre-touching?

I think that is what this patch is supposed to enable: concurrent pretouch. Where "concurrent" might mean "concurrent with application code". For example, init the heap, let the Java code run, and then use a background GC worker to concurrently pre-touch the heap. ("Pre-" becomes a little fuzzy here...).

Among other uses, exactly.

And yes, “Pre-“ is a little fuzzy. I was originally intending to add a new “concurrent_touch_memory”
or something like that, but I’ve not been able to detect any performance difference between the old
and the new. I’ll be getting back to Aleksey on that later; I have more measurements I want to run.

@kimbarrett
Copy link
Author

@kimbarrett kimbarrett commented Aug 29, 2021

Yeah, the overhead is measurable. See for example Epsilon with 100G heap (several runs, most typical result is shown):

$ time ~/trunks/jdk/build/baseline/bin/java -Xms100g -Xmx100g -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC Hello
Hello!

real	0m23.075s
user	0m1.880s
sys	0m21.108s

$ time ~/trunks/jdk/build/patched/bin/java -Xms100g -Xmx100g -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC Hello
Hello!

real	0m23.568s  ; + 500ms
user	0m2.306s    ; + 420ms
sys	0m21.189s  ; + 80ms (noise?)

This correlates with 100G / 4K = 25M pages to touch with atomics, which gives us roughly additional 500ms/25M = 20 ns per atomic/page (most likely cache-missing atomic costing extra). In the test above, this adds up to ~2% overhead. I do believe this overhead is inconsequential (since user already kinda loses startup performance "privileges" with -XX:+AlwaysPreTouch anyway), especially if we would be able to leverage this feature to pre-touch heap in background in future RFEs.

And this is x86_64. Whereas I see that AArch64 seems to do the call to the helper with memory_order_conservative always.

That's a good idea, using the heap pretouch from a simple collector like that.
I went a little further, and patched os::pretouch_memory to report the time
when the range size is large. I also had to lock the application to a single
NUMA node to get anything like reproducible and comparable results.

I was not able to reproduce the difference you are seeing on x86_64. So far as
I can tell, the performance of the original store-0 pretouch and the proposed
atomic-add0 pretouch are the same (no statistically significant difference)
across three different machines.

I also did the same measurements on an aarch64 machine. I was surprised that
the baseline pretouch performance for it was more than x10 faster (per touched
memory size) than the x86_64 machines I had tested on. I don't know where that
difference comes from. There is some difference because the default page size
is larger on aarch64, so 1/2 as many pages (of x2 size) involved. But I don't
see how that could account for such a large difference.

Unfortunately, the good news stops there. The atomic-add0 pretouch was 1/3
slower than the store-0 pretouch on that machine. That machine is using LSE,
so I also tried forcing that off (so conservative atomic add), as well as
variants that used relaxed cmpxchg (both LSE and not). All were similar to
one another, so all about 1/3 slower than store-0.

The time to handle the touches is so good on aarch64 that it kind of makes me
wonder whether pretouching is actually worth doing at all there, assuming that
performance isn't model/manufacturer-specific or something like that.

@kimbarrett
Copy link
Author

@kimbarrett kimbarrett commented Aug 30, 2021

I'm withdrawing this PR, with the intent of taking a different tack, adding a new function for the concurrent use-case.

@kimbarrett kimbarrett closed this Aug 30, 2021
@shipilev
Copy link
Contributor

@shipilev shipilev commented Aug 30, 2021

I'm withdrawing this PR, with the intent of taking a different tack, adding a new function for the concurrent use-case.

Understandable. I would be happy to assist with Epsilon-based prototype. I think concurrently ahead-touching Epsilon heap would provide both good baseline for future work, and it would also improve fast-startup-with-hopefully-no-hiccups Epsilon cases.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Sep 1, 2021

Mailing list message from Kim Barrett on hotspot-runtime-dev:

On Aug 26, 2021, at 2:59 AM, Aleksey Shipilev <shade at openjdk.java.net> wrote:

On Thu, 26 Aug 2021 02:20:50 GMT, David Holmes <dholmes at openjdk.org> wrote:

That said, how can we be actively using an oop located in a page that we are also in the process of pre-touching?

I think that is what this patch is supposed to enable: concurrent pretouch. Where "concurrent" might mean "concurrent with application code". For example, init the heap, let the Java code run, and then use a background GC worker to concurrently pre-touch the heap. ("Pre-" becomes a little fuzzy here?)

Among other uses, exactly.

And yes, ?Pre-? is a little fuzzy. I was originally intending to add a new ?concurrent_touch_memory?
or something like that, but I?ve not been able to detect any performance difference between the old
and the new. I?ll be getting back to Aleksey on that later; I have more measurements I want to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime ready rfr
4 participants