Skip to content

8339114: DaCapo xalan performance with -XX:+UseObjectMonitorTable #24098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

rkennke
Copy link
Contributor

@rkennke rkennke commented Mar 18, 2025

  • When successfully looked up an OM in the OMCache, then make it avaiable in the BasicLock cache. Use that cache whenever possible.
  • When successfully looked up an OM in the OMCache, then don't push-back the OM on that cache to avoid shuffling the cache on each monitor enter.
  • In the runtime path of monitorexit, attempt to use the BasicLock cache, then the OMCache, and only look up in the table when the caches failed.
  • Some small code shufflings.

I did 50 runs of xalan, each of which do 10 warmup iterations before doing one measurement. The following results are the averages of the measurement iterations across the 50 runs:
without-OMT: 773.3 ms
with-OMT: 797.28 ms

That is still a regression of ~3%


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8339114: DaCapo xalan performance with -XX:+UseObjectMonitorTable (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24098/head:pull/24098
$ git checkout pull/24098

Update a local copy of the PR:
$ git checkout pull/24098
$ git pull https://git.openjdk.org/jdk.git pull/24098/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24098

View PR using the GUI difftool:
$ git pr show -t 24098

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24098.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 18, 2025

👋 Welcome back rkennke! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Mar 18, 2025

@rkennke This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8339114: DaCapo xalan performance with -XX:+UseObjectMonitorTable

Reviewed-by: coleenp, aboldtch

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 201 new commits pushed to the master branch:

  • a1ab1d8: 8353449: [BACKOUT] One instance of STATIC_LIB_CFLAGS was missed in JDK-8345683
  • 2a31f69: 8353331: Test ForkJoinPool20Test::testFixedDelaySequence is failing
  • 1809138: 8352284: EXTRA_CFLAGS incorrectly applied to BUILD_LIBJVM src/hotspot C++ source files
  • cef5610: 8353272: One instance of STATIC_LIB_CFLAGS was missed in JDK-8345683
  • 6801eb8: 8352709: Remove bad timing annotations from WhileOpTest.java
  • 85a0baf: 8352719: Add an equals sign to the modules statement
  • f25f701: 8353226: JFR: emit old object samples must be transitive closure complete for segment
  • aff5aa7: 8350566: NMT: add size parameter to MemTracker::record_virtual_memory_tag
  • 196334f: 8352046: Test testEcoFriendly() in jdk tools launcher ExecutionEnvironment.java for AIX and Linux/musl is brittle
  • ad48846: 8350386: Test TestCodeCacheFull.java fails with option -XX:-UseCodeCacheFlushing
  • ... and 191 more: https://git.openjdk.org/jdk/compare/b891bfa7e67c21478475642e2bfa2cdc65a3bffe...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Mar 18, 2025

@rkennke The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Mar 18, 2025
Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a couple of comments but it's similar to what I have. I didn't use the BasicLock though for quick_enter and this hit variable is a bit confusing why it's there. I'm going to ask Eric to run our benchmarks on my similar version.

// ObjectMonitor enter successful.
cache_setter.set_monitor(monitor);
if (UseObjectMonitorTable) {
lock->set_object_monitor_cache(monitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit a bug where this was a deflating monitor so I don't think we can do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime/Monitor/UseObjectMonitorTableTest.java#ExtremeDeflation
This test failed when I updated the monitor via CacheSetter when we don't succeed to get the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I understand the problem. Why can't we do that? We successfully looked up the monitor from the OMCache, why can't we stick it in the BasicLock cache, to avoid having to look it up in the OMCache again? Whatever code would pull it out of the BasicLock and observe a deflating monitor might also pull it out of the OMCache and observe a deflating monitor, or is that not correct? I know that OMCache::get_monitor() filters deflating monitors, but that is not a guarantee that a monitor would not flip to deflating after that check? So ... the user of the monitor would always have to deal with deflation after it has pulled a monitor from any of the caches anyway?

If it is our contract that only successfully locked monitors go into the BasicLock cache, then we can't do that optimization, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is okay. My change that ran afoul of this test found the monitor in the first 32 entries of the ObjectSynchronizer::_in_use_list and filtered for deflating monitors also. Then put the monitor it found in the OMCache. Since it didn't help performance, I backed this out of my change. It may be safer to put the monitor in the BasicLock field because it's cleared. All this to say, I don't really know if this is right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like from GHA, this also fails the same test. I think you should remove this optimization since I don't think it helps any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'it does not help any' is not correct. The point of this PR is to put the OM in the BasicLock cache early, even if it does not succeed to enter, so that slower paths can pick it up from there. Also, simply removing this particular line would not change anything, because we put the OM in the BasicLock cache in the C2 fast-path in the same way.

I think the BasicLock OM cache should be treated the same way as the thread's OMCache. The BasicLock cache is basically the L1 cache, while the OMCache is the L2 cache. And just like the OMCache lookup, we should clear the BasicLock cache when we observe a deflating monitor. This fixes the failing test.
@xmas92 should also look at this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the concept of a L1 and L2 cache is good and clear. And does help performance of look ups in the OMCache. I reconstructed the crash I got with the runtime/Monitor/UseObjectMonitorTableTest.java#ExtremeDeflation test. I was setting the OMCache also with the monitor found, because I was only trying to avoid redoing the lookup in the table itself, but in this test, we find a deflated monitor in the OMCache, and then got the assert that it's not in the table.

I don't know if you want to recheck if the monitor is deflated at this time, even though it might be racy. It's unlikely and the check and clearing in BasicLock is better.

@rkennke rkennke marked this pull request as ready for review March 24, 2025 11:14
@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 24, 2025
@mlbridge
Copy link

mlbridge bot commented Mar 24, 2025

Webrevs

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks really good. I'll ping @xmas92 to review it tomorrow if he doesn't notice this mention. Thanks.

// ObjectMonitor enter successful.
cache_setter.set_monitor(monitor);
if (UseObjectMonitorTable) {
lock->set_object_monitor_cache(monitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the concept of a L1 and L2 cache is good and clear. And does help performance of look ups in the OMCache. I reconstructed the crash I got with the runtime/Monitor/UseObjectMonitorTableTest.java#ExtremeDeflation test. I was setting the OMCache also with the monitor found, because I was only trying to avoid redoing the lookup in the table itself, but in this test, we find a deflated monitor in the OMCache, and then got the assert that it's not in the table.

I don't know if you want to recheck if the monitor is deflated at this time, even though it might be racy. It's unlikely and the check and clearing in BasicLock is better.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 27, 2025
Copy link
Member

@xmas92 xmas92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit 1: Added Linux x64 results
Edit 2: Added Linux aarch64 results

The exit and hashcode changes seem good, I see no issue there.

The enter changes require more thought and testing I think.

First the caches:

There are two fundamental differences between the two caches.

First the thread stack cache (BasicLock) is assumed to be guarded from use after free by only containing a monitor if it is locked by the thread. So deflation cannot occur. There is a small window where it exists because C2, due to register pressure, saves a potential value to the stack. Currently we never read this value in the slow path, and the CacheSetter overwrites it with the monitor we actually lock on or nullptr in the case of fast locking.

Second exit assumes that whatever is in the cache is what is locked on. So if we do not do inflated locking in enter for whatever reason, it must be cleared.

We must also verify that all paths that create a BasicLock also clears the memory, so we do not read garbage. And that no platform relies on the CacheSetter, I do not think that is the case. I see you fixed the C++ constructions with a constructor. But need to look at (interpreter, c1, c2, jvmci) x (all supported platforms). I think the CacheSetter in quick_enter was added specifically for x86_32 which used to share the C2 code with x86_64. However after JDK-8338383 / #21565 x86_32 C2 support was disabled due to 64bit CAS requirements. So this might work, and the x86_32 is deprecated and being actively removed, but I think it would be good to have the clear the thread stack cache (BasicLock) for NOT_LP64.

Second the spinning:

try_enter -> spin_enter

This changes the monitor spinning quite dramatically. For virtual threads the number of try_spin calls increase from 1 -> 2 (+100%). And for non virtual threads the number of try_spin calls increase from 2 -> 3 (+50%). This seems significant enough to warrant looking at a large suite of benchmarks and see how they behave. And remember multiple calls to try_spin affect both the total spin duration and the velocity of the adaptive spinning change.

Performance evaluation

I spent some time yesterday trying to understand what in this patch is important for performance. Xalan seems to be a workload which wants to spend its time in try_spin. When playing around with the patch, the try_enter -> spin_enter change seems to be the most important change, at least in our (Oracle) java performance system. But have not exhaustively tested all changes. And the behavior in our java performance system and what I see on my virtualized sky lake xeon based cloud instance differs quite a bit.

I suspect it boils down to the sort of "effective" scheduling of our successor windows (times when there is a successor set). Which avoids exit slow paths and re-locking.

This made me do some experiments with the successor and evaluate the performance.

Explanation of the different versions

  • Base: b891bfa
    • Mainline commit used as a baseline (the current PRs base)
  • PR: fc5062a
    • The PR branch as of fc5062a: Include objectMonitor.inline.hpp
    • Personal instance:
      • Reduced regression by 27% with table (-XX:+UseObjectMonitorTable)
      • 16% regression without table (-XX:-UseObjectMonitorTable)
    • Oracle Java Performance System:
      • Reduced regression by 73% with table (-XX:+UseObjectMonitorTable) on Linux x64
      • Regression gone with table (-XX:+UseObjectMonitorTable) on Linux aarch64
        • Not significant, needs more runs.
      • Run without table (-XX:-UseObjectMonitorTable) Linux aarch64 is still in the queue.
        • No regression seen on Linux x64
  • V3: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v3
    • This is just my review comments / cleanups on the PR.
    • Has the same performance characteristics as the PR.
  • V4: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v4
    • Adhoc experiment with successor. Only clear the successor after the final spin, before parking.
    • Most likely incorrect / contains holes, but should only lead to stranding. (Looking at GHA there seem to be some stranding occurring). Though it is an interesting idea to evaluate.
    • Personal instance:
      • Similar performance as the PR / V3 branch
    • Oracle Java Performance System:
      • Regression gone with table (-XX:+UseObjectMonitorTable) on both Linux aarch64 and x64
        • Not significant, needs more runs.
  • V5: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v5
    • Instead of only clearing before the parking. It clears the successor more often, every 256 iteration of the try_spin spin loop. The idea is to instead of prologing the successor window, create more small windows where the successor is clear so that other threads that may be earlier in their spinning can take over the successor.
    • This is probably correct, there might be that a fence is needed after clearing the successor inside the loop. The reason that it is not there is because the spinner will always either get the lock or retry setting the successor, which if it succeeds will lead to a fence later, or another thread have become the successor, and that thread will fence.
    • Personal instance:
      • Marginal improvements over V4.
    • Oracle Java Performance System:
      • Linux aarch64 is still in the queue.
      • Linux x64 minor improvement over PR / V3
  • V6: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v6
    • Based on V4, but it sets the successor unconditionally in try_spin which may overwrite another successor. The idea is similar to V5 but instead of sometimes clearing, we simply allow for as long as a thread is in try_spin there is a successor, and reducing the number of clear successors (and fences) required.
    • The branch has stranding issues because it is based on V4. Will probably rerun this experiment but without the V4 changes to the successor.
    • Personal instance:
      • Marginal improvements over V4.
    • Oracle Java Performance System:
      • Linux aarch64 is still in the queue.
      • Linux x64 regresses over V4
  • V7: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v7
    • Based on V6. But now clearing the successor is done with a CAS, and only if it succeeds do the thread fence.
    • The branch has stranding issues because it is based on V4.
    • Combining this with V6 is probably a bad idea, because all threads are writing to this cache line, and at the same time some thread is trying to CAS. It is probably more effective do just do the load check store fence, and sometimes overwrite another threads successor claiming. (The idea was to not allow any time slice where there are no successor if there is a thread inside, but seems not worth the tradeoff).
    • Personal instance:
      • No improvement over V3, (effectively negates the improvements gained by V4+V6)
    • Oracle Java Performance System:
      • Linux aarch64 is still in the queue.
      • Linux x64 regresses over V6

Conclusion

I think parts of this PR could go in. But the enter changes should be held of on until we understand the impact on other benchmarks (even if we can just use try_enter over spin_enter for -XX:-UseObjectMonitorTable to avoid the -XX:-UseObjectMonitorTable regressions).

I also think there is some room for improvements with this CacheSetter and C2. Using the thread stack cache is (currently) only relevant for C2. If keep track in the runtime that we came from a context where this thread stack cache is preloaded with what is in the thread local stack, or nullptr. We can optimize away a lot of the CacheSetter logic. Have only be something we do if we retry due to deflation or if we come from a another context. From other contexts clear the thread stack cache is required and setting the thread stack cache is beneficial due to OSR. But it may not be worth setting the the thread local cache unless we are locking from C2, because on C2 uses this cache for lookups, and C2 is what we want to be fast. So limit the OMCache to only C2 locked monitors is probably beneficial.

I will update the this post with the rest of the benchmark results as they finished, maybe something I do next week. Similarly I might experiment some more with a better version of V6.

Cheers.

Benchmarks results

On Personal Intel cloud instance:

  • Difference vs Base Without Table (-XX:-UseObjectMonitorTable) b891bfa
    • With Table (-XX:+UseObjectMonitorTable)
      • Columns
        • Name Cnt Base Error Test Error Unit Change (* = significant)
      • Base b891bfa
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1474.200 ± 50.614 ms 0.75x (p = 0.000*)
      • PR branch fc5062a
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1407.650 ± 37.815 ms 0.79x (p = 0.000*)
      • v3 branch b18c83a
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1414.650 ± 50.783 ms 0.79x (p = 0.000*)
      • v4 branch ec91413
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1423.000 ± 39.226 ms 0.78x (p = 0.000*)
      • v5 branch 19340a0
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1405.600 ± 78.450 ms 0.79x (p = 0.000*)
      • v6 branch 9a332cb
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1371.750 ± 36.835 ms 0.81x (p = 0.000*)
      • v7 branch 13297b1
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1474.700 ± 72.045 ms 0.75x (p = 0.000*)
    • Without Table (-XX:-UseObjectMonitorTable)
      • Columns
        • Name Cnt Base Error Test Error Unit Change (* = significant)
      • PR branch fc5062a
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1324.650 ± 46.528 ms 0.84x (p = 0.000*)
      • v3 branch b18c83a
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1322.600 ± 46.727 ms 0.84x (p = 0.000*)
      • v4 branch ec91413
        • Not run yet
      • v5 branch 19340a0
        • DaCapo-xalan-large 20 1111.150 ± 49.733 1309.450 ± 31.553 ms 0.85x (p = 0.000*)
      • v6 branch 9a332cb
        • Not run yet
      • v7 branch 13297b1
        • Not run yet

In Oracle Java Performance System:

Linux x64

  • Difference vs Base Without Table (-XX:-UseObjectMonitorTable) b891bfa
    • With Table (-XX:+UseObjectMonitorTable)
      • Columns
        • Name Cnt Base Error Test Error Unit Change (* = significant)
      • Base b891bfa
        • DaCapo-xalan-large 20 1172.50 ± 39.45 1344.15 ± 83.49 ms 0.75x (p = 0.000*)
        • DaCapo23-xalan-large 5 9763.60 ± 479.44 10803.20 ± 864.69 ms 0.89x (p = 0.0553)
      • PR branch fc5062a
        • DaCapo-xalan-large 20 1172.50 ± 39.45 1218.40 ± 64.03 ms 0.96x (p = 0.0103)
        • DaCapo23-xalan-large 5 9763.60 ± 479.44 10182.40 ± 391.79 ms 0.96x (p = 0.1703)
      • v3 branch b18c83a
        • DaCapo-xalan-large 20 1172.50 ± 39.45 1235.75 ± 60.57 ms 0.95x (p = 0.0004*)
        • DaCapo23-xalan-large 5 9763.60 ± 479.44 10111.40 ± 441.37 ms 0.96x (p = 0.2671)
      • v4 branch ec91413
        • DaCapo-xalan-large 20 1172.50 ± 39.45 1161.15 ± 47.45 ms 1.01x (p = 0.4161)
        • DaCapo23-xalan-large 5 9763.60 ± 479.44 9812.00 ± 419.23 ms 1.00x (p = 0.8694)
      • v5 branch 19340a0
        • DaCapo-xalan-large 20 1172.50 ± 39.45 1204.95 ± 61.66 ms 0.97x (p = 0.0560)
        • DaCapo23-xalan-large 5 9763.60 ± 479.44 10015.20 ± 475.30 ms 0.97x (p = 0.4288)
      • v6 branch 9a332cb
        • DaCapo-xalan-large 20 1172.50 ± 39.45 1281.65 ± 66.35 ms 0.91x (p = 0.000*)
        • DaCapo23-xalan-large 5 9763.60 ± 479.44 10281.40 ± 533.05 ms 0.95x (p = 0.1454)
      • v7 branch 13297b1
        • DaCapo-xalan-large 20 1172.50 ± 39.45 1395.45 ± 78.64 ms 0.81x (p = 0.000*)
        • DaCapo23-xalan-large 5 9763.60 ± 479.44 11261.60 ± 780.96 ms 0.85x (p = 0.0089*)
    • Without Table (-XX:-UseObjectMonitorTable)
      • Columns
        • Name Cnt Base Error Test Error Unit Change (* = significant)
      • PR branch fc5062a
        • DaCapo-xalan-large 20 1172.50 ± 39.45 1147.10 ± 48.03 ms 1.02x (p = 0.0758)
        • DaCapo23-xalan-large 5 9763.60 ± 479.44 9640.60 ± 399.44 ms 1.01x (p = 0.6714)

Linux aarch64

  • Difference vs Base Without Table (-XX:-UseObjectMonitorTable) b891bfa
    • With Table (-XX:+UseObjectMonitorTable)
      • Columns
        • Name Cnt Base Error Test Error Unit Change (* = significant)
      • Base b891bfa
        • DaCapo-xalan-large 20 1999.55 ± 81.08 2176.85 ± 100.63 ms 0.91x (p = 0.000*)
        • DaCapo23-xalan-large 5 15595.20 ± 524.15 15326.60 ± 442.64 ms 1.02x (p = 0.4075)
      • PR branch fc5062a
        • DaCapo-xalan-large 20 1999.55 ± 81.08 2003.25 ± 83.89 ms 1.00x (p = 0.8880)
        • DaCapo23-xalan-large 5 15595.20 ± 524.15 15960.80 ± 826.91 ms 0.98x (p = 0.4322)
      • v3 branch b18c83a
        • DaCapo-xalan-large 20 1999.55 ± 81.08 2020.60 ± 70.67 ms 0.99x (p = 0.3870)
        • DaCapo23-xalan-large 5 15595.20 ± 524.15 15382.00 ± 385.40 ms 1.01x (p = 0.4864)
      • v4 branch ec91413
        • DaCapo-xalan-large 20 1999.55 ± 81.08 1987.55 ± 73.91 ms 1.01x (p = 0.6276)
        • DaCapo23-xalan-large 5 15595.20 ± 524.15 15601.00 ± 996.23 ms 1.00x (p = 0.9912)
      • v5 branch 19340a0
        • DaCapo-xalan-large 20 1999.55 ± 81.08 1910.10 ± 54.06 ms 1.05x (p = 0.0002*)
        • DaCapo23-xalan-large 5 15595.20 ± 524.15 15415.80 ± 562.71 ms 1.01x (p = 0.6161)
      • v6 branch 9a332cb
        • DaCapo-xalan-large 20 1999.55 ± 81.08 2082.75 ± 92.16 ms 0.96x (p = 0.0044*)
        • DaCapo23-xalan-large 5 15595.20 ± 524.15 16301.40 ± 486.39 ms 0.95x (p = 0.0584)
      • v7 branch 13297b1
        • DaCapo-xalan-large 20 1999.55 ± 81.08 2149.10 ± 104.61 ms 0.93x (p = 0.000*)
        • DaCapo23-xalan-large 5 15595.20 ± 524.15 16963.40 ± 328.17 ms 0.91x (p = 0.0019*)
    • Without Table (-XX:-UseObjectMonitorTable)
      • Columns
        • Name Cnt Base Error Test Error Unit Change (* = significant)
      • PR branch fc5062a
        • DaCapo-xalan-large 20 1999.55 ± 81.08 1768.85 ± 59.70 ms 1.12x (p = 0.0000*)
        • DaCapo23-xalan-large 5 15595.20 ± 524.15 14874.40 ± 933.46 ms 1.00x (p = 0.1806)

@coleenp
Copy link
Contributor

coleenp commented Mar 28, 2025

My testing also showed that the try_spin was the most beneficial of these changes to getting UseObjectMonitorTable performance mostly up to without UseObjectMonitorTable.

I've just started a Oracle internal benchmark run comparing mainline (base) with a version with try_spin instead of try_lock to see if that change has any adverse effects on any of the other benchmarks without UseObjectMonitorTable. xalan does benefit from adaptive spinning, so it makes sense moving this extra level of spinning earlier.

The BasicLock cache and OMCache changes are really good, so we want these and the try_spin changes if they don't regress any benchmarks (will update with results). That's this PR.

The cleanup to add read_caches() to encapsulate both BasicLock and OMCache is nice but I don't really like the new read_monitor(). LightweightSynchronizer::read_caches() can be private in lightweightSynchronizer.cpp and added to this change. Not sure about the CacheSetter changes to Axel's changes. And the successor changes should be in another PR.

I still think this PR is a good improvement, as is, pending try_spin performance testing. The additional work should be done as a set of separate PRs. 1. Fix CacheSetter to be only used for C2 (possibly, but do we want more conditional code?) 2. Experiment with setting successor in ObjectMonitor::try_spin. I did a little of this once and didn't have any good numbers to show for it, so I think it should be a new investigation. Your numbers are better for this.

Lastly, thank you Axel for running these performance tests. It looks like the performance for the PR change is quite good on the Oracle benchmarking machines, but less good on your personal cloud instance. The performance tests that we've (Eric Caspole and I have) done show that the PR changes are mostly good with some expected variation that we always see with locking changes.

@coleenp
Copy link
Contributor

coleenp commented Apr 1, 2025

Sorry for the delay, the performance testing took a while. I ran the old and new DaCapo benchmarks on a version of the VM with quick_enter calling try_spin rather than try_lock to see if the try_spin had any negative effects without UseObjectMonitorTable. I ran on our standard linux-x64 and linux-aarch64 machines. It had no significant effect on performance except a 3.7 improvement for old DaCapo xalan on linux-aarch64.

I think this should be checked in with RFEs for further improvements. I had a couple of cleanups in this area that I'm waiting for this.

Copy link
Member

@xmas92 xmas92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have ran the performance evaluation, and adding the extra try_spin does not hurt other workloads / -UseObjectMonitorTable. I am fine with this change.

As @coleenp mentions, a lot of the cache setter can be cleaned up in future RFEs. And eventually make it a property of C2 only.

@rkennke
Copy link
Contributor Author

rkennke commented Apr 1, 2025

If we have ran the performance evaluation, and adding the extra try_spin does not hurt other workloads / -UseObjectMonitorTable. I am fine with this change.

As @coleenp mentions, a lot of the cache setter can be cleaned up in future RFEs. And eventually make it a property of C2 only.

No more concerns about the setting of BasicLock cache in enter?

@coleenp
Copy link
Contributor

coleenp commented Apr 1, 2025

No more concerns about the setting of BasicLock cache in enter?

No, I think the bug that I ran into wasn't this, so no more concerns.

@rkennke
Copy link
Contributor Author

rkennke commented Apr 1, 2025

No more concerns about the setting of BasicLock cache in enter?

No, I think the bug that I ran into wasn't this, so no more concerns.

Ok, good.
@xmas92 also had concerns about doing that BL caching in enter that sounded like he wanted to hold this part back for now.

@xmas92
Copy link
Member

xmas92 commented Apr 1, 2025

No more concerns about the setting of BasicLock cache in enter?

No, I think the bug that I ran into wasn't this, so no more concerns.

Ok, good. @xmas92 also had concerns about doing that BL caching in enter that sounded like he wanted to hold this part back for now.

I think there is a problem with quick_enter on 32bit x86. As it may fast lock successfully without clearing the cache, which iirc can be garbage (or read it as a monitor later). I would either make 32bit x86 return nullptr from BasicLock::object_monitor_cache or just add the CacheSetter / if (UseObjectMonitorTable) { lock->clear_object_monitor_cache() } in the #ifndef _LP64 block. This is because the interpreter and C1 does not use MacroAssembler::lightweight_lock because of limited availability of registers and simply called into the VM. That is why the CacheSetter was introduced in quick enter.

For all other platforms which uses BasicLock::object_monitor_cache have clear the cache in interpreter/ C1 and C2 (have not checked recently, but iirc Graal does the same thing as C2 here)

Not sure what the state is w.r.t. 32bit x86 being deprecated for removal, but may just having ObjectMonitor* BasicLock::object_monitor_cache() return nullptr on 32 bit is the most straightforward solution here.

@dholmes-ora
Copy link
Member

Not sure what the state is w.r.t. 32bit x86 being deprecated for removal,

It has been removed

@xmas92
Copy link
Member

xmas92 commented Apr 2, 2025

Not sure what the state is w.r.t. 32bit x86 being deprecated for removal,

It has been removed

Alright if

// Implicit null check.
movptr(hdr, Address(obj, oopDesc::mark_offset_in_bytes()));
// Lacking registers and thread on x86_32. Always take slow path.
jmp(slow_case);
and
// Lacking registers and thread on x86_32. Always take slow path.
jmp(slow_case);
are dead / unreachable code (and/or if #ifdef X86 means x86_64) then there is no issue.

@coleenp
Copy link
Contributor

coleenp commented Apr 2, 2025

So is the issue that the 32 bit x86 code doesn't clear the BasicLock::monitor_cache before going slow path? I don't see the arm and s390 code doing that either. So the bug is that the quick-enter code will find a stale monitor?

@xmas92
Copy link
Member

xmas92 commented Apr 2, 2025

So is the issue that the 32 bit x86 code doesn't clear the BasicLock::monitor_cache before going slow path? I don't see the arm and s390 code doing that either. So the bug is that the quick-enter code will find a stale monitor?

s390 does (this code is used by interpreter and c1)

if (UseObjectMonitorTable) {
// Clear cache in case fast locking succeeds.
const Address om_cache_addr = Address(basic_lock, BasicObjectLock::lock_offset() + in_ByteSize((BasicLock::object_monitor_cache_offset_in_bytes())));
z_mvghi(om_cache_addr, 0);
}

and in C2 / native wrapper

if (UseObjectMonitorTable) {
// Clear cache in case fast locking succeeds.
z_mvghi(Address(box, BasicLock::object_monitor_cache_offset_in_bytes()), 0);
}

Arm32 is protected inside the runtime with preprocessor macros

#if !defined(ZERO) && (defined(X86) || defined(AARCH64) || defined(RISCV64) || defined(PPC64) || defined(S390))
ObjectMonitor* monitor = reinterpret_cast<ObjectMonitor*>(get_metadata());
if (monitor != nullptr && monitor->is_being_async_deflated()) {
clear_object_monitor_cache();
return nullptr;
}
return monitor;
#else
// Other platforms do not make use of the cache yet,
// and are not as careful with maintaining the invariant
// that the metadata either is nullptr or ObjectMonitor*.
return nullptr;
#endif

@shipilev
Copy link
Member

shipilev commented Apr 2, 2025

// Implicit null check.
movptr(hdr, Address(obj, oopDesc::mark_offset_in_bytes()));
// Lacking registers and thread on x86_32. Always take slow path.
jmp(slow_case);

and

This C1 code is on the chopping block with #24301.

// Lacking registers and thread on x86_32. Always take slow path.
jmp(slow_case);

are dead / unreachable code (and/or if #ifdef X86 means x86_64) then there is no issue.

This code is already absent in mainline, removed during the template interpreter cleanups.

@rkennke
Copy link
Contributor Author

rkennke commented Apr 2, 2025

Ok, sounds like we're good, then. Let's
/integrate

@openjdk
Copy link

openjdk bot commented Apr 2, 2025

Going to push as commit 49cb7aa.
Since your change was applied there have been 223 commits pushed to the master branch:

  • d32ff13: 8353117: Crash: assert(id >= ThreadIdentifier::initial() && id < ThreadIdentifier::current()) failed: must be reasonable)
  • a0677d9: 8353263: Parallel: Remove locking in PSOldGen::resize
  • 8608b16: 8348887: Create IR framework test for JDK-8347997
  • 23eb648: 8353545: Improve debug info for StartOptionTest
  • 4f97c4c: 8349211: Add support for intrusive trees to the utilities red-black tree
  • c9baa8a: 8352418: Add verification code to check that the associated loop nodes of useless Template Assertion Predicates are dead
  • b80b04d: 8353329: Small memory leak when create GrowableArray with initial size 0
  • 4a50778: 8353458: Don't pass -Wno-format-nonliteral to CFLAGS
  • 9076673: 8304674: File java.c compile error with -fsanitize=address -O0
  • 8fb67ac: 8282053: IGV: refine schedule approximation
  • ... and 213 more: https://git.openjdk.org/jdk/compare/b891bfa7e67c21478475642e2bfa2cdc65a3bffe...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Apr 2, 2025
@openjdk openjdk bot closed this Apr 2, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Apr 2, 2025
@openjdk
Copy link

openjdk bot commented Apr 2, 2025

@rkennke Pushed as commit 49cb7aa.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
Development

Successfully merging this pull request may close these issues.

5 participants