Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8293218: serviceability/tmtools/jstat/GcNewTest.java fails with "Error in the percent calculation" #10218

Closed
wants to merge 7 commits into from

Conversation

kevinjwalls
Copy link
Contributor

@kevinjwalls kevinjwalls commented Sep 8, 2022

Test update to cope with heap size changing (shrinking) in the early life of the test app.

A change in GC timing affects this test which reads eden size and heap size. Both eden and heap are likely to shrink initially for this test. Failures were that heap size shrank after reading eden size, such that eden appeared to be >100% of heap.
Recognising a shrinking heap and retrying resolves this.

(Re-ordering to read heap size then eden would be enough to make the check in provokeGc work. But it would allocate sometimes a very small fraction of the heap, which is not the intent.)


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issues

  • JDK-8293218: serviceability/tmtools/jstat/GcNewTest.java fails with "Error in the percent calculation"
  • JDK-8293564: serviceability/tmtools/jstat/GcTest01.java fails with "Error in the percent calculation"

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/10218/head:pull/10218
$ git checkout pull/10218

Update a local copy of the PR:
$ git checkout pull/10218
$ git pull https://git.openjdk.org/jdk pull/10218/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 10218

View PR using the GUI difftool:
$ git pr show -t 10218

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/10218.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 8, 2022

👋 Welcome back kevinw! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Sep 8, 2022

@kevinjwalls The following label will be automatically applied to this pull request:

  • serviceability

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the serviceability serviceability-dev@openjdk.org label Sep 8, 2022
@kevinjwalls kevinjwalls marked this pull request as ready for review September 8, 2022 11:02
@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 8, 2022
@mlbridge
Copy link

mlbridge bot commented Sep 8, 2022

Webrevs

@plummercj
Copy link
Contributor

What is the purpose of the outer loop of this test? Is it to test while the heap is shrinking? If so, it seems there is no reason for the inner loop to be added, and also possibly just re-ordering the read of the heap size is the correct fix.

@sspitsyn
Copy link
Contributor

sspitsyn commented Sep 8, 2022

What is the purpose of the outer loop of this test? Is it to test while the heap is shrinking? If so, it seems there is no reason for the inner loop to be added, and also possibly just re-ordering the read of the heap size is the correct fix.

Agreed. At least, the outer loop is confusing and needs a comment explaining why it is needed.

@kevinjwalls
Copy link
Contributor Author

Right, that outer loop is the original test, and it's not obvious if it is needed at all.
Testing without it is fine, for all the tests in that directory.
Possibly it was to make "really sure" we allocate enough memory to cause a GC. But after allocating whatever fraction of memory eden is, it does that again, and then calls System.gc(). No reason to do that 3 times...

Copy link
Contributor

@sspitsyn sspitsyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Thanks,
Serguei

@openjdk
Copy link

openjdk bot commented Sep 9, 2022

@kevinjwalls This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8293218: serviceability/tmtools/jstat/GcNewTest.java fails with "Error in the percent calculation"
8293564: serviceability/tmtools/jstat/GcTest01.java fails with "Error in the percent calculation"

Reviewed-by: sspitsyn, lmesnik, dcubed

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 87 new commits pushed to the master branch:

  • 746f5f5: 8293816: CI: ciBytecodeStream::get_klass() is not consistent
  • 4b297c1: 8293892: Add links to JVMS 19 and 20 from ClassFileFormatVersion enum constants
  • dfb9c06: 8293535: jdk/javadoc/doclet/testJavaFX/TestJavaFxMode.java fail with jfx
  • f42caef: 8293550: Optionally add get-task-allow entitlement to macos binaries
  • 5feca68: 8293840: RISC-V: Remove cbuf parameter from far_call/far_jump/trampoline_call
  • 39cd163: 8293578: Duplicate ldc generated by javac
  • 7765942: 8290367: Update default value and extend the scope of com.sun.jndi.ldap.object.trustSerialData system property
  • 11e7d53: 8293819: sun/util/logging/PlatformLoggerTest.java failed with "RuntimeException: Retrieved backing PlatformLogger level null is not the expected CONFIG"
  • 141d5f5: 8293767: AWT test TestSinhalaChar.java has old SCCS markings
  • 3beca2d: 8291600: [vectorapi] vector cast op check is not always needed for vector mask cast
  • ... and 77 more: https://git.openjdk.org/jdk/compare/8d3399bf5f354931b0c62d2ed8095e554be71680...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Sep 9, 2022
@plummercj
Copy link
Contributor

This code still isn't sitting right with me and I'm still not sure of it's original intent. The method is called provokeGc, but it seems to be quite a bit more complex than that. It (used to) loop 3 times, each time calculating the youngGen size as a % of the heap size (according to the MemoryPoolMXBean values). It then (on each iteration) allocates that much memory by applying the % to Runtime.maxMemory(), and does so twice. It could have just allocated the youngGen size twice, but instead used the round about way of determining how much to allocate. I'm not sure why.

In any case, the end result should be triggering at least one youngGen GC per iteration, but more likely two unless it was empty at the start. Then it follows this up with a full GC (on each iteration). So what is the point of any of this? It it testing consistency of the MemoryPoolMXBean values? Is it testing the relation between the MemoryPoolMXBean heap size and Runtime.maxMemory()?

You have changed it to only do the allocation part once rather than on every iteration, so now the loop is only checking if the heap is still shrinking, but just because heapSize == heapSize0 doesn't mean it's not shrinking. It can just mean no GC happened between the two Pools.getHeapCommittedSize() calls. I think maybe the original intend was to do the allocateHeap() part while the heap is shrinking, not just once after it has stabilized.

BTW, make sure you test with ZGC.

@dcubed-ojdk
Copy link
Member

@kevinjwalls - If you do a comment with "/issue JDK-8293564", then you'll
include the other bug ID that has a test affected by the same underlying issue
with this PR.

@kevinjwalls
Copy link
Contributor Author

/issue JDK-8293564

@openjdk
Copy link

openjdk bot commented Sep 12, 2022

@kevinjwalls
Adding additional issue to issue list: 8293564: serviceability/tmtools/jstat/GcTest01.java fails with "Error in the percent calculation".

@kevinjwalls
Copy link
Contributor Author

Thanks @plummercj

Yes, the whole fraction/percentage calculation is odd. Maybe it could just allocate eden size in bytes, but I was not rushing to throw it out.

I think it's clear that provokeGc wants to allocate and cause a GC. Tripling the pair of allocations appears to be an effort make extra sure it does that, but I think we're agreeing there is no clear point to that.

The current provokeGc is from 2016:

8168396: Unexpected OOME in GcCauseTest02 and GcTest02
http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/1b7fd4c2f65e

...so there was a previous OOM problem, possibly because young gen was large and could not be promoted. That probably inspired the fraction calculation. I'm not sure it's the best way of doing things.

Yes, my update is to retry if we see a shrink between fetching eden and heap size. It doesn't mean the heap has fully stabilized and has finished shrinking, it just means that we don't think a shrink occurred between fetching the two stats. That means our fraction calculation should be OK. No GC beteween the two stats being fetched is the common case of course, but now we are seeing enough GCs in between them to cause a frequent failures.

ZGC:
The tests in serviceability/tmtools/jstat do not run with ZGC or Shenandoah (requires vm.gc != "Z" & vm.gc != "Shenandoah").
They use e.g. "jstat -gcnew PID" which doesn't produce numbers with ZGC, so they would fail.

@plummercj
Copy link
Contributor

plummercj commented Sep 12, 2022

Hmm:

-    public static GcProvoker createGcProvoker() {
-        return new GcProvokerImpl();

So what happened to GcProvokerImpl? It no longer exists, but was not deleted as part of this changeset. Must have been cleaned up later. As of this changeset it was still in place:

http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/1b7fd4c2f65e/test/serviceability/tmtools/jstat/utils/GcProvokerImpl.java

And it's provokeGc() method was doing the same thing this "new" implementation is dong.

@kevinjwalls
Copy link
Contributor Author

OK, yes that technique in GcProvokerImpl comes in with:

date: Thu Jan 14 15:35:21 2016 +0300
summary: 8130063: Refactoring tmtools jstat and jstack tests to jtreg

..but IIRC tmtools is a testsuite being integrated into the JDK source, so the code is much older. Does this change anything? 8-)

@plummercj
Copy link
Contributor

I think the point is that the loop has been there probably since the beginning. So I wouldn't take it out. However, I'm not sure I agree with how you have now moved some parts out of the loop so they are only done once.

Copy link
Contributor

@sspitsyn sspitsyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've posted one comment. Otherwise, the update looks good.
Thanks,
Serguei

@dcubed-ojdk
Copy link
Member

Any chance this PR will integrate sometime today? We're still getting
serviceability/tmtools/jstat/GcTest01.java failures in Tier7 and I don't
want to ProblemList it if the fix is going in soon...

@kevinjwalls
Copy link
Contributor Author

OK, just updated to address the needless else.
It should be clear that this loop is only to give it a couple of tries to fetch the eden as a fraction, retrying if the heap is shrinking.

The previous allocations and System.gc calls being in the loop does not appear necessary.
This currently always causes an allocation as far as I can see in testing, including with the params that have caused failures.

This change does not address that the whole percentage of heap/memory is kind of strange and maybe unnecessary. If there are future problems, maybe we just allocate eden size in bytes instead, but for now I'd like to go ahead with the change here.

Copy link
Member

@dcubed-ojdk dcubed-ojdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the changes. Thanks for adding the retry when we
observe the heap shrinking.

It's not clear what kind of testing has already been done:

  • JDK-8293218 failures have been seen in Tier[78].
  • JDK-8293564 failures have been seen in Tier7.
  • Both tests also run in Tier[13].

I recommend doing Tier1 testing and then use that build-ID for
Tier3 and Tier7 testing. The tests also run in Tier[456] testing,
but I think you can skip those.

@kevinjwalls
Copy link
Contributor Author

OK thanks for all the comments, this is looking like a reasonable robustness improvement, and I can't trigger a failure. Will integrate.

@kevinjwalls
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Sep 17, 2022

Going to push as commit cfd44bb.
Since your change was applied there have been 90 commits pushed to the master branch:

  • 01e7b88: 8290917: x86: Memory-operand arithmetic instructions have too low costs
  • 4b8399b: 8293251: Use stringStream::base() instead of as_string() when applicable
  • a8f0f57: 8278165: Clarify that ZipInputStream does not access the CEN fields for a ZipEntry
  • 746f5f5: 8293816: CI: ciBytecodeStream::get_klass() is not consistent
  • 4b297c1: 8293892: Add links to JVMS 19 and 20 from ClassFileFormatVersion enum constants
  • dfb9c06: 8293535: jdk/javadoc/doclet/testJavaFX/TestJavaFxMode.java fail with jfx
  • f42caef: 8293550: Optionally add get-task-allow entitlement to macos binaries
  • 5feca68: 8293840: RISC-V: Remove cbuf parameter from far_call/far_jump/trampoline_call
  • 39cd163: 8293578: Duplicate ldc generated by javac
  • 7765942: 8290367: Update default value and extend the scope of com.sun.jndi.ldap.object.trustSerialData system property
  • ... and 80 more: https://git.openjdk.org/jdk/compare/8d3399bf5f354931b0c62d2ed8095e554be71680...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Sep 17, 2022
@openjdk openjdk bot closed this Sep 17, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 17, 2022
@openjdk
Copy link

openjdk bot commented Sep 17, 2022

@kevinjwalls Pushed as commit cfd44bb.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated Pull request has been integrated serviceability serviceability-dev@openjdk.org
5 participants