Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8267404: vmTestbase/vm/mlvm/anonloader/stress/oome/metaspace/Test.java failed with OutOfMemoryError #4140

Closed
wants to merge 4 commits into from

Conversation

DamonFool
Copy link
Member

@DamonFool DamonFool commented May 21, 2021

Hi all,

vmTestbase/vm/mlvm/anonloader/stress/oome/metaspace/Test.java OOMEs on Oracle's aarch64 platforms.
The reason is that both -Xmx and -XX:MetaspaceSize are not enough.

From the original JBS decription of JDK-8267404, the VM OOMEs before the expected OOME in metaspace happened showing that -Xmx256m is not enough.

Then, @dcubed-ojdk helped me test with -Xmx512, which still OOMEs.
However, the expected OOME in metaspace was caught this time.
But a second uncaught OOME in metaspace happened soon, which means -XX:MetaspaceSize=8m is not enough.

So both -Xmx and -XX:MetaspaceSize should be increased.
The fix just:

  • Revert changes about mataspace size setting
  • Increase -Xmx from 256m to 1g

-Xmx512m may be OK on Oracle's aarch64 machines, but to make it safer, -Xmx1g is preferred.

Thanks.
Best regards,
Jie


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8267404: vmTestbase/vm/mlvm/anonloader/stress/oome/metaspace/Test.java failed with OutOfMemoryError

Reviewers

Contributors

  • xiangyuan <xiangyuan@tencent.com>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/4140/head:pull/4140
$ git checkout pull/4140

Update a local copy of the PR:
$ git checkout pull/4140
$ git pull https://git.openjdk.java.net/jdk pull/4140/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 4140

View PR using the GUI difftool:
$ git pr show -t 4140

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/4140.diff

@bridgekeeper
Copy link

bridgekeeper bot commented May 21, 2021

👋 Welcome back jiefu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented May 21, 2021

@DamonFool The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label May 21, 2021
@DamonFool DamonFool marked this pull request as ready for review May 21, 2021
@openjdk openjdk bot added the rfr Pull request is ready for review label May 21, 2021
@mlbridge
Copy link

mlbridge bot commented May 21, 2021

Webrevs

@DamonFool
Copy link
Member Author

DamonFool commented May 21, 2021

/contributor add xiangyuan xiangyuan@tencent.com

@openjdk
Copy link

openjdk bot commented May 21, 2021

@DamonFool Could not parse xiangyuan xiangyuan@tencent.com as a valid contributor.
Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <duke@openjdk.org>

@DamonFool
Copy link
Member Author

DamonFool commented May 21, 2021

/contributor add xiangyuan xiangyuan@tencent.com

@openjdk
Copy link

openjdk bot commented May 21, 2021

@DamonFool
Contributor xiangyuan <xiangyuan@tencent.com> successfully added.

@stefank
Copy link
Member

stefank commented May 21, 2021

Maybe this is already understood, but in case it is not clear: -XX:MetaspaceSize is only used to determine at what point the first metaspace-induced GC should happen. When the Metaspace is filled up to that value, we trigger a GC. It is -XX:MaxMetaspaceSize that is the limit of how much memory we may commit for the metaspace.

@DamonFool
Copy link
Member Author

DamonFool commented May 21, 2021

Maybe this is already understood, but in case it is not clear: -XX:MetaspaceSize is only used to determine at what point the first metaspace-induced GC should happen. When the Metaspace is filled up to that value, we trigger a GC. It is -XX:MaxMetaspaceSize that is the limit of how much memory we may commit for the metaspace.

Got it.
Thanks @stefank for your clarification.

@tstuefe
Copy link
Member

tstuefe commented May 21, 2021

Maybe this is already understood, but in case it is not clear: -XX:MetaspaceSize is only used to determine at what point the first metaspace-induced GC should happen. When the Metaspace is filled up to that value, we trigger a GC. It is -XX:MaxMetaspaceSize that is the limit of how much memory we may commit for the metaspace.

Arguably MetaspaceSize=MaxMetaspaceSize helps a tiny tiny bit in lowering the runtime for this test, since we avoid some GC cycles (which would not collect anything anyway since this test does not unload anything).

@tstuefe
Copy link
Member

tstuefe commented May 21, 2021

I don't think this patch does what you think it does, sorry.

First off, as @stefank said, please forget about MetaspaceSize (terribly confusing name). Either omit it or set it to the same value as MaxMetaspaceSize. The latter may cause the test to be slightly faster, but it does not matter much either way.

The test allocates classes endlessly. The point of this test is to fill up metaspace with data until it is full, then to observe a correctly thrown OOM from Metaspace. For that to work, metaspace has to be small, not large.

In theory you could make it as small as 512k or so. Only, we need the VM itself to come up and start the test. So the VM needs to be able to load all JDK classes needed for booting, then the test classes. Then the test starts, loads all these generated classes, and then metaspace should run out. In my local tests I was able to run the test correctly with MaxMetaspaceSize=4m.

Each generated j.l.Class object needs heap space too, therefore this test is a race between what part is exhausted earlier, metaspace or heap. If you see an OOM from heap, not metaspace, it means heap space ran out first, and the test will fail.

You want metaspace to be as small as possible, heap space to be large enough to exhaust metaspace first. In my local test on Linux x64, I was able to run with MaxMetaspaceSize=4m and 128m heap space.

If you increase MaxMetaspaceSize to 20m, you will clearly need more and more heap space to still be able to exhaust that higher limit. Leave MaxMetaspaceSize at 8, or preferably less.

What I am concerned about is that with there should not be that much variance in how much heap space we use. Sure you can set it to 1g or 2g, but something is clearly off. I get by on Linux x64 with 128m, on 32bit x86 with 256m (since we don't have CompressedClassPointers we need somewhat more heap). But needing 1g on aarch64 seems weird.

Cheers, Thomas

p.s. One source of variance may be the question if CDS is on or off. If CDS is on (default), the JVM will use significantly less Metaspace at boot. With CDS off, we use more metaspace. It may make sense to set Xshare=off to get rid of this variance.

p.p.s. There are several ways to improve this test and make it more robust:

  1. increase the class size (see AnonkTestee01 - make a sister class with more fields and more constants and use that one)
  2. Rewrite the whole test: let it start a jvm with ProcessBuilder with a very low metaspace - too low to load the JDK itself. Something like this: java -Xshare:off -XX:MaxMetaspaceSize=1m -version. The VM should not come up; instead we should see an OutOfMemoryError: Metaspace at stdout.

(1) makes the test run through faster and requires less heap to fill up metaspace
(2) would completely remove the guessing game of "how much metaspace do I need to let the VM come up and start the test".

@DamonFool
Copy link
Member Author

DamonFool commented May 21, 2021

2. Rewrite the whole test: let it start a jvm with ProcessBuilder with a very low metaspace - too low to load the JDK itself. Something like this: java -Xshare:off -XX:MaxMetaspaceSize=1m -version. The VM should not come up; instead we should see an OutOfMemoryError: Metaspace at stdout.

Thanks @tstuefe for your nice analysis.
Much has been learned from you all today.

Due to the race between heap and metaspace, I prefer the second approach (rewrite the whole test with ProcessBuilder).

Thanks.
Best regards,
Jie

@DamonFool
Copy link
Member Author

DamonFool commented May 21, 2021

2. Rewrite the whole test: let it start a jvm with ProcessBuilder with a very low metaspace - too low to load the JDK itself. Something like this: java -Xshare:off -XX:MaxMetaspaceSize=1m -version. The VM should not come up; instead we should see an OutOfMemoryError: Metaspace at stdout.

Hi @tstuefe ,

The test has been rewrite with ProcessBuilder.
Please review it.

Testing:

  • The affected test on Linux/x86_{64, 32}, MacOSX

Thanks.

Copy link
Member

@tstuefe tstuefe left a comment

Seems fine. I can confirm this working with x86 and x64.

The MaxMetaspaceSize feels a bit on the low side - I would prefer a somewhat increased value of maybe 512k or 1m. Just to get the VM rolling a bit before OOMing out. I tested 1m for both 32bit and 64bit, seems to work too.

But I leave it up to you, this is fine too. Just please make sure other platforms work with this setting, esp. arm32 and aarch64.

(I'm on vacation and won't be able to look at this for the next two weeks).

If nobody uses AnonkTestee01 anymore, that class can be removed.

Cheers, Thomas

@openjdk
Copy link

openjdk bot commented May 21, 2021

@DamonFool This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8267404: vmTestbase/vm/mlvm/anonloader/stress/oome/metaspace/Test.java failed with OutOfMemoryError

Co-authored-by: xiangyuan <xiangyuan@tencent.com>
Reviewed-by: stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 6 new commits pushed to the master branch:

  • 94cfeb9: 8256155: Allow multiple large page sizes to be used on Linux
  • ec8a809: 8267119: switch expressions lack support for deferred type-checking
  • 4ba7613: 8267332: xor value should handle bounded values
  • ee2651b: 8203359: Container level resources events
  • b5d32bb: 8260690: JConsole User Guide Link from the Help menu is not accessible by keyboard
  • e48d7d6: 8264218: Public method javax.swing.JMenu.setComponentOrientation() has no spec

Please see this link for an up-to-date comparison between the source branch of this pull request and the master branch.
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label May 21, 2021
@DamonFool
Copy link
Member Author

DamonFool commented May 21, 2021

Seems fine. I can confirm this working with x86 and x64.

The MaxMetaspaceSize feels a bit on the low side - I would prefer a somewhat increased value of maybe 512k or 1m. Just to get the VM rolling a bit before OOMing out. I tested 1m for both 32bit and 64bit, seems to work too.

But I leave it up to you, this is fine too. Just please make sure other platforms work with this setting, esp. arm32 and aarch64.

(I'm on vacation and won't be able to look at this for the next two weeks).

If nobody uses AnonkTestee01 anymore, that class can be removed.

Cheers, Thomas

Thanks @tstuefe for your review.

MaxMetaspaceSize has been increased to 512k.
It works on our Linux/{x86, x64}, MacOSX and aarch64.
Thanks.

@DamonFool
Copy link
Member Author

DamonFool commented May 21, 2021

If nobody uses AnonkTestee01 anymore, that class can be removed.

I'll check if we can remove it safely next week.
Thanks.

@DamonFool
Copy link
Member Author

DamonFool commented May 21, 2021

According to @dcubed-ojdk 's comments in the JBS, the latest change passed in Oracle's CI/CD.
Will push it tomorrow if there is no objection.
Thanks.

@DamonFool
Copy link
Member Author

DamonFool commented May 22, 2021

/integrate

@openjdk openjdk bot closed this May 22, 2021
@openjdk openjdk bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated labels May 22, 2021
@openjdk openjdk bot removed the rfr Pull request is ready for review label May 22, 2021
@openjdk
Copy link

openjdk bot commented May 22, 2021

@DamonFool Since your change was applied there have been 12 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

Pushed as commit 6288a99.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@DamonFool DamonFool deleted the JDK-8267404 branch May 22, 2021
@iignatev
Copy link
Member

iignatev commented May 24, 2021

@DamonFool , the goal of this test is to check that in case of (nearly) exhaused metaspace defining classes via MethodHandles.Lookup::defineHiddenClass will lead to OOME, after your rewriting, this test doesn't serve that purpose anymore, now it just checks that runnig java -version w/ small metaspace will lead to OOME.

-- Igor

@DamonFool
Copy link
Member Author

DamonFool commented May 24, 2021

@DamonFool , the goal of this test is to check that in case of (nearly) exhaused metaspace defining classes via MethodHandles.Lookup::defineHiddenClass will lead to OOME, after your rewriting, this test doesn't serve that purpose anymore, now it just checks that runnig java -version w/ small metaspace will lead to OOME.

-- Igor

Sorry, I didn't know that before since the background of the test is so limited (no bug ID, no design purpose comments, etc.).
So what's your suggestion next? @iignatev
Thanks.

@iignatev
Copy link
Member

iignatev commented May 24, 2021

well, although I agree that the test might have done a better job explaining its purpose, it did have a javadoc: This test loads classes using defineHiddenClass and stores them, expecting Metaspace OOME and its location (vm/mlvm/anonloader/stress/oome/metaspace) suggests that it's a test for anonymous classloader (it used to use Unsafe::defineAnonymousClass till it get removed) part of mlvm (jsr292), and it's supposed to stress metaspace in near OOME state, so it's unfair to say it didn't state its purpose at all.

I haven't followed this RFR thread closely, so I can't suggest you the way to fix the test, but I strongly advise you to restore its logic, that's to say backout 6288a9936cc7e69cab0cc5f3e49c803f184bf2ca and work on the proper fix which will preserve the original intent of the test.

-- Igor

@iignatev
Copy link
Member

iignatev commented May 24, 2021

after spending some time looking at the original test, I'm wondering if we need -Xmx, -XX:MaxMetaspaceSize, -XX:MetaspaceSize at all.

@DamonFool
Copy link
Member Author

DamonFool commented May 24, 2021

after spending some time looking at the original test, I'm wondering if we need -Xmx, -XX:MaxMetaspaceSize, -XX:MetaspaceSize at all.

Let's wait for @tstuefe 's opinion about this suggestion.

But please feel free to do what you want if you think it is urgent.
Thanks.

@iignatev
Copy link
Member

iignatev commented May 24, 2021

There is no urgency in restoring the test(given it was problem listed), so I’m fine w/ waiting for Thomas’ opinion.

meanwhile, I’ve verified that if we remove Xmx, the test passes on Oracle build of linux-aarch64 in the configuration it used to fail.

@DamonFool
Copy link
Member Author

DamonFool commented May 24, 2021

There is no urgency in restoring the test(given it was problem listed), so I’m fine w/ waiting for Thomas’ opinion.

Thanks for your understanding.

meanwhile, I’ve verified that if we remove Xmx, the test passes on Oracle build of linux-aarch64 in the configuration it used to fail.

For Oracle build of linux-aarch64, it wouldn't fail without -Xmx before.
The failure was sighting due to extra -Xmx was added in the test, which tried to fix JDK-8267293 .
More background is here: #4062 (comment)
Thanks.

@iignatev
Copy link
Member

iignatev commented May 24, 2021

For Oracle build of linux-aarch64, it wouldn't fail without -Xmx before.
I see, and in what cases did it fail w/o -Xmx?

this test isn't supposed to be run concurrently w/ any other tests (and hopefully when CODETOOLS-7902831 / openjdk/jtreg#4 gets fixed it will be indeed), so I guess we might as well explicitly set -XX:MaxRAMPercentage=25.0 in the test (instead of -Xmx).

Cheers,
-- Igor

@DamonFool
Copy link
Member Author

DamonFool commented May 24, 2021

so I guess we might as well explicitly set -XX:MaxRAMPercentage=25.0 in the test (instead of -Xmx).

This seems reasonable to me.
Thanks.

@iignatev
Copy link
Member

iignatev commented May 24, 2021

this will, however, only solve the problem w/ the test; but I also share Thomas' concern[*], we shouldn't need 1g (or even .5g) of heap space to eat up 20m of metaspace (even w/ small classes like AnonkTestee01), so there seems to be an underlying defect in linux-aarch64 build, which will need to be investigated/addressed separately.

-- Igor

[*]

What I am concerned about is that with there should not be that much variance in how much heap space we use. Sure you can set it to 1g or 2g, but something is clearly off. I get by on Linux x64 with 128m, on 32bit x86 with 256m (since we don't have CompressedClassPointers we need somewhat more heap). But needing 1g on aarch64 seems weird.

@DamonFool
Copy link
Member Author

DamonFool commented May 24, 2021

so there seems to be an underlying defect in linux-aarch64 build, which will need to be investigated/addressed separately.

Yes, I think that makes sense since all of our linux-aarch64 platforms couldn't reproduce the OOME except those of Oracle's .

@DamonFool
Copy link
Member Author

DamonFool commented Jun 11, 2021

@DamonFool , the goal of this test is to check that in case of (nearly) exhaused metaspace defining classes via MethodHandles.Lookup::defineHiddenClass will lead to OOME, after your rewriting, this test doesn't serve that purpose anymore, now it just checks that runnig java -version w/ small metaspace will lead to OOME.

-- Igor

@tstuefe , do you also think we should revert this change and use -XX:MaxRAMPercentage=25 to fix the bug?
Thanks.

@mlbridge
Copy link

mlbridge bot commented Jun 18, 2021

Mailing list message from Igor Ignatev on hotspot-runtime-dev:

adding Thomas' direct email.

-- Igor

1 similar comment
@mlbridge
Copy link

mlbridge bot commented Jun 18, 2021

Mailing list message from Igor Ignatev on hotspot-runtime-dev:

adding Thomas' direct email.

-- Igor

@tstuefe
Copy link
Member

tstuefe commented Jun 20, 2021

@DamonFool @iignatev guys, sorry for the late response. I am fine with restoring the test, I think Igor was right.

I would love to understand the differences with aarch64s high heap usage though. Some thoughts. Do we manage on aarch64 to load the same amount of classes before OOME hits? Less? More? Does the way we create classes on aarch64 need somehow more java heap space?

It would be nice to compare metaspace state between "normal" platforms and aarch64 at the time oome hits. One way to do this would be to run the tests with -XX:+PrintMetaspaceStatisticsAtExit and compare the results. It would also be nice to know how many classes had been loaded at that point.

I'd be happy to take a look if someone does the tests. Unfortunately, I have no hardware to reproduce this problem.

@DamonFool
Copy link
Member Author

DamonFool commented Jun 20, 2021

I am fine with restoring the test, I think Igor was right.

OK.
Filed: https://bugs.openjdk.java.net/browse/JDK-8269065
I'd like to get it fixed in jdk17.
Thanks.

We also can't reproduce the OOME on our aarch64 platforms.

@tstuefe
Copy link
Member

tstuefe commented Jun 20, 2021

I am fine with restoring the test, I think Igor was right.

OK.
Filed: https://bugs.openjdk.java.net/browse/JDK-8269065
I'd like to get it fixed in jdk17.
Thanks.

We also can't reproduce the OOME on our aarch64 platforms.

Wrote offlist to you and Igor on Friday, but your mail bounced. One thing which may be a difference - and this is just a wild guess - is that Oracle's aarch64 machines use 64k pages. We recently had errors which only were reproducable on their machines due to that fact. I am not totally clear how this could cause the error, but that is the only difference I can think off off-hand.

@DamonFool
Copy link
Member Author

DamonFool commented Jun 20, 2021

One thing which may be a difference - and this is just a wild guess - is that Oracle's aarch64 machines use 64k pages. We recently had errors which only were reproducable on their machines due to that fact. I am not totally clear how this could cause the error, but that is the only difference I can think off off-hand.

I checked one aarch64 machine:

# getconf PAGE_SIZE
65536

But I didn't reproduce the OOME on it.

@DamonFool
Copy link
Member Author

DamonFool commented Jun 20, 2021

Filed: https://bugs.openjdk.java.net/browse/JDK-8269065
I'd like to get it fixed in jdk17.

PR: openjdk/jdk17#104
@iignatev @tstuefe
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
4 participants