Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8263236: runtime/os/TestTracePageSizes.java fails on old kernels #3415

Closed

Conversation

shipilev
Copy link
Contributor

@shipilev shipilev commented Apr 9, 2021

See the bug report for details. On some kernels, we have trouble parsing madvise tags from /proc/smaps.

Additional testing:

  • Test with kernel 5.4.0 (still passes)
  • Test with kernel 4.9.0 (used to fail, now passes)
  • Test with kernel 4.15 (still passes)
  • Test with kernel 4.14.17 (used to fail, now passes)

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8263236: runtime/os/TestTracePageSizes.java fails on old kernels

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3415/head:pull/3415
$ git checkout pull/3415

Update a local copy of the PR:
$ git checkout pull/3415
$ git pull https://git.openjdk.java.net/jdk pull/3415/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3415

View PR using the GUI difftool:
$ git pr show -t 3415

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3415.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Apr 9, 2021

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Apr 9, 2021

@shipilev The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime label Apr 9, 2021
@openjdk openjdk bot added the rfr label Apr 9, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Apr 9, 2021

Webrevs

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented Apr 14, 2021

Anyone? :)

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented Apr 21, 2021

Anyone? This unfortunately breaks tier1 on some of our testing hosts.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Hi Aleksey,

I can't really evaluate the changes as I'm not familiar with the information that is being queried. I get the gist of things and as this is a test the real question is whether the test now passes okay. So on that basis I'll approve it.

Thanks,
David

@openjdk
Copy link

@openjdk openjdk bot commented Apr 21, 2021

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8263236: runtime/os/TestTracePageSizes.java fails on old kernels

Reviewed-by: dholmes, sjohanss, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 12 new commits pushed to the master branch:

  • 04f7112: 8266293: Key protection using PBEWithMD5AndDES fails with "java.security.InvalidAlgorithmParameterException: Salt must be 8 bytes long"
  • a90b33a: 8266573: Make sure blackholes are tagged for all JVMCI paths
  • 2dcbedf: 8266044: Nested class summary should show kind of class or interface
  • e840597: 8266460: java.io tests fail on null stream with upgraded jtreg/TestNG
  • fcedfc8: 8266579: Update test/jdk/java/lang/ProcessHandle/PermissionTest.java & test/jdk/java/sql/testng/util/TestPolicy.java
  • c665dba: 8266561: Remove Compile::_save_argument_registers
  • 47d4438: 8266426: ZHeapIteratorOopClosure does not handle native access properly
  • 2438498: 8252758: Lanai: Optimize index calculation while copying glyphs
  • eb3b96d: 8266496: WBIsKlassAliveClosure.do_klass() fails for hidden classes
  • 51f5adf: 8265047: Inconsistent warning message in jcmd VM.log
  • ... and 2 more: https://git.openjdk.java.net/jdk/compare/a86ee9b3f370b59caea2ae78169d13498560cd8e...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Apr 21, 2021
@tstuefe
Copy link
Member

@tstuefe tstuefe commented Apr 21, 2021

I try to understand this. So:

AnonHugePages - number of huge pages mapped into area, THP or explicit
ht - this area is mapped with explicit huge pages
hg - this area is mapped with madvise(huge_pages), which indicates THP

Therefore: !ht && AnonHugePages > 0 -> THP?

What confuses me is that the kernel patch you refer to: torvalds/linux@50f8b92 sounds like the flag is not passed down to the memory management layer. Would this not effectively switch off THP for the region, and the resulting mapping would use small pages? In which case the failing test would have been correct, since we specified UseTransparentHugePages but it did not work?

Sorry for my confusion.

..Thomas

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented Apr 22, 2021

(sighs) The recent pull made the test fail again even with this patch. Let me see what is up there...

@pfustc
Copy link
Member

@pfustc pfustc commented May 6, 2021

This test is in tier1. Shall we ProblemList it if we can't fix it in a short time?

@tstuefe
Copy link
Member

@tstuefe tstuefe commented May 6, 2021

We cannot problemlist it for older kernels only, unfortunately.
@kstefanj could you have a look at this?

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented May 6, 2021

Sorry, totally missed this PR. I saw the bug-report a while back and hoped my recent refactoring would help the situation, but I guess it only made things worse? Will take a look.

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented May 6, 2021

I am looking at it too, since I have a machine where this failure reproduces reliably. Looks like the remaining failures are intermittent.

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented May 6, 2021

I notice that the failures are like this with debug turned on:

From logfile: [0.004s][info][pagesize] CodeHeap 'non-nmethods':  min=4M max=8M base=0x00007fffd9200000 page_size=2M size=8M
From smaps: [7fffd9200000, 7fffd9600000) anonHugePages = 0 pageSize=4KB isTHP=false isHUGETLB=false
From logfile: [0.011s][info][pagesize] Block Offset Table: req_size=2580K base=0x00007fffd7400000 page_size=2M alignment=2M size=4M
From smaps: [7fffd7400000, 7fffd7600000) anonHugePages = 0 pageSize=4KB isTHP=false isHUGETLB=false

So these probably are not committed yet, because -XX:+AlwaysPreTouch (set by test) do not affect them. (sighs)

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented May 6, 2021

@shipilev, do you know which change did break the test again, I did a few different cleanup that are related.

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented May 6, 2021

@shipilev, that sounds like a valid theory. But not sure why that should have changed recently.

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented May 6, 2021

@shipilev, do you know which change did break the test again, I did a few different cleanup that are related.

Don't know yet. But I think this kind of failure highlights that tracking AnonHugePages is not fool-proof either: we implicitly rely on allocations/commit happen in that area to have AHP != 0. Which is doable for Java heap with -XX:+AlwaysPreTouch, but not for VM areas. There are various -XX:+Zap* flags, but they do not cover every VM area.

@shipilev, that sounds like a valid theory. But not sure why that should have changed recently.

I am now not even sure that the test passed reliably during my first attempt. All 6 subtests intermittently pass/fails with this patch.

Let me mull over this a bit. Maybe the saner way out would be checking the kernel version and bailing on older kernels.

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented May 6, 2021

Let me mull over this a bit. Maybe the saner way out would be checking the kernel version and bailing on older kernels.

That would be one approach, when I first started to think about supporting THP for the test I looked at a few different ways including AnonHugePages but the one that was most reliable was to check the flag. I would be totally ok with excluding the test for older kernels, not sure if we have a good way to figure this out or if we have to do some parsing.

@openjdk openjdk bot removed ready rfr labels May 6, 2021
@shipilev
Copy link
Contributor Author

@shipilev shipilev commented May 6, 2021

Ran out of ideas. New version checks for kernel version and bails on kernels lower than 5.x. I shall try and see if I can find the more precise kernel version where this was fixed. Meanwhile, would the coarse check like this work, if I could not find a more precise version?

@openjdk openjdk bot added ready rfr labels May 6, 2021
Copy link
Contributor

@kstefanj kstefanj left a comment

Looks good to me. The isLinux() check should not be needed since we require this to only run on Linux. But it won't hurt and makes it more clear so I'm good with it.

tstuefe
tstuefe approved these changes May 6, 2021
Copy link
Member

@tstuefe tstuefe left a comment

Looks good to me too.

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented May 6, 2021

Just realized one thing... you might be able to solve it with just adding to the @requires line as well, something like:
@requires os.family == "linux" & os.simpleVersion >= 5
But I'm not sure if >= is allowed for simpleVersion and I'm not at all sure how simpleVersionis made available. Maybe it is a Mac thing, the only tests using it only matters for Mac. Maybe os.versionMajor is available to use. If you like this approach maybe you can try and see if it works.

@tstuefe
Copy link
Member

@tstuefe tstuefe commented May 6, 2021

Should we just handle AlwaysPretouch directly in os::reserve_memory() instead of having each caller do this? Or would this interfere with concurrent pretouching?

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented May 6, 2021

In theory this would be good, but as you say it would make parallel/concurrent pre-touch harder. Or at least we would have to extend a lot of APIs to pass down the needed work gang.

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented May 6, 2021

Okay, I did a rough bisect over pre-built Debian kernels, and that points to 4.15 as the first kernel that works. I updated the PR changeset and PR description accordingly. I am still struggling to find a particular commit that solved it, so I can explain to myself why this works... Anyhow, the version check would not get more precise that this, since we only have major and minor OS versions.

Copy link
Contributor

@kstefanj kstefanj left a comment

👍

@tstuefe
Copy link
Member

@tstuefe tstuefe commented May 6, 2021

In theory this would be good, but as you say it would make parallel/concurrent pre-touch harder. Or at least we would have to extend a lot of APIs to pass down the needed work gang.

Or make this an optional part of reservation, to be controlled by yet anther argument (sigh - maybe not :)

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented May 6, 2021

Okay, I did a rough bisect over pre-built Debian kernels, and that points to 4.15 as the first kernel that works. I updated the PR changeset and PR description accordingly. I am still struggling to find a particular commit that solved it, so I can explain to myself why this works...

All right. Here is the kicker. Vanilla 4.14.17 kernel passes the tests, while Debian's 4.14.17 fails it! Looking at Debian kernel patches, I see a few suspicious ones. Applying these patches to vanilla 4.14.17 also makes the test fail. This hunk from that Debian's kernel patch is apparently the problem:

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 7c87b6652244..f22c15d5e24c 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -87,7 +87,8 @@ calc_vm_flag_bits(unsigned long flags)
 {
 	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
 	       _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
-	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    );
+	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
+	       _calc_vm_trans(flags, MAP_FIXED,      VM_FIXED     );
 }

It copies the mmap tags to the VMA flags. Now, VMA flags are getting printed to /proc/smaps with show_smap_vma_flags. Notice how that method defensively prepares itself for missing flag cases by printing ??. Indeed, Debian patch misses the relevant addition in show_smap_vma_flags, so we see ?? for some entries in /proc/smaps.

How's that a problem for this test? Stare at this regexp:

        String smapsPatternString = "(\\w+)-(\\w+).*?" +
                                    "KernelPageSize:\\s*(\\d*) kB.*?" +
                                    "VmFlags: ([\\w ]*)";

? is not \w, so the entire thing matches wrong when VMA flags have ?? in them. In best case, we stop the match for the VmFlags capture group at the ?? flag. Since newly added unhanded VMA flag is before hg flag in kernel enum, the tests does not see the hg tag when ?? appears.

And this only happens on some Debian kernels, because you have to have that new unhandled flag. This also explains why 4.15 works, the later commit redefined that new flag to existing, handled value for VM_ARCH.

So this is deserves a simple fix in the regexp itself:

-                                    "VmFlags: ([\\w ]*)";
+                                    "VmFlags: ([\\w\\? ]*)";

Testing that now... EDIT: Of course it works. Gaaaa.

Copy link
Contributor

@kstefanj kstefanj left a comment

Great dig @shipilev 👍

So the hg vmFlag is present for all kernels we care about?

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented May 6, 2021

So the hg vmFlag is present for all kernels we care about?

Apparently. It is the test regexp that mismatches when kernel replies ?? for any other VMA flag.

@kstefanj
Copy link
Contributor

@kstefanj kstefanj commented May 6, 2021

So the hg vmFlag is present for all kernels we care about?

Apparently. It is the test regexp that mismatches when kernel replies ?? for any other VMA flag.

Got that. Just wanted to make sure noone else had this problem because they were on an even older kernel where ht actually wasn't added. But even if that is the case this fix is good =)

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented May 6, 2021

So the hg vmFlag is present for all kernels we care about?

Apparently. It is the test regexp that mismatches when kernel replies ?? for any other VMA flag.

Got that. Just wanted to make sure noone else had this problem because they were on an even older kernel where ht actually wasn't added.

The way I see the kernel sources, ht is available since the day VmFlags field was introduced in the smaps. Unless there is a bug that corrupts the VMA flag somewhere, we should be good.

@tstuefe
Copy link
Member

@tstuefe tstuefe commented May 6, 2021

Hah. Beautiful. And somewhat ironic, what are the odds that the offending patch was actually a workaround for a hotspot issue.

+1.

@shipilev
Copy link
Contributor Author

@shipilev shipilev commented May 6, 2021

/integrate

@openjdk openjdk bot closed this May 6, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels May 6, 2021
@openjdk
Copy link

@openjdk openjdk bot commented May 6, 2021

@shipilev Since your change was applied there have been 14 commits pushed to the master branch:

  • 0ca86da: 8266002: vmTestbase/nsk/jvmti/ClassPrepare/classprep001 should skip events for unexpected classes
  • 52f1db6: 8262002: java/lang/instrument/VerifyLocalVariableTableOnRetransformTest.sh failed with "TestCaseScaffoldException: DummyClassWithLVT did not match .class file"
  • 04f7112: 8266293: Key protection using PBEWithMD5AndDES fails with "java.security.InvalidAlgorithmParameterException: Salt must be 8 bytes long"
  • a90b33a: 8266573: Make sure blackholes are tagged for all JVMCI paths
  • 2dcbedf: 8266044: Nested class summary should show kind of class or interface
  • e840597: 8266460: java.io tests fail on null stream with upgraded jtreg/TestNG
  • fcedfc8: 8266579: Update test/jdk/java/lang/ProcessHandle/PermissionTest.java & test/jdk/java/sql/testng/util/TestPolicy.java
  • c665dba: 8266561: Remove Compile::_save_argument_registers
  • 47d4438: 8266426: ZHeapIteratorOopClosure does not handle native access properly
  • 2438498: 8252758: Lanai: Optimize index calculation while copying glyphs
  • ... and 4 more: https://git.openjdk.java.net/jdk/compare/a86ee9b3f370b59caea2ae78169d13498560cd8e...master

Your commit was automatically rebased without conflicts.

Pushed as commit 36e5ad6.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@shipilev shipilev deleted the JDK-8263236-trace-page-sizes branch May 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime integrated
5 participants