Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8275320: NMT should perform buffer overrun checks #5952

Closed
wants to merge 6 commits into from

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Oct 14, 2021

This is part of a number of RFE I plan to improve and simplify C-heap overflow checking in hotspot.

This proposal adds NMT buffer overflow checking:

  • it gives us C-heap overflow checking in release builds
  • the costs are neglectable: if NMT is off, we won't pay anything; if NMT is on, the added work is minuscule since we have to do malloc header management anyway.
  • NMT needs intact headers anyway. Faced with buffer overwrites today, it would maybe crash or maybe account wrongly, but it's a bit of a lottery really. Better to go the extra step and do a real check.
  • it could be a preparation for future code removal, if we wanted to do that (see details in umbrella RFE https://bugs.openjdk.java.net/browse/JDK-8275301). That way, net complexity would come down even with this patch.

For more details, please see the JBS issue.


Patch notes:

  • The malloc header is changed such that it contains a 16-bit canary directly preceding the user payload of the allocation. The new malloc header does not use bitfields anymore but normal types. For more details, see the comment in mallocTracker.hpp.

    • On 64-bit, we don't enlarge the malloc header. It remains 16 bytes in length. So no additional memory cost (apart from the 1-byte-footer, see below). Space for the canary is instead obtained by reducing the size of the bucket index bit field to 16 bits. That bit field is used to store the bucket slot index of the malloc site table in NMT detail mode. With 40 bits it was over-dimensioned, and even 16-bits arguably still are: malloc site table width is 512.
    • On 32-bit, I had to enlarge the header from 8 bytes to 16 bytes to make room for a canary. But strictly speaking 8 bytes were not enough anyway: the header size has to be large enough to satisfy malloc(3) alignment, and that would be 16 bytes. I believe it never led to an error since we don't store 128bit data in malloc'd memory in the hotspot anywhere.
  • I added a footer canary trailing the user allocation to catch tail buffer overruns. To keep matters simple (alignment) I made it a single byte only. That is enough to catch most overrun scenarios.

  • I brushed up error reporting. When NMT detects corruption, it will now print out a hex dump of the corrupted area to tty before asserting.

  • I added a bunch of gtests to test various heap overwrite scenarios. I also had to extend the gtest macros a bit because I wanted these tests of course to run in release builds too, but we did not have a death test macro for release builds yet (there are possibilities for code simplification here too, but that's for another RFE).

  • I renamed nmt_header_size to nmt_overhead since that size includes header and footer now.

  • I made the assert for malloc site table width a compile time STATIC_ASSERT.


Example output a buffer overrun would provide:

Block at 0x00005600f86136b0: footer canary broken at 0x00005600f86136c1 (buffer overflow?)
NMT Block at 0x00005600f86136b0, corruption at: 0x00005600f86136c1: 
0x00005600f86136a8:   21 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00
0x00005600f86136b8:   00 00 00 00 0f 00 1f fa 00 61 00 00 00 00 00 00
0x00005600f86136c8:   41 39 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00005600f86136d8:   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x00005600f86136e8:   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00005600f86136f8:   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00005600f8613708:   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00005600f8613718:   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00005600f8613728:   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00005600f8613738:   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
assert failed: fatal error: Block at 0x00005600f86136b0: footer canary broken at 0x00005600f86136c1 (buffer overflow?)#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (mallocTracker.cpp:203), pid=10805, tid=10805
#  fatal error: Block at 0x00005600f86136b0: footer canary broken at 0x00005600f86136c1 (buffer overflow?)
#

Tests:

  • manual tests with Linux x64, x86, minimal build
  • GHAs all clean
  • SAP nightlies ran for 4 weeks now without problems

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issues

  • JDK-8275320: NMT should perform buffer overrun checks
  • JDK-8275320: NMT should perform buffer overrun checks
  • JDK-8275301: Unify C-heap buffer overrun checks into NMT

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/5952/head:pull/5952
$ git checkout pull/5952

Update a local copy of the PR:
$ git checkout pull/5952
$ git pull https://git.openjdk.java.net/jdk pull/5952/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 5952

View PR using the GUI difftool:
$ git pr show -t 5952

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/5952.diff

@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Oct 14, 2021

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

@openjdk openjdk bot commented Oct 14, 2021

@tstuefe The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot label Oct 14, 2021
@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Oct 15, 2021

/issue JDK-8275320

@openjdk
Copy link

@openjdk openjdk bot commented Oct 15, 2021

@tstuefe
Adding additional issue to issue list: 8275320: NMT should perform buffer overrun checks.

@tstuefe tstuefe changed the title JDK-8275301: Unify C-heap buffer overrun checks into NMT JDK-8275320: NMT should perform buffer overrun checks Oct 15, 2021
@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Oct 15, 2021

/issue JDK-8275301

@openjdk
Copy link

@openjdk openjdk bot commented Oct 15, 2021

@tstuefe
Adding additional issue to issue list: 8275301: Unify C-heap buffer overrun checks into NMT.

@tstuefe tstuefe marked this pull request as ready for review Oct 15, 2021
@openjdk openjdk bot added the rfr label Oct 15, 2021
@mlbridge
Copy link

@mlbridge mlbridge bot commented Oct 15, 2021

Webrevs

@zhengyu123
Copy link
Contributor

@zhengyu123 zhengyu123 commented Oct 15, 2021

Sorry, we already have GuardedMemory for detecting buffer overrun, why introduce a new one?

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Oct 15, 2021

Sorry, we already have GuardedMemory for detecting buffer overrun, why introduce a new one?

GuardedMemory has a number of disadvantages, and I'd like to remove it in favor of NMT doing buffer overrun checks. For my full reasoning, please see my reasoning in the umbrella RFE https://bugs.openjdk.java.net/browse/JDK-8275301:


Disadvantages of the current solution:

  • We have no way to do C-heap checking in release builds. But there, it is sorely missed. We ship release VMs, and those VMs get used in myriad ways with a lot of faulty third-party native code. I would love to be able to flip a product switch at a customer site and have some basic C-heap checks done, without relegating to external tools or debug c-libs.

  • The debug-only guards in os::malloc() are quite fat, really, a whopping 48 bytes per allocation on 64-bit, 40 bytes on 32-bit. That is for guarding alone. They distort the memory allocation picture, since blowing up every allocation this way causes the underlying libc to do different things. Therefore we have different memory layouts and allocation patterns between debug and release. In addition, we have different code paths too, e.g. in debug os::realloc calls os::malloc + os::free whereas in release builds it calls directly into libc ::realloc. All that means that in debug builds we test something different than what we ship in release builds.

  • The canary in the headers of the debug-only guards do not directly precede the user portion of the data, so we won't catch negative buffer overflows of only a few bytes.

  • The guarding added by CheckJNICalls is unnecessarily expensive too, since it copies the memory around, handing a copy of the guarded memory up to the caller.

  • The fact that three different code sections all do malloc headers incurs unnecessary costs, and the code is unnecessarily complex. It makes also statistics difficult to understand since the silent overhead can be large (compare the rise in RSS with the rise in NMT allocations in a debug build).

  • None of the current overflow checkers print out hex dumps of the violated memory. That is what the libc usually does and it is very useful.


Thanks, Thomas

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Oct 15, 2021

Sorry, we already have GuardedMemory for detecting buffer overrun, why introduce a new one?

GuardedMemory has a number of disadvantages, and I'd like to remove it in favor of NMT doing buffer overrun checks. For my full reasoning, please see my reasoning in the umbrella RFE https://bugs.openjdk.java.net/browse/JDK-8275301:

Disadvantages of the current solution:

  • We have no way to do C-heap checking in release builds. But there, it is sorely missed. We ship release VMs, and those VMs get used in myriad ways with a lot of faulty third-party native code. I would love to be able to flip a product switch at a customer site and have some basic C-heap checks done, without relegating to external tools or debug c-libs.
  • The debug-only guards in os::malloc() are quite fat, really, a whopping 48 bytes per allocation on 64-bit, 40 bytes on 32-bit. That is for guarding alone. They distort the memory allocation picture, since blowing up every allocation this way causes the underlying libc to do different things. Therefore we have different memory layouts and allocation patterns between debug and release. In addition, we have different code paths too, e.g. in debug os::realloc calls os::malloc + os::free whereas in release builds it calls directly into libc ::realloc. All that means that in debug builds we test something different than what we ship in release builds.
  • The canary in the headers of the debug-only guards do not directly precede the user portion of the data, so we won't catch negative buffer overflows of only a few bytes.
  • The guarding added by CheckJNICalls is unnecessarily expensive too, since it copies the memory around, handing a copy of the guarded memory up to the caller.
  • The fact that three different code sections all do malloc headers incurs unnecessary costs, and the code is unnecessarily complex. It makes also statistics difficult to understand since the silent overhead can be large (compare the rise in RSS with the rise in NMT allocations in a debug build).
  • None of the current overflow checkers print out hex dumps of the violated memory. That is what the libc usually does and it is very useful.

Thanks, Thomas

p.s. I contemplated to do NMT overflow checks and removal of old guarding code in one RFE but was concerned that it would be too confusing and get stuck in review limbo. Maybe that was wrong. But this RFE here makes more sense when viewed as part of a whole.

@zhengyu123
Copy link
Contributor

@zhengyu123 zhengyu123 commented Oct 16, 2021

Sorry, we already have GuardedMemory for detecting buffer overrun, why introduce a new one?

GuardedMemory has a number of disadvantages, and I'd like to remove it in favor of NMT doing buffer overrun checks. For my full reasoning, please see my reasoning in the umbrella RFE https://bugs.openjdk.java.net/browse/JDK-8275301:
Disadvantages of the current solution:

  • We have no way to do C-heap checking in release builds. But there, it is sorely missed. We ship release VMs, and those VMs get used in myriad ways with a lot of faulty third-party native code. I would love to be able to flip a product switch at a customer site and have some basic C-heap checks done, without relegating to external tools or debug c-libs.
  • The debug-only guards in os::malloc() are quite fat, really, a whopping 48 bytes per allocation on 64-bit, 40 bytes on 32-bit. That is for guarding alone. They distort the memory allocation picture, since blowing up every allocation this way causes the underlying libc to do different things. Therefore we have different memory layouts and allocation patterns between debug and release. In addition, we have different code paths too, e.g. in debug os::realloc calls os::malloc + os::free whereas in release builds it calls directly into libc ::realloc. All that means that in debug builds we test something different than what we ship in release builds.
  • The canary in the headers of the debug-only guards do not directly precede the user portion of the data, so we won't catch negative buffer overflows of only a few bytes.
  • The guarding added by CheckJNICalls is unnecessarily expensive too, since it copies the memory around, handing a copy of the guarded memory up to the caller.
  • The fact that three different code sections all do malloc headers incurs unnecessary costs, and the code is unnecessarily complex. It makes also statistics difficult to understand since the silent overhead can be large (compare the rise in RSS with the rise in NMT allocations in a debug build).
  • None of the current overflow checkers print out hex dumps of the violated memory. That is what the libc usually does and it is very useful.

Thanks, Thomas

p.s. I contemplated to do NMT overflow checks and removal of old guarding code in one RFE but was concerned that it would be too confusing and get stuck in review limbo. Maybe that was wrong. But this RFE here makes more sense when viewed as part of a whole.

Thanks for explanation. So, buffer overrun detection is now only available when NMT is on, vs. always on with GuardedMemory in debug build. Right?

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Oct 17, 2021

Sorry, we already have GuardedMemory for detecting buffer overrun, why introduce a new one?

GuardedMemory has a number of disadvantages, and I'd like to remove it in favor of NMT doing buffer overrun checks. For my full reasoning, please see my reasoning in the umbrella RFE https://bugs.openjdk.java.net/browse/JDK-8275301:
Disadvantages of the current solution:

  • We have no way to do C-heap checking in release builds. But there, it is sorely missed. We ship release VMs, and those VMs get used in myriad ways with a lot of faulty third-party native code. I would love to be able to flip a product switch at a customer site and have some basic C-heap checks done, without relegating to external tools or debug c-libs.
  • The debug-only guards in os::malloc() are quite fat, really, a whopping 48 bytes per allocation on 64-bit, 40 bytes on 32-bit. That is for guarding alone. They distort the memory allocation picture, since blowing up every allocation this way causes the underlying libc to do different things. Therefore we have different memory layouts and allocation patterns between debug and release. In addition, we have different code paths too, e.g. in debug os::realloc calls os::malloc + os::free whereas in release builds it calls directly into libc ::realloc. All that means that in debug builds we test something different than what we ship in release builds.
  • The canary in the headers of the debug-only guards do not directly precede the user portion of the data, so we won't catch negative buffer overflows of only a few bytes.
  • The guarding added by CheckJNICalls is unnecessarily expensive too, since it copies the memory around, handing a copy of the guarded memory up to the caller.
  • The fact that three different code sections all do malloc headers incurs unnecessary costs, and the code is unnecessarily complex. It makes also statistics difficult to understand since the silent overhead can be large (compare the rise in RSS with the rise in NMT allocations in a debug build).
  • None of the current overflow checkers print out hex dumps of the violated memory. That is what the libc usually does and it is very useful.

Thanks, Thomas

p.s. I contemplated to do NMT overflow checks and removal of old guarding code in one RFE but was concerned that it would be too confusing and get stuck in review limbo. Maybe that was wrong. But this RFE here makes more sense when viewed as part of a whole.

Thanks for explanation. So, buffer overrun detection is now only available when NMT is on, vs. always on with GuardedMemory in debug build. Right?

Well, not with this patch obviously. But yes, that would be my proposal. To get "always-on", we could switch NMT on by default in debug builds. "summary" level is not really expensive at all, it uses less memory than GuardedMemory does, and the per-flag accounting does not really add much overhead (GuardedMemory also does some accounting btw).

Though tbh my first priority is to give us overflow checks in release builds. If we only do that and leave GuardedMemory in place I would be happy already. I had two customer cases very recently with heap overwriters, one of which I misused NMT to trigger a crash and analyze the core. A neighboring (non-VM-allocated) block was overwriting the following (VM allocated) heap block.

Cheers, Thomas

@zhengyu123
Copy link
Contributor

@zhengyu123 zhengyu123 commented Oct 17, 2021

Sorry, we already have GuardedMemory for detecting buffer overrun, why introduce a new one?

GuardedMemory has a number of disadvantages, and I'd like to remove it in favor of NMT doing buffer overrun checks. For my full reasoning, please see my reasoning in the umbrella RFE https://bugs.openjdk.java.net/browse/JDK-8275301:
Disadvantages of the current solution:

  • We have no way to do C-heap checking in release builds. But there, it is sorely missed. We ship release VMs, and those VMs get used in myriad ways with a lot of faulty third-party native code. I would love to be able to flip a product switch at a customer site and have some basic C-heap checks done, without relegating to external tools or debug c-libs.
  • The debug-only guards in os::malloc() are quite fat, really, a whopping 48 bytes per allocation on 64-bit, 40 bytes on 32-bit. That is for guarding alone. They distort the memory allocation picture, since blowing up every allocation this way causes the underlying libc to do different things. Therefore we have different memory layouts and allocation patterns between debug and release. In addition, we have different code paths too, e.g. in debug os::realloc calls os::malloc + os::free whereas in release builds it calls directly into libc ::realloc. All that means that in debug builds we test something different than what we ship in release builds.
  • The canary in the headers of the debug-only guards do not directly precede the user portion of the data, so we won't catch negative buffer overflows of only a few bytes.
  • The guarding added by CheckJNICalls is unnecessarily expensive too, since it copies the memory around, handing a copy of the guarded memory up to the caller.
  • The fact that three different code sections all do malloc headers incurs unnecessary costs, and the code is unnecessarily complex. It makes also statistics difficult to understand since the silent overhead can be large (compare the rise in RSS with the rise in NMT allocations in a debug build).
  • None of the current overflow checkers print out hex dumps of the violated memory. That is what the libc usually does and it is very useful.

Thanks, Thomas

p.s. I contemplated to do NMT overflow checks and removal of old guarding code in one RFE but was concerned that it would be too confusing and get stuck in review limbo. Maybe that was wrong. But this RFE here makes more sense when viewed as part of a whole.

Thanks for explanation. So, buffer overrun detection is now only available when NMT is on, vs. always on with GuardedMemory in debug build. Right?

Well, not with this patch obviously. But yes, that would be my proposal. To get "always-on", we could switch NMT on by default in debug builds. "summary" level is not really expensive at all, it uses less memory than GuardedMemory does, and the per-flag accounting does not really add much overhead (GuardedMemory also does some accounting btw).

Though tbh my first priority is to give us overflow checks in release builds. If we only do that and leave GuardedMemory in place I would be happy already. I had two customer cases very recently with heap overwriters, one of which I misused NMT to trigger a crash and analyze the core. A neighboring (non-VM-allocated) block was overwriting the following (VM allocated) heap block.

Cheers, Thomas

I have no problem on technical side. Changing NMT default value, I believe, needs CSR. Probably should start with a CSR to get a consensus.

Thanks.

-Zhengyu

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Oct 18, 2021

Sorry, we already have GuardedMemory for detecting buffer overrun, why introduce a new one?

GuardedMemory has a number of disadvantages, and I'd like to remove it in favor of NMT doing buffer overrun checks. For my full reasoning, please see my reasoning in the umbrella RFE https://bugs.openjdk.java.net/browse/JDK-8275301:
Disadvantages of the current solution:

  • We have no way to do C-heap checking in release builds. But there, it is sorely missed. We ship release VMs, and those VMs get used in myriad ways with a lot of faulty third-party native code. I would love to be able to flip a product switch at a customer site and have some basic C-heap checks done, without relegating to external tools or debug c-libs.
  • The debug-only guards in os::malloc() are quite fat, really, a whopping 48 bytes per allocation on 64-bit, 40 bytes on 32-bit. That is for guarding alone. They distort the memory allocation picture, since blowing up every allocation this way causes the underlying libc to do different things. Therefore we have different memory layouts and allocation patterns between debug and release. In addition, we have different code paths too, e.g. in debug os::realloc calls os::malloc + os::free whereas in release builds it calls directly into libc ::realloc. All that means that in debug builds we test something different than what we ship in release builds.
  • The canary in the headers of the debug-only guards do not directly precede the user portion of the data, so we won't catch negative buffer overflows of only a few bytes.
  • The guarding added by CheckJNICalls is unnecessarily expensive too, since it copies the memory around, handing a copy of the guarded memory up to the caller.
  • The fact that three different code sections all do malloc headers incurs unnecessary costs, and the code is unnecessarily complex. It makes also statistics difficult to understand since the silent overhead can be large (compare the rise in RSS with the rise in NMT allocations in a debug build).
  • None of the current overflow checkers print out hex dumps of the violated memory. That is what the libc usually does and it is very useful.

Thanks, Thomas

p.s. I contemplated to do NMT overflow checks and removal of old guarding code in one RFE but was concerned that it would be too confusing and get stuck in review limbo. Maybe that was wrong. But this RFE here makes more sense when viewed as part of a whole.

Thanks for explanation. So, buffer overrun detection is now only available when NMT is on, vs. always on with GuardedMemory in debug build. Right?

Well, not with this patch obviously. But yes, that would be my proposal. To get "always-on", we could switch NMT on by default in debug builds. "summary" level is not really expensive at all, it uses less memory than GuardedMemory does, and the per-flag accounting does not really add much overhead (GuardedMemory also does some accounting btw).
Though tbh my first priority is to give us overflow checks in release builds. If we only do that and leave GuardedMemory in place I would be happy already. I had two customer cases very recently with heap overwriters, one of which I misused NMT to trigger a crash and analyze the core. A neighboring (non-VM-allocated) block was overwriting the following (VM allocated) heap block.
Cheers, Thomas

I have no problem on technical side. Changing NMT default value, I believe, needs CSR. Probably should start with a CSR to get a consensus.

Thanks.

-Zhengyu

Hi Zhengyu,

I would like to keep the discussion about removing GuardedMemory and potentially enabling NMT per default on debug builds from this issue here. And just concentrate on this first step, lined out in this PR, which is getting NMT to do heap bounds checking. I believe that it has merits on its own, even with no subsequent changes.

And then, I would do some analysis (e.g. verify my assumption that NMT=summary is cheaper than what we do today with GuardedMemory) and use this as a base for a follow-up RFE to remove GuardedMemory. I'm not sure though we need a CSR. Since it would only affect debug code, it will never be shipped to the customer, so there can be no compatibility issues. But there would be a need for discussion of course.

Cheers, Thomas

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Oct 29, 2021

Patch has been active in our systems for 14 days without problems.

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Nov 6, 2021

No takers?

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Nov 13, 2021

Friendly Ping.

Copy link
Member

@simonis simonis left a comment

Hi Thoms,

your change looks good. I only have a few remarks and comments inline.

Best regards,
Volker

src/hotspot/share/services/mallocTracker.cpp Outdated Show resolved Hide resolved
address from2 = align_down(bad_address, sizeof(void*)) - 8;
from2 = MAX2(to, from2);
address to2 = from2 + 96;
if (to2 > to) {
Copy link
Member

@simonis simonis Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand this. If from2 = MAX2(to, from2) then from2 >= to. So shouldn't to2 (which is from2 + 96) always be bigger then to?

Copy link
Member Author

@tstuefe tstuefe Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right, and the code is not very clear either, I'll improve it.

src/hotspot/share/services/mallocTracker.cpp Show resolved Hide resolved
// From here on we assume the block pointer to be valid. We could
// use SafeFetch but since this is a hot path we don't. If we are
// wrong, we will crash when accessing the canary, which hopefully
// generates distinct crash report.
Copy link
Member

@simonis simonis Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for two spaces after //

static const uint8_t _footer_canary_life_mark = 0xFA;
static const uint8_t _footer_canary_dead_mark = 0xFB;
NOT_LP64(static const uint32_t _header_alt_canary_life_mark = 0xFAFA1F1F;)
NOT_LP64(static const uint32_t _header_alt_canary_dead_mark = 0xFBFB1F1F;)
Copy link
Member

@simonis simonis Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of interest, how did you choose these canary marks? Is there some evidence that they appear less frequently in real code/data than other values?

Copy link
Member Author

@tstuefe tstuefe Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an extensive statistical analysis of many core dumps.

...

...

Just kidding, I chose them on a whim to be not zero :) Do you have a better suggestion? I thought about making them ASCII pattern, but those are actually more common in payload data.

Copy link
Member

@simonis simonis Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking of the usual suspects like 0xcafebabe, 0xbaadbabe or 0xdeadbeef because that would simplify the detection of these markers in core dumps, hs_err files or during debugging. But I'm fine with whatever you choose :)


// this should generate two hex dumps, one with the front header, one with the overwritten
// portion.
static void test_overwrite_back_long() {
Copy link
Member

@simonis simonis Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test isn't really checking that we get two hex dumps, right?

Copy link
Member Author

@tstuefe tstuefe Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, bad wording in the comment. This tests that MallocHeader::print_block_on_error() prints a hex dump covering both header and the corruption address, and if both are too far apart, that the dump is split up in two parts.

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Nov 18, 2021

Thanks a lot @simonis for the review!

I massaged the patch a bit, improving comments and rewriting the block print function. I think its now easier to understand. I also extended the dump size somewhat, now it looks like this:

corruption and header close together:

NMT Block at 0x00005571a7030330, corruption at: 0x00005571a7030341: 
0x00005571a70302b0:   30 1c 05 a7 71 55 00 00 60 e5 f3 31 fa 7f 00 00
0x00005571a70302c0:   47 e6 f3 31 fa 7f 00 00 11 e6 f3 31 fa 7f 00 00
0x00005571a70302d0:   84 e6 f3 31 fa 7f 00 00 f1 f1 f1 f1 f1 f1 f1 f1
0x00005571a70302e0:   00 00 00 00 f1 f1 f1 f1 fa ab ab ab ab ab ab ab
0x00005571a70302f0:   ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
0x00005571a7030300:   ab ba ba ba ba ba ba ba 51 00 00 00 00 00 00 00
0x00005571a7030310:   ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
0x00005571a7030320:   12 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00005571a7030330:   01 00 00 00 00 00 00 00 f1 f1 f1 f1 0f 00 1f fa
0x00005571a7030340:   f1 61 ab ab ab ab ab ab ab ab ab ab ab ab ab ab
0x00005571a7030350:   ab ab 00 00 00 00 00 00 61 00 00 00 00 00 00 00
0x00005571a7030360:   ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
0x00005571a7030370:   21 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x00005571a7030380:   10 00 00 00 00 00 00 00 f1 f1 f1 f1 0b 00 1f fa
0x00005571a7030390:   00 00 00 00 00 00 00 00 d8 a5 01 c8 f9 7f 00 00
0x00005571a70303a0:   fa ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
0x00005571a70303b0:   ab ba ba ba ba ba ba ba 61 00 00 00 00 00 00 00 

Corruption and header apart:

NMT Block at 0x0000564da16bee10, corruption at: 0x0000564da16c0e20: 
0x0000564da16bed90:   00 00 00 00 00 00 00 00 00 ba ba ba ba ba ba ba
0x0000564da16beda0:   ba ba ba ba ba ba ba ba 01 01 ba ba 00 00 00 00
0x0000564da16bedb0:   01 00 00 00 00 ba ba ba 64 00 00 00 ba ba ba ba
0x0000564da16bedc0:   d0 ed 6b a1 4d 56 00 00 00 00 00 00 00 00 00 00
0x0000564da16bedd0:   00 ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba
0x0000564da16bede0:   00 ba ba ba ba ba ba ba 51 20 00 00 00 00 00 00
0x0000564da16bedf0:   ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
0x0000564da16bee00:   11 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0000564da16bee10:   00 20 00 00 00 00 00 00 f1 f1 f1 f1 0f 00 1f fa
0x0000564da16bee20:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16bee30:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16bee40:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16bee50:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16bee60:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16bee70:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16bee80:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 
...
0x0000564da16c0da0:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16c0db0:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16c0dc0:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16c0dd0:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16c0de0:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16c0df0:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16c0e00:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16c0e10:   f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1
0x0000564da16c0e20:   61 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
0x0000564da16c0e30:   ab ba ba ba ba ba ba ba d1 61 01 00 00 00 00 00
0x0000564da16c0e40:   ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba
0x0000564da16c0e50:   ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba
0x0000564da16c0e60:   ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba
0x0000564da16c0e70:   ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba
0x0000564da16c0e80:   ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba
0x0000564da16c0e90:   ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba ba 

Thanks!

Thomas

src/hotspot/share/runtime/os.cpp Show resolved Hide resolved
Copy link
Member

@simonis simonis left a comment

Looks good to me know except for Zhengyu question.

assert(bad_address >= (address)this, "sanity");

// This function prints block information, including hex dump, in case of a detected
// corruption. The hex dump should show the both block header and the corruption site
Copy link
Member

@simonis simonis Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..show both, the block header..

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Nov 18, 2021

I think you also need to subtract malloc_footer_size when calculating memblock_size below. Otherwise, memcpy can overwrite the footer.

Oh man, good catch... this is too complicated. See, that's why I want to remove the GuardedMemory layer. Having that gone will be such a relief.

So this is for a resize to a smaller size. We have this:

[guard header] [nmt header] [ ... payload ... ] [nmt footer] [guard footer]

and both nmt header and footer are now, from the POV of GuardedMemory, part of its payload. The os::malloc above already allocates a new block, and we need to copy the user payload while leaving the NMT footer intact.

I'll first write a repro case - this should have been catched by tests - then I think about a solution.

I wonder should just consolidate malloc_header_size and malloc_footer_size to one malloc_overhead? I don't see them used separately.

I take a look. I also found that we have two version of malloc_header_size() - one takes the NMT level, one takes a pointer. That makes me nervous resolution wise, though it's very probably fine. Maybe we can reduce the complexity a bit. Though I prefer to keep this patch as small as possihble.

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Nov 19, 2021

Added a thorough regression test for the realloc issue but I was unable to provoce the theoretical error @zhengyu123 pointed out in practice. When analyzing I found that out of accident the current coding already works:

  • a realloc to a smaller size will memcpy() the original payload with the new size, since we use MIN2(size, memblock_size) and use the new, smaller, payload size. That will leave the NMT footer intact which had been added by the os::malloc above.
  • a realloc to a larger size will memcpy() with memblock_size, and Zhengyu is right, that is too large. The effect of that is that we copy the original footer too. But that is fine. Since the footer is only one byte, we will, again, leave the new NMT footer added by os::malloc() intact.
    Still, Zhengyu was right, this is a problem. I will experiment with a larger footer since I believe that should fail as predicted (I just want to see my new regression tests actually fire :)

@tstuefe tstuefe force-pushed the nmt-overflow-checking branch from 17a5bc7 to 3d2a5d0 Compare Nov 19, 2021
@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Nov 19, 2021

Hi @simonis, @zhengyu123,

I somehow messed up and force-pushed. This is what happened since the last review:

  • 2247b5e : increases the footer canary to two bytes.
  • 188f0ea : extended the gtests, mainly to test realloc more thoroughly.
    • Added one death test to show that realloc also does heap corruption checks on the old block.
    • Another test - not a death test but a regular test - just to test that realloc works. This was in reaction to the bug Zhengyu found, but I never got it to fire. After analysing I believe the bug was benign. Still, good to have this test.
  • ea6fe31 : This one actually fixes the bug Zhengyu found. I kept the fix simple stupid and refrained from cleaning up too much. I just removed two methods which were not needed anymore.
  • 3d2a5d0 : Last one fixes the last typos Volker found.

Thanks again, guys, for your reviews. I plan to give this another round in our test systems before pushing.

Cheers, Thomas

Copy link
Member

@simonis simonis left a comment

Looks good now.

static inline size_t get_header_size(void* memblock) {
return (memblock == NULL) ? 0 : sizeof(MallocHeader);
}

static inline void record_new_arena(MEMFLAGS flags) {
Copy link
Member

@simonis simonis Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also wondered why we need these versions so it's good that you could remove them!

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Nov 19, 2021

Looks good now.

Many thanks, Volker! Nice to have this issue finally going somewhere. I feared it was stuck in PR limbo till after 18 ships.

@openjdk
Copy link

@openjdk openjdk bot commented Nov 19, 2021

@tstuefe This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8275320: NMT should perform buffer overrun checks
8275320: NMT should perform buffer overrun checks
8275301: Unify C-heap buffer overrun checks into NMT

Reviewed-by: simonis, zgu

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 25 new commits pushed to the master branch:

  • d427c79: 8277428: G1: Move and inline G1STWIsAliveClosure::do_object_b
  • 32839ba: 8266593: vmTestbase/nsk/jvmti/PopFrame/popframe011 fails with "assert(java_thread == _state->get_thread()) failed: Must be"
  • 8051041: 8277534: Remove unused ReferenceProcessor::has_discovered_references
  • 3f847fe: 8277385: Zero: Enable CompactStrings support
  • ca31ed5: 8275448: [REDO] AArch64: Implement string_compare intrinsic in SVE
  • 4ff4301: 8224922: Access JavaFileObject from Element(s)
  • 0a9e76c: 8277485: Zero: Fix fast{i,f}access_0 bytecodes handling
  • 1c215f3: 8272773: Configurable card table card size
  • 1d7cef3: 8276662: Scalability bottleneck in SymbolTable::lookup_common()
  • c79a485: 8277494: [BACKOUT] JDK-8276150 Quarantined jpackage apps are labeled as "damaged"
  • ... and 15 more: https://git.openjdk.java.net/jdk/compare/11d819d68a3ce2ae0973b496141c52aa419f90f0...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready label Nov 19, 2021
@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Nov 23, 2021

Thank you @zhengyu123 and @simonis! I'll do one round of stress tests more, then push.

@tstuefe
Copy link
Member Author

@tstuefe tstuefe commented Nov 24, 2021

Nightlies are clean.

/integrate

@openjdk
Copy link

@openjdk openjdk bot commented Nov 24, 2021

Going to push as commit cf7adae.
Since your change was applied there have been 62 commits pushed to the master branch:

  • 96e3607: 8275063: Implementation of Foreign Function & Memory API (Second incubator)
  • 17e68ca: 8277042: add test for 8276036 to compiler/codecache
  • 8a8bc29: 8277562: Remove dead method c1 If::swap_sux
  • d085c2b: 8273328: Compiler implementation for Pattern Matching for switch (Second Preview)
  • 6d73460: 8277399: ZGC: Move worker thread logging out of gc+phase=debug
  • 712b875: 8277397: ZGC: Add JFR event for temporary latency measurements
  • 7b2d823: 8277503: compiler/onSpinWait/TestOnSpinWaitAArch64DefaultFlags.java failed with "OnSpinWaitInst with the expected value 'isb' not found."
  • 7cb56a2: 8265796: vmTestbase/nsk/jdi/ObjectReference/referringObjects/referringObjects002/referringObjects002.java fails when running with JEP 416
  • 24e586a: 8276764: Enable deterministic file content ordering for Jar and Jmod
  • ea85e01: 8271623: Omit enclosing instance fields from inner classes that don't use it
  • ... and 52 more: https://git.openjdk.java.net/jdk/compare/11d819d68a3ce2ae0973b496141c52aa419f90f0...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot closed this Nov 24, 2021
@openjdk openjdk bot added integrated and removed ready rfr labels Nov 24, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Nov 24, 2021

@tstuefe Pushed as commit cf7adae.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot integrated
3 participants