Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaller class pointers #13

Closed
wants to merge 3 commits into from

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Sep 17, 2021

May I please have opinions on this proposal to reduce class pointer size?

This patch prepares the VM to work with narrow Klass pointers smaller than 32 bits and - for now - reduces them to 22 bits.

The patch implements the proposal I made in September on lilliput-dev [1]. In that mail thread, John Rose outlined the state of the discussions at that time [2]. Discussions focus on reducing the number of Klass structures falling into the encoding range and/or reducing the size of Klass structures to make more of them fit into the encoding range (and as a bonus, I assume, making them fixed-sized). The former approach would see us splitting classes into two sets and letting only one of them use narrow Klass pointers. The latter would mean breaking Klass up into several structures and moving lesser-used ones out of the encoding range.

My approach is less invasive and requires fewer fundamental changes to Klass handling. We can use it as a stepping stone to get the ball rolling and later refine our approach. Or, maybe my patch already suffices. Or perhaps you don't like it. Let's see :)

Narrow Klass encoding changes

Like compressed oops, Klass pointer encoding uses an add+shift scheme.

The shift value had always been (since at least 2013) zero or three: Zero for CDS=off, three for CDS=on. Note that CDS has been active by default since JDK 15, and it usually makes no sense to deactivate it since it improves startup significantly. Therefore, in practice, the shift value had always been three. Until recently, that is - in May, we switched to zero to solve an unrelated aarch64 problem [3].

The current scheme uses a non-zero shift to, in theory, enlarge the encoding range to 32GB - see KlassEncodingMetaspaceMax. But that does not do much good since we also deliberately cap the size of the relevant memory areas to 4GB (3GB for class space, the remainder for CDS). So in practice, the enlarged encoding range does not give us more breathing space for class data. Nor would it be needed since 4GB is a lot.

Enlarging the encoding range has an indirect positive effect since it increases the chance of getting zero-based encoding. However, CDS deliberately maps its archive and the class space outside the lower 32G address range. There are reasons for that, but they don't matter for this PR. So, again, in practice, we usually did not benefit from that advantage.

In conclusion: for years, we chiefly used a non-zero shift value. But I believe that shift had never been particularly useful.

New scheme

This patch proposes to revive the shift and give it a new meaning. It also increases its value to - for now - 9 bits. Instead of using it to enlarge the encoding range, we snip off those bits to make the narrow Klass pointer smaller. We save additional bits by reducing the encoding range, for now, to 2 GB. Both changes save us 10 bits, and we thus arrive at 22 bit narrow Klass pointers. These values are not set in stone; they are tweakable, with an increasing number of caveats.

Klass structure sizes

Increasing the encoding shift increases alignment requirements for Klass structures and hence alignment-related waste. So it trades metaspace/CDS memory for heap memory. It does so expecting that heap memory reduction by Lilliput far outweighs the modest increase in metaspace/CDS footprint.

To understand these tradeoffs, here are some facts:

  • Klass structures are variable-sized. Their size ranges from about ~480 bytes to just shy of a megabyte.
  • Their size distribution clusters heavily around the lower end of that range. Large Klass structures do exist and need to be supported, but they are scarce.
  • The overwhelming majority of all Klass structures are smaller than 1K. Of those, a large part is a wee bit smaller than 512 bytes.
  • Of particular relevance are classes generated for lambdas and reflection invocation. These get generated by the VM and can appear in masses. Their size typically hovers just a bit over 512 bytes.

I created some histograms of Klass size distribution for some common scenarios [4]. For example, this is spring petclinic after startup:

spring-petclinic-histo

We can see in all scenarios that Klass size distribution likes to spike around 512 bytes. Distribution then tapers out toward larger sizes, and we see very few Klass structures larger than 1 KB. Of course, this is not an exhaustive investigation, but enough to understand the reasoning that went into this patch.

From these numbers, we can see that either 9 bits (512 B alignment) or 10 bits (1 KB alignment) would be reasonable shift values.

Memory cost of increased alignment

My patch takes measures to limit alignment waste due to the larger Klass alignment:

  1. Before this patch, metaspace used a global allocation alignment, which happened to be 8 bytes. Naively increasing that one to 512 bytes would have been very expensive since most metadata are tiny. So instead, my patch only changes class space to use the larger alignment. Storage of non-class metadata is unaffected.

  2. The patch does the same in CDS: Klass structures in the CDS archives (and only those) now get stored at 512-byte boundaries. CDS is a bit different from class space, though, in that it intersperses Klass structures with non-class metadata. Therefore, the larger alignment gaps preceding Klass structures mostly get filled with non-class metadata. Little fish among the big fish.

We can roughly calculate the footprint tradeoff I made in my patch:

A 512-byte alignment wastes 256 bytes per Klass structure on average in class space. In reality, this number will be skewed, but it is difficult to predict in which direction (see [4]). Assuming Lilliput manages to reduce object header size by just 4 bytes, we'd break even if we allocate 64 objects per class on average. I think that is realistic, especially for larger heaps.

And note that this is just a first step. I am sure we can reduce that alignment gap price in class space. Use it for something productive. I am playing around with some ideas for future improvements.

Using the alignment gap?

One idea - which may sound absurd at first - would be to eliminate the non-class metaspace and store everything in class space. At first glance, that sounds counterintuitive since it increases the memory range we need to encode. We don't want that, right? But coupled with a large shift for Klass structures - maybe even larger than 9 bits, let's say 12 (4 KB) - this could make sense. The large alignment gaps Klass structures cause would naturally fill up with its non-class brethren. And the large shift in turn would expand the encoding range. Such an approach would effectively use the implicit knowledge that Klass and non class data, which belong to one class, are usually allocated around the same time and have the same life span.

It is just an idea, maybe it does not pan out. But it shows that the modest memory footprint increase this patch causes is not set in stone.

Class space arena becomes a table

One could say that by using a larger shift, we transform the class space from an arena with tight contiguous allocation into a table with 512-byte slots. In that table, Klass structures use up 1 to n slots. A narrow Klass pointer could be seen as an index into that table.

So we approach a table-based approach, but it smoothly accommodates larger-sized Klass structures that cover multiple slots. And it does so while retaining all the metaspace goodies. We get to keep free-list management, defragmentation after unloading, memory reclamation to the OS, and so on, at no added complexity.

How many classes can we accommodate

From the Klass size distribution, we can see that Klass is overwhelmingly smaller than 1K, and a significant part is just short of 512 bytes. So Klass structures, on average, take about one-and-a-half 512-byte table slots. A narrow Klass pointer size of 22 bits gives us 4 mio slots, so we could hold ~3 mio classes.

Note that this calculation is a bit fuzzy. Larger Klass structures do exist; they are just rare. And as we have seen, CDS stores data more tightly than class space. Note, however, that CDS is never the culprit in scenarios where we load tons of classes since those tend to be dynamically generated.

16-bit Klass pointers, or how many classes we want to accommodate.

The maximum number of classes Lilliput supports is important. The maximal reduction of the narrow Klass pointer depends on this answer.

I believe we have to be backward compatible and allow scenarios with tons of classes even after Lilliput. Those seem exclusively cases where classes are generated. One prominent example is JRuby. Anecdotally, according to Charles Nutter, JRuby can reach >500k classes. That is a lot. I am certain other examples exist too.

In the current JDK, we have no limit to the number of classes we can load, at least none imposed by class space. That is because a fallback exists with UseCompressedClassPointers. If one crazy app really maxes out 3 GB class space, it still could disable UseCompressedClassPointers and keep working.

But if we remove UseCompressedClassPointers for good, we remove that fallback. That means whatever limit Lilliput imposes is final and needs to be good enough for outlandish cases like JRuby. And thus may be oversized for standard applications that need to load just a few thousand classes.

Were we to keep UseCompressedClassPointers, we'd have a fallback and therefore could set the class limit for Lilliput much lower. Lilliput would only need to cover 99%, not 100%, of all cases. A much lower maximum class limit could lead to a much smaller narrow Klass pointer. One even may arrive at 16-bit class pointers.

Anyway, for now my patch does not touch this question. 9-bit shift and 2 GB encoding range are tame enough that even workloads like JRuby should not be bothered.

The patch

  1. I changed the metaspace to work with an arena-specific alignment instead of dictating a global allocation alignment. I then switched the class space to use that new large Klass alignment. I also needed to rewrite guard handling for metaspace arenas. But I did that upstream [5] to simplify this patch.
  2. I changed alignment handling in CDS.
  3. Factored out narrow Klass encoding to its own header/implementation (compressedKlass.hpp/cpp)
  4. Fixed narrow Klass encoding/decoding for both x64 and aarch64.
  5. Added new tests to test Klass pointer encoding setup with different values for encoding base address. In addition I had to disable some older tests that made no sense after Lilliput (some of which were questionable even before).

More details:

  • MacroAssembler:

    • aarch64 figures out the decode mode once during initialization, then uses that info. x64 recalculates the decode mode anew on each decode/encode. I liked aarch64 more, since it seems cleaner and allows easier error handling, and used that scheme for both platforms.
    • I introduced a new MacroAssembler::klass_decode_mode(base address) alongside the existing version that takes no parameter and instead uses the globally set encoding base. It can be used to negotiate valid encoding scheme during initialization without having to outguess the assembler decoding (compare aarch64's old and new version of CompressedKlassPointers::is_valid_base()).
    • Shift is never zero with my patch. So I removed handling of shift=0.
    • For x64, I added a xor+shift mode to be used instead of add+shift if possible
    • On x64, we cannot use leaq, since the scale factor is limited to at most 8. I removed that path and replaced it with add+shift.
    • Encodings get tested by a new jtreg test, runtime/CompressedOops/CompressedClassPointerEncodingTest.java, which forces the encoding range to several arbitrary bases and then checks that the VM figures out the correct encoding mode. It also should serve as a primitive functional test since some java code runs and some classes are loaded. It does mostly tests on x64, but I plan to beef up the aarch64 side later.
  • Metaspace: those changes are mostly straightforward, a lot of the work happened upstream. In this patch:

    • I factored out the new alignment handling into an own header, metaspaceAlignment.hpp
    • Metaspace arenas now get their own alignment, replacing the global metaspace allocation alignment. Arenas in class space then use the larger Klass alignment.
    • The more onerous changes were to the metaspace gtests, which needed to be made aware of arena-specific alignment.
  • CDS:

    • The comment in archiveBuilder.hpp, highlighting the various alignments CDS has to deal with, could help here
    • In my first attempt to fix CDS, I was a bit too aggressive and found that CDS is hellishly easy to break - especially if one wants to get it right on both 32-bit and 64-bit platforms. The reason is that CDS has some implicit alignment assumptions which are not immediately clear. The comment in DumpRegion::allocate explains this. In the end, I threw my first version away and implemented a minimal approach.
  • Narrow Klass encoding:

    • before this patch, that coding was intermixed with compressed oops encoding in compressedOops.hpp. I moved all that coding into a separate set of files (compressedKlass.cpp/hpp) for clarity
    • Shift is now a constant. There is no zero shift with Lilliput if we follow my proposal.
    • I am still a bit unhappy with how the encoding base address gets negotiated at VM startup. I have been unhappy before too. But this only affects either the CDS=off case or the case where we cannot reserve space at the preferred CDS base address. In the big scheme of things, it probably does not matter. Certainly not for this PR.
    • I also am unhappy with the fact that we compile a lot of coding unnecessarily on 32-bit. In the future, maybe we could exclude all that code cleanly from compiling on 32-bit. That would also simplify things in metaspace. But not for this PR.
  • I removed COMPRESSED_CLASS_POINTERS_DEPENDS_ON_COMPRESSED_OOPS, because we want to run with or without coops without tying narrow class pointers to that decision. If that causes problems for downstream projects, that is regrettable but not my focus atm.

I realize that this patch is largish, and I apologize for the review work. I tried to keep it small, but it kept growing on me. If we bring this into Lilliput, I will bring parts of it upstream, piecemeal, to ease merging pains.

Test state:

  • All GHAs are green. That means tier1 successfully ran on x64 and x86 and the VM at least builds on the remaining platforms, including the oddballs.
  • I manually did some tests on aarch64 and made sure the basics worked. I ran runtime/CompressedOops, runtime/Metaspace and the various gtest variants. As well as manually testing Klass encoding. However, I work on an underpowered Raspi here and testing is painful.
  • PPC, s390 don't work for now and remain broken.
  • Zero remains broken - it builds but won't run, as before.

I'd appreciate it if others could test too, especially on aarch64. I did what tests I could but have not much hardware available atm.

Thank you,

Thomas

[1] https://mail.openjdk.java.net/pipermail/lilliput-dev/2021-September/000101.html
[2] https://mail.openjdk.java.net/pipermail/lilliput-dev/2021-September/000102.html
[3] https://bugs.openjdk.java.net/browse/JDK-8265705
[4] https://github.com/tstuefe/metaspace-klass-size-analysis/blob/master/README.md
[5] https://bugs.openjdk.java.net/browse/JDK-8273783


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/lilliput pull/13/head:pull/13
$ git checkout pull/13

Update a local copy of the PR:
$ git checkout pull/13
$ git pull https://git.openjdk.java.net/lilliput pull/13/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 13

View PR using the GUI difftool:
$ git pr show -t 13

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/lilliput/pull/13.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 17, 2021

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@tstuefe
Copy link
Member Author

tstuefe commented Sep 17, 2021

/test

@openjdk
Copy link

openjdk bot commented Dec 2, 2021

@tstuefe this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout smaller-classpointers
git fetch https://git.openjdk.java.net/lilliput master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Dec 2, 2021
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Dec 3, 2021
@tstuefe tstuefe marked this pull request as ready for review December 3, 2021 16:36
@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 3, 2021
@mlbridge
Copy link

mlbridge bot commented Dec 3, 2021

Webrevs

Copy link
Collaborator

@rkennke rkennke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Thomas,
thanks for this great work! I could not find anything obviously wrong in the changeset.
22bits Klass* allows for another 2 bits for the hashcode, all in the upper 24 bits, and the remaining 8 bits for locking, GC age, and we even end up with some extra bits for future use. Together, it will allow 32bit headers, once we figure out how to restrict locking to the lower 32 (ideally lowest 3 or even 0) bits.

Go for it!

@@ -26,6 +26,7 @@
#ifndef SHARE_MEMORY_METASPACE_CHUNKLEVEL_HPP
#define SHARE_MEMORY_METASPACE_CHUNKLEVEL_HPP

#include "utilities/debug.hpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A stray include?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was actually intentional since we use assert and STATIC_ASSERT. But arguably nothing I should fix downstream in lilliput. I'll remove it.

@openjdk
Copy link

openjdk bot commented Dec 6, 2021

@tstuefe This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

Smaller class pointers

Reviewed-by: rkennke

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 6, 2021
@tstuefe
Copy link
Member Author

tstuefe commented Dec 7, 2021

Hi Thomas, thanks for this great work!

Thanks, it has been interesting :)

I could not find anything obviously wrong in the changeset. 22bits Klass* allows for another 2 bits for the hashcode, all in the upper 24 bits, and the remaining 8 bits for locking, GC age, and we even end up with some extra bits for future use. Together, it will allow 32bit headers, once we figure out how to restrict locking to the lower 32 (ideally lowest 3 or even 0) bits.

Go for it!

Should I integrate, or should I wait for more opinions?

@iklam already had an off-list look at the changes with his CDS eyes.

About the metaspace changes, I feel sufficiently confident myself.

The only thing I am hesitant about is the compiler changes, even if they are straightforward. Especially the aarch64 ones, since I cannot test them as thoroughly as x64.

I'll give it some aarch64 testing. If that goes well, you're good to go.

@@ -218,7 +219,7 @@ bool ArchiveBuilder::gather_klass_and_symbol(MetaspaceClosure::Ref* ref, bool re
_klasses->append(klass);
}
// See RunTimeClassInfo::get_for()
_estimated_metaspaceobj_bytes += align_up(BytesPerWord, SharedSpaceObjectAlignment);
_estimated_metaspaceobj_bytes += align_up(BytesPerWord, KlassAlignmentInBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit here: The original code was intended to reserve space for an extra pointer that's placed immediately below each Klass. I think we should do this so that the intention is clear.

// Reserve a slot to store a RunTimeClassInfo pointer immediately below each Klass.
// See RunTimeClassInfo::get_for()
_estimated_metaspaceobj_bytes += BytesPerWord;
// Extra padding for make sure each Klass is aligned.
_estimated_metaspaceobj_bytes += KlassAlignmentInBytes;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I fixed it and modified the comments in both gather_klass_and_symbol and make_shallow_copy to be clearer to non-CDS people. Could you take a look at them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version looks good.

@iklam
Copy link
Member

iklam commented Dec 7, 2021

@iklam already had an off-list look at the changes with his CDS eyes.

I looked at the CDS code and it looks fine to me. Just a small nit.

BTW, do you have size comparison of the CDS archive before/after?

Thanks

  • Ioi

@tstuefe
Copy link
Member Author

tstuefe commented Dec 7, 2021

@iklam already had an off-list look at the changes with his CDS eyes.

I looked at the CDS code and it looks fine to me. Just a small nit.

BTW, do you have size comparison of the CDS archive before/after?

Thanks

  • Ioi

Sure, here you go:

-- Archive file --

before:

classes.jsa 13434880
classes_nocoops.jsa 12279808

after:

classes.jsa 13783040 (+348k, +2.5%)
classes_nocoops.jsa 12623872 (+344k, +2.8%)

-- Runtime costs --

before:

  •    Shared class space (reserved=12582912, committed=12247040)
                          (mmap: reserved=12582912, committed=12247040)
    

after:

  •    Shared class space (reserved=16777216, committed=12591104)
                          (mmap: reserved=16777216, committed=12591104)
    

committed runtime costs: +344k

Reserved cost higher since, I assume, the size estimation to prepare mapping the archive is higher since we always add the full alignment cost when estimating the size before mapping.

@rkennke
Copy link
Collaborator

rkennke commented Dec 7, 2021

I'll give it some aarch64 testing. If that goes well, you're good to go.

AArch64 looks good (very few failures, but nothing that looks obviously related to your change).

@tstuefe
Copy link
Member Author

tstuefe commented Dec 7, 2021

I'll give it some aarch64 testing. If that goes well, you're good to go.

AArch64 looks good (very few failures, but nothing that looks obviously related to your change).

Great, thank you!

@rose00
Copy link
Collaborator

rose00 commented Dec 8, 2021

This is wonderful work, and I'm very glad to see it. I have some suggestions on your algorithms only, not on the specific changes.

You are hyper-aligning Klass blocks. You are using masking of low bits instead of scaling a low index to high bits. The discussion of these decisions is good. Either way is a good way to keep a large dynamic range but use fewer bits.

There is a performance pitfall you are at risk of here. By hyper-aligning Klass blocks beyond 64 bytes (6 bits of mask or shift), you are increasing the likelihood of cache line collisions and hence data cache pressure. I think of it this way: If you nullify N>6 bits of a Klass pointer (by shifting or masking) you are going to be hitting in only 1/2^(N-6) of the available cache lines.

(IIRC we once saw such data cache pressure effects with TLABs that were, at first, hyper-aligned naively.)

If any part of the Klass block is mutable, you may get cache line ping-ponging, and it is made worse in high pressure conditions. The secondary subtype cache comes to mind.

The fix is pretty easy, although it takes an extra instruction to decode a Klass pointer, and may add complexity your choice of shift vs. mask. Take shifting by N (which you are not doing here): If you multiply a Klass index by 2^N + 2^M instead of 2^N alone, where M is small (say M=6, the log of the cache line size) then you "smear" the effective base addresses across the full range of data cache lines.

Or, take masking mod 2^N. In that case, divide the bits of the Klass word (in place) by 2^(N-M) (maybe the same M=6 as before) and then mask off the lowest M bits (to avoid garbage alignments). Add or xor that to the fully masked (mod 2^N) Klass word. Such a scheme would also "smear" the effective addresses across your cache slots, avoiding cache pressure.

const L = 22; // lg # of distinct Klass addresses
const N = 9;  // low bits to nullify by shift or mask
const M = 6;  // smear low bits except the bottom 6
typedef uintptr_t klass_word;
typedef uintptr_t klass_addr;
klass_addr smearing_shift_decode(klass_word kw) {
  // assume kw is already shifted down to LSB
  kw &= (1<<L)-1;  // clip upper bits of index (if necessary)
  klass_addr ka = kw << N;  // nullify low N bits
  ka += kw << M;  // smear middle N-6 bits
  return ka;
}
klass_addr smearing_mask_decode(klass_word kw) {
  // assume kw is already shifted to machine address position
  kw &= (1<<(L+N))-1;  // clip upper bits of address (if necessary)
  klass_addr ka = kw & ((1<<N)-1);  // nullify low N bits
  ka += (ka >> (N-M));  // smear middle N-6 bits
  return ka;
}

With these adjustments in place, I think the shifting algorithm is maybe a little faster and easier to manage (for Klass allocation) than the masking algorithm.

I think in general it's easier and more flexible (at least in our code base, and in expositions) to work in terms of numeric indexes, rather than clever in-place masking schemes. You might prefer (for some other reason) to put the Klass bits in the upper part of a word (leading to a tilt towards masking). That's fine; you can treat that upper part as an index after right-shifting it.

The parameter L=22 above is lg(2^31 / 512). It reflects the upper limit of distinct Klass addresses that can be encoded: At most 4 million.

This leads to my other comment. You mention the tradeoff between class loading capacity (with JRuby as an extreme case) and the encoding width of Klass indexes. You suggest we might (with painful tradeoffs) get down to L=16. Actually, I think L=12 or L=10 are feasible, once we make the tradeoffs less painful.

So here's my position on very narrow Klass indexes. The JVM always knows how many Klasses it has created to date; that's why it is able to manage the 2Gb of metaspace for them. If the JVM were able to grant the most favorable treatment to the first 2^L loaded classes (give or take), it could also fall back to a different representation for the remaining classes (that JRuby is loading busily). Call this different representation an expanded Klass; it would have to take up extra space somewhere in the object layout. As long as the heap is mostly full of objects of those first 2^L Klasses, we get the density benefits of having only L-bit Klass indexes. The other objects have an extra word stuffed into them (not in the header, but elsewhere in the object). That would bloat the heap if the application (a) first loads 2^L classes, and then (b) fills its heap with instances of later classes. To me that worst-case scenario seems unlikely enough to make the gamble worth while. Users can ask for a larger L (up to 22) if they hit the worst case.

So what exactly happens in those other classes (after about the first 2^L)? Well, we have options, and here are a few. Use a range of Klass indexes to encode, not the first 2^L, but the "hard cases". For example, the Klass index of zero (all by itself) might signal that the decoder of a Klass pointer must fetch a Klass pointer (uncompressed or partially compressed) from the word immediately before the object header. So object headers bloat back up to today's status quo, except that the extra L bits in the direct header are wasted for these objects. Or, as a better second example, reserve a Klass index in the range [1..D], where D is probably much smaller than 2^L, to cause Klass decoding to look at an offset of D words from the object header to find the expanded Klass pointer. (By expanded we mean not stuffed into L bits, but uncompressed as 64 bits or partially compressed into 32 bits.) A third example would makes the presence of D distinguished by a tag bit, so there are L+1 bits in the encoded Klass field, and the dynamic range of D is like [1..2^L]. Maybe; I like the second one best, of [1..D] being small compared to [D+1..2^L-1].

In any case, when the JVM loads a class that cannot be given a fully compressed index (below 2^L, and not encoding D), then it makes sure that the object layout includes space for the expanded Klass field, at an offset no greater than D (or at offset -1, in the first example). What about big objects that have already used up all D words, and suddenly the JVM needs to reuse one of those words because it has run out of its 2^L indexes? That's easy to provide for: When loading a class, if its layout is of size D or larger, and it is not final, and it does not already have an expanded Klass field, then allocating an expanded Klass field at location D, even if it doesn't need it. An epicycle on this tactic would be to not burn one of the precious 2^L values for this big guy, and treat it the same as if 2^L classes had already been loaded.

Anyway, as you can see, there are lots of ways to fall back gracefully to a less compressed (32-bit) Klass pointer, even if L is really small.

Once you get below L=16 you can also start to think about hoisting Klass indexes into non-decoded pointer "color" bits, using explicit masking or hardware support like Intel's LAM feature. Why should ZGC have all the fun with pointer tag bits? With Klass indexes hoisted into pointer, dynamic type-testing, for the most common objects, would not need to load the object header at all.

If L is super-small, say L=8, we might try to carefully manage (perhaps with semi-statically configured profile feedback) which 250 or so classes get the really nice header layout, and which get the regular ones. The size L=8 is really easy to think of hoisting into an object pointer. Surely there are many heaps where large proportion of the objects are in the top 250 classes.

@rose00
Copy link
Collaborator

rose00 commented Dec 8, 2021

Also, FTR, there are other ways to keep cache density without the extra add/xor step of smearing M-shifted bits. It would require deeper cuts, but it is worth thinking again of the thing you already mentioned, where a Klass block is split into a fixed-sized "near Klass" and an indirectly addressed "far Klass". The near Klass would be sized as a small number of cache blocks, to keep Klass metadata dense in the data cache. There is a sensitive performance tradeoff, though, since you want parts of the Klass which are accessed frequently (dynamic type checks, vtables, GC metadata) to be in the near Klass. Since v-tables are variable in size, this means some v-table entries (probably) get favored over others. Tricky business. I think Coleen looked into this many years ago and may have further insights.

Independently, a possible future benefit of a near-Klass/far-Klass split would be that we could consider building in a one-to-many relation across the linking indirection. That could support a tightly integrated "species" concept, where an object's header can point to a near-Klass which represents an individuated species of its class. In theory that could help represent specialized types like ArrayList as distinct from the erased class ArrayList.

@tstuefe
Copy link
Member Author

tstuefe commented Dec 14, 2021

Hi @rose00,

many thanks for the friendly encouragement, it is nice to see the work appreciated!

First off, would you be okay with the patch going in its current form? I'm aware that its not the perfect solution, just a first step. But your suggestions all require more long-term work for which I lack the time for right now (also, vacations coming up). I would like to get this patch in, to enable @rkennke to start using the just slightly narrower narrow Klass pointers.

With this patch, we don't commit to any strategic direction, not really. The changes are tame and some of them even make sense out of the Lilliput context, so I will bring them upstream. There is nothing we could not change later if we fancy so.

About your ideas. This is only a part answer, unfortunately, I lack the time to go in deeper and do them full justice. But I enjoyed reading your proposals, many gems in there. It is apparent that you have thought for a long time about this :-)

(It is a pity that just at that moment the MLs failed. We may want to continue the discussion outside of the PR).

  1. Hyper-aligning Klass locations and the danger of diminished cache efficiency

This is a good point I did not think much about before. But before I implement a solution, I'd like to make sure that this is really a problem - also because I would like to see a negative before building a positive, for comparison. Do all Klass accesses concentrate on the first 64 bytes of Klass? Have we already ordered members in Klass by hotness? I would have thought we have not, and that accesses are more evenly distributed - after all, a Klass usually covers at least 8 cache lines. I'm not sure here. Access pattern with statistical distributions for Klass members would certainly help, also in deciding if and what to split off.

In comparison to TLAB, I would have thought that TLAB access is much more concentrated (more loads happening sequentially targetting the TLAB), but that Klass access is more spread out and interspersed with other loads/stores.

Solution-wise, I like your wiggle-idea (wiggling the Klass address around) to mitigate the lockstep problem. Just to be sure I understand you correctly, instead of an allocation cadence like this: 0->0x200->0x400->0x600, you propose one like this 0->0x240->0x480->0x6c0 ?

My only fear would be that this would increase code size for all the fragments which decode Klass, and that, again, could have a negative effect on performance. At least I cannot predict the effect in my head.

Reducing Klass size would be certainly A Good Thing, from various standpoints. For one, reduce metaspace overhead. Especially if we manage to make it homogenous, and especially if it's close to but under a pow2 size. Maybe splitting off vtable, itable. I thought maybe Klass could hold a small inline vtable/itable and just branch to separate structures if they are exhausted.

  1. About coloring bits, I may just have misunderstood you. Do you think about hiding the klass id inside the coloring bits of an oop? How would that work with narrow oops? Would you not lose the bits every time you store a narrow oop?

Thanks, Thomas

@tstuefe
Copy link
Member Author

tstuefe commented Dec 15, 2021

Unless there are no strong objections I will push this patch tomorrow morning.

@tstuefe
Copy link
Member Author

tstuefe commented Dec 16, 2021

Since I did not hear back from @rose00, @rkennke did ask me to speed this up and I have vacation looming, I decided to push. There is nothing we could not possibly revert later, but it gives us 22 bit class pointers for now and we can work with that.

Thanks @rkennke, @iklam and @rose00 for the feedback!

/integrate

@openjdk
Copy link

openjdk bot commented Dec 16, 2021

Going to push as commit 68fbdb3.

@openjdk openjdk bot added the integrated Pull request has been integrated label Dec 16, 2021
@openjdk openjdk bot closed this Dec 16, 2021
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 16, 2021
@openjdk
Copy link

openjdk bot commented Dec 16, 2021

@tstuefe Pushed as commit 68fbdb3.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@tstuefe tstuefe deleted the smaller-classpointers branch January 24, 2022 08:48
@rose00
Copy link
Collaborator

rose00 commented Apr 19, 2023

Since I did not hear back from @rose00, @rkennke did ask me to speed this up and I have vacation looming, I decided to push. There is nothing we could not possibly revert later, but it gives us 22 bit class pointers for now and we can work with that.

For the record (16 months later!), my comments were future-oriented and I am happy with the progress this push made. Some of those comments may help reduce risk of class ID overflow later on.

I personally think it would be interesting to experiment with a version of this change which makes MaxNarrowKlassPointerBits a command-line configurable value, and then (as a follow-on) undertakes to deal with overflow by adding a second injected field (with a full 32 bits) to impacted layouts, but not to layouts of the “favored” classes, which will usually be the majority.

The C++ access code would be slightly slower. And the extra injected field requires a usually-not-taken branch to a not-so-fast path that picks up the injected overflow bits.

Such an experiment would help make stress-tests (with very small MNKPB values). It could facilitate experiments with different header layouts, even 32-bit layouts, or maybe hoisting (some or all) klass bits into 64-bit oops.

Reply to other queries:

Just to be sure I understand you correctly, instead of an allocation cadence like this: 0->0x200->0x400->0x600, you propose one like this 0->0x240->0x480->0x6c0 ?

Correct. You can get that effect via arithmetic that looks like base + (id << 9) + (id << 6). Your reservations are correct; this mainly pays off if (or after) we correctly sort Klass fields in order of decreasing frequency of access, and that sorting concentrates accesses more sharply than to the first 8 D$ lines. We have sometimes tried to do such sorting in the past by hand, which is why the fields controlling instanceof are near the top. But a real push to do so would require some sort of testing harness that is hard to set up.

My only fear would be that this would increase code size for all the fragments which decode Klass, and that, again, could have a negative effect on performance. At least I cannot predict the effect in my head.

Personally I think the effect of an extra addend id << 6 would be minimal. It could easily be less than a negative effect from cache line pressure, depending on how D$ hardware works. To recap ,the negative effect, if it exists, could come from the “big stride” of id << 9 overusing a subset of available D$ lines and underusing others. But, now that I’m writing this, I suppose it’s a bad idea to bet against D$ hardware; there are ways to solve for the big stride effect in hardware that might already be in today's chips. (It wasn’t, in the days when I routinely spoke with chip designers, but those days are long past.) Only perf testing could settle the question. I suspect D$ pathologies tend to pop up only under heavy, specialized loads, not in standard benchmarks.

Reducing Klass size would be certainly A Good Thing, from various standpoints. For one, reduce metaspace overhead. Especially if we manage to make it homogenous, and especially if it's close to but under a pow2 size. Maybe splitting off vtable, itable. I thought maybe Klass could hold a small inline vtable/itable and just branch to separate structures if they are exhausted.

Precisely. V-table and i-table stubs would come in two flavors, one which looks for a method in the first/fast part of the Klass, and the other which looks in second/slower part of the Klass.

About coloring bits, I may just have misunderstood you. Do you think about hiding the klass id inside the coloring bits of an oop? How would that work with narrow oops? Would you not lose the bits every time you store a narrow oop?

Yes, that’s right. It would not work at all with narrow oops; it would be an option to enable instead of narrow oops. If you have a VM with a large heap, and you are forced to use 64-bit oops, suddenly there is a large amount of “slack” in your heap, near the top of every oop. That could in principle be used to carry klass bits. It’s only worth it, IMO, if you can then reduce the header size below 64 bits by eliminating klass bits from the header, since they are now present in every (correctly formed) reference to the header. GC’s might object, since they are in the business of reconstructing correctly formed references. But it’s a good thought experiment at least, IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integrated Pull request has been integrated
4 participants