Skip to content

8325104: Lilliput: Shrink Classpointers #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

tstuefe
Copy link
Member

@tstuefe tstuefe commented Feb 1, 2024

Hi,

I wanted to get input on the following improvement for Lilliput. Testing is still ongoing, but things look really good, so this patch is hopefully near its final form (barring any objections from reviewers, of course).

Note: I have a companion patch prepared for upstream, minus the markword changes. I will attempt to get that one upstream quickly in order to not have a large delta between upstream and lilliput, especially in Metaspace.

High-Level Overview

(for a short sequence of slides, please see https://github.com/tstuefe/fosdem24/blob/master/classpointers-and-liliput.pdf - these accompanied a talk we held at FOSDEM 24).

We want to reduce the bit size of narrow Klass to free up bits in the MarkWord.

We cannot just reduce the Klass encoding range size (well, we could, and maybe we will later, but for now we decided not to). We instead increase the alignment Klass is stored at, and use that alignment shadow to store other information.

In other words, this patch changes the narrow Klass Pointer to a Klass ID, since now (almost) every value in its value range points to a different class. Therefore, we use the value range of nKlass much more efficiently.

We then use the newly freed bits in the MarkWord to restore the iHash to 31 bits:

[ 22-bit nKlass | 31-bit iHash | 4 free bits | age | fwd | lck ]

nKlass gets reduced to 22 bits. Identity hash gets re-inflated to 31 bits. Preceding iHash are now 4 unused bits. Rest is unchanged.

(Note: I originally wanted to swap iHash and nKlass such that either of them could be loaded with a 32-bit load, but I found that tricky since C2 seems to rely on the nKlass offset in the Markword being > 0.)

nKlass reduction:

The reduction in nKlass size is made by only storing them at 10-bit aligned addresses. That alignment (1KB) works well in practice since Klass - although var sized - typically is between 512 bytes and 1KB in size. Outliers are possible, but the size distribution is bell-curvish [1], so far-away outliers are very rare.

To not lose memory to alignment waste, metaspace is reshaped to handle arbitrarily aligned allocations efficiently. Basically, we allow the non-Klass arena of a class loader to steal the alignment waste storage from the class arena. So, alignment waste blocks are filled with non-Klass metadata. That works very well in practice since non-Klass metadata is numerous and fine-granular compared to the big Klass blocks. Total footprint loss in metaspace is, therefore, almost zero (a few KB at most). We see class space to be used more, non-Klass space to be used less, but the total footprint is about the same.

The number of addressable classes is somewhat reduced now. From ~5 million we could fit before into a maxed-out (3 GB) class space, to ~3 million with 10-bit alignment. That is still a very high and acceptable number.

CDS:

CDS is also modified to store Klass at 10-bit aligned addresses. That works well since CDS contains both non-Klass- and Klass-data both, so things just shift around. As with Metaspace, very little footprint loss here.

We now create four CDS archives: with and without coops, and with and without UseCOH. Since that is potentially contentious, I guarded the +COH versions with the configure option --enable-cds-archive-coh.

Costs:

The costs of these improvements are:

  • when decoding an nKlass, the shift is now mandatory. Before, it was usually optional (only used on certain platforms when CDS was disabled).
  • we reduce the number of loadable classes from ~5 mio to ~ 3 mio
  • In theory, hyper-aligning >cache line could be detrimental. I attempted to measure that effect with a custom microbenchmark that aims to overwhelm L1, but so far I have not seen any effect above background noise. I will repeat these tests though, and have mitigations planned in case this should be an issue (basically, vary cadence by cache line size).

Testing

I tested the default variant (COH disabled) and a variant with +COH as default. GHAs are clean. I ran tier 1 and 2 manually on Linux x64 and MacOS.

Tests are currently being executed at Adoptium on Linux arm64 and x64.

Further outlook

Further size reductions are possible. We can probably squeeze a bit or two out of reducing the number of loadable classes. To go further, e.g. to 16-bit class pointers, is possible but would probably need something like the variable-sized-header idea of John Rose.

In any case, this PR is a necessary prerequisite, since it allows us to use the bits in nKlass much more efficiently. So, a 16-bit nKlass can really address ~64k classes.

[1] https://github.com/tstuefe/metaspace-statistics/blob/master/Histogram.svg


Progress

  • Change must not contain extraneous whitespace
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Integration blocker

 ⚠️ Too few reviewers with at least role committer found (have 0, need at least 1) (failed with updated jcheck configuration in pull request)

Issue

  • JDK-8325104: Lilliput: Shrink Classpointers (Enhancement - P3)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/lilliput.git pull/128/head:pull/128
$ git checkout pull/128

Update a local copy of the PR:
$ git checkout pull/128
$ git pull https://git.openjdk.org/lilliput.git pull/128/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 128

View PR using the GUI difftool:
$ git pr show -t 128

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/lilliput/pull/128.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Feb 1, 2024

👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@tstuefe tstuefe changed the title Smaller class pointers JDK-8325104: Lilliput: Shrink Classpointer size Feb 6, 2024
@tstuefe tstuefe changed the title JDK-8325104: Lilliput: Shrink Classpointer size JDK-8325104: Lilliput: Shrink Classpointers Feb 6, 2024
@tstuefe tstuefe marked this pull request as ready for review February 7, 2024 14:17
@openjdk
Copy link

openjdk bot commented Feb 7, 2024

@tstuefe this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout Smaller-ClassPointers
git fetch https://git.openjdk.org/lilliput.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch rfr Pull request is ready for review labels Feb 7, 2024
@mlbridge
Copy link

mlbridge bot commented Feb 7, 2024

Webrevs

Copy link
Collaborator

@rkennke rkennke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work!

One comment on your description:

when decoding an nKlass, the shift is not mandatory.

Did you mean to say 'is now mandatory'? Otherwise it doesn't seem to make much sense.

I did a first pass over the changes. Comments inline.

// Used only with compact headers.
static const int klass_shift = hash_shift_compact + hash_bits_compact;
#endif
static const int hash_shift_compact = 11;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to have a hole there in the middle. Could we have the hole in the uppermost bits? If we really have to have it here, use a new unused_gap_bits_compact here that has 4 instead of 1?

@tstuefe
Copy link
Member Author

tstuefe commented Feb 8, 2024

Very nice work!

Thank you!

when decoding an nKlass, the shift is not mandatory.

Did you mean to say 'is now mandatory'? Otherwise it doesn't seem to make much sense.

Yes, you are right; I'll correct that.

@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Feb 10, 2024
@bridgekeeper
Copy link

bridgekeeper bot commented Mar 11, 2024

@tstuefe This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@openjdk
Copy link

openjdk bot commented Mar 13, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

Comment on lines +343 to +345
constexpr int load_shift = markWord::klass_load_shift;
STATIC_ASSERT(load_shift % 8 == 0);
return mark_offset_in_bytes() + load_shift / 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this broken for big-endian machines? The follow-up question then is, should we really be reading the klass pointer with 32-bit loads? If we load the entire 64-bit "object header" and then shift with klass_shift, we wouldn't have to think about endianess, right? Do we keep the 32-bit load because we don't want to mess with C2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a technical problem when I developed the patch, which is that a klass_offset_in_bytes needed to be != mark_offset_in_bytes. See e.g. Compile::flatten_alias_type, but I believe there had been other places.

Then, a 32-bit load can possibly be encoded with fewer bytes, no? I am not sure if it would be faster beyond that distinction, however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it is odd that we return that the klass offset is 4 on big-endian machines. If we ever really try to read the klass at (obj + klass_offset_in_bytes()) on those machines we will not get the klass bits, instead we will get some of the other bits in the object header.

My inquiries were more if it really was a good idea to load the klass with 4 byte loads and that it seems safer (more platform independent) to stick with 8 byte loads. When we use 8 byte loads and shifts we don't have to think about endianess.

I glanced in the C2 code how klass_byte_in_offset is used. It looks like most places were klass_offset_in_bytes() is used, it is used as a sort of a token to figure out that the load (obj + offset) should be interpreted as a load of the klass. We don't actually use that offset for the load / decode of the klass. Instead we perform an 8 byte load from the start of the object, with associated 8 byte shift operation.

So, maybe my concern here is more that the name "klass_offset_in_bytes" is a bit misleading as it sounds like this is the offset where the klass bits are located. Code using klass_offset_in_bytes() really need to be written with the understanding that this is the case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble with C2 is that, currently, loading the Klass* (pre-lilliput) has its own memory slice, and is totally immutable. With Lilliput, we currently load the Klass* from the mark-word, and need to deal with header displacement in ObjectMonitors. This currently happens in the backend and is totally opaque to C2 IR.

However, it all could be done in C2 IR, even more so when we get the OMWorld stuff. Then the loads would happen on their true memory slice (the mark-word) and we would do the shifting/masking in IR, too. But it is questionable if we want to do so. The fact that Load(N)Klass can currently be treated as immutable makes the node freely moving in the ideal graph. Wiring it up in the same slice as mark-word loads means that we would have to re-load the Klass* after anything that would potentially touch the mark-word, especially safepoints, calls, etc. even though we know that the actual Klass* portion of the mark-word is still, in-fact, immutable. Therefore I believe it is better to keep the LoadKlass stuff on its own memory slice, with its own offset, even if we never actually load from that offset. offset=4 is a good choice for that, because it would never clash with a true offset of a field. Long-term, something like offset=1 or 2 may be better, when we want to do 4-byte headers, fields may start ot offset=4.

Also note the weird dance that we need to do in C2 .ad files to figure out the true offset. :-/

We should run this by some C2 experts to figure out the best way to deal with all that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding explaining some of the background for this. It sounds like a good idea to get a C2 expert to look at this.

@rose00
Copy link
Collaborator

rose00 commented Mar 25, 2024

One possible problem with hyper-aligned data structures (such as Klass* pointers allocated zero mod 1024) is interactions with data cache dynamics. If your D$ maps blocks aligned zero mod 64, then your Klass blocks can occupy only 1/16 of possible D$ entries. Sometimes this matters, when you are loading from many Klasses: The effect is of making the D$, just for them, smaller by a factor of 16, causing more frequent collisions and evictions.

(This sort of effect has been observed, IIRC, with both stacks and TLABs, in the past. I don’t know if Klass accesses are frequent enough to make similar effects be detectable or not.)

A solution is to perturb the hyper-alignment in such a way that the lower bits select all possible D$ frames, instead of just 1/16 of them. The simplest way to do this is to make your hyper-aligned Klass* blocks aligned mod (1024 + 64), where 1024 is your intended size and 64 is your D$ entry size. (Or maybe mod 1024+8, if it is faster to shift by 8 instead of by 64.). Decoding a Klass index consists of multiplying by the sum of a pair of powers of two, instead of by a single power of two. That means an extra instruction in the decode fast-path, unless you can hide the extra shift/add in an addressing mode.

Given that we work hard to make the Klass base address be zero, it is clear that every instruction counts, and even if the Klass base is zero, you still need an extra instruction or two to do the extra shift/add. So that is sad, but the fact that we are sensitive to performance of these sequences means that we might possibly be sensitive to D$ collision effects also: Saving one instruction but having the average number of cycles go up, due to D$ misses, could be a net loss.

So I suggest considering (even to the point of doing performance tests) the idea of assigning Klass addresses mod 1088 instead of 1024. It might require some imaginative coding to adjust the allocator logic to hand out such addresses, but I don’t think it’s very bad. For allocator performance, multiplication by 1088 is trivial. There will be some divisions by 1088, probably, but those are optimized (usually) to multiplications as well, if the divisor (1088) is a compile time constant.

The allocator could search for available blocks mod 1024, and then only at the end, before handing out the block, round up to the next multiple of 1088. That would add a number from 0 to 63 to the base address which has already been selected. If that shift makes the block overflow at the other end, then the allocator has to keep searching.

@rose00
Copy link
Collaborator

rose00 commented Mar 25, 2024

… interactions with data cache dynamics …

I see acknowledgement of this issue in a comment in this PR that says "smallest shift possible that still covers the whole range”. This is a good solution as long as that smallest shift is 64 (or less). A shift of 128 would dilute the D$ by 50%, but would reach 2^(22+7-10) distinct Klasses, half a million, which is great. I guess VMs which configure themselves for a quarter million classes or fewer just use the cache line as the allocation modulus, and all is well. So my solution (mod 1088) is really useful only for huge workloads. I retract my suggestion that performance experiments are needed at this time, given the solution you have in place.

@tstuefe
Copy link
Member Author

tstuefe commented Mar 26, 2024

One possible problem with hyper-aligned data structures (such as Klass* pointers allocated zero mod 1024) is interactions with data cache dynamics. If your D$ maps blocks aligned zero mod 64, then your Klass blocks can occupy only 1/16 of possible D$ entries. Sometimes this matters, when you are loading from many Klasses: The effect is of making the D$, just for them, smaller by a factor of 16, causing more frequent collisions and evictions.

(This sort of effect has been observed, IIRC, with both stacks and TLABs, in the past. I don’t know if Klass accesses are frequent enough to make similar effects be detectable or not.)

A solution is to perturb the hyper-alignment in such a way that the lower bits select all possible D$ frames, instead of just 1/16 of them. The simplest way to do this is to make your hyper-aligned Klass* blocks aligned mod (1024 + 64), where 1024 is your intended size and 64 is your D$ entry size. (Or maybe mod 1024+8, if it is faster to shift by 8 instead of by 64.). Decoding a Klass index consists of multiplying by the sum of a pair of powers of two, instead of by a single power of two. That means an extra instruction in the decode fast-path, unless you can hide the extra shift/add in an addressing mode.

Given that we work hard to make the Klass base address be zero, it is clear that every instruction counts, and even if the Klass base is zero, you still need an extra instruction or two to do the extra shift/add. So that is sad, but the fact that we are sensitive to performance of these sequences means that we might possibly be sensitive to D$ collision effects also: Saving one instruction but having the average number of cycles go up, due to D$ misses, could be a net loss.

So I suggest considering (even to the point of doing performance tests) the idea of assigning Klass addresses mod 1088 instead of 1024. It might require some imaginative coding to adjust the allocator logic to hand out such addresses, but I don’t think it’s very bad. For allocator performance, multiplication by 1088 is trivial. There will be some divisions by 1088, probably, but those are optimized (usually) to multiplications as well, if the divisor (1088) is a compile time constant.

The allocator could search for available blocks mod 1024, and then only at the end, before handing out the block, round up to the next multiple of 1088. That would add a number from 0 to 63 to the base address which has already been selected. If that shift makes the block overflow at the other end, then the allocator has to keep searching.

Hi John,

I remember well your point from our last interchange (#13 (comment)). Last summer, I did a number of performance tests but found no clear result above background noise. But I had planned to revisit this issue and potentially fix it as a follow-up task. One never knows, my test methodology may have been off.

Like you, I was worried about the increased cost and complexity of Klass ID decoding. Last summer I investigated how a possible decoding would look like. Since I am not a compiler expert, I just looked at what C compilers do when referencing items in a 1088-stride array. x64 e.g. uses imul, aarch64 uses some form of shift+add. Maybe its not so bad, but the effects need to be measured.

My short-term contingency plan had been to use some of the now free four bits to increase the Klass ID bits and thus bring alignment back to 64 bytes. That would work at least for 64-bit headers, as long as Valhalla does not need those bits too.

I will re-examine my performance runs, and double-check for methodology errors.

I see acknowledgement of this issue in a comment in this PR that says "smallest shift possible that still covers the whole range”. This is a good solution as long as that smallest shift is 64 (or less). A shift of 128 would dilute the D$ by 50%, but would reach 2^(22+7-10) distinct Klasses, half a million, which is great. I guess VMs which configure themselves for a quarter million classes or fewer just use the cache line as the allocation modulus, and all is well. So my solution (mod 1088) is really useful only for huge workloads. I retract my suggestion that performance experiments are needed at this time, given the solution you have in place.

Yes, that was the underlying reason for this logic. Unfortunately, the default is for the class space to be 1G, with some additional space for CDS, so we are at a 2G range, which brings us just one bit. There are knobs to play with, e.g. CDS size is usually very small, so both CDS and class space default together could easily fit into 1G. That brings us 2 bits, back to an alignment of 256. Still only every fourth cache line.

Copy link
Member

@stefank stefank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Thomas, I've started to read through the patch but are far from done. I'm sending a few early comments / questions / suggestions.


address CompressedKlassPointers::_base = (address)-1;
int CompressedKlassPointers::_shift = -1;
size_t CompressedKlassPointers::_range = (size_t)-1;
Copy link
Member

@stefank stefank Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this put 0x00000000FFFFFFFF in this size_t? I guess this works with the code using it, but it is not obvious that this is what we intended to put into the _range variable as the (uninitialized) value.

Related to this. How important is it to do these initialized checks? It seems to add casts and a bit odd code. When this is stable, wouldn't it be nice to clean this up? That would also make _tiny_cp a bool instead of a tri-bool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this put 0x00000000FFFFFFFF in this size_t?

Ignore that. I checked and this is signed extended out to 0xFFFFFFFFFFFFFFFF.

// For UseCompressedClassPointers.
class CompressedKlassPointers : public AllStatic {
friend class VMStructs;

// Tiny-class-pointer mode
static int _tiny_cp; // -1, 0=true, 1=false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems wrong. I think it should be 0=false, 1=true.


// Narrow klass pointer bits for an unshifted narrow Klass pointer.
static constexpr int narrow_klass_pointer_bits_legacy = 32;
static constexpr int narrow_klass_pointer_bits_tinycp = 22;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to have a consistent naming between *_tinycp and _tiny_cp.

static int narrow_klass_pointer_bits() { check_init(_narrow_klass_pointer_bits); return _narrow_klass_pointer_bits; }

// The maximum possible shift; the actual shift employed later can be smaller (see initialize())
static int max_shift() { check_init(_max_shift); return _max_shift; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, you could get rid of the variable name duplication if we changed this to be something like { return read_check_init(_max_shift); } instead.

@tstuefe
Copy link
Member Author

tstuefe commented Mar 26, 2024

Hi Thomas, I've started to read through the patch but are far from done. I'm sending a few early comments / questions / suggestions.

Many thanks, Stefan. I have to interleave work on this patch with a ton of other stuff, so I may be slow answering. B
ut your reviews are appreciated.

Copy link
Member

@stefank stefank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another pass over the code and have some more questions.

@@ -89,12 +100,28 @@ ClassLoaderMetaspace::~ClassLoaderMetaspace() {

// Allocate word_size words from Metaspace.
MetaWord* ClassLoaderMetaspace::allocate(size_t word_size, Metaspace::MetadataType mdType) {
word_size = align_up(word_size, Metaspace::min_allocation_word_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why you need to align word_size here? It seems like we perform a very similar alignment inside MetaspaceArena::allocate.

class_space_arena()->deallocate(ptr, word_size);
} else {
non_class_space_arena()->deallocate(ptr, word_size);
NOT_LP64(word_size = align_down(word_size, Metaspace::min_allocation_word_size);)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? What code paths passes in non-aligned word_size?

(BTW, I started to look at these in more details because I tend to think that we should stay away from modifying the input parameters, since it makes it easier to misunderstand the code).

}
if (UseNewCode)printf("\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like old debugging code.

Comment on lines +185 to 187
void BlockTree::zap_block(MetaBlock bl) {
memset(bl.base(), 0xF3, bl.word_size() * sizeof(MetaWord));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the header file the parameter is named block and here it is named bl. In add_block(MetaBlock mb) the name is mb. Maybe think about trying to use consistent naming for these?

size_t word_size() const { return _word_size; }
bool is_empty() const { return _base == nullptr; }
bool is_nonempty() const { return _base != nullptr; }
void reset() { _base = nullptr; _word_size = 0; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need reset? I was wondering if you could make the _base, _word_size const by changing the few places that use this to: result = MetaBlock();

You would also need to skip having split_off_tail and replace its usage with:

wastage = MetaBlock(...);
result  = Metablock(...)

Just a thought if you wanted to make MetaBlock an even simpler tiny structure.

#ifndef SHARE_MEMORY_METASPACE_METABLOCK_INLINE_HPP
#define SHARE_MEMORY_METASPACE_METABLOCK_INLINE_HPP

#include "memory/metaspace/metablock.hpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a convention that the .hpp file associated with the .inline.hpp file should come first (as you have here) and then there's a blank line separating it from the rest of the includes.

Comment on lines +343 to +345
constexpr int load_shift = markWord::klass_load_shift;
STATIC_ASSERT(load_shift % 8 == 0);
return mark_offset_in_bytes() + load_shift / 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding explaining some of the background for this. It sounds like a good idea to get a C2 expert to look at this.

@rose00
Copy link
Collaborator

rose00 commented Mar 27, 2024

My short-term contingency plan had been to use some of the now free four bits to increase the Klass ID bits and thus bring alignment back to 64 bytes. That would work at least for 64-bit headers, as long as Valhalla does not need those bits too.

Thanks for the response, Thomas. (It looks like you remembered our conversation better than I did.) Taking back the four free bits is a good tactic for now.

It may possibly squeeze Valhalla, at which point we will have to evaluate our choices: Should we encode the valhalla bits with a joint multi-bit encoding embedded in the klass-ID? Should we look again at mod-1088 encodings discussed here? Should we use a joint bit encoding for all array klass IDs, since Valhalla needs special markings on arrays anyway? Those are the sorts of questions that could arise.

@openjdk openjdk bot changed the title JDK-8325104: Lilliput: Shrink Classpointers 8325104: Lilliput: Shrink Classpointers Apr 10, 2024
@tstuefe tstuefe force-pushed the Smaller-ClassPointers branch from 1260f2d to 49d289e Compare April 12, 2024 12:32
@openjdk
Copy link

openjdk bot commented Apr 12, 2024

@tstuefe Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

@tstuefe tstuefe force-pushed the Smaller-ClassPointers branch from 49d289e to 279f532 Compare April 12, 2024 13:02
@openjdk
Copy link

openjdk bot commented Apr 12, 2024

@tstuefe Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

@tstuefe
Copy link
Member Author

tstuefe commented Apr 12, 2024

I'm taking this back to draft to avoid further mail spamming.

@tstuefe tstuefe marked this pull request as draft April 12, 2024 15:03
@openjdk openjdk bot removed the rfr Pull request is ready for review label Apr 12, 2024
@coleenp
Copy link
Collaborator

coleenp commented Apr 15, 2024

Hi Thomas, I haven't looked at the metaspace code in a very long time. Where does it decide to allocate non-class metaspace into the class space? This is actually disturbing to me. We worried about the pressure on the hard limit of class space, and even though we've never had a customer complaint, it doesn't take the worry away. Did you try and reject indirect class pointers again? My experiment was from over a decade ago and the c2 vtable performance was slightly worse for an old benchmark but the modern code is different.

I also worry about class space pressure because of LambdaForm and generated classes, which seem to be continuously proliferating in the code without user control. I did happen to observe that since we never allocate instances of these, nor do we allocate instances of interfaces and abstract classes (except VirtualMemoryError, which I think is a bug), these don't need to be allocated in the class metaspace. I have a hack that goes in the opposite direction. I move more Klass into the non-class metaspace instead.

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Apr 19, 2024
@tstuefe
Copy link
Member Author

tstuefe commented May 1, 2024

Closed, since its already part of the big Rebase

@tstuefe tstuefe closed this May 1, 2024
@stefank
Copy link
Member

stefank commented May 3, 2024

FYI: the current code can be found here:
https://github.com/openjdk/lilliput/commits/master/
0368d21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-conflict Pull request has merge conflict with target branch
Development

Successfully merging this pull request may close these issues.

5 participants