-
Notifications
You must be signed in to change notification settings - Fork 18
8325104: Lilliput: Shrink Classpointers #128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Welcome back stuefe! A progress list of the required criteria for merging this PR into |
@tstuefe this pull request can not be integrated into git checkout Smaller-ClassPointers
git fetch https://git.openjdk.org/lilliput.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work!
One comment on your description:
when decoding an nKlass, the shift is not mandatory.
Did you mean to say 'is now mandatory'? Otherwise it doesn't seem to make much sense.
I did a first pass over the changes. Comments inline.
// Used only with compact headers. | ||
static const int klass_shift = hash_shift_compact + hash_bits_compact; | ||
#endif | ||
static const int hash_shift_compact = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to have a hole there in the middle. Could we have the hole in the uppermost bits? If we really have to have it here, use a new unused_gap_bits_compact here that has 4 instead of 1?
Thank you!
Yes, you are right; I'll correct that. |
@tstuefe This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration! |
❗ This change is not yet ready to be integrated. |
constexpr int load_shift = markWord::klass_load_shift; | ||
STATIC_ASSERT(load_shift % 8 == 0); | ||
return mark_offset_in_bytes() + load_shift / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this broken for big-endian machines? The follow-up question then is, should we really be reading the klass pointer with 32-bit loads? If we load the entire 64-bit "object header" and then shift with klass_shift
, we wouldn't have to think about endianess, right? Do we keep the 32-bit load because we don't want to mess with C2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a technical problem when I developed the patch, which is that a klass_offset_in_bytes needed to be != mark_offset_in_bytes. See e.g. Compile::flatten_alias_type
, but I believe there had been other places.
Then, a 32-bit load can possibly be encoded with fewer bytes, no? I am not sure if it would be faster beyond that distinction, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it is odd that we return that the klass offset is 4 on big-endian machines. If we ever really try to read the klass at (obj + klass_offset_in_bytes()) on those machines we will not get the klass bits, instead we will get some of the other bits in the object header.
My inquiries were more if it really was a good idea to load the klass with 4 byte loads and that it seems safer (more platform independent) to stick with 8 byte loads. When we use 8 byte loads and shifts we don't have to think about endianess.
I glanced in the C2 code how klass_byte_in_offset is used. It looks like most places were klass_offset_in_bytes()
is used, it is used as a sort of a token to figure out that the load (obj + offset) should be interpreted as a load of the klass. We don't actually use that offset for the load / decode of the klass. Instead we perform an 8 byte load from the start of the object, with associated 8 byte shift operation.
So, maybe my concern here is more that the name "klass_offset_in_bytes" is a bit misleading as it sounds like this is the offset where the klass bits are located. Code using klass_offset_in_bytes() really need to be written with the understanding that this is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trouble with C2 is that, currently, loading the Klass* (pre-lilliput) has its own memory slice, and is totally immutable. With Lilliput, we currently load the Klass* from the mark-word, and need to deal with header displacement in ObjectMonitors. This currently happens in the backend and is totally opaque to C2 IR.
However, it all could be done in C2 IR, even more so when we get the OMWorld stuff. Then the loads would happen on their true memory slice (the mark-word) and we would do the shifting/masking in IR, too. But it is questionable if we want to do so. The fact that Load(N)Klass can currently be treated as immutable makes the node freely moving in the ideal graph. Wiring it up in the same slice as mark-word loads means that we would have to re-load the Klass* after anything that would potentially touch the mark-word, especially safepoints, calls, etc. even though we know that the actual Klass* portion of the mark-word is still, in-fact, immutable. Therefore I believe it is better to keep the LoadKlass stuff on its own memory slice, with its own offset, even if we never actually load from that offset. offset=4 is a good choice for that, because it would never clash with a true offset of a field. Long-term, something like offset=1 or 2 may be better, when we want to do 4-byte headers, fields may start ot offset=4.
Also note the weird dance that we need to do in C2 .ad files to figure out the true offset. :-/
We should run this by some C2 experts to figure out the best way to deal with all that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding explaining some of the background for this. It sounds like a good idea to get a C2 expert to look at this.
One possible problem with hyper-aligned data structures (such as Klass* pointers allocated zero mod 1024) is interactions with data cache dynamics. If your D$ maps blocks aligned zero mod 64, then your Klass blocks can occupy only 1/16 of possible D$ entries. Sometimes this matters, when you are loading from many Klasses: The effect is of making the D$, just for them, smaller by a factor of 16, causing more frequent collisions and evictions. (This sort of effect has been observed, IIRC, with both stacks and TLABs, in the past. I don’t know if Klass accesses are frequent enough to make similar effects be detectable or not.) A solution is to perturb the hyper-alignment in such a way that the lower bits select all possible D$ frames, instead of just 1/16 of them. The simplest way to do this is to make your hyper-aligned Klass* blocks aligned mod (1024 + 64), where 1024 is your intended size and 64 is your D$ entry size. (Or maybe mod 1024+8, if it is faster to shift by 8 instead of by 64.). Decoding a Klass index consists of multiplying by the sum of a pair of powers of two, instead of by a single power of two. That means an extra instruction in the decode fast-path, unless you can hide the extra shift/add in an addressing mode. Given that we work hard to make the Klass base address be zero, it is clear that every instruction counts, and even if the Klass base is zero, you still need an extra instruction or two to do the extra shift/add. So that is sad, but the fact that we are sensitive to performance of these sequences means that we might possibly be sensitive to D$ collision effects also: Saving one instruction but having the average number of cycles go up, due to D$ misses, could be a net loss. So I suggest considering (even to the point of doing performance tests) the idea of assigning Klass addresses mod 1088 instead of 1024. It might require some imaginative coding to adjust the allocator logic to hand out such addresses, but I don’t think it’s very bad. For allocator performance, multiplication by 1088 is trivial. There will be some divisions by 1088, probably, but those are optimized (usually) to multiplications as well, if the divisor (1088) is a compile time constant. The allocator could search for available blocks mod 1024, and then only at the end, before handing out the block, round up to the next multiple of 1088. That would add a number from 0 to 63 to the base address which has already been selected. If that shift makes the block overflow at the other end, then the allocator has to keep searching. |
I see acknowledgement of this issue in a comment in this PR that says "smallest shift possible that still covers the whole range”. This is a good solution as long as that smallest shift is 64 (or less). A shift of 128 would dilute the D$ by 50%, but would reach 2^(22+7-10) distinct Klasses, half a million, which is great. I guess VMs which configure themselves for a quarter million classes or fewer just use the cache line as the allocation modulus, and all is well. So my solution (mod 1088) is really useful only for huge workloads. I retract my suggestion that performance experiments are needed at this time, given the solution you have in place. |
Hi John, I remember well your point from our last interchange (#13 (comment)). Last summer, I did a number of performance tests but found no clear result above background noise. But I had planned to revisit this issue and potentially fix it as a follow-up task. One never knows, my test methodology may have been off. Like you, I was worried about the increased cost and complexity of Klass ID decoding. Last summer I investigated how a possible decoding would look like. Since I am not a compiler expert, I just looked at what C compilers do when referencing items in a 1088-stride array. x64 e.g. uses imul, aarch64 uses some form of shift+add. Maybe its not so bad, but the effects need to be measured. My short-term contingency plan had been to use some of the now free four bits to increase the Klass ID bits and thus bring alignment back to 64 bytes. That would work at least for 64-bit headers, as long as Valhalla does not need those bits too. I will re-examine my performance runs, and double-check for methodology errors.
Yes, that was the underlying reason for this logic. Unfortunately, the default is for the class space to be 1G, with some additional space for CDS, so we are at a 2G range, which brings us just one bit. There are knobs to play with, e.g. CDS size is usually very small, so both CDS and class space default together could easily fit into 1G. That brings us 2 bits, back to an alignment of 256. Still only every fourth cache line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Thomas, I've started to read through the patch but are far from done. I'm sending a few early comments / questions / suggestions.
|
||
address CompressedKlassPointers::_base = (address)-1; | ||
int CompressedKlassPointers::_shift = -1; | ||
size_t CompressedKlassPointers::_range = (size_t)-1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this put 0x00000000FFFFFFFF in this size_t? I guess this works with the code using it, but it is not obvious that this is what we intended to put into the _range variable as the (uninitialized) value.
Related to this. How important is it to do these initialized checks? It seems to add casts and a bit odd code. When this is stable, wouldn't it be nice to clean this up? That would also make _tiny_cp a bool instead of a tri-bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this put 0x00000000FFFFFFFF in this size_t?
Ignore that. I checked and this is signed extended out to 0xFFFFFFFFFFFFFFFF.
// For UseCompressedClassPointers. | ||
class CompressedKlassPointers : public AllStatic { | ||
friend class VMStructs; | ||
|
||
// Tiny-class-pointer mode | ||
static int _tiny_cp; // -1, 0=true, 1=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems wrong. I think it should be 0=false, 1=true
.
|
||
// Narrow klass pointer bits for an unshifted narrow Klass pointer. | ||
static constexpr int narrow_klass_pointer_bits_legacy = 32; | ||
static constexpr int narrow_klass_pointer_bits_tinycp = 22; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to have a consistent naming between *_tinycp
and _tiny_cp
.
static int narrow_klass_pointer_bits() { check_init(_narrow_klass_pointer_bits); return _narrow_klass_pointer_bits; } | ||
|
||
// The maximum possible shift; the actual shift employed later can be smaller (see initialize()) | ||
static int max_shift() { check_init(_max_shift); return _max_shift; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, you could get rid of the variable name duplication if we changed this to be something like { return read_check_init(_max_shift); }
instead.
Many thanks, Stefan. I have to interleave work on this patch with a ton of other stuff, so I may be slow answering. B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took another pass over the code and have some more questions.
@@ -89,12 +100,28 @@ ClassLoaderMetaspace::~ClassLoaderMetaspace() { | |||
|
|||
// Allocate word_size words from Metaspace. | |||
MetaWord* ClassLoaderMetaspace::allocate(size_t word_size, Metaspace::MetadataType mdType) { | |||
word_size = align_up(word_size, Metaspace::min_allocation_word_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why you need to align word_size here? It seems like we perform a very similar alignment inside MetaspaceArena::allocate.
class_space_arena()->deallocate(ptr, word_size); | ||
} else { | ||
non_class_space_arena()->deallocate(ptr, word_size); | ||
NOT_LP64(word_size = align_down(word_size, Metaspace::min_allocation_word_size);) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? What code paths passes in non-aligned word_size?
(BTW, I started to look at these in more details because I tend to think that we should stay away from modifying the input parameters, since it makes it easier to misunderstand the code).
} | ||
if (UseNewCode)printf("\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like old debugging code.
void BlockTree::zap_block(MetaBlock bl) { | ||
memset(bl.base(), 0xF3, bl.word_size() * sizeof(MetaWord)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the header file the parameter is named block
and here it is named bl
. In add_block(MetaBlock mb)
the name is mb
. Maybe think about trying to use consistent naming for these?
size_t word_size() const { return _word_size; } | ||
bool is_empty() const { return _base == nullptr; } | ||
bool is_nonempty() const { return _base != nullptr; } | ||
void reset() { _base = nullptr; _word_size = 0; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need reset? I was wondering if you could make the _base, _word_size const by changing the few places that use this to: result = MetaBlock();
You would also need to skip having split_off_tail
and replace its usage with:
wastage = MetaBlock(...);
result = Metablock(...)
Just a thought if you wanted to make MetaBlock an even simpler tiny structure.
#ifndef SHARE_MEMORY_METASPACE_METABLOCK_INLINE_HPP | ||
#define SHARE_MEMORY_METASPACE_METABLOCK_INLINE_HPP | ||
|
||
#include "memory/metaspace/metablock.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a convention that the .hpp file associated with the .inline.hpp file should come first (as you have here) and then there's a blank line separating it from the rest of the includes.
constexpr int load_shift = markWord::klass_load_shift; | ||
STATIC_ASSERT(load_shift % 8 == 0); | ||
return mark_offset_in_bytes() + load_shift / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding explaining some of the background for this. It sounds like a good idea to get a C2 expert to look at this.
Thanks for the response, Thomas. (It looks like you remembered our conversation better than I did.) Taking back the four free bits is a good tactic for now. It may possibly squeeze Valhalla, at which point we will have to evaluate our choices: Should we encode the valhalla bits with a joint multi-bit encoding embedded in the klass-ID? Should we look again at mod-1088 encodings discussed here? Should we use a joint bit encoding for all array klass IDs, since Valhalla needs special markings on arrays anyway? Those are the sorts of questions that could arise. |
1260f2d
to
49d289e
Compare
@tstuefe Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information. |
49d289e
to
279f532
Compare
@tstuefe Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information. |
I'm taking this back to draft to avoid further mail spamming. |
Hi Thomas, I haven't looked at the metaspace code in a very long time. Where does it decide to allocate non-class metaspace into the class space? This is actually disturbing to me. We worried about the pressure on the hard limit of class space, and even though we've never had a customer complaint, it doesn't take the worry away. Did you try and reject indirect class pointers again? My experiment was from over a decade ago and the c2 vtable performance was slightly worse for an old benchmark but the modern code is different. I also worry about class space pressure because of LambdaForm and generated classes, which seem to be continuously proliferating in the code without user control. I did happen to observe that since we never allocate instances of these, nor do we allocate instances of interfaces and abstract classes (except VirtualMemoryError, which I think is a bug), these don't need to be allocated in the class metaspace. I have a hack that goes in the opposite direction. I move more Klass into the non-class metaspace instead. |
Closed, since its already part of the big Rebase |
FYI: the current code can be found here: |
Hi,
I wanted to get input on the following improvement for Lilliput. Testing is still ongoing, but things look really good, so this patch is hopefully near its final form (barring any objections from reviewers, of course).
Note: I have a companion patch prepared for upstream, minus the markword changes. I will attempt to get that one upstream quickly in order to not have a large delta between upstream and lilliput, especially in Metaspace.
High-Level Overview
(for a short sequence of slides, please see https://github.com/tstuefe/fosdem24/blob/master/classpointers-and-liliput.pdf - these accompanied a talk we held at FOSDEM 24).
We want to reduce the bit size of narrow Klass to free up bits in the MarkWord.
We cannot just reduce the Klass encoding range size (well, we could, and maybe we will later, but for now we decided not to). We instead increase the alignment Klass is stored at, and use that alignment shadow to store other information.
In other words, this patch changes the narrow Klass Pointer to a Klass ID, since now (almost) every value in its value range points to a different class. Therefore, we use the value range of nKlass much more efficiently.
We then use the newly freed bits in the MarkWord to restore the iHash to 31 bits:
nKlass gets reduced to 22 bits. Identity hash gets re-inflated to 31 bits. Preceding iHash are now 4 unused bits. Rest is unchanged.
(Note: I originally wanted to swap iHash and nKlass such that either of them could be loaded with a 32-bit load, but I found that tricky since C2 seems to rely on the nKlass offset in the Markword being > 0.)
nKlass reduction:
The reduction in nKlass size is made by only storing them at 10-bit aligned addresses. That alignment (1KB) works well in practice since Klass - although var sized - typically is between 512 bytes and 1KB in size. Outliers are possible, but the size distribution is bell-curvish [1], so far-away outliers are very rare.
To not lose memory to alignment waste, metaspace is reshaped to handle arbitrarily aligned allocations efficiently. Basically, we allow the non-Klass arena of a class loader to steal the alignment waste storage from the class arena. So, alignment waste blocks are filled with non-Klass metadata. That works very well in practice since non-Klass metadata is numerous and fine-granular compared to the big Klass blocks. Total footprint loss in metaspace is, therefore, almost zero (a few KB at most). We see class space to be used more, non-Klass space to be used less, but the total footprint is about the same.
The number of addressable classes is somewhat reduced now. From ~5 million we could fit before into a maxed-out (3 GB) class space, to ~3 million with 10-bit alignment. That is still a very high and acceptable number.
CDS:
CDS is also modified to store Klass at 10-bit aligned addresses. That works well since CDS contains both non-Klass- and Klass-data both, so things just shift around. As with Metaspace, very little footprint loss here.
We now create four CDS archives: with and without coops, and with and without UseCOH. Since that is potentially contentious, I guarded the +COH versions with the configure option --enable-cds-archive-coh.
Costs:
The costs of these improvements are:
Testing
I tested the default variant (COH disabled) and a variant with +COH as default. GHAs are clean. I ran tier 1 and 2 manually on Linux x64 and MacOS.
Tests are currently being executed at Adoptium on Linux arm64 and x64.
Further outlook
Further size reductions are possible. We can probably squeeze a bit or two out of reducing the number of loadable classes. To go further, e.g. to 16-bit class pointers, is possible but would probably need something like the variable-sized-header idea of John Rose.
In any case, this PR is a necessary prerequisite, since it allows us to use the bits in nKlass much more efficiently. So, a 16-bit nKlass can really address ~64k classes.
[1] https://github.com/tstuefe/metaspace-statistics/blob/master/Histogram.svg
Progress
Integration blocker
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/lilliput.git pull/128/head:pull/128
$ git checkout pull/128
Update a local copy of the PR:
$ git checkout pull/128
$ git pull https://git.openjdk.org/lilliput.git pull/128/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 128
View PR using the GUI difftool:
$ git pr show -t 128
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/lilliput/pull/128.diff
Webrev
Link to Webrev Comment