Guava ImmutableList (and others) offer awful performance in some cases due to size-optmized specializations #1268

gissuebot · 2014-10-31T17:40:58Z

Original issue created by travis.downs on 2013-01-27 at 05:39 AM

Many of the guava Immutable collections have a cute trick where they have specializations for zero (EmptyImmutableList) and one (SingletonImmutableList) element collections. These specializations take the form of subclasses of ImmutableList, to go along with the "Regular" implementation and a few other specializations like ReverseImmutable, SubList, etc.

Unfortunately, the result is that when these subclasses mix at some call site, the call is megamorphic, and performance is awful compared to classes without these specializations (worse by a factor of 20 or more).

Here's a simple benchmark that shows ImmutableList vs ArrayList - given that ImmutableList is a simple ArrayList, without the need for double bounds checks on the upper bound, you'd expect it to at least be comparable:

benchmark minSize     ns linear runtime
ArrayList       0   60.9 =
ArrayList       1   60.6 =
ArrayList       2   60.6 =
ArrayList       3   60.7 =

ImmutableList 0 1169.0 ==============================
ImmutableList 1 107.4 ==
ImmutableList 2 90.7 ==
ImmutableList 3 90.7 ==

The benchmark just calls size() repeatedly on 100 ArrayLists or ImmutableLists. The sizes of the lists are evenly distributed in [minSize, 4].

You can see that when all lists have at least 2 elements, performance is comparable (~91 ns vs ~61 ns) - with the difference here being attributed to CHA - in the ArrayList case the compiler can prove that ArrayList class is effectively final, and can avoid the inline type check (so even in the >= 2 element case, the specializations hurt).

With 1 element lists present, the call is bimorphic, so it can still be inlined and optimized, but the extra check costs a bit.

With 0 element lists, performance tanks. The call is megamorphic and can't be well optimized. The cost of the call is ~20 times worse than ArrayList.

The penalty applies every List<> call, not just size().

In the above benchmark, the type of the array of List was ArrayList[] and ImmutableList[] respectively.

If you change that to be List[] in both cases, the performance degrades even more:

benchmark minSize     ns linear runtime
ArrayList       0   90.6 =
ArrayList       1   90.8 =
ArrayList       2   90.7 =
ArrayList       3   90.8 =

ImmutableList 0 2061.3 ==============================
ImmutableList 1 115.2 =
ImmutableList 2 90.6 =
ImmutableList 3 90.7 =

CHA isn't in play now, because the type of the array is List, which has multiple implementations, so ArrayList degrades to ~91 ns, just like ImmutableList.

The worse case ImmutableList performance has been cut in half, however, since now the call is a megamorphic invokeinterface, rather than invokevirtual, ugh.

This kind of scenario is not at all far fetched in real world code - it is natural to have lists of 0 and 1 length (which is probably why these specializations were created in the first place), and it is not unusual to find them mixed at a single call site - often in a hot method that takes a List<> as input. Unlike something like branch prediction, the worst case behavior will occur permanently once the specializations have ever been seen at the call site. Even if 0 or 1 element arrays are very uncommon, once you see one of each, you are hosed until you restart the VM (at least in every JIT that I know of).

The benchmark doesn't even test the worst case - the pattern of lists sizes is totally regular, so the branches inherent in the bimorphic and megamorphic dispatch will be well predicted. Randomize it and it will get worse (especially the bimorphic case because it is pretty fast so has a lot further to call).

The same issue could occur with SubList and perhaps ReverseImmutableList also, although probably less frequently, and there may be no good alternative there.

A reasonable compromise here would be to keep only the Singleton list implementation, and use a global static instance of that for the 0-element case also, with null element array, and special case all the methods to do the right thing for a null array. This will get you most of the memory and GC benefit of the current implementation (only a slight increase for 0-element lists), while making the worse case bimorphic, which isn't too bad.

Alternately, you could keep only the 0-element case, and group 1-element lists into the general case. This will increase memory use by 2.5x (16 bytes vs 40 bytes on hotspot bytes according to my back-of-napkin calcs), and probably have somewhat worse runtime performance for heavy use of 1-element lists.

Benchmark attached.

The text was updated successfully, but these errors were encountered:

gissuebot · 2014-11-01T01:01:23Z

Original comment posted by travis.downs on 2013-01-27 at 05:45 AM

Benchmark attached here...

gissuebot · 2014-11-01T01:01:25Z

Original comment posted by travis.downs on 2013-01-27 at 05:46 AM

You can uncomment the block above setUp(), and comment out the corresponding block right above that to try the List[] case (second set of results above).

gissuebot · 2014-11-01T01:01:33Z

Original comment posted by cpovirk@google.com on 2013-01-28 at 04:38 PM

Wow, thanks. Coincidentally, we recently started some internal discussions about the value of methods like EmptyImmutableList.isEmpty(). Your benchmark numbers are beyond my worst fears (20x??). Further, your reasoning shows that a solution would have to do more than just remove method overrides: It would actually have to remove implementation classes. And, as you point out, there are more of these than Empty+Singleton+Regular. (See also _Immutable_AsList.) For this reason, List in particular may end up being a lost cause, but this deserves more thought. Maybe we can hide all the "weird" implementations behind a ForwardingImmutableList that delegates to non-ImmutableList-but-actually-immutable delegates? Or maybe we should focus our attention on immutable classes with fewer implementations (ImmutableMultimap)?

The distressing part of this is that the primary goal of the multiple "normal" ImmutableList implementations is to save memory, and there's only so much we can do to preserve the memory savings while eliminating subclasses. I like your suggestion of eliminating EmptyImmutableList (with the caveat that the existence of other ImmutableList implementations may still be a megamorphism problem). Another way of doing that is to use a RegularImmutableList for empty lists.

The overall size-space tradeoff is a hard one to evaluate, so I don't know exactly what we'll do here, but SingletonImmutableList, for one, looks tenuous. Thanks for supplying the numbers that have been lacking in our evaluation so far.

Status: Research
CC: lowasser@google.com, gak@google.com

gissuebot · 2014-11-01T01:01:35Z

Original comment posted by ogregoire on 2013-01-28 at 05:12 PM

Please try to @Param numLists with other values too. This show the same tendency, but higher numLists show less and less difference compared to 100. Also, the singleton list may be an actual gain of performance with higher values of numLists.

I actually don't know how to interpret that, so I'm just giving a lead here.

gissuebot · 2014-11-01T01:01:38Z

Original comment posted by lowasser@google.com on 2013-01-28 at 05:53 PM

ImmutableList does seem likely to always have many subclasses -- mostly the ImmutableAsLists.

Chris, I'm not sure I follow how ForwardingImmutableList would work, or would save anything?

I think we could safely eliminate EmptyImmutableList and its brethren -- as long as we maintain a singleton holding the empty ImmutableList instance, it will only require a constant amount of memory for the whole VM.

That said, I'm not 100% convinced the benchmark is valid, if only because the List interface itself is likely to have more implementations used than just ArrayList and ImmutableList...? Most implementations will be List references, not ArrayList-specific references, so I'd expect most operations on a basic List to be polymorphic calls.

gissuebot · 2014-11-01T01:01:40Z

Original comment posted by cpovirk@google.com on 2013-01-28 at 06:28 PM

Chris, I'm not sure I follow how ForwardingImmutableList would work, or
would save anything?

We can sacrifice performance on uncommon implementations by making the common call bimorphic:

ImmutableList has two subclasses, RegularImmutableList and ForwardingImmutableList. RegularImmutableList is used for 0-, many-, and perhaps 1-element lists. ForwardingImmutableList is used for everything else, including ImmutableAsList. ForwardingImmutableList is a wrapper around a plain (though immutable) List. Any other "ImmutableList implementations" are actually List implementations that get wrapped in a ForwardingImmutableList.

That said, I'm not 100% convinced the benchmark is valid, if only because
the List interface itself is likely to have more implementations used than
just ArrayList and ImmutableList...? Most implementations will be List
references, not ArrayList-specific references, so I'd expect most operations
on a basic List to be polymorphic calls.

While we have been encouraging people to store references in fields of type ImmutableList, we can be sure that that advice isn't always followed. In any case, it would be nice if the people who did follow the advice could get a performance boost out of it.

But there may be only so much gain from any of these approaches. It may be that the right takeaway from all this is that our attempts to subclass ImmutableList for speed are fruitless because, even if we could eliminate the entire 90ns cost of "the method itself," the percent performance increase would be in the single digits. Despite years of advice that the immutable implementations are probably more efficient than the JDK classes, we've avoided writing these benchmarks. (Well, at least we have some for Set.)

gissuebot · 2014-11-01T01:01:42Z

Original comment posted by lowasser@google.com on 2013-01-28 at 06:34 PM

To be clear, I'm arguing that the benchmark overestimates the speed of ArrayList. I also have to admit that I'd've assumed that if the ForwardingImmutableList procedure you describe worked, then the JIT would already have implemented a similar technique to deal with polymorphic calls when certain implementations were much more common than others.

gissuebot · 2014-11-01T01:02:41Z

Original comment posted by travis.downs on 2013-01-30 at 06:17 AM

Given the discussion so far, I think there is a bit of a misunderstanding about what makes things slow, so here's a quick summary based on the comments above, and then the gory details follow in the next post.

Most implementations will be List
references, not ArrayList-specific references, so I'd expect most operations
on a basic List to be polymorphic calls.

The important thing for performance isn't whether the "proven" type of the reference has subclasses, but whether multiple implementations have actually been seen in practice at some call site. In the language of my post to follow, CHA doesn't apply if List has multiple implementations, but "inline caching" does, as long as only one implementation is seen in practice, for each call site. That's why ArrayList is fast: one class returned by Lists.newArrayList(), and ImmutableList is slow: multiple implementations returned in practice (for minSize in [0, 1] - but the 2 implementation case is still optimized well).

This is shown clearly by the second benchmark - the type is now List<>, but ArrayList and ImmutableList (for length > 1) are still very fast (a 0.3 ns penalty per call, since CHA isn't in play). The massive penalty incurred by the > 1 implementations of ImmList is still being paid, and the relative cost is even greater (20x now vs 10x) since interface types are being used).

Declaring fields as ImmutableList vs List actually does give a performance boost, but only because ImmutableList is a class, and not an interface, so it can use the faster vtable dispatch vs the interface dispatch. That's secondary to the issue here - even virtual dispatch is very slow compared to inlined & fully optimized access.

Maybe we can hide all the "weird" implementations behind a ForwardingImmutableList
that delegates to non-ImmutableList-but-actually-immutable delegates? Or maybe we
should focus our attention on immutable classes with fewer implementations (ImmutableMultimap)?

As above, the issue isn't total implementations, but total implementations likely to be seen at one call side. I focused on the types returned by ImmutableList.of() because three different implementations can be transparently returned by .of(), completely undocumented, so you can get different concrete classes even if you did your best to avoid it. Such instances are very likely to end up at the same call site, and cause poor performance, compared to the other implementations, which will often never appear at any call site (see the next part, for an important clarification about "call site").

The other implementations, while common in some cases, only arise as the result of distinct java call sequences, so are likely to be limited to many fewer effective call sites (the difference between java-level call sites and effective call sites after inlining is critical, so see the last part of the next post).

gissuebot · 2014-11-01T01:02:43Z

Original comment posted by travis.downs on 2013-01-30 at 06:54 AM

Here are the gory details about JIT method inlining & optimization in general. When I say "JIT" here, I mean hotspot, but I don't think it is going to vary too much across other state-of-the-art JITs out there - or at least they'll usually be worse in specific cases, not better. This should help you when you are analyzing the tradeoffs to make.

Unlike say, C++, every non-final instance method in Java is potentially virtual - that is, polymorphic dispatch is the default. Naively implemented, these calls are brutal for performance - partly because the dispatch requires some indirection but mostly because they aren't inlined. Once a method isn't inlined you lose all sorts of optimizations involving the caller and callee, so it is really the "granddaddy" optimization that enables the rest of the optimization family to kick in for hot code with method calls. It's worth noting that the fast cases for my benchmark take between 0.6 ns and 0.9 ns per call, which is exactly 2 or 3 cycles (3.33 GHz box), which is pretty amazing given that there are a dozen bytecodes or more involved in each loop. The JIT has done a good job!

Because inlining methods is so important and virtual methods are so important, a key optimization for the JIT is "devirtualization" - taking a method which is polymorphic in theory and compiling it as if the actual implementation was known, including inlining it. It does this through two very different mechanisms, with different behaviors and different implications for class hierarchy design, which I will describe below:

Class Hierarchy Analysis (CHA):

This is the trick that makes the final on methods keyword mostly useless for performance, despite lots of out of date advice to the contrary (still useful as a enforced restriction at compile time). If a method m() is declared final on class C, and the compiler knows that for an object of type C or a subclass (but not a superclass), a call to m() will always resolve to the single final implementation, and it can be fully inlined and optimized.

Most methods, however, are not declared final, but are effectively final since they have no overrides in the currently loaded set of classes. Since the JVM knows, at all times, the complete set of loaded classes, it can perform this "is effectively final" analysis for each method. If m() has no overrides in any loaded class, it can be optimized just as if it was final.

The set of loaded classes can change of course, so some CHA-based assumptions which where valid when some code was compiled may become invalid later. This is handled through deoptimization when a class is loaded that violates earlier CHA assumptions. The deoptimization always occurs at a safepoint, so it piggy-backs on the existing safepoint polling mechanisms and basically means that CHA is "free" from the point of view of the compiled code - there is no trace of the fact that the CHA could become invalid in the generated assembly, other than safepoint polling which is required regardless.

OK, so the key points about CHA based devirtualization:

It applies to the entire VM, per method - either some method can be devirtualized via CHA everywhere, or nowhere (actually it could be per-classloader, but the general idea that it's a global optimization is the same).
As a consequence of (1), if any class is loaded that overrides a method, CHA for that method is invalid everywhere a method is called, even if some call site never actually sees the overridden classes.
CHA works on a method basis - if you have a class with M methods, and a subclass of that that overrides N methods (N < M), CHA is still applicable to the non-overridden methods.

CHA is great when it works, but it is pretty useless for this issue. As Louis points out above, interfaces like List will always have multiple loaded implementations, so CHA will almost never apply for List. It can still apply when the type of the object can be proven to be more specific than List, which is what occurs in the first benchmark, so a concrete List implementation, like ArrayList, may be subject to CHA if no class extends it (as in the benchmark), but in most large projects, you are probably doing to have someone who extends it (BTW, this is a good reason to make performance sensitive classes final).

The only time CHA shows up in the benchmarks above is the "fast case" in the first benchmark, with all lists of size 3 or 4. ArrayList gets 60 ns, while ImmList gets 90 ns, the difference being because of CHA which works for ArrayList, but not for ImmList, which has multiple loaded (but not used, except perhaps internally in guava) implementations. In the second benchmark, where the array type is List rather than ArrayList/ImmutableList, CHA doesn't apply (since the call is against List, which does have multiple implementations). So CHA isn't big-picture important here because it is only comes in to play in the 60 ns vs 90 ns difference (which is 0.3 ns per call, since 100 lists are involved - 0.3 ns is exactly one cycle).

Next, on to the optimization that is important here: inline caching.

This optimization applies when CHA can't - that is, when a method has overrides in loaded classes. So it applies all the time for things like List, which always have multiple implementations. In this case, the JIT takes advantage of the fact that even though overrides may exist, at any particular call site (a call site being a possibly inlined method invocation in the JIT'd assembly) there is often exactly one implementing class in practice. For ArrayList this is always true in my benchmark, and it is true for the minsize 3,4 case for ImmList. Take a look at your codebase (not guava, though, since you don't know who is calling your methods!) for say 10 random calls of List.get() or List.size() and trace back all the possible origins of the List in question - you'll often find that they'll always be a certain type.

What the JIT does here is looks at the profiling information gathered during the interpreted phase, and if only 1 or 2 types of objects have been seen at that call site, optimistically assumes that the type of the object will be those 1 or 2 types (monomorphic and bimorphic cases). This allows it to inline/optimize the method pretty much in the same way as CHA, above. The difference is that the assumption might be wrong, and there is no classsloading event & safepoint poll that can be hooked to guard this case, so they actually need to insert a check in the generated assembly to ensure the type of the object is one of the assumed type(s), immediately before the actual code of the method. If this checks fails, the method is de-optimized and recompiled with the new information. At most this can occur twice: mono -> bi, then bi -> mega. In the benchmarks above, this is the reason for the 90 ns ImmList times vs the 60 ns ArrayList times - ArrayList was using CHA, while ImmList needed an inline cache, and the extra check & branch cost 1 cycle. Still extremely fast. It's also why the ImmList minsize=1 case takes 107 ns vs 90 ns for ImmList - we can still use inline caches, because we are bimorphic (the singleton ImmList and regular ImmList), but we need an extra check to cover the two cases (the branch is well predicted because the pattern of lists is completely regular, but if it wasn't this case would suck too).

When you have more than 2 implementation observed at a call site, however, the JIT gives up - it doesn't inline the method and must compile an expensive virtual method call. If the underlying type is a class, this involves some indirection and parameter marshaling, and if it is an interface, involves that, plus a linear search through the implemented interfaces to find the right one (because the class hierarchy is a tree, but the interface + class hierarchy is a DAG). This case is the ~1170 ns and ~2060 ns ImmList cases. The call site has > 2 implementations of ImmList (or List in the second benchmark) so it is megamorphic not only in practice, but in theory. The second benchmark is 2x as bad as the first because the proven type of the receiver (object on which the method is called) is List, an interface, not ImmutableList, an (abstract) class, so it has to pay the additional interface search cost.

The key points about inline caches:

It applies only when CHA doesn't - that is, when there are multiple implementations of a class/interface. More precisely, when a method is called where there are multiple implementations of that particular method in subclasses of the proven type of the receiver.
It applies on a per-call site basis. Every call site is treated independently by the above. So the optimization can apply in many cases for some method n(), while failing at some particular call sites of n() where it happens that multiple receivers are observered. So global program behavior won't affect, in many cases, particular call sites.
It applies with 1 or 2 receiver types, not more (CHA applies with exactly 1 type).
It is based on the type of the object, not the specific method.

Of the points above, (2) and (4) are particularly important here.

(2) means that most of the discussion above about how many implementations of a class/interface exist is probably wrongly aimed - it's not how many exist, but how many actual instances of those implementations appear at a particular site. In the benchmark, ArrayList is always actually an ArrayList because Lists.newArrayList always returns an ArrayList. The ImmutableList, however, even though they all originate from the same factory method, ImmutableList.of(), vary in their most-derived types! This is why I focused on the types returned by ImmutableList.of() - even though other specializations of ImmutableList exist, they are returned in specific cases, involve a different call path from a usual creation (sublist, reverse calls). On the other hand, 3 different implementations can be transparently returned by .of(), completely undocumented, so you can get different concrete classes even if you did you best to avoid it. Such instances are very likely to end up at the same call site, even in high performance code, and cause poor performance, compared to the other implementations, which will often never appear at any call site (see the next part, for an important clarification about "call site").

The above makes clear that "call sites" are critically important. What is a distinct call site? A sufficient condition is usually an actual call of a java method in the java code:

static <E> boolean isInEither(List<? super E> list1, List<? super E> list2, E elem) {
return list1.contains(elem) || list2.contains(elem);
}

However, this isn't a necessary condition. For example, the method above contains two java-level call sites for contains() on the same line. This method is extremely simple, probably less than 35 bytecodes, and certainly less than 350 bytecodes, so it will in generally be inlined into any callers. This means that the contains calls (which may or may not be inlined, but it doesn't matter here) will actually appear at each call site of the isInEither() method, meaning that different invocations of the method don't really interfere with each other. If one method wants to use ArrayLists, and one method passes LinkedLists, no problem, they don't de-optimize each other's inline caches because due to inlining the "effective" call site is different - much more fine grained that the (single) call site in the java code. This type of thing is usually true for performance sensitive code: inlining means that the effective call site is moved up the call stack to a method that is too large to be inlined, which for hot code is around ~300 bytes . The exact behavior depends on JVM arguments and can probably only be exactly determined with +XX:PrintCompilation.

(4) means that unlike CHA, for inline caches to work it is only important if you have seen multiple subclasses at the call site, not whether those subclasses actually implement the called method. For example imagine you have an override of ArrayList, which doesn't override any methods but implements the Comparable interface (it doesn't matter why you have a subclass, I'm just trying to come up with a realistic reason). For your new class, every method in List is implemented by the ArrayList method, not your new method. However, if you mix both at a call site, the call will be bimorphic, not monomorphic based on line caching. Under CHA, a similar situation would be monomoprhic, because CHA can tell that even though there is a subclass, it doesn't override the method of interest. The way the "inline cache" is implemented, however is as a simple single-word check of the expected class object versus the actual class object - this happens in once cycle, as above - there is no general way to efficiently implement the more complication question "does the receive override method x" and the non-general ways probably aren't interesting enough that the JIT has implemented them yet.

The key part of (4) is that unless CHA is in play you have to talk about types - whether they have subclasses or not, and not whether methods have overrides or not. This is a big downside of inline caches - even if you can design a memory-efficient subclass, which overrides very few methods, you'll pay the price on every method call of the superclass, not just on those that you override.

gissuebot · 2014-11-01T01:02:48Z

Original comment posted by travis.downs on 2013-01-30 at 07:09 AM

All that said, if were building this myself, here are the choices I would make:

Two concrete implementations of ImmutableList:

RegularImmutableList (the current implementation)
ForwardingImmutableList (similar to the same thing above, but clarified below)

OrtherImmutableList can be be implemented exactly as a delegate, like:

class OtherImmutableList extends ImmutableList {

final OtherImmutableListDelegate delegate;

  @Override
  void size() {
    delegate.size();
  }
}

I would implement the size 0 and 1 cases of ImmutableList in terms of RegularImmutable list - size 1 not being special at all, and size 0 always returning an empty array version of RegularImmutableList (so being very fast and zero memory cost).

The benefit of this is that it keeps the common case case, of regular immutable lists, even of size 0 and 1. You'll likely find this faster in the case of 0 mixed with non-zero lists than the current implementation, yet not using any more memory unless 1-element lists are common since the 0-element list is a singleton. That is, I disagree with ogregoire's assertion that you'll find Empty faster for larger list sizes - this won't happen except in extreme cases where the entire working set can't fit in L3, and the Empty set avoids one cache miss (doesn't need to examine .length).

For the unusual case, things will be about as a fast as possible. As long as only 1 "other" implementation occurs at a call side, you'll end up with a bimorphic check + momomorphic check and branch. I'm just guessing here (not online) but something like 150-200 ns vs 90 ns for the monomorphic case. Branch prediction will help a lot here, in since unlike JIT it is dynamic for the life of the process and can recognize patterns. If you have mostly Regular ImmLists mixed with a few "Other" lists (of any type) you'll get close to 90 ns, since the JIT will set up the branches right, and branch prediction will get it right most of the time.

gissuebot · 2014-11-01T01:02:50Z

Original comment posted by travis.downs on 2013-01-30 at 07:10 AM

Sorry, above OtherImmutableList == RegularImmutableList, made a last second edit that I can't fix now.

gissuebot · 2014-11-01T01:02:54Z

Original comment posted by cpovirk@google.com on 2013-01-30 at 03:31 PM

Thanks for all the details. I'm bookmarking this for future reference.

When it comes to other implementations, my concern is less with reverse() and other oddballs and more with RegularImmutable_As_List, which is what you get from calling ImmutableList.copyOf(immutableSetOfMultipleElements). I'm envisioning a constructor that accepts an Iterable and calls copyOf() on it. Some callers will pass an ImmutableSet and get RegularImmutableAsList; others will pass other types and get RegularImmutableList.

I have no idea how common this is. That's the source of my fears that ForwardingImmutableList may be common. (Even if it is, will the objects that end up holding RegularImmutableList and the objects that end up holding RegularImmutableAsList end up being the same "call sites," or will inlining separate them? That's even harder to say. It's plausible, though.)

Having two types is, of course, a problem addressed by your solution of RegularImmutableList+ForwardingImmutableList. And now, by eliminating SingletonImmutableList, we're making tradeoffs between memory and CPU. Hmm.

gissuebot · 2014-11-01T01:02:58Z

Original comment posted by gak@google.com on 2013-01-30 at 04:31 PM

tl;dr
https://wikis.oracle.com/display/HotSpotInternals/PerformanceTechniques

gissuebot · 2014-11-01T01:03:02Z

Original comment posted by lowasser@google.com on 2013-01-30 at 05:46 PM

@cpovirk: I'm fairly convinced that making sure the "common case" -- a vanilla ImmutableList -- is monomorphic should be our primary priority?

gissuebot · 2014-11-01T01:03:07Z

Original comment posted by travis.downs on 2013-01-30 at 06:11 PM

Yeah I guess the ImmutableAsList stuff can definitely cause issues too, but as you point out it all comes down to inlining: if the methods of the class that takes Iterable in the constructor are small enough to inline into callers, different users of that class shouldn't be impacted too much.

I see the behavior of ImmutableList.copyOf() as worse in the plain iterable case, however - you can pass in a series of identically typed Iterables, varying only by length, and get different implementations back.

At least in the ImmutableSetAsList case, the caller shares some responsibility because he is passing in different types as input in the first place.

gissuebot · 2014-11-01T01:03:09Z

Original comment posted by gak@google.com on 2013-01-30 at 07:06 PM

@lowasser, keep in mind that this whole issue is a tiny slice of the larger performance problem. We have a benchmark (that needs review) that shows certain characteristics for runtime performance for a certain invocation pattern. In general, we've made the choice to prioritize memory consumption over CPU for both the raw # of bytes in the heap and the overhead of GC. We haven't even started the discussion about CPU cost relative to any possible memory implications.

Basically, I think this is a great discussion that draws the conclusion that "all things being equal, monomorphic calls produce faster code than megamorphic calls". I agree. The JDK developers agree. We just need to make sure that we're keeping larger performance picture in mind.

Making ImmutableList monomorphic shouldn't be our "primary priority" as it is a means, not an end. Making certain calls monomorphic that are likely to be invoked in certain patterns is as good of a strategy as any other provided informed decisions are made about the trade-offs.

gissuebot · 2014-11-01T01:03:24Z

Original comment posted by travis.downs on 2013-01-31 at 04:30 AM

Well at least the case of EmptyImmutableList isn't a case of favoring memory use over CPU, sine it's a singleton. So at least there the intent must have been a faster specialization since all methods can be hardcoded.

That seems like a straightforward win to remove - no increase in memory use (ok, a few bytes process wide) and should be faster bit lower in practice, except in degenerate cases like a huge number of empty lists.

Size 1 case is definitely more subtle - I'd be curious if your internal profiling shows these make up a large amount of aggregate memory use.

…with a null hash table. Related: #1268 ------------- Created by MOE: http://code.google.com/p/moe-java MOE_MIGRATED_REVID=89625833

lowasser · 2015-05-14T18:40:00Z

We've been addressing this, and most immutable collection types are down to bimorphic now. We'll continue paying attention to this, though.

This is part of a potentially-large effort to reduce the number of implementations of common interfaces, like Collection and Set, that might be passed at any of an individual program's frequent callsites. Arrays.asList() and Collections.emptySet() are serious offenders here, each defining its own class it returns but that is used nowhere else. If you expect an ArrayList, then use Maker.makeList(), which now has an optimization for the one-argument case. If you expect an OrderedSet, there's Maker.makeOS(), but if you expect it to be empty, just use new OrderedSet<>() rather than Collections.emptySet(). Similarly, there's Maker.makeUOS() for UnorderedSet instances. If you want to know the effect this can have, see this Guava issue google/guava#1268 ; a 20x slowdown can happen at megamorphic callsites. There's an explanation of the terms I used here http://insightfullogic.com/2014/May/12/fast-and-megamorphic-what-influences-method-invoca/ .

thespags · 2019-03-17T15:17:30Z

Is the benchmark that was run still available with the move to github?

travisdowns · 2021-12-14T06:58:39Z

@thespags - the wayback machine has it.

I've attached it here (because I can't attach it to the first message in this thread):

ArrayListVsImmutableList.zip

This is part of a potentially-large effort to reduce the number of implementations of common interfaces, like Collection and Set, that might be passed at any of an individual program's frequent callsites. Arrays.asList() and Collections.emptySet() are serious offenders here, each defining its own class it returns but that is used nowhere else. If you expect an ArrayList, then use Maker.makeList(), which now has an optimization for the one-argument case. If you expect an OrderedSet, there's Maker.makeOS(), but if you expect it to be empty, just use new OrderedSet<>() rather than Collections.emptySet(). Similarly, there's Maker.makeUOS() for UnorderedSet instances. If you want to know the effect this can have, see this Guava issue google/guava#1268 ; a 20x slowdown can happen at megamorphic callsites. There's an explanation of the terms I used here http://insightfullogic.com/2014/May/12/fast-and-megamorphic-what-influences-method-invoca/ .

gissuebot added migrated package=collect status=research type=performance Related to performance labels Nov 1, 2014

cgdecker removed the migrated label Nov 1, 2014

lowasser added a commit that referenced this issue Apr 16, 2015

Eliminate EmptyImmutableSet, replacing it with a RegularImmutableSet …

cdcd91f

…with a null hash table. Related: #1268 ------------- Created by MOE: http://code.google.com/p/moe-java MOE_MIGRATED_REVID=89625833

kevinb9n assigned lowasser May 14, 2015

lowasser added status=in-progress and removed status=research labels May 14, 2015

ktoso mentioned this issue May 19, 2016

Buffer and header parsing support akka/akka#20457

Merged

swankjesse mentioned this issue Sep 18, 2016

Minor memory optimization for Util.immutableList square/okhttp#2860

Closed

smarter mentioned this issue Nov 8, 2017

Override getOrElse in Map1, Map2, Map3 and Map4 to avoid allocations. scala/scala#6138

Merged

cgdecker added the P3 label Aug 23, 2019

techsy730 mentioned this issue Dec 17, 2020

[OPTIMIZATION] Objects from methods that return 3 or more possible runtime-types optimize very poorly vigna/fastutil#183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guava ImmutableList (and others) offer awful performance in some cases due to size-optmized specializations #1268

Guava ImmutableList (and others) offer awful performance in some cases due to size-optmized specializations #1268

gissuebot commented Oct 31, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

lowasser commented May 14, 2015

thespags commented Mar 17, 2019

travisdowns commented Dec 14, 2021

Guava ImmutableList (and others) offer awful performance in some cases due to size-optmized specializations #1268

Guava ImmutableList (and others) offer awful performance in some cases due to size-optmized specializations #1268

Comments

gissuebot commented Oct 31, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

gissuebot commented Nov 1, 2014

lowasser commented May 14, 2015

thespags commented Mar 17, 2019

travisdowns commented Dec 14, 2021