Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid creating an entry whose key is null at construction time. #6198

Closed
wants to merge 1 commit into from

Conversation

copybara-service[bot]
Copy link
Contributor

Avoid creating an entry whose key is null at construction time.

In the case of an entry whose key is stored in a WeakReference, the resulting entry will never be enqueued in the ReferenceQueue, so I'm not sure that it will necessarily ever be removed from the cache.

Compare a similar change to MapMakerInternalMap that was part of cl/479157599.

I get the impression that there may be a similar situation with weak values, but I'm not biting that off right now.

As always, we recommend that you use Caffeine instead of Guava's cache library.

RELNOTES=n/a

In the case of an entry whose key is stored in a `WeakReference`, the resulting entry will never be enqueued in the `ReferenceQueue`, so I'm not sure that it will necessarily ever be removed from the cache.

Compare a similar change to `MapMakerInternalMap` that was part of cl/479157599.

I get the impression that there may be a similar situation with weak _values_, but I'm not biting that off right now.

As always, we [recommend](https://guava.dev/CacheBuilder) that you use [Caffeine](https://github.com/ben-manes/caffeine/wiki) instead of Guava's `cache` library.

RELNOTES=n/a
PiperOrigin-RevId: 479318112
@cpovirk
Copy link
Member

cpovirk commented Oct 7, 2022

@ben-manes FYI

I don't think you have code quite like this in Caffeine, in part because you delegate the work of implementing a hash table to ConcurrentHashMap. And when you do call getKey(), you appear to have been more diligent than us about consistently using the result of the same call. But here's yet another reason to tell people to use Caffeine :)

@ben-manes
Copy link
Contributor

Thanks for the ping!

This is an interesting and subtle bug! Like you said, it shouldn't impact Caffeine since we don't have a custom hash table where such mistakes can happen. I understood why Guava decided to fork and disagreed with that decision back then, so took the easier path of delegation. That is less memory optimal for reference caching by wrapping the key/value instead of inlining the fields onto the hash entry. Caffeine optimizes towards size eviction so reference caching is second class as rarer, whereas Guava initially took a big bet on soft references as the best policy and got bit hard when that did not pan out.

Caffeine's tests include a large number of reference tests and it automatically performs an internal state check after every successful test case. All the tests run against Guava as well for a compatibility check, so we could try to write a similar internal state checker if helpful. Unfortunately, I'm afraid that any effort we put into finding more bugs here won't translate into anyone having the time to fix them.

@cpovirk
Copy link
Member

cpovirk commented Oct 10, 2022

Yes, please never invest your time in anything that would rely on us to address cache bugs in response :( The problem here wasn't something that we learned about because a user encountered and reported it; instead, a developer happened to notice the identical bug in the MapMakerInternalMap source code as part of implementing #6197 / a2e8f3c.

As a fun twist, the developer was modifying MapMakerInternalMap because even it was not close enough to memory-optimal for one team :) However, that team is an unusual team by Google standards. And they may ultimately end up forking MapMakerInternalMap to make it yet more nearly memory-optimal. I'm not sure if that forking counts as a reason for us to have taken the MapMakerInternalMap approach (since it would make the fork easier) or a reason not to (since it still doesn't save the team that really cares about memory usage from having develop a fork) :)

(The context for all that is weak references and interning. We do seem to be gradually training people to avoid soft references, thankfully.)

@ben-manes
Copy link
Contributor

If they care about more memory than concurrency, then maybe forking WeakHashMap is more appropriate?

@cpovirk
Copy link
Member

cpovirk commented Oct 10, 2022

Thanks. I think they care about concurrency, too. It's possible that what they'll actually fork in the end is ConcurrentHashMapV8.java, but I imagine that would be tricky :)

@ben-manes
Copy link
Contributor

Maybe they can look at Spring's ConcurrentReferenceHashMap which is custom and uses a segmented locking. The JBoss implementation is similar to MapMaker's so might not be much of a difference (Google Collections had ReferenceMap via delegation, and the author felt one upped, so MapMaker was an attempt of showmanship rather than thoughtful design).

@justinhorvitz
Copy link

Hi, I'm the mysterious unnamed developer. You may remember me from ben-manes/caffeine#568. The project in question is to remove the memory overhead of interning in https://github.com/bazelbuild/bazel. I didn't bother open sourcing the design document, but in summary, the plan is to use existing bazel data structures to find canonical instances, and for various reasons (e.g. to support retrieval of map keys), we are considering maintaining our own concurrent map implementation. Thanks for the suggestion of Spring's implementation. I'm still weighing the pros and cons of each.

@ben-manes
Copy link
Contributor

I had a feeling it was you 😄

I believe NonBlockingHashMap was slimmer than the JDK’s, but had footguns. It lacks reference caching, but is a different hash table design for consideration. I think maintaining a generic hash table as a micro optimization is cumbersome, but specializing when truly needed is reasonable. Good luck in finding a nice fit.

@justinhorvitz
Copy link

Actually, we don't need weak/soft reference support. So that gives us some more options.

@ben-manes
Copy link
Contributor

Oh what do you need exactly?

@justinhorvitz
Copy link

A concurrent map that supports:

  1. Retrieving existing keys (for interning)
  2. A parallel iteration method similar to [ConcurrentHashMap#forEach](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/ConcurrentHashMap.html#forEach(long,java.util.function.BiConsumer). We actually don't need this to be compatible with concurrent updates.
  3. Ideally entries have no more than 3 fields (key, value, next). Hash can be computed on demand by calling key.hashCode().

We might also need the ability to synchronize between two of these maps so that a key never exists in both of the two. I haven't yet decided whether this will require anything special or whether we can get away with e.g. a side effect in computeIfAbsent.

@ben-manes
Copy link
Contributor

NBHM offers getk and does not store the hash code. It should be fast for traversal as it does not chain on collision, so might be trivially parallelized. However the key may not be modified even after removal due to tombstoning, among other gotchas. Cliff Click might be approachable if you find it a fit.

@cpovirk
Copy link
Member

cpovirk commented Oct 11, 2022

Thanks, Ben.

And ooo, my mistake about weak keys, thanks. That makes NBHM look appealing for another reason (on top of the lack of chaining): If you don't need a WeakReference, then you don't need to store an "entry" object at all, just the single array for keys+values. And IIUC, that's how NBHM is implemented. (Compare our Android RegularImmutableMap, except NBHM is way more sophisticated in order to support modification :))

@justinhorvitz
Copy link

My mistake actually. We might need weak keys after. I was thinking we could just use the Guava interner, but we need to manually remove from it, and the synchronization about ensuring the key isn’t located in two places may also be tricky without some customization. Still up in the air.

@ben-manes
Copy link
Contributor

Assuming NBHM you could reimplement those features on top,

For weak references, you could resurrect com.google.common.collect.ReferenceMap (public in the 2008 snapshot). That appears to work with NBHM. However it uses a dedicated thread (FinalizableReferenceQueue) which caused a lot of pain with being too slow, class unloading, etc. That is very similar to java.lang.ref.Cleaner as a modern alternative. When we inherited MapMaker, we replaced that with an amortized cleanup which imho works better. In that model you simply increment a counter until a threshold, tryLock, sweep a ReferenceQueue, and reset the counter (or some equivalent variant of this).

For synchronizing across data structures, simply use a striped lock (e.g. Striped). A fixed striped lock at a reasonable size, e.g. 1024, is probably good enough. Then reads can remain lock-free, independent writes can run (mostly) in parallel, and you can obtain entry consistency when obtaining the lock.

A little more work than an existing off-the-shelf but less memory overhead, high concurrency, and less tricky code to maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants