Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalCache deadlock in Guava 22 #2863

Open
dima5rr opened this issue Jun 30, 2017 · 6 comments
Open

LocalCache deadlock in Guava 22 #2863

dima5rr opened this issue Jun 30, 2017 · 6 comments
Assignees

Comments

@dima5rr
Copy link

dima5rr commented Jun 30, 2017

Throwing exception in compute function of Cache ConcurrentMap and computing the same key after makes deadlock in LocalCache.java:2387

@test
public void testCacheLoaderException() {
Map cache = CacheBuilder.newBuilder().build().asMap();

    Integer n = 1;
    try {
        cache.computeIfAbsent(n,k->{
            throw new IllegalStateException("make deadlock");
        });
    } catch (Exception e) {
    }
    int v = (int)cache.computeIfAbsent(n, k->k); //deadlock here
    Assert.assertEquals(1, v);
}
@ben-manes
Copy link
Contributor

All recursive operations on a map are undefined behavior, so this works as intended. When the cache was a feature of MapMaker it had this behavior for computations on get, but I switched it to fail fast by detecting the lock was already held. Ideally the computations would emulate that for consistency and friendliness, but it isn't required.

Java 8's ConcurrentHashMap will live-lock on this call and HashMap may corrupt itself. In Java 9 it will fail fast in both cases, if detected.

So this works as intended, but could be a tad nicer.

@lowasser
Copy link
Contributor

lowasser commented Jul 1, 2017

I'm not sure I see the recursion here?

@ben-manes
Copy link
Contributor

ben-manes commented Jul 1, 2017

Oh shoot, I misread. I've seen the recursion issues raised so many times across hash tables that, on seeing multiple computeIfAbsent calls, I saw the imagined pattern.

It does appear that LocalCache.compute fails to handle exceptions from loadingValueReference.compute and leaves the failed entry in the cache. The subsequent call waits for a future that has never been finalized.

@andrewparmet
Copy link

andrewparmet commented Jul 1, 2017

Isn't this the behavior referenced in #2799?

@slovdahl
Copy link

slovdahl commented Oct 23, 2017

I'm also observing some strange behaviour with guava caches under contention. Not sure if these two issues are related. If they aren't, I'm happy to open a separate issue.

While I was doing some JMH benchmarks I noticed that benchmarks that were using guava caches stalled completely and/or had very erratic performance during thread contention. Benchmark iterations very often stalled for seconds, minutes or sometimes seemingly indefinitely. It turned out that the cache was the culprit in all cases.

The issue is very easy to trigger. I published a sample project on https://github.com/hibox-systems/guava-cache-test that demonstrates the issue. In my environment it works fine with 1, 2 and 3 threads. Using 4 threads or more always stalls the benchmark at some point. I have tried the benchmark with all published guava versions that are compatible with the sample code, i.e. guava 11.0 and newer. Tested with these JVM versions on Ubuntu 16.04 with a Intel Core i7-3770 CPU:

  • OpenJDK 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11
  • Oracle 1.8.0_151-b12
  • Oracle 9.0.1+11

FWIW, I tried replacing the guava caches in the benchmarks with caffeine caches, and all kinds of stalling and erratic performance went away.

@ben-manes
Copy link
Contributor

It should be unrelated due to this bug dealing with new code for Java 8. Your benchmark touches prior code.

Note that by default there are 4 hash segments, each having its own queue and counter for replaying reads. The queue is uncapped and draining is amortized across readers. There should be a per drain threshold to avoid over penalizing any thread. In a stress test like yours, one could imagine a memory issue, hang ups for the penalized callers, and contention adding to the read buffer. Caffeine uses a lossy ring buffer which avoids all of this. My guess is you are hitting the worst case which wouldn’t be expected in practice, or at least not with the hardware available when the cache was designed.

lucamilanesio pushed a commit to GerritCodeReview/gerrit that referenced this issue Feb 26, 2018
This deadlock is not the typical deadlock where 2 threads locked a
resource and each one is waiting to lock a second resource already
locked by the other thread. The thread owning the account cache lock is
parked, which tell us that the locked was not released. I could not
determine the exact sequence of events leading to this deadlock making
it really hard to report/fix the problem.

While investigating, I realized that there quite a few reported issues
in Guava that could be major for Gerrit:

  Our deadlock happening in account cache
  https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

  Other deadlock
  google/guava#2976
  google/guava#2863

  Performance
  google/guava#2063

  Race condition
  google/guava#2971

Because I could not reproduce the deadlock in a dev environment or in
a unit test making it almost impossible to fix, I considered other
options such as replacing Guava by something else.

The maintainer of Caffeine[1] cache claims that Caffeine is a high
performance[2], near optimal caching library designed/implemented base
on the experience of designing Guava's cache and ConcurrentLinkedHashMap.
I also did some benchmarks about spawning a lot of threads reading/writing
values from the caches. I ran those benchmarks on both Guava and Caffeine
and Guava was always taking at least double the time than Caffeine to
complete all operations.

Migrating to Caffeine is almost a drop-in replacement. Caffeine
interface are very similar to Guava cache and there is an adapter to
migrate to Caffeine and keep using Guava's interfaces. After migrating
to Caffeine, we saw that deadlock occurrence was reduced from once a day
to once every 2 weeks in our production server.

The maintainer of Caffeine, Ben Manes pointed us to the possible
cause[3] of this issue, a bug[4] in the kernel and its fix[5]. Our
kernel version is supposed to have the fix but we will try another OS
and kernel version.

Replacing Guava caches by Caffeine brings 2 things, it reduces the
chances of having the deadlock most likely caused by a kernel bug and
improve the cache performance.

[1]https://github.com/ben-manes/caffeine
[2]https://github.com/ben-manes/caffeine/wiki/Benchmarks
[3]https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64
[4]torvalds/linux@b0c29f7
[5]torvalds/linux@76835b0

Bug: Issue 7645
Change-Id: I8d2b17a94d0e9daf9fa9cdda563316c5768b29ae
@cpovirk cpovirk added the P4 label Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants