ISPN-6869 Avoid deadlock when stopping DefaultCacheManager #4459

vjuranek · 2016-07-19T11:06:49Z

https://issues.jboss.org/browse/ISPN-6869

slaskawi · 2016-07-26T10:57:20Z

core/src/main/java/org/infinispan/manager/DefaultCacheManager.java

@@ -134,7 +136,8 @@
   private final CacheContainerStats stats;
   private final ConfigurationManager configurationManager;

-   @GuardedBy("this")
+   private final ReadWriteLock stoppingLock = new ReentrantReadWriteLock();
+   @GuardedBy("stoppingLock")
   private boolean stopping;


Just thinking - do we need both - a lock and a flag?

yes, you have to synchronize on something to change the flag (before it was DefaultCacheManager instance, which leads to deadlock) and without flag and just blocking on the lock would results into even more deadlocks

How about using [Lock#tryLock](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Lock.html#tryLock(long,%20java.util.concurrent.TimeUnit) method? If this method returns false we know we have to throw and exception here. WDYT?

I looked into this more deeply and prepared this: #4559

@vjuranek @rvansa Please tell me what you think...

vjuranek · 2016-07-26T11:50:30Z

I'm not sure I understand how it can help here? The issue is not about synchronizing on one particular lock (see jira description for more details)

slaskawi · 2016-07-28T07:45:29Z

Frankly, I think the synchronization policy in DefaultCacheManager is slightly wrong and adding another synchronization object is not the right approach here. Let me explain this a more...

When starting a new cache (inside wireAndStartCache method) we synchronize on caches instance (which is a concurrent map so it's wrong IMO). The shutdown procedure (stop method) synchronizes on this and a flag - stopping (!?). The terminate method (the one that actually kills the caches) uses no synchronization - just takes caches one by one (with a not-null check) and kills them.

As you can see we use different objects for synchronizing starting caches (caches field) and stopping them (this and stopping flag). In my opinion the best way to solve the deadlock described in JIRA is to refactor startup/shutdown procedure (and not just add new lock). I totally agree with @vjuranek approach of using a Lock for this (shutdown and creating new cache should probably own a write lock whereas returning new cache should operate on read lock). There might be even better approach for this - StampedLock which supports upgrading from read lock to write lock (could be useful in getCache method - return if exists with read lock and if it doesn't - upgrade to write lock and wire a new cache).

cc @danberindei @rvansa

rvansa · 2016-07-28T08:02:59Z

I don't think you need a read lock for getting the cache, accessing CHM is fine. And I would refrain StampedLock unless you need it for perf purposes - the risk of misuse is too high.

Nevertheless, I agree that all the synchronization should be executed on single lock (be it intrinsic lock or explicit one) and I would rather use ComponentStatus on DCM than another custom flag.

vjuranek · 2016-07-28T08:43:36Z

@slaskawi @rvansa can you elaborate why do you think one lock can solve the problem? This was the first solution I tried and it resulted into several other deadlock on different places. I didn't investigate much further, but it seems to me that that it would require changes in several other places (maybe substantial changes) and the risk to introduce new deadlock(s) would be IMHO quite high. Also synchronizing everything on one single lock (even if it's deadlock free) would very likely result into much worse performance.

@slaskawi as for sync on caches - I believe the sync there is not because of caches object itself, but to create and store CacheWrapper atomically.

@rvansa ComponentStatus would be better choice if it's properly setup during whole lifecycle of the component (otherwise it can be quite confusing), requiring obtain write lock several times, while it's actually not needed anywhere besides stopping, so not completely sure if it worth it.

slaskawi · 2016-07-28T09:08:33Z

@vjuranek - Could you please tell us more how you implemented this (ideally share a branch, PR or a diff)? I believe you can not deadlock when using a single re-entrant lock, so there must be something else going on there...

As I suggested - IMO one lock should be ok - a read lock for obtaining a cache and a write lock for wiring it up and killing it (@rvansa prefers component status whereas I StampedLock with an upgrade but frankly - it doesn't matter it's only a detail).

BTW - I believe we should also have @danberindei opinion on that. I wouldn't like to move this forward without his comment...

vjuranek · 2016-08-01T07:34:10Z

@slaskawi sorry for late reply, I was on PTO on Thu and Fri. Unfortunately I haven't my old implementation as it didn't work (probably not very helpful, but I have some issue with starting a cache from cache starting listener listener - i.e. with StartCacheFromListenerTest and core testsuite execution was considerably slower). I was thinking about it little bit and agree it would be probably doable to have only one lock (and adjusting other part to avoid e.g. StartCacheFromListenerTest failures), but I still think such implementation wouldn't be very efficient.

Anyway, completely agree we should wait for insight from @danberindei

vjuranek · 2016-08-30T08:29:29Z

rebased
@danberindei could you plase take a look once you have some time? Thanks

vjuranek · 2016-09-12T19:37:02Z

The failure after rebase is fixed now

slaskawi · 2016-09-16T10:18:05Z

Unfortunately I was looking at the wrong place... Of course LGTM!

rvansa · 2016-09-16T11:37:33Z

@vjuranek Maybe if you added a test case, alternative solutions would prove as wrong immediately.

vjuranek · 2016-09-16T11:42:26Z

@rvansa IIRC I tried, but wasn't able to figure out any reliable reproducer .... will look on it again

gustavocoding · 2016-09-16T12:16:59Z

@vjuranek @rvansa use byteman for the reproducer? [1]

[1] http://www.jboss.org/dms/judcon/2013unitedstates/presentations/judcon2013_day3track1session1.pdf

danberindei · 2016-09-19T08:13:36Z

Sorry it took so long to look at this @vjuranek!

The deadlock seems to come from the fact that InternalCacheRegistryImpl.stop uses DCM.getCache(name, false) -- something I should have removed with ISPN-6277. DCM.getCache(name, false) blocks when another thread is in the process of starting the cache. But global components, including InternalCacheRegistryImpl, are stopped while holding the GlobalComponentRegisty monitor, blocking any threads using the global components. If the cache starting thread needs to inject a global component into a cache component, it blocks too, causing the deadlock. The simple fix here is to remove InternalCacheRegistryImpl.stop().

DefaultCacheManager.stopCaches() also has a problem: it ignores caches that are not fully started yet. But instead of adding a new lock, I'd rather reuse the CacheWrapper latch. DCM.wireAndStartCache can also use ConcurrentMap methods instead of synchronized (caches).

I thought this would be a good time to change getCache(name) so it reports any exception during cache start to all callers of getCache(name), not just the first one, so I replaced CacheWrapper with a CompletableFuture and I implemented my proposed changes here: https://github.com/danberindei/infinispan/commits/ISPN-6869_dan

The branch also includes a reproducer. It doesn't rely on InternalCacheRegistryImpl.stop, it just checks that manager.getCache(name) blocks manager.stop(), and manager.stop() fails all manager.getCache(name) calls that start after it.

WDYT?

vjuranek · 2016-09-19T09:25:16Z

Thanks for review @danberindei !
Your solution is definitely more general and nicer (I even didn't noticed that caches are stopped on two different places and InternalCacheRegistryImpl.stop() can be completely removed) - closing this PR. Btw: also neat reproducer! Thanks

gustavocoding · 2016-09-26T15:52:35Z

@danberindei which PR is replacing this one? AFAICT, the problem is still happening on master

vjuranek force-pushed the ISPN-6869 branch from f19786f to 5910903 Compare July 19, 2016 13:03

slaskawi reviewed Jul 26, 2016
View reviewed changes

slaskawi added Ready for Review Question labels Jul 26, 2016

slaskawi added the Changes Suggested label Jul 28, 2016

slaskawi added Needs Rebase Changes Required and removed Changes Suggested Ready for Review labels Aug 22, 2016

vjuranek force-pushed the ISPN-6869 branch from 5910903 to 66a747f Compare August 30, 2016 08:28

ISPN-6869 Avoid deadlock when stopping DefaultCacheManager

7a38384

vjuranek force-pushed the ISPN-6869 branch from 66a747f to 7a38384 Compare September 12, 2016 19:36

slaskawi added Ready for Review and removed Changes Required Needs Rebase Question labels Sep 16, 2016

danberindei added the Changes Suggested label Sep 19, 2016

danberindei removed the Ready for Review label Sep 19, 2016

vjuranek closed this Sep 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISPN-6869 Avoid deadlock when stopping DefaultCacheManager #4459

ISPN-6869 Avoid deadlock when stopping DefaultCacheManager #4459

vjuranek commented Jul 19, 2016

slaskawi Jul 26, 2016

vjuranek Jul 26, 2016

slaskawi Jul 26, 2016

slaskawi Sep 16, 2016

vjuranek commented Jul 26, 2016

slaskawi commented Jul 28, 2016

rvansa commented Jul 28, 2016

vjuranek commented Jul 28, 2016

slaskawi commented Jul 28, 2016

vjuranek commented Aug 1, 2016

vjuranek commented Aug 30, 2016

vjuranek commented Sep 12, 2016

slaskawi commented Sep 16, 2016

rvansa commented Sep 16, 2016

vjuranek commented Sep 16, 2016

gustavocoding commented Sep 16, 2016

danberindei commented Sep 19, 2016

vjuranek commented Sep 19, 2016

gustavocoding commented Sep 26, 2016

ISPN-6869 Avoid deadlock when stopping DefaultCacheManager #4459

ISPN-6869 Avoid deadlock when stopping DefaultCacheManager #4459

Conversation

vjuranek commented Jul 19, 2016

slaskawi Jul 26, 2016

Choose a reason for hiding this comment

vjuranek Jul 26, 2016

Choose a reason for hiding this comment

slaskawi Jul 26, 2016

Choose a reason for hiding this comment

slaskawi Sep 16, 2016

Choose a reason for hiding this comment

vjuranek commented Jul 26, 2016

slaskawi commented Jul 28, 2016

rvansa commented Jul 28, 2016

vjuranek commented Jul 28, 2016

slaskawi commented Jul 28, 2016

vjuranek commented Aug 1, 2016

vjuranek commented Aug 30, 2016

vjuranek commented Sep 12, 2016

slaskawi commented Sep 16, 2016

rvansa commented Sep 16, 2016

vjuranek commented Sep 16, 2016

gustavocoding commented Sep 16, 2016

danberindei commented Sep 19, 2016

vjuranek commented Sep 19, 2016

gustavocoding commented Sep 26, 2016