Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-6869 Avoid deadlock when stopping DefaultCacheManager #4459

Closed
wants to merge 1 commit into from

Conversation

vjuranek
Copy link
Member

@@ -134,7 +136,8 @@
private final CacheContainerStats stats;
private final ConfigurationManager configurationManager;

@GuardedBy("this")
private final ReadWriteLock stoppingLock = new ReentrantReadWriteLock();
@GuardedBy("stoppingLock")
private boolean stopping;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking - do we need both - a lock and a flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you have to synchronize on something to change the flag (before it was DefaultCacheManager instance, which leads to deadlock) and without flag and just blocking on the lock would results into even more deadlocks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using [Lock#tryLock](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Lock.html#tryLock(long,%20java.util.concurrent.TimeUnit) method? If this method returns false we know we have to throw and exception here. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this more deeply and prepared this: #4559

@vjuranek @rvansa Please tell me what you think...

@vjuranek
Copy link
Member Author

I'm not sure I understand how it can help here? The issue is not about synchronizing on one particular lock (see jira description for more details)

@slaskawi
Copy link
Contributor

Frankly, I think the synchronization policy in DefaultCacheManager is slightly wrong and adding another synchronization object is not the right approach here. Let me explain this a more...

When starting a new cache (inside wireAndStartCache method) we synchronize on caches instance (which is a concurrent map so it's wrong IMO). The shutdown procedure (stop method) synchronizes on this and a flag - stopping (!?). The terminate method (the one that actually kills the caches) uses no synchronization - just takes caches one by one (with a not-null check) and kills them.

As you can see we use different objects for synchronizing starting caches (caches field) and stopping them (this and stopping flag). In my opinion the best way to solve the deadlock described in JIRA is to refactor startup/shutdown procedure (and not just add new lock). I totally agree with @vjuranek approach of using a Lock for this (shutdown and creating new cache should probably own a write lock whereas returning new cache should operate on read lock). There might be even better approach for this - StampedLock which supports upgrading from read lock to write lock (could be useful in getCache method - return if exists with read lock and if it doesn't - upgrade to write lock and wire a new cache).

cc @danberindei @rvansa

@rvansa
Copy link
Member

rvansa commented Jul 28, 2016

I don't think you need a read lock for getting the cache, accessing CHM is fine. And I would refrain StampedLock unless you need it for perf purposes - the risk of misuse is too high.

Nevertheless, I agree that all the synchronization should be executed on single lock (be it intrinsic lock or explicit one) and I would rather use ComponentStatus on DCM than another custom flag.

@vjuranek
Copy link
Member Author

@slaskawi @rvansa can you elaborate why do you think one lock can solve the problem? This was the first solution I tried and it resulted into several other deadlock on different places. I didn't investigate much further, but it seems to me that that it would require changes in several other places (maybe substantial changes) and the risk to introduce new deadlock(s) would be IMHO quite high. Also synchronizing everything on one single lock (even if it's deadlock free) would very likely result into much worse performance.

@slaskawi as for sync on caches - I believe the sync there is not because of caches object itself, but to create and store CacheWrapper atomically.

@rvansa ComponentStatus would be better choice if it's properly setup during whole lifecycle of the component (otherwise it can be quite confusing), requiring obtain write lock several times, while it's actually not needed anywhere besides stopping, so not completely sure if it worth it.

@slaskawi
Copy link
Contributor

@vjuranek - Could you please tell us more how you implemented this (ideally share a branch, PR or a diff)? I believe you can not deadlock when using a single re-entrant lock, so there must be something else going on there...

As I suggested - IMO one lock should be ok - a read lock for obtaining a cache and a write lock for wiring it up and killing it (@rvansa prefers component status whereas I StampedLock with an upgrade but frankly - it doesn't matter it's only a detail).

BTW - I believe we should also have @danberindei opinion on that. I wouldn't like to move this forward without his comment...

@vjuranek
Copy link
Member Author

vjuranek commented Aug 1, 2016

@slaskawi sorry for late reply, I was on PTO on Thu and Fri. Unfortunately I haven't my old implementation as it didn't work (probably not very helpful, but I have some issue with starting a cache from cache starting listener listener - i.e. with StartCacheFromListenerTest and core testsuite execution was considerably slower). I was thinking about it little bit and agree it would be probably doable to have only one lock (and adjusting other part to avoid e.g. StartCacheFromListenerTest failures), but I still think such implementation wouldn't be very efficient.

Anyway, completely agree we should wait for insight from @danberindei

@vjuranek
Copy link
Member Author

rebased
@danberindei could you plase take a look once you have some time? Thanks

@vjuranek
Copy link
Member Author

The failure after rebase is fixed now

@slaskawi
Copy link
Contributor

Unfortunately I was looking at the wrong place... Of course LGTM!

@rvansa
Copy link
Member

rvansa commented Sep 16, 2016

@vjuranek Maybe if you added a test case, alternative solutions would prove as wrong immediately.

@vjuranek
Copy link
Member Author

@rvansa IIRC I tried, but wasn't able to figure out any reliable reproducer .... will look on it again

@gustavocoding
Copy link

@danberindei
Copy link
Member

Sorry it took so long to look at this @vjuranek!

The deadlock seems to come from the fact that InternalCacheRegistryImpl.stop uses DCM.getCache(name, false) -- something I should have removed with ISPN-6277. DCM.getCache(name, false) blocks when another thread is in the process of starting the cache. But global components, including InternalCacheRegistryImpl, are stopped while holding the GlobalComponentRegisty monitor, blocking any threads using the global components. If the cache starting thread needs to inject a global component into a cache component, it blocks too, causing the deadlock. The simple fix here is to remove InternalCacheRegistryImpl.stop().

DefaultCacheManager.stopCaches() also has a problem: it ignores caches that are not fully started yet. But instead of adding a new lock, I'd rather reuse the CacheWrapper latch. DCM.wireAndStartCache can also use ConcurrentMap methods instead of synchronized (caches).

I thought this would be a good time to change getCache(name) so it reports any exception during cache start to all callers of getCache(name), not just the first one, so I replaced CacheWrapper with a CompletableFuture and I implemented my proposed changes here: https://github.com/danberindei/infinispan/commits/ISPN-6869_dan

The branch also includes a reproducer. It doesn't rely on InternalCacheRegistryImpl.stop, it just checks that manager.getCache(name) blocks manager.stop(), and manager.stop() fails all manager.getCache(name) calls that start after it.

WDYT?

@vjuranek
Copy link
Member Author

Thanks for review @danberindei !
Your solution is definitely more general and nicer (I even didn't noticed that caches are stopped on two different places and InternalCacheRegistryImpl.stop() can be completely removed) - closing this PR. Btw: also neat reproducer! Thanks

@vjuranek vjuranek closed this Sep 19, 2016
@gustavocoding
Copy link

@danberindei which PR is replacing this one? AFAICT, the problem is still happening on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants