ISPN-1106 and ISPN-1255 plus some random test stability isuess #491

wants to merge 16 commits into


None yet

3 participants


Also t_1255_5.0.x for the 5.0.x branch.

There are too many commits to list everything here so it's better to look at the individual commits, but to sum it up it's all related to rehash reliability.

@maniksurtani maniksurtani and 2 others commented on an outdated diff Aug 4, 2011
@@ -71,15 +65,11 @@ public class LockControlCommand extends AbstractTransactionBoundaryCommand imple
public LockControlCommand(Collection<Object> keys, String cacheName, Set<Flag> flags, boolean implicit) {
this.cacheName = cacheName;
- this.keys = null;
- this.singleKey = null;
- if (keys != null && !keys.isEmpty()) {
- if (keys.size() == 1) {
- for (Object k: keys) this.singleKey = k;
- } else {
- // defensive copy
- this.keys = new HashSet<Object>(keys);
- }
+ if (keys != null) {
+ // defensive copy
+ this.keys = new ArrayList<Object>(keys);
+ } else {
+ keys = Collections.emptySet();
maniksurtani Aug 4, 2011 Member

SHouldn't this be this.keys = ... ?

danberindei Aug 4, 2011 Member

Good catch! I guess the tests never create a lock command without any key ;)

Sanne Aug 5, 2011 Member

ouch, please keep the names different :) shadowing is hard to deal with!

danberindei Aug 5, 2011 Member

This is the constructor, I think it's pretty standard to use the same names for the constructor parameters.

Sanne Aug 5, 2011 Member

sorry didn't notice that :)

@Sanne Sanne and 1 other commented on an outdated diff Aug 5, 2011
@@ -538,28 +538,33 @@ public class DefaultCacheManager implements EmbeddedCacheManager, CacheManager {
@GuardedBy("Cache name lock container keeps a lock per cache name which guards this method")
private Cache createCache(String cacheName) {
- CacheWrapper existingCache = caches.get(cacheName);
- if (existingCache != null)
- return existingCache.getCache();
- Configuration c = getConfiguration(cacheName);
- setConfigurationName(cacheName, c);
- c.setGlobalConfiguration(globalConfiguration);
- c.accept(configurationValidator);
- c.assertValid();
- Cache cache = new InternalCacheFactory().createCache(c, globalComponentRegistry, cacheName, reflectionCache);
- CacheWrapper cw = new CacheWrapper(cache);
+ LogFactory.pushNDC(cacheName, log.isTraceEnabled());
Sanne Aug 5, 2011 Member

you should store the result of log.isTraceEnabled() in local variable and use that for pushNDC && popNDC , as if the value changes between a push/pull it's making a mess in all subsequent debug sessions.

danberindei Aug 5, 2011 Member

good point, fixed

Dan Berindei added some commits Jul 28, 2011
Dan Berindei ISPN-1255 - Distributed tasks should not be queued
Distributed tasks can't be replayed after the cache has started, because the caller needs to know the return value.
So if the cache hasn't started it should return an UnsuccessfulResponse immediately instead of returning a RequestIgnoredResponse and then running the command.
Dan Berindei ISPN-1271 - Consistent Hash externalizer loses grouping configuration
As a workaround, skip the consistent hash check on the node receiving state.
Dan Berindei ISPN-1123 - BaseDldLaziLockingTest was only working if the first tran…
…saction won the coin toss
Dan Berindei ISPN-1123 - Fixed waiting for updated cluster views
Some tests were not using the barfIfTooManyMembers = false flag after killing a node.
Waiting for rehashing to finish is not enough, we must ensure that we got the updated views first.
DataLossOnJoinOneOwnerTest was not creating the second cache before blocking for view.

Some tests were creating a non-default but then they were waiting for the default cache to finish rehashing.
Dan Berindei ISPN-1123 - Fixed mismatched test names
I added a script (bin/ to find mismatched test names.
Dan Berindei ISPN-1123 - Make the joins really concurrent in ConcurrentJoinTest 21e1add
Dan Berindei ISPN-1123 - SyncReplImplicitLockingTest: Allow more time for the time…
…out to kick in
Dan Berindei ISPN-1106 - Log more information about locks and rehashing
* In LockManagerImpl log the other keys owned by the current transaction.
* In DefaultCacheManager push the cache name to the NDC during cache startup.
* Improved toString() for RehashControlCommand and DistributedExecuteCommand.
* In InboundInvocationHandler log the cache name.
* Log cache start/stop.
* Log the read lock owners in JGroupsDistSync.
Dan Berindei ISPN-1106 - Corrected DistributionManagerImpl.isAffectedByRehash()
It returns true is the CH maps the key to the current node but the previous CH didn't, meaning the key may have not arrived yet from the previous owner.
Dan Berindei ISPN-1106 - Possibility of data loss when there's a pending rehash an…
…d the state receivers ignore our state

Moved key invalidation after receiving the rehash completed confirmation, and only if there is no pending rehash.
Re-added notifyDataRehashed post event.
Separated the rehash completion into two phases so RebalanceTask can invalidate the keys after rehashing is done but before the cache clients know it.
Dan Berindei ISPN-1106 - Remove locks for keys locked by a remote transaction that…
… are no longer local after a rehash

I also changed the algorithm for eager single node locking to rollback a transaction if the primary data owner changed, even if it's not a joiner or leaver (see ISPN-1275).
Dan Berindei ISPN-1106 - Filter out the originator node from the targets of a remo…
…te get command

Needed if we are during a rehash and the key doesn't exist on the current node.
Dan Berindei ISPN-1106 - LockControlCommand was not locking keys in the order the …
…user passed them in

The ordering was not consistent, as it was relying on a HashSet. So different commands could lock the same subset of keys in a different order, leading to a deadlock.
Dan Berindei ISPN-1170 - If the number of joiners >= numOwners the owner sets befo…
…re and after rehash can be disjoint
Dan Berindei ISPN-1255 - Ongoing transactions waiting on locks are blocking the re…
…hash from finishing

The generic scenario involves multiple caches.
Say we have transactions Tx1 and Tx2 spanning caches C1 and C2.
A new node joins the cluster, starting C1 and C2.
With the following sequence of events rehashing will be blocked for lockAcquisitionTimeout.

1. Tx1 prepares on C1 locking K1
2. Tx2 wants to prepare on C2, Tx2 gets the tx lock
3. Tx2 now waits to lock K1 while holding the tx lock on C2
4. Rehash starts on C2 but it can't proceed because Tx2 has the tx lock
5. Tx1 now wants to prepare on C2, but can't acquire the tx lock

I've implemented a crude "deadlock detection" scheme: a new tx will wait
the full lockAcquisitionTimeout for the tx lock, but a tx that already
has locks acquired will only wait 1/100 of that. So if there is a cycle
it will break much quicker and allow rehashing to proceed.

There is also a simpler variant where the transactions work with a single cache.
In that case if the remote command can't acquire the tx lock with 0 timeout it knows
that it has the tx lock on the origin node and it's in a deadlock situation.
Dan Berindei ISPN-1106 - Allow the users to start multiple caches at once
This is no longer strictly necessary for ISPN-1106, as we are waiting
with a shorter timeout on transactions with locks and so the rehash
does not block for a very long period of time.

It is recommended however to start all caches on application startup,
and this method provides an easy way for users to start all their caches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment