Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

ISPN-1106 and ISPN-1255 plus some random test stability isuess #491

Closed
wants to merge 16 commits into from

3 participants

Dan Berindei Manik Surtani Sanne Grinovero
Dan Berindei
Collaborator

https://issues.jboss.org/browse/ISPN-1106
https://issues.jboss.org/browse/ISPN-1255

Also t_1255_5.0.x for the 5.0.x branch.

There are too many commits to list everything here so it's better to look at the individual commits, but to sum it up it's all related to rehash reliability.

core/src/main/java/org/infinispan/commands/control/LockControlCommand.java
... ...
@@ -71,15 +65,11 @@ public class LockControlCommand extends AbstractTransactionBoundaryCommand imple
71 65
 
72 66
    public LockControlCommand(Collection<Object> keys, String cacheName, Set<Flag> flags, boolean implicit) {
73 67
       this.cacheName = cacheName;
74  
-      this.keys = null;
75  
-      this.singleKey = null;
76  
-      if (keys != null && !keys.isEmpty()) {
77  
-         if (keys.size() == 1) {
78  
-            for (Object k: keys) this.singleKey = k;
79  
-         } else {
80  
-            // defensive copy
81  
-            this.keys = new HashSet<Object>(keys);
82  
-         }
  68
+      if (keys != null) {
  69
+         // defensive copy
  70
+         this.keys = new ArrayList<Object>(keys);
  71
+      } else {
  72
+         keys = Collections.emptySet();
5
Manik Surtani Collaborator

SHouldn't this be this.keys = ... ?

Dan Berindei Collaborator
Sanne Grinovero Collaborator
Sanne added a note August 05, 2011

ouch, please keep the names different :) shadowing is hard to deal with!

Dan Berindei Collaborator

This is the constructor, I think it's pretty standard to use the same names for the constructor parameters.

Sanne Grinovero Collaborator
Sanne added a note August 05, 2011

sorry didn't notice that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
core/src/main/java/org/infinispan/manager/DefaultCacheManager.java
... ...
@@ -538,28 +538,33 @@ public class DefaultCacheManager implements EmbeddedCacheManager, CacheManager {
538 538
 
539 539
    @GuardedBy("Cache name lock container keeps a lock per cache name which guards this method")
540 540
    private Cache createCache(String cacheName) {
541  
-      CacheWrapper existingCache = caches.get(cacheName);
542  
-      if (existingCache != null)
543  
-         return existingCache.getCache();
544  
-
545  
-      Configuration c = getConfiguration(cacheName);
546  
-      setConfigurationName(cacheName, c);
547  
-
548  
-      c.setGlobalConfiguration(globalConfiguration);
549  
-      c.accept(configurationValidator);
550  
-      c.assertValid();
551  
-      Cache cache = new InternalCacheFactory().createCache(c, globalComponentRegistry, cacheName, reflectionCache);
552  
-      CacheWrapper cw = new CacheWrapper(cache);
  541
+      LogFactory.pushNDC(cacheName, log.isTraceEnabled());
2
Sanne Grinovero Collaborator
Sanne added a note August 05, 2011

you should store the result of log.isTraceEnabled() in local variable and use that for pushNDC && popNDC , as if the value changes between a push/pull it's making a mess in all subsequent debug sessions.

Dan Berindei Collaborator

good point, fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
added some commits July 28, 2011
ISPN-1255 - Distributed tasks should not be queued
Distributed tasks can't be replayed after the cache has started, because the caller needs to know the return value.
So if the cache hasn't started it should return an UnsuccessfulResponse immediately instead of returning a RequestIgnoredResponse and then running the command.
34467b8
ISPN-1271 - Consistent Hash externalizer loses grouping configuration
As a workaround, skip the consistent hash check on the node receiving state.
46c7346
ISPN-1123 - BaseDldLaziLockingTest was only working if the first tran…
…saction won the coin toss
101b686
ISPN-1123 - Fixed waiting for updated cluster views
Some tests were not using the barfIfTooManyMembers = false flag after killing a node.
Waiting for rehashing to finish is not enough, we must ensure that we got the updated views first.
DataLossOnJoinOneOwnerTest was not creating the second cache before blocking for view.

Some tests were creating a non-default but then they were waiting for the default cache to finish rehashing.
c068fdf
ISPN-1123 - Fixed mismatched test names
I added a script (bin/mismatchedTestNames.sh) to find mismatched test names.
7e6924c
ISPN-1123 - Make the joins really concurrent in ConcurrentJoinTest 21e1add
ISPN-1123 - SyncReplImplicitLockingTest: Allow more time for the time…
…out to kick in
361c587
ISPN-1106 - Log more information about locks and rehashing
* In LockManagerImpl log the other keys owned by the current transaction.
* In DefaultCacheManager push the cache name to the NDC during cache startup.
* Improved toString() for RehashControlCommand and DistributedExecuteCommand.
* In InboundInvocationHandler log the cache name.
* Log cache start/stop.
* Log the read lock owners in JGroupsDistSync.
991b53e
ISPN-1106 - Corrected DistributionManagerImpl.isAffectedByRehash()
It returns true is the CH maps the key to the current node but the previous CH didn't, meaning the key may have not arrived yet from the previous owner.
5ba8005
ISPN-1106 - Possibility of data loss when there's a pending rehash an…
…d the state receivers ignore our state

Moved key invalidation after receiving the rehash completed confirmation, and only if there is no pending rehash.
Re-added notifyDataRehashed post event.
Separated the rehash completion into two phases so RebalanceTask can invalidate the keys after rehashing is done but before the cache clients know it.
2b51342
ISPN-1106 - Remove locks for keys locked by a remote transaction that…
… are no longer local after a rehash

I also changed the algorithm for eager single node locking to rollback a transaction if the primary data owner changed, even if it's not a joiner or leaver (see ISPN-1275).
7927cda
ISPN-1106 - Filter out the originator node from the targets of a remo…
…te get command

Needed if we are during a rehash and the key doesn't exist on the current node.
d266b29
ISPN-1106 - LockControlCommand was not locking keys in the order the …
…user passed them in

The ordering was not consistent, as it was relying on a HashSet. So different commands could lock the same subset of keys in a different order, leading to a deadlock.
1955085
ISPN-1170 - If the number of joiners >= numOwners the owner sets befo…
…re and after rehash can be disjoint
fc4f5d0
ISPN-1255 - Ongoing transactions waiting on locks are blocking the re…
…hash from finishing

The generic scenario involves multiple caches.
Say we have transactions Tx1 and Tx2 spanning caches C1 and C2.
A new node joins the cluster, starting C1 and C2.
With the following sequence of events rehashing will be blocked for lockAcquisitionTimeout.

1. Tx1 prepares on C1 locking K1
2. Tx2 wants to prepare on C2, Tx2 gets the tx lock
3. Tx2 now waits to lock K1 while holding the tx lock on C2
4. Rehash starts on C2 but it can't proceed because Tx2 has the tx lock
5. Tx1 now wants to prepare on C2, but can't acquire the tx lock

I've implemented a crude "deadlock detection" scheme: a new tx will wait
the full lockAcquisitionTimeout for the tx lock, but a tx that already
has locks acquired will only wait 1/100 of that. So if there is a cycle
it will break much quicker and allow rehashing to proceed.

There is also a simpler variant where the transactions work with a single cache.
In that case if the remote command can't acquire the tx lock with 0 timeout it knows
that it has the tx lock on the origin node and it's in a deadlock situation.
6c827cb
ISPN-1106 - Allow the users to start multiple caches at once
This is no longer strictly necessary for ISPN-1106, as we are waiting
with a shorter timeout on transactions with locks and so the rehash
does not block for a very long period of time.

It is recommended however to start all caches on application startup,
and this method provides an easy way for users to start all their caches.
b5a0a0e
Manik Surtani maniksurtani closed this August 05, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Showing 16 unique commits by 1 author.

Aug 05, 2011
ISPN-1255 - Distributed tasks should not be queued
Distributed tasks can't be replayed after the cache has started, because the caller needs to know the return value.
So if the cache hasn't started it should return an UnsuccessfulResponse immediately instead of returning a RequestIgnoredResponse and then running the command.
34467b8
ISPN-1271 - Consistent Hash externalizer loses grouping configuration
As a workaround, skip the consistent hash check on the node receiving state.
46c7346
ISPN-1123 - BaseDldLaziLockingTest was only working if the first tran…
…saction won the coin toss
101b686
ISPN-1123 - Fixed waiting for updated cluster views
Some tests were not using the barfIfTooManyMembers = false flag after killing a node.
Waiting for rehashing to finish is not enough, we must ensure that we got the updated views first.
DataLossOnJoinOneOwnerTest was not creating the second cache before blocking for view.

Some tests were creating a non-default but then they were waiting for the default cache to finish rehashing.
c068fdf
ISPN-1123 - Fixed mismatched test names
I added a script (bin/mismatchedTestNames.sh) to find mismatched test names.
7e6924c
ISPN-1123 - Make the joins really concurrent in ConcurrentJoinTest 21e1add
ISPN-1123 - SyncReplImplicitLockingTest: Allow more time for the time…
…out to kick in
361c587
ISPN-1106 - Log more information about locks and rehashing
* In LockManagerImpl log the other keys owned by the current transaction.
* In DefaultCacheManager push the cache name to the NDC during cache startup.
* Improved toString() for RehashControlCommand and DistributedExecuteCommand.
* In InboundInvocationHandler log the cache name.
* Log cache start/stop.
* Log the read lock owners in JGroupsDistSync.
991b53e
ISPN-1106 - Corrected DistributionManagerImpl.isAffectedByRehash()
It returns true is the CH maps the key to the current node but the previous CH didn't, meaning the key may have not arrived yet from the previous owner.
5ba8005
ISPN-1106 - Possibility of data loss when there's a pending rehash an…
…d the state receivers ignore our state

Moved key invalidation after receiving the rehash completed confirmation, and only if there is no pending rehash.
Re-added notifyDataRehashed post event.
Separated the rehash completion into two phases so RebalanceTask can invalidate the keys after rehashing is done but before the cache clients know it.
2b51342
ISPN-1106 - Remove locks for keys locked by a remote transaction that…
… are no longer local after a rehash

I also changed the algorithm for eager single node locking to rollback a transaction if the primary data owner changed, even if it's not a joiner or leaver (see ISPN-1275).
7927cda
ISPN-1106 - Filter out the originator node from the targets of a remo…
…te get command

Needed if we are during a rehash and the key doesn't exist on the current node.
d266b29
ISPN-1106 - LockControlCommand was not locking keys in the order the …
…user passed them in

The ordering was not consistent, as it was relying on a HashSet. So different commands could lock the same subset of keys in a different order, leading to a deadlock.
1955085
ISPN-1170 - If the number of joiners >= numOwners the owner sets befo…
…re and after rehash can be disjoint
fc4f5d0
ISPN-1255 - Ongoing transactions waiting on locks are blocking the re…
…hash from finishing

The generic scenario involves multiple caches.
Say we have transactions Tx1 and Tx2 spanning caches C1 and C2.
A new node joins the cluster, starting C1 and C2.
With the following sequence of events rehashing will be blocked for lockAcquisitionTimeout.

1. Tx1 prepares on C1 locking K1
2. Tx2 wants to prepare on C2, Tx2 gets the tx lock
3. Tx2 now waits to lock K1 while holding the tx lock on C2
4. Rehash starts on C2 but it can't proceed because Tx2 has the tx lock
5. Tx1 now wants to prepare on C2, but can't acquire the tx lock

I've implemented a crude "deadlock detection" scheme: a new tx will wait
the full lockAcquisitionTimeout for the tx lock, but a tx that already
has locks acquired will only wait 1/100 of that. So if there is a cycle
it will break much quicker and allow rehashing to proceed.

There is also a simpler variant where the transactions work with a single cache.
In that case if the remote command can't acquire the tx lock with 0 timeout it knows
that it has the tx lock on the origin node and it's in a deadlock situation.
6c827cb
ISPN-1106 - Allow the users to start multiple caches at once
This is no longer strictly necessary for ISPN-1106, as we are waiting
with a shorter timeout on transactions with locks and so the rehash
does not block for a very long period of time.

It is recommended however to start all caches on application startup,
and this method provides an easy way for users to start all their caches.
b5a0a0e
Something went wrong with that request. Please try again.