Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

ISPN-1106 and ISPN-1255 plus some random test stability isuess #491

Closed
wants to merge 16 commits into from

3 participants

@danberindei
Collaborator

https://issues.jboss.org/browse/ISPN-1106
https://issues.jboss.org/browse/ISPN-1255

Also t_1255_5.0.x for the 5.0.x branch.

There are too many commits to list everything here so it's better to look at the individual commits, but to sum it up it's all related to rehash reliability.

...g/infinispan/commands/control/LockControlCommand.java
@@ -71,15 +65,11 @@ public class LockControlCommand extends AbstractTransactionBoundaryCommand imple
public LockControlCommand(Collection<Object> keys, String cacheName, Set<Flag> flags, boolean implicit) {
this.cacheName = cacheName;
- this.keys = null;
- this.singleKey = null;
- if (keys != null && !keys.isEmpty()) {
- if (keys.size() == 1) {
- for (Object k: keys) this.singleKey = k;
- } else {
- // defensive copy
- this.keys = new HashSet<Object>(keys);
- }
+ if (keys != null) {
+ // defensive copy
+ this.keys = new ArrayList<Object>(keys);
+ } else {
+ keys = Collections.emptySet();
@maniksurtani Collaborator

SHouldn't this be this.keys = ... ?

@danberindei Collaborator
@Sanne Owner
Sanne added a note

ouch, please keep the names different :) shadowing is hard to deal with!

@danberindei Collaborator

This is the constructor, I think it's pretty standard to use the same names for the constructor parameters.

@Sanne Owner
Sanne added a note

sorry didn't notice that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
.../java/org/infinispan/manager/DefaultCacheManager.java
@@ -538,28 +538,33 @@ public class DefaultCacheManager implements EmbeddedCacheManager, CacheManager {
@GuardedBy("Cache name lock container keeps a lock per cache name which guards this method")
private Cache createCache(String cacheName) {
- CacheWrapper existingCache = caches.get(cacheName);
- if (existingCache != null)
- return existingCache.getCache();
-
- Configuration c = getConfiguration(cacheName);
- setConfigurationName(cacheName, c);
-
- c.setGlobalConfiguration(globalConfiguration);
- c.accept(configurationValidator);
- c.assertValid();
- Cache cache = new InternalCacheFactory().createCache(c, globalComponentRegistry, cacheName, reflectionCache);
- CacheWrapper cw = new CacheWrapper(cache);
+ LogFactory.pushNDC(cacheName, log.isTraceEnabled());
@Sanne Owner
Sanne added a note

you should store the result of log.isTraceEnabled() in local variable and use that for pushNDC && popNDC , as if the value changes between a push/pull it's making a mess in all subsequent debug sessions.

@danberindei Collaborator

good point, fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Dan Berindei added some commits
Dan Berindei ISPN-1255 - Distributed tasks should not be queued
Distributed tasks can't be replayed after the cache has started, because the caller needs to know the return value.
So if the cache hasn't started it should return an UnsuccessfulResponse immediately instead of returning a RequestIgnoredResponse and then running the command.
34467b8
Dan Berindei ISPN-1271 - Consistent Hash externalizer loses grouping configuration
As a workaround, skip the consistent hash check on the node receiving state.
46c7346
Dan Berindei ISPN-1123 - BaseDldLaziLockingTest was only working if the first tran…
…saction won the coin toss
101b686
Dan Berindei ISPN-1123 - Fixed waiting for updated cluster views
Some tests were not using the barfIfTooManyMembers = false flag after killing a node.
Waiting for rehashing to finish is not enough, we must ensure that we got the updated views first.
DataLossOnJoinOneOwnerTest was not creating the second cache before blocking for view.

Some tests were creating a non-default but then they were waiting for the default cache to finish rehashing.
c068fdf
Dan Berindei ISPN-1123 - Fixed mismatched test names
I added a script (bin/mismatchedTestNames.sh) to find mismatched test names.
7e6924c
Dan Berindei ISPN-1123 - Make the joins really concurrent in ConcurrentJoinTest 21e1add
Dan Berindei ISPN-1123 - SyncReplImplicitLockingTest: Allow more time for the time…
…out to kick in
361c587
Dan Berindei ISPN-1106 - Log more information about locks and rehashing
* In LockManagerImpl log the other keys owned by the current transaction.
* In DefaultCacheManager push the cache name to the NDC during cache startup.
* Improved toString() for RehashControlCommand and DistributedExecuteCommand.
* In InboundInvocationHandler log the cache name.
* Log cache start/stop.
* Log the read lock owners in JGroupsDistSync.
991b53e
Dan Berindei ISPN-1106 - Corrected DistributionManagerImpl.isAffectedByRehash()
It returns true is the CH maps the key to the current node but the previous CH didn't, meaning the key may have not arrived yet from the previous owner.
5ba8005
Dan Berindei ISPN-1106 - Possibility of data loss when there's a pending rehash an…
…d the state receivers ignore our state

Moved key invalidation after receiving the rehash completed confirmation, and only if there is no pending rehash.
Re-added notifyDataRehashed post event.
Separated the rehash completion into two phases so RebalanceTask can invalidate the keys after rehashing is done but before the cache clients know it.
2b51342
Dan Berindei ISPN-1106 - Remove locks for keys locked by a remote transaction that…
… are no longer local after a rehash

I also changed the algorithm for eager single node locking to rollback a transaction if the primary data owner changed, even if it's not a joiner or leaver (see ISPN-1275).
7927cda
Dan Berindei ISPN-1106 - Filter out the originator node from the targets of a remo…
…te get command

Needed if we are during a rehash and the key doesn't exist on the current node.
d266b29
Dan Berindei ISPN-1106 - LockControlCommand was not locking keys in the order the …
…user passed them in

The ordering was not consistent, as it was relying on a HashSet. So different commands could lock the same subset of keys in a different order, leading to a deadlock.
1955085
Dan Berindei ISPN-1170 - If the number of joiners >= numOwners the owner sets befo…
…re and after rehash can be disjoint
fc4f5d0
Dan Berindei ISPN-1255 - Ongoing transactions waiting on locks are blocking the re…
…hash from finishing

The generic scenario involves multiple caches.
Say we have transactions Tx1 and Tx2 spanning caches C1 and C2.
A new node joins the cluster, starting C1 and C2.
With the following sequence of events rehashing will be blocked for lockAcquisitionTimeout.

1. Tx1 prepares on C1 locking K1
2. Tx2 wants to prepare on C2, Tx2 gets the tx lock
3. Tx2 now waits to lock K1 while holding the tx lock on C2
4. Rehash starts on C2 but it can't proceed because Tx2 has the tx lock
5. Tx1 now wants to prepare on C2, but can't acquire the tx lock

I've implemented a crude "deadlock detection" scheme: a new tx will wait
the full lockAcquisitionTimeout for the tx lock, but a tx that already
has locks acquired will only wait 1/100 of that. So if there is a cycle
it will break much quicker and allow rehashing to proceed.

There is also a simpler variant where the transactions work with a single cache.
In that case if the remote command can't acquire the tx lock with 0 timeout it knows
that it has the tx lock on the origin node and it's in a deadlock situation.
6c827cb
Dan Berindei ISPN-1106 - Allow the users to start multiple caches at once
This is no longer strictly necessary for ISPN-1106, as we are waiting
with a shorter timeout on transactions with locks and so the rehash
does not block for a very long period of time.

It is recommended however to start all caches on application startup,
and this method provides an easy way for users to start all their caches.
b5a0a0e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Aug 5, 2011
  1. ISPN-1255 - Distributed tasks should not be queued

    Dan Berindei authored
    Distributed tasks can't be replayed after the cache has started, because the caller needs to know the return value.
    So if the cache hasn't started it should return an UnsuccessfulResponse immediately instead of returning a RequestIgnoredResponse and then running the command.
  2. ISPN-1271 - Consistent Hash externalizer loses grouping configuration

    Dan Berindei authored
    As a workaround, skip the consistent hash check on the node receiving state.
  3. ISPN-1123 - BaseDldLaziLockingTest was only working if the first tran…

    Dan Berindei authored
    …saction won the coin toss
  4. ISPN-1123 - Fixed waiting for updated cluster views

    Dan Berindei authored
    Some tests were not using the barfIfTooManyMembers = false flag after killing a node.
    Waiting for rehashing to finish is not enough, we must ensure that we got the updated views first.
    DataLossOnJoinOneOwnerTest was not creating the second cache before blocking for view.
    
    Some tests were creating a non-default but then they were waiting for the default cache to finish rehashing.
  5. ISPN-1123 - Fixed mismatched test names

    Dan Berindei authored
    I added a script (bin/mismatchedTestNames.sh) to find mismatched test names.
  6. ISPN-1123 - SyncReplImplicitLockingTest: Allow more time for the time…

    Dan Berindei authored
    …out to kick in
  7. ISPN-1106 - Log more information about locks and rehashing

    Dan Berindei authored
    * In LockManagerImpl log the other keys owned by the current transaction.
    * In DefaultCacheManager push the cache name to the NDC during cache startup.
    * Improved toString() for RehashControlCommand and DistributedExecuteCommand.
    * In InboundInvocationHandler log the cache name.
    * Log cache start/stop.
    * Log the read lock owners in JGroupsDistSync.
  8. ISPN-1106 - Corrected DistributionManagerImpl.isAffectedByRehash()

    Dan Berindei authored
    It returns true is the CH maps the key to the current node but the previous CH didn't, meaning the key may have not arrived yet from the previous owner.
  9. ISPN-1106 - Possibility of data loss when there's a pending rehash an…

    Dan Berindei authored
    …d the state receivers ignore our state
    
    Moved key invalidation after receiving the rehash completed confirmation, and only if there is no pending rehash.
    Re-added notifyDataRehashed post event.
    Separated the rehash completion into two phases so RebalanceTask can invalidate the keys after rehashing is done but before the cache clients know it.
  10. ISPN-1106 - Remove locks for keys locked by a remote transaction that…

    Dan Berindei authored
    … are no longer local after a rehash
    
    I also changed the algorithm for eager single node locking to rollback a transaction if the primary data owner changed, even if it's not a joiner or leaver (see ISPN-1275).
  11. ISPN-1106 - Filter out the originator node from the targets of a remo…

    Dan Berindei authored
    …te get command
    
    Needed if we are during a rehash and the key doesn't exist on the current node.
  12. ISPN-1106 - LockControlCommand was not locking keys in the order the …

    Dan Berindei authored
    …user passed them in
    
    The ordering was not consistent, as it was relying on a HashSet. So different commands could lock the same subset of keys in a different order, leading to a deadlock.
  13. ISPN-1170 - If the number of joiners >= numOwners the owner sets befo…

    Dan Berindei authored
    …re and after rehash can be disjoint
  14. ISPN-1255 - Ongoing transactions waiting on locks are blocking the re…

    Dan Berindei authored
    …hash from finishing
    
    The generic scenario involves multiple caches.
    Say we have transactions Tx1 and Tx2 spanning caches C1 and C2.
    A new node joins the cluster, starting C1 and C2.
    With the following sequence of events rehashing will be blocked for lockAcquisitionTimeout.
    
    1. Tx1 prepares on C1 locking K1
    2. Tx2 wants to prepare on C2, Tx2 gets the tx lock
    3. Tx2 now waits to lock K1 while holding the tx lock on C2
    4. Rehash starts on C2 but it can't proceed because Tx2 has the tx lock
    5. Tx1 now wants to prepare on C2, but can't acquire the tx lock
    
    I've implemented a crude "deadlock detection" scheme: a new tx will wait
    the full lockAcquisitionTimeout for the tx lock, but a tx that already
    has locks acquired will only wait 1/100 of that. So if there is a cycle
    it will break much quicker and allow rehashing to proceed.
    
    There is also a simpler variant where the transactions work with a single cache.
    In that case if the remote command can't acquire the tx lock with 0 timeout it knows
    that it has the tx lock on the origin node and it's in a deadlock situation.
  15. ISPN-1106 - Allow the users to start multiple caches at once

    Dan Berindei authored
    This is no longer strictly necessary for ISPN-1106, as we are waiting
    with a shorter timeout on transactions with locks and so the rehash
    does not block for a very long period of time.
    
    It is recommended however to start all caches on application startup,
    and this method provides an easy way for users to start all their caches.
Something went wrong with that request. Please try again.