ISPN-1106 and ISPN-1255 plus some random test stability isuess #491

Closed
wants to merge 16 commits into
from

Conversation

Projects
None yet
3 participants
@danberindei
Member

danberindei commented Aug 4, 2011

https://issues.jboss.org/browse/ISPN-1106
https://issues.jboss.org/browse/ISPN-1255

Also t_1255_5.0.x for the 5.0.x branch.

There are too many commits to list everything here so it's better to look at the individual commits, but to sum it up it's all related to rehash reliability.

@maniksurtani

View changes

core/src/main/java/org/infinispan/commands/control/LockControlCommand.java
+ // defensive copy
+ this.keys = new ArrayList<Object>(keys);
+ } else {
+ keys = Collections.emptySet();

This comment has been minimized.

@maniksurtani

maniksurtani Aug 4, 2011

Member

SHouldn't this be this.keys = ... ?

@maniksurtani

maniksurtani Aug 4, 2011

Member

SHouldn't this be this.keys = ... ?

This comment has been minimized.

@danberindei

danberindei Aug 4, 2011

Member

Good catch! I guess the tests never create a lock command without any key ;)

@danberindei

danberindei Aug 4, 2011

Member

Good catch! I guess the tests never create a lock command without any key ;)

This comment has been minimized.

@Sanne

Sanne Aug 5, 2011

Member

ouch, please keep the names different :) shadowing is hard to deal with!

@Sanne

Sanne Aug 5, 2011

Member

ouch, please keep the names different :) shadowing is hard to deal with!

This comment has been minimized.

@danberindei

danberindei Aug 5, 2011

Member

This is the constructor, I think it's pretty standard to use the same names for the constructor parameters.

@danberindei

danberindei Aug 5, 2011

Member

This is the constructor, I think it's pretty standard to use the same names for the constructor parameters.

This comment has been minimized.

@Sanne

Sanne Aug 5, 2011

Member

sorry didn't notice that :)

@Sanne

Sanne Aug 5, 2011

Member

sorry didn't notice that :)

@Sanne

View changes

core/src/main/java/org/infinispan/manager/DefaultCacheManager.java
- c.assertValid();
- Cache cache = new InternalCacheFactory().createCache(c, globalComponentRegistry, cacheName, reflectionCache);
- CacheWrapper cw = new CacheWrapper(cache);
+ LogFactory.pushNDC(cacheName, log.isTraceEnabled());

This comment has been minimized.

@Sanne

Sanne Aug 5, 2011

Member

you should store the result of log.isTraceEnabled() in local variable and use that for pushNDC && popNDC , as if the value changes between a push/pull it's making a mess in all subsequent debug sessions.

@Sanne

Sanne Aug 5, 2011

Member

you should store the result of log.isTraceEnabled() in local variable and use that for pushNDC && popNDC , as if the value changes between a push/pull it's making a mess in all subsequent debug sessions.

This comment has been minimized.

@danberindei

danberindei Aug 5, 2011

Member

good point, fixed

@danberindei

danberindei Aug 5, 2011

Member

good point, fixed

Dan Berindei added some commits Jul 28, 2011

Dan Berindei
ISPN-1255 - Distributed tasks should not be queued
Distributed tasks can't be replayed after the cache has started, because the caller needs to know the return value.
So if the cache hasn't started it should return an UnsuccessfulResponse immediately instead of returning a RequestIgnoredResponse and then running the command.
Dan Berindei
ISPN-1271 - Consistent Hash externalizer loses grouping configuration
As a workaround, skip the consistent hash check on the node receiving state.
Dan Berindei
ISPN-1123 - Fixed waiting for updated cluster views
Some tests were not using the barfIfTooManyMembers = false flag after killing a node.
Waiting for rehashing to finish is not enough, we must ensure that we got the updated views first.
DataLossOnJoinOneOwnerTest was not creating the second cache before blocking for view.

Some tests were creating a non-default but then they were waiting for the default cache to finish rehashing.
Dan Berindei
ISPN-1123 - Fixed mismatched test names
I added a script (bin/mismatchedTestNames.sh) to find mismatched test names.
Dan Berindei
ISPN-1106 - Log more information about locks and rehashing
* In LockManagerImpl log the other keys owned by the current transaction.
* In DefaultCacheManager push the cache name to the NDC during cache startup.
* Improved toString() for RehashControlCommand and DistributedExecuteCommand.
* In InboundInvocationHandler log the cache name.
* Log cache start/stop.
* Log the read lock owners in JGroupsDistSync.
Dan Berindei
ISPN-1106 - Corrected DistributionManagerImpl.isAffectedByRehash()
It returns true is the CH maps the key to the current node but the previous CH didn't, meaning the key may have not arrived yet from the previous owner.
Dan Berindei
ISPN-1106 - Possibility of data loss when there's a pending rehash an…
…d the state receivers ignore our state

Moved key invalidation after receiving the rehash completed confirmation, and only if there is no pending rehash.
Re-added notifyDataRehashed post event.
Separated the rehash completion into two phases so RebalanceTask can invalidate the keys after rehashing is done but before the cache clients know it.
Dan Berindei
ISPN-1106 - Remove locks for keys locked by a remote transaction that…
… are no longer local after a rehash

I also changed the algorithm for eager single node locking to rollback a transaction if the primary data owner changed, even if it's not a joiner or leaver (see ISPN-1275).
Dan Berindei
ISPN-1106 - Filter out the originator node from the targets of a remo…
…te get command

Needed if we are during a rehash and the key doesn't exist on the current node.
Dan Berindei
ISPN-1106 - LockControlCommand was not locking keys in the order the …
…user passed them in

The ordering was not consistent, as it was relying on a HashSet. So different commands could lock the same subset of keys in a different order, leading to a deadlock.
Dan Berindei
ISPN-1255 - Ongoing transactions waiting on locks are blocking the re…
…hash from finishing

The generic scenario involves multiple caches.
Say we have transactions Tx1 and Tx2 spanning caches C1 and C2.
A new node joins the cluster, starting C1 and C2.
With the following sequence of events rehashing will be blocked for lockAcquisitionTimeout.

1. Tx1 prepares on C1 locking K1
2. Tx2 wants to prepare on C2, Tx2 gets the tx lock
3. Tx2 now waits to lock K1 while holding the tx lock on C2
4. Rehash starts on C2 but it can't proceed because Tx2 has the tx lock
5. Tx1 now wants to prepare on C2, but can't acquire the tx lock

I've implemented a crude "deadlock detection" scheme: a new tx will wait
the full lockAcquisitionTimeout for the tx lock, but a tx that already
has locks acquired will only wait 1/100 of that. So if there is a cycle
it will break much quicker and allow rehashing to proceed.

There is also a simpler variant where the transactions work with a single cache.
In that case if the remote command can't acquire the tx lock with 0 timeout it knows
that it has the tx lock on the origin node and it's in a deadlock situation.
Dan Berindei
ISPN-1106 - Allow the users to start multiple caches at once
This is no longer strictly necessary for ISPN-1106, as we are waiting
with a shorter timeout on transactions with locks and so the rehash
does not block for a very long period of time.

It is recommended however to start all caches on application startup,
and this method provides an easy way for users to start all their caches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment