ISPN-1106 and ISPN-1255 plus some random test stability isuess #491

Closed
wants to merge 16 commits into
from

2 participants

@danberindei
Infinispan member

https://issues.jboss.org/browse/ISPN-1106
https://issues.jboss.org/browse/ISPN-1255

Also t_1255_5.0.x for the 5.0.x branch.

There are too many commits to list everything here so it's better to look at the individual commits, but to sum it up it's all related to rehash reliability.

@maniksurtani maniksurtani and 2 others commented on an outdated diff Aug 4, 2011
...g/infinispan/commands/control/LockControlCommand.java
@@ -71,15 +65,11 @@ public class LockControlCommand extends AbstractTransactionBoundaryCommand imple
public LockControlCommand(Collection<Object> keys, String cacheName, Set<Flag> flags, boolean implicit) {
this.cacheName = cacheName;
- this.keys = null;
- this.singleKey = null;
- if (keys != null && !keys.isEmpty()) {
- if (keys.size() == 1) {
- for (Object k: keys) this.singleKey = k;
- } else {
- // defensive copy
- this.keys = new HashSet<Object>(keys);
- }
+ if (keys != null) {
+ // defensive copy
+ this.keys = new ArrayList<Object>(keys);
+ } else {
+ keys = Collections.emptySet();
@maniksurtani
Infinispan member
maniksurtani added a line comment Aug 4, 2011

SHouldn't this be this.keys = ... ?

@danberindei
Infinispan member
danberindei added a line comment Aug 4, 2011
@Sanne
Infinispan member
Sanne added a line comment Aug 5, 2011

ouch, please keep the names different :) shadowing is hard to deal with!

@danberindei
Infinispan member
danberindei added a line comment Aug 5, 2011

This is the constructor, I think it's pretty standard to use the same names for the constructor parameters.

@Sanne
Infinispan member
Sanne added a line comment Aug 5, 2011

sorry didn't notice that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@Sanne Sanne and 1 other commented on an outdated diff Aug 5, 2011
.../java/org/infinispan/manager/DefaultCacheManager.java
@@ -538,28 +538,33 @@ public class DefaultCacheManager implements EmbeddedCacheManager, CacheManager {
@GuardedBy("Cache name lock container keeps a lock per cache name which guards this method")
private Cache createCache(String cacheName) {
- CacheWrapper existingCache = caches.get(cacheName);
- if (existingCache != null)
- return existingCache.getCache();
-
- Configuration c = getConfiguration(cacheName);
- setConfigurationName(cacheName, c);
-
- c.setGlobalConfiguration(globalConfiguration);
- c.accept(configurationValidator);
- c.assertValid();
- Cache cache = new InternalCacheFactory().createCache(c, globalComponentRegistry, cacheName, reflectionCache);
- CacheWrapper cw = new CacheWrapper(cache);
+ LogFactory.pushNDC(cacheName, log.isTraceEnabled());
@Sanne
Infinispan member
Sanne added a line comment Aug 5, 2011

you should store the result of log.isTraceEnabled() in local variable and use that for pushNDC && popNDC , as if the value changes between a push/pull it's making a mess in all subsequent debug sessions.

@danberindei
Infinispan member
danberindei added a line comment Aug 5, 2011

good point, fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Dan Berindei added some commits Jul 28, 2011
Dan Berindei ISPN-1255 - Distributed tasks should not be queued
Distributed tasks can't be replayed after the cache has started, because the caller needs to know the return value.
So if the cache hasn't started it should return an UnsuccessfulResponse immediately instead of returning a RequestIgnoredResponse and then running the command.
34467b8
Dan Berindei ISPN-1271 - Consistent Hash externalizer loses grouping configuration
As a workaround, skip the consistent hash check on the node receiving state.
46c7346
Dan Berindei ISPN-1123 - BaseDldLaziLockingTest was only working if the first tran…
…saction won the coin toss
101b686
Dan Berindei ISPN-1123 - Fixed waiting for updated cluster views
Some tests were not using the barfIfTooManyMembers = false flag after killing a node.
Waiting for rehashing to finish is not enough, we must ensure that we got the updated views first.
DataLossOnJoinOneOwnerTest was not creating the second cache before blocking for view.

Some tests were creating a non-default but then they were waiting for the default cache to finish rehashing.
c068fdf
Dan Berindei ISPN-1123 - Fixed mismatched test names
I added a script (bin/mismatchedTestNames.sh) to find mismatched test names.
7e6924c
Dan Berindei ISPN-1123 - Make the joins really concurrent in ConcurrentJoinTest 21e1add
Dan Berindei ISPN-1123 - SyncReplImplicitLockingTest: Allow more time for the time…
…out to kick in
361c587
Dan Berindei ISPN-1106 - Log more information about locks and rehashing
* In LockManagerImpl log the other keys owned by the current transaction.
* In DefaultCacheManager push the cache name to the NDC during cache startup.
* Improved toString() for RehashControlCommand and DistributedExecuteCommand.
* In InboundInvocationHandler log the cache name.
* Log cache start/stop.
* Log the read lock owners in JGroupsDistSync.
991b53e
Dan Berindei ISPN-1106 - Corrected DistributionManagerImpl.isAffectedByRehash()
It returns true is the CH maps the key to the current node but the previous CH didn't, meaning the key may have not arrived yet from the previous owner.
5ba8005
Dan Berindei ISPN-1106 - Possibility of data loss when there's a pending rehash an…
…d the state receivers ignore our state

Moved key invalidation after receiving the rehash completed confirmation, and only if there is no pending rehash.
Re-added notifyDataRehashed post event.
Separated the rehash completion into two phases so RebalanceTask can invalidate the keys after rehashing is done but before the cache clients know it.
2b51342
Dan Berindei ISPN-1106 - Remove locks for keys locked by a remote transaction that…
… are no longer local after a rehash

I also changed the algorithm for eager single node locking to rollback a transaction if the primary data owner changed, even if it's not a joiner or leaver (see ISPN-1275).
7927cda
Dan Berindei ISPN-1106 - Filter out the originator node from the targets of a remo…
…te get command

Needed if we are during a rehash and the key doesn't exist on the current node.
d266b29
Dan Berindei ISPN-1106 - LockControlCommand was not locking keys in the order the …
…user passed them in

The ordering was not consistent, as it was relying on a HashSet. So different commands could lock the same subset of keys in a different order, leading to a deadlock.
1955085
Dan Berindei ISPN-1170 - If the number of joiners >= numOwners the owner sets befo…
…re and after rehash can be disjoint
fc4f5d0
Dan Berindei ISPN-1255 - Ongoing transactions waiting on locks are blocking the re…
…hash from finishing

The generic scenario involves multiple caches.
Say we have transactions Tx1 and Tx2 spanning caches C1 and C2.
A new node joins the cluster, starting C1 and C2.
With the following sequence of events rehashing will be blocked for lockAcquisitionTimeout.

1. Tx1 prepares on C1 locking K1
2. Tx2 wants to prepare on C2, Tx2 gets the tx lock
3. Tx2 now waits to lock K1 while holding the tx lock on C2
4. Rehash starts on C2 but it can't proceed because Tx2 has the tx lock
5. Tx1 now wants to prepare on C2, but can't acquire the tx lock

I've implemented a crude "deadlock detection" scheme: a new tx will wait
the full lockAcquisitionTimeout for the tx lock, but a tx that already
has locks acquired will only wait 1/100 of that. So if there is a cycle
it will break much quicker and allow rehashing to proceed.

There is also a simpler variant where the transactions work with a single cache.
In that case if the remote command can't acquire the tx lock with 0 timeout it knows
that it has the tx lock on the origin node and it's in a deadlock situation.
6c827cb
Dan Berindei ISPN-1106 - Allow the users to start multiple caches at once
This is no longer strictly necessary for ISPN-1106, as we are waiting
with a shorter timeout on transactions with locks and so the rehash
does not block for a very long period of time.

It is recommended however to start all caches on application startup,
and this method provides an easy way for users to start all their caches.
b5a0a0e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment