Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue poll operation - WrongTargetException #3754

Closed
IlaiG opened this issue Oct 2, 2014 · 18 comments

Comments

Projects
None yet
@IlaiG
Copy link

commented Oct 2, 2014

We're running a 2 node client-server deployment based on 3.2.5 running on AWS EC2.

After one of the nodes disconnected and reconnected, a merge operation occurred, following it we started to get a crazy amount of the following message, and more importantly - clients that were connected to SERVER1 were stuck on Poll, not polling any items even though the queue was not empty:

» 12:41:05.080 WARNING: [SERVER1]:5701 [dev] [3.2.5] Op: com.hazelcast.queue.PollOperation@5be9d2c5, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[SERVER1]:5701, target:Address[SERVER2]:5701, partitionId: 199, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
» 12:41:05.080 Oct 01, 2014 9:41:13 AM com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp
@stikku

This comment has been minimized.

Copy link

commented Nov 6, 2014

We are also experiencing the same issue.

Our setup is of 2 (in some cases 3) nodes running version 3.3.1
Same error occurs after a node disconnects and reconnects. Additionally it does not seem to be happening with every disconnect/reconnect.

Additionally, we outputted a list of all partitions and their owner on all nodes. The results do match (partition list seems to be synchronized across nodes)

We would greatly appreciate if you could give us an update on the progress on this issue and how severe the current issue is. Thanks

@mdogan mdogan added Team: Core and removed Type: Defect labels Dec 31, 2014

@FlavioF

This comment has been minimized.

Copy link

commented Mar 18, 2015

+1

@pveentjer

This comment has been minimized.

Copy link
Member

commented Mar 18, 2015

I don't think it is an issue related the queue itself. I think there is a problem deeper in the system. We should get rid of the exception to indicate that a member has migrated and use one of our 'responses' for that. This makes it a lot cheaper to notify the invocation that the partition has migrated since there is no need to deal with the stacktrace and all that.

Does this logging eventually stop? Because 'crazy' amount makes me think that it isn't a temporary issue. Because normally you get this exception and then redirect to the right member. But this sounds different. How many log entries and how long does it go on?

There is apparently another issue; since the clients are stuck on the poll. I know that work has been done on the waiting mechanism by @mdogan when a member leaves the cluster. Perhaps this solves this particular problem as well.

@FlavioF

This comment has been minimized.

Copy link

commented Mar 18, 2015

In my case forever. :) Yes maybe this issue (#4676) will fix it.

@albertocsm

This comment has been minimized.

Copy link

commented Jun 5, 2015

I'm afraid this issue is not resolved. I'm still experiencing this with 3.4.2.
@mdogan's PR #4678 did not fixed this.

I do not have a split brain condition as mentioned on #4676, what I do have is an isolated node that as soon as it got up, for brief instances, was saw by members of the cluster and left right after.
When the isolated node rejoins the cluster, I get the aforementioned error log. The node stays unresponsive indefinitely.
All the other nodes in the cluster (+- 50) behave normally. This is very much a show stopper for our use case.

Does anyone have any suggestion on how to approach a fix for this?

cc/ @FlavioF @jbonofre

@lhsvobodaj

This comment has been minimized.

Copy link

commented Jul 16, 2015

It seems this issue isn't solved yet. I have a test setup with 2 nodes running version 3.5 on the same machine (ports 5701 and 5702). Both nodes have a thread accessing a shared queue and are waiting on a take() operation as showed by the log below:

Exception in thread "ON-DEMAND-EXECUTOR-THREAD" com.hazelcast.core.OperationTimeoutException: No response for 360000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=201, callId=620, invocationTime=1437047210813, waitTimeout=-180001, callTimeout=60000}, partitionId=201, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=2, callTimeout=60000, target=Address[127.0.0.1]:5702, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0
    at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:277)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)

After the nodes finish the merge process we get an endless log message printed (the log below shows an interval of ~8 minutes).

[16/07/2015 08:46:27:275] [com.hazelcast.cluster.impl.TcpIpJoiner] DEBUG: [127.0.0.1]:5701 [nms_dist_poller] [3.5] Address[127.0.0.1]:5702 should merge to this node , because : node.getThisAddress().hashCode() < joinRequest.address.hashCode() , this node member count: 1
...
[16/07/2015 08:46:50:736] [com.hazelcast.cluster.impl.TcpIpJoiner] INFO: [127.0.0.1]:5702 [nms_dist_poller] [3.5] Address[127.0.0.1]:5702 is merging to Address[127.0.0.1]:5701, because : node.getThisAddress().hashCode() > joinRequest.address.hashCode() , this node member count: 1
...
[16/07/2015 08:46:58:458] [com.hazelcast.partition.InternalPartitionService] INFO: [127.0.0.1]:5701 [nms_dist_poller] [3.5] All migration tasks have been completed, queues are empty.
[16/07/2015 08:46:58:859] [com.hazelcast.core.LifecycleService] INFO: [127.0.0.1]:5702 [nms_dist_poller] [3.5] Address[127.0.0.1]:5702 is MERGED
[16/07/2015 08:47:51:732] [com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation] WARN: [127.0.0.1]:5702 [nms_dist_poller] [3.5] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=185, callId=621, invocationTime=1437047210813, waitTimeout=-240001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[127.0.0.1]:5702, target:Address[127.0.0.1]:5701, partitionId: 185, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService
...
[16/07/2015 08:55:25:793] [com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation] WARN: [127.0.0.1]:5702 [nms_dist_poller] [3.5] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=201, callId=0, invocationTime=1437047210813, waitTimeout=-180001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[127.0.0.1]:5702, target:Address[127.0.0.1]:5701, partitionId: 201, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService

As @albertocsm asked earlier, is there any workaround for this problem?

@enesakar enesakar removed the PENDING label Nov 2, 2015

@bwzhang2011

This comment has been minimized.

Copy link

commented Nov 14, 2015

@lhsvobodaj, @albertocsm, any update with such issue ?

@albertocsm

This comment has been minimized.

Copy link

commented Nov 14, 2015

Don't know. I solved it by stop using HZ in the problematic use case.
On Nov 14, 2015 09:46, "bwzhang2012" notifications@github.com wrote:

@lhsvobodaj https://github.com/lhsvobodaj, @albertocsm
https://github.com/albertocsm, any update with such issue ?


Reply to this email directly or view it on GitHub
#3754 (comment)
.

@lhsvobodaj

This comment has been minimized.

Copy link

commented Nov 14, 2015

@bwzhang2011, we took the same decision as @albertocsm. In my case we decided to use a JMS implementation.

@bwzhang2011

This comment has been minimized.

Copy link

commented Nov 21, 2015

@mdogan, @pveentjer, any update with such issue ?

@chrismyers81

This comment has been minimized.

Copy link

commented Nov 25, 2015

We just started getting this issue after rebooting one of our CAS cluster nodes. Haven't had an issue up till today.

When that one node (muwacnode2) rebooted, it became the master, and then started spewing:

Nov 25, 2015 3:26:27 PM com.hazelcast.spi.impl.BasicInvocation
WARNING: [muwacnode2.millikin.edu]:5701 [dev] [3.4.2] No response for 120000 ms. BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:core:clusterService', op=MemberInfoUpdateOperation {
MemberInfo{address=Address[muwacnode2.millikin.edu]:5701}
MemberInfo{address=Address[muwadev1.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode3.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode1.millikin.edu]:5701}
}, partitionId=-1, replicaIndex=0, tryCount=100, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[muwacnode3.millikin.edu]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false}
Nov 25, 2015 3:26:27 PM com.hazelcast.nio.tcp.SocketConnector
INFO: [muwacnode2.millikin.edu]:5701 [dev] [3.4.2] Connecting to muwacnode3.millikin.edu/172.16.1.69:5701, timeout: 0, bind-any: true
Nov 25, 2015 3:26:27 PM com.hazelcast.nio.tcp.TcpIpConnectionManager
INFO: [muwacnode2.millikin.edu]:5701 [dev] [3.4.2] Established socket connection between /172.16.1.67:35113 and muwacnode3.millikin.edu/172.16.1.69:5701
Nov 25, 2015 3:26:27 PM com.hazelcast.spi.impl.BasicInvocation
WARNING: [muwacnode2.millikin.edu]:5701 [dev] [3.4.2] Asking if operation execution has been started: BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:core:clusterService', op=MemberInfoUpdateOperation {
MemberInfo{address=Address[muwacnode2.millikin.edu]:5701}
MemberInfo{address=Address[muwadev1.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode3.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode1.millikin.edu]:5701}
}, partitionId=-1, replicaIndex=0, tryCount=100, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[muwacnode3.millikin.edu]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false}
Nov 25, 2015 3:26:27 PM com.hazelcast.spi.impl.BasicInvocation
WARNING: [muwacnode2.millikin.edu]:5701 [dev] [3.4.2] While asking 'is-executing': BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:core:clusterService', op=MemberInfoUpdateOperation {
MemberInfo{address=Address[muwacnode2.millikin.edu]:5701}
MemberInfo{address=Address[muwadev1.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode3.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode1.millikin.edu]:5701}
}, partitionId=-1, replicaIndex=0, tryCount=100, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[muwacnode3.millikin.edu]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false}
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.RetryableIOException: Packet not send to -> Address[muwacnode3.millikin.edu]:5701
        at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponseOrThrowException(BasicInvocationFuture.java:343)
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:194)
        at com.hazelcast.spi.impl.BasicInvocationFuture.isOperationExecuting(BasicInvocationFuture.java:421)
        at com.hazelcast.spi.impl.BasicInvocationFuture.waitForResponse(BasicInvocationFuture.java:244)
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:193)
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:173)
        at com.hazelcast.spi.impl.BasicInvocationFuture.getSafely(BasicInvocationFuture.java:185)
        at com.hazelcast.util.FutureUtil.retrieveValue(FutureUtil.java:301)
        at com.hazelcast.util.FutureUtil.executeWithDeadline(FutureUtil.java:289)
        at com.hazelcast.util.FutureUtil.waitWithDeadline(FutureUtil.java:278)
        at com.hazelcast.util.FutureUtil.waitWithDeadline(FutureUtil.java:252)
        at com.hazelcast.cluster.impl.ClusterServiceImpl.startJoin(ClusterServiceImpl.java:934)
        at com.hazelcast.cluster.impl.ClusterServiceImpl.handleJoinRequest(ClusterServiceImpl.java:618)
        at com.hazelcast.cluster.impl.operations.JoinRequestOperation.run(JoinRequestOperation.java:40)
        at com.hazelcast.spi.impl.BasicOperationService$OperationHandler.handle(BasicOperationService.java:749)
        at com.hazelcast.spi.impl.BasicOperationService$OperationHandler.access$500(BasicOperationService.java:725)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.handle(BasicOperationService.java:699)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.handle(BasicOperationService.java:643)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.access$1500(BasicOperationService.java:630)
        at com.hazelcast.spi.impl.BasicOperationService$BasicDispatcherImpl.dispatch(BasicOperationService.java:582)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.process(BasicOperationScheduler.java:466)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.processPriorityMessages(BasicOperationScheduler.java:480)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.doRun(BasicOperationScheduler.java:457)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.run(BasicOperationScheduler.java:432)
Caused by: com.hazelcast.spi.exception.RetryableIOException: Packet not send to -> Address[muwacnode3.millikin.edu]:5701
        at com.hazelcast.spi.impl.BasicInvocation.doInvokeRemote(BasicInvocation.java:291)
        at com.hazelcast.spi.impl.BasicInvocation.doInvoke(BasicInvocation.java:260)
        at com.hazelcast.spi.impl.BasicInvocation.invoke(BasicInvocation.java:229)
        at com.hazelcast.spi.impl.BasicInvocationFuture.isOperationExecuting(BasicInvocationFuture.java:419)
        at com.hazelcast.spi.impl.BasicInvocationFuture.waitForResponse(BasicInvocationFuture.java:244)
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:193)
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:173)
        at com.hazelcast.spi.impl.BasicInvocationFuture.getSafely(BasicInvocationFuture.java:185)
        at com.hazelcast.util.FutureUtil.retrieveValue(FutureUtil.java:301)
        at com.hazelcast.util.FutureUtil.executeWithDeadline(FutureUtil.java:289)
        at com.hazelcast.util.FutureUtil.waitWithDeadline(FutureUtil.java:278)
        at com.hazelcast.util.FutureUtil.waitWithDeadline(FutureUtil.java:252)
        at com.hazelcast.cluster.impl.ClusterServiceImpl.startJoin(ClusterServiceImpl.java:934)
        at com.hazelcast.cluster.impl.ClusterServiceImpl.handleJoinRequest(ClusterServiceImpl.java:618)
        at com.hazelcast.cluster.impl.operations.JoinRequestOperation.run(JoinRequestOperation.java:40)
        at com.hazelcast.spi.impl.BasicOperationService$OperationHandler.handle(BasicOperationService.java:749)
        at com.hazelcast.spi.impl.BasicOperationService$OperationHandler.access$500(BasicOperationService.java:725)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.handle(BasicOperationService.java:699)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.handle(BasicOperationService.java:643)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.access$1500(BasicOperationService.java:630)
        at com.hazelcast.spi.impl.BasicOperationService$BasicDispatcherImpl.dispatch(BasicOperationService.java:582)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.process(BasicOperationScheduler.java:466)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.processPriorityMessages(BasicOperationScheduler.java:480)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.doRun(BasicOperationScheduler.java:457)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.run(BasicOperationScheduler.java:432)
        at ------ End remote and begin local stack-trace ------.(Unknown Source)
        at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponse(BasicInvocationFuture.java:387)
        at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponseOrThrowException(BasicInvocationFuture.java:320)
        ... 23 more

Nov 25, 2015 3:26:27 PM com.hazelcast.spi.impl.BasicInvocation
WARNING: [muwacnode2.millikin.edu]:5701 [dev] [3.4.2] 'is-executing': false -> BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:core:clusterService', op=MemberInfoUpdateOperation {
MemberInfo{address=Address[muwacnode2.millikin.edu]:5701}
MemberInfo{address=Address[muwadev1.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode3.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode1.millikin.edu]:5701}
}, partitionId=-1, replicaIndex=0, tryCount=100, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[muwacnode3.millikin.edu]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false}
Nov 25, 2015 3:26:27 PM com.hazelcast.cluster.ClusterService
WARNING: [muwacnode2.millikin.edu]:5701 [dev] [3.4.2] While waiting finalize join calls...
com.hazelcast.core.OperationTimeoutException: No response for 600000 ms. Aborting invocation! BasicInvocationFuture{invocation=BasicInvocation{ serviceName='hz:core:clusterService', op=MemberInfoUpdateOperation {
MemberInfo{address=Address[muwacnode2.millikin.edu]:5701}
MemberInfo{address=Address[muwadev1.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode3.millikin.edu]:5701}
MemberInfo{address=Address[muwacnode1.millikin.edu]:5701}
}, partitionId=-1, replicaIndex=0, tryCount=100, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[muwacnode3.millikin.edu]:5701, backupsExpected=0, backupsCompleted=0}, response=null, done=false} No response has been received!  backups-expected:0 backups-completed: 0
        at com.hazelcast.spi.impl.BasicInvocationFuture.newOperationTimeoutException(BasicInvocationFuture.java:309)
        at com.hazelcast.spi.impl.BasicInvocationFuture.waitForResponse(BasicInvocationFuture.java:246)
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:193)
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:173)
        at com.hazelcast.spi.impl.BasicInvocationFuture.getSafely(BasicInvocationFuture.java:185)
        at com.hazelcast.util.FutureUtil.retrieveValue(FutureUtil.java:301)
        at com.hazelcast.util.FutureUtil.executeWithDeadline(FutureUtil.java:289)
        at com.hazelcast.util.FutureUtil.waitWithDeadline(FutureUtil.java:278)
        at com.hazelcast.util.FutureUtil.waitWithDeadline(FutureUtil.java:252)
        at com.hazelcast.cluster.impl.ClusterServiceImpl.startJoin(ClusterServiceImpl.java:934)
        at com.hazelcast.cluster.impl.ClusterServiceImpl.handleJoinRequest(ClusterServiceImpl.java:618)
        at com.hazelcast.cluster.impl.operations.JoinRequestOperation.run(JoinRequestOperation.java:40)
        at com.hazelcast.spi.impl.BasicOperationService$OperationHandler.handle(BasicOperationService.java:749)
        at com.hazelcast.spi.impl.BasicOperationService$OperationHandler.access$500(BasicOperationService.java:725)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.handle(BasicOperationService.java:699)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.handle(BasicOperationService.java:643)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.access$1500(BasicOperationService.java:630)
        at com.hazelcast.spi.impl.BasicOperationService$BasicDispatcherImpl.dispatch(BasicOperationService.java:582)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.process(BasicOperationScheduler.java:466)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.processPriorityMessages(BasicOperationScheduler.java:480)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.doRun(BasicOperationScheduler.java:457)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.run(BasicOperationScheduler.java:432)
        at ------ End remote and begin local stack-trace ------.(Unknown Source)
        at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponse(BasicInvocationFuture.java:387)
        at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponseOrThrowException(BasicInvocationFuture.java:320)
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:194)
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:173)
        at com.hazelcast.spi.impl.BasicInvocationFuture.getSafely(BasicInvocationFuture.java:185)
        at com.hazelcast.util.FutureUtil.retrieveValue(FutureUtil.java:301)
        at com.hazelcast.util.FutureUtil.executeWithDeadline(FutureUtil.java:289)
        at com.hazelcast.util.FutureUtil.waitWithDeadline(FutureUtil.java:278)
        at com.hazelcast.util.FutureUtil.waitWithDeadline(FutureUtil.java:252)
        at com.hazelcast.cluster.impl.ClusterServiceImpl.startJoin(ClusterServiceImpl.java:934)
        at com.hazelcast.cluster.impl.ClusterServiceImpl.handleJoinRequest(ClusterServiceImpl.java:618)
        at com.hazelcast.cluster.impl.operations.JoinRequestOperation.run(JoinRequestOperation.java:40)
        at com.hazelcast.spi.impl.BasicOperationService$OperationHandler.handle(BasicOperationService.java:749)
        at com.hazelcast.spi.impl.BasicOperationService$OperationHandler.access$500(BasicOperationService.java:725)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.handle(BasicOperationService.java:699)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.handle(BasicOperationService.java:643)
        at com.hazelcast.spi.impl.BasicOperationService$OperationPacketHandler.access$1500(BasicOperationService.java:630)
        at com.hazelcast.spi.impl.BasicOperationService$BasicDispatcherImpl.dispatch(BasicOperationService.java:582)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.process(BasicOperationScheduler.java:466)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.processPriorityMessages(BasicOperationScheduler.java:480)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.doRun(BasicOperationScheduler.java:457)
        at com.hazelcast.spi.impl.BasicOperationScheduler$OperationThread.run(BasicOperationScheduler.java:432)

The other two nodes in the cluster are complaining:

WARNING: [muwacnode3.millikin.edu]:5701 [dev] [3.4.2] Couldn't join to the master : Address[muwacnode2.millikin.edu]:5701
Nov 25, 2015 3:26:27 PM com.hazelcast.nio.tcp.SocketAcceptor
INFO: [muwacnode3.millikin.edu]:5701 [dev] [3.4.2] Accepting socket connection from /172.16.1.67:35113
Nov 25, 2015 3:26:27 PM com.hazelcast.nio.tcp.TcpIpConnectionManager
INFO: [muwacnode3.millikin.edu]:5701 [dev] [3.4.2] Established socket connection between /172.16.1.69:5701 and muwacnode2.millikin.edu/172.16.1.67:35113
Nov 25, 2015 3:26:29 PM com.hazelcast.cluster.impl.TcpIpJoiner
WARNING: [muwacnode3.millikin.edu]:5701 [dev] [3.4.2] Couldn't join to the master : Address[muwacnode2.millikin.edu]:5701
Nov 25, 2015 3:27:04 PM com.hazelcast.cluster.impl.TcpIpJoiner
WARNING: [muwacnode3.millikin.edu]:5701 [dev] [3.4.2] Couldn't join to the master : Address[muwacnode2.millikin.edu]:5701
Nov 25, 2015 3:27:39 PM com.hazelcast.cluster.impl.TcpIpJoiner
WARNING: [muwacnode3.millikin.edu]:5701 [dev] [3.4.2] Couldn't join to the master : Address[muwacnode2.millikin.edu]:5701
Nov 25, 2015 3:28:14 PM com.hazelcast.cluster.impl.TcpIpJoiner
WARNING: [muwacnode3.millikin.edu]:5701 [dev] [3.4.2] Couldn't join to the master : Address[muwacnode2.millikin.edu]:5701
Nov 25, 2015 3:28:49 PM com.hazelcast.cluster.impl.TcpIpJoiner
WARNING: [muwacnode3.millikin.edu]:5701 [dev] [3.4.2] Couldn't join to the master : Address[muwacnode2.millikin.edu]:5701
Nov 25, 2015 3:29:24 PM com.hazelcast.cluster.impl.TcpIpJoiner
WARNING: [muwacnode3.millikin.edu]:5701 [dev] [3.4.2] Couldn't join to the master : Address[muwacnode2.millikin.edu]:5701
[2015-11-25 15:29:54,595] DEBUG org.jasig.cas.web.FlowExecutionExceptionResolver Ignoring the received exception due to a type mismatch 
org.springframework.webflow.conversation.impl.LockTimeoutException: Unable to acquire conversation lock after 30 seconds
        at org.springframework.webflow.conversation.impl.JdkConcurrentConversationLock.lock(JdkConcurrentConversationLock.java:44)
        at org.springframework.webflow.conversation.impl.ContainedConversation.lock(ContainedConversation.java:69)
        at org.springframework.webflow.execution.repository.support.ConversationBackedFlowExecutionLock.lock(ConversationBackedFlowExecutionLock.java:51)
        at org.springframework.webflow.executor.FlowExecutorImpl.resumeExecution(FlowExecutorImpl.java:166)
        at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:183)
        at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
        at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
        at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
        at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:838)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:643)
        at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
        at org.jasig.cas.web.init.SafeDispatcherServlet.service(SafeDispatcherServlet.java:125)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
        at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
        at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:343)
        at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:260)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.github.inspektr.common.web.ClientInfoThreadLocalFilter.doFilter(ClientInfoThreadLocalFilter.java:63)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.ha.session.JvmRouteBinderValve.invoke(JvmRouteBinderValve.java:227)
        at org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:347)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:879)
        at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:617)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1774)
        at java.lang.Thread.run(Thread.java:722)
Nov 25, 2015 3:29:54 PM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet cas threw exception
org.springframework.webflow.conversation.impl.LockTimeoutException: Unable to acquire conversation lock after 30 seconds
        at org.springframework.webflow.conversation.impl.JdkConcurrentConversationLock.lock(JdkConcurrentConversationLock.java:44)
        at org.springframework.webflow.conversation.impl.ContainedConversation.lock(ContainedConversation.java:69)
        at org.springframework.webflow.execution.repository.support.ConversationBackedFlowExecutionLock.lock(ConversationBackedFlowExecutionLock.java:51)
        at org.springframework.webflow.executor.FlowExecutorImpl.resumeExecution(FlowExecutorImpl.java:166)
        at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:183)
        at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
        at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
        at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
        at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:838)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:643)
        at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
        at org.jasig.cas.web.init.SafeDispatcherServlet.service(SafeDispatcherServlet.java:125)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
        at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
        at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:343)
        at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:260)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.github.inspektr.common.web.ClientInfoThreadLocalFilter.doFilter(ClientInfoThreadLocalFilter.java:63)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.ha.session.JvmRouteBinderValve.invoke(JvmRouteBinderValve.java:227)
        at org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:347)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:879)
        at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:617)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1774)
        at java.lang.Thread.run(Thread.java:722)

[2015-11-25 15:29:55,573] DEBUG org.jasig.cas.web.support.CasArgumentExtractor Extractor generated service for: https://my.millikin.edu/muportal/secure/ 
[2015-11-25 15:30:02,928] DEBUG org.jasig.cas.web.FlowExecutionExceptionResolver Ignoring the received exception due to a type mismatch 
org.springframework.webflow.conversation.impl.LockTimeoutException: Unable to acquire conversation lock after 30 seconds
        at org.springframework.webflow.conversation.impl.JdkConcurrentConversationLock.lock(JdkConcurrentConversationLock.java:44)
        at org.springframework.webflow.conversation.impl.ContainedConversation.lock(ContainedConversation.java:69)
        at org.springframework.webflow.execution.repository.support.ConversationBackedFlowExecutionLock.lock(ConversationBackedFlowExecutionLock.java:51)
        at org.springframework.webflow.executor.FlowExecutorImpl.resumeExecution(FlowExecutorImpl.java:166)
        at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:183)
        at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
        at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
        at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
        at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:838)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:643)
        at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
        at org.jasig.cas.web.init.SafeDispatcherServlet.service(SafeDispatcherServlet.java:125)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
        at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
        at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:343)
        at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:260)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.github.inspektr.common.web.ClientInfoThreadLocalFilter.doFilter(ClientInfoThreadLocalFilter.java:63)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.ha.session.JvmRouteBinderValve.invoke(JvmRouteBinderValve.java:227)
        at org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:347)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:879)
        at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:617)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1774)
        at java.lang.Thread.run(Thread.java:722)
Nov 25, 2015 3:30:02 PM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet cas threw exception
org.springframework.webflow.conversation.impl.LockTimeoutException: Unable to acquire conversation lock after 30 seconds
        at org.springframework.webflow.conversation.impl.JdkConcurrentConversationLock.lock(JdkConcurrentConversationLock.java:44)
        at org.springframework.webflow.conversation.impl.ContainedConversation.lock(ContainedConversation.java:69)
        at org.springframework.webflow.execution.repository.support.ConversationBackedFlowExecutionLock.lock(ConversationBackedFlowExecutionLock.java:51)
        at org.springframework.webflow.executor.FlowExecutorImpl.resumeExecution(FlowExecutorImpl.java:166)
        at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:183)
        at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
        at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
        at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:936)
        at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:838)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:643)
        at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:812)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:723)
        at org.jasig.cas.web.init.SafeDispatcherServlet.service(SafeDispatcherServlet.java:125)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
        at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
        at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:343)
        at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:260)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at com.github.inspektr.common.web.ClientInfoThreadLocalFilter.doFilter(ClientInfoThreadLocalFilter.java:63)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.ha.session.JvmRouteBinderValve.invoke(JvmRouteBinderValve.java:227)
        at org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:347)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:879)
        at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:617)
        at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1774)
        at java.lang.Thread.run(Thread.java:722)
@chrismyers81

This comment has been minimized.

Copy link

commented Nov 25, 2015

After that started happening, I tried re-rebooting muwacnode2 and muwacnode3, and things seem to be happier now.

@cmuramoto

This comment has been minimized.

Copy link

commented Mar 3, 2016

Guys, this issue is very easy to reproduce.

I first saw this when I was testing network interface failures and I waited several hours after the cluster merged and the messages were still there, but the only solution that was capable of healing the node was to 'reboot' the JVM process that creates the hazelcast instance.

Steps to reproduce:

  1. Create 3 virtual interfaces. E.g., as root, do
ifconfig eth0:1 172.30.6.251 netmask 255.255.0.0 up
ifconfig eth0:2 172.30.6.252 netmask 255.255.0.0 up
ifconfig eth0:3 172.30.6.253 netmask 255.255.0.0 up
  1. Create 2 nodes as consumers and 1 node as producer, preferably using n Threads. In my test, I used n=4.
//Producer
for (int i = 0; i < n; i++) {
        IQueue<Object> queue = instance.getQueue("q#" + i);
        Thread t = new Thread(() -> {

          int k = 0;
          while (!Thread.currentThread().isInterrupted()) {
            queue.offer(instance.getName() + "#" + k++);
            //Don't overload the queues
            LockSupport.parkUntil(System.currentTimeMillis() + 1000);
          }
        });
        t.start();
      }
}
//Consumer
for (int i = 0; i < n; i++) {
    final IQueue<Object> queue = instance.getQueue("q#" + i);
    Thread t = new Thread(() -> {

     while (!Thread.currentThread().isInterrupted()) {
      try {
       Object obj = queue.poll(5000, TimeUnit.SECONDS);

       if (obj != null) {
        System.out.println(instance.getName() + " :: " + obj);

        LockSupport.parkUntil(System.currentTimeMillis() + 1000);
       }
      } catch (Exception e) {
       e.printStackTrace();
      }
     }
    });

    t.start();
   }
}
  1. Start the nodes, ensuring the first consumer becomes master (e.g., by delaying startup of second consumer and producer).
  2. After some time (~30-60 sec), kill the ip address of the Master node:
ifconfig eth0:1 down
  1. Wait a little (~2 minutes, assuming using default cfg of 120000ms). After the previous master node throws some OperationTimeoutExceptions, recreate the virtual interface
ifconfig eth0:1 172.30.6.251 netmask 255.255.0.0 up
  1. It may take a while for the cluster to merge. After it does, one will start getting WrongTargetExceptions on the first Consumer Node.

  2. XmlConfig:

<?xml version="1.0"?>
<hazelcast xmlns="http://www.hazelcast.com/schema/config"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.5.xsd ">
  <properties>
    <property name="hazelcast.logging.type">slf4j</property>
    <property name="hazelcast.wait.seconds.before.join">5</property>
    <property name="hazelcast.max.wait.seconds.before.join">60</property>
    <property name="hazelcast.backpressure.enabled">true</property>
    <property name="hazelcast.heartbeat.interval.seconds">2</property>
    <property name="hazelcast.max.no.heartbeat.seconds">30</property>
    <property name="hazelcast.restart.on.max.idle">true</property>
    <property name="hazelcast.io.thread.count">4</property>
    <property name="hazelcast.socket.bind.any">false</property>
    <property name="hazelcast.member.list.publish.interval.seconds">60</property>
  </properties>
  <management-center enabled="false">http://localhost:8080/mancenter
  </management-center>
  <group>
    <name>g</name>
    <password>a</password>
  </group>
  <network>
    <port auto-increment="false">9510</port>
    <interfaces enabled="true">
      <interface>172.30.6.251</interface> <!--Use 172.30.6.252 for second consumer and 172.30.6.253 for producer-->
    </interfaces>
    <join>
      <tcp-ip connection-timeout-seconds="10" enabled="true">
        <interface>172.30.6.251</interface>
        <interface>172.30.6.252</interface>
        <interface>172.30.6.253</interface>
      </tcp-ip>
      <multicast enabled="false" />
      <aws enabled="false" />
    </join>
  </network>
</hazelcast>

8)Obs: During debug I've forced a NodeEngine::reset() and the Node sort of 'healed'. Before calling NodeEngine::reset() no messages were ever consumed from the queues. After that, it just worked. I know this probably might lead to deeper consequences like data loss, but to my use case it was enough.

Edit: The issue is more likelly to happen after split-brain, that is, after the first Consumer 'evicts' the other two nodes and vice-versa. I've changed the config to use hazelcast.max.no.heartbeat.seconds of 30s instead of 300s.

@jerrinot

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2016

@cmuramoto: thanks for a very nice analysis!

@jerrinot jerrinot added this to the 3.7 milestone Mar 3, 2016

@cmuramoto

This comment has been minimized.

Copy link

commented Mar 3, 2016

Guys I think the problem lies around the re-schedulling mechanism of WaitNotifyServiceImpl

private boolean doRun() throws Exception {
    long waitTime = FIRST_WAIT_TIME;
    while (waitTime > 0) {
    long begin = System.currentTimeMillis();
    WaitingOperation waitingOp = (WaitingOperation) delayQueue.poll(waitTime, TimeUnit.MILLISECONDS);
    if (waitingOp != null) {
      if (waitingOp.isValid()) {
      invalidate(waitingOp);
      }
    }
    long end = System.currentTimeMillis();
    waitTime -= (end - begin);
    if (waitTime > FIRST_WAIT_TIME) {
      waitTime = FIRST_WAIT_TIME;
    }
    }

    for (Queue<WaitingOperation> q : mapWaitingOps.values()) {
    for (WaitingOperation waitingOp : q) {
      if (Thread.interrupted()) {
      return true;
      }
      if (waitingOp.isValid() && waitingOp.needsInvalidation()) {
      invalidate(waitingOp); //--->This will cause waitingOp to be re-scheduled endlessly as the object is never removed from the queue associated with the map
      }
    }
    }
    return false;
}

I did some quick and dirty modifications in order to test this 'theory', and it seems correct.

First, on Operation class I've added another bit mask to mark if an instance has gotten a WrongTargetException.

public abstract class Operation implements DataSerializable {

  //...
  static final int BITMASK_GOT_WRONG_TARGET = 1 << 7;

  //...

  public void markGotWrongTarget() {
    setFlag(true, BITMASK_GOT_WRONG_TARGET);
  }

  //...

  public boolean gotWrongTarget() {
    return isFlagSet(BITMASK_GOT_WRONG_TARGET);
  }

  //...
}

Next I changed the OperationRunnerImpl::ensureNoPartitionProblems code to mark such operations as trouble:

protected void ensureNoPartitionProblems(Operation op) {
  int partitionId = op.getPartitionId();

  if (partitionId < 0) {
    return;
  }

  if (partitionId != getPartitionId()) {
    throw new IllegalStateException("wrong partition, expected: " + getPartitionId() + " but found:" + partitionId);
  }

  if (internalPartition == null) {
    internalPartition = nodeEngine.getPartitionService().getPartition(partitionId);
  }

  if (retryDuringMigration(op) && internalPartition.isMigrating()) {
    throw new PartitionMigratingException(node.getThisAddress(), partitionId, op.getClass().getName(), op.getServiceName());
  }

  Address owner = internalPartition.getReplicaAddress(op.getReplicaIndex());
  if (op.validatesTarget() && !node.getThisAddress().equals(owner)) {
    op.markGotWrongTarget();
    throw new WrongTargetException(node.getThisAddress(), owner, partitionId, op.getReplicaIndex(), op.getClass().getName(), op.getServiceName());
  }
}

And finally I've changed WaitNotifyServiceImpl$ExpirationTask::doRun to purge problematic ops from the queues:

private class ExpirationTask implements Runnable {
  private boolean doRun() throws Exception {
    long waitTime = FIRST_WAIT_TIME;
    while (waitTime > 0) {
      long begin = System.currentTimeMillis();
      WaitingOperation waitingOp = (WaitingOperation) delayQueue.poll(waitTime, TimeUnit.MILLISECONDS);
      if (waitingOp != null) {
        // there's no need to check for wrong target here, operation was removed from
        // the queue and will be scheduled only once.
        if (waitingOp.isValid()) {
          invalidate(waitingOp);
        }
      }
      long end = System.currentTimeMillis();
      waitTime -= (end - begin);
      if (waitTime > FIRST_WAIT_TIME) {
        waitTime = FIRST_WAIT_TIME;
      }
    }

    for (Queue<WaitingOperation> q : mapWaitingOps.values()) {
      Iterator<WaitingOperation> itr = q.iterator();
      while (itr.hasNext()) {
        WaitingOperation waitingOp = itr.next();
          if (waitingOp.gotWrongTarget()) {
            logger.info("Removing " + waitingOp);
            itr.remove(); //There's no point in keeping it in the queue anymore
            try {
              waitingOp.onExpire(); //Release waiting consumers asap
            } catch (Exception e) {
              logger.warning(e);
            }
            continue;
        }

        // Why does this has to be on this loop?
        if (Thread.interrupted()) {
          return true;
        }

        if (waitingOp.isValid() && waitingOp.needsInvalidation()) {
          invalidate(waitingOp);
        }
      }
    }
      return false;
  }
}

Obs 1: If one does not call WaitingOperation::onExpire the consumer Threads get blocked for a long time (~480s on HZ 3.7-SNAPSHOT), which is not desireable, but eventually the problem go away.

Obs 2: Since Operation seems to be shared amongs several Threads, perhaps this approach is flawed from a JMM point of view, as the visibility to the modified 'flags' field might take a while to kick in. Maybe storing this state in a volatile field might be a good idea.

Obs 3: Would it be possible to actually heal the operation? This issue seems to be related to re-partitioning schemes. In case of IQueue::poll this is kind of harmless, but I don't know about the side-effects that might be caused by other operations.

Edit: typos

@jerrinot jerrinot self-assigned this Mar 3, 2016

@cmuramoto

This comment has been minimized.

Copy link

commented Mar 3, 2016

Just in case anyone is interested, this script might help to debug this issue by creating the virtual interfaces.

#!/bin/bash

OP="up"

if [ "$1" == "down" ]
then
OP="down"
fi

MASK=$(ifconfig |grep Mask|grep -v 127|xargs)
MASK=${MASK##*Mask:}
IP=$(ip ad|grep eth0|grep inet|xargs|cut -d' ' -f 2|cut -d'/' -f 1)
DIGIT=$(echo $IP|cut -d'.' -f 4)
PART=${IP%%"$DIGIT"}

echo "BASE: $IP, DIGIT: $DIGIT, MASK: $MASK"

for i in {1..3}
do
    ((DIGIT++))
    ADDR="$PART""$DIGIT"
    ifconfig eth0:$i $ADDR netmask $MASK $OP
done

It assumes you have at most two interface (eth0 and loopback), and picks the variables from eth0. Virtual interfaces created from loopback doesn't seem to cause network failures/split-brain!

@bwzhang2011

This comment has been minimized.

Copy link

commented Apr 26, 2016

@jerrinot, any update with such issue ?

@ygyg70

This comment has been minimized.

Copy link

commented Apr 26, 2016

We are seeing same issue. Using with OrientDB. Two nodes, disconnect network till both reconfigure as a single node cluster, one of them elects itself to be a new master. Then restore network connection. One of the nodes will get into the state.

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Jun 20, 2016

Delay invocation when node is not joined
Reasoning:
When a cluster is being merged into another one
then a local node is marked as not-joined and invocations are
notified with MemberLeftException.

We do not want to retry them before the node is joined
again because partition table is still stale at this point.

Fix hazelcast#3754

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Jun 20, 2016

Delay invocation when node is not joined
Reasoning:
When a cluster is being merged into another one
then a local node is marked as not-joined and invocations are
notified with MemberLeftException.

We do not want to retry them before the node is joined
again because partition table is still stale at this point.

Fix hazelcast#3754

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Jul 19, 2016

Delay operation retry when node is not joined
Fixes hazelcast#3754, hazelcast#3395

Reasoning:
When a smaller cluster is being merged into a bigger one then
operations waiting in the WaitNotify queues are cleared and
pending invocations receive MemberLeftException. However the
invocation can retry immediately and send the operation again.

This change is adding a check into a retry part to re-execute the
operation only when a node is joined (again). Happy path is
unaffected.

This commit is also adding a support class for split-brain tests.
Many thanks to @mdogan who helped me with analysis and also contributed
the original test.

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Jul 19, 2016

Delay operation retry when node is not joined
Fixes hazelcast#3754, hazelcast#3395

Reasoning:
When a smaller cluster is being merged into a bigger one then
operations waiting in the WaitNotify queues are cleared and
pending invocations receive MemberLeftException. However the
invocation can retry immediately and send the operation again.

This change is adding a check into a retry part to re-execute the
operation only when a node is joined (again). Happy path is
unaffected.

This commit is also adding a support class for split-brain tests.
Many thanks to @mdogan who helped me with analysis and also contributed
the original test.

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Jul 19, 2016

Delay operation retry when node is not joined
Fixes hazelcast#3754, hazelcast#3395

Reasoning:
When a smaller cluster is being merged into a bigger one then
operations waiting in the WaitNotify queues are cleared and
pending invocations receive MemberLeftException. However the
invocation can retry immediately and send the operation again.

This change is adding a check into a retry part to re-execute the
operation only when a node is joined (again). Happy path is
unaffected.

This commit is also adding a support class for split-brain tests.
Many thanks to @mdogan who helped me with analysis and also contributed
the original test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.