Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster is in PASSIVE state! vs TargetDisconnectedException 3.7-SNAP #8261

Closed
Danny-Hazelcast opened this issue May 27, 2016 · 4 comments

Comments

Projects
None yet
3 participants
@Danny-Hazelcast
Copy link
Member

commented May 27, 2016

while testing hzInstance.getCluster().shutdown(); on 3.6.3 and 3.7-SANPSHOT

using a 4 node cluster, with 4 clients making putAll ops.

during which we call, hzInstance.getCluster().shutdown(); from one member of the cluster

with 3.6.3 clients get

java.lang.IllegalStateException: Cluster is in PASSIVE state! Operation: com.hazelcast.map.impl.operation.PutAllOperation{identityHash=993079332, serviceName='hz:impl:mapService', partitionId=64, replicaIndex=0, callId=741874, invocationTime=1464269130111 (Thu May 26 13:25:30 UTC 2016), waitTimeout=-1, callTimeout=60000, name=mapBak1}
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.checkNodeState(OperationRunnerImpl.java:200)

with 3.7-SANPSHOT clients get

com.hazelcast.spi.exception.TargetDisconnectedException: Disconnecting from member [10.0.0.189]:5701 due to heartbeat problems. Current time: 2016-05-26 13:37:11.896. Last heartbeat: never
    at com.hazelcast.spi.exception.TargetDisconnectedException.newTargetDisconnectedExceptionCausedByHeartBeat(TargetDisconnectedException.java:48)

so me it seem 3.6.3, is correct, and exception given in 3.7-SANPSHOT is misleading

3.6.3 member log example

May 26, 2016 1:25:30 PM com.hazelcast.cluster.impl.operations.ChangeClusterStateOperation
INFO: [10.0.0.188]:5701 [HZ] [3.6.3] Changing cluster state state to PASSIVE, Initiator: Address[10.0.0.186]:5701
May 26, 2016 1:25:30 PM com.hazelcast.map.impl.operation.PutAllOperation
SEVERE: [10.0.0.188]:5701 [HZ] [3.6.3] Cluster is in PASSIVE state! Operation: com.hazelcast.map.impl.operation.PutAllOperation{identityHash=954986353, serviceName='hz:impl:mapService', partitionId=184, replicaIndex=0, callId=757747, invocationTime=1464269130111 (Thu May 26 13:25:30 UTC 2016), waitTimeout=-1, callTimeout=60000, name=mapBak1}
java.lang.IllegalStateException: Cluster is in PASSIVE state! Operation: com.hazelcast.map.impl.operation.PutAllOperation{identityHash=954986353, serviceName='hz:impl:mapService', partitionId=184, replicaIndex=0, callId=757747, invocationTime=1464269130111 (Thu May 26 13:25:30 UTC 2016), waitTimeout=-1, callTimeout=60000, name=mapBak1}
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.checkNodeState(OperationRunnerImpl.java:200)
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:156)
    at com.hazelcast.spi.impl.operationexecutor.classic.ClassicOperationExecutor.runOnCallingThread(ClassicOperationExecutor.java:392)
    at com.hazelcast.spi.impl.operationexecutor.classic.ClassicOperationExecutor.runOnCallingThreadIfPossible(ClassicOperationExecutor.java:345)
    at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvokeLocal(Invocation.java:259)
    at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvoke(Invocation.java:243)
    at com.hazelcast.spi.impl.operationservice.impl.Invocation.invokeInternal(Invocation.java:210)
    at com.hazelcast.spi.impl.operationservice.impl.Invocation.invoke(Invocation.java:180)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationBuilderImpl.invoke(InvocationBuilderImpl.java:46)
    at com.hazelcast.client.impl.protocol.task.AbstractPartitionMessageTask.processMessage(AbstractPartitionMessageTask.java:65)
    at com.hazelcast.client.impl.protocol.task.AbstractMessageTask.initializeAndProcessMessage(AbstractMessageTask.java:118)
    at com.hazelcast.client.impl.protocol.task.AbstractMessageTask.run(AbstractMessageTask.java:98)
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:127)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.processPartitionSpecificRunnable(OperationThread.java:159)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.process(OperationThread.java:142)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.doRun(OperationThread.java:124)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.run(OperationThread.java:99)

and 3.7-SANPSHOT member log example

May 26, 2016 1:37:11 PM com.hazelcast.internal.cluster.impl.operations.ChangeClusterStateOperation
INFO: [10.0.0.188]:5701 [HZ] [3.7-SNAPSHOT] Changing cluster state state to PASSIVE, Initiator: [10.0.0.186]:5701
May 26, 2016 1:37:11 PM com.hazelcast.map.impl.operation.PutAllBackupOperation
SEVERE: [10.0.0.188]:5701 [HZ] [3.7-SNAPSHOT] Cluster is in PASSIVE state! Operation: com.hazelcast.spi.impl.operationservice.impl.operations.Backup{serviceName='null', identityHash=1313662299, partitionId=149, replicaIndex=1, callId=741874, invocationTime=-1 (1969-12-31 23:59:59.999), waitTimeout=-1, callTimeout=9223372036854775807, backupOp=com.hazelcast.map.impl.operation.PutAllBackupOperation{serviceName='hz:impl:mapService', identityHash=591857971, partitionId=-1, replicaIndex=0, callId=0, invocationTime=-1 (1969-12-31 23:59:59.999), waitTimeout=-1, callTimeout=9223372036854775807, name=mapBak1}, backupOpData=null, originalCaller=[10.0.0.189]:5701, version=[11167, 0, 0, 0, 0, 0], sync=true}
java.lang.IllegalStateException: Cluster is in PASSIVE state! Operation: com.hazelcast.spi.impl.operationservice.impl.operations.Backup{serviceName='null', identityHash=1313662299, partitionId=149, replicaIndex=1, callId=741874, invocationTime=-1 (1969-12-31 23:59:59.999), waitTimeout=-1, callTimeout=9223372036854775807, backupOp=com.hazelcast.map.impl.operation.PutAllBackupOperation{serviceName='hz:impl:mapService', identityHash=591857971, partitionId=-1, replicaIndex=0, callId=0, invocationTime=-1 (1969-12-31 23:59:59.999), waitTimeout=-1, callTimeout=9223372036854775807, name=mapBak1}, backupOpData=null, originalCaller=[10.0.0.189]:5701, version=[11167, 0, 0, 0, 0, 0], sync=true}
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.checkNodeState(OperationRunnerImpl.java:211)
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:167)
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:397)
    at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:117)
    at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.run(OperationThread.java:102)
@ihsandemir

This comment has been minimized.

Copy link
Contributor

commented May 27, 2016

I examined the problem. Recently, we added a cleaner thread that sets the invocation result for connections that are either disconnected (not live) or heartbeatFailed. Line https://github.com/hazelcast/hazelcast/blob/master/hazelcast-client/src/main/java/com/hazelcast/client/spi/impl/ClientInvocationServiceSupport.java#L203, may return false not only because the connection heartbeat fails but also during connection close where the live flag is set to false (https://github.com/hazelcast/hazelcast/blob/master/hazelcast-client/src/main/java/com/hazelcast/client/connection/nio/ClientConnection.java#L219). Hence, can set the invocation result for a more appropriate exception based on these two different cases.

A Suggestion for the fix:
If hearbeatFailed -- > newTargetDisconnectedExceptionCausedByHeartBeat
if !connection.live --> TargetDisconnectedException(Address address)

The solution will not produce the IllegalStateException as written above. This is an exception by server side, which is propagated at the client as the response to putAll. This shall still occur, but depending on if it comes before the CleanResourcesTask runs. If CleanResourcesTask runs first it shall be the exceptions as we proposed above, otherwise it may be the IllegalStateException.

ihsandemir added a commit to ihsandemir/hazelcast that referenced this issue Jul 18, 2016

Fix for issue hazelcast#8261 . Makes sure that the CleanupResourcesTa…
…sk only sets the future value after the connection is closed.

@asimarslan asimarslan self-assigned this Jul 25, 2016

@asimarslan

This comment has been minimized.

Copy link
Member

commented Jul 25, 2016

@Danny-Hazelcast, I can reproduce the above case as you mentioned for version 3.6.3 too.
so is this still a problem ?

@Danny-Hazelcast

This comment has been minimized.

Copy link
Member Author

commented Jul 27, 2016

client should expect either exception. on cluster passive
because there is always a possibility of hearbeat which end up with TargetDisconnectException

@asimarslan

This comment has been minimized.

Copy link
Member

commented Jul 27, 2016

One extra info on the issue: Client backward compatibility prevent changing these kind of exceptions.

ihsandemir added a commit to ihsandemir/hazelcast that referenced this issue Nov 25, 2016

Fix for issue hazelcast#8261 . Makes sure that the CleanupResourcesTa…
…sk only sets the future value after the connection is closed.

ihsandemir added a commit that referenced this issue Nov 28, 2016

Fix for issue #8261 (#8543)
* Fix for issue #8261 . Makes sure that the CleanupResourcesTask only sets the future value after the connection is closed.

tkountis added a commit to tkountis/hazelcast that referenced this issue Nov 30, 2016

Fix for issue hazelcast#8261 (hazelcast#8543)
* Fix for issue hazelcast#8261 . Makes sure that the CleanupResourcesTask only sets the future value after the connection is closed.

sancar added a commit to sancar/hazelcast that referenced this issue Dec 8, 2016

Cleanup for changes made in hazelcast#8261
In CleanResourceTask invocation.notifyException called twice.
, removing one of them.

Two static methods related to client is removed from
TargetDisconnectedException since it is also used in node and
methods specific to client should not be there

Detailed heartbeart related logging inside
newTargetDisconnectedExceptionCausedByHeartbeat is removed,
since these are already available in ClientConnection.toString.

ClientConnection.isClosed is removed since it nearly means !isAlive.
One caveat related to removing is race between closedTime and isAlive
boolean. To remove this race I also removed isAlive boolean.
isAlive means closedTime == 0 from now on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.