Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much delay for deciding heartbeat timeout #2766

Closed
sancar opened this issue Jun 19, 2014 · 1 comment · Fixed by #3028
Closed

Too much delay for deciding heartbeat timeout #2766

sancar opened this issue Jun 19, 2014 · 1 comment · Fixed by #3028
Assignees
Labels
Source: Internal PR or issue was opened by an employee Type: Defect
Milestone

Comments

@sancar
Copy link
Contributor

sancar commented Jun 19, 2014

I take a look at the code. In heartBeater method, there are suspicious hardcoded numbers.
Eventough did not quite follow the logic but there is a update in last read to delay if connection to member is null. Seems like it is reason for delaying the decision of member is dead.

Here is the customers detailed report :

In the case where a After a TCP connection is lost from a peer, Hazelcast instance takes 200+ seconds to update the members list.

I have a hazelcast cluster with 3 nodes in the same IP subnet. Each server is listening on port 5700 for peer-peer communication. and 5 second peer heartbeat time out (hazelcast.max.no.heartbeat.seconds)

I have a test to check the behavior of the cluster when a firewall blocks the communication port.

Steps:

  1. Configure 3 node hazelcast cluster as described above
  2. Use linux iptables to block the communication port on two of the members command used:

iptables -I INPUT -p tcp --destination-port 5700 -j DROP
iptables -I OUTPUT -p tcp --destination-port 5700 -j DROP

  1. Check for cluster members on each hazelcast node
  2. In most runs of the test, cluster membership is updated within 5 seconds on all three nodes to reflect the loss of two peer nodes. In few runs, membership update takes 3+ minutes on one of the machine.

I'm expecting that all nodes detect and settle with appropriate memberships quicker in all runs. Let me know if you need any further clarification.

Some logs from the test run:

Node 1 - working fine

2014-06-18 13:01:00.397] INFO ev.global-operation.thread-5 com.hazelcast.cluster.ClusterService [10.1.18.182]:5700 [dev] [3.2.2]

Members [3] {
Member [10.1.18.181]:5700
Member [10.1.18.182]:5700 this
Member [10.1.20.116]:5700
}

[2014-06-18 13:01:07.952] INFO stance_2_dev.cached.thread-7 com.hazelcast.nio.SocketConnector [10.1.18.182]:5700 [dev] [3.2.2] Could not connect to: /10.1.18.181:5700. Reason: SocketException[Connection timed out to address /10.1.18.181:5700]
[2014-06-18 13:01:30.795] WARN stance_2_dev.cached.thread-2 com.hazelcast.cluster.ClusterService [10.1.18.182]:5700 [dev] [3.2.2] Master node has timed out its heartbeat and will be removed
[2014-06-18 13:01:30.796] INFO stance_2_dev.cached.thread-2 com.hazelcast.cluster.ClusterService [10.1.18.182]:5700 [dev] [3.2.2] Master Address[10.1.18.181]:5700 left the cluster. Assigning new master Member [10.1.18.182]:5700 this
[2014-06-18 13:01:30.798] INFO stance_2_dev.cached.thread-2 com.hazelcast.nio.TcpIpConnection [10.1.18.182]:5700 [dev] [3.2.2] Connection [Address[10.1.18.181]:5700] lost. Reason: Socket explicitly closed
[2014-06-18 13:01:30.798] INFO stance_2_dev.cached.thread-2 com.hazelcast.cluster.ClusterService [10.1.18.182]:5700 [dev] [3.2.2] Removing Member [10.1.18.181]:5700
[2014-06-18 13:01:30.808] INFO stance_2_dev.cached.thread-2 com.hazelcast.cluster.ClusterService [10.1.18.182]:5700 [dev] [3.2.2]

Members [2] {
Member [10.1.18.182]:5700 this
Member [10.1.20.116]:5700
}

[2014-06-18 13:01:30.808] INFO stance_2_dev.cached.thread-6 com.hazelcast.nio.SocketConnector [10.1.18.182]:5700 [dev] [3.2.2] Connecting to /10.1.18.181:5700, timeout: 0, bind-any: true
[2014-06-18 13:01:31.796] WARN stance_2_dev.cached.thread-4 com.hazelcast.cluster.ClusterService [10.1.18.182]:5700 [dev] [3.2.2] Added Address[10.1.20.116]:5700 to list of dead addresses because of timeout since last read
[2014-06-18 13:01:31.798] INFO stance_2_dev.cached.thread-4 com.hazelcast.nio.TcpIpConnection [10.1.18.182]:5700 [dev] [3.2.2] Connection [Address[10.1.20.116]:5700] lost. Reason: Socket explicitly closed
[2014-06-18 13:01:31.799] INFO stance_2_dev.cached.thread-4 com.hazelcast.cluster.ClusterService [10.1.18.182]:5700 [dev] [3.2.2] Removing Member [10.1.20.116]:5700
[2014-06-18 13:01:31.801] INFO stance_2_dev.cached.thread-4 com.hazelcast.cluster.ClusterService [10.1.18.182]:5700 [dev] [3.2.2]

Members [1] {
Member [10.1.18.182]:5700 this
}

[2014-06-18 13:01:32.309] INFO ._hzInstance_2_dev.migration com.hazelcast.partition.InternalPartitionService [10.1.18.182]:5700 [dev] [3.2.2] Partition balance is ok, no need to re-partition cluster data...

Node 2 - working fine

[2014-06-18 13:01:00.396] INFO ev.global-operation.thread-8 com.hazelcast.cluster.ClusterService [10.1.20.116]:5700 [dev] [3.2.2]

Members [3] {
Member [10.1.18.181]:5700
Member [10.1.18.182]:5700
Member [10.1.20.116]:5700 this
}

[2014-06-18 13:01:01.894] INFO stance_2_dev.cached.thread-4 com.hazelcast.core.LifecycleService [10.1.20.116]:5700 [dev] [3.2.2] Address[10.1.20.116]:5700 is MERGED
[2014-06-18 13:01:06.977] INFO stance_2_dev.cached.thread-3 com.hazelcast.nio.SocketConnector [10.1.20.116]:5700 [dev] [3.2.2] Could not connect to: /10.1.18.181:5700. Reason: SocketException[Connection timed out to address /10.1.18.181:5700]
[2014-06-18 13:01:23.904] WARN stance_2_dev.cached.thread-3 com.hazelcast.cluster.ClusterService [10.1.20.116]:5700 [dev] [3.2.2] Master node has timed out its heartbeat and will be removed
[2014-06-18 13:01:23.904] INFO stance_2_dev.cached.thread-3 com.hazelcast.cluster.ClusterService [10.1.20.116]:5700 [dev] [3.2.2] Master Address[10.1.18.181]:5700 left the cluster. Assigning new master Member [10.1.18.182]:5700
[2014-06-18 13:01:23.905] INFO stance_2_dev.cached.thread-3 com.hazelcast.nio.TcpIpConnection [10.1.20.116]:5700 [dev] [3.2.2] Connection [Address[10.1.18.181]:5700] lost. Reason: Socket explicitly closed
[2014-06-18 13:01:23.905] INFO stance_2_dev.cached.thread-3 com.hazelcast.cluster.ClusterService [10.1.20.116]:5700 [dev] [3.2.2] Removing Member [10.1.18.181]:5700
[2014-06-18 13:01:23.911] INFO stance_2_dev.cached.thread-3 com.hazelcast.cluster.ClusterService [10.1.20.116]:5700 [dev] [3.2.2]

Members [2] {
Member [10.1.18.182]:5700
Member [10.1.20.116]:5700 this
}

[2014-06-18 13:01:23.911] INFO stance_2_dev.cached.thread-4 com.hazelcast.nio.SocketConnector [10.1.20.116]:5700 [dev] [3.2.2] Connecting to /10.1.18.181:5700, timeout: 0, bind-any: true
[2014-06-18 13:01:24.904] WARN stance_2_dev.cached.thread-5 com.hazelcast.cluster.ClusterService [10.1.20.116]:5700 [dev] [3.2.2] Master node has timed out its heartbeat and will be removed
[2014-06-18 13:01:24.904] INFO stance_2_dev.cached.thread-5 com.hazelcast.cluster.ClusterService [10.1.20.116]:5700 [dev] [3.2.2] Master Address[10.1.18.182]:5700 left the cluster. Assigning new master Member [10.1.20.116]:5700 this
[2014-06-18 13:01:24.906] INFO stance_2_dev.cached.thread-5 com.hazelcast.nio.TcpIpConnection [10.1.20.116]:5700 [dev] [3.2.2] Connection [Address[10.1.18.182]:5700] lost. Reason: Socket explicitly closed
[2014-06-18 13:01:24.906] INFO stance_2_dev.cached.thread-5 com.hazelcast.cluster.ClusterService [10.1.20.116]:5700 [dev] [3.2.2] Removing Member [10.1.18.182]:5700
[2014-06-18 13:01:24.908] INFO stance_2_dev.cached.thread-5 com.hazelcast.cluster.ClusterService [10.1.20.116]:5700 [dev] [3.2.2]

Members [1] {
Member [10.1.20.116]:5700 this
}

[2014-06-18 13:01:24.908] INFO stance_2_dev.cached.thread-1 com.hazelcast.nio.SocketConnector [10.1.20.116]:5700 [dev] [3.2.2] Connecting to /10.1.18.182:5700, timeout: 0, bind-any: true
[2014-06-18 13:01:25.657] INFO ._hzInstance_2_dev.migration com.hazelcast.partition.InternalPartitionService [10.1.20.116]:5700 [dev] [3.2.2] Partition balance is ok, no need to re-partition cluster data...

Node 3 - with problem


[2014-06-18 13:01:00.397] INFO ev.global-operation.thread-3 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2]
Members [3] {
Member [10.1.18.181]:5700 this
Member [10.1.18.182]:5700
Member [10.1.20.116]:5700
}

[2014-06-18 13:01:00.692] INFO ._hzInstance_2_dev.migration com.hazelcast.partition.InternalPartitionService [10.1.18.181]:5700 [dev] [3.2.2] Re-partitioning cluster data... Migration queue size: 90
[2014-06-18 13:01:02.027] INFO ._hzInstance_2_dev.migration com.hazelcast.partition.InternalPartitionService [10.1.18.181]:5700 [dev] [3.2.2] All migration tasks has been completed, queues are empty.
[2014-06-18 13:01:23.908] INFO nstance_2_dev.IO.thread-in-0 com.hazelcast.nio.TcpIpConnection [10.1.18.181]:5700 [dev] [3.2.2] Connection [Address[10.1.20.116]:5700] lost. Reason: java.io.EOFException[Remote socket closed!]
[2014-06-18 13:01:23.908] WARN nstance_2_dev.IO.thread-in-0 com.hazelcast.nio.ReadHandler [10.1.18.181]:5700 [dev] [3.2.2] hz._hzInstance_2_dev.IO.thread-in-0 Closing socket to endpoint Address[10.1.20.116]:5700, Cause:java.io.EOFException: Remote socket closed!
[2014-06-18 13:01:24.441] INFO stance_2_dev.cached.thread-4 com.hazelcast.nio.SocketConnector [10.1.18.181]:5700 [dev] [3.2.2] Connecting to /10.1.20.116:5700, timeout: 0, bind-any: true
[2014-06-18 13:01:29.441] WARN stance_2_dev.cached.thread-5 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:01:30.442] WARN stance_2_dev.cached.thread-5 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Added Address[10.1.18.182]:5700 to list of dead addresses because of timeout since last read
[2014-06-18 13:01:30.442] INFO stance_2_dev.cached.thread-5 com.hazelcast.nio.TcpIpConnection [10.1.18.181]:5700 [dev] [3.2.2] Connection [Address[10.1.18.182]:5700] lost. Reason: Socket explicitly closed
[2014-06-18 13:01:30.442] INFO stance_2_dev.cached.thread-5 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Removing Member [10.1.18.182]:5700
[2014-06-18 13:01:30.442] WARN stance_2_dev.IO.thread-out-1 com.hazelcast.nio.WriteHandler [10.1.18.181]:5700 [dev] [3.2.2] hz._hzInstance_2_dev.IO.thread-out-1 Closing socket to endpoint Address[10.1.18.182]:5700, Cause:java.nio.channels.ClosedChannelException
[2014-06-18 13:01:30.450] INFO stance_2_dev.cached.thread-5 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2]

Members [2] {
Member [10.1.18.181]:5700 this
Member [10.1.20.116]:5700
}

[2014-06-18 13:01:34.971] INFO ._hzInstance_2_dev.migration com.hazelcast.partition.InternalPartitionService [10.1.18.181]:5700 [dev] [3.2.2] Partition state sync invocation timed out: java.util.concurrent.TimeoutException: Call BasicInvocation{ serviceName='hz:core:partitionService', op=com.hazelcast.partition.impl.PartitionStateOperation@64e4d3c2, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=11, callTimeout=60000, target=Address[10.1.20.116]:5700} encountered a timeout
[2014-06-18 13:01:34.971] INFO ._hzInstance_2_dev.migration com.hazelcast.partition.InternalPartitionService [10.1.18.181]:5700 [dev] [3.2.2] Partition balance is ok, no need to re-partition cluster data...
[2014-06-18 13:01:35.442] WARN stance_2_dev.cached.thread-6 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:01:40.443] WARN stance_2_dev.cached.thread-6 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:01:46.444] WARN stance_2_dev.cached.thread-2 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:01:52.445] WARN stance_2_dev.cached.thread-2 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:01:54.405] WARN TaskExecSched-3-thread-8 com.hazelcast.spi.OperationService [10.1.18.181]:5700 [dev] [3.2.2] com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:01:58.445] WARN stance_2_dev.cached.thread-6 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:02:04.446] WARN stance_2_dev.cached.thread-6 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:02:09.447] WARN stance_2_dev.cached.thread-6 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:02:14.448] WARN stance_2_dev.cached.thread-2 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:02:17.980] WARN stance_2_dev.async.thread-17 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=ContainsKeyOperation{}, partitionId=15, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:02:19.480] WARN stance_2_dev.async.thread-12 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:core:partitionService', op=com.hazelcast.partition.impl.PartitionStateOperation@64e4d3c2, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:02:20.449] WARN stance_2_dev.cached.thread-6 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:02:22.983] WARN stance_2_dev.async.thread-15 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=ContainsKeyOperation{}, partitionId=15, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=110, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:02:24.483] WARN stance_2_dev.async.thread-15 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:core:partitionService', op=com.hazelcast.partition.impl.PartitionStateOperation@64e4d3c2, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=110, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:02:26.450] WARN stance_2_dev.cached.thread-6 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:02:27.593] INFO stance_2_dev.cached.thread-4 com.hazelcast.nio.SocketConnector [10.1.18.181]:5700 [dev] [3.2.2] Could not connect to: /10.1.20.116:5700. Reason: SocketException[Connection timed out to address /10.1.20.116:5700]
[2014-06-18 13:02:27.778] INFO stance_2_dev.cached.thread-2 com.hazelcast.nio.SocketConnector [10.1.18.181]:5700 [dev] [3.2.2] Connecting to /10.1.20.116:5700, timeout: 0, bind-any: true
[2014-06-18 13:02:27.985] WARN nstance_2_dev.async.thread-3 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=ContainsKeyOperation{}, partitionId=15, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=120, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:02:29.486] WARN stance_2_dev.async.thread-14 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:core:partitionService', op=com.hazelcast.partition.impl.PartitionStateOperation@64e4d3c2, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=120, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:02:32.450] WARN stance_2_dev.cached.thread-6 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:02:32.988] WARN nstance_2_dev.async.thread-5 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=ContainsKeyOperation{}, partitionId=15, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=130, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:02:34.489] WARN nstance_2_dev.async.thread-5 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:core:partitionService', op=com.hazelcast.partition.impl.PartitionStateOperation@64e4d3c2, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=130, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:02:37.451] WARN stance_2_dev.cached.thread-5 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Member [10.1.18.181]:5700 this has no connection to Address[10.1.20.116]:5700
[2014-06-18 13:02:37.991] WARN stance_2_dev.async.thread-18 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=ContainsKeyOperation{}, partitionId=15, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=140, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:02:39.492] WARN stance_2_dev.async.thread-15 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:core:partitionService', op=com.hazelcast.partition.impl.PartitionStateOperation@64e4d3c2, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=140, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700

...
<Several thousand similar lines of logs>
...

[2014-06-18 13:04:32.096] WARN stance_2_dev.async.thread-11 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.operation.MapValuesOperation@26202d71, partitionId=153, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:04:32.096] WARN stance_2_dev.async.thread-18 com.hazelcast.spi.impl.BasicInvocation [10.1.18.181]:5700 [dev] [3.2.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.operation.MapValuesOperation@4a3de75b, partitionId=82, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.1.20.116]:5700}, Reason: com.hazelcast.spi.exception.RetryableIOException: Packet not sent to -> Address[10.1.20.116]:5700
[2014-06-18 13:04:34.121] INFO stance_2_dev.cached.thread-2 com.hazelcast.nio.SocketConnector [10.1.18.181]:5700 [dev] [3.2.2] Could not connect to: /10.1.20.116:5700. Reason: SocketException[Connection timed out to address /10.1.20.116:5700]
[2014-06-18 13:04:34.121] WARN stance_2_dev.cached.thread-2 com.hazelcast.nio.ConnectionMonitor [10.1.18.181]:5700 [dev] [3.2.2] Removing connection to endpoint Address[10.1.20.116]:5700 Cause => java.net.SocketException {Connection timed out to address /10.1.20.116:5700}, Error-Count: 5
[2014-06-18 13:04:34.122] INFO stance_2_dev.cached.thread-2 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2] Removing Member [10.1.20.116]:5700
[2014-06-18 13:04:34.124] INFO stance_2_dev.cached.thread-2 com.hazelcast.cluster.ClusterService [10.1.18.181]:5700 [dev] [3.2.2]

Members [1] {
Member [10.1.18.181]:5700 this
}

[2014-06-18 13:04:35.676] INFO ._hzInstance_2_dev.migration com.hazelcast.partition.InternalPartitionService [10.1.18.181]:5700 [dev] [3.2.2] Partition balance is ok, no need to re-partition cluster data...
@sancar sancar added this to the 3.x Backlog milestone Jun 19, 2014
@sancar sancar added the PENDING label Jun 19, 2014
@enesakar enesakar changed the title To much delay for deciding heartbeat timeout Too much delay for deciding heartbeat timeout Jun 19, 2014
@oliverburn
Copy link

I have also experienced the same problem performing nearly a identical test using Hz 3.2.3

@sancar sancar self-assigned this Jul 3, 2014
@sancar sancar modified the milestones: 3.3, 3.x Backlog Jul 3, 2014
@sancar sancar removed their assignment Jul 3, 2014
@mdogan mdogan removed the PENDING label Jul 4, 2014
@mdogan mdogan self-assigned this Jul 18, 2014
@mdogan mdogan modified the milestones: 3.2.5, 3.3 Jul 18, 2014
@mmedenjak mmedenjak added the Source: Internal PR or issue was opened by an employee label Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Source: Internal PR or issue was opened by an employee Type: Defect
Projects
None yet
4 participants