Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hazelcast node continuosly logs WrongTargetException #3395

Closed
agherardi opened this issue Aug 26, 2014 · 23 comments

Comments

Projects
None yet
@agherardi
Copy link
Contributor

commented Aug 26, 2014

Hi,
While running a load test, after a few hours one of the nodes logs thousand of warnings like this:

2014-08-26T06:20:51.898Z] [WARN] [-] [hz._hzInstance_1_vmscore_1.operation.thread-2] [com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp] - [192.168.240.25]:5701 [vmscore_1] [3.2] Op: com.hazelcast.concurrent.lock.operations.LockOperation@4a642c96, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.240.25]:5701, target:Address[192.168.240.24]:5701, partitionId: 194, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:lockService

We're running 3.2 so I believe we have picked up the patch for
#1815

Any ideas?

@mdogan mdogan added PENDING labels Aug 26, 2014

@pveentjer

This comment has been minimized.

Copy link
Member

commented Aug 26, 2014

Can you tell something about the environment you are running in? E.g. ec2? And what is the exact HZ version you are using? HZ 3.2(.0) ?

What is the test you are running btw?

And can you post some logging?

@agherardi

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2014

The environment is a cluster of 2 Windows 2008 R2 servers. The server IPs are 192.168.240.24 and 192.168.240.25. Both servers are located in our lab - not in the cloud.

We use Hazelcast 3.2 (.0). The test uses various Hazelcast features - maps, queues and locks.

Regarding logging. Trouble seems to start when 192.168.240.24 logs these warnings:
[2014-08-24T21:16:41.635Z] [WARN] [-] [Thread-11] [com.hazelcast.spi.impl.BasicInvocation] - [192.168.240.24]:5701 [vmscore_1] [3.2] No response for 120011 ms. InvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:queueService', op=com.hazelcast.queue.PollOperation@76a51bd3, partitionId=46, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.240.25]:5701}, done=false}
[2014-08-24T21:16:41.666Z] [WARN] [-] [Thread-11] [com.hazelcast.spi.impl.BasicInvocation] - [192.168.240.24]:5701 [vmscore_1] [3.2] Asking if operation execution has been started: InvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:queueService', op=com.hazelcast.queue.PollOperation@76a51bd3, partitionId=46, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.240.25]:5701}, done=false}
[2014-08-24T21:16:46.892Z] [WARN] [-] [Thread-11] [com.hazelcast.spi.impl.BasicInvocation] - [192.168.240.24]:5701 [vmscore_1] [3.2] While asking 'is-executing': InvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:queueService', op=com.hazelcast.queue.PollOperation@76a51bd3, partitionId=46, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.240.25]:5701}, done=false}
! java.util.concurrent.TimeoutException: Call BasicInvocation{ serviceName='hz:impl:queueService', op=com.hazelcast.spi.impl.BasicInvocation$IsStillExecuting@2313f5e0, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[192.168.240.25]:5701} encountered a timeout

! at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.resolveResponse(BasicInvocation.java:801)

! at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.resolveResponseOrThrowException(BasicInvocation.java:769)

! at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:696)

! at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.isOperationExecuting(BasicInvocation.java:875)

! at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.waitForResponse(BasicInvocation.java:753)

! at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:695)
...

Unfortunately, I don't have logs from the 192.168.240.25 at the same time - they rolled over. A few hours later, 192.168.240.25 is fillings its log with wrong target exceptions.

@pveentjer

This comment has been minimized.

Copy link
Member

commented Aug 27, 2014

Can you disable the rollover. I really need to see what has happened between the machines.

@agherardi

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2014

I reproduced the issue and have the logs from the 2 machines. is it OK for me to email them to ala....@gmail.com

@agherardi

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2014

Hi Peter - I email-ed you the log files at ala....@gmail.com

@pveentjer

This comment has been minimized.

Copy link
Member

commented Aug 29, 2014

Thanks. I don't know if I have time for this issue today; so probably it will be next week.

@cszmajda

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2014

We're seeing this as well in 3.3-RC3 :

2014-09-24 04:29:04,749 WARN  [hz.hazelcast.cached.thread-2]  com.hazelcast.cluster.ClusterService [172.24.12.181]:5701 [stash-prod-clustered] [3.3-RC3] Removing Member [172.24.12.182]:5701 because it has not sent any heartbeats for 60000 ms.
2014-09-24 04:29:15,609 WARN  [hz.hazelcast.generic-operation.thread-2]  c.h.p.InternalPartitionService [172.24.12.181]:5701 [stash-prod-clustered] [3.3-RC3] This is the master node and received a PartitionRuntimeState from Address[172.24.12.182]:5701. Ignoring incoming state! 
...
2014-09-24 04:30:19,697 WARN  [hz.hazelcast.partition-operation.thread-14]  c.h.s.i.WaitNotifyServiceImpl$WaitingOp [172.24.12.181]:5701 [stash-prod-clustered] [3.3-RC3] Op: com.hazelcast.queue.PollOperation@656e6d34, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[172.24.12.181]:5701, target:Address[172.24.12.182]:5701, partitionId: 126, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
2014-09-24 04:30:20,696 WARN  [hz.hazelcast.partition-operation.thread-14]  c.h.s.i.WaitNotifyServiceImpl$WaitingOp [172.24.12.181]:5701 [stash-prod-clustered] [3.3-RC3] Op: com.hazelcast.queue.PollOperation@656e6d34, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[172.24.12.181]:5701, target:Address[172.24.12.182]:5701, partitionId: 126, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
2014-09-24 04:30:21,696 WARN  [hz.hazelcast.partition-operation.thread-14]  c.h.s.i.WaitNotifyServiceImpl$WaitingOp [172.24.12.181]:5701 [stash-prod-clustered] [3.3-RC3] Op: com.hazelcast.queue.PollOperation@656e6d34, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[172.24.12.181]:5701, target:Address[172.24.12.182]:5701, partitionId: 126, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService

And on the other node (this is just a 2-node cluster):

2014-09-24 04:26:37,774 WARN  [hz.hazelcast.cached.thread-10] ... com.hazelcast.cluster.ClusterService [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] This node does not have a connection to Member [172.24.12.181]:5701
2014-09-24 04:26:37,775 WARN  [hz.hazelcast.IO.thread-in-0]  com.hazelcast.nio.tcp.ReadHandler [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] hz.hazelcast.IO.thread-in-0 Closing socket to endpoint Address[172.24.12.181]:5701, Cause:java.io.EOFException: Remote socket closed!
2014-09-24 04:26:53,255 WARN  [ContainerBackgroundProcessor[StandardEngine[Catalina]]]  com.hazelcast.spi.OperationService [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[172.24.12.182]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.PartitionIteratingOperation, service: hz:impl:mapService
2014-09-24 04:27:04,369 INFO  [http-nio-9080-exec-13] *160S25Ox267x271377x13 172.24.253.78,172.24.36.105,172.24.12.146,172.24.12.182 "POST /plugins/servlet/mirror/pull/STASH/stash HTTP/1.1" c.a.s.p.m.servlet.PullTriggerServlet Trigger mirror on /STASH/stash
2014-09-24 04:27:14,440 WARN  [hz.hazelcast.IO.thread-in-1]  com.hazelcast.nio.tcp.ReadHandler [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] hz.hazelcast.IO.thread-in-1 Closing socket to endpoint Address[172.24.12.181]:5701, Cause:java.io.EOFException: Remote socket closed!
2014-09-24 04:27:14,983 WARN  [hz.hazelcast.generic-operation.thread-8]  com.hazelcast.cluster.ClusterService [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] New join request has been received from an existing endpoint! => Member [172.24.12.181]:5701 Removing old member and processing join request...
2014-09-24 04:27:15,496 WARN  [clusterScheduler_QuartzSchedulerThread]  c.h.map.BasicMapContextQuerySupport [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] Could not get results
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:Address[172.24.12.181]:5701, partitionId: -1, operation: com.hazelcast.map.operation.QueryOperation, service: hz:impl:mapService
        at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponseOrThrowException(BasicInvocationFuture.java:305) ~[BasicInvocationFuture.class:3.3-RC3]
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:181) ~[BasicInvocationFuture.class:3.3-RC3]
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:160) ~[BasicInvocationFuture.class:3.3-RC3]
        at com.hazelcast.map.BasicMapContextQuerySupport.getQueryResult(BasicMapContextQuerySupport.java:329) ~[BasicMapContextQuerySupport.class:3.3-RC3]
        at com.hazelcast.map.BasicMapContextQuerySupport.addResultsOfPredicate(BasicMapContextQuerySupport.java:316) ~[BasicMapContextQuerySupport.class:3.3-RC3]
        at com.hazelcast.map.BasicMapContextQuerySupport.query(BasicMapContextQuerySupport.java:227) ~[BasicMapContextQuerySupport.class:3.3-RC3]
        at com.hazelcast.map.proxy.MapProxySupport.query(MapProxySupport.java:1056) [MapProxySupport.class:3.3-RC3]
        at com.hazelcast.map.proxy.MapProxyImpl.keySet(MapProxyImpl.java:576) [MapProxyImpl.class:3.3-RC3]
        at com.atlassian.stash.internal.quartz.HazelcastJobStore.acquireNextTriggers(HazelcastJobStore.java:68) [HazelcastJobStore.class:na]
        at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:272) [QuartzSchedulerThread.class:na]
com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:Address[172.24.12.181]:5701, partitionId: -1, operation: com.hazelcast.map.operation.QueryOperation, service: hz:impl:mapService
        at com.hazelcast.spi.impl.BasicInvocation.initInvocationTarget(BasicInvocation.java:333) ~[BasicInvocation.class:3.3-RC3]
        at com.hazelcast.spi.impl.BasicInvocation.doInvoke(BasicInvocation.java:263) ~[BasicInvocation.class:3.3-RC3]
        at com.hazelcast.spi.impl.BasicInvocation.run(BasicInvocation.java:569) ~[BasicInvocation.class:3.3-RC3]
        at com.hazelcast.spi.impl.BasicTargetInvocation.run(BasicTargetInvocation.java:29) ~[BasicTargetInvocation.class:3.3-RC3]
        at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:209) ~[CachedExecutorServiceDelegate$Worker.class:3.3-RC3]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_60]
        at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_60]
        at com.hazelcast.util.executor.PoolExecutorThreadFactory$ManagedThread.run(PoolExecutorThreadFactory.java:59) ~[PoolExecutorThreadFactory$ManagedThread.class:3.3-RC3]
        at ------ End remote and begin local stack-trace ------.(Unknown Source) ~[na:na]
        at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponse(BasicInvocationFuture.java:349) ~[BasicInvocationFuture.class:3.3-RC3]
        at com.hazelcast.spi.impl.BasicInvocationFuture.resolveApplicationResponseOrThrowException(BasicInvocationFuture.java:282) ~[BasicInvocationFuture.class:3.3-RC3]
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:181) ~[BasicInvocationFuture.class:3.3-RC3]
        at com.hazelcast.spi.impl.BasicInvocationFuture.get(BasicInvocationFuture.java:160) ~[BasicInvocationFuture.class:3.3-RC3]
        at com.hazelcast.map.BasicMapContextQuerySupport.getQueryResult(BasicMapContextQuerySupport.java:329) ~[BasicMapContextQuerySupport.class:3.3-RC3]
        at com.hazelcast.map.BasicMapContextQuerySupport.addResultsOfPredicate(BasicMapContextQuerySupport.java:316) ~[BasicMapContextQuerySupport.class:3.3-RC3]
        at com.hazelcast.map.BasicMapContextQuerySupport.query(BasicMapContextQuerySupport.java:227) ~[BasicMapContextQuerySupport.class:3.3-RC3]
        at com.hazelcast.map.proxy.MapProxySupport.query(MapProxySupport.java:1056) [MapProxySupport.class:3.3-RC3]
        at com.hazelcast.map.proxy.MapProxyImpl.keySet(MapProxyImpl.java:576) [MapProxyImpl.class:3.3-RC3]
2014-09-24 04:29:03,277 WARN  [hz.hazelcast.partition-operation.thread-0]  c.h.m.o.ClearExpiredOperation [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this:Address[172.24.12.182]:5701, partitionId: 0, operation: com.hazelcast.map.operation.ClearExpiredOperation, 
service: null
2014-09-24 04:29:45,183 WARN  [hz.hazelcast.partition-operation.thread-0]  c.hazelcast.spi.impl.BasicInvocation [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] Retrying invocation: BasicInvocation{ serviceName='hz:impl:lockService', op=com.hazelcast.concurrent.lock.operations.LockOperation@1deaf1c4, partitionId=0, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=10000, target=Address[172.24.12.182]:5701}, Reason: com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this:Address[172.24.12.182]:5701, partitionId: 0, operation: com.hazelcast.concurrent.lock.operations.LockOperation, service: hz:impl:lockService
2014-09-24 04:29:50,187 WARN  [hz.hazelcast.partition-operation.thread-0]  c.hazelcast.spi.impl.BasicInvocation [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] Retrying invocation: BasicInvocation{ serviceName='hz:impl:lockService', op=com.hazelcast.concurrent.lock.operations.LockOperation@1deaf1c4, partitionId=0, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=110, callTimeout=10000, target=Address[172.24.12.182]:5701}, Reason: com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this:Address[172.24.12.182]:5701, partitionId: 0, operation: com.hazelcast.concurrent.lock.operations.LockOperation, service: hz:impl:lockService
2014-09-24 04:29:55,189 WARN  [hz.hazelcast.partition-operation.thread-0]  c.hazelcast.spi.impl.BasicInvocation [172.24.12.182]:5701 [stash-prod-clustered] [3.3-RC3] Retrying invocation: BasicInvocation{ serviceName='hz:impl:lockService', op=com.hazelcast.concurrent.lock.operations.LockOperation@1deaf1c4, partitionId=0, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=120, callTimeout=10000, target=Address[172.24.12.182]:5701}, Reason: com.hazelcast.spi.exception.PartitionMigratingException: Partition is migrating! this:Address[172.24.12.182]:5701, partitionId: 0, operation: com.hazelcast.concurrent.lock.operations.LockOperation, service: hz:impl:lockService

I have the full logfiles if you need more.

@mhurne

This comment has been minimized.

Copy link

commented Sep 24, 2014

We ran into similar warnings today after an upgrade to Hazelcast 3.3.

@jwest

This comment has been minimized.

Copy link

commented Nov 26, 2014

Hi, I see on my cluster same problem, any idea? (version 3.3.3)

@SimonTurfer

This comment has been minimized.

Copy link

commented Jan 4, 2015

We're getting this exception on version 3.3.3 as well, running on EC2 instances. It happens when one instance is being shutdown from the AWS console.

We're also getting spammed with "Retrying[2] packet send operation to...." messages when this happens and it wont stop retrying.

[hz._hzInstance_1_dev.cached.thread-22] 2015-01-04 18:03:09,928 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-22] 2015-01-04 18:03:09,932 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[1] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-22] 2015-01-04 18:03:09,937 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-22] 2015-01-04 18:03:09,953 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-22] 2015-01-04 18:03:09,954 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-22] 2015-01-04 18:03:09,955 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-22] 2015-01-04 18:03:09,956 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-3] 2015-01-04 18:03:09,956 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-3] 2015-01-04 18:03:09,958 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-3] 2015-01-04 18:03:09,964 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-3] 2015-01-04 18:03:09,969 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-6] 2015-01-04 18:03:09,969 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
[hz._hzInstance_1_dev.cached.thread-3] 2015-01-04 18:03:09,970 [x.x.x.x]:5701 [dev] [3.3.3] Retrying[2] packet send operation to: Address[x.x.x.x]:5701
@innokenty

This comment has been minimized.

Copy link

commented Jul 7, 2015

Same for v.3.5, thousands of such entries were logged, although the cluster looks to have been working okay in general:

Jul 06, 2015 6:00:01 PM com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation
WARNING: [hz-aerostat-test02.haze.yandex.net]:5701 [aerostat-testing] [3.5] Op: com.hazelcast.concurrent.lock.operations.LockOperation{serviceName='null', partitionId=199, callId=0, invocationTime=1436195180735, waitTimeout=240000, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[hz-aerostat-test02.haze.yandex.net]:5701, target:Address[hz-aerostat-test01.haze.yandex.net]:5701, partitionId: 199, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:lockService
@wenerme

This comment has been minimized.

Copy link

commented Oct 12, 2015

Confirmed that's my mistake to cause the data lost.When I do HA test, I meed a lot WrongTargetException, but things still working.

@peterlitvak

This comment has been minimized.

Copy link

commented Oct 16, 2015

Getting this on 3.5.2, in EC2 when doing rolling restart of the nodes.
One thing is that our app calls map.lock() in one of our beans upon the instantiation.

@enesakar enesakar removed the PENDING label Nov 2, 2015

@agherardi

This comment has been minimized.

Copy link
Contributor Author

commented Mar 10, 2016

Same issue with 3.5.5. Scenario:

  • Cluster with 2 nodes - 10.220.178.194 and 10.220.178.91
  • Unplugged 10.220.178.91 from the network
  • As expected, 10.220.178.194 says that it's now the only member:

INFO [2016-03-10 19:23:50,660] com.hazelcast.cluster.ClusterService: [10.220.178.194]:5701 [XXX_1-c0d85fee-d356-4042-bb13-6b1c55bc8233] [3.5.5]
Members [1] {
Member [10.220.178.194]:5701 this
}

  • Plugged 10.220.178.91
  • After a while, the 2 members re-join the same cluster:

INFO [2016-03-10 19:27:32,708] com.hazelcast.cluster.ClusterService: [10.220.178.194]:5701 [XXX_1-c0d85fee-d356-4042-bb13-6b1c55bc8233] [3.5.5]

Members [2] {
Member [10.220.178.91]:5701
Member [10.220.178.194]:5701 this
}

  • About 1 minute later, the avalanche of WrongTargetException's starts in the 10.220.178.194 log:

WARN [2016-03-10 19:28:26,597] com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation: [10.220.178.194]:5701 [XXX_1-c0d85fee-d356-4042-bb13-6b1c55bc8233] [3.5.5] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{identityHash=1657152703, serviceName='null', partitionId=176, callId=502, invocationTime=1457638045699, waitTimeout=-420001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.220.178.194]:5701, target:Address[10.220.178.91]:5701, partitionId: 176, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService
WARN [2016-03-10 19:28:27,598] com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation: [10.220.178.194]:5701 [XXX_1-c0d85fee-d356-4042-bb13-6b1c55bc8233] [3.5.5] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{identityHash=1657152703, serviceName='null', partitionId=176, callId=502, invocationTime=1457638045699, waitTimeout=-420001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.220.178.194]:5701, target:Address[10.220.178.91]:5701, partitionId: 176, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService

  • This keeps going on and on, at the rate of about 1 WARN a second.
@agherardi

This comment has been minimized.

Copy link
Contributor Author

commented Mar 24, 2016

I submitted a tentative patch with this pull request: #7825

@colinduplantis

This comment has been minimized.

Copy link

commented Apr 22, 2016

I have seen this problem as well with 3.5.x. For me, it requires some kind of cluster-wide operation (getLock or executeOnAllMembers, eg) while an instance is starting up. FWIW, this problem seems easier to reproduce on a 16-core machine.

@tehsu

This comment has been minimized.

Copy link

commented Jun 24, 2016

I'm seeing this as well, do we know if there has been a fix, it looks like the PR is waiting still.

@colinduplantis

This comment has been minimized.

Copy link

commented Jun 24, 2016

I don't have any info on a fix yet.

It's very difficult for us to reproduce reliably. One thing that has
popped up for us, though, is that we were using Oracle as our RDBMS.
There is a bug/feature in the Oracle JDBC driver that causes it to
consume a fair bit of randomness from /dev/random on start. Combine this
with using a pre-3.16 Linux kernel which was slow to collect entropy and
a bug in Oracle Java 7 configuration that incorrectly sets the random
generator to the blocking random device instead of the unblocking random
device.

Now, none of that has anything to do with Hazelcast yet, except, we
found that Hazelcast seems to be sensitive to executing cluster-wide
operations (lock, exec-service execution on all members, replicating
hash map, etc) during start.

So, in a scenario where half our cluster has started and is executing,
and another set of machines joins the party late, the first set is more
likely to execute these cluster-wide operations while the second set is
starting up. There appears to be a brief window during cluster start
where Hcast is vulnerable, with a deadlock resulting if a cluster-wide
operation is executed at exactly the right time during start.

Finally, the Oracle thing comes in. We start our cluster, which starts
the Hcast service via Spring config, which depends on our DB pool. The
DB pool fires up and Hcast via Spring will wait to complete
initialization until the DB pool is started. The DB pool is blocking on
/dev/random because of the clusterfsck detailed above. This causes Hcast
to be held in a state in which it appears to be vulnerable to deadlock
for seconds to minutes, thus inviting disaster.

Factor in that we've only managed to reproduce this on a 16-core server.

This may or may not be what you're seeing, and we still can't quite nail
it down, but we haven't seen a recurrence since we applied a tweak to
force the JVM to use the unblocking random device.

The fix is to add this to your JVM at start:

-Djava.security.egd=file:/dev/../dev/urandom

Of course, none of this may apply to you.

On 06/24/2016 08:28 AM, Tehseen wrote:

I'm seeing this as well, do we know if there has been a fix, it looks
like the PR is waiting still.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#3395 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AJIdsu5W0D2_UNlLN5a9o8sl9dtYAlhUks5qO_eUgaJpZM4CbUnB.

Colin DuPlantis
Chief Architect, Marketcetera
Download, Run, Trade
888.868.4884 +1.541.306.6556
http://www.marketcetera.org

@bradmac

This comment has been minimized.

Copy link

commented Jul 7, 2016

We've just seen the same issue in EC2 after a rolling deploy

@peterlitvak

This comment has been minimized.

Copy link

commented Jul 7, 2016

Yes, we had to abandon hazelcast for this reason in our current app unfortunately.

@jerrinot jerrinot added this to the 3.7 milestone Jul 7, 2016

@jerrinot

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2016

Related to #3754

@tombujok

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2016

Thanks for your valuable comments guys. We have a fix on the way as @jerrinot pointed out.
We're planning to address it in the forthcoming days.

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Jul 19, 2016

Delay operation retry when node is not joined
Fixes hazelcast#3754, hazelcast#3395

Reasoning:
When a smaller cluster is being merged into a bigger one then
operations waiting in the WaitNotify queues are cleared and
pending invocations receive MemberLeftException. However the
invocation can retry immediately and send the operation again.

This change is adding a check into a retry part to re-execute the
operation only when a node is joined (again). Happy path is
unaffected.

This commit is also adding a support class for split-brain tests.
Many thanks to @mdogan who helped me with analysis and also contributed
the original test.

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Jul 19, 2016

Delay operation retry when node is not joined
Fixes hazelcast#3754, hazelcast#3395

Reasoning:
When a smaller cluster is being merged into a bigger one then
operations waiting in the WaitNotify queues are cleared and
pending invocations receive MemberLeftException. However the
invocation can retry immediately and send the operation again.

This change is adding a check into a retry part to re-execute the
operation only when a node is joined (again). Happy path is
unaffected.

This commit is also adding a support class for split-brain tests.
Many thanks to @mdogan who helped me with analysis and also contributed
the original test.

jerrinot added a commit to jerrinot/hazelcast that referenced this issue Jul 19, 2016

Delay operation retry when node is not joined
Fixes hazelcast#3754, hazelcast#3395

Reasoning:
When a smaller cluster is being merged into a bigger one then
operations waiting in the WaitNotify queues are cleared and
pending invocations receive MemberLeftException. However the
invocation can retry immediately and send the operation again.

This change is adding a check into a retry part to re-execute the
operation only when a node is joined (again). Happy path is
unaffected.

This commit is also adding a support class for split-brain tests.
Many thanks to @mdogan who helped me with analysis and also contributed
the original test.
@jerrinot

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2016

fixed by #8550

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.