New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OperationTimeoutException #2051

Closed
mgrossmann opened this Issue Mar 18, 2014 · 39 comments

Comments

Projects
None yet
8 participants
@mgrossmann

mgrossmann commented Mar 18, 2014

In case of running our code on a virtual machine, we are getting following exception:

com.hazelcast.core.OperationTimeoutException: No response for 10000 ms.

In production it could take several hours until occurrence. So we are using .setProperty(PROP_OPERATION_CALL_TIMEOUT_MILLIS, '5000')
to speed up this problem. The main issue is the fact that hazelcast keeps throwing exception and stays inresponsive.

We put together some example code: gist

If we run this code on two dedicated (virtual) machines,
it lasts up to 10 minutes until the error occurres.

@gurbuzali gurbuzali added the 3.x label Mar 18, 2014

@mgrossmann

This comment has been minimized.

mgrossmann commented Mar 20, 2014

some log messages from our load tests:

13 Mrz 2014 15:43:33,798 : [WARN  ] [Session: 2060166329] com.hazelcast.spi.Invocation.?(): [10.33.124.3]:5900 [evo6c1] No response for 10001 ms. InvocationFuture{invocation=InvocationImpl{ serviceName='hz:impl:mapService', op=SetOperation{jobsWaitingForReceipt}, partitionId=122, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.33.124.4]:5900}, done=false}
13 Mrz 2014 15:43:33,800 : [WARN  ] [Session: 2060166329] com.hazelcast.spi.Invocation.?(): [10.33.124.3]:5900 [evo6c1] Asking if operation execution has been started: InvocationImpl{ serviceName='hz:impl:mapService', op=SetOperation{jobsWaitingForReceipt}, partitionId=122, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.33.124.4]:5900}
13 Mrz 2014 15:43:38,826 : [WARN  ] [Session: 2060166329] com.hazelcast.spi.Invocation.?(): [10.33.124.3]:5900 [evo6c1] While asking 'is-executing': InvocationImpl{ serviceName='hz:impl:mapService', op=SetOperation{jobsWaitingForReceipt}, partitionId=122, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=0, callTimeout=5000, target=Address[10.33.124.4]:5900}
java.util.concurrent.TimeoutException
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.resolveResponse(InvocationImpl.java:449)
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java:297)
    at com.hazelcast.spi.impl.InvocationImpl.isOperationExecuting(InvocationImpl.java:493)
    at com.hazelcast.spi.impl.InvocationImpl.access$1300(InvocationImpl.java:35)
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.waitForResponse(InvocationImpl.java:393)
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java:296)
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java:287)
    at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:220)
    at com.hazelcast.map.proxy.MapProxySupport.setInternal(MapProxySupport.java:266)
    at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:165)
    at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:152)
    at com.acme.JobController.updateTransmissionState(JobController.java:3250)
    at com.acme.JobController.updateTransmissionJob(JobController.java:727)
    at com.acme.JobController$$FastClassByCGLIB$$1b558953.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:698)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
    at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:80)
    at com.acme.TracingAspect.traceExecutionTime(TracingAspect.java:62)
    at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:621)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:610)
    at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:65)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:631)
    at com.acme.JobController$$EnhancerByCGLIB$$4a13de4e.updateTransmissionJob(<generated>)
    at com.acme.ProtocolMonitorImpl.updateTransmissionJob(ProtocolMonitorImpl.java:258)
    at com.acme.ProtocolMonitorImpl$$FastClassByCGLIB$$12d821d.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:698)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
    at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:80)
    at com.acme.TracingAspect.traceExecutionTime(TracingAspect.java:62)
    at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:621)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:610)
    at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:65)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:631)
    at com.acme.ProtocolMonitorImpl$$EnhancerByCGLIB$$74bc97d8.updateTransmissionJob(<generated>)
    at com.acme.UserMonitor.handleOutputEvent(UserMonitor.java:984)
    at com.acme.events.output.Event.handle(Event.java:26)
    at com.acme.UserMonitor.run(UserMonitor.java:348)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)
13 Mrz 2014 15:43:38,857 : [WARN  ] [Session: 2060166329] com.hazelcast.spi.Invocation.?(): [10.33.124.3]:5900 [evo6c1] 'is-executing': false -> InvocationImpl{ serviceName='hz:impl:mapService', op=SetOperation{jobsWaitingForReceipt}, partitionId=122, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=0, callTimeout=5000, target=Address[10.33.124.4]:5900}
13 Mrz 2014 15:43:48,858 : [WARN  ] [Session: 2060166329] com.hazelcast.spi.Invocation.?(): [10.33.124.3]:5900 [evo6c1] No response for 10001 ms. InvocationFuture{invocation=InvocationImpl{ serviceName='hz:impl:mapService', op=SetOperation{jobsWaitingForReceipt}, partitionId=122, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.33.124.4]:5900}, done=false}
13 Mrz 2014 15:43:48,858 : [WARN  ] [Session: 2060166329] com.hazelcast.spi.Invocation.?(): [10.33.124.3]:5900 [evo6c1] Asking if operation execution has been started: InvocationImpl{ serviceName='hz:impl:mapService', op=SetOperation{jobsWaitingForReceipt}, partitionId=122, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.33.124.4]:5900}
13 Mrz 2014 15:43:53,859 : [WARN  ] [Session: 2060166329] com.hazelcast.spi.Invocation.?(): [10.33.124.3]:5900 [evo6c1] While asking 'is-executing': InvocationImpl{ serviceName='hz:impl:mapService', op=SetOperation{jobsWaitingForReceipt}, partitionId=122, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.33.124.4]:5900}
13 Mrz 2014 15:43:53,864 : [WARN  ] [Session: 2060166329] com.hazelcast.spi.Invocation.?(): [10.33.124.3]:5900 [evo6c1] 'is-executing': false -> InvocationImpl{ serviceName='hz:impl:mapService', op=SetOperation{jobsWaitingForReceipt}, partitionId=122, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.33.124.4]:5900}
13 Mrz 2014 15:43:53,865 : [ERROR ] [Session: 2060166329] com.acme.SessionImpl.?()        : An unexpected OperationTimeoutException occurred.
com.hazelcast.core.OperationTimeoutException: No response for 30000 ms. Aborting invocation! InvocationFuture{invocation=InvocationImpl{ serviceName='hz:impl:mapService', op=SetOperation{jobsWaitingForReceipt}, partitionId=122, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=5000, target=Address[10.33.124.4]:5900}, done=false}
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.waitForResponse(InvocationImpl.java:399)
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java:296)
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java:287)
    at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:220)
    at com.hazelcast.map.proxy.MapProxySupport.setInternal(MapProxySupport.java:266)
    at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:165)
    at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:152)
    at com.acme.JobController.updateTransmissionState(JobController.java:3250)
    at com.acme.JobController.updateTransmissionJob(JobController.java:727)
    at com.acme.JobController$$FastClassByCGLIB$$1b558953.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:698)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
    at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:80)
    at com.acme.TracingAspect.traceExecutionTime(TracingAspect.java:62)
    at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:621)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:610)
    at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:65)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:631)
    at com.acme.JobController$$EnhancerByCGLIB$$4a13de4e.updateTransmissionJob(<generated>)
    at com.acme.ProtocolMonitorImpl.updateTransmissionJob(ProtocolMonitorImpl.java:258)
    at com.acme.ProtocolMonitorImpl$$FastClassByCGLIB$$12d821d.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:698)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
    at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:80)
    at com.acme.TracingAspect.traceExecutionTime(TracingAspect.java:62)
    at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:621)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:610)
    at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:65)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:631)
    at com.acme.ProtocolMonitorImpl$$EnhancerByCGLIB$$74bc97d8.updateTransmissionJob(<generated>)
    at com.acme.UserMonitor.handleOutputEvent(UserMonitor.java:984)
    at com.acme.events.output.Event.handle(Event.java:26)
    at com.acme.UserMonitor.run(UserMonitor.java:348)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)
    at ------ End remote and begin local stack-trace ------.(Unknown Source)
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.resolveResponse(InvocationImpl.java:428)
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java:297)
    at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java:287)
    at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:220)
    at com.hazelcast.map.proxy.MapProxySupport.setInternal(MapProxySupport.java:266)
    at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:165)
    at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:152)
    at com.acme.JobController.updateTransmissionState(JobController.java:3250)
    at com.acme.JobController.updateTransmissionJob(JobController.java:727)
    at com.acme.JobController$$FastClassByCGLIB$$1b558953.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:698)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
    at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:80)
    at com.acme.TracingAspect.traceExecutionTime(TracingAspect.java:62)
    at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:621)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:610)
    at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:65)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:631)
    at com.acme.JobController$$EnhancerByCGLIB$$4a13de4e.updateTransmissionJob(<generated>)
    at com.acme.ProtocolMonitorImpl.updateTransmissionJob(ProtocolMonitorImpl.java:258)
    at com.acme.ProtocolMonitorImpl$$FastClassByCGLIB$$12d821d.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:698)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
    at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:80)
    at com.acme.TracingAspect.traceExecutionTime(TracingAspect.java:62)
    at sun.reflect.GeneratedMethodAccessor165.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:621)
    at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:610)
    at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:65)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:631)
    at com.acme.ProtocolMonitorImpl$$EnhancerByCGLIB$$74bc97d8.updateTransmissionJob(<generated>)
    at com.acme.UserMonitor.handleOutputEvent(UserMonitor.java:984)
    at com.acme.events.output.Event.handle(Event.java:26)
    at com.acme.UserMonitor.run(UserMonitor.java:348)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)

@gurbuzali gurbuzali self-assigned this Mar 21, 2014

@gurbuzali gurbuzali added this to the 3.2.1 milestone Mar 21, 2014

@mgrossmann

This comment has been minimized.

mgrossmann commented Mar 24, 2014

Because the issue 1949 looks very simular to our issue, we tried to use the latest RC3-SNAPSHOT.
Unfortunately this do not fix our problems.

@mgrossmann

This comment has been minimized.

mgrossmann commented Mar 25, 2014

With latest 3.2-RCx version com.hazelcast.spi.impl.InvocationImpl$InvocationFuture is now
com.hazelcast.spi.impl.BasicInvocation$InvocationFuture

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 7, 2014

The virtual machines are configured to have a single core. Running the tests on non virtual machines
did not show the problem. Reducing the amount of parallel threads in our application seems to solve the problem, too. We also tried to reduce the number of distributed maps / queues, but the problem was still existing.

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 8, 2014

Hours later i have to correct my self. With a reduced amount of threads, the problem still exists.
May be it is interesting that the complete hazelcast subsystem seems to be unresponsive.
I'm not even able to retrieve any jmx information from your mbeans, after the problem occurred.

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 8, 2014

Another indication could be the blocked count of the IO.thread-out-0. 1.600.000 x blocked?!
IO.thread-in-0 was 100 x blocked, only ?!

@pveentjer

This comment has been minimized.

Member

pveentjer commented Apr 8, 2014

Hi Mike,

how easy is it for you (or for us) to reproduce this issue?

I'll run it on my local setup because this is a very important bug for us to resolve.

So:
does it only happen on virtualised environments?

On Tue, Apr 8, 2014 at 1:53 PM, Mike Großmann notifications@github.comwrote:

Another indication could be the blocked count of the IO.thread-out-0.
1.600.000 x blocked?!
IO.thread-in-0 was 100 x blocked, only ?!

Reply to this email directly or view it on GitHubhttps://github.com//issues/2051#issuecomment-39834711
.

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 8, 2014

At the moment the example code here is the only way to produce a similar problem.
(it could not be reproduced by 100%)

Tomorrow i could try to improve this script.

@pveentjer

This comment has been minimized.

Member

pveentjer commented Apr 8, 2014

Can you tell everything about the setup? Java versions, os, hypervisor etc.

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 8, 2014

Hi Peter ..

i will try :)

Both nodes running on java 1.7 on Ubuntu Precise.
These machines running on a vSphere 5.1.0

The following VM State is from the current test system (the problem already occurred)


Virtual Machine:ᅠ

Java HotSpot(TM) 64-Bit Server VM version 23.6-b04

Vendor:ᅠ

Oracle Corporation

Uptime:ᅠ

1 day 6 hours 24 minutes

Process CPU time:ᅠ

7 hours 37 minutes

JIT compiler:ᅠ

HotSpot 64-Bit Tiered Compilers

Total compile time:ᅠ

8 minutes


Live threads:ᅠ

ᅠᅠᅠᅠᅠ87

Peak:ᅠ

ᅠᅠᅠᅠ200

Daemon threads:ᅠ

ᅠᅠᅠᅠᅠ28

Total threads started:ᅠ

542.969

Current classes loaded:ᅠ

10.570

Total classes loaded:ᅠ

10.814

Total classes unloaded:ᅠ

ᅠᅠᅠ244


Current heap size:ᅠ

ᅠᅠ284.827 kbytes

Maximum heap size:ᅠ

2.027.264 kbytes

Committed memory:ᅠ

474.500 kbytes

Pending finalization:ᅠ

{0} objects

Garbage collector:ᅠ

Name = 'Copy', Collections = 34.385, Total time spent = 9 minutes

Garbage collector:ᅠ

Name = 'MarkSweepCompact', Collections = 38, Total time spent = 23,722 seconds


Operating System:ᅠ

Linux 3.2.0-60-generic

Architecture:ᅠ

amd64

Number of processors:ᅠ

1

Committed virtual memory:ᅠ

3.151.668 kbytes

Total physical memory:ᅠ

4.049.976 kbytes

Free physical memory:ᅠ

ᅠᅠ291.840 kbytes

Total swap space:ᅠ

4.193.276 kbytes

Free swap space:ᅠ

4.193.260 kbytes


VM arguments:ᅠ

-Dspring.profiles.active=oracle,cluster -Djava.net.preferIPv4Stack=true -Djava.security.auth.login.config=UserManager.config -Djava.util.prefs.systemRoot=/home/ -Xmx2048M -Djava.security.egd=file:///dev/urandom -Duser.language=de

Library path:ᅠ

/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib


This is our configuration:

    @Bean
    public HazelcastInstance createHzInstance()
    {
        final String clusterGroupName = this.environment.getProperty("cluster.group.name");
        final String nodeName = this.environment.getProperty("cluster.node.name");
        final String nodeIp = this.environment.getProperty("cluster.node.ip");
        final int nodePort = new Integer(this.environment.getProperty("cluster.node.port"));
        final String[] otherNodes = this.environment.getProperty("cluster.nodes").split(",");
        final String managementCenterUrl = this.environment.getProperty("cluster.mc.url");

        final Config config = new Config();
        config.setProperty(PROP_LOGGING_TYPE, "slf4j");
        config.setProperty(PROP_ENABLE_JMX, "true");
        config.setProperty(PROP_ENABLE_JMX_DETAILED, "true");
        config.setProperty(PROP_SYSTEM_LOG_ENABLED, "false");
        config.setProperty(PROP_SHUTDOWNHOOK_ENABLED, "false");
        config.setInstanceName(nodeName);

        if (managementCenterUrl != null)
        {
            ManagementCenterConfig managementCenterConfig = new ManagementCenterConfig();
            managementCenterConfig.setEnabled(true);
            managementCenterConfig.setUrl(managementCenterUrl);
            managementCenterConfig.setUpdateInterval(5);

            config.setManagementCenterConfig(managementCenterConfig);
        }

        final GroupConfig groupConfig = config.getGroupConfig();
        groupConfig.setName(clusterGroupName);
        groupConfig.setPassword("***");

        final NetworkConfig networkConfig = config.getNetworkConfig();

        if (nodePort == 0)
        {
            networkConfig.setPort(5900);
            networkConfig.setPortAutoIncrement(true);
        }
        else
        {
            networkConfig.setPort(nodePort);
            networkConfig.setPortAutoIncrement(false);
        }

        final JoinConfig join = networkConfig.getJoin();
        join.getMulticastConfig().setEnabled(false);

        for (final String node : otherNodes)
        {
            join.getTcpIpConfig().addMember(node).setEnabled(true);
        }

        InetAddress inetAddress;
        try
        {
            inetAddress = InetAddress.getByName(nodeIp);
            networkConfig.getInterfaces().addInterface(inetAddress.getHostAddress());
        }
        catch (final UnknownHostException e)
        {
            e.printStackTrace();
        }

        networkConfig.getInterfaces().setEnabled(true);

        // create and configure all needed maps and queues

        // cache
        configureStationCache(config);

        return Hazelcast.newHazelcastInstance(config);
    }
@pveentjer

This comment has been minimized.

Member

pveentjer commented Apr 8, 2014

Could you run with something else than ubuntu? E.g. centos/rhel. My
colleagues have been trying to fix a bug and eventually it boiled down to
ubuntu. It would be nice if we could confirm if is only happening under
ubuntu.

On Tue, Apr 8, 2014 at 5:58 PM, Mike Großmann notifications@github.comwrote:

Hi Peter ..

i will try :)

Both nodes running on java 1.7 on Ubuntu Precise.
These machines running on a vSphere 5.1.0

The following VM State is from the current test system (the problem

already occurred)

Virtual Machine:

Java HotSpot(TM) 64-Bit Server VM version 23.6-b04
Vendor:

Oracle Corporation
Uptime:

1 day 6 hours 24 minutes
Process CPU time:

7 hours 37 minutes
JIT compiler:

HotSpot 64-Bit Tiered Compilers
Total compile time:

8 minutes

Live threads:

87
Peak:

200
Daemon threads:

28
Total threads started:

542.969
Current classes loaded:

10.570
Total classes loaded:

10.814
Total classes unloaded:

244

Current heap size:

284.827 kbytes
Maximum heap size:

2.027.264 kbytes
Committed memory:

474.500 kbytes
Pending finalization:

{0} objects
Garbage collector:

Name = 'Copy', Collections = 34.385, Total time spent = 9 minutes
Garbage collector:

Name = 'MarkSweepCompact', Collections = 38, Total time spent = 23,722

seconds

Operating System:

Linux 3.2.0-60-generic
Architecture:

amd64
Number of processors:

1
Committed virtual memory:

3.151.668 kbytes
Total physical memory:

4.049.976 kbytes
Free physical memory:

291.840 kbytes
Total swap space:

4.193.276 kbytes
Free swap space:

4.193.260 kbytes

VM arguments:

-Dspring.profiles.active=oracle,cluster -Djava.net.preferIPv4Stack=true
-Djava.security.auth.login.config=UserManager.config
-Djava.util.prefs.systemRoot=/home/ -Xmx2048M
-Djava.security.egd=file:///dev/urandom -Duser.language=de
Library path:

/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

This is our configuration:

@Bean
public HazelcastInstance createHzInstance()
{
    final String clusterGroupName = this.environment.getProperty("cluster.group.name");
    final String nodeName = this.environment.getProperty("cluster.node.name");
    final String nodeIp = this.environment.getProperty("cluster.node.ip");
    final int nodePort = new Integer(this.environment.getProperty("cluster.node.port"));
    final String[] otherNodes = this.environment.getProperty("cluster.nodes").split(",");
    final String managementCenterUrl = this.environment.getProperty("cluster.mc.url");

    final Config config = new Config();
    config.setProperty(PROP_LOGGING_TYPE, "slf4j");
    config.setProperty(PROP_ENABLE_JMX, "true");
    config.setProperty(PROP_ENABLE_JMX_DETAILED, "true");
    config.setProperty(PROP_SYSTEM_LOG_ENABLED, "false");
    config.setProperty(PROP_SHUTDOWNHOOK_ENABLED, "false");
    config.setInstanceName(nodeName);

    if (managementCenterUrl != null)
    {
        ManagementCenterConfig managementCenterConfig = new ManagementCenterConfig();
        managementCenterConfig.setEnabled(true);
        managementCenterConfig.setUrl(managementCenterUrl);
        managementCenterConfig.setUpdateInterval(5);

        config.setManagementCenterConfig(managementCenterConfig);
    }

    final GroupConfig groupConfig = config.getGroupConfig();
    groupConfig.setName(clusterGroupName);
    groupConfig.setPassword("***");

    final NetworkConfig networkConfig = config.getNetworkConfig();

    if (nodePort == 0)
    {
        networkConfig.setPort(5900);
        networkConfig.setPortAutoIncrement(true);
    }
    else
    {
        networkConfig.setPort(nodePort);
        networkConfig.setPortAutoIncrement(false);
    }

    final JoinConfig join = networkConfig.getJoin();
    join.getMulticastConfig().setEnabled(false);

    for (final String node : otherNodes)
    {
        join.getTcpIpConfig().addMember(node).setEnabled(true);
    }

    InetAddress inetAddress;
    try
    {
        inetAddress = InetAddress.getByName(nodeIp);
        networkConfig.getInterfaces().addInterface(inetAddress.getHostAddress());
    }
    catch (final UnknownHostException e)
    {
        e.printStackTrace();
    }

    networkConfig.getInterfaces().setEnabled(true);

    // create and configure all needed maps and queues

    // cache
    configureStationCache(config);

    return Hazelcast.newHazelcastInstance(config);
}

Reply to this email directly or view it on GitHubhttps://github.com//issues/2051#issuecomment-39858129
.

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 8, 2014

i will check this.

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 9, 2014

Hi Peter,

should i get any further informations out of the currently stucked system, before i start with a new setup?

@pveentjer

This comment has been minimized.

Member

pveentjer commented Apr 9, 2014

Hi Mike,

the most important piece of information we currently need is to narrow the problem down to a specific operating system.

How difficult is it to roll out on a different os?

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 9, 2014

We have to rent new virtual machines internally. This will last at least 3 working days.
So i could provide the needed informationsearliest at monday or tuesday.
I will try to push this in any way.

@pveentjer

This comment has been minimized.

Member

pveentjer commented Apr 9, 2014

Do you know which hypervisor you are using? (What is your cloud provider?) We see that there is an issue with Ubuntu on EC2 (Xen).

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 9, 2014

We rent our machines internally. All that i know is that we are using VMWare.

@pveentjer

This comment has been minimized.

Member

pveentjer commented Apr 9, 2014

I'll run on EC2 in combination with Ubuntu to see if it can be reproduced and with a non ubuntu to see if the issue is not reproduced. Do I need to have 2 os-instances, or can the application twice on the same os-instance?

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 9, 2014

For now we are running the tests on two different os-instances.

@enesakar enesakar assigned pveentjer and unassigned gurbuzali Apr 14, 2014

@mgrossmann

This comment has been minimized.

mgrossmann commented Apr 14, 2014

Hey Peter,

the test script fails after two hours on the new centos instances. :( now i have started the real application.

/Mike

@pveentjer

This comment has been minimized.

Member

pveentjer commented Apr 14, 2014

Thanks. I'll convert your project to our stabilizer-system (stress testing system for hazelcast that can automatically roll out in ec2) and see if I can reproduce it. My main concern is to set up this environment, but I'll try to sneak on this test.

@pveentjer

This comment has been minimized.

Member

pveentjer commented Apr 15, 2014

Finally I have time for this particular issue.

@pveentjer

This comment has been minimized.

Member

pveentjer commented Apr 15, 2014

Question about the code. I'm converting it to Java and ran into the following:

   def key = instance.config.instanceName + ' - ' + UUID.randomUUID()
    activityQueue.add(key)

Key is of type String, but activityQueue is of type Data. Am I missing something?

@mdogan mdogan modified the milestones: 3.2.2, 3.2.1 May 6, 2014

@engrun

This comment has been minimized.

engrun commented May 12, 2014

Just to let you know
We have observed similar situations with CentOs running on Hyper-V virtual machines.

We see lots of these warnings
2014-05-12 11:52:57,264 WARN com.hazelcast.spi.impl.BasicInvocation - [10.230.48.190]:5701 [qa-cluster] [3.2.1] 'is-executing': true -> InvocationFuture{invocation=BasicInvocation{ serviceName='hz:impl:executorService
', op=com.hazelcast.executor.CallableTaskOperation@95bc3ee, partitionId=172, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[10.230.48.189]:5701}, done=false}

Running Hazelcast 3.2.1

According to our developers, it's hard to reproduce, and happens only after some time.
When these warnings appear in the log, hazelcast is unresponsive, and a full restart is required to fix it.

@enesakar enesakar modified the milestones: 3.2.3, 3.2.2 May 21, 2014

@engrun

This comment has been minimized.

engrun commented May 21, 2014

Hi Mike.

I am currently converting the groovy examples to java/maven, and I have some questions regarding your example code.

In the consume closure, you are polling the preProcessingQueue for an instance of Data.
It is then added twice to the activityQueue.
This rapidly/continuously fills the queue, and after a while the process runs out of men. (running with 4GB on each node).

(The remove call in the Produce has (obviously) no relevance for the consume closure, as the consumer is likely to poll AFTER the producer has called remove)

In effect, each loop in the consume closure adds the data twice to the queue without removing it.

Any input on this?
Is this intendend behaviour?

  1. postProcessingQueue and sendQueue are never used.
    What's your initial idea for these queues?
@gurbuzali

This comment has been minimized.

Member

gurbuzali commented May 29, 2014

I believe #2516 is related with this one

@enesakar

This comment has been minimized.

Member

enesakar commented Jun 20, 2014

see also #2778

@mdogan

This comment has been minimized.

Member

mdogan commented Aug 14, 2014

#3207 partially fixes this issue.

@pveentjer

This comment has been minimized.

Member

pveentjer commented Aug 26, 2014

We have resolved this issue (the spurious wakueup fix). So I think we can close this issue.

@mdogan @jerrinot. WDYT?

@jerrinot

This comment has been minimized.

Contributor

jerrinot commented Aug 26, 2014

There is still the other failure mode - probably related to congestion or
flow control algorithm.
From the other hand I assume a vast majority of real
world OperationTimeoutException were caused by the spurious wake-up.

On Tue, Aug 26, 2014 at 10:13 AM, Peter Veentjer notifications@github.com
wrote:

We have resolved this issue (the spurious wakueup fix). So I think we can
close this issue.

@mdogan https://github.com/mdogan @jerrinot
https://github.com/jerrinot. WDYT?


Reply to this email directly or view it on GitHub
#2051 (comment)
.

“Perfection is achieved, not when there is nothing more to add, but when
there is nothing left to take away.”
Antoine de Saint Exupéry

@pveentjer

This comment has been minimized.

Member

pveentjer commented Aug 26, 2014

Perhaps we should close the issue for the time being and if we see another OperationTimeout, we reopen it (or create a new ticket but now targetted for the lower level congestion controll)

@gurbuzali

This comment has been minimized.

Member

gurbuzali commented Aug 26, 2014

@enesakar enesakar closed this Aug 26, 2014

@tincugabriel

This comment has been minimized.

tincugabriel commented Sep 14, 2014

Is there any release candidate available where this issue can be considered fixed ?

@pveentjer

This comment has been minimized.

Member

pveentjer commented Sep 14, 2014

3.3 and 3.2.6 for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment