Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[operation] Hazelcast 3.7: PollOperation invocation failed to complete due to operation-heartbeat-timeout #8831

Closed
maheshreddy77 opened this issue Sep 2, 2016 · 56 comments

Comments

@maheshreddy77
Copy link

@maheshreddy77 maheshreddy77 commented Sep 2, 2016

Hi,

I am using Hazelcast 3.7.
There are no other nodes in cluster.

When i do poll with timeout, I am getting below error

{
 IQueue notifyQueue=...
 ..
 Integer segmentId = notifyQueue.poll(2, TimeUnit.MINUTES);
 return segmentId;
}

Error:

PollOperation invocation failed to complete due to operation-heartbeat-timeout. 
Current time: 2016-09-02 05:34:46.359. 
Total elapsed time: 121499 ms. 
Last operation heartbeat: never. 
Last operation heartbeat from member: 2016-09-02 05:34:42.350. Invocation{op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='hz:impl:queueService', identityHash=500295452, partitionId=51, replicaIndex=0, callId=0, invocationTime=1472819549423 (2016-09-02 05:32:29.423), waitTimeout=60000, callTimeout=60000}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1472819564860, firstInvocationTime='2016-09-02 05:32:44.860', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 16:00:00.000', target=[172.31.142.27]:5902, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}

Regards,
Mahesh

@jerrinot jerrinot added this to the 3.7.2 milestone Sep 2, 2016
@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Sep 2, 2016

Hello @maheshreddy77,

many thanks for reporting. It looks like a bug in invocation system.

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Sep 2, 2016

I tried reproducing using the following program:

public class Main {

    public static void main(String[] args) throws Exception {
        HazelcastInstance hz = Hazelcast.newHazelcastInstance();
        IQueue notifyQueue=hz.getQueue("foo");
        System.out.println("Waiting");
        Object result = notifyQueue.poll(2, TimeUnit.MINUTES);
        System.out.println("Ready "+result);
    }
}

But this works fine. Can you provide a reproducer so we can figure out what is happening?

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Sep 22, 2016

hi @maheshreddy77: do you have any update?

@jerrinot jerrinot modified the milestones: 3.7.3, 3.7.2 Sep 22, 2016
@jmcshane
Copy link

@jmcshane jmcshane commented Oct 20, 2016

I can't get a consistent reproduction of this issue, but I did want to report that this is happening regularly in one application that uses Hazelcast IQueue:

com.hazelcast.core.OperationTimeoutException: SizeOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2016-10-20 05:27:17.982. Total elapsed time: 120999 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2016-10-20 05:27:13.705. Invocation{op=com.hazelcast.collection.impl.queue.operations.SizeOperation{serviceName='hz:impl:queueService', identityHash=253967771, partitionId=3, replicaIndex=0, callId=0, invocationTime=1476955516983 (2016-10-20 05:25:16.983), waitTimeout=-1, callTimeout=60000}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1476955516983, firstInvocationTime='2016-10-20 05:25:16.983', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 19:00:00.000', target=[HOST2]:5715, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=10, /HOST1:5715->/HOST2:46870, endpoint=[HOST2]:5715, alive=true, type=MEMBER]}

This does not begin to happen immediately after the app starts, but rather after some significant workload has been placed on the Hazelcast queuing system.

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Oct 20, 2016

@jmcshane: what Hazelcast version is this?

@jmcshane
Copy link

@jmcshane jmcshane commented Oct 20, 2016

Version 3.7

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Oct 20, 2016

@jmcshane: can you try it with 3.7.2? There was actually a problem in 3.7 and it's fixed in 3.7.2: ecbd261

@jmcshane
Copy link

@jmcshane jmcshane commented Oct 20, 2016

Perfect, I'll do that

@jmcshane
Copy link

@jmcshane jmcshane commented Oct 26, 2016

@jerrinot I applied the update and it took some time for these issues to reoccur, but we are experiencing problems once again. At this time, every operation between the two hazelcast nodes is timing out with this operation-heartbeat-timeout. Are there some network requirements that I am missing? Would I need to look into the VMWare setup? I have tried pinging the IPs and telnet to the corresponding member hazelcast port and both succeed even while these operations fail.

    java.util.concurrent.ExecutionException: com.hazelcast.core.OperationTimeoutException: LockOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2016-10-26 01:20:55.431. Total elapsed time: 10999 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2016-10-26 01:20:51.827. Invocation{op=com.hazelcast.concurrent.lock.operations.LockOperation{serviceName='hz:impl:lockService', identityHash=481872067, partitionId=174, replicaIndex=0, callId=0, invocationTime=1477459127420 (2016-10-26 01:18:47.420), waitTimeout=50, callTimeout=10000, namespace=com.hazelcast.concurrent.lock.InternalLockNamespace@ce6cf218, threadId=115}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=10000, firstInvocationTimeMs=1477459255432, firstInvocationTime='2016-10-26 01:20:55.432', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 19:00:00.000', target=[10.MYHOST1]:5715, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=64, /10.MYHOST2:34860->/10.MYHOST1:5715, endpoint=[10.MYHOST1]:5715, alive=true, type=MEMBER]}
            at java.util.concurrent.FutureTask.report(FutureTask.java:122)
            at java.util.concurrent.FutureTask.get(FutureTask.java:192)
            at myco.MyClass$1.run(MyClass.java:66)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: com.hazelcast.core.OperationTimeoutException: LockOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2016-10-26 01:20:55.431. Total elapsed time: 9998 ms. Last operation heartbeat: never$
            at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:150)
            at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:98)
            at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
            at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
            at com.hazelcast.concurrent.lock.LockProxySupport.tryLock(LockProxySupport.java:136)
            at com.hazelcast.concurrent.lock.LockProxySupport.tryLock(LockProxySupport.java:125)
            at com.hazelcast.concurrent.lock.LockProxy.tryLock(LockProxy.java:93)
            at myco.MyClass$2.run(MyClass.java:83)
            ... 7 common frames omitted
    01:21:06.431  INFO | hz._hzInstance_1_dev.InvocationMonitorThread |305062234 INFO  c.h.s.i.o.impl.InvocationMonitor - [10.MYHOST2]:5715 [dev] [3.7.2] Invocations:272 timeouts:1 backup-timeouts:0
    01:22:16.431 ERROR | pool-8-thread-1 |305132234 ERROR myco.MyClass
    com.hazelcast.core.OperationTimeoutException: SizeOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2016-10-26 01:22:16.431. Total elapsed time: 65999 ms. Last operation heartbeat: never.  Last operation heartbeat from member: 2016-10-26 01:23:14.403. Invocation{op=com.hazelcast.collection.impl.queue.operations.SizeOperation{serviceName='hz:impl:queueService', identityHash=2068134089, partitionId=268, replicaIndex=0, callId=0, invocationTime=1477459202474 (2016-10-26 01:20:02.474), waitTimeout=-1, callTimeout=60000}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1477459336434, firstInvocationTime='2016-10-26 01:22:16.434', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 19:00:00.000', target=[10.MYHOST1]:5715, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=64, /10.MYHOST2:34860->/10.MYHOST1:5715, endpoint=[10.MYHOST1]:5715, alive=true, type=MEMBER]}
            at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:150)
            at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:98)
            at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
            at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
            at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
            at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:170)
            at com.hazelcast.collection.impl.queue.QueueProxySupport.size(QueueProxySupport.java:104)
            at com.hazelcast.collection.impl.queue.QueueProxyImpl.size(QueueProxyImpl.java:40)
            at myco.MyClass$0(MyClass.java:75)
@jmcshane
Copy link

@jmcshane jmcshane commented Oct 26, 2016

As a note, we had been experiencing these issues for about 8 hours consecutively, trying to debug on the network side to see if there was a problem. We resorted to completely restarting the application and the Hazelcast cluster, and the application began functioning normally without these errors.

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Oct 26, 2016

@jmcshane: do you have (wall-clock) time in sync on all your boxes?

@jmcshane
Copy link

@jmcshane jmcshane commented Oct 26, 2016

@jerrinot The time in the two boxes is synced to a single source

@jerrinot jerrinot modified the milestones: 3.7.4, 3.7.3 Nov 1, 2016
@mufumbo
Copy link

@mufumbo mufumbo commented Nov 16, 2016

same is happening here, with ntpdate on all servers. is there any work around? maybe downgrade hazelcast?

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Nov 17, 2016

It is likely to be caused by:

#9251

We are working on a fix. The isn't a networking/congestion problem which usually are the cause the heartbeat problems. But there is a bug in the heartbeat timeout for timed blocking operations like queue.poll that can lead to a premature heartbeat timeout to be thrown.

@mufumbo
Copy link

@mufumbo mufumbo commented Nov 17, 2016

Not sure what's happening, it does seems like a congestion problem, but we don't have much traffic and our Map and ReplicatedMaps puts and gets are getting delayed by a huge amount. Like 50 seconds sometimes!

We had to write a complete wrapper around hazelcast to make the app to continue working. Code like:

 public V get(Object key) {
        long start = System.currentTimeMillis();
        V result = null;
        try {
            Future<V> future = executorService.submit(new Callable<V>() {
                @Override
                public V call() throws Exception {
                    addCount("get", 1);
                    V result = (V) getBacking().get(key);
                    return result;
                }
            });

            result = future.get(replicated ? 15 : 30, TimeUnit.MILLISECONDS);

            // TODO: make this work somehow
            if (valueClass != null && !valueClass.isInstance(result)) {
                throw new Exception("expecting " + valueClass.getSimpleName() + " but found " + result);
            }
        }
        catch (TimeoutException te) {
            log.info("timed out fetching cache " + key);
        }
        catch (Exception e) {
            log.warn("couldn't get cache " + key, e);
        }
        addTime("get", System.currentTimeMillis() - start);
        return result;
    }

Do you have any suggestion to debug this issue @pveentjer ?

We use a wan-replication with a cluster in canada and another in germany. Maybe this problems happens only when the network is slow like that.

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Nov 17, 2016

Perhaps something else is playing. But we have a well known bug in the timed blocking operations like q.poll(2,minutes). Blocked operations are periodically aborted and retried. We also have an operation heartbeat where every running operations sends a signal to machine where the invocation is done to confirm everything is fine.

The problem is that when retrying; the operation-heartbeat isn't reset. So when the timeout detection kicks in; it will assume the operation hasn't received a heartbeat for a long time ----> and immediately aborts the invocation with a OperationTimeoutException.

This problem can easily happen in a system that isn't doing a lot of work.. it is just a matter of chance.

If you are getting 50 second latencies I would suggest adding the following (HZ 3.7+)

On the member side the following parameters need to be added:

-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.invocation.sample.period.seconds=30
-Dhazelcast.diagnostics.pending.invocations.period.seconds=30
-Dhazelcast.diagnostics.slowoperations.period.seconds=30

On the client side the following parameters need to be added

-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info

You can use this parameter to specify the location for log file:

-Dhazelcast.diagnostics.directory=/your/log/directory

The above parameters will enable diagnostics and various low overhead plugins that periodically write to a dedicated file. Have a look at this file; especially slow operations/slow invocations section. Otherwise send it to me peter at hazelcast dot com and I'll have a look.

This can run in production without significant overhead. I'm always have them enabled when doing benchmarking.

In 3.8 we added another plugin to the diagnostics that provides more detailed information about MapStore/QueueStore/RingbufferStore and Cache Store/Loader latencies. This will make it easier to pinpoint which db interaction is problematic.

@mufumbo
Copy link

@mufumbo mufumbo commented Nov 18, 2016

Do you add those parameters to the java command runtime? It doesn't write anything anywhere (we have 3.7

@mufumbo
Copy link

@mufumbo mufumbo commented Nov 18, 2016

I mean, if you google "hazelcast.diagnostics.enabled" there are zero pages. I believe to solve this problem we need to understand what's happening with hazelcast. It's 3 days of failures and we wrote lots of wrappers to handle hazelcast always being slow.

I really don't understand why the architecture of Map or ReplicatedMap would EVER be synchronous for get. If we configured async-fillup="true" would it be the default behaviour to return NULL if the fillup hasn't finished? What happens now is that because the get timeout, we do a lot of PUT in sequence, so the writes never stop and it's never synched.

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Nov 23, 2016

You can add them to the java command, but you can also use Config.setProperty.

You do not get a diagnostics log file in the current working directory?

Else set the

-Dhazelcast.diagnostics.directory=/your/log/directory

To a specific directory and see if the file gets created.

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Nov 23, 2016

I would suggest creating a new ticket. Issues with map.put and replicated map are likely of a different nature than the queue.poll; where we know there is a bug in the timeout handling and already has been addressed and will be released in 3.7.4 and 3.8.

#9287

@pveentjer
Copy link
Member

@pveentjer pveentjer commented Nov 23, 2016

@jmcshane can you try 3.7.4-SNAPSHOT? It contains the fix for OperationTimeoutException for timed blocking operations. I hope your problem is solved.

Map.put/replicatedMap.put are not bound blocking operations; so they are not suffering from this issue. If they do suffer from OperationTimeoutExceptions, something else must be going on. For that I would like to see the diagnostics output.

[edit]
I see you are getting an OperationTimeoutException on SizeOperation; which is also not a timed blocking operation. So something else must be going on here. Please run with diagnostics and send them to me peter at hazelcast dot com

@mufumbo
Copy link

@mufumbo mufumbo commented Nov 23, 2016

@pveentjer this option in the jvm start doesn't do anything with 3.7.2 "-Dhazelcast.diagnostics.directory"

I'm getting both replicated map and this issue. Probably they're related.
It seems it's because of large latency in multi continent cluster.

@jerrinot jerrinot added this to the 3.7.5 milestone Nov 29, 2016
@cforce
Copy link

@cforce cforce commented Sep 10, 2017

@bartprokop Can you explain why switching off LRO fixes the issue ? Seems like that it did the trick for us too. Before i also had "Last operation heartbeat: neverLast operation heartbeat: never" in my logs ..and now none of this. I really feel not to much comfortable that hazlecast needs such tweaks, to get performing stable. I think therefore we should reopen the issue and find out whats the reason.
It feels like the data is transmitted but then gets lost on the way from the tcp/ip layer to the application or the application is unable to read it

Are there special header's needed by hazzlecast- because according to this https://lwn.net/Articles/358910/ this data may be lossy if LR0 on

"But LRO is a bit of a flawed solution, according to Herbert; the real problem is that it "merges everything in sight." This transformation is lossy; if there are important differences between the headers in incoming packets, those differences will be lost. And that breaks things. If a system is serving as a router, it really should not be changing the headers on packets as they pass through. LRO can totally break satellite-based connections, where some very strange header tricks are done by providers to make the whole thing work. And bridging breaks, which is a serious problem: most virtualization setups use a virtual network bridge between the host and its clients. One might simply avoid using LRO in such situations, but these also tend to be the workloads that one really wants to optimize. Virtualized networking, in particular, is already slower; any possible optimization in this area is much needed."

@mrumpf
Copy link

@mrumpf mrumpf commented Sep 18, 2017

I can re-assure what @cforce said. We were able to reproduce the issue with LRO turned on under high load. The issue does not occur in low traffic scenarios (in our case below 1500 req/s). The LWN article states that LRO kicks in on 10GBit network interfaces only when the CPU is not able to deal with the high amount of network packages.
We had the parameter set with ethtool to off and the issue was gone. Yesterday one of the machines was rebooted and the LRO parameter came back to ON again.
In our high traffic phase the issue re-appeared with the one node showing response times of more than 10s (!).
The parameter seems to be dependent on the operating system. We have currently 3 different Linux versions in use RHEL 6.8, OracleLinux 7.2 and 7.3. On RHEL 6.8 LRO support was added but not enabled by default:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/6.8_Release_Notes/new_features_kernel.html
On RHEL 7.2 LRO has been turned on by default:
https://access.redhat.com/solutions/2168261
So a bug report without the Linux version and the LRO parameter setting does not make sense.

What is missing is a detailed explanation of what exactly is going on when the issue occurs. Maybe someone with Hazelcast and in-depth networking knowledge can step up :)

@pveentjer pveentjer reopened this Sep 19, 2017
@degerhz degerhz modified the milestones: 3.8.5, 3.8.7 Sep 20, 2017
@bartprokop
Copy link

@bartprokop bartprokop commented Sep 26, 2017

@cforce @mrumpf I do not know why switching LRO have fixed the issue. My theory is that LRO somehow interferes with kernel capacity to read data from "network interface". It was very long and non-trivial investigation to pinpoint this. What I have noticed is that Hazelcast uses TCP connection in very specific way, actually the TCP pipes are used in unidirectional way (solely for writes or reads only, never mixture of read/write). It seems that the number of bytes send on one end usually matches the number of bytes on the other end (i.e. per IO operation). When the issue is about to happen no. of data written and read are no longer aligned between nodes (looks like bytes available/written are group into different size "packets". Then communication stops for that particular TCP pipe. Hope those observations might be somehow helpful for someone with knowledge of Hazelcast net IO internals.

@jerrinot jerrinot modified the milestones: 3.8.7, 3.10 Oct 5, 2017
@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Oct 10, 2017

@bartprokop: many thanks for the analysis you have done!
We will have another look and knowing the issue depends on the LRO switch is a massive help!

@cforce
Copy link

@cforce cforce commented Oct 10, 2017

Great to hear that. In times of virtualization and clouds (PaaS)) it's uncomfortable requirement to set LRO on is layer.

@hrushig
Copy link

@hrushig hrushig commented Oct 24, 2017

Our issue too is resolved by disabling LRO at the guest level. Interestingly, the cpu consumption reduced down by 10-15%.

@cforce
Copy link

@cforce cforce commented Oct 24, 2017

yes, strange, normally the oppossite should happen - just one more proof that it does not behave expected.

@ruslan-belinskyy
Copy link

@ruslan-belinskyy ruslan-belinskyy commented Dec 7, 2017

Having same issue on PROD servers:
Cause: com.hazelcast.core.OperationTimeoutException: GetOperation invocation failed to complete due to operation-heartbeat-timeout

Hazelcast 3.8.3
Red Hat 6.4

LRO is on:
ethtool -k eth1 | grep large
Returns:
large-receive-offload: on

Server has at least 2 eth interfaces (ifconfig -a => eth0, eth1).
Why it matters?

Found interesting blog: http://ehaselwanter.com/en/blog/2014/11/02/mtu-issue--nope-it-is-lro-with-bridge-and-bond/

Which notes:

From the [Base Driver for the Intel® Ethernet 10 Gigabit PCI Express Family of Adapters README](http://downloadmirror.intel.com/22919/eng/README.txt:

WARNING: The ixgbe driver compiles by default with the LRO (Large Receive Offload) feature enabled. This option offers the lowest CPU utilization for receives, but is completely incompatible with routing/ip forwarding and bridging. If enabling ip forwarding or bridging is a requirement, it is necessary to disable LRO using compile time options as noted in the LRO section later in this document. The result of not disabling LRO when combined with ip forwarding or bridging can be low throughput or even a kernel panic.

My best guess is that hazelcast from some time could act as router ("IP Forwarding"). Request comes to one eth interface and moves to another for some reason.
I'm not sure that's actually the reason.

Another point. Does anyone knows from which version issue reproduced? Most probably i will have to downgrade Hazelcast as safer option. Rather than changing server configs.

@vkandarpa
Copy link

@vkandarpa vkandarpa commented Dec 7, 2017

Having the same issue in our servers also, but we have the network configuration as below

LRO is OFF on the links

We have bonding enabled and the LRO is ON on bonding. Do we need to turn off LRO on bonding also...

Also we are running 3.7 version

@ruslan-belinskyy
Copy link

@ruslan-belinskyy ruslan-belinskyy commented Feb 22, 2018

Just recently we stopped receiving issues related to OperationTimeoutException (without any Hazlecast/network changes) .
It's already for couple days. So i'm not sure how it will be in long run. But what was the major change on our side is removing framework which was causing 'Too Many Open Files'.
Now when I monitor app via 'lsof' i see slightly better numbers and no exceptions.

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Feb 22, 2018

@Batter2014: many thanks for heads-up!

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented Mar 8, 2018

I am closing this as it appears to be caused outside Hazelcast. Feel free to re-open it.

@jerrinot jerrinot closed this Mar 8, 2018
@cforce
Copy link

@cforce cforce commented Mar 9, 2018

It's not sovled in hazzlecast anyway, because any physical system, vm and container has the default LRO value set on to save's a lot of CPU load. For hazzlecast to be usable you need to switch of this cost and performance saver. Managing the LRO switch is even difficult on systems you don't 100% manage (onprem) but use CAAS or PAAS in the cloud.
So there is a (bad) workaround, but that's far from solved. Its even not explained officaily why that is necessarry,

@jerrinot
Copy link
Contributor

@jerrinot jerrinot commented May 4, 2018

Many thanks to all contributors to this discussion. I thought I would add a few observations I have made in recent weeks.

Apparently LRO does not play nicely with packet routing/bridging and with interface bonding. This is not specific to Hazelcast at all, it's just the way how LRO works.

See for example:
http://ehaselwanter.com/en/blog/2014/11/02/mtu-issue--nope-it-is-lro-with-bridge-and-bond/
or
moby/moby#32023

I believe bridging/routing or even bonding is routinly used in virtualized environments and you should disable LRO to be on the safe side. See this VMWare knowledgebase article.

Also the Intel ixgbe driver documentation has a good read:

WARNING:  The ixgbe driver compiles by default with the LRO (Large Receive
Offload) feature enabled.  This option offers the lowest CPU utilization for
receives, but is completely incompatible with *routing/ip forwarding* and
*bridging*.  If enabling ip forwarding or bridging is a requirement, it is
necessary to disable LRO using compile time options as noted in the LRO
section later in this document.  The result of not disabling LRO when combined
with ip forwarding or bridging can be low throughput or even a kernel panic.

and from the same document:

Do Not Use LRO When Routing or Bridging Packets
-----------------------------------------------
Due to a known general compatibility issue with LRO and routing, do not use
LRO when routing or bridging packets.

Bottom Line:
Disable LRO in virtualized environment or when you use network interface bonding. Enable it only when you absolutely have to and only after extensive testing

@cforce
Copy link

@cforce cforce commented May 8, 2018

Good explanatioin, tx so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.