Hazelcast stuck in TIMED_WAITING when using 2nd-level cache #4406

Closed
amishasthana opened this Issue Jan 13, 2015 · 16 comments

Projects

None yet
@amishasthana

We are using Hazlecast as 2nd level cache for hibernate.
We are on hazlecast 3.2.6.
We are running a 12 node cluster. After two days we are seeing that all threads which are using hibernate/hazlecast are in Stuck state.
The Stack trace is :
qtp19557847-405" prio=10 tid=0x00007fab688ed000 nid=0xd37b in Object.wait() [0x00007faa94727000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
waiting on <0x000000064e350860> (a com.hazelcast.spi.impl.BasicInvocation$InvocationFuture)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.pollResponse(BasicInvocation.java:767)
locked <0x000000064e350860> (a com.hazelcast.spi.impl.BasicInvocation$InvocationFuture)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.waitForResponse(BasicInvocation.java:719)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:697)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:676)
at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:256)
at com.hazelcast.map.proxy.MapProxySupport.setInternal(MapProxySupport.java:305)
at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:172)
at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:158)
at com.hazelcast.hibernate.distributed.IMapRegionCache.update(IMapRegionCache.java:106)
at com.hazelcast.hibernate.distributed.IMapRegionCache.put(IMapRegionCache.java:68)
at com.hazelcast.hibernate.access.AbstractAccessDelegate.put(AbstractAccessDelegate.java:60)
at com.hazelcast.hibernate.access.ReadWriteAccessDelegate.putFromLoad(ReadWriteAccessDelegate.java:57)
at com.hazelcast.hibernate.region.EntityRegionAccessStrategyAdapter.putFromLoad(EntityRegionAccessStrategyAdapter.java:80)
at org.hibernate.engine.internal.TwoPhaseLoad.doInitializeEntity(TwoPhaseLoad.java:217)
at org.hibernate.engine.internal.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:137)
at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:1108)
at org.hibernate.loader.Loader.processResultSet(Loader.java:964)
at org.hibernate.loader.Loader.doQuery(Loader.java:911)
.....................................

When we check our cluster, it seem to be in good state and all 12 nodes are up and accessible.
Enabeling hazlecast logger we are seeing the following messages in logs:

[192.168.112.114] 01/12/2015 11:28:48.969 [hz.defaulttenant-defaultorg0.response - platform] DEBUG c.h.spi.impl.BasicInvocation - [192.168.112.114]:5701 [defaulttenant-defaultorg0] [3.2.6] Call timed-out during wait-notify phase, retrying call: BasicInvocation{ serviceName='hz:impl:lockService', op=com.hazelcast.concurrent.lock.operations.LockOperation@b3bc7f, partitionId=263, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.112.100]:5701}
[192.168.112.114] 01/12/2015 11:28:49.430 [hz.defaulttenant-defaultorg0.response - platform] DEBUG c.h.spi.impl.BasicInvocation - [192.168.112.114]:5701 [defaulttenant-defaultorg0] [3.2.6] Call timed-out during wait-notify phase, retrying call: BasicInvocation{ serviceName='hz:impl:lockService', op=com.hazelcast.concurrent.lock.operations.LockOperation@101e659a, partitionId=253, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.112.57]:5701}
[192.168.112.114] 01/12/2015 11:28:57.344 [hz.defaulttenant-defaultorg0.cached.thread-8 - platform] DEBUG c.h.cluster.ClusterService - [192.168.112.114]:5701 [defaulttenant-defaultorg0] [3.2.6] Sending MasterConfirmation to Member [192.168.112.57]:5701

Basically as we read it, the Master Confirmation message between node is working fine, however the BasicInvocation is failing with call timed out.
We have checked DB and there are no locks etc. at DB level.

@amishasthana

Please note that its basically blocking our whole app. We call this query which uses hibernate/hazlecast and never comes back causing all threads to block.
The same query if we execute against the DB will come back immediately.

@cmuramoto

I am facing the exact same problem with 3.4. Threads get stuck in InvocationFuture.pollResponse() when second level cache is accessed with a high degree of concurrency.

I am running my application with 8 dell R720 (32 cores each) nodes.

In the end I was forced to switch to LocalRegionCache, which uses CHM and is based on topic invalidation, that is, every node will end up hitting the database until it's local map is warmed up...

I would like to know the root cause of the problem... Up till now I was very confident that Hazelcast was a perfect substitute for EhCache + TC, and in general it is, but when it comes to Hibernate distributted 2nd Level cache I feel like I've taken a step back :(

I even suspected that it might be a system clock issue, but every node is perfectly in sync with my ntp servers.

Once I have time I'll try to create a test case to reproduce this behavior.

In my case it always happens more or less like in this scenario:

-Concurrency is high - I use a parallelStream().map() which in turn calls a service that queries the entities

-Entities for this specific region being queried are @OneToOne related with Entities of another region, which are also cached.

-I use query cache as well, but in this case Entities are being loaded with a session.get()

-The service querying my Entities is spring based and @Transactional

@bilalyasar
Collaborator

@amishasthana @cmuramoto are you facing this issue in hibernate3 or hibernate4? or both of them?

@cmuramoto

@bilalyasar I am using hibernate 3.6.10.Final

Just for the record, as far as I can remember this issue used to happen in hazelcast 3.1.X onwards, since I started to deploying my application with Hazelcast. I don't know if older releases suffer from this as well.

Every patch/new release I tried has had this problem. I upgraded to 3.4 yesterday and retried HazelcastCacheRegionFactory in place of HazelcastLocalCacheRegionFactory and today this issue presented itself.

@Donnerbart
Contributor

Thanks for reporting your issues. Since we have (at least) two different setups, please provide us your details via the following template:

  • Hazelcast version that you are using (e.g. 3.4, also specify minor release or latest snapshot)
  • Hazelcast configuration (hazelcast.xml or programmatic configuration)
  • Cluster topology
    • Number of Hazelcast member nodes
    • Number of Hazelcast clients
  • Java version (also JVM parameters can be added)
  • Operating System (for Linux kernel version will be helpful)
  • Logs and Stack Traces (if available)
  • Steps to reproduce (detailed description of steps to reproduce your issue)
    • Unit Test (if you can include a unit test which reproduces your issue, we would be grateful)
  • Integration module versions if available (e.g. Tomcat, Jetty, Spring, Hibernate)
    • Detailed configuration information (web.xml, Hibernate configuration, Spring context.xml etc.)

A small unit test would really be great to reproduce this issue.

@cmuramoto

Hello @Donnerbart, here is my environment.

  • Hazelcast version: 3.4
  • Hazelcast configuration: see xml bellow
  • Java version:
    • java version "1.8.0_20"
    • Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
    • Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
  • VM args: (Application specific parameters ommited)
    • -server -Xverify:none -Dsun.reflect.inflationThreshold=2147483647 -XX:+TieredCompilation -XX:+UseParallelGC -XX:ParallelGCThreads=16 -XX:MaxTenuringThreshold=5 -XX:PretenureSizeThreshold=1m -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/usr/local/jboss/bin/gc/gc.log -XX:+UseGCLogFileRotation -XX:GCLogFileSize=10M -XX:NumberOfGCLogFiles=20 -d64 -Xms2g -Xmx8g -XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=512m -XX:MaxDirectMemorySize=10g -Xss256k -Xrs -XX:+UseCompressedOops -XX:StringTableSize=277331 -XX:AutoBoxCacheMax=1024 -Djava.awt.headless=true -Duser.timezone=UTC -Duser.language=en -Duser.country=US -Dsun.rmi.dgc.client.gcInterval=86400000 -Dsun.rmi.dgc.server.gcInterval=86400000 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.net.preferIPv4Stack=true -Djava.library.path=/usr/local/jboss/native -Djava.endorsed.dirs=/usr/local/jboss/lib/endorsed
  • Cluster Topology:
    • Production 8/Dev 4 (all discrete hardware, no virtual machines)
    • Clients: 0 (I don't use HazelcastClient)
  • Operating System:
    • Red Hat Enterprise Linux Server release 6.6 (Santiago)
    • Kernel: 2.6.32-504.el6.x86_64 # 1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
  • Steps to Reproduce: application-specific.
    • Basically, there's a background job that keeps processing data and issues several session.get(id) in parallel on some entities.
    • I'll try to create a unit test to emulate this scenario.
  • Logs and StackTraces: I will provide them as soon as I can revert dev servers to use Distributted Caches in place of Local/Topic based invalidation caches.
  • Integration module versions
    • hazelcast-hibernate3 (3.4)
    • hazelcast-wm (3.4)
    • There's just too much config in my application. Most of it is vanilla spring (I don't use spring to boot hazelcast).
    • My Hibernate SessionFactory is configured with both query and second level cache.

Hazelcast Config:

  • Only relevant caches are shown. I use one separate cache for unversioned (E_U) and versioned entities (E_V)
  • Entities are, of course, annotated with, e.g., @Cache(usage = READ_WRITE, region = "XXX.E_U")
  • I use TCP-based join and in every node I have just one specific interface enabled. This is the only parameter that is different in the whole xml config from node to node.
<hazelcast xmlns="http://www.hazelcast.com/schema/config" xmlns:xsi="..." xsi:schemaLocation="...">
  <properties>
    <property name="hazelcast.logging.type">slf4j</property>
  </properties>
  <group>
    <name>*****</name>
    <password>*****</password>
  </group>
  <network>
    <port auto-increment="false">9510</port>
    <interfaces enabled="true">
      <interface>172.30.10.1</interface>
    </interfaces>
    <join>
      <tcp-ip connection-timeout-seconds="30" enabled="true">
        <interface>172.30.10.1</interface>
        <interface>172.30.10.2</interface>
        <interface>172.30.10.3</interface>
        <interface>172.30.10.4</interface>
      </tcp-ip>
      <multicast enabled="false"/>
      <aws enabled="false"/>
    </join>
  </network>
  <map name="org.hibernate.cache.UpdateTimestampsCache">
    <eviction-policy>NONE</eviction-policy>
    <in-memory-format>OBJECT</in-memory-format>
    <max-size>5000</max-size>
    <time-to-live-seconds>0</time-to-live-seconds>
    <max-idle-seconds>0</max-idle-seconds>
    <read-backup-data>false</read-backup-data>
    <near-cache>
      <eviction-policy>NONE</eviction-policy>
      <in-memory-format>OBJECT</in-memory-format>
      <max-size>5000</max-size>
      <max-idle-seconds>0</max-idle-seconds>
      <time-to-live-seconds>0</time-to-live-seconds>
    </near-cache>
  </map>
  <map name="XXX.E_U">
    <eviction-policy>NONE</eviction-policy>
    <in-memory-format>OBJECT</in-memory-format>
    <max-size>15000</max-size>
    <time-to-live-seconds>0</time-to-live-seconds>
    <max-idle-seconds>0</max-idle-seconds>
    <eviction-policy>LRU</eviction-policy>
    <read-backup-data>false</read-backup-data>
    <near-cache>
      <eviction-policy>NONE</eviction-policy>
      <in-memory-format>OBJECT</in-memory-format>
      <max-size>5000</max-size>
      <max-idle-seconds>0</max-idle-seconds>
      <time-to-live-seconds>0</time-to-live-seconds>
      <eviction-policy>LRU</eviction-policy>
    </near-cache>
  </map>
  <map name="XXX.E_V">
    <eviction-policy>NONE</eviction-policy>
    <in-memory-format>OBJECT</in-memory-format>
    <max-size>15000</max-size>
    <time-to-live-seconds>0</time-to-live-seconds>
    <max-idle-seconds>0</max-idle-seconds>
    <eviction-policy>LRU</eviction-policy>
    <read-backup-data>false</read-backup-data>
    <near-cache>
      <eviction-policy>NONE</eviction-policy>
      <in-memory-format>OBJECT</in-memory-format>
      <max-size>5000</max-size>
      <max-idle-seconds>0</max-idle-seconds>
      <time-to-live-seconds>0</time-to-live-seconds>
      <eviction-policy>LRU</eviction-policy>
    </near-cache>
  </map>
  <serialization>
    <use-native-byte-order>true</use-native-byte-order>
    <allow-unsafe>true</allow-unsafe>
    <data-serializable-factories>
      <data-serializable-factory factory-id="3">
                com.my.company.hz.ObjectFactory
            </data-serializable-factory>
    </data-serializable-factories>
  </serialization>
</hazelcast>
@cmuramoto

Guys, after taking a look at the the code of NonStrictReadWriteAccessDelegate I think I might have a clue about the problem.

This little method caught my attention:

public void unlockRegion(final SoftLock lock) throws CacheException {
        removeAll();//calls IMap::clear()
    }

In my application I have like 25 regions that are used to cache more than 60 entities and the process that "hangs" involves types annotated with

@Cache (usage=CacheConcurrencyStrategy.NON_STRICT_READ_WRITE)

Maybe a standard IMap load test with eventual clear operations interleaved with lots of concurrent get ops might display this kind of behaviour.

@efic1
efic1 commented Feb 18, 2015

We are seeing the exact same issue while using @Cache(usage = CacheConcurrencyStrategy.READ_WRITE) on hazelcast 3.4

@stoerr
stoerr commented Feb 26, 2015

We just had the same problem with hazelcast 3.2.6 , 4 hazelcast member nodes integrated in the application (and no additional clients), Java(TM) 1.7.0_51-b13 (Java HotSpot(TM) 64-Bit Server VM, 24.51-b03, mixed mode), SunOS, 5.10 , amd64/64 (2 Kerne) , hibernate 4.3.6 . The stacktrace is different - see below.

What struck me odd is the following line in
com.hazelcast.spi.impl.BasicInvocation.InvocationFuture.get() :
return get(Long.MAX_VALUE, TimeUnit.MILLISECONDS);
Is an infinite timeout appropriate? That seems to be present in hazelcast 3.4 , too: com.hazelcast.spi.impl.BasicInvocationFuture.get() .

com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.pollResponse(BasicInvocation.java:767)
com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.waitForResponse(BasicInvocation.java:719)
com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:697)
com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:676)
com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:256)
com.hazelcast.map.proxy.MapProxySupport.removeInternal(MapProxySupport.java:318)
com.hazelcast.map.proxy.MapProxyImpl.remove(MapProxyImpl.java:182)
com.hazelcast.hibernate.distributed.IMapRegionCache.remove(IMapRegionCache.java:112)
com.hazelcast.hibernate.access.AbstractAccessDelegate.remove(AbstractAccessDelegate.java:99)
com.hazelcast.hibernate.region.CollectionRegionAccessStrategyAdapter.remove(CollectionRegionAccessStrategyAdapter.java:71
org.hibernate.action.internal.CollectionAction.evict(CollectionAction.java:152)
org.hibernate.action.internal.CollectionRemoveAction.execute(CollectionRemoveAction.java:121)
org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:463)
org.hibernate.engine.spi.ActionQueue.executeActions(ActionQueue.java:349)
org.hibernate.event.internal.AbstractFlushingEventListener.performExecutions(AbstractFlushingEventListener.java:350)
org.hibernate.event.internal.DefaultFlushEventListener.onFlush(DefaultFlushEventListener.java:56)
org.hibernate.internal.SessionImpl.flush(SessionImpl.java:1222)
org.hibernate.jpa.spi.AbstractEntityManagerImpl.flush(AbstractEntityManagerImpl.java:1335)

Some things from hazelcast.xml :

            <hz:network port="5701" port-auto-increment="true">
                <hz:join>
                    <hz:tcp-ip enabled="true" connection-timeout-seconds="2">
                        <hz:members>XXXX</hz:members>
                    </hz:tcp-ip>
                </hz:join>
            </hz:network>
            <hz:map name="default"
                    backup-count="1"
                    read-backup-data="true"
                    max-size="200"
                    max-size-policy="USED_HEAP_SIZE"
                    eviction-percentage="5"
                    eviction-policy="LFU"
                    in-memory-format="OBJECT">
                <hz:near-cache in-memory-format="OBJECT" />
            </hz:map>
@mesutcelik mesutcelik added this to the 3.4.3 milestone May 12, 2015
@kobalski kobalski was assigned by mesutcelik May 12, 2015
@kobalski kobalski modified the milestone: 3.4.3, 3.6 May 27, 2015
@akoledzhikov
Contributor

I also got something similar with HZ 3.2.3 - lots of threads stuck while attempting to obtain a lock:
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.waitForResponse(BasicInvocation.java:723)
- locked <0x00000006e6044178> (a com.hazelcast.spi.impl.BasicInvocation$InvocationFuture)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:697)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:676)
at com.hazelcast.concurrent.lock.LockProxySupport.tryLock(LockProxySupport.java:100)
at com.hazelcast.concurrent.lock.LockProxy.tryLock(LockProxy.java:95)

Some observations:

  1. I've checked the logs and didn't find any traces of this warning:
    logger.warning("No response for " + lastPollTime + " ms. " + toString());
    Which perhaps means that all those threads are trying to lock something locally?
  2. The locks have backup count = 0, so it is the primary lock operation they're waiting for.
  3. There are 5 nodes in the cluster. None have suffered from split-brain syndrome - everyone has been connected from start to end, at least according to the logs.
  4. I've checked the thread dumps of the other nodes - all of their HZ-related threads seem to be idle - no stuck LockOperations being ran.
@akoledzhikov
Contributor

I also saw something strange - when doing the isExecuting check, there is this piece of code:

                if (nodeEngine.getThisAddress().equals(target)) {
                    // target may change during invocation because of migration!
                    continue;
                }

In case that a partition migration happened between the execution of the invocation and this check, and the target shifted from some other instance to the current one, you will never check if someone is actually working on the task at hand. Wouldn't it be better if you stored the original target and checked isExecuting against it?

@kobalski
Contributor

We have a community contribution(#4110) which solves problems of 2nd Level Cache usage when concurrency is high. Also it is ported to hibernate3(#5572). Our current implementation had problems with high loads because of using distributed locks on cache entries. Solving this issue will require architectural changes, so we cannot provide a simple backport bug fix. Changes will be available in 3.6 and you can try it with 3.6-SNAPSHOT version. I am closing this issue for now, if you see same TIMED_WAITING exception with current implementation, please reopen the issue.

@kobalski kobalski closed this Jul 13, 2015
@tzvatot
tzvatot commented Aug 31, 2015

Getting the same issue with HZ version 3.5.1 and 3.5.2.

java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.pollResponse(InvocationFuture.java:299)
        - locked <0x00000007601ece80> (a com.hazelcast.spi.impl.operationservice.impl.InvocationFuture)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:247)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
        at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:456)
        at com.hazelcast.map.impl.proxy.MapProxySupport.setInternal(MapProxySupport.java:508)
        at com.hazelcast.map.impl.proxy.MapProxyImpl.set(MapProxyImpl.java:171)
        at com.hazelcast.map.impl.proxy.MapProxyImpl.set(MapProxyImpl.java:161)
        at com.hazelcast.hibernate.distributed.IMapRegionCache.update(IMapRegionCache.java:94)
        at com.hazelcast.hibernate.distributed.IMapRegionCache.put(IMapRegionCache.java:76)
        at com.hazelcast.hibernate.access.AbstractAccessDelegate.put(AbstractAccessDelegate.java:62)
        at com.hazelcast.hibernate.access.NonStrictReadWriteAccessDelegate.putFromLoad(NonStrictReadWriteAccessDelegate.java:49)
        at com.hazelcast.hibernate.region.EntityRegionAccessStrategyAdapter.putFromLoad(EntityRegionAccessStrategyAdapter.java:82)
        at org.hibernate.engine.internal.TwoPhaseLoad.doInitializeEntity(TwoPhaseLoad.java:223)
        at org.hibernate.engine.internal.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:137)
        at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:1103)
        at org.hibernate.loader.Loader.processResultSet(Loader.java:960)
        at org.hibernate.loader.Loader.doQuery(Loader.java:910)
        at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:341)
        at org.hibernate.loader.Loader.doList(Loader.java:2516)
        at org.hibernate.loader.Loader.doList(Loader.java:2502)

Only 2 nodes, cache contains ~5000 objects.
Will be happy for any solution on this matter.

@kobalski
Contributor

@tzvatot Can you please try 3.6-SNAPSHOT? If you see the same exception again, please feel free to reopen the issue. Unfortunately, right now PR #4110 which solves this issue will be available in version 3.6 .

@tzvatot
tzvatot commented Aug 31, 2015

@kobalski I tried 3.6-SNAPSHOT, it had other issues, not sure if it's something on my side or not, but I didn't get any results at all.
Either way, I wouldn't risk placing a SNAPSHOT version on production. We'll be happy for any solution on official release.

@mutars
mutars commented Dec 31, 2015

@kobalski unfortunately all versions of hazelcast has same issue, i tried all early versions 3.6.x. Turning off L2 cache at all gives about 20% more performance when with a hazelcast caching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment