New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hazelcast stuck in TIMED_WAITING when using 2nd-level cache #4406
Comments
Please note that its basically blocking our whole app. We call this query which uses hibernate/hazlecast and never comes back causing all threads to block. |
I am facing the exact same problem with 3.4. Threads get stuck in InvocationFuture.pollResponse() when second level cache is accessed with a high degree of concurrency. I am running my application with 8 dell R720 (32 cores each) nodes. In the end I was forced to switch to LocalRegionCache, which uses CHM and is based on topic invalidation, that is, every node will end up hitting the database until it's local map is warmed up... I would like to know the root cause of the problem... Up till now I was very confident that Hazelcast was a perfect substitute for EhCache + TC, and in general it is, but when it comes to Hibernate distributted 2nd Level cache I feel like I've taken a step back :( I even suspected that it might be a system clock issue, but every node is perfectly in sync with my ntp servers. Once I have time I'll try to create a test case to reproduce this behavior. In my case it always happens more or less like in this scenario: -Concurrency is high - I use a parallelStream().map() which in turn calls a service that queries the entities -Entities for this specific region being queried are @OnetoOne related with Entities of another region, which are also cached. -I use query cache as well, but in this case Entities are being loaded with a session.get() -The service querying my Entities is spring based and @transactional |
@amishasthana @cmuramoto are you facing this issue in hibernate3 or hibernate4? or both of them? |
@bilalyasar I am using hibernate 3.6.10.Final Just for the record, as far as I can remember this issue used to happen in hazelcast 3.1.X onwards, since I started to deploying my application with Hazelcast. I don't know if older releases suffer from this as well. Every patch/new release I tried has had this problem. I upgraded to 3.4 yesterday and retried HazelcastCacheRegionFactory in place of HazelcastLocalCacheRegionFactory and today this issue presented itself. |
Thanks for reporting your issues. Since we have (at least) two different setups, please provide us your details via the following template:
A small unit test would really be great to reproduce this issue. |
Hello @Donnerbart, here is my environment.
Hazelcast Config:
<hazelcast xmlns="http://www.hazelcast.com/schema/config" xmlns:xsi="..." xsi:schemaLocation="...">
<properties>
<property name="hazelcast.logging.type">slf4j</property>
</properties>
<group>
<name>*****</name>
<password>*****</password>
</group>
<network>
<port auto-increment="false">9510</port>
<interfaces enabled="true">
<interface>172.30.10.1</interface>
</interfaces>
<join>
<tcp-ip connection-timeout-seconds="30" enabled="true">
<interface>172.30.10.1</interface>
<interface>172.30.10.2</interface>
<interface>172.30.10.3</interface>
<interface>172.30.10.4</interface>
</tcp-ip>
<multicast enabled="false"/>
<aws enabled="false"/>
</join>
</network>
<map name="org.hibernate.cache.UpdateTimestampsCache">
<eviction-policy>NONE</eviction-policy>
<in-memory-format>OBJECT</in-memory-format>
<max-size>5000</max-size>
<time-to-live-seconds>0</time-to-live-seconds>
<max-idle-seconds>0</max-idle-seconds>
<read-backup-data>false</read-backup-data>
<near-cache>
<eviction-policy>NONE</eviction-policy>
<in-memory-format>OBJECT</in-memory-format>
<max-size>5000</max-size>
<max-idle-seconds>0</max-idle-seconds>
<time-to-live-seconds>0</time-to-live-seconds>
</near-cache>
</map>
<map name="XXX.E_U">
<eviction-policy>NONE</eviction-policy>
<in-memory-format>OBJECT</in-memory-format>
<max-size>15000</max-size>
<time-to-live-seconds>0</time-to-live-seconds>
<max-idle-seconds>0</max-idle-seconds>
<eviction-policy>LRU</eviction-policy>
<read-backup-data>false</read-backup-data>
<near-cache>
<eviction-policy>NONE</eviction-policy>
<in-memory-format>OBJECT</in-memory-format>
<max-size>5000</max-size>
<max-idle-seconds>0</max-idle-seconds>
<time-to-live-seconds>0</time-to-live-seconds>
<eviction-policy>LRU</eviction-policy>
</near-cache>
</map>
<map name="XXX.E_V">
<eviction-policy>NONE</eviction-policy>
<in-memory-format>OBJECT</in-memory-format>
<max-size>15000</max-size>
<time-to-live-seconds>0</time-to-live-seconds>
<max-idle-seconds>0</max-idle-seconds>
<eviction-policy>LRU</eviction-policy>
<read-backup-data>false</read-backup-data>
<near-cache>
<eviction-policy>NONE</eviction-policy>
<in-memory-format>OBJECT</in-memory-format>
<max-size>5000</max-size>
<max-idle-seconds>0</max-idle-seconds>
<time-to-live-seconds>0</time-to-live-seconds>
<eviction-policy>LRU</eviction-policy>
</near-cache>
</map>
<serialization>
<use-native-byte-order>true</use-native-byte-order>
<allow-unsafe>true</allow-unsafe>
<data-serializable-factories>
<data-serializable-factory factory-id="3">
com.my.company.hz.ObjectFactory
</data-serializable-factory>
</data-serializable-factories>
</serialization>
</hazelcast> |
Guys, after taking a look at the the code of NonStrictReadWriteAccessDelegate I think I might have a clue about the problem. This little method caught my attention: public void unlockRegion(final SoftLock lock) throws CacheException {
removeAll();//calls IMap::clear()
} In my application I have like 25 regions that are used to cache more than 60 entities and the process that "hangs" involves types annotated with @Cache (usage=CacheConcurrencyStrategy.NON_STRICT_READ_WRITE) Maybe a standard IMap load test with eventual clear operations interleaved with lots of concurrent get ops might display this kind of behaviour. |
We are seeing the exact same issue while using @Cache(usage = CacheConcurrencyStrategy.READ_WRITE) on hazelcast 3.4 |
We just had the same problem with hazelcast 3.2.6 , 4 hazelcast member nodes integrated in the application (and no additional clients), Java(TM) 1.7.0_51-b13 (Java HotSpot(TM) 64-Bit Server VM, 24.51-b03, mixed mode), SunOS, 5.10 , amd64/64 (2 Kerne) , hibernate 4.3.6 . The stacktrace is different - see below. What struck me odd is the following line in
Some things from hazelcast.xml : <hz:network port="5701" port-auto-increment="true">
<hz:join>
<hz:tcp-ip enabled="true" connection-timeout-seconds="2">
<hz:members>XXXX</hz:members>
</hz:tcp-ip>
</hz:join>
</hz:network>
<hz:map name="default"
backup-count="1"
read-backup-data="true"
max-size="200"
max-size-policy="USED_HEAP_SIZE"
eviction-percentage="5"
eviction-policy="LFU"
in-memory-format="OBJECT">
<hz:near-cache in-memory-format="OBJECT" />
</hz:map> |
I also got something similar with HZ 3.2.3 - lots of threads stuck while attempting to obtain a lock: Some observations:
|
I also saw something strange - when doing the isExecuting check, there is this piece of code:
In case that a partition migration happened between the execution of the invocation and this check, and the target shifted from some other instance to the current one, you will never check if someone is actually working on the task at hand. Wouldn't it be better if you stored the original target and checked isExecuting against it? |
We have a community contribution(#4110) which solves problems of 2nd Level Cache usage when concurrency is high. Also it is ported to hibernate3(#5572). Our current implementation had problems with high loads because of using distributed locks on cache entries. Solving this issue will require architectural changes, so we cannot provide a simple backport bug fix. Changes will be available in |
Getting the same issue with HZ version 3.5.1 and 3.5.2.
Only 2 nodes, cache contains ~5000 objects. |
@kobalski I tried 3.6-SNAPSHOT, it had other issues, not sure if it's something on my side or not, but I didn't get any results at all. |
@kobalski unfortunately all versions of hazelcast has same issue, i tried all early versions 3.6.x. Turning off L2 cache at all gives about 20% more performance when with a hazelcast caching. |
We are using Hazlecast as 2nd level cache for hibernate.
We are on hazlecast 3.2.6.
We are running a 12 node cluster. After two days we are seeing that all threads which are using hibernate/hazlecast are in Stuck state.
The Stack trace is :
qtp19557847-405" prio=10 tid=0x00007fab688ed000 nid=0xd37b in Object.wait() [0x00007faa94727000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
waiting on <0x000000064e350860> (a com.hazelcast.spi.impl.BasicInvocation$InvocationFuture)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.pollResponse(BasicInvocation.java:767)
locked <0x000000064e350860> (a com.hazelcast.spi.impl.BasicInvocation$InvocationFuture)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.waitForResponse(BasicInvocation.java:719)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:697)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:676)
at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:256)
at com.hazelcast.map.proxy.MapProxySupport.setInternal(MapProxySupport.java:305)
at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:172)
at com.hazelcast.map.proxy.MapProxyImpl.set(MapProxyImpl.java:158)
at com.hazelcast.hibernate.distributed.IMapRegionCache.update(IMapRegionCache.java:106)
at com.hazelcast.hibernate.distributed.IMapRegionCache.put(IMapRegionCache.java:68)
at com.hazelcast.hibernate.access.AbstractAccessDelegate.put(AbstractAccessDelegate.java:60)
at com.hazelcast.hibernate.access.ReadWriteAccessDelegate.putFromLoad(ReadWriteAccessDelegate.java:57)
at com.hazelcast.hibernate.region.EntityRegionAccessStrategyAdapter.putFromLoad(EntityRegionAccessStrategyAdapter.java:80)
at org.hibernate.engine.internal.TwoPhaseLoad.doInitializeEntity(TwoPhaseLoad.java:217)
at org.hibernate.engine.internal.TwoPhaseLoad.initializeEntity(TwoPhaseLoad.java:137)
at org.hibernate.loader.Loader.initializeEntitiesAndCollections(Loader.java:1108)
at org.hibernate.loader.Loader.processResultSet(Loader.java:964)
at org.hibernate.loader.Loader.doQuery(Loader.java:911)
.....................................
When we check our cluster, it seem to be in good state and all 12 nodes are up and accessible.
Enabeling hazlecast logger we are seeing the following messages in logs:
[192.168.112.114] 01/12/2015 11:28:48.969 [hz.defaulttenant-defaultorg0.response - platform] DEBUG c.h.spi.impl.BasicInvocation - [192.168.112.114]:5701 [defaulttenant-defaultorg0] [3.2.6] Call timed-out during wait-notify phase, retrying call: BasicInvocation{ serviceName='hz:impl:lockService', op=com.hazelcast.concurrent.lock.operations.LockOperation@b3bc7f, partitionId=263, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.112.100]:5701}
[192.168.112.114] 01/12/2015 11:28:49.430 [hz.defaulttenant-defaultorg0.response - platform] DEBUG c.h.spi.impl.BasicInvocation - [192.168.112.114]:5701 [defaulttenant-defaultorg0] [3.2.6] Call timed-out during wait-notify phase, retrying call: BasicInvocation{ serviceName='hz:impl:lockService', op=com.hazelcast.concurrent.lock.operations.LockOperation@101e659a, partitionId=253, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.112.57]:5701}
[192.168.112.114] 01/12/2015 11:28:57.344 [hz.defaulttenant-defaultorg0.cached.thread-8 - platform] DEBUG c.h.cluster.ClusterService - [192.168.112.114]:5701 [defaulttenant-defaultorg0] [3.2.6] Sending MasterConfirmation to Member [192.168.112.57]:5701
Basically as we read it, the Master Confirmation message between node is working fine, however the BasicInvocation is failing with call timed out.
We have checked DB and there are no locks etc. at DB level.
The text was updated successfully, but these errors were encountered: